Anthropic have announced several key developments in AI, introducing the upgraded Claude 3.5 Sonnet and a new model, Claude 3.5 Haiku.
Both models offer significant advancements, particularly in coding and tool-use tasks. Claude 3.5 Sonnet, in particular, has made impressive improvements, increasing its performance on coding benchmarks like SWE-bench Verified from 33.4% to 49.0%. This model outperforms all publicly available models in agentic coding, including specialized systems. It also scores higher on TAU-bench, a tool-use task, in both retail and airline domains. Early customer feedback from companies like GitLab, Cognition, and The Browser Company suggests the model provides substantial gains in reasoning and problem-solving with no added latency.
Claude 3.5 Haiku is described as the next generation of their fastest model, improving upon its predecessor, Claude 3 Haiku, and even outperforming the previous largest model, Claude 3 Opus, on intelligence benchmarks. “Claude 3.5 Haiku improves across every skill set and surpasses even Claude 3 Opus…on many intelligence benchmarks.” Notably, Claude 3.5 Haiku is particularly strong in coding, with a score of 40.6% on SWE-bench Verified. The model is well-suited for tasks involving large volumes of data, making it ideal for user-facing products and personalized experiences.
Anthropic has also introduced a groundbreaking new capability in public beta: computer use. This allows Claude 3.5 Sonnet to perform tasks by interacting with a computer as humans do, by “looking at a screen, moving a cursor, clicking buttons, and typing text.” It’s still in its experimental phase, but companies like Replit, Asana, and DoorDash have already begun testing its potential, especially for complex tasks requiring numerous steps. The model has shown impressive results on OSWorld, which evaluates AI’s ability to navigate computers, with Claude 3.5 Sonnet scoring 14.9% in the screenshot-only category, significantly higher than the next-best system’s 7.8%.
Why is this important?
While the technology is still in its early stages, developers are encouraged to explore its potential, focusing on low-risk tasks due to current limitations in certain actions like scrolling and zooming. Anthropic is also mindful of potential risks, such as spam and misinformation, and has implemented classifiers to ensure safe deployment of computer use. Anthropic are inviting feedback from users as they explore these new capabilities, and are hoping that these developments will open up new possibilities for interacting with Claude AI.