Hugging Face and IBM open a leaderboard for AI agent testing
Hugging Face and IBM Research introduced an Open Agent Leaderboard to compare how well AI agents handle tool use and multi-step tasks.
Topic hub
AI agents are systems that can plan steps and use tools. This desk tracks where agentic workflows are becoming real and where they still need caution.
Plain-English primer
Hugging Face and IBM Research introduced an Open Agent Leaderboard to compare how well AI agents handle tool use and multi-step tasks.
OpenAI and Dell are partnering to support Codex in hybrid and on-premise enterprise environments where companies need tighter data controls.
OpenAI detailed the Windows sandbox work behind Codex, showing how coding agents can be given useful access without unlimited system permissions.
OpenAI’s Sea Limited case study shows how a large Asian technology company is thinking about Codex and agentic software development.
OpenAI says Databricks is using GPT-5.5 for enterprise agent workflows after benchmark gains on office-style knowledge tasks.
Anthropic says it is releasing ten finance agent templates and Claude add-ins for Microsoft 365, so teams can run governed workflows across Excel, PowerPoint, Word, and Outlook.
NVIDIA says it is expanding work with ServiceNow on governed autonomous agents, including ServiceNow’s Project Arc and an OpenShell-based runtime for sandboxed, policy-controlled execution.
Google DeepMind says AlphaEvolve, a Gemini-powered coding agent, found algorithm and infrastructure improvements, citing gains in genomics, grid optimization, and systems tuning.
A new arXiv paper introduces AgentFloor, a 30-task tool-use benchmark, and reports many routine agent steps work well on smaller open-weight models.
ReasoningBank stores distilled reasoning strategies from both successes and failures, improving tool-using agent performance on web navigation and coding benchmarks.
Nearby topics