OpenAI wants developers using its new GPT-4.1 models – but how do they compare to Claude and Gemini on coding tasks?
Lower latency and higher context windows for GPT-4.1 could enable more detailed AI agents


OpenAI has unveiled a new family of AI models intended for developers, which it claims offers sizable improvements for coding and understanding complex prompts.
GPT-4.1 is a multi-modal model specifically designed to be more helpful in a professional context, with support for much longer context windows and better contextual processing – useful for handling large documents such as PDFs or code repositories, for example.
Developers can now input up to a million tokens per prompt and the model has been trained to focus on all the details within a prompt. This has been a common issue for users in recent months, as LLMs can struggle to pull out specific details across hundreds of thousands of tokens.
Alongside its flagship model, OpenAI announced two smaller, cheaper models for developers: GPT-4.1 mini and GPT-4.1 nano. The firm described GPT-4.1 mini as on par with GPT-4o across benchmarks such as the general accuracy test MMLU, but with far lower latency.
GPT-4.1 nano is being sold as the best choice for text-based, low latency use cases such as data classification. It is capable of returning the first token of its response in just five seconds for prompts with 128,000 tokens.
Both small models also have a one million token context window and the tech giant recommended them as a highly cost-efficient option for running autonomous AI agents.
How well does GPT-4.1 perform?
OpenAI has highlighted coding as a speciality for GPT-4.1, with the model having scored 54.6% against the benchmark SWE-bench Verified, which evaluates AI models for their ability to solve real-world coding problems based on GitHub data.
Get the ITPro daily newsletter
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
This is a 21.4% improvement over GPT-4o, previously considered OpenAI’s best coding model.
Against the Aider Polyglot benchmark, which tests how well LLMs can edit code per instructions across languages such Java, Rust, and Python, GPT-4.1 scored 52.4%. This is more than double the 23.1% scored by GPT-4o at launch.
But while GPT-4.1 is setting coding records compared to previous models, it doesn’t match the scores achieved by competing options. Google’s Gemini 2.5 Pro recorded a score of 63.8%, while Anthropic’s Claude 3.7 Sonnet achieved 70.3%.
How to get your hands on the new models
GPT-4.1 will be available via OpenAI’s API for a price of $2 per million input tokens and $8 per million output tokens. For cached inputs – meaning new prompts referring to pre-computed input tokens such as a new question about a previously-uploaded PDF – the model costs $0.5 per million tokens.
Unlike GPT-4o, GPT-4.1 will not be made available in the ChatGPT app.
As for the lighter, lower latency models, GPT-4.1 mini and GPT-4.1 nano are far cheaper to inference at just $0.4 and $0.1 per million tokens respectively.
Customers who have already used GPT-4.1 include the global investment firm Carlyle, which deployed the model to pull out financial information from Excel files, PDFs, and other common business documents.
It saw 50% improvement on the task than with previous AI attempts and a reduction in common errors in which AI models prioritize the beginning and end of prompts and skip over details contained in the middle, as well as a better ability to connect context across different documents.
OpenAI kills off 4.5, complicates offerings
Alongside the announcement of GPT-4.1, OpenAI stated it would be deactivating its experimental model GPT-4.5 Preview within the API on July 14th.
The company said this is due to GPT-4.1 offering a similar experience with far lower cost and latency compared to GPT-4.5’s $75 per million input tokens and $150 per million output tokens.
It’s unclear how the announcement of GPT-4.1 aligns with a recent post on X by CEO Sam Altman, who revealed that OpenAI would release o3 and o4-mini in “a couple of weeks”.
OpenAI described o3-mini as a “cost-efficient reasoning model that’s optimized for coding, math, and science”, and in benchmarks has shown it to outperform the GPT-4.1 mini and nano models.
More complicated is the claim that GPT-4.1 is particularly strong at coding. Though its 54.6% performance at SWE-bench Verified outmatches the 49.3% scored by o3-mini, the smaller model holds a 15-point lead over GPT-4.1 at Aider Polyglot.
If OpenAI launches o4 as promised, the full list of available models across its API and app will include GPT-4.1, GPT-4.5, GPT-4o, o4, and o3-mini. On the OpenAI subreddit, some Redditors complained about the name scheme, arguing that it was becoming increasingly hard to keep track of which models are best for which tasks.
Altman himself referenced the issue in another X post on 14 April, acknowledging that the company was deserving of the jokes people had been making at the expense of its name schemes.
“[H]ow about we fix our model naming by this summer and everyone gets a few more months to make fun of us (which we very much deserve) until then?” he wrote.
MORE FROM ITPRO

Rory Bathgate is Features and Multimedia Editor at ITPro, overseeing all in-depth content and case studies. He can also be found co-hosting the ITPro Podcast with Jane McCallion, swapping a keyboard for a microphone to discuss the latest learnings with thought leaders from across the tech sector.
In his free time, Rory enjoys photography, video editing, and good science fiction. After graduating from the University of Kent with a BA in English and American Literature, Rory undertook an MA in Eighteenth-Century Studies at King’s College London. He joined ITPro in 2022 as a graduate, following four years in student journalism. You can contact Rory at rory.bathgate@futurenet.com or on LinkedIn.
-
LaunchDarkly to "double down" on observability with Highlight acquisition
News Highlight's observability tools will be integrated into LaunchDarkly's Guarded Releases software deployment service
By Daniel Todd
-
Samsung Galaxy Tab S10 FE review
Reviews The Tab S10 FE retains the feel and core capabilities of Samsung's high-end S10 tablets, but compromises on the display and the performance
By Stuart Andrews
-
"OpenAI continues to be our partner on frontier models": Microsoft is open to using a range of AI models in 365 Copilot, but OpenAI remains its go-to choice
News Citing concerns over performance and cost, Microsoft will look to use a range of models to provide the best experience for Copilot 365 users
By Solomon Klappholz
-
“There is no one model to rule every scenario”: GitHub will now let developers use AI models from Anthropic, Google, and OpenAI
News Devs will be given access to a broader array of AI models on GitHub – but there's more in store for users
By Emma Woollacott
-
Open source AI just got a major seal of approval from US regulators — but will it push developers in the right direction?
Analysis Regulators in the US appear very keen on supporting open source AI developers
By Nicole Kobie
-
Stack Overflow hailed its OpenAI partnership as a boon for developers — but now disgruntled users say they're being banned for deleting posts
News Stack Overflow users have pushed back against a new partnership with OpenAI by altering contributions
By George Fitzmaurice
-
Microsoft Build 2023: Microsoft Fabric and oodles of Azure AI integrations announced
News Microsoft Fabric aims to greatly improve developer productivity and simplify real-time analytics
By Ross Kelly
-
Innovation is harder than it looks – we should go easy on tech firms
Opinion From Google’s bungled generative Bard launch to Twitter’s ongoing troubles, we must remember the often-forgotten truth that innovation is slow and software is difficult to build
By Nicole Kobie