Microsoft is doubling down on multilingual large language models – and Europe stands to benefit the most
The tech giant wants to ramp up development of LLMs for a range of European languages
Microsoft has announced plans to expand the development and adoption of multilingual LLMs as part of a new partnership drive across Europe.
Europe’s 24 official languages and 250 indigenous languages are currently underrepresented in web content, on which the large language models (LLMs) the industry uses were trained.
The result is that LLMs are currently unable to process Swedish or Romanian at the same standard as English. To bridge this language gap, Microsoft will make multilingual data from GitHub accessible to the European community in collaboration with Hugging Face.
On 1st September, Microsoft will also issue a call for applications for grants to build content out in 10 underrepresented European languages.
“We have learnt that, basically, one needs to record several hundred hours of people, speaking a particular language in order to support the multi-modal capability of AI,” explained Brad Smith, vice chair and president at Microsoft.
“So for example, to be able to handle text to speech and speech to text, and we can do that by employing people to go record more audio in more languages.”
Microsoft will also create new jobs at its innovation centers in Strasbourg, the Microsoft Open Innovation Center (MOIC) and AI for Good Lab, partner with the ICube Laboratory at the University of Strasbourg which is already working on this problem, and fund two post-doctoral researchers.
Sign up today and you will receive a free copy of our Future Focus 2026 report - the leading resource for IT decision-maker insight on priorities and investment areas in AI, security and more.
This will involve digitizing existing content such as books, as well as creating audio content in the languages to improve multimodal training data. To back these efforts, the company said it will provide groups with Azure cloud credits, grants, and engineering support.
Microsoft has stressed that this data will be in the public domain and will be made freely available to European citizens.
“It's important to underscore that all of this work is designed to donate more data so that others can use it,” Smith told assembled media.
“Our goal is to make it available to the European public and to open source developers. And at the same time, if there are particular partners that submit a proposal that work within a certain approach in terms of terms and the like, we want to be open to honoring their terms," he added.
“But across the board, I want to be clear Microsoft is not going to have a proprietary interest in any of this new content that is made available.”
English dominates AI training
Microsoft research has found that 46% of web content used to train large language models (LLMs) is English.
“When you crawl the whole open web, what you see is predominantly the number one language on the web is English,” explained Juan Lavista, adding that German, Spanish, and French come second but still make up less than 6% of the total.
This is massively disproportionate, with the 379.7 million native English speakers worldwide outnumbered by the 485.1 million native Spanish speakers, for example.
Lavista added that this has been a problem since the beginning of the internet, as it was established in English and didn’t universally support special characters such as those required in French until 2003.
In a presentation, he showed how Meta’s Llama 3.1 drops 10 points in performance benchmarks when used in Swedish compared to English. In Latvian or Estonian, the gap is even more stark, with the model 25 points down .
Microsoft and European governments have identified these limitations as a clear barrier to unlocking productivity through AI in the coming years.
The MOIC and AI for Good Lab will also publish an open blueprint for training LLMs and creating regional language datasets, targeting organizations such as the Basque Center for Language Technology, Barcelona Supercomputing Center, and University of Santiago de Compostela, which are working on Azure-based AI models in Basque, Catalan, and Galician.
In addition to supporting the new languages via improved datasets and hands-on support, Microsoft announced new collaborations with IE University School of Science & Technology in Madrid and the University of Strasbourg, to support other ongoing research projects.
Make sure to follow ITPro on Google News to keep tabs on all our latest news, analysis, and reviews.
MORE FROM ITPRO

Rory Bathgate is Features and Multimedia Editor at ITPro, overseeing all in-depth content and case studies. He can also be found co-hosting the ITPro Podcast with Jane McCallion, swapping a keyboard for a microphone to discuss the latest learnings with thought leaders from across the tech sector.
In his free time, Rory enjoys photography, video editing, and good science fiction. After graduating from the University of Kent with a BA in English and American Literature, Rory undertook an MA in Eighteenth-Century Studies at King’s College London. He joined ITPro in 2022 as a graduate, following four years in student journalism. You can contact Rory at rory.bathgate@futurenet.com or on LinkedIn.
-
IT leaders are being stung by "unexpected" AI costsNews The growing costs associated with AI are hitting organizations large and small
-
'Botsitting' is destroying productivity as workers spend nearly a full day each week making AI 'usable'News While workers are reporting productivity improvements, ‘botsitting’ means these are often negated
-
'Most enterprises are still unprepared to operationalize it': IT leaders are bullish on agents, but keeping falling at the final hurdle – here's whyNews Forrester points to challenges scaling agentic AI, saying companies start rolling out the tech before they're ready to scale
-
‘Chat is dead’: OpenAI plots ChatGPT ‘super app’ overhaul ahead of public listing – with agents and coding tools the new focusNews The company looks set to spruce up ChatGPT with a particular focus on agents to drive subscriptions
-
Uber’s eye-watering AI bill shows enterprises are ‘still measuring AI success through consumption rather than outcomes’ – and it's warping our perception of ROI and productivityNews ‘Tokenmaxxing’ might pad the stats, but it’s a trend that could come back to haunt enterprises
-
Destination AI: Una partnership affidabile per superare gli ostacoli e gettare le basi per la crescita futuraSponsored Con l'accelerazione dell'adozione dell''AI aziendale, i partner IT devono spostare la loro attenzione dall'hype tecnologico ai risultati aziendali tangibili, sfruttando ecosistemi strutturati per promuovere la monetizzazione a lungo termine
-
Le programme Destination AI : un partenariat de confiance pour surmonter les obstacles et poser les bases de votre croissance futureSponsored Alors que l'adoption de l'IA en entreprise s'accélère, les partenaires informatiques doivent réorienter leurs priorités : délaisser le battage technologique au profit de résultats commerciaux concrets, en exploitant des écosystèmes structurés pour assurer une monétisation à long terme

