Are small language models finally having their moment?

Artificial intelligence (AI) concept image showing a digitized cube with 'AI' written on it resting on top of circuit boards. — (Image credit: Getty Images)

Large language models (LLMs) with billions, or even trillions, of parameters come with a baggage of massive training budgets, material shortages, processing costs, and security risks. Despite these limitations, LLMs have been widely adopted across enterprises – a recent report from McKinsey and company found 88% of organizations now use AI in at least one business function.

In most enterprises, data security, compliance, and intellectual property concerns surface in boardroom discussions. Sharing proprietary data on the cloud with externally hosted LLMs restricted enterprises from using LLMs.

That’s where small language models (SLMs) fit into enterprise in-house operations. SLMs are compact AI models with fewer parameters, typically ranging from millions to a few billion. SLMs don’t rely on the internet to run and store data. There is no risk of a security breach due to the absence of cloud round-trips. The critical data generated on-premises, without the need for internet connectivity, ensures regulatory compliance and security.

There are several enterprise use cases of SLMs. According to a report by Markets and Markets, about 44% of SLM enterprise adoption is driven by the cloud. Enterprises follow a hybrid cloud strategy to lower hardware infrastructure costs.

Latest Videos From

Watch full video here:

Gartner projects that by 2027, organizations will use task-specific SLMs three times more than LLMs.

SLMs don’t replace LLMs but complement them. The same enterprise may choose SLMs for some secure tasks and LLMs for broader operations. Matt Beucler, CEO and founder of Plura AI, compares LLMs and SLMs. “SLMs handle speed, control, and security", says Beucler. “LLMs do abstraction, synthesis, and creativity. That’s an important conceptual dividing line in modern engineering, and companies that recognize it tend to build scalable, reliable systems.”

LLMs, Beucler tells ITPro, are also not brilliant at rules-based work on their own. This, and the lure of deploying more industry or case-specific AI models, is part of the appeal of SLMs. In most cases, SLMs are fine-tuned on domain knowledge to perform enterprise-specific tasks. For example, an SLM trained only on medical guidelines and patient data, rather than mathematical equations, would deliver context-aware outputs and exhibit lower chances of hallucinations.

SLM adoption is also largely driven by training and processing costs. The cost of training SLMs on a smaller dataset is significantly lower than that of LLMs. In 2024, ITPro reported that OpenAI’s GPT-4o mini was 60% cheaper than LLM GPT-3.5 Turbo. In terms of processing costs, LLMs may account for several tens of thousands of dollars across individual tasks, while SLMs reach just a fraction of that.

The training window of SLMs is as small as a week. LLMs, on the other hand, need months of training to achieve comparable results. SLMs are also less energy-intensive, designed to run locally on low-power devices. As a result, SLMs are an enterprise-favoured choice for edge AI and embedded applications. In simple terms, SLMs offer resource efficiency and enhanced cybersecurity at a lower price compared to LLMs.

However, many SLMs are also derived from a process known as ‘knowledge distillation’, in which a ‘student’ model is trained by a larger model such as the trillion-parameter frontier models. This is important because it shows SLMs aren't a replacement for LLMs altogether.

Use cases for SLMs

SLM training involves pruning, a model compression technique to remove unnecessary weights from the data. As a result, SLMs fail to answer general and out-of-context queries, making them more useful for specialized applications.

The developers behind some larger models have released SLMs designed for very specific workloads. For example, in February Alibaba released Qwen-3-Coder-Next, an open-weight model designed for coding agents to run on a single modern Nvidia GPU.

Indeed, SLMs are often chosen for code and text generation capabilities. ITPro spoke to Peter Schneider, the senior product manager at Qt Group, a Finnish software company specializing in cross-platform development frameworks, to hear more about practical SLM deployment.

Schneider shared how they have worked with LLMs and their “student model” SLMs. “We’ve tested code completion on our Qt Modelling Language using several small language models like GPT-oss-20b or Llama 3.3 70B against larger models like Gemini 3 Pro Preview, GPT-5, and Claude 4.5.”

“Without tuning, the choice is generally this: do you want a roughly 85% chance of a good code suggestion, or are you ok with about 50%? If you want the former, you go with a frontier model like Gemini or Claude. If you want the latter, then it's a royalty-free small model”.

Schneider stresses that this doesn’t mean that SLMs should be considered inherently worse options than larger models.

“That is not to say SLMs can't be good, or even great, for very specific use cases like code completion after you fine-tune the model. We have had success fine-tuning small models like Code Llama to be as good at code completion as bigger models.”

Defense and aerospace industries deploy SLMs on-prem to handle and analyze classified data. The healthcare industry deploys SLMs for data compliance and protection of patient history. Thanks to the ability of SLMs to function without the internet in case of an emergency.

“For customers who will never connect to a public cloud in highly regulated industries like medical and defense, the choice is pretty clear”, Schneider highlighted the important enterprise use case of SLMs.

Searching through data held on premises, or generating documents grounded in information held in a private cloud, is in fact an ideal use case for SLMs. Provided they’re backed with the right hardware, such as a few on premises server racks or on suitably powerful staff devices, they can complete tasks securely and with low latency.

“When classifying thousands of support tickets, extracting invoice data, or flagging compliance issues, an SLM trained on your own data will often outperform a general-purpose giant at a fraction of the cost, running in milliseconds and keeping data strictly on-premises,” says Barry James, the director of data and AI at Nucleo.

Other common SLM use cases include:

Telecom: Contact centers are the most common examples where SLMs handle customer data, such as interactive voice response menu (IVR), conversations, ticket routing, FAQs, and product-related queries. With the help of SLMs, enterprises can rely on low-latency volume interactions and use customer data to forecast and predict trends.
Embedded systems: SLMs can run without the internet on local edge devices. SLMs are suitable for industrial IoT applications that allow devices to operate in constrained settings with utmost safety features.
Factories: SLMs require fewer GPUs to operate. In industries like semiconductor fabrication, SLMs can help build an enterprise-centric knowledge base of common defects and manufacturing errors.
Banking: Financial data is one of the most critical types of data in enterprises and banks alike. Regulatory authorities and standards govern bank data globally. The use of AI in such applications is less discussed. With on-premises AI deployment, European banks might be able to comply with the General Data Protection Regulation (GDPR).

Notable SLMs

In 2026, some of the most notable SLMs include Microsoft’s Phi-4 and Phi4--mini, Google’s Gemma 3 and Gemma 3n, OpenAI’s gpt-oss-120b and gpt-oss-20b, Alibaba’s Qwen 2.5, and Meta’s Llama 4.

These models can be run on hardware much smaller than you would find in a data center. Microsoft says Phi-4-mini can be run on a single PC with a minimum of 16GB RAM. Gemma 3n is designed to run on smartphones, while depending on the quantized model size (ranging from 270 million to 27 billion) Gemma 3 can run on a consumer-grade laptop or may require a top-of-the-line GPU.

As the demand for SLMs increases, AI labs may focus more on quantization and squeezing the most performance possible out of small models.

Venus is a freelance technology writer specializing in IT, quantum physics, electronics, and among other technical fields. She holds a degree in Electronics and Telecommunications Engineering from Mumbai University, India.

With years of experience in writing for global media brands and IT companies, she enjoys translating complex content into engaging stories. When she’s not writing about the latest IT trends, Venus can be found tracking enterprise trends or the newest processor in town.