Small Language Models

What is an SLM?

A small language model is a neural network based on the Transformer architecture with significantly fewer parameters, from millions to several billion, compared to a large language model (LLM).

The key difference is that an SLM sacrifices breadth of generalization for efficiency.

Advantages include fast operation (low latency), lower memory consumption, and the ability to deploy on edge devices.

Technologies for creating SLMs

Models are created using three main compression methods:

Quantization — reducing the number of bits used to store weight values (for example, moving from 32-bit to 8-bit), which makes the model lighter without significant loss of accuracy.
Pruning — removing “extra” neurons or parameters that have little effect on predictions.
Distillation — a process in which a large “teacher model” transfers its knowledge to a smaller “student model”.

Comparison of SLM and LLM

Characteristic,SLM,LLM Parameters,Millions,Billions Memory (VRAM),Minimal,Significant Latency,Ultra-low,Noticeably higher Accuracy,Moderate,High Training cost,Affordable,High Application,Mobile/edge tasks,Cloud systems

Characteristic	SLM	LLM
Parameters	Millions	Billions
Memory (VRAM)	Minimal	Significant
Latency	Ultra-low	Noticeably higher
Accuracy	Moderate	High
Training cost	Affordable	High
Application	Mobile / edge tasks	Cloud systems

Strategies for use in AI agents

Four strategies are proposed for effective work:

Intelligent routing: simple tasks (support, data extraction) are routed to the SLM, while complex ones go to the LLM.
Pipeline collaboration: the SLM creates a draft or filters data, while the LLM completes the work, for example by checking hallucinations.
Parallel verification: the SLM quickly generates an answer, while the LLM simultaneously verifies and corrects it.
Conditional activation: the LLM is connected only if the SLM’s confidence in its answer is below a certain threshold.

Use cases and examples

Confidentiality (On-Premise): in medicine (patient triage) and law (contract analysis), data does not leave the company’s protected perimeter.
High volume / low cost: processing receipts and invoices (for example, Microsoft Phi-3 can do this with 99% accuracy at 1% of GPT-4 costs).
Narrow specialization: coding assistants trained on a company’s specific style, or support ticket classification.

Real company examples

Uber: uses SLMs in its agentic RAG systems.
Microsoft: tests SLMs for managing cloud supply chains and interacting with applications through natural language.
OpenAI: also publishes guides on building agents using such models.

Conclusion

According to Gartner forecasts, by 2027 organizations will use specialized small models three times more often than general-purpose LLMs.