Small but Mighty: The Enduring Relevance of Small Language Models in the Age of LLMs

Large Language Models (LLMs) have revolutionized natural language processing in recent years. The pre-train and fine-tune paradigm, exemplified by models like ELMo and BERT, has evolved into prompt-based reasoning used by the GPT family. These approaches have shown exceptional performance across various tasks, including language generation, understanding, and domain-specific applications. The theory of emergent abilities suggests that increasing model size enhances certain reasoning capabilities, leading to the development of increasingly large models. LLMs have gained widespread popularity, with ChatGPT reaching approximately 180 million users by March 2024.

Despite LLMs’ advancements in artificial general intelligence, their size leads to exponential increases in computational costs and energy consumption. This has sparked interest in smaller language models (SLMs) like Phi-3.8B and Gemma-2B, which achieve comparable performance with fewer parameters. Researchers from Imperial College London and Soda, Inria Saclay have presented the analysis of HuggingFace downloads which reveals that smaller models, especially BERT-base, remain highly popular in practical settings. This surprising trend highlights the continued relevance of SLMs and raises important questions about their role in the LLM era, a topic previously overlooked in research. The persistence of smaller models challenges assumptions about the dominance of large-scale AI.

Small Models (SMs) are defined relative to larger models, with no fixed parameter threshold. SMs are compared to LLMs across four dimensions: accuracy, generality, efficiency, and interpretability. While LLMs excel in accuracy and generality, SMs offer advantages in efficiency and interpretability. SMs can achieve comparable results through techniques like knowledge distillation and often outperform LLMs in specialized tasks. They require fewer resources, making them suitable for real-time applications and resource-constrained environments. SMs are also more interpretable, which is crucial in fields like healthcare and finance. This study examines the role of SMs in the LLM era from two perspectives: collaboration with LLMs and competition against them.

SMs play a crucial role in enhancing LLMs through data curation. For pre-training data, SMs help select high-quality subsets from large datasets, addressing the challenge of finite data availability and improving model performance. Techniques include using small classifiers to assess content quality and proxy language models to calculate perplexity scores. In instruction tuning, SMs assist in curating smaller, high-quality datasets that can effectively align LLMs with human preferences. Methods like Model-oriented Data Selection (MoDS) and the LESS framework demonstrate how SMs can select influential data for LLMs, optimizing the instruction tuning process and achieving strong alignment with fewer examples.

The weak-to-strong paradigm addresses challenges in aligning superhuman LLMs with human values. As LLMs surpass human capabilities in complex tasks, evaluating their outputs becomes increasingly difficult. This paradigm uses smaller models to supervise larger ones, allowing strong models to generalize beyond their weaker supervisors’ limitations. Recent variants include using diverse specialized weak teachers, incorporating reliability estimation, and applying weak models during inference. Techniques like Aligner and Weak-to-Strong Search further enhance alignment by learning correctional residuals or maximizing log-likelihood differences. This approach extends beyond language models to vision foundation models, offering a promising solution for aligning advanced AI systems with human preferences.

Model ensembling strategies utilize both large and small language models to optimize inference efficiency and cost-effectiveness. Two main approaches are model cascading and model routing. Model cascading sequentially uses models of varying complexity, with smaller models handling simpler queries and larger models addressing more complex tasks. Techniques like AutoMix use self-verification and confidence assessment to determine when to escalate queries. Model routing dynamically directs input to the most appropriate models in a pool. Methods like OrchestraLLM and RouteLLM use efficient routers to select optimal models without accessing their outputs. Speculative decoding further enhances efficiency by using a smaller auxiliary model to generate initial predictions, which are then verified by a larger model.

Model-based evaluation approaches use smaller models to assess the performance of LLMs, addressing the limitations of traditional methods like BLEU and ROUGE. Techniques such as BERTSCORE and BARTSCORE employ smaller models to compute semantic similarity and evaluate texts from various perspectives. Some methods use natural language inference models to estimate uncertainty in LLM responses. In addition to that, proxy models can predict LLM performance, reducing computational costs during model selection. These approaches enhance the evaluation of open-ended text generation by LLMs, capturing nuanced semantic meaning and compositional diversity that traditional metrics often miss.

Domain adaptation techniques for LLMs use smaller models to enhance performance in specific domains. White-Box Adaptation methods, like CombLM and IPA, adjust token distributions of frozen LLMs using small, domain-specific models. These approaches modify only the parameters of small experts, allowing LLMs to adapt to specific tasks. Black-Box Adaptation, suitable for API-only services, uses small domain-specific models to guide LLMs through textual knowledge. Retrieval Augmented Generation (RAG) extracts relevant information from external sources, while approaches like BLADE and Knowledge Card use small expert models to generate domain-specific knowledge. These techniques enable LLMs to perform optimally in specialized domains without extensive retraining or access to internal parameters.

RAG enhances LLMs by integrating external knowledge sources to overcome limitations in domain-specific expertise and up-to-date information. RAG methods use lightweight retrievers to extract relevant information from various sources, effectively reducing hallucinations in generated content. These sources can be categorized into three types: textual documents (e.g., Wikipedia, cross-lingual text, domain-specific corpora), structured knowledge (knowledge bases, databases), and other sources (code, tools, images). RAG approaches employ diverse retrieval techniques, including sparse BM25 and dense BERT-based models for textual sources, entity linkers and query executors for structured knowledge, and specialized retrievers for other sources. By utilizing these external resources, RAG significantly enhances LLMs’ performance across various tasks and domains.

Prompt-based learning utilizes LLMs’ ability to adapt to new scenarios with minimal or no labelled data through carefully crafted prompts. This approach utilizes In-Context Learning (ICL), which incorporates demonstration examples within natural language templates without updating model parameters. Small models can be employed to enhance prompts and improve larger models’ performance. Techniques like Uprise and DaSLaM use lightweight retrievers or small models to optimize prompts, break down complex problems, or generate pseudo labels. These methods significantly reduce manual prompt engineering efforts and improve performance across various reasoning tasks. Further, small models can be used to verify or rewrite LLM outputs, achieving performance gains without fine-tuning the larger models.

LLMs can sometimes generate repeated, untruthful, or toxic content. To address these deficiencies, two main approaches using smaller models have emerged: contrastive decoding and small model plug-ins. Contrastive decoding utilizes the differences between a larger “expert” model and a smaller “amateur” model to improve output quality. This technique has been successfully applied to reduce repetition, mitigate hallucinations, enhance reasoning capabilities, and protect user privacy. Small model plug-ins, on the other hand, involve fine-tuning specialized smaller models to address specific LLM shortcomings. These plug-ins can help with issues like handling out-of-vocabulary words, detecting hallucinations, or calibrating confidence scores. Both approaches offer cost-effective ways to improve LLM performance without the need for extensive fine-tuning of the larger models.

Knowledge Distillation (KD) offers an effective solution to enhance smaller models’ performance using the knowledge of LMs. This approach involves training a smaller student model to replicate the behaviour of a larger teacher model, making powerful AI more accessible and deployable. KD methods can be categorized into white-box and black-box approaches. White-box distillation uses internal states, output distributions, and intermediate features of the teacher LLM to train the student model transparently. Black-box distillation typically generates a dataset using the teacher LLM for fine-tuning the student model. These techniques have been successfully applied to improve reasoning capabilities, enhance zero-shot performance, and tackle various domain-specific tasks, demonstrating KD’s versatility in creating cost-effective yet powerful models across multiple applications.

LLMs offer an efficient solution for data synthesis, addressing the limitations of human-created data and the need for task-specific smaller models. This approach focuses on two key areas: Training Data Generation and Data Augmentation. In Training Data Generation, LLMs like ChatGPT create datasets from scratch, which are then used to train smaller, task-specific models. This method has been successfully applied to various tasks, including text classification, clinical text mining, and hate speech detection. Data Augmentation involves using LLMs to modify existing data points, increasing diversity for training smaller models. Techniques include paraphrasing, query rewriting, and generating additional samples for tasks such as personality detection and dialogue understanding. These approaches significantly enhance the performance and robustness of smaller models while maintaining efficiency in inference.

Smaller models prove advantageous in three key scenarios: computation-constrained environments, task-specific environments, and situations requiring interpretability

LLMs, despite their impressive capabilities, face significant challenges in computation-constrained environments due to their substantial computational demands. Scaling model size leads to exponential increases in training time, inference latency, and energy consumption, making LLMs impractical for many academic researchers, businesses with limited resources, and edge or mobile devices. However, not all tasks require such large models. For many tasks that are not knowledge-intensive or don’t demand complex reasoning, smaller models can be equally effective. Research shows diminishing returns from increasing model sizes, particularly in tasks like text similarity and classification. In information retrieval, where faster inference speed is crucial, lightweight models like Sentence-BERT remain widely used. This has led to a growing shift towards smaller, more efficient models like Phi-3.8B, MiniCPM, and Gemma2B, driven by the need for accessibility, efficiency, and democratization of AI technologies.

In task-specific environments, smaller models often prove more effective and efficient than LLMs. This is particularly true in domains with limited available data or specialized requirements. Domain-specific tasks in fields like biomedicine and law benefit from fine-tuned smaller models, which can outperform general LLMs. For tabular learning, where datasets are typically smaller and structured, tree-based models often compete effectively with larger deep-learning models. Short text tasks, such as classification and phrase representation, don’t require extensive background knowledge, making smaller models particularly effective. Further, in niche areas like machine-generated text detection, spreadsheet representation, and information extraction, specialized smaller models can surpass larger ones. These scenarios highlight the advantages of developing lightweight, task-specific models, offering promising returns in specialized domains where data scarcity or unique requirements make large-scale pretraining unfeasible.

Interpretability in machine learning aims to provide human-understandable explanations of a model’s internal reasoning process. Smaller and simpler models generally offer better interpretability compared to larger, more complex ones. Industries like healthcare, finance, and law often prefer more interpretable models because their decisions must be understandable to non-experts. In high-stakes decision-making contexts, easily auditable and explainable models are typically favored. When choosing LLMs or SMs, it’s crucial to balance model complexity with the need for human understanding, making appropriate trade-offs based on the specific application and requirements.

This study analyzes the relationship between LLMs and SMs from two perspectives: collaboration and competition. LLMs and SMs can work together to balance performance and efficiency. They also compete in specific scenarios, such as computation-constrained environments, task-specific applications, and situations requiring high interpretability. Careful evaluation of trade-offs between LLMs and SMs is crucial when selecting models for specific tasks. While LLMs offer superior performance, SMs have advantages in accessibility, simplicity, cost-effectiveness, and interoperability. This research aims to provide insights for practitioners and encourage further study on resource optimization and cost-effective system development, building upon the previous discussion of interpretability in various industries.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

⏩ ⏩ FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)


Asjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.

⏩ ⏩ FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)

Leer más
Back to top button