The Alchemy of Language: Distilling High-Quality Models from Small Language Models (SLMs)
DALL-E Alchemy of Language

The Alchemy of Language: Distilling High-Quality Models from Small Language Models (SLMs)

Jung, Jaehun, et al. "Impossible distillation: from low-quality model to high-quality dataset & model for summarization and paraphrasing

Introduction

The paper introduces a novel framework called Impossible Distillation that enables the generation of high-quality paraphrases and sentence summaries using small, off-the-shelf language models (LMs) such as GPT-2. This innovative approach challenges the prevailing notion that achieving state-of-the-art performance on these tasks necessitates the use of massive, computationally expensive models like GPT-3.

At the heart of the Impossible Distillation framework is the concept of paraphrastic proximity, which posits that paraphrases of a given sentence tend to occupy nearby regions in the probability distribution learned by pretrained LMs. By strategically constraining the LM's generation process to these paraphrastic subspaces, the model can be coaxed into generating a set of sentences that are paraphrases of each other, even if the model itself lacks the explicit capability to perform paraphrasing.

IMPOSSIBLE DISTILLATION develops upon

The distillation process in Impossible Distillation consists of four key stages. First, in the Pair Generation stage, a large pool of candidate (source, paraphrase) sentence pairs is generated by sampling from the small teacher LM, conditioned on the same contextual prompt. Next, in the Critic Filtering stage, these candidate pairs are passed through a gauntlet of off-the-shelf "critic" models that serve to filter out low-quality or redundant pairs, retaining only those paraphrases that are semantically equivalent to the source but sufficiently diverse in their surface realization. The surviving pairs form a high-quality dataset that is then used to train a student model in the Student Learning stage. Finally, in the Self-Distillation stage, the student model's performance is further refined by having it generate its own paraphrases, subjecting them to the same critic filtering process, and retraining the model on this self-generated data.

The effectiveness of Impossible Distillation is demonstrated through extensive empirical evaluations. The 770M parameter student model, dubbed Impossible-T5, consistently outperforms significantly larger models distilled from state-of-the-art LLMs like ChatGPT on a diverse suite of paraphrase generation and summarization benchmarks, and in some cases, even surpasses the performance of ChatGPT itself. Moreover, the synthetic dataset DIMPLE, produced by applying Impossible Distillation to 1.5B parameter GPT-2 models, exhibits higher diversity and fidelity than human-curated datasets more than an order of magnitude larger in size.

Importantly, the Impossible Distillation framework is not limited to paraphrase generation but can be readily generalized to the task of sentence summarization by appropriately redefining the critic models used in the filtering stage. This flexibility underscores the broad applicability and adaptability of the proposed approach.

Overview of IMPOSSIBLE DISTILLATION. Starting from low-quality LM (GPT2), we generate a data pool of input-output pairs leveraging paraphrastic proximity, filter it with off-the-shelf critics, and distill a student model on this data pool. By self-distilling the student model, we obtain a high-quality dataset and model for target task.

Paraphrastic Proximity

The authors introduce and provide compelling evidence for the key insight that underlies the Impossible Distillation framework: the tendency for paraphrases to reside in proximity within the probability distribution learned by a language model, a property they term "paraphrastic proximity". This observation is particularly significant, as it suggests that the paraphrastic knowledge essential for high-quality task distillation is already latently present in widely available, off-the-shelf language models like GPT-2, waiting to be harnessed.

 To rigorously verify and quantify the extent of paraphrastic proximity in GPT-2 XL, the authors design and conduct a controlled analysis of the model's generation behavior. They first sample a diverse set of 1000 multi-sentence prompts from a news corpus, and then use each of these prompts to conditionally generate 100 next-sentence completions from the model. By computing the semantic equivalence and surface-form dissimilarity between the resulting generations, they arrive at two key findings that shed light on the nature of paraphrastic proximity.

Firstly, as the conditioning prompt increases in length and informativeness, the generated sentence completions become progressively more likely to be semantically equivalent to each other, with the equivalence rate approaching 100% for prompts containing 5 or more sentences. This result provides strong empirical confirmation of the paraphrastic proximity hypothesis, demonstrating that sufficiently rich and informative context can effectively constrain the model's generation process towards producing paraphrastic completions.

Secondly, and crucially, the high semantic equivalence observed in the generated sentences does not arise trivially from a lack of lexical diversity. Even when conditioned on the longest and most informative prompts, the average pair-wise Self-BLEU score between the generated sentences remains around 32, which is comparable to the lexical diversity exhibited by human-written paraphrases. This finding suggests that the model is not simply memorizing and reproducing verbatim chunks of text, but rather leveraging its learned linguistic knowledge to generate meaningfully distinct paraphrases.

The authors also emphasize the importance of fine-grained decoding temperature control in striking an optimal balance between sample efficiency and diversity during this guided generation process. By carefully tuning the temperature, they demonstrate that it is possible to efficiently sample a diverse set of high-quality paraphrastic sentence pairs from the model.

Taken together, these empirical findings provide compelling evidence for the existence of paraphrastic proximity in off-the-shelf language models. They showcase the untapped potential of these models to serve as rich sources of paraphrastic knowledge, setting the stage for the development of a powerful and efficient distillation paradigm in the subsequent sections of the paper.

By firmly establishing the concept of paraphrastic proximity and presenting a principled methodology for leveraging it to generate high-quality paraphrase pairs, this section lays the conceptual and empirical foundation upon which the Impossible Distillation framework is built. It highlights the significance of this insight in enabling the distillation of high-quality task-specific knowledge from standard language models, without the need for expensive, specialized training or massive computational resources.

How paraphrastic are GPT2-XL generations? We compute the ratio of semantically equivalent pairs and their average Self-BLEU

Impossible Distillation

Impossible Distillation is the core framework proposed in the paper for distilling high-quality paraphrase and summarization datasets and models from off-the-shelf language models. The process takes a teacher LM, denoted as MT, and distills its knowledge into a student LM, MS, to yield a specialized model Mtask for the target task. A valuable byproduct of this distillation is a high-quality dataset Dtask.

The distillation pipeline is composed of four key stages. The first stage, Pair Generation, focuses on sampling a large pool of candidate (source, paraphrase) sentence pairs CT from the teacher LM. This is accomplished by first generating diverse multi-sentence contextual prompts ci from MT, either unconditionally or conditioned on a simple domain-specific prompt. For each prompt, a batch of k sentence completions is generated via nucleus sampling with a carefully tuned temperature τtemp. The candidate pool CT is then constructed by enumerating all pairwise combinations of distinct sentences within each batch.

In the second stage, Critic Filtering, the noisy candidate pool undergoes a refinement process to create a high-quality dataset DT. This is achieved using a suite of "critic" models. The Semantic Equivalence Critic, implemented as an NLI model, ensures that the sentences in each pair entail each other bidirectionally, capturing their semantic equivalence. The Dissimilarity Critic enforces lexical and syntactic diversity between the sentences, using ROUGE and tree edit distance thresholds. The Diversity Critic reduces redundancy by clustering the pairs based on entailment and retaining only the most representative pair from each cluster. The pairs that successfully pass through this critic gauntlet form the filtered dataset DT.

The third stage, Student Learning, involves fine-tuning an off-the-shelf student model MS on the filtered dataset DT to specialize it for the paraphrasing task. This is accomplished through standard conditional language modeling, where the objective is to maximize the log-likelihood of the paraphrase given the source sentence.

In the final stage, Self-Distillation, the student's performance is further enhanced. The fine-tuned student generates a new batch of paraphrase pairs CS by sampling from its own conditional distribution. These pairs are then filtered using the same critic pipeline, yielding a refined dataset Dpara. The student is re-trained on this self-generated data, resulting in the final specialized model Mpara. This self-distillation process has been shown to significantly boost the model's performance.

The authors demonstrate an instantiation of their framework called Impossible-T5, a 770M parameter model distilled from 1.5B parameter GPT-2 (general domain) and Bio-GPT (biomedical domain) teachers, using a T5 student. The resulting DIMPLE dataset contains 4M high-quality paraphrase pairs balanced across domains.

A notable feature of the Impossible Distillation framework is its ability to accommodate controllable generation. By conditioning the student on target syntactic parses during fine-tuning, the model gains the ability to generate paraphrases that conform to specified syntactic templates.

Lastly, the authors showcase the generalizability of their approach to the task of sentence summarization. By appropriately redefining the critic models to enforce compression and factual consistency, the same pipeline can be used to distill high-quality summarization datasets and models. This highlights the flexibility and broad applicability of the Impossible Distillation paradigm across various language generation tasks.

Experiments

The authors present a thorough evaluation of the Impossible Distillation framework, assessing both the quality of the distilled dataset DIMPLE and the performance of the specialized model Impossible-T5 across a diverse set of paraphrase generation and sentence summarization tasks. This comprehensive analysis serves to validate the effectiveness of the proposed approach and shed light on the relative contributions of its key components.

The evaluation begins with an intrinsic comparison of DIMPLE against three state-of-the-art paraphrase corpora: ParaBank1, ParaBank2, and ChatGPT-Para. ParaBank1 and ParaBank2 are constructed using back-translation techniques, with the latter incorporating additional clustering and resampling steps to enhance the diversity of the generated paraphrases. ChatGPT-Para, on the other hand, is distilled directly from ChatGPT by prompting the model to paraphrase sentences sampled from a variety of web sources. To assess the quality of these datasets, the authors employ a suite of metrics that capture semantic similarity, lexical diversity, and syntactic diversity. Despite being an order of magnitude smaller in size, DIMPLE consistently outperforms all the baselines across these dimensions, showcasing the sample efficiency of the Impossible Distillation framework in extracting high-quality paraphrastic knowledge. Notably, the superiority of DIMPLE over ChatGPT-Para highlights that the scale of the teacher language model is not the sole determining factor in the quality of the distilled data. Instead, it underscores the critical role played by the critic-driven filtering process in ensuring the generation of diverse, semantically faithful, and syntactically varied paraphrases.

Quality comparison between paraphrase datasets. DIMPLE, as a purely synthetic corpus generated from 1.5B LMs, exhibits better diversity compared to others, including the dataset constructed by prompting ChatGPT.

Building on these promising dataset-level results, the authors proceed to evaluate the performance of the Impossible-T5 model on a range of benchmark tasks for unconstrained paraphrase generation. These benchmarks encompass both general-domain and specialized datasets, allowing for a comprehensive assessment of the model's generalization capabilities. Across all the evaluated datasets, Impossible-T5 significantly outperforms baseline models of the same size that were trained on substantially larger human-authored and synthetically generated corpora. The model achieves up to 10% relative improvements in iBLEU and BERT-iBLEU scores, which are widely adopted metrics known to correlate well with human judgments of paraphrase quality. Notably, Impossible-T5 is the only model with fewer than a billion parameters that can match the performance of the 175B-parameter GPT-3 across all benchmarks. Furthermore, it even surpasses the performance of ChatGPT on domain-specific tasks, highlighting its ability to generate high-quality paraphrases even in specialized linguistic contexts. To further validate these automated evaluation results, the authors conduct human evaluations of the generated paraphrases. The results from these studies corroborate the quantitative findings, with human raters consistently judging the outputs of Impossible-T5 to be more fluent, semantically faithful to the input, and lexically diverse compared to strong baselines.

Experimental results of Impossible-T5 and baselines on unconstrained paraphrase generation. Impossible- T5 outperforms the same size model trained on much larger datasets, and is competitive to 175B LLM in both general and domain-specific benchmarks.

In addition to its strong performance on unconstrained paraphrase generation, Impossible-T5 also demonstrates remarkable abilities in syntactically controlled generation. The authors evaluate the model on the ParaNMT-small benchmark, which requires generating paraphrases that adhere to a specified target syntactic structure. Impossible-T5 achieves the highest overall quality in this setting while closely conforming to the desired syntactic templates, outperforming few-shot prompted versions of ChatGPT. This result highlights the advantage of explicitly distilling controllable generation capabilities into a specialized model, as opposed to relying on ad-hoc prompting of a large language model.

To showcase the generalizability of the Impossible Distillation framework beyond paraphrasing, the authors apply it to the task of sentence summarization. By appropriately redefining the critic models to enforce compression and factual consistency desiderata, they distill a high-quality summarization dataset and model that outperforms strong baselines on the Gigaword benchmark. The success of this transfer crucially relies on the explicit alignment of the generation process with task-specific objectives through the critic-based filtering stage, rather than naively repurposing a paraphrase corpus for summarization.

Finally, the authors present a series of ablation studies aimed at disentangling the relative contributions of the key components in the Impossible Distillation pipeline. They find that the self-distillation stage, where the student model is refined on its own generated outputs, provides a significant performance boost compared to the base model trained only on the teacher-distilled data. However, the effectiveness of this self-distillation process is contingent on the initial quality and diversity of the synthetic data used for refinement. Naively applying self-distillation to a model trained on raw ChatGPT-generated paraphrases yields only limited gains, emphasizing the importance of the critic-driven filtering step in the overall pipeline. Notably, filtering ChatGPT outputs with the same critic suite used in Impossible Distillation does lead to improvements in downstream performance. However, this setup still underperforms the end-to-end Impossible Distillation approach due to lower sample efficiency, as a significant portion of the raw ChatGPT generations fail to pass the critic thresholds. These ablations underscore the importance of jointly optimizing the data generation and filtering processes, as is done in the proposed framework.

In summary, the extensive experimental evaluation presented in this section provides strong empirical evidence for the effectiveness of the Impossible Distillation framework in extracting high-quality, task-specific datasets and models from compact, off-the-shelf language models. The resulting artifacts, DIMPLE and Impossible-T5, establish new state-of-the-art performance across a range of paraphrasing and summarization benchmarks, rivaling models distilled from language models orders of magnitude larger in size. The ablation studies offer valuable insights into the relative contributions of the key technical innovations in the proposed approach, highlighting the critical role of the critic-driven generation and filtering process in enabling this level of performance. Taken together, these results make a compelling case for the Impossible Distillation paradigm as a new foundation for efficient, transparent, and controllable natural language generation that democratizes access to high-quality task-specific models without the need for massive computational resources or expensive human annotation.

Conclusion

The "Impossible Distillation" paper presents a groundbreaking framework for distilling high-quality, task-specific datasets and models from compact, off-the-shelf language models. By leveraging the key insights of paraphrastic proximity and critic-based filtering, the approach can extract and amplify the latent knowledge in these models, yielding performance that rivals that of models distilled from state-of-the-art language models orders of magnitude larger in size. The resulting artifacts, DIMPLE and Impossible-T5, establish new standards for paraphrase generation and sentence summarization, excelling across a diverse range of benchmarks and human evaluations. The ablation studies presented in the paper provide valuable insights into the relative contributions of the key technical innovations, underscoring the importance of jointly optimizing the data generation and filtering processes. Beyond its empirical success, Impossible Distillation represents a significant conceptual advance in the field of natural language generation. It challenges the prevailing wisdom that high-quality task-specific models can only be obtained through expensive human annotation or distillation from massive language models. Instead, it charts a new path forward, one that democratizes access to state-of-the-art performance by enabling the efficient extraction of task-specific knowledge from widely available, off-the-shelf models. As such, Impossible Distillation opens exciting new avenues for research and application, paving the way for a new generation of natural language technologies that are more efficient, transparent, and accessible to all.

Reference

Jung, Jaehun, et al. "Impossible distillation: from low-quality model to high-quality dataset & model for summarization and paraphrasing." arXiv preprint arXiv:2305.16635 (2023).

To view or add a comment, sign in

More articles by Vijay Raghavan Ph.D., M.B.A.,

Insights from the community

Others also viewed

Explore topics