shakti web3’s Post

Understanding how language model performance varies with scale is critical tobenchmark and algorithm development. Scaling laws are one approach to buildingthis understanding, but the requirement of training models across manydifferent scales has limited their use. We propose an alternative,observational approach that bypasses model training and instead builds scalinglaws from ~80 publically available models. Building a single scaling law frommultiple model families is challenging due to large variations in theirtraining compute efficiencies and capabilities. However, we show that thesevariations are consistent with a simple, generalized scaling law where languagemodel performance is a function of a low-dimensional capability space, andmodel families only vary in their efficiency in converting training compute tocapabilities. Using this approach, we show the surprising predictability ofcomplex scaling phenomena: we show that several emergent phenomena follow asmooth, sigmoidal behavior and are predictable from small models; we show thatthe agent performance of models such as GPT-4 can be precisely predicted fromsimpler non-agentic benchmarks; and we show how to predict the impact ofpost-training interventions like Chain-of-Thought and Self-Consistency aslanguage model capabilities continue to improve. #LanguageModels #ScalingLaws #ModelEfficiency #PerformancePrediction #EmergentPhenomena

To view or add a comment, sign in

Explore topics