Katrin Selbak’s Post

English-Estonian sworn translator @Pilvekiri // Translation Studies MA @UniTartu (EMT network) // legal/IT/marketing translation/MTPE

The trilemma of AI/LLMs. Legal (privacy/confidentiality/copyright), accuracy and environmental (resource) issues. ... For getting more accurate results, we need more training data, which could raise more legal issues. ... Also, to get more accurate results from the huge training data, we need more computational power and thus more resources, which raises environmental issues. It reminds me how at Translating Europe Forum #TEF2024 ... we talked much about the accuracy part (how to use AI in translation and terminology management to get the best results), ... the legal issues arose a bit here and there (like is everything available on the Internet also not copyright-protected and thus ok to feed to AI), ... and the environmental issues were several times raised in the questions from the audience and chat, but not really addressed by any speakers nor panelists. Is this something that will continue to evolve - from accuracy issues to legal and then environmental? ... Has it been happening in some other areas - the first concern being about whether it works right and gives the expected results, then whether this working is legal, and after that whether it is also environmentally reasonable? ... What if we included all these different concerns right from the beginning?

ANANT VERMA

M.Tech | IIT Patna | Artificial Intelligence and Data Science Engineering

🚀 Optimizing Large Language Models: Diving into Quantization for Efficiency and Performance Today, I focused on the fascinating realm of quantization, exploring both symmetric and asymmetric techniques. In the ever-evolving world of AI, fine-tuning large language models (LLMs) presents both exciting opportunities and significant challenges, particularly around computational costs and resource requirements. One promising solution is quantization, a technique designed to make these massive models more efficient by reducing the precision of their data. 💡 Real-World Example: The LLaMA3.1–70B model with FP32 precision requires a staggering 336 GB VRAM, making inference feasible only with multiple high-end GPUs. But with 4-bit quantization, the memory footprint reduces by ~90% to just 42 GB, enabling efficient deployment on a single A100 GPU. This demonstrates quantization's transformative potential in democratizing LLM accessibility. What is Linear Quantization? Linear quantization is one of the most widely adopted methods for compressing LLMs, mapping model weights from higher precision (e.g., FP32) to lower precision (e.g., FP16, BF16, INT8). 🔑 Two Main Modes: 1️⃣ Asymmetric Linear Quantization: Flexible for datasets with varying ranges. 2️⃣ Symmetric Linear Quantization: Simple and hardware-friendly. Types of LLM Quantization 🔸 Post-Training Quantization (PTQ): Quick and efficient, applied after training. 🔸 Quantization-Aware Training (QAT): Yields higher accuracy by training with quantization in mind. Quantization isn't just about making models smaller; it's about making them smarter, scalable, and accessible for everyone. Stay tuned for the next update as we explore advanced techniques for quantization and their real-world applications! #LLM #FineTuning #Quantization #AI #MachineLearning #Optimization

To view or add a comment, sign in

More Relevant Posts

ANANT VERMA

M.Tech | IIT Patna | Artificial Intelligence and Data Science Engineering
3w
Report this post
🚀 Optimizing Large Language Models: Diving into Quantization for Efficiency and Performance Today, I focused on the fascinating realm of quantization, exploring both symmetric and asymmetric techniques. In the ever-evolving world of AI, fine-tuning large language models (LLMs) presents both exciting opportunities and significant challenges, particularly around computational costs and resource requirements. One promising solution is quantization, a technique designed to make these massive models more efficient by reducing the precision of their data. 💡 Real-World Example: The LLaMA3.1–70B model with FP32 precision requires a staggering 336 GB VRAM, making inference feasible only with multiple high-end GPUs. But with 4-bit quantization, the memory footprint reduces by ~90% to just 42 GB, enabling efficient deployment on a single A100 GPU. This demonstrates quantization's transformative potential in democratizing LLM accessibility. What is Linear Quantization? Linear quantization is one of the most widely adopted methods for compressing LLMs, mapping model weights from higher precision (e.g., FP32) to lower precision (e.g., FP16, BF16, INT8). 🔑 Two Main Modes: 1️⃣ Asymmetric Linear Quantization: Flexible for datasets with varying ranges. 2️⃣ Symmetric Linear Quantization: Simple and hardware-friendly. Types of LLM Quantization 🔸 Post-Training Quantization (PTQ): Quick and efficient, applied after training. 🔸 Quantization-Aware Training (QAT): Yields higher accuracy by training with quantization in mind. Quantization isn't just about making models smaller; it's about making them smarter, scalable, and accessible for everyone. Stay tuned for the next update as we explore advanced techniques for quantization and their real-world applications! #LLM #FineTuning #Quantization #AI #MachineLearning #Optimization

3 Comments
Like Comment
To view or add a comment, sign in
Prahlad Sahu

Generative AI Engineer | M.Tech AI/ML | Full-Stack Developer at Dassault Systèmes | Ex-ISRO Intern | Specialized in LLMs & FEA
1mo Edited
Report this post
⛷️Optimizing Large Language Models with Pruning & Distillation: The Minitron Approach⛷️ In the push to make large language models (LLMs) more efficient, NVIDIA's Minitron approach offers an innovative solution by compressing models like Llama 3.1 and Mistral NeMo. Here’s a quick overview: 🔹 Pruning Techniques Through depth and width pruning, Minitron reduces model size without compromising performance. Width pruning, in particular, preserves accuracy, especially for complex reasoning tasks. 🔹 Knowledge Distillation The pruned models are fine-tuned to align with their original “teacher” models, which minimizes accuracy loss and allows smaller models to perform similarly to their larger counterparts. 🔹 Results The compressed models offer up to 2.7x speed improvements and outperform others in key benchmarks like MMLU and Winogrande—all while training with significantly fewer tokens. With the Minitron approach, we’re seeing a pathway to making LLMs more resource-efficient and accessible to wider applications. A step closer to the future of AI! #AI #MachineLearning #LLM #NVIDIA #ModelOptimization #Pruning #Distillation #Innovation

2 Comments
Like Comment
To view or add a comment, sign in
Revanth Reddy T

Principal Architect | IoT | Digital Health | Smart Retail | Smart Parking | Smart Logistics | Fleet Tracking | Smart Metering
5mo
Report this post
🚀 Extracting Information from Videos with Long Context Models 🚀 In this post, we'll explore a groundbreaking model designed to extract information from videos based on text prompts. This model can process extensive video sequences, providing answers in the context of the given video. This capability is particularly useful given the valuable temporal information contained within videos, which existing multimodal models struggle to understand, especially with extremely long videos. The Challenge with Long Videos Large multimodal models (MMs) often fall short in comprehending long video sequences. Traditional approaches have tried to address this by reducing the number of visual tokens using techniques like resampling. However, these methods are not always effective. The research in focus here has approached the problem from a different angle—by extending the context length of the language backbone, allowing the model to process a significantly higher number of visual tokens without additional video training. Introducing Long Context Transfer This method, known as "long context transfer," enables the model to generalize to long contexts in the vision modality. This technique has been carefully tested to measure its effectiveness, resulting in the development of a benchmark called VNEA (Visual Needle in a Haystack). VNEA is a synthetic long-vision benchmark inspired by similar tests in language models. The Long Video Assistant (LongVA) The proposed model, Long Video Assistant (LongVA), can process up to 2,000 frames or over 200,000 visual tokens without added complexities. With its extended context length, LongVA achieves state-of-the-art performance on video-based benchmarks among models at the 7 billion parameter scale by densely sampling more input frames. Key Features Extended Context Length: LongVA can process up to 2,000 frames, significantly more than traditional models. High Performance: Achieves state-of-the-art performance on benchmarks by leveraging extended context length. Versatile Applications: Suitable for various applications like video summarization, information extraction, and more. Running LongVA Locally Running LongVA locally requires substantial GPU resources. Specifically, you would need an A100 GPU with at least 80GB of VRAM. Unfortunately, even high-end consumer GPUs may not be sufficient due to the model's intensive resource requirements. For more details, check out the project on GitHub: https://lnkd.in/gcC6u5eH https://lnkd.in/gS8u6Sn7 #AI #MachineLearning #VideoAnalysis #TechInnovation #LongContextModels #LongVA
Like Comment
To view or add a comment, sign in
Fiddler AI

15,066 followers
1mo
Report this post
💌 Webinar invite: NVIDIA + Fiddler AI share secrets on Inference, Guardrails, and Observability for LLMs Register now for #AIExplained to gain to explore best practices for deploying large language models at scale with Krishna Gade, CEO at Fiddler AI and Jonathan Cohen, VP of Applied Research at NVIDIA 📅Register here and join us tomorrow: https://lnkd.in/dcJY8fRw -- #AI #ArtificialIntelligence #ML #MachineLearning #MLOps #LLMs #LLMOps #GenAI #GenerativeAI #DataScience #DataScientist #DataEngineering #DataEngineer #CIO #ResponsibleAI #EthicalAI #AIObservability
Like Comment
To view or add a comment, sign in
B.U.C.K. Online

Automating Your Empire
3mo
Report this post
🚀 "Wafer-thin" might be a diet trend we're all too familiar with, but Cerebras is flipping it on its head with their Wafer-Scale Engine! 🍪 Their custom AI chip promises to not only make Nvidia's GPUs sweat but to also sizzle through inference tasks at a lightning 100x faster rate! From image classification to natural language processing, Cerebras' new inference tool is paving the way for a future filled with accelerated AI tasks and endless possibilities. 🌐✨ Key features? Wafer-scale parallelism, customized hardware, and seamless software integration – sounds like a business owner's dream! Seeing disruptive innovation like this makes me reflect on how we, at B.U.C.K. Online, are working tirelessly to ensure businesses can keep pace with the rapid advancements in technology. From bespoke advice to killer automations, we're not just participating in the race—we're revolutionizing it. 💡🔧 If you need an AI consultation or some transformative SaaS, slide into my DMs! We're here to help you with automations, chatbots, ecomm store product recs, and analytics. Remember, at B.U.C.K. Online, we’re all about "Automating Your Empire." Reflective question: How are you leveraging AI and automation to stay ahead in your industry? 🌟 #AI #Automation #BusinessInnovation #BUCKOnline #AutomatingYourEmpire #TechRevolution
Like Comment
To view or add a comment, sign in
Papers2Date

54 followers
7mo
Report this post
✨ TriForce: Speed and Scalability in Large Language Model Decoding ✨ 💡 Introduction: In the realm of AI and machine learning, the TriForce hierarchical speculative decoding system stands out as a groundbreaking innovation. It's designed to expedite long sequence generation with large language models (LLMs), primarily by addressing the KV cache bottleneck. TriForce integrates two models and three caches, forming a dual-pronged approach to tackle memory constraints and enhance speed without compromising accuracy. ⚙️ Main Features: TriForce's core innovation lies in its ability to draft sequences using a minuscule 3% of the full KV cache, subsequently verified by a full KV cache to ensure accuracy. This speculative decoding system cleverly employs attention sparsity and contextual locality, resulting in a robust, scalable solution that outperforms existing methods. It's optimized for various scenarios, from on-chip to offloading, and its code is publicly available for the community to explore and build upon. 📖 Case Study or Example: In practical terms, TriForce has demonstrated its prowess by achieving up to a 2.31x speedup on a single A100 GPU and an impressive 7.78x speedup on two RTX 4090 GPUs. This showcases its potential to serve even larger models like Llama2-13B with 128K contexts, significantly outpacing current methods like DeepSpeed-Zero-Inference. ❤️ Importance and Benefits: The importance of TriForce lies in its ability to maintain low latency in sequence generation, a critical aspect for real-time applications of LLMs. The benefits are manifold, including significant speedups, scalability for longer contexts, and robust performance across various temperatures, making it a versatile tool for industry and academia alike. 🚀 Future Directions: While TriForce has set a new standard in LLM decoding, the quest for improvement continues. Future directions include optimizing the speculative decoding process further, improving the system's robustness, and exploring new applications in domains where long sequence generation is key. 📢 Call to Action: Are you intrigued by the potential of TriForce to transform LLM decoding? Dive deeper into the intricacies of this innovative system and join the conversation on how it can shape the future of AI. Check out the full paper [here](https://lnkd.in/eFJjr4h6). #AI #MachineLearning #LargeLanguageModels #Innovation #TriForce #Speedup #Scalability #AIResearch
Like Comment
To view or add a comment, sign in
Hajar Mousannif

AI Evangelist and Strategist | Full Professor at UCA | Speaker | 38K+ on LinkedIn
6mo Edited
Report this post
What is a Mixture of Experts (MoE)? Mixture of Experts (MoE) is an advanced machine learning approach that optimizes AI models by dividing them into specialized sub-networks, or “experts.” Each expert focuses on a subset of the input data, working together to perform tasks more efficiently. This technique enhances performance by activating only the necessary experts for a specific task, rather than the entire neural network. Key Components of MoEs: 1. Sparse MoE Layers: These replace dense feed-forward network (FFN) layers with a fixed number of experts, each being a neural network. Experts can range from simple FFNs to more complex architectures. 2. Gate Network (Router): The gate network routes tokens to appropriate experts, sometimes to multiple experts simultaneously. The routing parameters are optimized during training. Notable Characteristics: - Inference Speed: MoEs provide faster inference compared to dense models with equivalent parameters. - VRAM Requirements: High VRAM is essential since all experts are loaded into memory. Example in Practice: In Mistral’s Mixtral 8x7B language model, each layer includes 8 feedforward blocks (experts) with 7 billion parameters each. For each token, a router network selects 2 out of the 8 experts to process the data, then combines their outputs to pass on to the next layer. Learning Resources: - History of MoE with papers: https://lnkd.in/dmB8wsZT - Excellent tutorial on Sliding Window Attention, KV-Cache, Sparse Mixture of Experts (SMoE) by Umar Jamil: https://lnkd.in/eP55SHWJ Follow us at Katanemo. We're committed to making AI/GenAI learning accessible for everyone! #MachineLearning #ArtificialIntelligence #AI #NeuralNetworks #LLM #DeepLearning #MixtureOfExperts #MoE #Innovation #Tech #GenAI #Katanemo
Like Comment
To view or add a comment, sign in
Margaret Clerkin

Investment Executive Director, Managing Partner @ GroupM | Workflow Efficiency, Business Intelligence Optimization
4w
Report this post
Mastering the world of AI with Prompt Engineering!!

Generative AI: Prompt Engineering was issued by Coursera to Margaret M Clerkin.

credly.com
Like Comment
To view or add a comment, sign in
Paul Assink

Cloud consultant at Ministerie van Economische Zaken
8mo Edited
Report this post
@medium.com quote article "Mixture-of-Depths (MoD) marks a significant advancement in the field of transformer models, promising improved computational efficiency without sacrificing performance" AI #machinelearningalgorithms #machinelearning #artificialintelligence #neuralnetwork #nvidia #databricks #oracleai #azureai #llm #metaai #googlai https://lnkd.in/eMB_6C2i

Google DeepMind Introduces Mixture-of-Depths: Transforming Transformer Models for Enhanced Computational Efficiency and Sustainable AI Development

https://multiplatform.ai
Like Comment
To view or add a comment, sign in
Kourosh Karimi

AI Engineer
2mo Edited
Report this post
Recently, at Seeed Studio we came across a fascinating paper on Embodied Chain of Thought (CoT), an innovative approach that extends the traditional Chain of Thought reasoning in AI. What makes this approach groundbreaking is how it integrates reasoning within embodied environments—enabling AI to not only think but to interact with the physical world. The simple but effective philosophy behind this paper is to teach robots to "Think" and "Look" carefully and step-by-step. Traditional CoT models rely purely on text-based reasoning. The Embodied CoT takes this further by combining logical reasoning with physical action in dynamic environments, making AI more adaptive to real-world tasks. In the future, we go through these steps: 1- #OpenVLA architecture that uses Prismatic #VLM and Fused visual encoder 2- Generating Dataset synthetically based on Prismatic-VLM, Grounding DINO, OWL, SAM, and Gemini LLM 3- Developing MimicGen on Jetson to Improve the Dataset Generation phase and simulate robot behavior 4- Fine-tuning workflow with synthetic data generation 5- Testing the base model with the original ECoT-OpenVLA-7B weights on #Nvidia Jetson Orin NX 6- Quantization and inference optimizations to get better inference time on #Jetson #AI #MachineLearning #EmbodiedAI #CoT #NLP #ComputerVision #EdgeComputing #VLM #VLA #LLM #Robotic

1 Comment
Like Comment
To view or add a comment, sign in

751 followers

426 Posts

View Profile Follow

Katrin Selbak’s Post

More Relevant Posts

Explore topics