Many people are talking about how LLMs can't count the number of r's in "strawberry." Most people say it's because LLMs aren't good at math, but that's not the issue. To understand the issue, you need to understand how LLMs work and how the internals interact with the task at hand, which comes down to: 1. Tokenization (how LLMs interpret words) 2. Training data -- Tokenization -- LLMs break words down into tokens. Tokens are small words or parts of larger words, but importantly, they're usually not broken up by letters. Why does this matter? It means they just see a word like strawberry as a single unit as opposed to the letters that make up the word, like we see it. This is efficient for processing (and efficient for training the model). See the screenshot below using the GPT-4 tokenizer. Interestingly, strawberry is a single token when there is a space before it (most of the time) and it is broken into multiple sub-tokens when it either starts a sentence or when a quote comes before it. This tokenization quirk would likely add to its confusion slightly and may be partly why strawberry is picked on. "Blueberry," on the other hand, is broken into "blue" and "berry" tokens in both cases. -- Training data -- It still knows "strawberry" is a compound word made up of the words "straw" and "berry" because that would be in its training data many times (one of those things that just comes up in conversation). But there are too many combinations of words and letters for all combinations to be in the training data enough for LLMs to learn it. Note that LLMs still count the letters in many words accurately; that's because they do have some examples in their training data, and they're pretty good at pattern matching to generalize this to other words.
Joel Anderson’s Post
More Relevant Posts
-
A clear and helpful description of how tokenization works in an LLM. Those of us experimenting with Gen AI and prompting strategies would do well to also learn foundational principles (like this) of how LLMs work so that we can better understand *why* our prompts work well (or don’t), and how to improve them.
Many people are talking about how LLMs can't count the number of r's in "strawberry." Most people say it's because LLMs aren't good at math, but that's not the issue. To understand the issue, you need to understand how LLMs work and how the internals interact with the task at hand, which comes down to: 1. Tokenization (how LLMs interpret words) 2. Training data -- Tokenization -- LLMs break words down into tokens. Tokens are small words or parts of larger words, but importantly, they're usually not broken up by letters. Why does this matter? It means they just see a word like strawberry as a single unit as opposed to the letters that make up the word, like we see it. This is efficient for processing (and efficient for training the model). See the screenshot below using the GPT-4 tokenizer. Interestingly, strawberry is a single token when there is a space before it (most of the time) and it is broken into multiple sub-tokens when it either starts a sentence or when a quote comes before it. This tokenization quirk would likely add to its confusion slightly and may be partly why strawberry is picked on. "Blueberry," on the other hand, is broken into "blue" and "berry" tokens in both cases. -- Training data -- It still knows "strawberry" is a compound word made up of the words "straw" and "berry" because that would be in its training data many times (one of those things that just comes up in conversation). But there are too many combinations of words and letters for all combinations to be in the training data enough for LLMs to learn it. Note that LLMs still count the letters in many words accurately; that's because they do have some examples in their training data, and they're pretty good at pattern matching to generalize this to other words.
To view or add a comment, sign in
-
-
Check out this article on 7 key terms every machine learning beginner should know. #machinelearning https://lnkd.in/guzccfs2
To view or add a comment, sign in
-
𝗪𝗮𝗻𝘁 𝘁𝗼 𝗹𝗲𝗮𝗿𝗻 𝗦𝘂𝗽𝗽𝗼𝗿𝘁 𝗩𝗲𝗰𝘁𝗼𝗿 𝗠𝗮𝗰𝗵𝗶𝗻𝗲𝘀 (𝗦𝗩𝗠𝘀)? ↓↓↓ 💠 𝗕𝗲𝗳𝗼𝗿𝗲 𝘄𝗲 𝗯𝗲𝗴𝗶𝗻 ... SVMs are used for both classification and regression tasks, but they’re best known for their effectiveness in high-dimensional spaces and binary classification problems. 💠 𝗪𝗵𝗮𝘁 𝗶𝘀 𝗮 𝗦𝘂𝗽𝗽𝗼𝗿𝘁 𝗩𝗲𝗰𝘁𝗼𝗿 𝗠𝗮𝗰𝗵𝗶𝗻𝗲? SVMs are supervised learning models that find the best boundary (hyperplane) that separates data points of different classes. Imagine a set of data points plotted on a graph. SVMs aim to find the "maximum margin" between the classes — the largest possible distance between data points of different categories. 🔹 SVMs use a "kernel trick" to handle non-linear boundaries by transforming the original feature space into a higher-dimensional space where a linear separator might be found. 🔹 Support vectors are the data points closest to the hyperplane; these points are critical in defining the position of the hyperplane. 💠 𝗛𝗼𝘄 𝗦𝗩𝗠𝘀 𝗪𝗼𝗿𝗸: 𝗔𝗻 𝗘𝘅𝗮𝗺𝗽𝗹𝗲 Imagine you're developing a spam detection system for emails: === 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝗽𝗵𝗮𝘀𝗲 === 1️⃣ Collect a dataset of emails labeled as "spam" or "not spam". 2️⃣ Represent each email as a data point in a high-dimensional space based on features (like the frequency of specific words, length of the email, etc.). 3️⃣ Use SVM to find the optimal hyperplane that best separates "spam" from "not spam" emails. === 𝗖𝗹𝗮𝘀𝘀𝗶𝗳𝗶𝗰𝗮𝘁𝗶𝗼𝗻 𝗽𝗵𝗮𝘀𝗲 === 4️⃣ When a new email arrives, represent it as a point in the same feature space. 5️⃣ Determine which side of the hyperplane the new email falls on to classify it as "spam" or "not spam". 💠 𝗞𝗲𝗿𝗻𝗲𝗹 𝗧𝗿𝗶𝗰𝗸: 𝗧𝗵𝗲 𝗚𝗮𝗺𝗲-𝗖𝗵𝗮𝗻𝗴𝗲𝗿 SVMs use different kernels (like linear, polynomial, radial basis function (RBF)) to map data into higher dimensions. For example: 🔹 Linear Kernel: Best for linearly separable data. 🔹 RBF Kernel: Handles more complex, non-linear relationships by mapping data to a higher-dimensional space. 💠 𝗔 𝗦𝗩𝗠 𝗜𝗻 𝗔𝗰𝘁𝗶𝗼𝗻: 𝗔 𝗩𝗶𝘀𝘂𝗮𝗹 𝗘𝘅𝗮𝗺𝗽𝗹𝗲 Let’s imagine SVM as a judge deciding which side of a fence you belong to: === 𝗦𝗲𝗽𝗮𝗿𝗮𝘁𝗶𝗻𝗴 𝗵𝗲𝗹𝗽𝗳𝘂𝗹 𝗲𝗺𝗮𝗶𝗹𝘀 𝗳𝗿𝗼𝗺 𝘀𝗽𝗮𝗺 === 1️⃣ Your emails are scattered across a field (the feature space). 2️⃣ The SVM finds the ideal fence (hyperplane) with maximum distance from both categories. 3️⃣ If a new email lands close to spam emails, it’s flagged as spam; if it’s closer to useful emails, it stays in your inbox.
To view or add a comment, sign in
-
-
𝗪𝗮𝗻𝘁 𝘁𝗼 𝗹𝗲𝗮𝗿𝗻 𝗦𝘂𝗽𝗽𝗼𝗿𝘁 𝗩𝗲𝗰𝘁𝗼𝗿 𝗠𝗮𝗰𝗵𝗶𝗻𝗲𝘀 (𝗦𝗩𝗠𝘀)? 💠 𝗕𝗲𝗳𝗼𝗿𝗲 𝘄𝗲 𝗯𝗲𝗴𝗶𝗻 ... SVMs are used for both classification and regression tasks, but they’re best known for their effectiveness in high-dimensional spaces and binary classification problems. 💠 𝗪𝗵𝗮𝘁 𝗶𝘀 𝗮 𝗦𝘂𝗽𝗽𝗼𝗿𝘁 𝗩𝗲𝗰𝘁𝗼𝗿 𝗠𝗮𝗰𝗵𝗶𝗻𝗲? SVMs are supervised learning models that find the best boundary (hyperplane) that separates data points of different classes. Imagine a set of data points plotted on a graph. SVMs aim to find the "maximum margin" between the classes — the largest possible distance between data points of different categories. 🔹 SVMs use a "kernel trick" to handle non-linear boundaries by transforming the original feature space into a higher-dimensional space where a linear separator might be found. 🔹 Support vectors are the data points closest to the hyperplane; these points are critical in defining the position of the hyperplane. 💠 𝗛𝗼𝘄 𝗦𝗩𝗠𝘀 𝗪𝗼𝗿𝗸: 𝗔𝗻 𝗘𝘅𝗮𝗺𝗽𝗹𝗲 Imagine you're developing a spam detection system for emails: === 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝗽𝗵𝗮𝘀𝗲 === 1️⃣ Collect a dataset of emails labeled as "spam" or "not spam". 2️⃣ Represent each email as a data point in a high-dimensional space based on features (like the frequency of specific words, length of the email, etc.). 3️⃣ Use SVM to find the optimal hyperplane that best separates "spam" from "not spam" emails. === 𝗖𝗹𝗮𝘀𝘀𝗶𝗳𝗶𝗰𝗮𝘁𝗶𝗼𝗻 𝗽𝗵𝗮𝘀𝗲 === 4️⃣ When a new email arrives, represent it as a point in the same feature space. 5️⃣ Determine which side of the hyperplane the new email falls on to classify it as "spam" or "not spam". 💠 𝗞𝗲𝗿𝗻𝗲𝗹 𝗧𝗿𝗶𝗰𝗸: 𝗧𝗵𝗲 𝗚𝗮𝗺𝗲-𝗖𝗵𝗮𝗻𝗴𝗲𝗿 SVMs use different kernels (like linear, polynomial, radial basis function (RBF)) to map data into higher dimensions. For example: 🔹 Linear Kernel: Best for linearly separable data. 🔹 RBF Kernel: Handles more complex, non-linear relationships by mapping data to a higher-dimensional space. 💠 𝗔 𝗦𝗩𝗠 𝗜𝗻 𝗔𝗰𝘁𝗶𝗼𝗻: 𝗔 𝗩𝗶𝘀𝘂𝗮𝗹 𝗘𝘅𝗮𝗺𝗽𝗹𝗲 Let’s imagine SVM as a judge deciding which side of a fence you belong to: === 𝗦𝗲𝗽𝗮𝗿𝗮𝘁𝗶𝗻𝗴 𝗵𝗲𝗹𝗽𝗳𝘂𝗹 𝗲𝗺𝗮𝗶𝗹𝘀 𝗳𝗿𝗼𝗺 𝘀𝗽𝗮𝗺 === 1️⃣ Your emails are scattered across a field (the feature space). 2️⃣ The SVM finds the ideal fence (hyperplane) with maximum distance from both categories. 3️⃣ If a new email lands close to spam emails, it’s flagged as spam; if it’s closer to useful emails, it stays in your inbox. #pythoncode #pythonlearning #machinelearningalgorithms #machinelearningengineer #artificial_intelligence #datascientist #SVM #dataacience #ai #python #machinelearning #deepcleaning #datascience #regression #classification #TechEducation
To view or add a comment, sign in
-
𝗪𝗮𝗻𝘁 𝘁𝗼 𝗹𝗲𝗮𝗿𝗻 𝗦𝘂𝗽𝗽𝗼𝗿𝘁 𝗩𝗲𝗰𝘁𝗼𝗿 𝗠𝗮𝗰𝗵𝗶𝗻𝗲𝘀 (𝗦𝗩𝗠𝘀)? ↓↓↓ 💠 𝗕𝗲𝗳𝗼𝗿𝗲 𝘄𝗲 𝗯𝗲𝗴𝗶𝗻 ... SVMs are used for both classification and regression tasks, but they’re best known for their effectiveness in high-dimensional spaces and binary classification problems. 💠 𝗪𝗵𝗮𝘁 𝗶𝘀 𝗮 𝗦𝘂𝗽𝗽𝗼𝗿𝘁 𝗩𝗲𝗰𝘁𝗼𝗿 𝗠𝗮𝗰𝗵𝗶𝗻𝗲? SVMs are supervised learning models that find the best boundary (hyperplane) that separates data points of different classes. Imagine a set of data points plotted on a graph. SVMs aim to find the "maximum margin" between the classes — the largest possible distance between data points of different categories. 🔹 SVMs use a "kernel trick" to handle non-linear boundaries by transforming the original feature space into a higher-dimensional space where a linear separator might be found. 🔹 Support vectors are the data points closest to the hyperplane; these points are critical in defining the position of the hyperplane. 💠 𝗛𝗼𝘄 𝗦𝗩𝗠𝘀 𝗪𝗼𝗿𝗸: 𝗔𝗻 𝗘𝘅𝗮𝗺𝗽𝗹𝗲 Imagine you're developing a spam detection system for emails: === 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝗽𝗵𝗮𝘀𝗲 === 1️⃣ Collect a dataset of emails labeled as "spam" or "not spam". 2️⃣ Represent each email as a data point in a high-dimensional space based on features (like the frequency of specific words, length of the email, etc.). 3️⃣ Use SVM to find the optimal hyperplane that best separates "spam" from "not spam" emails. === 𝗖𝗹𝗮𝘀𝘀𝗶𝗳𝗶𝗰𝗮𝘁𝗶𝗼𝗻 𝗽𝗵𝗮𝘀𝗲 === 4️⃣ When a new email arrives, represent it as a point in the same feature space. 5️⃣ Determine which side of the hyperplane the new email falls on to classify it as "spam" or "not spam". 💠 𝗞𝗲𝗿𝗻𝗲𝗹 𝗧𝗿𝗶𝗰𝗸: 𝗧𝗵𝗲 𝗚𝗮𝗺𝗲-𝗖𝗵𝗮𝗻𝗴𝗲𝗿 SVMs use different kernels (like linear, polynomial, radial basis function (RBF)) to map data into higher dimensions. For example: 🔹 Linear Kernel: Best for linearly separable data. 🔹 RBF Kernel: Handles more complex, non-linear relationships by mapping data to a higher-dimensional space. 💠 𝗔 𝗦𝗩𝗠 𝗜𝗻 𝗔𝗰𝘁𝗶𝗼𝗻: 𝗔 𝗩𝗶𝘀𝘂𝗮𝗹 𝗘𝘅𝗮𝗺𝗽𝗹𝗲 Let’s imagine SVM as a judge deciding which side of a fence you belong to: === 𝗦𝗲𝗽𝗮𝗿𝗮𝘁𝗶𝗻𝗴 𝗵𝗲𝗹𝗽𝗳𝘂𝗹 𝗲𝗺𝗮𝗶𝗹𝘀 𝗳𝗿𝗼𝗺 𝘀𝗽𝗮𝗺 === 1️⃣ Your emails are scattered across a field (the feature space). 2️⃣ The SVM finds the ideal fence (hyperplane) with maximum distance from both categories. 3️⃣ If a new email lands close to spam emails, it’s flagged as spam; if it’s closer to useful emails, it stays in your inbox. --- 📕 400+ 𝗗𝗮𝘁𝗮 𝗦𝗰𝗶𝗲𝗻𝗰𝗲 𝗥𝗲𝘀𝗼𝘂𝗿𝗰𝗲𝘀: https://lnkd.in/gv9yvfdd 📘 𝗣𝗿𝗲𝗺𝗶𝘂𝗺 𝗗𝗮𝘁𝗮 𝗦𝗰𝗶𝗲𝗻𝗰𝗲 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄 𝗥𝗲𝘀𝗼𝘂𝗿𝗰𝗲𝘀 : https://lnkd.in/gPrWQ8is 📙 𝗣𝘆𝘁𝗵𝗼𝗻 𝗗𝗮𝘁𝗮 𝗦𝗰𝗶𝗲𝗻𝗰𝗲 𝗟𝗶𝗯𝗿𝗮𝗿𝘆: https://lnkd.in/gHSDtsmA
To view or add a comment, sign in
-
-
Confused about Machine Learning Terms? I was too, here are few ~ beginner, so I encourage if you find something inaccurate to be pointed out. This is foundational knowledge of mental picture so I'd love if I someone can point out the errors early on, if any. OR validate that it's correct. tldr; plenty of data gets split into training & testing. training - model uses it to "learn" testing - use this to validate "learning" and optimizing Model Model = (ml algorithm/architecture + learned weights) Basically ~ choosing a " algorithm that'll give the correct output for "this" input from training, store the learning as weights (e.g. parameter of Algorithm based on "features of data". Different kind of models have their own pros and cons. Once model is trained - ask stuff, and this inquiring the model is called Inference.
To view or add a comment, sign in
-
-
𝗪𝗮𝗻𝘁 𝘁𝗼 𝗹𝗲𝗮𝗿𝗻 𝗦𝘂𝗽𝗽𝗼𝗿𝘁 𝗩𝗲𝗰𝘁𝗼𝗿 𝗠𝗮𝗰𝗵𝗶𝗻𝗲𝘀 (𝗦𝗩𝗠𝘀)? ↓↓↓ 💠 𝗕𝗲𝗳𝗼𝗿𝗲 𝘄𝗲 𝗯𝗲𝗴𝗶𝗻 ... SVMs are used for both classification and regression tasks, but they’re best known for their effectiveness in high-dimensional spaces and binary classification problems. 💠 𝗪𝗵𝗮𝘁 𝗶𝘀 𝗮 𝗦𝘂𝗽𝗽𝗼𝗿𝘁 𝗩𝗲𝗰𝘁𝗼𝗿 𝗠𝗮𝗰𝗵𝗶𝗻𝗲? SVMs are supervised learning models that find the best boundary (hyperplane) that separates data points of different classes. Imagine a set of data points plotted on a graph. SVMs aim to find the "maximum margin" between the classes — the largest possible distance between data points of different categories. 🔹 SVMs use a "kernel trick" to handle non-linear boundaries by transforming the original feature space into a higher-dimensional space where a linear separator might be found. 🔹 Support vectors are the data points closest to the hyperplane; these points are critical in defining the position of the hyperplane. 💠 𝗛𝗼𝘄 𝗦𝗩𝗠𝘀 𝗪𝗼𝗿𝗸: 𝗔𝗻 𝗘𝘅𝗮𝗺𝗽𝗹𝗲 Imagine you're developing a spam detection system for emails: === 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝗽𝗵𝗮𝘀𝗲 === 1️⃣ Collect a dataset of emails labeled as "spam" or "not spam". 2️⃣ Represent each email as a data point in a high-dimensional space based on features (like the frequency of specific words, length of the email, etc.). 3️⃣ Use SVM to find the optimal hyperplane that best separates "spam" from "not spam" emails. === 𝗖𝗹𝗮𝘀𝘀𝗶𝗳𝗶𝗰𝗮𝘁𝗶𝗼𝗻 𝗽𝗵𝗮𝘀𝗲 === 4️⃣ When a new email arrives, represent it as a point in the same feature space. 5️⃣ Determine which side of the hyperplane the new email falls on to classify it as "spam" or "not spam". 💠 𝗞𝗲𝗿𝗻𝗲𝗹 𝗧𝗿𝗶𝗰𝗸: 𝗧𝗵𝗲 𝗚𝗮𝗺𝗲-𝗖𝗵𝗮𝗻𝗴𝗲𝗿 SVMs use different kernels (like linear, polynomial, radial basis function (RBF)) to map data into higher dimensions. For example: 🔹 Linear Kernel: Best for linearly separable data. 🔹 RBF Kernel: Handles more complex, non-linear relationships by mapping data to a higher-dimensional space. 💠 𝗔 𝗦𝗩𝗠 𝗜𝗻 𝗔𝗰𝘁𝗶𝗼𝗻: 𝗔 𝗩𝗶𝘀𝘂𝗮𝗹 𝗘𝘅𝗮𝗺𝗽𝗹𝗲 Let’s imagine SVM as a judge deciding which side of a fence you belong to: === 𝗦𝗲𝗽𝗮𝗿𝗮𝘁𝗶𝗻𝗴 𝗵𝗲𝗹𝗽𝗳𝘂𝗹 𝗲𝗺𝗮𝗶𝗹𝘀 𝗳𝗿𝗼𝗺 𝘀𝗽𝗮𝗺 === 1️⃣ Your emails are scattered across a field (the feature space). 2️⃣ The SVM finds the ideal fence (hyperplane) with maximum distance from both categories. 3️⃣ If a new email lands close to spam emails, it’s flagged as spam; if it’s closer to useful emails, it stays in your inbox. Follow Sunil Jangra for more content Happy learning:) #datascience #svm #algorithms #datascientist
To view or add a comment, sign in
-
-
RAG (Retrieval-Augmented Generation) is significantly more complex to work with compared to traditional machine learning. It is designed for a different set of tasks, requiring real-time retrieval and dependency on external systems. Certain tasks, such as working with dynamic data, make RAG the obvious choice. In contrast, for embedding knowledge directly into the model and stylistically altering it, traditional training is more suitable. Performing gradient descent on a known dataset is a straightforward process. Optimizers like AdamW are highly effective, learning rates and schedulers are well-researched, and achieving low loss rates on validation sets is a well-understood goal. Building a vector database from various sources such as PDFs, webpages, and other content, and training the model to access this database appropriately, presents a much greater challenge. Currently, llama-index appears to be a more promising tool than langchain, which still suffers from notoriously difficult documentation even in version 0.2. Llama-index offers a wide range of plugins capable of accessing, indexing, and embedding information. Over the next few days, I will be focusing on learning llama-index to handle relatively simple natural language questions that are answered by SQL queries on straightforward databases.
To view or add a comment, sign in
-
Now Un-Supervised Learning is something we are training the algorithm with the Un- Label 🏷️ data. Let's Say we have too many lemons and we want to train the algorithm. We will first make the groups of lemon 🍋 which are having small medium and large size. We give the unstructured data to a ML Model and it gives us a structure or in a group of data this is called Un-Supervised Learning. Let's Demonstrate it.
To view or add a comment, sign in
-
-
Just completed the "ChatGPT + Excel: Master Data, Make Decisions, Tell Stories" course on Coursera! 🎓 As a finance professional, this course sharpened my ability to analyze complex data, streamline decision-making, and present insights with clarity using Excel and AI tools. Excited to apply these skills to drive smarter, data-driven strategies.
To view or add a comment, sign in
-
CEO & Co-Founder
5moThanks for this, Joel Anderson. It needed to be addressed. Beyond the technical aspects of tokenizers' mechanics and training data, the key point here is that LLMs tend to double down on validating an incorrect answer once they start down the wrong path—and can be very assertive in the process. The Strawberry example may seem trivial, but it highlights a deeper issue regarding trust in their outputs. We should also explore the types of guardrails, research frameworks, and technical stacks needed to use LLMs effectively while mitigating their quirks and shortcomings. This is a fascinating discussion—thank you for keeping it alive!