Introduction to Linear Algebra with Applications to Machine Learning

When it comes to machine learning, many algorithms are represented Mathematically using techniques in Linear Algebra. I aim to share this information with simple and insightful explanations of the mathematics used driving some of our foundational machine learning algorithms and describe what some may think is more intermediate level linear algebra in a way that almost anybody taking the time to grasp this content can understand!

What is Linear Algebra?

Think of it like Algebra, except it uses the matrix.

Oops, not that Matrix...

Its actually these Matrices:

There are a lot of numbers here. Though what is really happening is we are adding the matching row, columns of each number in matrix A to the matching row, column numbers of each number in matrix B. For this operation, we are just doing addition a lot of times to process some volume of data that may mean something when given some practical situation.

There is a commonly used notation when accessing matrices. Usually the letter "i" denotes a row number and the letter "j" represents a column number. So we can say things like A(i,j) which is a general way of saying, here is a function that can access an element somewhere in the matrix.

For example A[1,0] = 4 and B[1,0] = 8 for the image above.

A[1,0] + B[1,0] = 4 + 8 = 12. Which we see is the result in the above addition.

How to Multiply Matrices

Multiplication a different and this is where it gets interesting -

For multiplication we do some "flippy-spinny" stuff that may seem weird to a beginner. Though just follow along and with practice it begins to make sense.

Take some time to analyze the figure above to see what is happening.

The Dot Product Used In Matrix Multiplication

We use this thing called a "dot product" of the rows of the first matrix against the columns of the second matrix.

In Linear Algebra a "Dot Product" is the way we "multiply" the rows of the first matrix against the column of the second matrix. The reason for this is mostly because of geometry.

A row of numbers [a1, a2, a3] can be thought of as a vector. Same for its columns. A vector in linear algebra is just a list of numbers. A list of data that means something when given an application. It could be a list of shoe sizes Alice owns for example [size1, size2, size3, ...].

In regular algebra, if we think of a single number known in linear algebra as a "scalar" and multiply that by another "scalar" the same geometry applies except when we are doing regular one-dimensional algebra all of this just happens on a one-dimensional number line.

Though when we let ourselves see past one dimension and break out into "the matrix" (another cue for me to show keanu reeves)

then geometry applies. It is at this point we need to start thinking about circles so we can see how the rows and columns multiply in two or more dimensions.

Getting into Trigonometry a unit circle looks like this:

Within this circle we can have vectors pointing in different directions. Using some other math we can figure out the lengths of any vector and direction of these vectors. We do this using triangles.

In this article we will not get too deep into trigonometry. Instead we will only describe how to calculate the angles of a triangle and how to compute the length of a vector.

These two calculations are what is needed in order to understand the geometry of a dot product.

Inside of the circle above we can draw triangles, these triangles can then be used with knowledge we have about the "pythagorean theorem" in order to get more insight into what is going on with a vector.

When given the length of two sides of triangle, it is possible to compute angles within the triangle. The math for calculating these angles is in the image below.

To calculate the cos(theta) of a triangle, we just need to know the length of the adjacent side and the hypotenuse. Example of a calculation in the image below:

To get the length of a vector known in Linear Algebra as its "magnitude", we use basic math most people learn in elementary school. a^2 + b^2 = c^2.

Moving terms around we can get the magnitude of vector a, which is noted with the symbol |a|.

So now... let's forget everything we just learned and get back to why we are looking at all this in the first place

When we see two vectors on a graph. Such as Vectors A and B in the image below

and we want to multiply them, then we can take the magnitude of A denoted as |A| and get the cosine of a triangle by drawing an imaginary line like the dotted line in the image above to get the cos(theta) of the triangle.

When we multiply A times cos(theta) what we get is a line going in the SAME direction as B. This flattens the two dimensional space into a one dimensional space that most are using to when multiplying scalars (regular numbers) like 2*2=4.

So then we can run the operation of |A| times cos(theta) times the magnitude of B which is denoted as |B| in order to get the dot product of the two vectors and thus finishing the multiplication.

So given this, we can imply that the algebraic dot product is equivalent to the geometric dot product

A•B = a0b0 + a1b1 + a2b2 + ... aNbN = |A||B|cos(Θ)

So when looking at matrix multiplication again...

we can now have a sense of what is going on for each row and each column being multiplied in terms of the numbers. If we were to put these numbers on a graph, then we can now also visualize how the dot product is performed given the explanations just provided.

What Matrix Transformations Are

Ok, so now that we have a sense of what a vector is, how they are multiplied and moving around in terms of geometry we can start getting even more tricky and think about how we can transform information. These transformations are essential for manipulating images or running machine learning algorithms that are processing large volumes of data.

Without getting deep into the math of it, there are ways in which we can leverage matrix multiplication and addition in order to do things to an image or to data that moves it, blows it up to be bigger by scaling it, rotating it around some point or skewing it along some axis.

There are some fundamental properties and technical manipulations of matrices that are possible which people in mathematics lost sleep over and possibly cried over to help us get to where we are today. This allow me to be here, casually drinking tea and writing this article. Below is what it may have looked like when these algorithms were first being invented. An image of a scholar crying and thinking while sleep deprived below:

Overview of Linear Combinations

One thing that is of common interest in looking at matrices is analyzing linear combinations of them.

Given a set of vectors [v1, v2, v3, ... vn] and a set of scalars [c1, c2, c3, ... cn]

We can write out the following math formula:

c1 v1+ c2 v2+…+ cn vn

An example of multiplying a vector and a scalar looks like this

In the image above we could imagine that c1 equals 3 and v1 equals [4, -2]. The result of the multiplication is [12, -6]. We take the 3 and multiply it against every element in the column matrix.

Doing this multiple times and adding each scalar multiplication together then gives us the resultant value of a linear combination.

Why Linear Combinations are useful

So we see the math of this, though why are we even looking at this in the first place?

Well for all practical purposes, these can be directly used in matrix transformation. For example multiplying a matrix with a scalar, c1 will scale it or adding a matrix with another matrix will move it some desired direction. This can all be done by setting the right values of c1 ... cn of a linear combination.

Linear combinations are also a fundamental mathematical foundation used to build other concepts on top of and that is where things get real interesting.

With the definition of Linear Combinations we can now start looking at Vector Spaces, the span of a vector set, basis vectors and lastly Eigenvalues and Eigenvectors. The last part is interesting, without Eigenvalues and Eigenvectors the billion dollar algorithm that made google possible would not have existed!

Vector Spaces

First let's start wrapping our head around a vector space. A vector space is just a collection of vectors, [v1, v2, ... vN] that can be added together or multiplied by some scalar. With this there are ten fundamental definitions that serve as the axioms for defining a vector space. An axiom is a rule that is true in and of itself and thus cannot be proven or disproven. In philosophy we call this a "tautology".

All of these operations can serve to do things such as transform a matrix that scale it, move it, rotate it etc...

As long as these 10 rules are followed, then any mathematical manipulation of a vector will guarantee that the resultant vector will lie within the same space.

The span of a vector set

A span of a set of vectors [v1, v2, ... vN] is defined as the set of all the linear combinations of those vectors.

Imagine if I gave you a cup of water, salt and sugar. Then I ask you to make cups of water with all the possible combinations of sugar and salt that are possible to make. One cup might have half sugar and half salt. Another cup might be 3/4 salt and 1/4 sugar, another cup of water may contain only water etc... There are millions or theoretically an infinite amount of combinations in which such ingredients can be combined.

All the possible combinations of combining these ingredients in the cup is the "span" of salt, water and sugar which defined all the possible linear combinations of adding these ingredients to the cup.

Linear Dependence and Independence

It is said that all the vectors in a set of vectors [v1, v2, ... vN] are "dependent" on each other if there exists a set of numbers [c1, c2, ... cN] that when linearly combined with these vectors will add up to zero.

If there exists no combination of scalars that can satisfy this equation then the vectors are said to be linearly "independent".

Linear Dependence exists if and only if there exists a set of [c1, c2, ... cN] such that c1 v1 + c2 v2 + ··· + cN vN = 0

If there is no set of [c1, c2, ... cN] which are non-zero to satisfy this equation, then the vectors [v1, v2, ... vN] are linearly independent

When thinking about linear dependence and independence on a graph, what we can see is that two vectors are dependent if we can find a linear combination that exists where the lines can point in the same but opposite directions. They are independent when this is not possible no matter what non-zero numbers that are tried.

So when fathoming a "span" of a vector space, one significant thing that is often looked for is what sets of vectors within that span may be dependent or independent.

Linear dependence and independence alone may not be of much interest unless your goal is to just ponder pure mathematics. These properties of a vector space are needed as building blocks for other ideas which can give way to some powerful applications.

Basis Vectors

Now that we have an idea of what spans are, how they are related to linear combinations and what it looks like to determine if a vector set within a space is dependent or independent the next building block to look at are "basis vectors".

So what is a basis vector?

The concept of basis vectors along with all the other concepts described are another fundamental building block of linear algebra.

Essentially basis vectors are the needed information to define a vector space.

If you are a painter wanting to paint a rainbow, it is not necessary to have every single specific color of a rainbow. Rather, all the painter needs are the primary colors. Using the primary colors of red, blue and yellow we can then span the space of all the other possible colors by mixing our paints in the way we need.

Similarly, to define a vector space we need the necessary basis vectors. For example the vectors (1, 0) and (0, 1) can be linearly combined to define the entire 2-Dimensional vector space that we typically understand as the x,y plane or cartesian graph.

We can also define basis vectors to define a 3 dimensional plane as (0,1,0), (1, 0, 0), (0, 0, 1)

These thoughts can then be generalized to define any dimensions of space needed. It could be one hundred dimensions, a thousand dimensions any dimensions desired.

It may not be straightforward to understand why we would need this, though in the field of data science it is common to think in terms of dimensionality of N dimensions which means some finite number of dimensions like N=1432 dimensions for example.

The reason why is that in Data Science we are always looking at properties of information that may help us predict something. To predict the height of a person for example a data scientist may look at finger length, foot size, the height of the forehead and many other properties of the human body. Each one of these properties may be thought of as a "dimension".

So that is essentially all a basis vectors are. They are a set of linearly independent vectors that span a vector space.

Eigenvalues and Eigenvectors

Ok now that we have the fundamental building blocks of linear algebra in place let's look at one last core concept. Eigenvalues and Eigenvectors.

When performing "linear transformations" on matrices, sometimes we may want to know which vectors have their directions unchanged by the operation.

What we want to find are the set of vectors such that the linear transformation of that only "scales" or "stretches" the set of vectors without moving, rotating or skewing it.

This set of values and vectors can be found with the following equation

this equates to finding the vector x that scales the matrix by some factor λ.

So x is some vector with numbers such as [2,3,4] which is within the basis of 3 dimensions. The symbol λ is some number such as the number 5.

We want to find the x and the λ that solves the equation for some matrix A that is given to us.

The math can get quite involved to find these values and we will not dive into that in this article though here is one example of how to find the eigenvalues and eigenvectors of a 2X2 sized matrix A. (press image to enlarge)

Application to Machine Learning - Image Compression

Now that we have all this fancy math under our tool-belt what is the point?

Well for one, with this math we can begin to understand some algorithms used behind image compression. To get started in understanding how this is done we need to look at the machine learning technique of "Principal Component Analysis" (PCA). This is a common topic of study for students of Data Science and one of the most widely used algorithms for exploratory data analysis.

Along with applications to image compression PCA has a myriad of other applications. It can be used to help those in quantitative finance understand various financial risks of investing money, it is used in neuroscience to identify the identity of a neuron in the human brain based on various properties of the brains activity or help us use the properties of a customer to understand why they made a given purchase decision.

Principal Component Analysis

Principal component analysis (PCA) is used when we have a collection of data where the number of dimensions is high and we want to reduce the number of dimensions by creating a different set of data with less dimensions that still preserves the most important patterns and relationships of the data in the original dataset.

In the data science world we call this "dimensionality reduction".

Eigenvectors and Eigenvalues are a core part of what is needed in order to determine how to reduce the dimensions of a data set.

For example, if we wanted to determine a threat level of a power outage happening in a given area we may look at a number of dimensions such as location, population density, degree of gaps in wealth within a given geographic radius, racial and ethnic distributions of the areas population within some radius, crime rate over the past year and many other factors. Each one of these factors is a dimension. It is possible there could be a hundred or even a thousand dimensions that could be analyzed to determine the likelihood of a power outage happening somewhere.

Computationally it may be too expensive to analyze all these dimensions so we may wish to reduce it to a feature set that minimizes the loss of information while still maximizing predictive power.

Below is an example data set with two features on an XY plane, meaning there are two dimensions to the following data set. For the sake of example we could say the first dimension is the time a movie was released on the X axis and the second dimension is the number of people who watched the movie in the first week of release on the Y axis.

We may wish to reduce the number of dimensions in this data to be used in other machine learning models to predict which movie is most likely to make the most money by first finding the "principal components" of the graph.

In order to do this, there are some pre-processing steps that we will not cover in this article. Ultimately what needs to happen is first we need to find the "Covariance Matrix" for the data set.

This is a matrix that can help us understand how each of the dimensions of our data points relates with every other dimension of the data. A 2D graph means the size of the Covariance matrix will be 2X2. The number of rows and columns in this matrix will always equal the number of dimensions of the data being looked at. This is because we are analyzing how the data in each dimension varies with every other dimension.

Without getting into the math of how to compute this, essentially some of the numbers in the matrix will be positive and others will be negative after the computation.

If a number in the matrix ends up being positive, then the two data points are directly proportional. For example if Cov(x,y)=0.8 then that means when X increases y will also increase. Though if Cov(x,y)=-0.6 then that would mean when x or y decreases the other one will increase in value.

Once covariance is determined we can use this matrix as matrix A and compute its Eigenvectors and Eigenvalues.

The Eigenvectors of the covariance matrix give us the "principal components" of the data set. Whereas the Eigenvalues can reveal to us how much information that principal component actually has. In other words we can use the eigenvalues to determine which eigenvector to use in our vector space in order to make a reduced data set with as much predictive power as possible by minimizing loss of information.

We can see in the above plot, that our principal component represented as the line V1 has the most information contained, whereas V2 has the least.

Now we have a new vector space which spans the same basis of our 2 dimensional world. We can rotate our plane to have a new view of our coordinate plane. Now instead of having an XY plane, we have a PC1PC2 plane.

In our example, the original data set had meaning. The X axis was time the movie was released and the Y axis was the number of people who watched the movie. After applying Principle Component Analysis, these new axes don't really allow for much interpretation to them. Despite this they still can show us how to get the most predictive power we can have to understand the answer to a question such as "which movie is likely to have the most sales?".

In this example we can see that the axis PC1 has the most information contained near it whereas PC2 has the least amount.

We can also say that PC1 and PC2 are Eigenvectors for the covariance matrix in our original XY data set. Each of these eigenvectors also have eigenvalues.

for sake of example we could say that λ1 = 0.8 and λ2 = 0.07.

λ1 corresponds to the PC1 vector, and λ2 corresponds to the PC2 vector.

To know what percentage of the original information each of these vectors contain we can divide the eigenvalue by the sum of all the eignvalues.

So in this case AmountOfInfo(λ1) = λ1 / λ1 + λ2 = (0.8)/(0.8 + 0.07) = 91%

and AmountOfInfo(λ2) = λ2 / λ1 + λ2 = (0.07)/(0.8 + 0.07) = 9%

Which means we only need to use PC1 for any further data analysis and we can throw away PC2 in order to reduce the number of dimensions we need.

Applying PCA to Image Compression

So how can PCA be used for image compression?

Let's take a look at how to do this using some common libraries in the python programming language. For our example we will compress an image resembling an explosion of color.

First import the required libraries

import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA

Next read the source image into memory

img = mpimg.imread('colorExplosion.png')

Once the image is read into memory we can print the dimensions of it. In this case the dimensions are (1024, 1024, 3). These numbers are in the format of (height, width, number of color channels). The number of color channels is 3 representing an RGB image type. Every computer image of this type has a Red, Green and Blue color channels. These are the primary colors used to represent every possible color in the pixels of an image.

print(img.shape) #Output: (1024, 1024, 3)

PCA does not know about the image shape with color channels, so we need to change its shape to be a 2D array. The first element will be the image height and the second element be the image width multiplied by the number of color channels. In this case that is 1024*3 = 3,072. So the Matrix given to the PCA algorithm will have the shape 1024 X 3072. Which means it has 1024 rows and 3072 columns where each column is a different dimension.

img_r = np.reshape(img, (1024, 3072))
print(img_r.shape) # Output: (1024, 3072)

To compress the image, we will reduce these 3,072 dimensions to 64 dimensions in total by choosing the dimensions that contain the most information about the original data

pca = PCA(64).fit(img_r)
img_transformed = pca.transform(img_r)

Printing the shape of the transformed image shows us that we successfully reduced the number of dimensions to 64.

print(img_transformed.shape) #Output: (1024, 64)

We can also print the result of how much information was preserved by these 64 principal components. The value output below is 0.9735123. This means that approximately 97% of the data was preserved after the dimensionality reduction.

print(np.sum(pca.explained_variance_ratio_)) #Output: 0.9735123

We can now use the result of the PCA and apply an inverse transform which will take the compressed information and convert that data into its original vector space

temp = pca.inverse_transform(img_transformed) 
print(temp.shape) #Output (1024, 3072)

From here we can then transform the image back to the three dimensional array which has the original RGB color channels

temp = np.reshape(temp, (1024, 1024, 3))
print(temp.shape) #Output (1024, 1024, 3)

The result of this is the original image goes from looking like this:

To looking like this:

Looking closely we can see a visual loss in quality between the original image and the compressed image though most of the data stays intact. Changing the number of dimensions used in the PCA from 64 to say 128 may also be used to compress the image while preserving more quality.

Below is a full snippet of all the code required to run this experiment:

import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA

# Read the image into memory
img = mpimg.imread('colorExplosion.png')

# Print the dimensions of the image 
print(img.shape) #Output: (1024, 1024, 3) 

# Display the image
plt.imshow(img)
plt.show()

# Reshape the image for PCA. 
img_r = np.reshape(img, (1024, 3072))
print(img_r.shape) # Output: (1024, 3072)

# Run The PCA Algorithm 
pca = PCA(64).fit(img_r)
img_transformed = pca.transform(img_r)

# Print the shape of the reduced data
print(img_transformed.shape) #Output: (1024, 64)

# Explain the variance preserved. 
print(np.sum(pca.explained_variance_ratio_)) #Output: 0.9735123

# Use the PCA result to convert back to its original vector space
temp = pca.inverse_transform(img_transformed) 
print(temp.shape) #Output (1024, 3072)

# Reshape the image to its original RGB color channels
temp = np.reshape(temp, (1024, 1024, 3))
print(temp.shape)

# Display the compressed image
plt.imshow(temp)
plt.show()

Ending Summary

In this crash course going over the fundamentals of linear algebra we were able to gain a much deeper understanding of what happens under the hood of many of the machine learning algorithms that power our world. Applying these mathematical foundations to the technique of Principal Component Analysis we had the opportunity to be opened to a world of endless applications in Data Science. Specifically we got to see how this machine learning technique can be used in Image Compression along with gaining understanding of its much wider variety of possible applications as well.

If you got this far, I applaud you for your efforts on gaining a deeper understanding of a few of the mathematical foundations driving the world of Machine Learning and Artificial Intelligence.

Introduction to Linear Algebra with Applications to Machine Learning - Image Compression

Joshua Wheeler

Senior Software Engineer at Walmart Global Tech

Recommended by LinkedIn

More articles by this author

Insights from the community

Others also viewed

An Introduction to the Dual-Alpha Hyper-Dynamic Vertical Manifolds (DHV-Manifolds)

MAKING STOCHASTIC PROGRAMING MODELS - FILLING THE BLANKS - (in MS-EXCEL/MS-WORD/CSV)

The Beautiful and Useful Applications of Logarithms

Part 4: Unlocking the Power of Mathematical Categories for Revolutionary Genome Design

A Brief History of Algorithms

Applications of Aperiodic Monotile in Information Theory, Data Analysis and Machine Learning

Avoiding Temporaries With Expression Templates

Fundamentals of Quantization - Quantization of LLMs, Part-3

#3. Math for ML Part 1: Linear Algebra

Explore topics

Recommended by LinkedIn

Using OpenCV for Human Pose Estimation - Shoulder Shrug Detection

Sep 11, 2024

Using Blink Detection to determine Affective States in OpenCV

Sep 4, 2024

Aligning Images Using Homography in OpenCV

Jun 15, 2024

Handpoint Detection using OpenCV to Draw on a Virtual Whiteboard

Jun 12, 2024

The History and Future of Money

May 7, 2024

What is Blockchain, How it Works and Why We Should Care

May 7, 2024

Motion Detection in OpenCV

Apr 26, 2024

Histograms, Equalization methods and Tamper Detection for Computer Images in OpenCV

Apr 16, 2024

Basic Image Manipulation With OpenCV + Python and The Predictive Qualities of Human Eyes

Apr 10, 2024

Introduction to Color Spaces For 2D Computer Images