A Nutshell of my Favorite Algorithms as a Data Scientist.
Machine Learning: Is a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy. Machine learning algorithms are typically created using frameworks that accelerate solution development, such as TensorFlow and PyTorch. Machine learning is behind chatbots and predictive text, language translation apps, the shows Netflix suggests to you, and how your social media feeds are presented. It powers autonomous vehicles and machines that can diagnose medical conditions based on images. When companies today deploy artificial intelligence programs, they are most likely using machine learning — so much so that the terms are often used interchangeably, and sometimes ambiguously. Machine learning is a subfield of artificial intelligence that gives computers the ability to learn without explicitly being programmed. While not everyone needs to know the technical details, they should understand what the technology does and what it can and cannot do.
The goal of AI is to create computer models that exhibit “intelligent behaviors” like humans. in some cases, writing a program for the machine to follow is time-consuming or impossible, such as training a computer to recognize pictures of different people. While humans can do this task easily, it’s difficult to tell a computer how to do it. Machine learning takes the approach of letting computers learn to program themselves through experience. Machine learning uses two types of techniques: supervised learning, which trains a model on known input and output data so that it can predict future outputs, and unsupervised learning, which finds hidden patterns or intrinsic structures in input data. Supervised learning uses classification and regression techniques to develop machine learning models. Unsupervised learning finds hidden patterns or intrinsic structures in data. It is used to draw inferences from datasets consisting of input data without labeled responses.
Data-Driven Decision-making: Decision making is the process of making choices by identifying a decision, gathering information, and assessing alternative resolutions. It is a process in which organizations use data and analytical techniques to inform and guide their strategic, tactical, and operational choices. It's about basing decisions on empirical evidence and insights extracted from data, rather than relying solely on intuition or experience. When Data is harnessed correctly, it has the potential to drive decision-making, impact strategy formulation, and improve organizational performance. Learning how to analyze data effectively can enable you to draw conclusions, predictions, and actionable insights to drive impactful decision-making. Though intuition can be a helpful tool, it would be a mistake to base all decisions around a mere gut feeling. While intuition can provide a hunch or spark that starts you down a particular path, it's through data that you verify, understand, and quantify. You can use tools, frameworks, and software to analyze data, such as Microsoft Excel and Power BI, Google Charts, Data Wrapper, Infogram, Tableau, and Zoho Analytics. These can help you examine data from different angles and create visualizations that illuminate the story you are trying to tell. Data-driven decision-making (sometimes abbreviated as DDDM) is the process of using data to inform your decision-making process and validate a course of action before committing to it. Once you begin collecting and analyzing data, you’re likely to find that it’s easier to reach a confident decision about virtually any business challenge. Just because a decision is based on data does not mean it will always be correct. While the data might show a particular pattern or suggest a certain outcome, if the data collection process or interpretation is flawed, then any decision based on the data would be inaccurate. This is why the impact of every business decision should be regularly measured and monitored. Ultimately, always remember that correlation does not always mean causation. Being data-driven means that you try to make decisions without bias or emotion. If you’re in a leadership position, making objective decisions is the best way to remain fair and balanced.
Linear Regression: Is used to predict the value of a variable based on the value of another variable. The variable you want to predict is called the dependent variable. The variable you are using to predict the other variable's value is called the independent variable. a data analysis technique that predicts the value of unknown data by using another related and known data value. It mathematically models the unknown or dependent variable and the known or independent variable as a linear equation. It is commonly used in many fields, including economics, finance, and social sciences, to analyze and predict trends in data. It can also be extended to multiple linear regression, where there are multiple independent variables, and logistic regression, which is used for binary classification problems. The greater the linear relationship between the independent variable and the dependent variable, the more accurate the prediction is. The ability to find predictions and evaluate them can help provide benefits to many businesses and individuals, like optimized operations and detailed research materials. allows researchers to predict or explain the variation in one variable based on another variable. The goal is to find the best-fitting line that minimizes the difference between predicted and actual values. Linear regression is used for continuous outcome variables. First, the relationship between x and y should be linear. Second, all the observations in a sample must be independent of each other; thus, this method should not be used if the data includes more than one observation on any individual. The formula for a linear regression line is: y = mx + b
Where y is the dependent variable, x is the independent variable, m is the slope (weight), and b is the intercept. It represents the best-fitting straight line that describes the relationship between the variables by minimizing the sum of squared differences between actual and predicted values. Businesses use it to reliably and predictably convert raw data into business intelligence and actionable insights. Scientists in many fields, including biology and the behavioral, environmental, and social sciences, use linear regression to conduct preliminary data analysis and predict future trends. Many data science methods, such as machine learning and artificial intelligence, use linear regression to solve complex problems. A positive coefficient indicates that as the value of the independent variable increases, the mean of the dependent variable also tends to increase. A negative coefficient suggests that as the independent variable increases, the dependent variable tends to decrease. If there appears to be no association between the proposed explanatory and dependent variables (i.e., the scatterplot does not indicate any increasing or decreasing trends), then fitting a linear regression model to the data probably will not provide a useful model. After a regression line has been computed for a group of data, a point which lies far from the line (and thus has a large residual value) is known as an outlier. Linear regression is unbounded, and this brings logistic regression into picture.
Logistic Regression: Is a process of modeling the probability of a discrete outcome given an input variable. The most common logistic regression models a binary outcome; something that can take two values such as true/false, yes/no, and so on. It is an extension of the linear regression model for classification problems. It is also known as logit model. The output from the hypothesis is the estimated probability. This is used to infer how confident predicted value can be actual value when given an input X. used for classification tasks where the goal is to predict the probability that an instance of belonging to a given class. In logistic regression, we use the concept of the threshold value, which defines the probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the threshold values tends to 0. There should not be collinearity between independent variables. The logistic regression model transforms the linear regression function continuous value output into categorical value output using a sigmoid function, which maps any real-valued set of independent variables input into a value between 0 and 1. The sample size is sufficiently large with no outliers. The birth of Logistic regression begins as it is evident that problem with linear regression is that the predicted values may be out of range. We know that probability can be between 0 and 1, but if we use linear regression this probability may exceed 1 or go below 0. Logistic Regression, which converts this straight best fit line in linear regression to an S-curve using the sigmoid function, which will always give values between 0 and 1. They can use these insights for predictive analysis to reduce operational costs, increase efficiency, and scale faster. For example, businesses can uncover patterns that improve employee retention or lead to more profitable product design.
Multiple Linear Regression: This refers to a statistical technique that uses two or more independent variables to predict the outcome of a dependent variable. The technique enables analysts to determine the variation of the model and the relative contribution of each independent variable in the total variance. It allows researchers to assess the strength of the relationship between an outcome (the dependent variable) and several predictor variables as well as the importance of each of the predictors to the relationship, often with the effect of other predictors statistically eliminated. For example, you could use multiple regression to understand whether exam performance can be predicted based on revision time, test anxiety, lecture attendance and gender. Multiple regression includes two or more independent variables – sometimes called predictor variables – in the model, rather than just one as in Linear Regression. One of the main advantages of multiple regression is that it can capture the complex and multifaceted nature of real-world phenomena. By including multiple independent variables, you can account for more factors that influence the dependent variable and reduce the error and bias in your estimates. Multiple regression, like all statistical techniques based on correlation, has a severe limitation due to the fact that correlation doesn't prove causation. And no amount of measuring of "control" variables can untangle the web of causality. Five main assumptions underlying multiple regression models must be satisfied: (1) linearity, (2) homoskedasticity, (3) independence of errors, (4) normality, and (5) independence of independent variables. Diagnostic plots can help detect whether these assumptions are satisfied. In essence, multiple regression is the extension of ordinary least-squares (OLS) regression because it involves more than one explanatory variable. it is possible that some of the independent variables are actually correlated with one another, so it is important to check these before developing the regression model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of them should be used in the regression model. Multiple regressions assume that there is a linear relationship between both the dependent and independent variables. It also assumes no major correlation between the independent variables. The model calculates the line of best fit that minimizes the variances of each of the variables included as it relates to the dependent variable. Because it fits a line, it is a linear model. The model creates a relationship in the form of a straight line (linear) that best approximates all the individual data points.
K-Means Clustering: It tries to group similar kinds of items in the form of clusters. It finds the similarity between the items and groups them into clusters. K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. It fails to give good results when the data contains outliers, the density spread of data points across the data space is different. One of the limitations is the difficulty to choose the number of clusters, k, and cannot be used with arbitrary distances. sensitive to scaling – requires careful preprocessing. The objective of K-means is simple: group similar data points together and discover underlying patterns. a target number k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster. In clustering, we do not have a target to predict. We look at the data, try to club similar observations, and form different groups. Hence it is an unsupervised learning problem. data points from different clusters should be as different from each other as possible to have more meaningful clusters. The k-means algorithm uses an iterative approach to find the optimal cluster assignments by minimizing the sum of squared distances between data points and their assigned cluster centroid. Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares. All the data points in a cluster should be similar to each other. The data points from different clusters should be as different as possible. It is essentially a grouping of things based on how similar and different they are to one another. Finding the ideal number of groups to divide the data into is a basic stage in any unsupervised algorithm. One of the most common techniques for figuring out this ideal value of k is the elbow approach. K-means clustering performs best on data that are spherical. Spherical data are data that group in space in close proximity to each other either. This can be visualized in 2- or 3-dimensional space more easily. Data that aren’t spherical or should not be spherical do not work well with k-means clustering.
Naive Bayes Classifier: Naive Bayes is suitable for solving multi-class prediction problems. If its assumption of the independence of features holds true, it can perform better than other models and requires much less training data. Naive Bayes is better suited for categorical input variables than numerical variables. If within the training data a given Attribute value never occurs in the context of a given class or is missing, then the conditional probability is set to zero. It is a popular algorithm mainly because it can be easily written in code and predictions can be made real quick which in turn increases the scalability of the solution. This technique involves using a kernel function to estimate the probability density function of the input data, allowing the classifier to improve its performance in complex scenarios where the data distribution is not well-defined. When assumption of independence holds, the classifier performs better compared to other machine learning models like logistic regression or decision tree and requires less training data. They are fast and easy to implement but their biggest disadvantage is that the requirement of predictors to be independent. In most of the real-life cases, the predictors are dependent, this hinders the performance of the classifier. It performs well in case of categorical input variables compared to numerical variable(s). Another limitation of this algorithm is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent. Despite their apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many real-world situations, famously document classification and spam filtering. They require a small amount of training data to estimate the necessary parameters. Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods.
Random Forest: Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time to reach a single result. for example, classifying whether an email is “spam” or “not spam”. The algorithm’s strength lies in its ability to handle complex datasets and mitigate overfitting, making it a valuable tool for various predictive tasks in machine learning. It can handle the data set containing continuous variables, as in the case of regression, and categorical variables, as in the case of classification. It performs better for classification and regression tasks. The “forest” it builds is an ensemble of decision trees, usually trained with the bagging method. The general idea of the bagging method is that a combination of learning models increases the overall result. In general, these algorithms are fast to train, but quite slow to create predictions once they are trained. A more accurate prediction requires more trees, which results in a slower model. In most real-world applications, the random forest algorithm is fast enough but there can certainly be situations where run-time performance is important and other approaches would be preferred. Random forests do not overfit. You can run as many trees as you want. It is fast. Running on a data set with 50,000 cases and 100 variables, it produced 100 trees in 11 minutes on an 800Mhz machine. For large data sets the major memory requirement is the storage of the data itself. One of the biggest advantages of random forest is its versatility. A example of a real-life scenario is for instance, Lee wants to decide where to work after his internship, so he asks his mentor who knows him best for suggestions. The mentor whom he seeks out asks him about his future dreams and plans in life. Based on the answers, he will give Lee some advice. This is a typical decision tree algorithm approach. Lee’s mentor created rules to guide his decision about what he should recommend, by using Lee’s answers. Afterward, Lee starts asking more and more of leaders in the industry to advise him and they again ask him different questions they can use to derive some recommendations from. Finally, Lee chooses the workplaces that his leaders recommend the most to him, which is the typical random forest algorithm approach. Another great quality of the random forest algorithm is that it is very easy to measure the relative importance of each feature on the prediction. In a decision tree, each internal node represents a ‘test’ on an attribute (e.g., whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). A node that has no children is a leaf.” so, one can decide which features to possibly drop because they don’t contribute enough (or sometimes nothing at all) to the prediction process. This is important because a general rule in machine learning is that the more features you have the more likely your model will suffer from overfitting and vice versa. A random forest eradicates the limitations of a decision tree algorithm. It reduces the overfitting of datasets and increases precision. It generates predictions without requiring many configurations in packages (like scikit-learn).
Decision Trees: It helps us understand how random forest algorithms work. Decision tree learning employs a divide and conquer strategy by conducting a greedy search to identify the optimal split points within a tree. This process of splitting is then repeated in a top-down, recursive manner until all, or the majority of records have been classified under specific class labels. A decision tree consists of three components: decision nodes, leaf nodes, and a root node. A decision tree algorithm divides a training dataset into branches, which further segregate into other branches. This sequence continues until a leaf node is attained. The leaf node cannot be segregated further. The outcome chosen by most decision trees will be the final choice. If three trees predict YES, and one tree predicts NO, then the final prediction will be YES. In this case, it Is predicted that YES is the outcome. A decision tree is a hierarchical model used in decision support that depicts decisions and their potential outcomes, incorporating chance events, resource expenses, and utility. This algorithmic model utilizes conditional control statements and is non-parametric, supervised learning, useful for both classification and regression tasks. The name itself suggests that it uses a flowchart like a tree structure to show the predictions that result from a series of feature-based splits. Decision trees are prone to overfitting when they capture noise in the data. Pruning and setting appropriate stopping criteria are used to address this assumption. Decision trees are nothing but a bunch of if-else statements in layman terms. It checks if the condition is true and if it is then it goes to the next node attached to that decision. Decision trees assume that there are no missing values in the dataset or that missing values have been appropriately handled through imputation or other methods. Pruning is another method that can help us avoid overfitting. It helps in improving the performance of the tree by cutting the nodes or sub-nodes which are not significant. Additionally, it removes the branches which have very low importance. It enables developers to analyze the possible consequences of a decision, and as an algorithm accesses more data, it can predict outcomes for future data.
Support Vector Machine: In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data for classification, regression analysis and outliers detection. SVMs are particularly good at solving binary classification problems, which require classifying the elements of a data set into two groups. SVMs are used in applications like handwriting recognition, intrusion detection, face detection, email classification, gene classification, and in web pages. This is one of the reasons we use SVMs in machine learning. It can handle both classification and regression on linear and non-linear data. They are easy to understand, interpret, and implement, making them an ideal choice for beginners in the field of machine learning. The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane. The SVM method has also suffered from potential setbacks such as high memory consumption when it processes large volumes of data. It is not easy to interpret the parameters of the solved SVM method. The SVM method requires all the input data to be correctly labeled before the process. The advantages of the SVM method are the better accuracy in classification and the best performance in the analysis. SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called support vectors, and hence algorithm is termed as Support Vector Machine. The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane are termed as Support Vector. The distance between the vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is called the optimal hyperplane. SVMs are widely adopted across disciplines such as healthcare, natural language processing, signal processing applications, and speech & image recognition fields. The hyperplane is learned from training data using an optimization procedure that maximizes the margin. The SVM performs both linear classification and nonlinear classification. The nonlinear classification is performed using the Kernel function. In nonlinear classification, the kernels are homogenous polynomial, complex polynomial. If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we cannot draw a single straight line. So to separate these data points, we need to add one more dimension. For linear data, we have used two dimensions x and y, so for non-linear data, we will add a third-dimension z. In this way, complex data points can be separated with the help of more dimensions. The concept highlighted here is that the data points continue to get mapped into higher dimensions until a hyperplane is identified that shows a clear separation between the data points. The geo-sounding problem is one of the widespread use cases for SVMs, wherein the process is employed to track the planet’s layered structure. With SVMs, you can determine whether any digital image is tampered with, contaminated, or pure. Such examples are helpful when handling security-related matters for organizations or government agencies, as it is easier to encrypt and embed data as a watermark in high-resolution images.
K-Nearest Neighbors : It is also known as KNN or k-NN, is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point. It is easy to implement and understand but has a major drawback of becoming significantly slows as the size of that data in use grows. It is a simple algorithm that stores all the available cases and classifies the new data or case based on a similarity measure. machine learning technique used for classification and regression tasks. It relies on the idea that similar data points tend to have similar labels or values. For example, we have an image of a creature that looks similar to cat and dog, but we want to know whether it is a cat or dog. So, for this identification, we can use the KNN algorithm, as it works on a similarity measure. During the training phase, the KNN algorithm stores the entire training dataset as a reference. When making predictions, it calculates the distance between the input data point and all the training examples, using a chosen distance metric such as Euclidean distance. It does not require any assumptions about the underlying data distribution. It can also handle both numerical and categorical data, making it a flexible choice for various types of datasets in classification and regression tasks. It is a non-parametric method that makes predictions based on the similarity of data points in each dataset. K-NN is less sensitive to outliers compared to other algorithms. The K-NN algorithm works by finding the K nearest neighbors to a given data point based on a distance metric, such as Euclidean distance. The class or value of the data point is then determined by the majority vote or average of the K neighbors. This approach allows the algorithm to adapt to different patterns and make predictions based on the local structure of the data. The algorithm identifies the nearest K neighbors to the input data point based on their distances. In the case of classification, the algorithm assigns the most common class label among the K neighbors as the predicted label for the input data point. For regression, it calculates the average or weighted average of the target values of the K neighbors to predict the value for the input data point. The KNN algorithm is straightforward and easy to understand, making it a popular choice in various domains. However, its performance can be affected by the choice of K and the distance metric, so careful parameter tuning is necessary for optimal results. The value of k in the k-nearest neighbors (k-NN) algorithm should be chosen based on the input data. If the input data has more outliers or noise, a higher value of k would be better. It is recommended to choose an odd value for k to avoid ties in classification. Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity. Thе specific method for selecting the nearest neighbors can vary, but a common approach is to sort the distances in ascending order and choose the K data points with the shortest distances. KNN captures the idea of similarity (sometimes called distance, proximity, or closeness) with some mathematics. After identifying the K nearest neighbors, the algorithm makes predictions based on the labels or values associated with these neighbors. For classification tasks, the majority class among the K neighbors is assigned as the predicted label for the new data point. For regression tasks, the average or weighted average of the values of the K neighbors is assigned as the predicted value. As we decrease the value of K to 1, our predictions become less stable. because K=1, KNN incorrectly predicts that the query point. Inversely, as we increase the value of K, our predictions become more stable due to majority more likely to make more accurate predictions (up to a certain point). Eventually, we begin to witness an increasing number of errors. It is at this point we know we have pushed the value of K too far. KNN’s main disadvantage of becoming significantly slower as the volume of data increases makes it an impractical choice in environments where predictions need to be made rapidly. Moreover, there are faster algorithms that can produce more accurate classification and regression results.
Artificial Neural Networks: A neural network is a method in artificial intelligence that teaches computers to process data in a way that is inspired by the human brain. It is a type of machine learning process, called deep learning, that uses interconnected nodes or neurons in a layered structure that resembles the human brain. Each node, or artificial neuron, connects to another and has an associated weight and threshold. If the output of any individual node is above the specified threshold value, that node is activated, sending data to the next layer of the network. Otherwise, no data is passed along to the next layer of the network. Neural networks are widely used in a variety of applications, including image recognition, predictive modeling and natural language processing (NLP). Artificial Neural Networks contain artificial neurons which are called units. These units are arranged in a series of layers that together constitute the whole Artificial Neural Network in a system. A layer can have only a dozen units or millions of units as this depends on how the complex neural networks will be required to learn the hidden patterns in the dataset. As the data transfers from one unit to another, the neural network learns more and more about the data which eventually results in an output from the output layer. The concept of artificial neural networks comes from biological neurons found in animal brains. Artificial neural networks are trained using a training set. For example, suppose you want to teach an ANN to recognize a cat. Then it is shown thousands of different images of cats so that the network can learn to identify a cat. Once the neural network has been trained enough using images of cats, then you need to check if it can identify cat images correctly. This is done by making the ANN classify the images it is provided by deciding whether they are cat images or not. The workings of ANN are extremely similar to those of biological neural networks, although they are not identical. ANN algorithm accepts only numeric and structured data. Information flows through these nodes, and the network adjusts the connection strengths (weights) during training to learn from data, enabling it to recognize patterns, make predictions, and solve various tasks in machine learning and artificial intelligence. These versions, complete with artificial neurons, allow computers to more effectively process an input dataset. As for the “intelligent” machines involved, the quality of their output is limited to the material they receive, but they do “think” like an animal does. The number of layers in a neural network is a clue to its classification. A basic neural network has two or three layers. One that has at least two layers — which adds some complexity — is technically a deep neural network. A very large neural network is a deep-learning tool. These tools can help identify relationships among people, content, and data, as well as connections between user interests and search queries. Large financial institutions have used ANNs to improve performance in such areas as bond rating, credit scoring, target marketing and evaluating loan applications. These systems are typically only a few percentage points more accurate than their predecessors, but because of the amount of money involved, they are very profitable. ANNs are now used to analyze credit card transactions to detect likely instances of fraud. From making cars drive autonomously on the roads, to generating shockingly realistic CGI faces, to machine translation, to fraud detection, to reading our minds, to recognizing when a cat is in the garden and turning on the sprinklers; neural nets are behind many of the biggest advances in A.I. The ANN also has excellent fault tolerance and is fast and highly scalable with parallel processing. Overfitting can occur. It occurs when weights make the system learn details of learning set instead of discovering structures. This happens when size of learning set is too small in relation to the complexity of the model. A hidden layer is present or not, the output layer of the network can sometimes have many units, when there are many classes to predict. In technical terms, you want to “reduce the dimension of your feature space.” By reducing the dimension of your feature space, you have fewer relationships between variables to consider and you are less likely to overfit your model.
Principal Component Analysis: Principal component analysis, or PCA, is a dimensionality reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. A PCA plot shows clusters of samples based on their similarity. PCA does not discard any samples or characteristics (variables). Instead, it reduces the overwhelming number of dimensions by constructing principal components (PCs). If you have worked with a lot of variables before, you know this can present problems. Do you have so many variables that you are in danger of overfitting your model to your data or that you might be violating assumptions of whichever modeling tactic you are using. Reducing the dimension of the feature space is called “dimensionality reduction.” There are many ways to achieve dimensionality reduction, but most of these techniques fall into one of two classes called Feature engineering techniques: Feature Elimination and Feature Extraction. Principal component analysis is a technique for feature extraction — so it combines our input variables in a specific way, then we can drop the “least important” variables while still retaining the most valuable parts of all of the variables! As an added benefit, each of the “new” variables after PCA are all independent of one another. This is a benefit because the assumptions of a linear model require our independent variables to be independent of one another. If we decide to fit a linear regression model with these “new” variables (see “principal component regression” below), this assumption will necessarily be satisfied. PCA helps manage high-dimensional datasets by extracting essential information and discarding less relevant features, simplifying analysis. PCA) is a statistical procedure that uses an orthogonal transformation that converts a set of correlated variables to a set of uncorrelated variables. PCA is the most widely used tool in exploratory data analysis and in machine learning for predictive models. Moreover, Principal Component Analysis (PCA) is an unsupervised learning algorithm technique used to examine the interrelations among a set of variables. It is also known as a general factor analysis where regression determines a line of best fit. The main goal of Principal Component Analysis (PCA) is to reduce the dimensionality of a dataset while preserving the most important patterns or relationships between the variables without any prior knowledge of the target variables. Principal Component Analysis (PCA) is used to reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables, retaining most of the sample’s information, and useful for the regression and classification of data. The principal components are linear combinations of the original variables in the dataset and are ordered in decreasing order of importance. The total variance captured by all the principal components is equal to the total variance in the original dataset. The first principal component captures the most variation in the data, but the second principal component captures the maximum variance that is orthogonal to the first principal component, and so on. Principal Component Analysis can be used for a variety of purposes, including data visualization, feature selection, and data compression. In data visualization, PCA can be used to plot high-dimensional data in two or three dimensions, making it easier to interpret. In feature selection, PCA can be used to identify the most important variables in a dataset. In data compression, PCA can be used to reduce the size of a dataset without losing important information. Dimensionality reduction is obtained by only retaining those axes (dimensions) that account for most of the variance, and discarding all others. While Principal Component Analysis reduces the number of variables, it can also lead to loss of information. The degree of information loss depends on the number of principal components selected. Therefore, it is important to carefully select the number of principal components to retain. Principal Component Analysis assumes that the relationships between variables are linear. However, if there are non-linear relationships between variables, Principal Component Analysis may not work well.
Quicksort : Sorting is a way of arranging items in a systematic manner. Quicksort is the widely used sorting algorithm that makes n log n comparisons in average case for sorting an array of n elements. It is a faster and highly efficient sorting algorithm. This algorithm follows the divide and conquer approach. Divide and conquer is a technique of breaking down the algorithms into subproblems, then solving the subproblems, and combining the results back together to solve the original problem. Quicksort is a divide and conquer recursive algorithm. The algorithm picks an index typically referred to as the pivot and divides the array into two sub-arrays above and below the pivot. Quicksort’s performance can be inefficient when the algorithm encounters imbalanced partitions. The worst-case scenario is if the first or last element is always the partition point for an array or sub-array. In this case, one side of the partition will contain all the elements. In Divide, first pick a pivot element. After that, partition or rearrange the array into two sub-arrays such that each element in the left sub-array is less than or equal to the pivot element and each element in the right sub-array is larger than the pivot element. Quicksort picks an element as pivot, and then it partitions the given array around the picked pivot element. In quick sort, a large array is divided into two arrays in which one holds values that are smaller than the specified value (Pivot), and another array holds the values that are greater than the pivot. After that, left and right sub-arrays are also partitioned using the same approach. It will continue until the single element remains in the sub-array. Picking a good pivot is necessary for the fast implementation of quicksort. However, it is typical to determine a good pivot. Some of the ways of choosing a pivot are as follows - Pivot can be random, i.e. select the random pivot from the given array. In other words, the choice of pivot can vary - it can be the first element, the last element, the median, or a random element of the array. Pivot can either be the rightmost element of the leftmost element of the given array. Select median as the pivot element. The efficiency of the Quicksort algorithm depends on the choice of the pivot. A bad pivot choice can result in significantly slower performance, but a good choice can optimize the sorting process. Ideally, the pivot should be the median of the array, which would divide the array into two equal halves. However, finding the true median is a time-consuming process, so good pivot selection strategies are crucial for ensuring Quicksort's efficiency. In quick sort, every element arranges itself at its correct position to sort the given array. Here, the pivot element is placed at its correct sorted position, and hence it is the element that we know is sorted. And thus, subarrays are divided around the pivot element. The quicksort algorithm is performed as follows: A pivot point is chosen from the array. The array is reordered so that all values smaller than the pivot are moved before it and all values larger than the pivot are moved after it, with values equaling the pivot going either way. When this is done, the pivot is in its final position. The above step is repeated for each subarray of smaller values as well as done separately for the subarray with greater values. This is repeated until the entire array is sorted. The time complexity of Quicksort in the average and best case is O(n log n), where n is the number of items being sorted. In the worst case, which occurs when the smallest or largest element is always chosen as the pivot, the time complexity is O(n^2). However, this worst-case scenario can typically be avoided with a good pivot selection strategy. Quicksort is not a stable sort, meaning that the relative order of equal sort items is not preserved. This is typically not an issue unless certain applications require stability.
CART: It means Classification and Regression Tree. CART is a predictive algorithm used in Machine learning and it explains how the target variable's values can be predicted based on other matters. It is a decision tree where each fork is split into a predictor variable and each node has a prediction for the target variable at the end. In the decision tree, nodes are split into sub-nodes on the basis of a threshold value of an attribute. The root node is taken as the training set and is split into two by considering the best attribute and threshold value. Further, the subsets are also split using the same logic. This continues till the last pure sub-set is found in the tree or the maximum number of leaves possible in that growing tree. In the decision tree, the nodes are split into subnodes on the basis of a threshold value of an attribute. The CART algorithm does that by searching for the best homogeneity for the subnodes, with the help of the Gini Index criterion. The root node is taken as the training set and is split into two by considering the best attribute and threshold value. Further, the subsets are also split using the same logic. This continues till the last pure sub-set is found in the tree or the maximum number of leaves possible in that growing tree. This is also known as Tree Pruning. The Gini index has a degree range of 0 to 1. Gini Index is used by CART (Classification algorithm and Regression Trees) as a tool for attribute selection. The Gini index, also known as the Gini coefficient or Gini impurity, calculates the likelihood that a given variable will be incorrectly classified when selected randomly. Where 0 denotes the presence of a single class or the fact that all objects fall under that class.
When the Gini index is 1, all the items are dispersed randomly among the classes. When the Gini index is 0.5, the elements are distributed evenly among many classes. The representation for the CART model may be a binary tree. This is a binary tree which is formed from the algorithms and data structures which is nothing too fancy. Each root node represents one input variable (x) and a split point thereon variable (assuming the variable is numeric). The leaf nodes of the tree contain an output variable (y) which is employed to form a prediction. Given a dataset with two inputs (x) of height in centimetres and weight in kilograms the output of sex as male or female, below may be an example of a binary decision tree which is completely fictitious for demonstration purposes only and is very important. Classification and regression trees work to supply accurate predictions or predicted classifications, supported the set of if-else conditions. They typically have several advantages over regular decision trees. The interpretation of results summarized in classification or regression trees is typically fairly simple.
Recommended by LinkedIn
Merge sort: Merge sort is one of the most efficient sorting algorithms. It is based on the divide-and-conquer strategy. Merge sort continuously cuts down a list into multiple sub lists until each has only one item, then merges those sublists into a sorted list. Divide-and-conquer recursively solves subproblems; each subproblem must be smaller than the original problem, and each must have a base case. A divide-and-conquer algorithm has three parts:
Divide up the problem into a lot of smaller pieces of the same problem. Conquer the subproblems by recursively solving them. Solve the subproblems as base cases if they're small enough. To find the solutions to the original problem, combine the solutions of the subproblems. So, the merge sort working rule involves the following steps: Divide the unsorted array into subarray, each containing a single element. Take adjacent pairs of two single-element array and merge them to form an array of 2 elements. Repeat the process till a single sorted array is obtained. An array of Size ‘N’ is divided into two parts ‘N/2’ size of each. then those arrays are further divided till we reach a single element. The base case here is reaching one single element. When the base case is hit, we start merging the left part and the right part and we get a sorted array at the end. Merge sort repeatedly breaks down an array into several subarrays until each subarray consists of a single element and merging those subarrays in a manner that results in a sorted array. Other algorithms use the divide-and-conquer paradigm, such as Quicksort, Binary Search, and Strassen’s algorithm. Merge sort performs well when sorting large lists, but its operation time is slower than other sorting solutions when used on smaller lists. Another disadvantage of merge sort is that it will execute the operational steps even if the initial list is already sorted. In the use case of sorting linked lists, merge sort is one of the fastest sorting algorithms to use. Merge sort can be used in file sorting within external storage systems, such as hard drives.
Apriori : It also called Market Basket Analysis. Association rule mining is a technique to identify underlying relations between different items. There are many methods to perform association rule mining. The most famous story about association rule mining is the “beer and diaper”. Researchers discovered that customers who buy diapers also tend to buy beer. This classic example shows that there might be many interesting association rules hidden in our daily data. By association rules, we identify the set of items or attributes that occur together in a table. frequent itemset mining is a data mining technique to identify the items that often occur together. For Example, Bread and butter, Laptop and Antivirus software, etc. You have a data set in which customers are buying multiple products. Your goal is to find out which combinations of products are frequently bought together. You need to organize the data in such a way that you have a set of products on each line. Each of those sets contains products that were bought in the same transaction. The most basic solution would be to loop through all the transactions and inside the transactions loop through all the combinations of products and count them. Unfortunately, this is going to take way too much time, so we need something better. You must have noticed that the Pizza shop seller makes a pizza, soft drink, and breadstick combo together. He also offers a discount to their customers who buy these combos. Do you ever think why does he do so? He thinks that customers who buy pizza also buy soft drinks and breadsticks. However, by making combos, he makes it easy for the customers. At the same time, he also improves his sales performance. Apriori algorithm refers to an algorithm that is used in mining frequent products sets and relevant association rules. Generally, the apriori algorithm operates on a database containing a huge number of transactions. The key concept in the Apriori algorithm is that it assumes all subsets of a frequent itemset to be frequent. Similarly, for any infrequent itemset, all its supersets must also be infrequent. In order to select the interesting rules out of multiple possible rules from this small business scenario, we will be using the following measures: Support, Confidence, Lift, Conviction. Support of item x is nothing but the ratio of the number of transactions in which item x appears to the total number of transactions. Confidence (x => y) signifies the likelihood of the item y being purchased when item x is purchased. This method takes into account the popularity of item x.
Lift (x => y) is nothing but the ‘interestingness’ or the likelihood of the item y being purchased when item x is sold. Unlike confidence (x => y), this method takes into account the popularity of the item y.
Lift (x => y) = 1 means that there is no correlation within the itemset.
Lift (x => y) > 1 means that there is a positive correlation within the itemset, i.e., products in the itemset, x and y, are more likely to be bought together.
Lift (x => y) < 1 means that there is a negative correlation within the itemset, i.e., products in itemset, x and y, are unlikely to be bought together.
An algorithm that is simple to grasp. The Merge and Squash processes are simple to apply on big itemsets in huge databases but it requires a significant amount of calculations if the itemsets are extremely big and the minimal support is maintained to a bare minimum. Also, A full scan of the whole database is required.