Simplifying key Data Science Concepts! (drafted by Dr Ratika Datta)
(1).Difference between Eigen Vector and Eigen Values:Principle component analysis usage.|A- lambda*I|=0,here, lambda gives the values as Eigen values.
(2).p value:A high value,is the strength in support of the Null Hypothesis.eg a p value of greater than .05!
(3).If more than 30 pct of Missing values ,pandas in python can replace the variable by df.Mean() for the missing values.
(4).Kernel Trick-Convert non linear values to a linear model adaptation.
(5).Predictive modelling: backward versus forward versus Recursive versus Embedded approach(eg.Random Forest)
(6).ROC Curve:Receiver Operating Characteristic:Plotting of True positive rates and false positive rates to get ROC.eg.practically used in illness identification for pathlabs etc
(7).Machine Learning algorithms are used to automate predictive modelling
(8).List and tuple,*in Python* ,is an ordered collection of items.*Dictionary is unordered collection,bw,List and dictionary objects are mutable i.e. it is possible to add new item or delete and item from it. Tuple is an immutable object
(9).Pickling :It is a process where a Python object hierarchy is converted into a byte stream. Unpickling: It is the inverse of Pickling process where a byte stream is converted into an object hierarchy.
(10).Dir():displays a list of files and subdirectories in the directory.
(11).In Bagging, multiple homogenous algorithms are trained independently and combined afterward to determine the *model's average.
Boosting is an ensemble technique, where we train multiple homogenous algorithms sequentially .These individual algorithms create a final model with the best results.
(12).Difference bw classification versus regression data:Eg.of classification is Email classification into spam and non spam.
Regression is when,we predict the stock price based on historical data,for a specific period of time.
(13).The purpose of reinforcement learning is for the agent to learn an optimal, or nearly-optimal, policy that maximizes the "reward function" or other user-provided reinforcement signal that accumulates from the immediate rewards.
(14).Softmax is used for multi-classification in the Logistic Regression model, whereas Sigmoid is used for binary classification in the Logistic Regression model.
(15).Fourier analysis converts a signal from its original domain to a representation in the frequency domain and vice -versa.
(16).DIT (Decimation in time) and DIF( Decimation in frequency) algorithms are two different ways of implementing the Fast Fourier Transform (FFT) ,thus reducing the total number of computations used by the DFT algorithms and making the process faster and device-friendly.
(17).Mercer's Theorem determines which functions can be used as a kernel function,with SVM.
(18).SVM and convex hull creation.
(19/20)..RNN or CNN:Deep Learning:
Current input identification via one direction ,input to output analysis eg.CNN
Whereas,Recurrent Neural Network goes in both directions,can be used for Sentiment analysis(eg.Tone identification),text mining and picture captioning.
(21).R Language:
(a)It includes Vectors,Lists,Matrix and Data Frame.Its less effective than Python in memory and performance.
(b)importing a CSV file in R:read.csv(path)
(c)Rmarkdown:R reporting tool
Gives Output in HTML,PDF,Word
(d)install packages in R:install.packages("package name")
(e)to input /impute data in R:MICE,imputeR,Mi,Hmisc,Amelia,MissForest
(f)Cross tabulation creates Confusion Matrix.
(g) dplyr package:Filter,select,mutate,arrange,count
(h)What are the 6 classes of R objects?
R's basic data types are character, numeric, integer, complex, and logical. R's basic data structures include the vector, list, matrix, data frame, and factors.
(i) Rattle for supervised and unsupervised learning
(j)debug() and browser() for debugging
(k) Factor in R is also known as a categorical variable that stores both string and integer data values as levels.
(l)10000 free packages exists in its Cran library.
(m)setwd() the working directory may be set by the same ,and then,CSV variable ,read.csv() and the path.
(n)Confusion Matrix has solutions for the R Output,having value table of actual and predicted values.
(o)Confusion Matrix has:
True Positives ,True Negatives,False Positive ,False Negatives
(p)Based on the same,Accuracy-True Positive plus true negative /Sum of all 4 categories.
Precision-Right positives /overall positives
True Positives/TP +FP
Recall sensitivity of the true positive rate
True positive/total actual positives.
(22)Random Forest versus Decision Tree.
(23) Kmeans Clustering(Unsupervised Learning) versus K nearest neighbour algorithm(Supervised Learning).
(24) Eucilidean Distance :the distance between Data point and centroid
(25) SQL:ANSI(American National Standards Institute) approved SQL in 1986.SQL is in English but,MY SQL is DBMS linked language.
(26)Joins:Left,Right,inner ,full join
(27). Cross Join/Cartesian join:
The CARTESIAN is also called CROSS JOIN.There exists a join for every row of a table to every row of some other table
It usually occurs when the matching column isn't specified on when the WHERE condition isn't specified.
(28).Sub queries in SQL:Query inside another Query.*It may be nested within any query.*including Select,update,other and/or comparison operator.(eg.>,or=)
(29).My SQL ,Oracle DB are egs of RDBMS(Relational Database Management System).
(30).Create,alter,drop,truncate,rename,select,update,insert,delete,grant,revoke,Commit,rollback,savepoint
(31).Data warehousing:historical data of organisation aiding in decision making!
(32).The intension of a database is the set of definitions of the data structures for the particular database (also called *schema).
The extension of the database is the set of database values that populate these data structures.
(33).Delete with Where clause versus Truncate:
Delete eg.a row with where clause.Truncate deletes all data from table,when used.
(34).Shared Lock:allows many transactions to read the data items
Exclusive Lock:allows one transaction at a time.
(35).Primary Key versus Foreign Key,versus Candidate key versus Super Key
(36).The trigger can be executed when you run one of the following MySQL statements on the table: INSERT, UPDATE and DELETE and it can be invoked before or after the event.
(37).My SQL relationships:
One -o-One
One-o -many
Many-o-Many
(38).Sharding in SQL:Sharding is when you logically separate your RDBMS data into multiple databases typically with the same schema.
(39).A cursor make it possible to define a result set (a set of data rows) and perform complex logic on a row by row basis
(40). Data Wrangling:Data wrangling is the process of removing errors and combining complex data sets to make them more accessible and easier to analyze.
(41).Data parsing is converting data from one format to another.
(42).Row-Level Security:it limits the data a person can see,based on their access filters.Users can specify row-level-security.
(43).Depth Cueing is available in Architectural and Coordination discipline views to allow architects to better visualize their elevations and sections. This graphic display option allows you to quickly show what elements are farthest away and what elements are closest to the front of the view.
(44).Parallel Projection : creating a 2 D representation of a 3 D scene.
(45).Data Science additional:eg. of a Palindrome:Madam
(46)In CHAR, If the length of the string is less than set or fixed-length then it is padded with extra memory space. In VARCHAR: If the length of the string is less than the set or fixed-length then it will store as it is without padded with extra memory spaces.
(47)unique,foreign key,check,not null,default,primary key etc are some of the constraints.
(48) Data integrity:consistency and correctness of data kept in a database.
(49)GetDate() to get the current date.
(50) Query Optimisation :best cost reduced high performance results of the query.
(51) Denormalization is the process of adding precomputed redundant data to an otherwise normalized relational database to improve read performance of the database. Normalizing a database involves removing redundancy so only a single copy exists of each piece of information.
(52) Index makes an entry for each value ,and can therefore,retrieve records faster,eg.Unique Index,clustered index,non-clustered index
(53) First Normal Form ,second normal form and third normal form exists.
(54)ACID property:Atomicity, Consistency,Isolation and Durability
(55)SQL Operators:Logical,arithmetic and Comparison operators.
(56)Recommender System:Eg.Movie Recommendations,Music Preferences etc.
(57) Machine Learning in Real World scenerios:
Ecommerce: Understanding the customer churn, deploying targeted
advertising, remarketing.
Search engine: Ranking pages depending on the personal preferences of the
searcher.
Finance: Evaluating investment opportunities & risks, detecting fraudulent
transactions
Medicare: Designing drugs depending on the patient’s history and needs
Robotics: Machine learning for handling situations that *are out of the
ordinary*
Social media: Understanding relationships and recommending connections
Extraction of information: framing questions ,for getting answers from the web.
(58)Explain the steps in making a Decision tree?
Take entire data sets,make splits that maximises classes,apply splits to inputs too,Reapply the above steps till a stopping criteria(called pruning),Clean up the tree,for the extra splits.
(59)Similar to finding the line of best fit in linear regression, the goal of gradient descent is to minimize the cost function, or the error between predicted and actual y.
(60) A/B testing is a way to compare two versions of something to figure out which performs better.
(61)Confounding Variable:The same variable causes illogical or spurious causation.
(62) schema is a collection of database objects, including tables, views, indexes, and synonyms. Star schema is denormalised versus snowflake schema ,which is normalised and occupies less space than star schema.
(63)There are advanced tools like text analysis with Python that can help you transform your data into meaningful insights, quickly.
(64) Box-Cox transformation is a statistical technique that transforms your target variable so that your data closely resembles a normal distribution
(65) Application of training data to testing data is an eg.of supervised learning ,versus ,no earlier adaptation available,eg.Clustering is eg.of unsupervised learning.
Recommended by LinkedIn
(66) What is Combinatorics ? A stream of mathematics that concerns the study of finite discrete structures. It deals with the study of permutations and combinations, enumerations of the sets of elements
(67) MLE :Maximum Likelihood Estimate* This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable
(68)Bayesian analysis is a statistical paradigm that answers research questions about unknown parameters using probability statements.
Eg. Bayesian probability, also known as evidential probability, is the process of adding prior probability to a hypothesis and adjusting that probability as new information becomes available.eg02:biased coin known
(69)Regularisation in Data:It means that our model works well not only with training or test data, but also with the data *it'll receive in the future.
(70)Root Mean Square Error versus Mean absolute Error:the mean absolute Error,tells the gap between actual values and predicted value in absolute terms and can be positive ,negative or zero.The RMSE,squares the errors,and for large errors,it's a befitting tool of explanation in gradiant.
(71)Naive Bayes algorithms are mostly used in sentiment analysis, spam filtering, recommendation systems
a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an orange if it is orangish green, round, and about 3 inches in diameter.
(72)Practical Example exists in Logistic Regression in Excel with MLE Calculation,bo,b1,b2,and X1,X2,and Y(having values 0 and 1):The steps include based on Logit values calculation of E^logit/1+E^logit probablility for each row.Then,finally,eg.if y is zero for a row,then zero multiplied by the calc Probability +(1-0)(1-calc probability),*sum of these values as MLE value*The same is then ,in *solver targeted to be maximum given the concept changes in b0,b1 and b2.
(73)Fuzzy matching is a method that provides an improved ability to process word-based matching queries to *find matching phrases or sentences from a database.
(74)The mean absolute percentage error (MAPE) is the mean or average of the absolute percentage errors of forecasts.
Mean Absolute Percent Error (MAPE) = 100 * (ABS (Actual – Forecast)/Actual)
(75)Seasonality and Time Series:Seasonality in TS,implies a repeated pattern over time.Seasonal Differencing whereby particular value is differenced for a period lag,to remove the bias.
(76)The “training” data set is the general term for the samples used to create the model, while the “test” or “validation” data set is used to qualify performance,evaluating the predictive power
(77)Senstivity is same as accuracy of the model.
(78)When to use SVM versus Random Forest?
Accuracy under both could be tested.When the data is clean and not having missing values etc,then,one can go for SVM,with multi class problems in data,Random Forest is preferred.
(79)Concepts with missing values:
Complete Case treatment ;eg. drop the whole row
Available case analysis:eg.in R coefficient calculation,if there are missing values and u remove them
Mean substitution:submitting the missing values with mean value.This does makes the distribution biased,as Regression,correlation and standard deviation are mean dependant.
(80).Machine learning and deep learning are both types of AI. In short, machine learning is AI that can automatically adapt with minimal human interference.
Deep learning is a subset of machine learning that uses artificial neural networks to mimic the learning process of the human brain.
(81). Inertia measures how well a dataset was clustered by K-Means.
Inertia is calculated by measuring the distance between each data point and its centroid, squaring this distance, and summing these squares across one cluster. A good model is one with low inertia AND a low number of clusters ( K ).
(82) . LabelEncoder encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels. If a label repeats it assigns the same value to as assigned earlier.
(83) The purpose of cross–validation is to test the ability of a machine learning model to predict new data. It is also used to flag problems like overfitting or selection bias and gives insights on how the model will generalize to an independent dataset by iterations.
(84) Sns pairplot.The Seaborn Pairplot allows us to plot pairwise relationships between variables within a dataset. The value when plotted against itself,will give a histogram sort of a diagram,unless ,it's categorical!
(85). ROC Curve: The relationship between True Positives to false positives.Thr more the curve is close to the Y axis ,the better the true Positives and better is the ROC Curve.
(86) : Sentiment analysis is analytical technique that uses statistics, natural language processing, and machine learning to determine the emotional meaning of communications.
Companies use sentiment analysis to evaluate customer messages, call centre interactions, online reviews, social media posts, and other content.
(87). The CountVectorizer will select the words/features/terms which occur the most frequently.
It takes absolute values so if you set the 'max_features = 3', it will select the 3 most common words in the data.
By setting 'binary = True', the CountVectorizer no more takes into consideration the frequency of the term/word.
(88). Cosine similarity measures the similarity between two vectors of an inner product space.
It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction.
(89). Soft margin SVM(Support Vector Machine) is implemented with the help of the Regularization parameter (C).
Regularization parameter (C): It tells us how much misclassification we want to avoid. The margin line of bifurcation is reduced with higher value of C!
Hard margin SVM generally has large values of C. – Soft margin SVM generally has small values of C.
(90) SVM is a supervised machine learning algorithm which can be used for classification or regression problems.
It uses a technique called the kernel trick to transform your data and then based on these transformations it finds an optimal boundary between the possible outputs.
(91). : Linear Kernel versus RBF Kernel-Linear SVM is a parametric model,.But,What is RBF? nonlinear kernels such as the Radial Basis Function kernel for non-linear problems using a hyperplane.
(92). Gamma is a hyperparameter used with non-linear SVM like RBF, the radial basis function .
Gamma parameter of RBF controls the distance of the influence of a single training point.
What does it imply, In the 3 D plane?Keeping the tolerance of the Regularisation parameter constant,i.e.Keeping C constant,one can analyse ,different simulations of Gamma in RBF!
(93). StandardScaler follows Standard Normal Distribution (SND). * Therefore, it makes mean = 0 and scales the data to unit variance.
*MinMaxScaler*scales all the data features in the range [0, 1] or else in the range [-1, 1] if there are negative values in the dataset.
(94). One can do SVC facial recognition for image identification.
Mentioning the C,Gamma and degree,and opting for SVC Kernel as Linear.The objective of a Linear SVC (Support Vector Classifier) is to fit to the data you provide, returning a "best fit" hyperplane that divides, or categorizes, your data.
(95). Grid search with Image Recogition: is a process that searches exhaustively through a manually specified subset of the hyperparameter space of the targeted algorithm.
What does it signify? It gives Confusion Matrix as result.
Eg.given in the range of 20 images ,16 out of 20 times,Collin Powell has been correctly classified in Image Recognition.
(96). Two dimensional Dataset is reduced to One dimensional Dataset with PCA(Principal component Analysis)
eg 1000 columns with use of PCA are reduced to 100 columns,and still,retain 90 pct information on original dataset.
What is PCA?* The Principal Component Analysis is a popular unsupervised learning technique for reducing the dimensionality of data*.It increases interpretability yet, at the same time, it minimizes information loss.
(97). Filtering Noise in PCA: eg.In image recognition,there exists some blurred images ,The same is noise in PCA,The same can be removed from original Image Recognition dataset.This won't harm the final output,by filtering Noise.
(98). One example of anonymized data is a dataset that has been stripped of any personally identifiable information such as names, addresses, and phone numbers.
This type of data can be used to analyze trends and patterns without the risk of exposing any individual's personal information.
(99). Anomaly Detection examples:Failure in Jet Engines,Credit Card Fraud(1 would be fraud and 0 no Fraud) or Anomaly in Telemetry Data.
(100). Isolation Forests: Isolation forest is a machine learning algorithm for anomaly detection.
It's an unsupervised learning algorithm that identifies anomaly by isolating outliers in the data. Isolation Forest is based on the Decision Tree algorithm.
(101). GPUs versus TPUs?
Graphics processing Unit vs Tensor Processing Unit , whereby TPU is Google's custom developed application)
GPUs have the ability to break complex problems into thousands or millions of separate tasks and work them out all at once, while TPUs were designed specifically for neural network loads and have the ability to work quicker than GPUs while also using fewer resources
(102).What does the Deep in deep learning mean?
The word "deep" in "deep learning" refers to the number of layers through which the data is transformed.
(103).CNN vs RNN vs GAN
Convolutional Neural Network vs Recurrent Neural Network vs Generative Adversarial Network
CNN is good at image processing n classification, RNN is good at sequential data(*eg :Time series data is time sequenced) processing and GAN is good at generative tasks.(for eg.GANs use two neural networks and pit one against the other)eg.For GANs,enables computers to create art,music and other content.
(104). The simplest Neural Network is the Multilayer Perceptron.
Multi layer perceptron (MLP) is a supplement of feed forward neural network. It consists of three types of layers—the input layer, output layer and hidden layer.
So,how does it work?
There are given Weights and biases.Also, Inputs are mentioned.In the above example,Multiple Linear Regressor eg.y =MX+B is the known work behind with weights and biases to reach from the input layer to the hidden layer.Then,let's say,we get the hidden layer .From there,we have to reach the output.The* Rectified Linear Unit (ReLU) *is the most commonly used activation function in deep learning. The function returns 0 if the input is negative, but for any positive input, it returns that value back.From Relu,again ,we have to apply the ML Regressor,y=MX+b to reach the output Y.
(105).Backpropagation: Inputs are given,hidden layers exists,one feeds forward to compute output.
Error is calculated,eg.the gap between the computed output and the correct output.Loss function is computed.
Then,*comes *backpropagation,i.e.propagate back by adjusting weights and biases,until the error is sufficiently small.With each iteration,weights n biases become refined and error becomes smaller!So,what do u do?propagate back,to reduce error .
(106).Metrics with Backpropagation :
Adam:Effective backpropagation optimizer
Loss measurer:Lowest Mae implies mean absolute Error.
Metrics is lowest MAE values on trained dataset,to test training efficacy.
The more the gap between training and testing data set ,the greater is the overfitting of the model.
(107). 7 steps eg of simplest Tensorflow model
1.tf Keras models sequential.
2.tf Keras layers dense
3.dense with three hidden layers ,2 inputs and one output.
4. Adam and Mae reduction and Mae Matrix is chosen.
5.model history fitment with validation of 20 pct.
6.Training and testing fit is there
7.model prediction with 2 inputs is there
***************************************************************
Aim:Simplifying practicality over tough concepts!,Across Economics,Finance,Analytics, Econometrics.
Regards,Self-Drafted
Dr Ratika Datta
Project Co-ordinator - BI & Analytics
1yYou are my all time Fav. Thanks for this content again 👍
F&A | FP&A | Business & Finance Analytics 📊 | Predictive Analytics | Business Acumen | Process Improvement & Automation | Continuous Learning
1yCrisp and to the point summary