You want to be a data guru?

Manish Kushwaha

Director Of Data Engineering (PE) @ McAfee | Data Science, Data Engineering

Published Jun 27, 2021

If you want to be data guru than you need to understand few fundamentals of Data. Start with few basic concepts of Statistics this will help us to get into Descriptive Analytics. Then get into ML, supervised and unsupervised learning.

But, be descriptive, or learning based insights you are taking out, be a prescriber. It is important to prescribe medicine if problem exist or prescribe to avoid any upcoming problem.

Few Basic terminology you need to know.

Data: Raw information generated by any activity is known as data. Any attribute of an object can be recorded as data. For example, the colour of a car, the speed of a car, etc.

Dataset: A collection of data from a particular group or study is known as a dataset. For example, the speed of different cars on a highway on New Year’s Eve.

Variable: Suppose you need to use some data again and again in a report. A variable is a name that represents the data. You can use the variable instead of describing the data repeatedly in a report.

Variables can be classified into two categories, depending on the values they can take:

Qualitative or categorical variables: These variables can take values that describe the quality of an object or an activity. For example, the colour of a car, customer satisfaction levels, etc.
Quantitative variables: These variables can take fixed values defined by numbers. For example, the speed of a car, the number of students in a class, etc.

Quantitative variables can be further classified into Discrete and Continuous.

Discrete variables: The variables which can be counted, and do not have any decimal parts, are known as discrete variables. For example, the number of students in a class. A class can have 10 students or 11 students, but it cannot have 10.25 students.
Continuous variables: The variables which can be divided infinitely into smaller parts are known as continuous variables. For example, a student’s height can be 1 metre or 0.99 metre, or 0.998 metre.

"Descriptive Statistics is a branch of statistics that describes or summarizes a collection of information."

Data Visualization

There are various technique to visualize data. The process of using charts and graphs to represent data is known as data visualisation.

Visualising data in terms of Chart and Graph make data analysing very intuitive. There are various chart used to represent data, Bar chart, Histogram, Pie chart.

Pie chart is useful if you want to represent fewer data categories say max 6. More then 6 data categories on Pie chart makes it cluster that that case use bar chat.

If we are trying to understand trends in data then line chat is best fit.

If you are comparing two or more variables then it Scatter Chart is choice of selection.

Histogram

A histogram is a visual representation of the distribution of data. The data can be of both types, numerical and categorical. The two important points to keep in mind about histograms are as below.

A histogram is an extended form of a bar graph and there are no gaps between adjacent bars in a histogram.
To construct a histogram, you need to first construct a frequency distribution table.

Data visualisation offers you the insight that you are looking for immediately, but the information is not always complete. Hence, in most cases, you will need to calculate metrics that provide you with relevant information on the characteristics of a dataset.

We can draw various insights by using the different summary metrics for a dataset. For that we need to understand following very powerful concepts.

Measures of Central Tendency of a dataset
Measures of Dispersion of a dataset
Interquartile Spread

Measures of Central Tendency

A central tendency is a central or typical value for a probability distribution. It may also be called a center or location of the distribution. Colloquially, measures of central tendency are often called averages.

Depending on the type of data sample, you can choose one of the following as a measure of central tendency:

Mean: Mean is the sum of all the data values divided by the total number of sample values. Mean is commonly represented by the symbol 𝝁.
Median: If you arrange the sample data in ascending order of frequency, from left to right, the value in the middle is called the median. For an odd number of values, we have one median. For an even number of values, the median is the average of the two central values.
Mode: In a dataset, the value with the highest frequency is the mode. For qualitative data, it is not possible to measure the mean or median values, as there are no numerical values. Thus, the variable with the highest frequency is considered as the measure of central tendency in such cases.

Measure of Dispersion

A measure of dispersion, is used to describe the variability in a sample or population. It is usually used in conjunction with a measure of central tendency, such as the mean or median, to provide an overall description of a set of data.

The measure of dispersion shows how scattered the values are in a dataset and how much they differ from the mean value.

We should understand two basic terms,

Population: All the data points in a dataset are known as the population.

Sample: A part of the population is known as a sample.

There are three most popular metrics used to quantify spread within data are Variance, Standard Deviation and Interquartile Ranges.

Variance =∑(x−μ)^2/(n) (for a population), ∑(x−μ)^2/(n-1) (for a sample)
where x is any data point within the dataset, n is the number of data points, and μ is the mean value of all data points within a dataset.

Standard deviation: It is the square root of the variance. This metric serves the purpose of measuring variation without exaggerating its magnitude. It is popularly represented as 𝜎. So, the variance is represented as σ^2.

Interquartile Range

Interquartile range based calculation are very useful when outliers in a dataset cause standard deviation to give distorted results

The best way to find an outlier is to calculate the standard deviation. If the result is much higher than expected, there is a high chance that your data contains an outlier.

In such cases, the interquartile spread is a much better way to communicate the variation or spread in the data. Quartile values are the values in a sample at the 25th, 50th, 75th, and 100th percentiles.

Yasir Wani

A good nice refresher on basic concepts! 👍

1 Reaction

To view or add a comment, sign in

You want to be a data guru?

Manish Kushwaha

Director Of Data Engineering (PE) @ McAfee | Data Science, Data Engineering

Data Visualization

Histogram

Measures of Central Tendency

Measure of Dispersion

Interquartile Range

More articles by Manish Kushwaha

Insights from the community

Others also viewed

Data Science Best Practices

Basic Building Blocks of K-Means Clustering Algorithms

Data Cleaning and Preparation Techniques

Build Statistics and Machine Learning Models Using SQL in Data Distiller

Mastering Probability and Statistics for Data Science

The Future of Work: Data Skills You Need to Thrive

Different Data Transformations in Machine Learning - Part 09

Data Science Notes _ Part 1

Top Interview Questions for Data Analytics:

Explore topics

Data Visualization

Histogram

Measures of Central Tendency

Measure of Dispersion

Interquartile Range

More articles by Manish Kushwaha

How to write a great whitepaper

Human body is greatest inspiration for Software Architecture

Who will help to take right decision and help earn $ :)

Insights from the community

Others also viewed

Data Science Best Practices

Basic Building Blocks of K-Means Clustering Algorithms

Data Cleaning and Preparation Techniques

Build Statistics and Machine Learning Models Using SQL in Data Distiller

Mastering Probability and Statistics for Data Science

The Future of Work: Data Skills You Need to Thrive

Different Data Transformations in Machine Learning - Part 09

Data Science Notes _ Part 1

Top Interview Questions for Data Analytics:

Explore topics