Introduction to seaborn! (Part-1)
Hi Guys! Welcome to the next interesting article about the EDA (Exploratory Data Analysis) part using python. EDA is the first part of our model-building journey. We can get various insights about data and valuable information by visualizing the data properly. The more you visualize the data, the more you get insights into the data.
For data visualization, we have more tools like Matplotlib, Plotly, NumPy, pandas, seaborn and more. Here we are going to see about the seaborn module present in python. The seaborn tool helps to visualize the data effectively, and it has more cool features in it. Every feature helps to understand the data in a better way! Seaborn is a library in Python predominantly used for making statistical graphics. Seaborn is a data visualization library built on top of Matplotlib and closely integrated with pandas data structures in Python. Visualization is the central part of Seaborn, which helps in the exploration and understanding of data.
Let's see about seaborn plots for both numerical and categorical data. It helps to understand the data more clearly!
They build Seaborn on top of the Matplotlib library; it means you can use both Matplotlib and seaborn syntax to get a good data visualization.
Before going into detail about seaborn, let me introduce the useful library. It will help to see the visualization clearly! We have a tool in Ipython that helps to visualize the data clearly because it has SVG (Scalable Vector Graphics) format. The SVG format helps to make the plot clear. Let me throw the picture before and after applying of SVG format in a notebook.
If you see carefully the image before, the plot looks blurred and is not visible clearly. After applying the SVG module, the plot looks clear and bright! It helps to understand the data more clearly.
Code to activate SVG in our notebook:
# Install the module
pip install IPython
# Import the module
from IPython import display
display.set_matplotlib_formats('svg')
Regression Plot:
Let's start with the basic Regression Plot, I hope you know the regression concept very clear! In regression, our ultimate aim is to fit the best fit line to the model. We find the best fit line by using MSE or MAE. It's a normal method, right? Think like this: if we can find the best-fit line by using visualization, it gives more information right! yes, this is what we are going to do now. Seaborn has a special visualization plot for regression. It helps to find the best fit line analytically. Sounds good, right! Let us see some of the basic concepts of regression plot present in seaborn.
The name of the plot: reg plot. It's a bivariate part because we are finding the best fit line for two features.
See the image, the line in cyan color represents the best fit line and the green dot represents the actual data points. If you see the blurry line near the hyperplanes, those are confidence intervals. Bootstrapping method helps calculate the confidence intervals. It does all the processes on its own. So, you don't need to worry about anything.
You have more hyper-parameter here to customize the visualization based on your protocol.
The hyper-parameters are:
See the code:
import seaborn as sns
diamonds = sns.load_dataset('diamonds').dropna()
diamonds = diamonds.sample(n = 200, random_state = 44) # sampling
sns.regplot(diamonds.carat, diamonds.price, fit_reg = False,marker = '+', scatter_kws= {'color':'red'});
# change the parameter for your code and analyze the plot clearly!
What about polynomials?
If you have polynomial data with you, the regular reg plot doesn't help you with that. For visualizing the polynomial, you need to specify the order of your polynomial with your code and it will help you visualize the polynomial data.
Code!:
sns.regplot(diamonds.carat, diamonds.price, order = 2,line_kws = {'color':'cyan'}, scatter_kws = {'color':'pink'});
The best-fit line changed based on the order of degree that you have given in the code! The reg plot helps us understand many things in our data especially the best fit line.
Not only polynomial and best-fit line, but you can also visualize the logistic regression using reg plot, for that you need to install the stats module and use it.
Cat Plot:
We can access to all categorical plots like box plot, violin plot, bar plot, count plot. if you don't know about anything, Don't worry future articles covers everything what I mentioned here. Just understand cat plot not only used one plot, it used multiple plot to visualize the better results.
Cat plot helps to analyze univariate and bivariate plot.
This is how cat plot looks, this is univariate analysis. It shows horsepower feature has uniform distribution, and it is increasing in a same way.
Same like previous plot, we have more hyper-parameters here.
If you have categorical data with the data, you can analyze the data before you giving the data to a categorical statistical test like chi-square and some other test.
This is the plot look like when we use with the help of some other plots.
This plot name is swarm plot, if you don't know don't worry, future article cover all the things. Just understand we can use multiple plots using cat plot, swarm plot helps to visualize each data point clearly.
And we can use more plot like this and analyze the data clearly, it helps to understand the categorical data clearly!
Code!
# importing data
cars = sns.load_dataset('mpg').dropna(
cars = cars[cars.cylinders.isin([4,6,8])]
cars['type'] = ['old' if x <=76 else 'new' for x in cars.model_year])
sns.catplot(x = 'horsepower', data = cars, color = 'green', marker = 'o');
# kind = different plots
sns.catplot(x = 'horsepower', data = cars,kind = 'swarm', color = 'pink');
# Multivariate plot
sns.catplot(y = 'horsepower', x= 'origin', data = cars,kind = 'box',hue = 'cylinders');
output:
Explore all the hyper parameters and get excellent knowledge in cat plot.
This is all about regression plot and categorical plot. We will look for more plots in future articles!
+
Name: R.Aravindan
Company: Artificial Neurons.AI
Position: Content Writer
Thank you!