Cleaning and manipulating data with Python
Data Science is a multidisciplinary field that combines various techniques and processes to extract knowledge from structured and unstructured data.
One of the key concepts in this field is data cleaning and preprocessing. This activity includes managing data sources that often come with information that is not necessarily useful for the project we are working on, or it must be cleaned and properly processed for dealing with missing values or inconsistencies among other things.
Python has many powerful tools to achieve this. I am going to use the TripAdvisor Rating Impact on Hotel Popularity dataset from Kaggle to show some of the functions we can use for data cleaning and manipulating.
You can download the original dataset on the following url:
For this exercise, I'll use Jupyter Notebook because it's easy to run commands cell by cell, so I can put each task in an individual cell that I can run separately. I will leave the link to the github repository at the end of the post.
First of all, we need to import Pandas library and use the common alias as pd
import pandas as pd
Then we need to load the dataset into a Data Frame object
df = pd.read_csv("./datasets/tripadvisor-rating-impact-on-hotel-popularity/data_rdd.csv", encoding_errors="ignore")
The previous line of code reads the CSV file named 'data_rdd.csv' located in the “./datasets/tripadvisor-rating-impact-on-hotel-popularity/” directory. It then loads the data from the CSV file into a DataFrame, which is stored with the variable name “df.” The “encoding_errors” parameter is set to “ignore”, which means any encoding errors encountered during reading will be disregarded.
One of the first things we can do is get summary information to get the initial insight of the object using DataFrame information method. This will bring us information about index dtype, columns, non-null values and memory usage.
df.info()
This particular dataset is composed of 272 columns, so in this case df.info() will show a reduced summary without individual column detail information.
Describing statistical information:
Now that we have the structure of the DataFrame, we can view some statistical information about numerical columns. This is done using describe method.
df.describe()
This will show the following information from each numerical column:
Showing some data:
Another common function used to show a quick view of the structure and some data are the head and tail methods. These commands will show the list of columns with a view of the first 5 values on de DataFrame if it is used without any additional parameters.
df.head()
You can modify the number of rows shown by entering the desired number inside the parenthesis, for example, to show the first 20 rows
df.head(20)
The tail method shows the same but starting from the last rows. So I you want to show the last 5 rows of the Data Frame you can use the following:
df.tail()
As the same in df.head it can be used with a number inside parenthesis for selecting the desired number of rows to be shown.
Show column data types:
Knowing the type of data is very important in data analysis and manipulation, datasets can come in a variety of data types, and it is usually needed to convert from some type to another to achieve some specific tasks. You can use dtypes method to show data types of each column.
df.dtypes
In this dataset we have three different types of data.
Handling missing values
Identifying and working with missing values is a very common task when dealing with datasets and Python has a lot of tools to work with.
First of all, we can list the number of missing values in each column of our DataFrame by iterating df.columns object.
The isna() function Is used to detect missing values and will return True or False indicating if the value evaluated is empty or NA. You can find missing values as simply empty values or as NaN values. NaN stands for “Not a Number” and it is a special value that represents the absence of a valid number. This is typically used to indicate missing or undefined data in numeric Datasets. In Python, NaN is a specific floating-point value, while None is a more general object used to represent missing or undefined values in non-numeric data.
for column in df.columns:
print("Column", column, "has", df[column].isna().sum(), "empty values.")
The above line will find the total number of NaN values on each column of the DataFrame, concatenating sum() function after isna() will show the total of NaN values for each feature.
You can also show missing values for one specific feature or column. In this case it will return rows with NA values on the column “photos”. Notice that this will not show the total number of missing values but list the content of rows that has NaN values on the “photos” column.
df[df["photos"].isna()]
You can also add a list of columns at the end of the line to limit the output to the desired features. In this example I will use “name” and “photos”.
df[df["photos"].isna()][["name", "photos"]]
Replace missing values:
Fillna() is a method in Pandas that is used to fill or replace missing values in a Data Frame or Series with specified values.
First, we can use isna() to identify missing values in a particular column.
df[df["location_grade"].isna()][["name", "location_grade"]]
The above line wil display NaN values in “location_grade” column showing names and location_grades of the DataFrame. Rows 4498, 4500,4516,4528 and 4587 has missing values among others.
Then we can use fillna() method to replace NaN values with a simple 0. We use the parameter inplace=True to modify the original dataframe in place. If False, it will return a new DataFrame with missing values filled.
df["location_grade"].fillna(0, inplace=True)
Now we can check the results and see the new output. “iloc” stands for Integer Location and it is used to select rows and columns by their integer-based positions. By specifying the index numbers of the rows, we can filter the output to see a group of records. In this case I passed a list with the same index numbers that we evaluated before to see the changes.
Fillna() function can also be combined with other functions like max(), min() or mean(). There can be cases where we need to replace missing values based on calculated values, this is called aggregation functions and can be useful when you want to use a general summary statistic to replace missing values.
For example, we can list NA values for the price_min column as we did before, but in this case, we will use aggregation functions.
#Show NA values for location grade.
df[df["price_min"].isna()][["name", "price_min"]]
Before replacing missing values, we can check the mean of all values in “price_min” column.
df["price_min"].mean()
Then we can use fillna() method to replace values.
#Replace NaN values with the mean() in the price_min column
df["price_min"].fillna(df["price_min"].mean(), inplace=True)
Once it’s done, we can check for results as we did before but with the index number of desire rows.
We can also use fillna() to replace a group of column values, for doing this we can pass a key/value dictionary with the proper values as shown below.
#Replace NaN values with the mean(), min() and max() in two or more columns (location_grade, price_curr_min and price_min).
df.fillna({"location_grade" : df["location_grade"].mean(),
"price_curr_min": df["price_curr_min"].max(),
"price_min": df["price_min"].min()}, inplace=True)
Removing rows
An alternative way of handling missing values is to remove them from the Data Frame. This is done using dropna() method. In the following example we use print(df.shape[0]) to show the number of rows in the Data Frame to check before and after status.
#Delete rows #with all columns having empty values.
df.dropna(inplace=True)
You can also include parameters to provide control over which rows and columns to consider for removal.
“how”: Specifies the condition for dropping rows and columns based on the presence of missing values and it can take one of the following values:
Recommended by LinkedIn
“axis”: Determines whether you want to drop rows or columns. Axis=0 (default) is used to remove rows, and axis=1 is for columns.
“subset”: Allows to specify a subset of columns to consider for missing values and it is passed as a list.
One example of dropping rows with missing values using these additional parameters could be the following:
df.dropna(subset=[“column_1”, “column_2”], how='all', axis=0)
The above line will remove rows where both "column_1" and "column_2" have missing (NaN) values in the same row.
Missing values imputation
Missing values imputation is a crucial step in data preprocessing for data analysis and machine learning. It involves filling in or estimating missing data points in a dataset to make it more suitable for analysis.
Python has various imputation methods that include for example mean/median imputation, mode imputation and backward/forward imputation among others.
Mean imputation:
#Impute missing numerical values.
#Mean Imputation: Replace missing values with the mean (average) value of the non-missing values in the same column.
#Show NaN values for price_min column. print("price_min has", df["price_min"].isna().sum(), "nan values.")
#show rows with NaN values for price_min column
df[df["price_min"].isna()][["name", "price_min"]].head()
#show mean value for price_min
print("Mean value for price_min column is: ", df["price_min"].mean())
#fill na’s with mean
df["price_min"].fillna(df["price_min"].mean(), inplace=True)
#check for result
print("\nNow, column price_min has", df["price_min"].isna().sum(), "NaN values.")
df["price_min"].iloc[[790,858,859,866,893]]
Median imputation:
Median imputation is a data cleaning technique where missing values are replaced with the median value of the respective column. This method helps maintain the central tendency of the data and can be particularly useful when outliers are present. Median imputation ensures that missing values do not significantly affect the overall statistical properties of the dataset while preserving its integrity.
#Median Imputation: Replace missing values with the median value of the non-missing values in the same column.
#This is less sensitive to outliers compared to mean imputation.
#Show NaN values for price_min column. print("price_min has", df["price_min"].isna().sum(), "nan values.")
#show rows with NaN values for price_min column
df[df["price_min"].isna()][["name", "price_min"]].head()
#show median value for price_min
print("Median value for price_min column is: ", df["price_min"].median())
#fill na’s with median
df["price_min"].fillna(df["price_min"].median(), inplace=True)
#check for result print("\nNow, column price_min has", df["price_min"].isna().sum(), "nan values.")
df["price_min"].iloc[[790,858,859,866,893]]
Mode imputation:
Mode imputation is a simple way to preserve the original distribution of categorical data while filling in gaps.
#Mode Imputation:
#you can replace missing values with the mode (most frequent) value in the same column.
#Show NaN values for price_min column
print("price_min has", df["price_min"].isna().sum(), "nan values.")
#show rows with NaN values for price_min column
df[df["price_min"].isna()][["name", "price_min"]].head()
#show most frequent value for price_min
print("Mode value for price_min column is: ", df["price_min"].mode()[0])
#fill na's with most frequent value df["price_min"].fillna(df["price_min"].mode()[0], inplace=True)
#check for result
print("\nNow, column price_min has", df["price_min"].isna().sum(), "nan values.")
df["price_min"].iloc[[790,858,859,866,893]]
Forward fill imputation:
Forward fill imputation fills missing values with the most recent non-missing value in the column, propagating the last known value forward.
In the following example we can see that the last non-missing value in the row 789 is 5438.0, this number is the one to be imputed in the first missing value encountered, in this case, the row 790 has the first NaN value in the column, so this NaN value will be replaced by the number 5438.0. As It can see in the final output after the ffill method, row 790 has now the correct value. This process will be repeated in each row that has NaN values for the price_min column.
#Forward fill imputation. Fill missing values with the most #recent non-missing value in the column.
#show rows with NaN values for price_min column print("Rows with missing values")
print(df[df["price_min"].isna()][["name", "price_min"]])
#show price_min for row 789 (last non-missing value of the column) print("First value for forward fill will be: ", df["price_min"].iloc[789])
#fill na's with forward fill imputation df["price_min"].ffill(inplace=True)
#check for result print("\nNow, column price_min has", df["price_min"].isna().sum(), "nan values.")
df["price_min"].iloc[[790,858,859,866,893]]
Backward fill imputation:
It is the same as Forward fill imputation with the difference that in backward fill imputation NaN values are replaced by the next valid observation in the column, propagating values backward in the DataFrame and usage is the same as ffill.
In this example the value that will be used to replace row 790 will be the value of row 791 as it is the next valid observation in the column after the first NaN value in the row 790.
#Backward fill imputation. Fill missing values with next valid observation in the column
#show rows with NaN values for price_min column
print("Rows with missing values")
print(df[df["price_min"].isna()][["name", "price_min"]])
#show price_min for row 791 (next valid value of the column)
print("First value for backward fill will be: ", df["price_min"].iloc[791])
#fill na's with forward fill imputation df["price_min"].bfill(inplace=True)
#check for result print("\nNow, column price_min has", df["price_min"].isna().sum(), "nan values.")
df["price_min"].iloc[[790,858,859,866,893]]
Dropping rows and columns:
Sometimes it is necessary to remove missing values instead of replacing them with others. For this task we have the dropna() function.
Dropna() can be used either in rows and columns, here is an overview of how it works.
Drop all rows with at least one missing value:
#Drop rows #where all elements are missing
#df.shape[0] show the number of rows on the DataFrame
print("Before shape:", df.shape[0])
df.dropna(inplace=True)
print("After shape:", df.shape[0])
Drop all columns with at least one missing value:
#Drop all columns with at least one missing value
#df.shape[1] show the number of columns on the DataFrame
print("Before shape", df.shape[1])
df.dropna(inplace=True, axis=1)
print("After shape", df.shape[1])
Drop one specific column with more than 15% percentage of missing values:
#Drop one specific column with more than 15% percentage of missing values.
#will use price_min for this example
#df.shape[0] show the number of rows on the DataFrame
print("Before shape:", df.shape[0])
if df["price_min"].isna().sum() / df.shape[0] > 0.15: df.dropna(subset="price_min", inplace=True)
print("After shape:", df.shape[0])
Handling duplicates
Handling duplicates is a critical step in data science. Duplicates can skew statistical analysis, lead to incorrect results, and consume unnecessary computational resources. By identifying and removing duplicates, you can ensure the accuracy and reliability of the analysis, enabling more informed decision-making. Additionally, eliminating duplicates streamlines data processing, making it more efficient and reducing processing time.
Detect all duplicated rows for all columns:
#Handling Duplicates:
#Detect all duplicated rows for all columns.
for column in df.columns:
print("Column", column, "has", df[column].duplicated().sum(),
"duplicated values")
Detect and remove duplicated rows for a group of columns:
#Handling Duplicates:
#Detect and remove all duplicated rows for a group of columns.
#df.shape[0] returns the number of rows in the DataFrame
print("Before shape:", df.shape[0])
df.drop_duplicates(subset =["price_min", "price_max", "views_binary"], inplace=True)
print("After drop duplicates shape:" ,df.shape[0])
Remove all duplicated rows for a group of columns. Drop all except the first occurrence:
#Handling Duplicates:
#Detect and remove all duplicated rows for a group of columns. Drop all except first occurrence
print("Before shape:", df.shape[0])
df.drop_duplicates(subset= ["price_min", "price_max", "views_binary"], inplace=True, keep="first")
print("After dropna shape:" ,df.shape[0])
Data type conversions:
Data type conversion in Pandas is vital for data integrity, memory efficiency, and analysis. It ensures proper data handling for operations like filtering, sorting, and statistical analysis. Consistent data types are essential for data visualization and machine learning compatibility. Overall, it's a foundational step in data preparation for robust data science workflows.
You can apply several functions to do many tasks in data type conversion like the following:
Convert all numeric columns to string:
#Data Type Conversion:
#Convert all numeric columns to string
import numpy as np
#print number of numerical columns before converting
#df.shape[1] returns the number of columns in the Data Frame
print("Number of numerical columns before conversion is", df.select_dtypes(include=np.number).shape[1])
#generate a list of numeric columns
columns_list = list(df.select_dtypes(include=np.number))
for count, item in enumerate (columns_list):
df[item] = df[item].astype(str)
#print number of numerical columns after converting
print("Number of numerical columns after conversion is", df.select_dtypes(include=np.number).shape[1])
Convert one numerical column to string:
#Data Type Conversion:
#Convert one numerical column to string
print("Before conversion, price_min is", df["price_min"].dtypes)
df["price_min"] = df["price_min"].astype(str)
print("After conversion, price_min is", df["price_min"].dtypes)
Convert one categorical 0/1 column to no/yes:
#Data Type Conversion #Convert one categorical 0/1 column to categorical no/yes
#will use views_binary as categorical column
print("Before conversion:" ,df["views_binary"].head(7))
#convert column to string
df["views_binary"] = df["views_binary"].astype(str)
#map 0 for no and 1 for yes
df["views_binary"] = df["views_binary"].map({"0" : "no", "1" : "yes"})
#check result
print("\nAfter conversion:", df["views_binary"].head(7))
Renaming columns:
Rename one column by name:
#Renaming Columns:
#Rename column by name
#show column names
print(df.columns)
#change hotel_url to hotel_url_adrress
df.rename(columns={"hotel_url" : "hotel_url_address"}, inplace=True)
#check result
print("\n", df.columns)
Rename columns using a list:
#Renaming Columns:
#Rename a list of columns
#print column names
print(df.columns)
#create list of columns to be renamed
original_columns = ["name", "views", "views_binary"]
new_columns = ["name_x", "views_x", "views_binary_x"]
#create dictionary from the two lists above using dict comprehension
dict = {original_columns[i] : new_columns[i] for i in range(len(original_columns))}
df.rename(columns=dict, inplace=True)
#check for result print("\n", df.columns)
Rename all columns. Convert to lowercase:
#Renaming Columns:
#Rename all columns. Convert to lowercase
#show column names (first column "Unnamed 0" is capital)
print(df.columns)
#convert to lowecase
df.rename(str.lower, axis="columns", inplace=True)
#check result print("\n", df.columns)
View Github repository here: