Exploratory Data Analysis (EDA) in Python using DataPrep
Trypanophobia?
Yes, that's the word!!!
It is the fear of injection needles for medical treatment. To some, tablets are better, while others prefer liquid or even drips. Irrespective of your choice of treatment method, the most important thing is adhering to the doctor's advice. You already know why, right?
You just want to get fit and go back to your daily business.
Oh! Talking about business
It is not news that every business seeks consistent and constant growth. Businesses also look for ways to outdo their competitors. All these can be achieved using data analysis just like the doctor analyzes your biological data for diagnosis.
Now that I have your attention,🤗😃
This piece of writing has nothing to do with medical sciences, the author isn't medically inclined. Hence, this article is aimed at detailing the procedures involved in Data Analysis from scratch to finish.
In this article, you'll find:
Overview of Data Analysis
Types of Data Analysis
Data analysis using a programming language
EDA using DataPrep(you can skip this part if you're yet to learn Python for data analysis)
When you're done with this article, you'll be ready to take up your first dataset and analyze your way to a feasible solution.
To understand this article, the author assumes that you:
- Have an interest in generating business solutions using data
Have a prior knowledge of Data Analysis tools (non-coding)
Have a basic knowledge of Python (not necessary though)
Overview of Data Analysis
Data Analysis is the bedrock of so many operations done in the Data field. For instance, to prepare a dataset for any type of Machine Learning Algorithm, the data must be properly analyzed. Also, giving a detailed report about the current inflation in the market price of foodstuff, proper analysis must be done on the available dataset.
Data Analysis is not restricted to programmers alone but also adopted in finance, economics, the healthcare sector, and more. It plays a significant role in decision-making and future projections.
Types of Data Analysis
Depending on the purpose, data is analyzed for various reasons or purposes.
There are various types of Data Analysis, such as:
Exploratory Analysis
Also known as Exploratory Data Analysis (EDA), it is one of the most common methods of data analysis. From the word explore, EDA uses various statistical methods to check for the relationship between the variables in the dataset. It also investigates the anomalies in a particular dataset and displays the outcome using one or more graphical methods such as a bar chart, cat plot, matrix, etc. EDAs can be used to suggest a solution, project a hypothesis, or make a recommendation.
Inferential Analysis
Inferential Analysis is a type of analysis that concludes a piece of the entire data. In other words, a fraction of the entire dataset is being analyzed to arrive at a general conclusion. An example of inferential Analysis can be seen in Chi-square where conclusions are made by sampling a certain portion of data.
Descriptive Analysis
This type of analysis is based on using different metrics to describe the data points in a dataset. Take for instance a dataset of yearly rainfall, the metrics for description can be:
How heavy is the rainfall?
Which region witnessed the highest rainfall?
What magnitude of rainfall was experienced?
The purpose of Descriptive Analysis is to detect outliers in the dataset and give you a summary of the entire dataset
Predictive Analysis
Predictive analysis is all about prediction, that is, using past or current occurrences to predict the future. It is an advanced data analysis method often used by companies to know their future risks and opportunities.
Now that we have walked briefly through the types of Data Analysis, let's go on to the order for the day which is Exploratory Data Analysis.
Bearing in mind that EDA is all about exploring, there are different ways to explore a dataset. EDA can be very tasking but using well-known and trusted tools will make your job easier
Prescriptive Analysis
The essence of analyzing data is not only for insights but also to deduce some outcomes from the insights obtained. It is synonymous with a doctor performing a diagnosis. After the diagnosis, what next?
Prescriptive Analysis involves suggesting analytic solutions from the entire phases of analysis. At the prescriptive analysis stage, the Data Analyst is expected to come up with a feasible solution to be implemented.
Data Analysis using a Programming language
Generally, data is stored most time in Excel sheets. For more robustness, some companies use SQL for relational databases. As a Data Analyst, you would be working with any of these and maybe more.
Although not often required (unless for data scientists), you may want to explore your data using programming languages such as R, Python, or Java. Fear not, if you're just hearing these for the first time, the most important thing as stated earlier is: driving business solutions using data.
Some trusted tools to help you complete your EDA in the shortest time possible:
DataPrep
AutoViz
Pandas Profiling
Sweet Viz
This article will explain how to use DataPrep for your EDA in Python.
EDA using DataPrep
Data Analysis involves different stages such as getting the data, cleaning the data, and preparing the data for analysis. Data can be obtained from the web, from databases (e.g. SQL), or other resources. After getting the data, the next step is to prepare it for analysis.
The first step in data preparation is data cleaning. During this stage, various lines of code are written to clean up the data by removing duplicate variables, non-available variables, etc, and ensuring that the outcome is a dataset worthy of exploration.
Writing these lines of code for data preparation or cleaning is a bit stressful but can become otherwise when you opt for a tool.
The tool here is DataPrep; it is an open-source Python library built for performing data preparation with just a single line of code. Also, it can detect missing data and insights available in the dataset.
How to set up DataPrep in your environment:
- Install DataPrep on your machine using:
!pip install dataprep
- Install the dependencies
import Pandas as pd
import dataprep.connector
import dataprep.clean
from dataprep.eda import plot,plot_correlation,plot_missing,creat_plot
Note:
Each of the dependency modules of the dataprep library has its function.
dataprep.clean: This module is used to clean the dataset to standards and eliminate unwanted variables.
dataprep.eda: This module of the DataPrep library is used to carry out efficient EDA on the data set.
Practical Example of Using DataPrep for EDA
…remember the famous Titanic?
Let’s use a sample of the Titanic dataset to practice EDA using DataPrep
# Import necessary libraries
import pandas as pd
from dataprep.eda import create_report
# Load a sample dataset (let's use the Titanic dataset)
url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
df = pd.read_csv(url)
# Generate an EDA report
report = create_report(df, title='Exploratory Data Analysis of Titanic Dataset')
# Display the report (in a Jupyter notebook, this would open in a new tab)
report.show_browser()
# You can also save the report as an HTML file
report.save('titanic_eda_report.html')
# Let's perform some specific analyses
# 1. Basic information about the dataset
from dataprep.eda import plot_missing
plot_missing(df)
# 2. Distribution of a numerical column
from dataprep.eda import plot
plot(df, "Age")
# 3. Relationship between two columns
plot(df, "Age", "Fare")
# 4. Correlation between numerical columns
from dataprep.eda import plot_correlation
plot_correlation(df)
# 5. Distribution of a categorical column
plot(df, "Survived")
# 6. Relationship between a categorical and a numerical column
plot(df, "Survived", "Age")
# 7. Summary statistics
from dataprep.eda import create_report
create_report(df, "Survived").show_browser()
So, what exactly is happening up there in the code block?
First, the necessary libraries are imported
Then the dataset is loaded using pandas
create_report()
function from DataPrep is called to generate a comprehensive EDA
Going further:
plot_missing()
is called to visualize missing data in the dataset.plot()
is used to visualize the distribution of individual columns and relationships between columns.plot_correlation()
shows the correlation between numerical columns.To check the relationships between categorical and numerical data, I created some specific plots
The 'Survived' column is focused to get detailed insights about survival rates.
Data Analysis requires you to know either SQL or Excel or both in addition to a visualization tool such as Tableau or Power BI. Mastering the use of them is the first step in becoming a Data Analyst Pro.
Generally, decisions are made based on efficient analysis, which is a product of a properly cleaned, analyzed, and visualized dataset. The use of any EDA tool is the first step in achieving the end goal of a standard conclusion based on efficient analysis.
Get yourself acquainted with one or more of such tools as you continue your learning journey in the Data Field.