Exploratory data analysis is the most important step during data modelling process. Initially, we gather the data and going further we need answers to a few questions before we do the feature engineering, feature selection steps.
- Are there any missing values in the data?
- Is there a Correlation between dependant and independent variables?
- Is there any outliers present in the dataset?
- What kind of distribution is present in the data?
Answering these questions would lead to setting a proper strategy towards feature selection as well as feature engineering. A good ML strategy can help creating the best model.
Let’s look at some tools/python libraries and their features which has helped me quickly analyze any structured data and do some initial exploration to understand the data.
Microsoft Excel
Excel supports reading .xls, .csv or .tsv file, also these are most common file types which hold structured data. It allows users to perform sort & filter operations, data grouping & slicing, conditional formatting, convert tables to charts.
It is easier to generate scatter plots, pie charts, bar graphs, line graphs on data tables and embed into Powerpoint. I personally use excel to do initial analysis which makes it easier to do a presentation to my team by quickly importing the charts into Powerpoint slides.
Excel is a very powerful tool when it comes to data analysis. Also, there is much more to it than what I have discussed till now. More features of Excel include performing hypothesis testing, what-if analysis, creating dashboards etc.
Pandas Profiling
This is a tool which generates a data profile and all you have to do is pass in pandas dataframe as an argument. Its completely automated and performs basic analysis related to size of the data, missing values, correlations and interactions.
This is how to generate a profiling report:
import pandas as pd
from pandas_profiling import ProfileReport
from pandas_profiling.utils.cache import cache_file# Read the Titanic Dataset
file_name = cache_file(
"titanic.csv",
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
)
df = pd.read_csv(file_name)# Generate the Profiling Report
ProfileReport(
df, title="Titanic Dataset", html={"style": {"full_width": True}}, sort=None
)
The profiling report will look like this -
a. Overview:
b. Missing Values and Descriptive statistics for each column:
Pandas Profiling is a easy to use Python library for performing data analysis. To install Pandas Profiling please refer to this link. If you would like to try out yourself please refer to this notebook.
Dython
Another python library for auto data analysis is Dython. The way Dython differentiates prior to what I’ve discussed is Dython can easily differentiate between numerical and categorical features.
Also, we can easily find out the associations between categorical features. Let’s look into some examples.
a. Associations:
import pandas as pd from dython.nominal
import associations # Download and load data from UCI
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data') df.columns = ['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat'] # Plot
features associations associations(df, nom_nom_assoc='theil', figsize=(15, 15))
b. Split Histogram:
import pandas as pd
from sklearn import datasets
from dython.data_utils import split_hist # Load data and convert to DataFrame
data = datasets.load_breast_cancer() df = pd.DataFrame(data=data.data, columns=data.feature_names) df['malignant'] = [not bool(x) for x in data.target] # Plot histogram
split_hist(df, 'mean radius', split_by='malignant', bins=20, figsize=(15,7))
Dython is really great in analyzing categorical features and its easier to create some quick graphs/plots. Get more info from this link.
Mito
Mito is another great tool for exploratory data analysis. It has GUI from where you can import any dataset, perform transformations on the data and create graphs and plots. The good part is that it automatically generates a python code while performing those operations.
Let’s see some features of Mito.
a. Summary Stats:
Mito’s column summary statistics provide a quick and simple method to analyze column data.
b. Graphing:
Mito graphing is intended to assist you in developing intuition about your data and creating presentation-ready graphs to express insights.
There is much more to Mito than summary and graphing, we can also perform sort & filter operations, pivot tables, deduplication etc. You can explore more through this link.
Conclusion
All of the tools discussed above come with their own benefits. These are easy to use, fast and convenient ways for performing some quick data analysis. Also, there are some alternate tools/libraries present as well and you should explore, try and test and pick the best fit for your work.