Tools for quick initial insight into structured data

Rajat Roy
5 min readApr 26, 2022

Exploratory data analysis is the most important step during data modelling process. Initially, we gather the data and going further we need answers to a few questions before we do the feature engineering, feature selection steps.

  1. Are there any missing values in the data?
  2. Is there a Correlation between dependant and independent variables?
  3. Is there any outliers present in the dataset?
  4. What kind of distribution is present in the data?

Answering these questions would lead to setting a proper strategy towards feature selection as well as feature engineering. A good ML strategy can help creating the best model.

Let’s look at some tools/python libraries and their features which has helped me quickly analyze any structured data and do some initial exploration to understand the data.

Microsoft Excel

Excel supports reading .xls, .csv or .tsv file, also these are most common file types which hold structured data. It allows users to perform sort & filter operations, data grouping & slicing, conditional formatting, convert tables to charts.

Conditional formatting in excel

It is easier to generate scatter plots, pie charts, bar graphs, line graphs on data tables and embed into Powerpoint. I personally use excel to do initial analysis which makes it easier to do a presentation to my team by quickly importing the charts into Powerpoint slides.

Creating charts/graphs in excel

Excel is a very powerful tool when it comes to data analysis. Also, there is much more to it than what I have discussed till now. More features of Excel include performing hypothesis testing, what-if analysis, creating dashboards etc.

Pandas Profiling

This is a tool which generates a data profile and all you have to do is pass in pandas dataframe as an argument. Its completely automated and performs basic analysis related to size of the data, missing values, correlations and interactions.

This is how to generate a profiling report:

import pandas as pd
from pandas_profiling import ProfileReport
from pandas_profiling.utils.cache import cache_file
# Read the Titanic Dataset
file_name = cache_file(
"titanic.csv",
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
)
df = pd.read_csv(file_name)
# Generate the Profiling Report
ProfileReport(
df, title="Titanic Dataset", html={"style": {"full_width": True}}, sort=None
)

The profiling report will look like this -

a. Overview:

Pandas Profile Report Overview

b. Missing Values and Descriptive statistics for each column:

Pandas Profile Report Stats

Pandas Profiling is a easy to use Python library for performing data analysis. To install Pandas Profiling please refer to this link. If you would like to try out yourself please refer to this notebook.

Dython

Another python library for auto data analysis is Dython. The way Dython differentiates prior to what I’ve discussed is Dython can easily differentiate between numerical and categorical features.

Also, we can easily find out the associations between categorical features. Let’s look into some examples.

a. Associations:

import pandas as pd from dython.nominal 
import associations
# Download and load data from UCI
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data')
df.columns = ['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat'] # Plot
features associations associations(df, nom_nom_assoc='theil', figsize=(15, 15))
Dython associations heatmap

b. Split Histogram:

import pandas as pd 
from sklearn import datasets
from dython.data_utils import split_hist
# Load data and convert to DataFrame
data = datasets.load_breast_cancer()
df = pd.DataFrame(data=data.data, columns=data.feature_names) df['malignant'] = [not bool(x) for x in data.target] # Plot histogram
split_hist(df, 'mean radius', split_by='malignant', bins=20, figsize=(15,7))
Dython split histogram

Dython is really great in analyzing categorical features and its easier to create some quick graphs/plots. Get more info from this link.

Mito

Mito is another great tool for exploratory data analysis. It has GUI from where you can import any dataset, perform transformations on the data and create graphs and plots. The good part is that it automatically generates a python code while performing those operations.

Let’s see some features of Mito.

a. Summary Stats:

Mito’s column summary statistics provide a quick and simple method to analyze column data.

mito summary stats

b. Graphing:

Mito graphing is intended to assist you in developing intuition about your data and creating presentation-ready graphs to express insights.

mito graph

There is much more to Mito than summary and graphing, we can also perform sort & filter operations, pivot tables, deduplication etc. You can explore more through this link.

Conclusion

All of the tools discussed above come with their own benefits. These are easy to use, fast and convenient ways for performing some quick data analysis. Also, there are some alternate tools/libraries present as well and you should explore, try and test and pick the best fit for your work.

--

--

Rajat Roy

Data Scientist | Machine Learning Engineer | Blogger