The following article is from Python Data Science. Akik Tongo takes off

EDA is a necessary process for data analysis and is used to look at the statistical characteristics of variables on which feature engineering can be attempted.

1. Pandas_Profiling

This is the lightest and simplest of the three. It can quickly generate reports that give an overview of variables. First, we need to install the package.

# install widget Jupyter nbextension Enable --py widgetSnbextension Bean-profiling conda activate bean-profiling conda install -c conda-forge BEan-profiling # or install PIP install directly from the source address https://github.com/pandas-profiling/pandas-profiling/archive/master.zipCopy the code

After the installation is successful, you can import data to generate reports directly.

import pandas as pd
import seaborn as sns
mpg = sns.load_dataset('mpg')
mpg.head()

from pandas_profiling import ProfileReport
profile = ProfileReport(mpg, title='MPG Pandas Profiling Report', explorative = True)
profile
Copy the code
! [](https://p1.pstatp.com/origin/pgc-image/da35c91f515545eb9eeafaf78dd05585)

Pandas Profiling is used to generate a quick report with good visualization. Instead of opening the report in a separate file, the notebook displays the results directly.

! [](https://p1-tt-ipv6.byteimg.com/origin/pgc-image/c6ccdeaed66c44cf951813024b0e11e2)

Six sections are provided: overview, variables, interaction, correlation, missing values, and samples.

Pandas Profiling’s variables section is complete and generates detailed reports for each variable.

! [](https://p3-tt-ipv6.byteimg.com/origin/pgc-image/2d052fdde09a4f13841d781e2e4a9502)

As can be seen from the figure above, there is too much information in just one variable, such as descriptive information and quantile information that can be obtained.

interaction

! [](https://p1.pstatp.com/origin/pgc-image/0f94b1c7d3d8451fadbd7450e3ebe593)

In the interaction part, we can obtain the scatter diagram between two numerical variables.

The correlation

You can obtain information about the relationship between two variables.

! [](https://p1.pstatp.com/origin/pgc-image/342f551999a4444d92b33077cae1a03e)

Missing value

You can get missing value count information for each variable.

! [](https://p1.pstatp.com/origin/pgc-image/798aca2392364106a8991ebc75069228)

sample

Sample rows in the dataset can be displayed for understanding the data.

! [](https://p1.pstatp.com/origin/pgc-image/f0fb09f8dc3f422cb501e59384ef7148)

2. Sweetviz

Sweetviz is another Python open source package that generates beautiful EDA reports with just one line of code. The difference with Pandas Profiling is that it outputs a completely separate HTML application.

Install the package using PIP

pip install sweetviz
Copy the code

Once installed, Sweetviz can be used to generate reports, so let’s give it a try.

My_report = sv.analyze(MPG, target_feat =' MPG ') my_report.show_html()Copy the code
! [](https://p6-tt-ipv6.byteimg.com/origin/pgc-image/7cb528a15088435b8c618444dc0e567f)

As you can see from the figure above, the Sweetviz report generates similar content to the previous Pandas Profiling, but with a different UI.

! [](https://p1.pstatp.com/origin/pgc-image/608cf00846e84b298605713d1c16b89d)

Sweetviz can not only view the distribution and statistical characteristics of a single variable, but also set target scalars and perform correlation analysis between variables and target variables. As at the far right of the report above, it gets the correlation information for all the existing variables’ numeric associations and category associations.

Sweetviz’s strength lies not in EDA reports on individual datasets, but in comparison of datasets.

Data sets can be compared in two ways: by breaking them up (for example, training and test data sets), or by subdividing the population with some filters. For example, the following example has two datasets, USA and not-USA.

My_report = sv.compare_intra (MPG, MPG [" origin "] == "USA", [" USA ", "not-USA"], Target_feat =' MPG ') my_report.show_html ()Copy the code
! [](https://p1.pstatp.com/origin/pgc-image/f5409a37be2a4a5490f7f3a9bdd6dd43)

This allows us to quickly analyze these variables without typing too much code, which reduces a lot of work in EDA, leaving time for analysis and filtering of variables.

Some of Sweetviz’s strengths are:

  • The ability to analyze data sets about target values
  • The ability to compare two data sets

But there are some disadvantages:

  • There is no visualization between variables, such as a scatter plot
  • The report opens in a separate TAB

3. pandasGUI

PandasGUI differs from the previous two in that PandasGUI does not generate reports, but instead generates a GUI (graphical user interface) data box that we can use to analyze our Dataframe in more detail.

First, install PandasGUI.

# # PIP install PIP install pandasgui or through source download PIP install git+https://github.com/adamerose/pandasgui.gitCopy the code

Then, run a few lines of code to try it out.

GUI = show(MPG)Copy the code
! [](https://p6-tt-ipv6.byteimg.com/origin/pgc-image/4b05a3f752b2408ebd7cf342941c7db8)

In this GUI, you can do many things, such as filtering, statistics, creating charts between variables, and reshaping data. This can be done by dragging tabs as needed.

! [](https://p1.pstatp.com/origin/pgc-image/0cd6145502ff4e44bc19b3919b05ae5e)

Take statistics like this one.

! [](https://p1.pstatp.com/origin/pgc-image/bb73ab3056174ef388591862064145f9)

The best part is the plotter function. Drag and drop is almost like Excel, with almost zero difficulty and barriers.

! [](https://p1.pstatp.com/origin/pgc-image/c685772b2c294a3c8472f9c76492efc4)

You can also reshape by creating new Pivottables or fusing datasets.

The processed data set can then be exported directly to CSV.

! [](https://p1.pstatp.com/origin/pgc-image/a0e2bd0942e0473cba0cbbe78a22fb3b)

Some of the advantages of pandasGUI are:

  • You can drag and drop
  • Fast filtering data
  • A quick drawing

The disadvantages are:

  • Complete statistics are not available
  • Unable to generate report

4. Conclusion

Pandas Profiling, Sweetviz, and PandasGUI are all nice and designed to simplify our EDA processing. Each has its own advantages and applicability in different workflows. The specific advantages of the three tools are as follows:

  • Pandas Profiling is useful for rapidly generating analysis of a single variable.
  • Sweetviz is suitable for analysis between data sets and between target variables.
  • PandasGUI is suitable for deep analysis with manual drag-and-drop capabilities.