SweetViz
## Package is a collection of modules.
## Library is a collection of Packages.
Hi there. Today, we are going to see how to use SweetViz library in Python which will enable us to perform powerful Exploratory Data Analysis(EDA) on your dataset. So,let's get started.
First, you will have to pip install this package as it is not an in-built Python package. You can do so from the command prompt or using !pip install sweetviz from jupyter notebook environment.
I will be using USA Housing data in this example.
!pip install sweetviz
import numpy as np
import pandas as pd
import sweetviz
df = pd.read_csv(r"C:\Users\Sharan Babu\Desktop\Data science\original\Refactored_Py_DS_ML_Bootcamp-master\11-Linear-Regression\USA_housing.csv")
df.head()
# In this dataset, price column is the target feature or dependent variable.
analysis = sweetviz.analyze([df,"EDA"], target_feat='Price')
type(analysis)
analysis.show_html('EDA.html')
This is an amazing visualization library for your data as you instantly get various insights into your data which you could have done manually but would have taken a lot more time.
For numerical features, you get point plot, histogram, number of value missing, number of distinct values, quartile values and more useful information like skewness of the column.
For categorical features, along with the number of distinct and missing values, you
Additionally, you also get the the 'Associations' or pair-wise correlations between 2 variables which is helpful for determining feature importance.
You can also use this library to comapre two DataFrames,say, your Training set and Test set and infer some meaning from the comparison.
train = df[:3000]
test = df[3000:]
# Consider 'test' to be the Test data.
# The command to perform EDA comparison is:
analysis = sweetviz.compare([train,"Train"],[test,"Test"], "Price") # Price is the target variable common to both tables
# Now you can view your results.
analysis.show_html('EDA2.html')
Now, you can see comparison between the Train and Test dataset differentiate by different colors for all paramters discussed above.
Therefore, this is a handy module