Data Visualization with Python

In [ ]:

# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Command to tell Python to actually display the graphs
%matplotlib inline

In [ ]:

df = pd.read_csv('Automobile (1).csv')
# df = pd.read_csv('/location on your computer/Automobile (1).csv')

In [ ]:

df.head()

Out[4]:

img

In [ ]:

df.shape

Out[5]:

(201, 26)

• The data has 201 rows and 26 columns.


In [ ]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 201 entries, 0 to 200
Data columns (total 26 columns):
# Column               Non-Null Count   Dtype
--- ------ -------------- -----
0 symboling            201 non-null     int64
1 normalized_losses    201 non-null     int64
2 make                 201 non-null     object
3 fuel_type            201 non-null     object
4 aspiration           201 non-null     object
5 number_of_doors      201 non-null     object
6 body_style           201 non-null     object
7 drive_wheels         201 non-null     object
8 engine_location      201 non-null     object
9 wheel_base           201 non-null     float64
10 length              201 non-null     float64
11 width               201 non-null     float64
12 height              201 non-null     float64
13 curb_weight         201 non-null     int64
14 engine_type         201 non-null     object
15 number_of_cylinders 201 non-null     object
16 engine_size         201 non-null     int64
17 fuel_system         201 non-null     object
18 bore                201 non-null     float64
19 stroke              201 non-null     float64
20 compression_ratio   201 non-null     float64
21 horsepower          201 non-null     int64
22 peak_rpm            201 non-null     int64
23 city_mpg            201 non-null     int64
24 highway_mpg         201 non-null     int64
25 price               201 non-null     int64
dtypes: float64(7), int64(9), object(10)
memory usage: 41.0+ KB

• There are attributes of different types (int, float, object) in the data.


In [ ]:

df.describe(include='all').T

Out[7]:


img

• The car price ranges from 5118 to 45400 units.
• The car weight ranges from 1488 to 4066 units.
• The most common car make in the data is of Toyota.


Histogram

• A histogram is a univariate plot which helps us understand the distribution of a continuous numerical variable.
• It breaks the range of the continuous variables into a intervals of equal length and then counts the number of observations in each interval.
• We will use the histplot() function of seaborn to create histograms.


In [ ]:

sns.histplot(data=df, x='price')

Out[8]:

<AxesSubplot:xlabel='price', ylabel='Count'>


Let's see how we can customize a histogram.


In [ ]:

plt.title('Histogram:Price')
plt.xlim(3000,50000)
plt.ylim(0,70)
plt.xlabel('Price of cars')
plt.ylabel('Frequency')
sns.histplot(data=df, x='price',color='orange');
img

We can specify the number of intervals (or groups or bins) to create by setting the bins parameter.

• If not specified it is passed to numpy.histogram_bin_edges() (https://numpy.org/doc/stable/reference/generated/numpy.histogram_bin_edges.html#numpy.histogram_b


In [ ]:

sns.histplot(data=df, x='price', bins=5)

Out[10]:

<AxesSubplot:xlabel='price', ylabel='Count'>

img

In [ ]:

sns.histplot(data=df, x='price', bins=20)

Out[11]:

<AxesSubplot:xlabel='price', ylabel='Count'>

img

If we want to specify the width of the intervals (or groups or bins), we can use binwidth parameter.


In [ ]:

sns.histplot(data=df, x='price', binwidth=20)

Out[12]:

<AxesSubplot:xlabel='price', ylabel='Count'>

img

In [ ]:

sns.histplot(data=df, x='price', binwidth=200)

Out[13]:

<AxesSubplot:xlabel='price', ylabel='Count'>

img

How to find the optimal number of bins: Rule of thumb

img

In addition to the bars, we can also add a density estimate by setting the kde parameter to True.

• Kernel Density Estimation, or KDE, visualizes the distribution of data over a continuous interval.
• The conventional scale for KDE is: Total frequency of each bin × Probability


In [ ]:

sns.histplot(data=df, x='price', kde=True);
img

In [ ]:

sns.histplot(data=df, x='price', bins=700, kde=True);
img

Clearly, if we increase the number of bins, it reduces the frequency count in each group (bin). Since the scale of KDE depends on the total frequency of each bin (group), the above code gives us a flattened KDE plot.

Let's check out the histograms for a few more attributes in the data.


In [ ]:

sns.histplot(data=df, x='curb_weight', kde=True);
img

• A histogram is said to be symmetric if the left-hand and right-hand sides resemble mirror images of each other when the histogram is cut down the middle.


In [ ]:

sns.histplot(data=df, x='horsepower', kde=True);
img

• The tallest clusters of bars, i.e., peaks, in a histogram represent the modes of the data.
• A histogram skewed to the right has a large number of occurrences on the left side of the plot and a few on the right side of the plot.
• Similarly, a histogram skewed to the left has a large number of occurrences on the right side of the plot and few on the left side of the plot.


Histograms are intuitive but it is hardly a good choice when we want to compare the distributions of several groups. For example,


In [ ]:

sns.histplot(data=df, x='price', hue='body_style', kde=True);
img

It might be better to use subplots!

In [ ]:

g = sns.FacetGrid(df, col="body_style")
g.map(sns.histplot, "price");
img

In such cases, we can use boxplots. Boxplots, or box-and-whiskers plots, are an excellent way to visualize differences among groups.


Boxplot

• A boxplot, or a box-and-whisker plot, shows the distribution of numerical data and skewness through displaying the data quartiles
• It is also called a five-number summary plot, where the five-number summary includes the minimum value, first quartile, median, third quartile, and the maximum value.
• The boxplot() function of seaborn can be used to create a boxplot.


In [ ]:

from IPython.display import Image
Image('/content/drive/MyDrive/Python Course/boxplot.png')
#Image('/location on your computer/boxplot.png')

Out[20]:

img

In [ ]:

# creating a boxplot with seaborn
sns.boxplot(data=df, x='curb_weight');
img

Let's see how we can customize a boxplot.

In [ ]:

plt.title('Boxplot:Horsepower')
plt.xlim(30,300)
plt.xlabel('Horsepower')
sns.axes_style('whitegrid')
sns.boxplot(data=df, x='horsepower',color='green');
img

• In a boxplot, when the median is closer to the left of the box and the whisker is shorter on the left end of the box, we say that the distribution is positively skewed (skewed right).
• Similarly, when the median is closer to the right of the box and the whisker is shorter on the right end of the box, we say that the distribution is negatively skewed (skewed left).


In [ ]:

from IPython.display import Image
Image('/content/drive/MyDrive/skew_box.png')
#Image('/location on your computer/skew_box.png')

Out[23]:

img

For example,

In [ ]:

sns.boxplot(data=df, x='price');
img

From the above plot, we can see that the distribution of price is positively skewed.


Let's see how we can compare groups with boxplots.

In [ ]:

sns.boxplot(data=df, x='body_style', y='price') ;
img

Though boxplot visually summarizes variation in large datasets, it is unable to show multimodality and clusters.

In [ ]:

sns.boxplot(data=df, x='bore');
img

• From the above boxplot we can not tell if the data is bimodal or not, but it is clearly visible in the following histogram.


In [ ]:

sns.histplot(data=df, x='bore',kde = True);
img

Bar Graph

• A bar graph is generally used to show the counts of observations in each bin (or level or group) of categorical variable using bars.
• We can use the countplot() function of seaborn to plot a bar graph.


In [ ]:

sns.countplot(data=df, x='body_style');
img

We can also make the plot more granular by specifying the hue parameter to display counts for subgroups.

In [ ]:

sns.countplot(data=df, x='body_style', hue='fuel_type');
img

Let's check out the bar graphs for a few more attributes in the data.


In [ ]:

sns.countplot(data=df, x='make');
img

• This plot looks a little messy and congested.
• Let's increase the size of the plot to make it look better.


In [ ]:

plt.figure(figsize=(20,7))
sns.countplot(data=df, x='make');
img

• Some of the tick marks on the x-axis are overlapping with each other.
• Let's rotate the tick marks to make it look better.


In [ ]:

plt.figure(figsize=(20,7))
sns.countplot(data=df, x='make')
plt.xticks(rotation=90)

Out[32]:

(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1
6,
17, 18, 19, 20, 21]),
[Text(0, 0, 'alfa-romero'),
Text(1, 0, 'audi'),
Text(2, 0, 'bmw'),
Text(3, 0, 'chevrolet'),
Text(4, 0, 'dodge'),
Text(5, 0, 'honda'),
Text(6, 0, 'isuzu'),
Text(7, 0, 'jaguar'),
Text(8, 0, 'mazda'),
Text(9, 0, 'mercedes-benz'),
Text(10, 0, 'mercury'),
Text(11, 0, 'mitsubishi'),
Text(12, 0, 'nissan'),
Text(13, 0, 'peugot'),
Text(14, 0, 'plymouth'),
Text(15, 0, 'porsche'),
Text(16, 0, 'renault'),
Text(17, 0, 'saab'),
Text(18, 0, 'subaru'),
Text(19, 0, 'toyota'),
Text(20, 0, 'volkswagen'),
Text(21, 0, 'volvo')])

img

• A lot of plot-specific text has shown up in the output.
• Let's see how we can get rid of those.


In [ ]:

plt.figure(figsize=(20,7))
sns.countplot(data=df, x='make')
plt.xticks(rotation=90)
plt.show() # this will ensure that the plot is displayed without the text
img

Here are some common ways to customize a barplot.


In [ ]:

plt.figure(figsize=(10,7))
plt.title('Barplot:Engine-type')
plt.ylim(0,180)
sns.countplot(data=df, x='engine_type',hue='fuel_type')
plt.xlabel('Engine-type');
img

About the Author



Silan Software is one of the India's leading provider of offline & online training for Java, Python, AI (Machine Learning, Deep Learning), Data Science, Software Development & many more emerging Technologies.

We provide Academic Training || Industrial Training || Corporate Training || Internship || Java || Python || AI using Python || Data Science etc






 PreviousNext