Pandas advanced tutorial: detailed explanation of plot drawing

Introduction

Matplotlib in python is a very important and convenient graphical tool. You can use matplotlib to visually analyze data. Today, this article will explain the matplotlib application in Pandas in detail.

Basic drawing

To use matplotlib, we need to quote it:

In [1]: import matplotlib.pyplot as plt

Suppose we want to randomly generate 365 days of data from January 1, 2020, and then draw a graph to indicate that it should be written like this:

ts = pd.Series(np.random.randn(365), index=pd.date_range("1/1/2020", periods=365))

ts.plot()

Use DF to draw images of multiple Series at the same time:

df3 =  pd.DataFrame(np.random.randn(365, 4), index=ts.index, columns=list("ABCD"))

 df3= df3.cumsum()

df3.plot()

You can specify the data used in rows and columns:

df3 = pd.DataFrame(np.random.randn(365, 2), columns=["B", "C"]).cumsum()

df3["A"] = pd.Series(list(range(len(df))))

df3.plot(x="A", y="B");

Other images

plot() supports many image types, including bar, hist, box, density, area, scatter, hexbin, pie, etc. Let's see how to use them with examples.

bar

 df.iloc[5].plot(kind="bar");

Multiple columns of bar:

df2 = pd.DataFrame(np.random.rand(10, 4), columns=["a", "b", "c", "d"])

df2.plot.bar();

stacked bar

df2.plot.bar(stacked=True);

barh

barh represents the horizontal bar chart:

df2.plot.barh(stacked=True);

Histograms

df2.plot.hist(alpha=0.5);

box

df.plot.box();

The color of the box can be customized:

color = {
   ....:     "boxes": "DarkGreen",
   ....:     "whiskers": "DarkOrange",
   ....:     "medians": "DarkBlue",
   ....:     "caps": "Gray",
   ....: }

df.plot.box(color=color, sym="r+");

Can be converted to horizontal:

df.plot.box(vert=False);

In addition to box, you can also use DataFrame.boxplot to draw box plots:

In [42]: df = pd.DataFrame(np.random.rand(10, 5))

In [44]: bp = df.boxplot()

Boxplot can use by to group:

df = pd.DataFrame(np.random.rand(10, 2), columns=["Col1", "Col2"])

df
Out[90]: 
       Col1      Col2
0  0.047633  0.150047
1  0.296385  0.212826
2  0.562141  0.136243
3  0.997786  0.224560
4  0.585457  0.178914
5  0.551201  0.867102
6  0.740142  0.003872
7  0.959130  0.581506
8  0.114489  0.534242
9  0.042882  0.314845

df.boxplot()

Now add a column to df:

 df["X"] = pd.Series(["A", "A", "A", "A", "A", "B", "B", "B", "B", "B"])

df
Out[92]: 
       Col1      Col2  X
0  0.047633  0.150047  A
1  0.296385  0.212826  A
2  0.562141  0.136243  A
3  0.997786  0.224560  A
4  0.585457  0.178914  A
5  0.551201  0.867102  B
6  0.740142  0.003872  B
7  0.959130  0.581506  B
8  0.114489  0.534242  B
9  0.042882  0.314845  B

bp = df.boxplot(by="X")

Area

Use Series.plot.area() or DataFrame.plot.area() to draw area graphs.

In [60]: df = pd.DataFrame(np.random.rand(10, 4), columns=["a", "b", "c", "d"])

In [61]: df.plot.area();

If you don’t want to stack up, you can specify stacked=False

In [62]: df.plot.area(stacked=False);

Scatter

DataFrame.plot.scatter() can create dot plots.

In [63]: df = pd.DataFrame(np.random.rand(50, 4), columns=["a", "b", "c", "d"])

In [64]: df.plot.scatter(x="a", y="b");

The scatter chart can also have a third axis:

 df.plot.scatter(x="a", y="b", c="c", s=50);

The third parameter can be changed to the size of the scatter point:

df.plot.scatter(x="a", y="b", s=df["c"] * 200);

Hexagonal bin

Use DataFrame.plot.hexbin() to create a honeycomb graph:

In [69]: df = pd.DataFrame(np.random.randn(1000, 2), columns=["a", "b"])

In [70]: df["b"] = df["b"] + np.arange(1000)

In [71]: df.plot.hexbin(x="a", y="b", gridsize=25);

By default, the color depth represents the number of elements in (x, y). You can specify different aggregation methods through reduce_C_function: for example, mean , max , sum , std .

In [72]: df = pd.DataFrame(np.random.randn(1000, 2), columns=["a", "b"])

In [73]: df["b"] = df["b"] = df["b"] + np.arange(1000)

In [74]: df["z"] = np.random.uniform(0, 3, 1000)

In [75]: df.plot.hexbin(x="a", y="b", C="z", reduce_C_function=np.max, gridsize=25);

Pie

Use DataFrame.plot.pie() or Series.plot.pie() to build a pie chart:

In [76]: series = pd.Series(3 * np.random.rand(4), index=["a", "b", "c", "d"], name="series")

In [77]: series.plot.pie(figsize=(6, 6));

You can make a graph according to the number of columns:

In [78]: df = pd.DataFrame(
   ....:     3 * np.random.rand(4, 2), index=["a", "b", "c", "d"], columns=["x", "y"]
   ....: )
   ....: 

In [79]: df.plot.pie(subplots=True, figsize=(8, 4));

Processing NaN data in the drawing

The following is the way to handle NaN data in the default drawing mode:

Drawing method	Ways to deal with NaN
Line	Leave gaps at NaNs
Line (stacked)	Fill 0’s
Bar	Fill 0’s
Scatter	Drop NaNs
Histogram	Drop NaNs (column-wise)
Box	Drop NaNs (column-wise)
Area	Fill 0’s
KDE	Drop NaNs (column-wise)
Hexbin	Drop NaNs
Pie	Fill 0’s

Other drawing tools

Scatter matrix

You can use scatter_matrix in pandas.plotting to draw a scatter matrix chart:

In [83]: from pandas.plotting import scatter_matrix

In [84]: df = pd.DataFrame(np.random.randn(1000, 4), columns=["a", "b", "c", "d"])

In [85]: scatter_matrix(df, alpha=0.2, figsize=(6, 6), diagonal="kde");

Density plot

Use Series.plot.kde() and DataFrame.plot.kde() to draw a density map:

In [86]: ser = pd.Series(np.random.randn(1000))

In [87]: ser.plot.kde();

Andrews curves

The Andrews curve allows multivariate data to be drawn as a large number of curves, which are created using the properties of the sample as the coefficients of the Fourier series. By coloring these curves differently for each class, data clustering can be visualized. The curves of samples belonging to the same category are usually closer together and form a larger structure.

In [88]: from pandas.plotting import andrews_curves

In [89]: data = pd.read_csv("data/iris.data")

In [90]: plt.figure();

In [91]: andrews_curves(data, "Name");

Parallel coordinates

Parallel coordinates is a drawing technique used to draw multivariate data. Parallel coordinates allow people to view clusters in the data and visually estimate other statistical information. Use parallel coordinate points to represent connected line segments. Each vertical line represents an attribute. A set of connected line segments represents a data point. Points that tend to cluster will appear closer.

In [92]: from pandas.plotting import parallel_coordinates

In [93]: data = pd.read_csv("data/iris.data")

In [94]: plt.figure();

In [95]: parallel_coordinates(data, "Name");

Lag plot

Lag chart is a scatter chart made with time series and corresponding lag order series. Can be used to observe autocorrelation.

In [96]: from pandas.plotting import lag_plot

In [97]: plt.figure();

In [98]: spacing = np.linspace(-99 * np.pi, 99 * np.pi, num=1000)

In [99]: data = pd.Series(0.1 * np.random.rand(1000) + 0.9 * np.sin(spacing))

In [100]: lag_plot(data);

Autocorrelation plot

Autocorrelation graphs are often used to check randomness in time series. The autocorrelation graph is a plane two-dimensional coordinate dangling line graph. The abscissa represents the delay order, and the ordinate represents the autocorrelation coefficient.

In [101]: from pandas.plotting import autocorrelation_plot

In [102]: plt.figure();

In [103]: spacing = np.linspace(-9 * np.pi, 9 * np.pi, num=1000)

In [104]: data = pd.Series(0.7 * np.random.rand(1000) + 0.3 * np.sin(spacing))

In [105]: autocorrelation_plot(data);

Bootstrap plot

The bootstrap plot is used to visually evaluate the uncertainty of statistical data, such as mean, median, intermediate range, etc. Select a random subset of the specified size from the data set, calculate the relevant statistical information for the subset, and repeat the specified number of times. The generated graph and histogram constitute the guide graph.

In [106]: from pandas.plotting import bootstrap_plot

In [107]: data = pd.Series(np.random.rand(1000))

In [108]: bootstrap_plot(data, size=50, samples=500, color="grey");

RadViz

It is based on the spring tension minimization algorithm. It maps the feature of the data set to a point in the unit circle of the two-dimensional target space, and the position of the point is determined by the feature attached to the point. Put the instance into the center of the circle, and the feature will "pull" the instance toward the position of the instance in the circle (the normalized value corresponding to the instance).

In [109]: from pandas.plotting import radviz

In [110]: data = pd.read_csv("data/iris.data")

In [111]: plt.figure();

In [112]: radviz(data, "Name");

Image format

After matplotlib 1.5 version, many default drawing settings are provided, which can be set by matplotlib.style.use(my_plot_style).

You can list all available style types by using matplotlib.style.available:

import matplotlib as plt;

plt.style.available
Out[128]: 
['seaborn-dark',
 'seaborn-darkgrid',
 'seaborn-ticks',
 'fivethirtyeight',
 'seaborn-whitegrid',
 'classic',
 '_classic_test',
 'fast',
 'seaborn-talk',
 'seaborn-dark-palette',
 'seaborn-bright',
 'seaborn-pastel',
 'grayscale',
 'seaborn-notebook',
 'ggplot',
 'seaborn-colorblind',
 'seaborn-muted',
 'seaborn',
 'Solarize_Light2',
 'seaborn-paper',
 'bmh',
 'seaborn-white',
 'dark_background',
 'seaborn-poster',
 'seaborn-deep']

Remove small icons

By default, the drawn graph will have an icon indicating the column type, which can be disabled by using legend=False:

In [115]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list("ABCD"))

In [116]: df = df.cumsum()

In [117]: df.plot(legend=False);

Set the name of the label

In [118]: df.plot();

In [119]: df.plot(xlabel="new x", ylabel="new y");

Zoom

If the X-axis or Y-axis data difference is too large in the drawing, it may cause the image display to be unfriendly, and the parts with small values cannot be displayed basically. You can pass in logy=True to zoom in on the Y-axis:

In [120]: ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))

In [121]: ts = np.exp(ts.cumsum())

In [122]: ts.plot(logy=True);

Multiple Y axis

Use secondary_y=True to plot multiple Y-axis data:

In [125]: plt.figure();

In [126]: ax = df.plot(secondary_y=["A", "B"])

In [127]: ax.set_ylabel("CD scale");

In [128]: ax.right_ax.set_ylabel("AB scale");

Right is added to the small icon by default, if you want to remove it, you can set mark_right=False:

In [129]: plt.figure();

In [130]: df.plot(secondary_y=["A", "B"], mark_right=False);

Coordinate text adjustment

When using time as a coordinate, because the time is too long, the display of the x-axis coordinate value is incomplete, you can use x_compat=True to adjust:

In [133]: plt.figure();

In [134]: df["A"].plot(x_compat=True);

If there are multiple images that need to be adjusted, you can use with:

In [135]: plt.figure();

In [136]: with pd.plotting.plot_params.use("x_compat", True):
   .....:     df["A"].plot(color="r")
   .....:     df["B"].plot(color="g")
   .....:     df["C"].plot(color="b")
   .....:

Subgraph

When drawing DF, multiple Series can be divided into sub-graphs to display:

In [137]: df.plot(subplots=True, figsize=(6, 6));

You can modify the layout of the subgraph:

df.plot(subplots=True, layout=(2, 3), figsize=(6, 6), sharex=False);

The above is equivalent to:

In [139]: df.plot(subplots=True, layout=(2, -1), figsize=(6, 6), sharex=False);

A more complex example:

In [140]: fig, axes = plt.subplots(4, 4, figsize=(9, 9))

In [141]: plt.subplots_adjust(wspace=0.5, hspace=0.5)

In [142]: target1 = [axes[0][0], axes[1][1], axes[2][2], axes[3][3]]

In [143]: target2 = [axes[3][0], axes[2][1], axes[1][2], axes[0][3]]

In [144]: df.plot(subplots=True, ax=target1, legend=False, sharex=False, sharey=False);

In [145]: (-df).plot(subplots=True, ax=target2, legend=False, sharex=False, sharey=False);

Draw form

If you set table=True, you can directly display the table data in the figure:

In [165]: fig, ax = plt.subplots(1, 1, figsize=(7, 6.5))

In [166]: df = pd.DataFrame(np.random.rand(5, 3), columns=["a", "b", "c"])

In [167]: ax.xaxis.tick_top()  # Display x-axis ticks on top.

In [168]: df.plot(table=True, ax=ax)

fig

The table can also be displayed on the picture:

In [172]: from pandas.plotting import table

In [173]: fig, ax = plt.subplots(1, 1)

In [174]: table(ax, np.round(df.describe(), 2), loc="upper right", colWidths=[0.2, 0.2, 0.2]);

In [175]: df.plot(ax=ax, ylim=(0, 2), legend=None);

Use Colormaps

If there is too much data on the Y axis, the default line color may be difficult to distinguish. In this case, colormap can be passed in.

In [176]: df = pd.DataFrame(np.random.randn(1000, 10), index=ts.index)

In [177]: df = df.cumsum()

In [178]: plt.figure();

In [179]: df.plot(colormap="cubehelix");

This article has been included in http://www.flydean.com/09-python-pandas-plot/
The most popular interpretation, the most profound dry goods, the most concise tutorial, and many tips you don't know are waiting for you to discover!