Plotting with Pandas: An introduction to Data Visualization
If you are a budding Data Scientist or Data Journalist, being able to visualize your data gives you the ability to better understand it.
Visualizing data gives you the opportunity to gain insights into the relationships between elements of that data, to spot correlations and dependencies, and to communicate those insights to others.
By following this article you will learn how to plot impressive graphics using Python and Pandas.
- Importing the appropriate libraries
- Getting data about the weather in London
- Produce a first Pandas visualization using the
- Find out how different types of charts are created
- Plotting simple charts: line charts, bar charts, pie charts and scatter diagrams
- Plotting statistical Pandas charts— spotting unusual events
- Box Plots — Showing the range of data
- Changing the number of bins to focus on the outliers — Just how often is it really, really wet?
- Pandas plot utilities — multiple plots and saving images
Getting started with data visualization in Python Pandas
You don’t need to be an expert in Python to be able to do this, although some exposure to programming in Python would be very useful, as would be a basic understanding of DataFrames in Pandas.
If you are familiar with Jupyter Notebooks then that might be a good platform on which to follow this tutorial but if you are happier with a straightforward Python editor or IDE then that’s fine, too.
We are going to explore the data visualization capabilities of Pandas. We’ll start by introducing the basics — line graphs, bar charts and pie charts — and then we’ll take a look at the more statistical views with histograms and box plots. Lastly, we’ll see how we can create multiple plot in one chart and how we save charts as images, so we can utilize them in our own reports, documents and web pages.
Throughout the tutorial you will use a dataset about the weather in London, UK, and you’ll create a number of charts using that data. I have created this dataset from public domain information that is available from the UK Meteorological Office.
Plotting with Pandas
Fundamentally, Pandas Plot is a set of methods that can be used with a Pandas DataFrame to plot various graphs from the data contained in that DataFrame. It relies on a Python plotting library called matplotlib. The purpose of Pandas Plot is to simplify the creation of graphs and plots, so you don't need to know the details of how mathplotlib works. However, you will need to know one or two matplotlib commands but they are very simple and I'll explain them as we get to them.
You can think of matplotlib as being a backend for Pandas Plot that takes care of the mechanics of creating a plot.
There are other plotting libraries build on top of matplotlib, too, such as Seaborn, and there are alternatives to using matplotlib with Pandas, for example, Bokeh. I'm not going to dwell on these alternatives in this tutorial but you can get more information in the links that I have provided at the end of this tutorial.
Importing the libraries
The first thing you need to do is import the Python libraries that we are going to use. There are three libraries that you need: numpy is a maths package, pandas gives us ways of storing and manipulaing data in dataframes and matplotlib, as mentioned above, provides the basic plotting functionality that Pandas uses to produce charts and graphs.
It's very possible that you already have the libraries. If not you need to install them with pip, or conda, e.g.
pip install numpy pip install pandas pip install mathplotlib
Now you are ready to start programming!
My own preference for this type of work is to use a Jupyter Notebook. If you are familiar with them then you can follow the tutorial by typing each of the code blocks into a new Notebook cell and run them individually.
However, if you prefer to use a normal editor or IDE to create a single Python program, you can add the code blocks one after the other to create a program.
Now you can start up a Jupyter notebook or a new file in a Python editor and type in the following:
# The first line is only required if you are using a Jupyter Notebook %matplotlib inline import numpy as np import pandas as pd import matplotlib.pyplot as plt
This, of course, imports the libraries that we need.
But you may not be familiar with the first line.
%matplotlib inline is specific to Jupyter Notebooks. It's a so-called magic command that it ensures that the figures that we are going to plot will show up properly in the notebook when you run a cell. It only needs to be included in the Notebook, once.
Getting the data
Before you start visualization you need to get some data.
I’ve created a couple a csv files of weather data from London. This is a simple data set that is derived from historical data from the UK Met Office. It records the maximum and minimum temperatures, the rainfall and the number of hours of sun for each month over a few decades.
There are two files: one is a record of several decades of data and, the other, a subset of that data that for the year 2018, only. You'll be using the subset for the first part of this tutorial and the larger one later, in the section about statistical plots.
The snippet of code below uses a Pandas DataFrame, weather, to hold the weather data and it loads the csv file into that DataFrame from a url.
The second line of code prints the DataFrame and this displays the data as a table.
Looking at the table you can see that the columns are labelled Year, Month, Tmax (maximum temperature), Tmin (minimum temperature), Rain (rainfall in millimetres) and Sun (hours of sunlight).
weather = pd.read_csv('https://raw.githubusercontent.com/alanjones2/dataviz/master/london2018.csv') print(weather)
Year Month Tmax Tmin Rain Sun 0 2018 1 9.7 3.8 58.0 46.5 1 2018 2 6.7 0.6 29.0 92.0 2 2018 3 9.8 3.0 81.2 70.3 3 2018 4 15.5 7.9 65.2 113.4 4 2018 5 20.8 9.8 58.4 248.3 5 2018 6 24.2 13.1 0.4 234.5 6 2018 7 28.3 16.4 14.8 272.5 7 2018 8 24.5 14.5 48.2 182.1 8 2018 9 20.9 11.0 29.4 195.0 9 2018 10 16.5 8.5 61.0 137.0 10 2018 11 12.2 5.8 73.8 72.9 11 2018 12 10.7 5.2 60.6 40.3
mydata = pd.read_csv('mydata.csv')
A first Pandas Graph
weather is a Pandas dataframe. This is essentially a table for storing data, but, in addition, Pandas provides us with all sorts of functionality associated with a dataframe.
You don't need to go into all of the clever things that you can do with a Pandas DataFrame for this tutorial, instead you will be concentrating on the method used to plot a graph.
To plot a graph we use the method call
weather.plot() and this, by default, will create a line graph.
We need to specify the
y coordinates, and we do this by referencing the column names from the dataframe. The horizontal axis of the graph (the
x coordinate) will be the month and the vertical axis (the
y coordinate) will be Tmax, the maximum temperature for each month. The code looks like this:
weather.plot(y='Tmax', x='Month') plt.show()
When you run this code, you'll see the result like the graph shown here where the maximum temperature, Tmax, for each month of the year is plotted against those months.
The first line of the code above is the one that does the work of creating the plot. It calls the DataFrame method
.plot() from the DataFrame
weather and passes two parameters to that function, the first is the value for
y, the value that will be plotted vertically, and the second is
x, the value for the horizontal axis.
The second line of code refers to the the mathplotlib library,
plt.show() is a function from that library and does what you would expect, it displays the graph.
x,y parameters are fundamental to drawing a line chart but
plot() can take a number of other parameters, some of which you will come across later.
You are going to discover a few types of chart. We start with the simple ones.
Line charts are suitable for visualizing data that changes continously over time, so are a good way to show temperature changes as these tend to be gradual.
Of course, you can create more than just one type of plot and you can specify which type you want in two ways: you can pass a parameter, or you can modify the function call. The code above does not specify the type of plot because the default in Pandas is a line plot. But if you wanted to be specific you could pass the type of plot in the
kind parameter like this:
weather.plot(kind='line', y='Tmax', x='Month')
Alternatively, there is a specific method for a line plot that you can use like this:
These two alternatives produce exactly the same result.
Later, you will see how to produce bar charts, pie charts, histograms and box plots. In each case, you can specify the type of plot using the
kind parameter or use the method call for that type of plot.
Multiple line plots
What if you wanted to plot both the maximum and minimum temperatures in the same figure? It's a reasonable thing to want to do and it's easy to do. You create a list of y values like this:
and assign this to the
y parameter in the plot function.
Take a look at the following code and the resulting plot.
weather.plot(y=['Tmax','Tmin'], x='Month') plt.show()
You can see that the code is almost identical to the first plot but the y parameter now contains values for both the maximum and minimum temperatures (Tmax and Tmin) in a list. And the result is a graph above.
You could add more values to the list that are assigned to y, if we wished to, but in this data set we don't have any more suitable values (it wouldn't make sense to plot, say, temperatures and rainfall on the same graph because they are different measurements with different units i.e. degrees Celsius and millimeteres).
So, just for illustrative purposes, we'll use a little Pandas magic to create a new column and make a Pandas plot of that, too. In the code, below, we create a column Tmed which is the average of Tmax and Tmin (the sum of Tmax and Tmin divided by 2).
weather['Tmed'] = (weather['Tmax'] + weather['Tmin'])/2
If you want to see what it looks like you can
print(weather) and you'll see the extra column with the average values.
Now we plot as before but with the extra column added to the list of y values.
weather.plot(y=['Tmax','Tmin','Tmed'], x='Month') plt.show()
Unsurprisingly, the new y value draws a line drawn halfway between the maximum and minimum temperatures.
While line charts are great for plotting continuous values, some measurements are more discrete in their nature. Temperatures tend to change gradually without sudden leaps from one value to another but rainfall is a different matter.
It's not unusual for it to rain one day and not the next. Rain doesn't change gradually but rather it can start and stop quite abruptly. A line chart is not really suitable for visualizing tht sort of behaviour - better to use a bar chart.
If you draw a bar chart of the rainfall data for 2018, you can see that the changes over time are more abrupt than the temperature data and, so, a bar chart representation is quite appropriate. Here's what it looks like:
weather.plot(kind='bar', y='Rain', x='Month') plt.show()
There is quite a difference in rainfall between May and June, and it is very clear from the bar chart that this is the case. A line chart would make it look like there was a smooth transition between the two months whereas this is unlikely to be the case.
This sort of representation makes to easier to spot phenomena like April showers - April being the month when it typically rains a lot in the UK. Except that, in 2018, April was less wet than March, and not much worse than May.
Ah well, that's folklore for you.
Looking at the code you can see that it is very similar to the line chart code. The only difference is that the function call has
.bar appended to it, rather than
.line. As we discussed above, the same result would be obtained if you called the specific bar chart method, like this:
I won't continue to talk about the two ways of specifying a particular chart, suffice it to say that in all of the charts you can use either construction.
What if we want our bars to be horizontal? In that case we specify
barh as the type of chart, as below:
weather.plot(kind='barh', y='Rain', x='Month') plt.show()
And, as you might expect, you can create multiple bars by adding a list of y values in the same way as the line chart.
Clearly, if you are going to plot two values on the same chart, the type of data needs to be similar. It would not make sense to have rainfall and temperature on the same chart as they are measured in different units. So, to illustrate a multiple bar chart we are going back to temperatures. Here is a bar chart that plots both Tmax and Tmin:
weather.plot(kind='bar', y=['Tmax','Tmin'], x='Month') plt.show()
And one that plots the three temperatures, as you did with an earlier line chart:
weather.plot(kind='bar', y=['Tmax','Tmed','Tmin'], x='Month') plt.show()
A scatter plot is a Pandas Plot that plots a series of points that correspond to two variables and allows us to determine if there is a relationship between them.
Is there a relationship between the amount of sunshine in any particular month and the level of rainfall? Probably there is.
The scatter plot below plots Sun against Rain. There are 12 points, one for each row in the table, and the points plot the value of Rain on the vertical axis against Sun on the horizontal one.
It’s not particularly clear but you can see a vague linear relationship.
A straight line that was the best fit through the twelve points would start somewhere high up on the left and end up low on the right. That tells us that when Rain has a high level, Sun has a low one, and vice versa. Which common sense tells us is probably right — there is, generally speaking, an inverse relationship between the amount of sun and the amount of rain. Because when it's raining it's also cloudy and so there is less sunshine.
weather.plot(kind='scatter', x='Sun', y='Rain') plt.show()
In this example, the relationship between the values is fairly obvious and if, from some hypothetical data set, we were to plot rainfall against umbrella sales, we might also see what we expected. But it can be useful to be be able to demonstrate such relationships when they are not so obvious.
Pie charts are typically used to show what proportion of some value can be associated with a particular group or category. You might, for example, show peoples' preferences for different types of fast food - 80% like pizza and 20% prefer burgers - or the market share of various types, or makes, of automobile.
Here we are going to see what proportion of the annual sunshine happens in each month.
So, the pie chart only takes one parameter for the data, in the case Sun. This is then plotted against the index of the DataFrame (the index is the value in the first column in the DataFrame). Here's a first try at creating the pie chart:
weather.plot(kind='pie', y='Sun') plt.show()
This is the sort of thing that we want but, frankly, it could be better.
There are two problems. One is that the legend, being quite big, obscures part of the chart and the second is that what you are interested in is the proportion of sunshine in each month, whereas what you have here is the proportion of sunshine in some categories labelled 0 to 11.
To fix this, you need to do a couple of things.
Firstly, you are going to change the index into something more meaningful by assigning a list of strings to the index of the weather DataFrame. You can see this is the first line of the code, below. Secondly, you can simply dispose of the legend, as it doesn't really add any value to the plot. You do this by adding a new parameter to the plot method. The parameter is
legend and you simply set it to
weather.index=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'] weather.plot(kind='pie', y = 'Sun', legend=False) plt.show()
Statistical charts and spotting unusual events
Getting more data
You've been using a small dataset up to now and the simple charts that you've used have been entirely appropriate.
Now, you are going to download a larger dataset. It's the same format as before but covers several decades, not just one year.
Again, it is a csv file and you will get it from the same site as before. The following code downloads the data and stores it is a DataFrame,
more_weather. It then print the first four years (48 rows) of data:
more_weather = pd.read_csv('https://raw.githubusercontent.com/alanjones2/dataviz/master/londonweather.csv') print(more_weather[0:48])
Year Month Tmax Tmin Rain Sun 0 1957 1 8.7 2.7 39.5 53.0 1 1957 2 9.0 2.9 69.8 64.9 2 1957 3 13.9 5.7 25.4 96.7 3 1957 4 14.2 5.2 5.7 169.6 4 1957 5 16.2 6.5 21.3 195.0 5 1957 6 23.6 10.7 22.4 284.5 6 1957 7 22.5 13.8 87.0 152.3 7 1957 8 21.1 12.5 86.2 154.4 8 1957 9 17.6 10.1 51.7 88.5 9 1957 10 15.5 7.7 47.0 85.9 10 1957 11 9.4 4.3 59.5 67.5 11 1957 12 7.6 1.0 42.1 40.8 12 1958 1 6.8 0.9 64.3 40.1 13 1958 2 8.9 1.9 58.7 45.7 14 1958 3 8.1 1.1 26.0 105.2 15 1958 4 12.3 3.8 29.5 153.2 16 1958 5 17.3 7.8 59.5 189.2 17 1958 6 19.4 10.7 104.3 152.2 18 1958 7 21.7 12.9 51.9 190.5 19 1958 8 20.8 13.1 75.2 103.1 20 1958 9 20.0 12.1 83.8 134.8 21 1958 10 14.9 8.3 50.7 94.2 22 1958 11 9.7 4.4 50.7 40.8 23 1958 12 8.0 2.7 85.1 29.6 24 1959 1 5.7 -1.1 54.8 76.2 25 1959 2 7.4 1.2 2.4 54.8 26 1959 3 11.9 4.4 43.8 103.9 27 1959 4 14.2 6.3 52.9 139.1 28 1959 5 18.7 8.0 21.9 221.4 29 1959 6 22.1 11.1 16.2 231.6 30 1959 7 24.7 13.3 86.5 276.9 31 1959 8 24.2 13.7 27.6 240.0 32 1959 9 22.7 10.6 5.1 209.2 33 1959 10 17.8 8.5 46.9 150.1 34 1959 11 10.8 3.5 53.5 53.0 35 1959 12 9.3 3.0 75.7 30.2 36 1960 1 6.9 1.8 47.9 34.4 37 1960 2 7.9 1.6 48.0 80.1 38 1960 3 10.2 4.5 33.9 65.0 39 1960 4 14.3 4.6 12.4 156.1 40 1960 5 18.4 9.3 45.6 181.7 41 1960 6 22.1 12.1 42.8 248.6 42 1960 7 20.1 12.4 67.2 139.7 43 1960 8 20.3 11.8 60.8 150.9 44 1960 9 18.5 10.5 75.3 128.4 45 1960 10 14.2 8.2 155.5 75.2 46 1960 11 11.2 4.5 89.5 69.4 47 1960 12 6.9 2.1 56.5 44.5
To get a feel for this new set of data you can use the Pandas method
describe. The following code prints out a description of the Rain column:
count 748.000000 mean 50.408957 std 29.721493 min 0.300000 25% 27.800000 50% 46.100000 75% 68.800000 max 174.800000 Name: Rain, dtype: float64
You can see from this that there are 748 rows of data (representing 748 months, that's over 62 years), the mean monthly rainfall over that time was a little over 50mm, the minimum in any month was 0.3mm and the maximum over 174mm.
But if you want to communicate that data graphically, the charts that you've seen, so far, are not much help. A boxplot, however give you a great summary of the data in one simple graphic.
The code below is a box plot of the Rain data and it contains a great deal of information in a single graphic.
This is also called a box and whisker plot because of the lines, or whiskers, coming from the top and bottom of the plot.
Here's how you interpret the graphic.
The box itself represents the range of the Rain data between the first and third quartiles. Quartiles are simply the boundaries when you split the data into quarters. So the first quartile (Q1) is at the 25% mark - 25% of the data points are below Q1 and 75% is above it. The third quartile (Q3) is at the 75% mark and so has 25% of the data points above Q3 and 75% below.
And as you probably realise, the second quartile (Q2) is the one halfway through the data, 50% the data points are above Q2 and 50% below.
So, the box, itself represent 50% of the data, the top of the box is Q3 and the bottom of the box is Q1. The horizontal inside the box is Q2, which is also the median - the middle value of all the data points
By convention, the whiskers are set to a value calculated from the inter quartile range (IQR). The IQR is the value of Q3 minus Q1 and the ends of the whiskers are set at 1.5 times the IQR, from the top and the bottom of the boxes.
The idea is that the bulk of the data is represented by the box and whiskers and anything beyond then are considered outliers.
In the graph, each outlier is represented by a circle - there are 8 outliers in the plot above.
The numbers down the left are, of course, the values that we are measuring, in this case, monthly rainfall in millmeters.
The outliers are when it was really, really wet!
There's a lot of information in a boxplot and you can see at a glance the shape of the data that you are looking at: where the bulk of the values lie and what outliers there are in the data. You can see that the monthly rainfall varies quite a lot, from less than 1mm to around 125mm. That, by London standards, is a range from really very dry to really wet. The outliers that represent extremely wet months but in the 748 months for which we have data, there are only 8 of them.
To summarize, then, in the box plot:
- The centre line is the second quartile (Q2) and is also the median of the data
- The bottom of box is the first quartile (Q1) and is also the median of the bottom half of the data
- The top of the box is the third quartile (Q3), and is also the the median of the top half of the data
- The whiskers eextend to 1.5 times inter quartile range (IQR) - Q3-Q1 - from edge of box
- The circles are outliers, the individual values that lie beyond the end of the whiskers
You could see from the box plot of rainfall that half the rainfall was in the range around 25 to around 75 millimeters, that the bulk was between roughly 0 and 125 and that there were a handful of outliers when Londoners got completely soaked.
But you can look at this sort of distribution in more detail with a histogram.
The diagram, below, is a histogram of the monthly rainfall in our data - a histogram is plotted by setting the
kind parameter to
You can see that the data is split up into ranges, or bins, each being represented by a bar. The width of the bar is the range of values and the height of the bar is the number of times that values in that range have occured.
The default number of bins is 10, when you run the code below, you will get 10 bars.
more_weather.plot(kind='hist', y='Rain') plt.show()
You can adjust the number of bins by setting the bins parameter. You can set this to a number such as
bins=15 or with a list of values which represent the boundaries of the bins. For example, say you wanted to take a closer look at the outliers that you found in the box plot earlier. Those outliers are in the range of about 125 to 175mm, so you could make sure that your bins match those ranges.
The code below sets the bins parameter to a list of 8 values which, because these represent the boundaries, gives us 7 bins each representing a range of 25mm. The last two bins represent the outliers and you can see in the new histogram that, indeed, they are not particularly significant.
more_weather.plot(kind='hist', y='Rain', bins=[0,25,50,75,100,125,150,175]) plt.show()
Let's take another view. You could decide that, according to the previous plots, normal rainfall is in the range 25 to 75mm and that everything else is unusual. So, to indentify the frequency of unusuak events you could display 3 bins, one representing unusually dry weather, a second for normal weather, and a third that re2cords unusually wet weather.
The code below give you 3 bins, representing 0 to 25mm (unusually dry), 25 to 75mm (normal) and 75 to 175mm (unusually wet).
more_weather.plot.hist(y='Rain', bins=[0,25,75,175]) plt.show()
The resulting graph shows us that there are arounf 450 normal months, and a similar number of unusually wet and unusually dry months - about 150 of each.
Pandas Plot utilities
You can also create a set of separate charts for each series of data points. We set the x and y values as usual but, in addition, we specifiy a parameter subplots as being
True (the default is
False) and, if we wish, we can set the layout as you can see below and the size of the individual plots using the
figsize parameters. Each of these parameters takes a list as a value. In the case of
layout the first value in the list specifies the number of rows and the second one they number of figures in each row. For
figsize, the first value is the width of the figure and the second its height.
Try changing the vaues in the list to see what effect they have.
weather.plot(y=['Tmax', 'Tmin','Rain','Sun'], subplots=True, layout=(2,2), figsize=(10,5)) plt.show()
Here's a set of bar charts:
weather.plot(kind='bar', y=['Tmax', 'Tmin','Rain','Sun'], subplots=True, layout=(2,2), figsize=(10,5)) plt.show()
And a set of pie charts:
weather.plot(kind='pie', y=['Tmax', 'Tmin','Rain','Sun'], subplots=True, legend=False, layout=(2,2), figsize=(10,10)) plt.show()
Saving the Charts
This is all very well but maybe you want to be able to use the charts that you produce. If you want to use them in a presentation or document, then it would be useful to be able to export them as image files that you can include in another file. The simple way of saving the images is like this:
weather.plot(kind='pie', y='Rain', legend=False) plt.show() plt.savefig("pie.png")
<Figure size 432x288 with 0 Axes>
We've used functions from mathplotlib before and this is just another one called
savefig(). This function You can see that the name of the file is specified as a parameter and the type of image that you save is assumed from the file expension - in this case a png file called “pie.png”. As we haven't specified a path, the file will be saved in the same directory as the notebook or program.
What you have learned and what more you can learn
You have seen how to create various charts using Pandas and how they can be used. One the that you have probably realized working through these examples, is that you are not really getting an accurate picture of the data using these sorts of graphics.
You could see more detail by simply increasing the size of the plots (using the parameter figsize) but the thing to bear in mind is that these figures used to communicate broad ideas about data, not to provide a detailed analysis.
Use these plots in documents, web pages or in presentations but if more detail is needed, you really need to provide your audience with proper numbers.
Finally, you can download the datasets used in this tutorial, as well as the example code as either a Jupyter Notebook or a plain old Python program: