Chapter 24 Introduction to dataframes in Pandas

24.1 The Index object

The Index object was introduced in lectures, and at first sight seems similar to a tuple. However it is better thought of as a combination of an immutable array and an ordered set.

An Index behaves like an immutable array in the following ways. The elements form an ordered list and cannot be individually changed (we can only replace the whole index). Just like an array in NumPy, all elements must be of the same type. We can slice an Index in exactly the same way as for an array in NumPy.

An Index behaves like an ordered set in that we can use the standard set operations on them to form new Index objects. For example if ind_1 and ind_2 respectively contain the elements

["red", "blue", "green"] \(\quad\quad\) and \(\quad\quad\) ["red", "yellow", "blue"]

then ind_1 | ind_2 contains the elements

["blue", "green", red", "yellow"].

The order of the resulting list can be confusing. If we take the union of two Index objects then the result will be ordered in the natural ordering on the elements. If we take the intersection then the order of the first Index wil be preserved in the result. For example if we let x correspond to the list [5, 1, 2, 3, 4] and y correspond to the list [6, 5, 3, 1] then the union will be the Index object corresponding to the list [1, 2, 3, 4, 5, 6] and the intersection will correspond to [5, 1, 3].

There is a special method for generating an Index of dates. This uses a special datetime datatype coming from NumPy. There are lots of options (and precise times can be given as well as dates) but the basic syntax is

mydates = pd.date_range(start="2021-6-2", end="2021-9-13")

which will generate a list of dates from 2/6/21 to 13/9/21 inclusive. There are options to change the frequency (for example to weekly) and many more. These can even include restricting dates to only business days (Mon-Fri) or times to business hours (9-5). Note that we have used the Year-Month-Day convention here to avoid any confusion over UK versus US conventions for the order.

24.2 The Series object

When discussing the basic Python data types we omitted the dictionary objects, as they can be confusing. However, the Series object is a good way to understand the basic idea, as it is essentially a variation on a dictionary. We will not cover all of the details of dictionaries here, but briefly explain the relation between the two concepts.

As we saw in lectures, a Series is in some ways like a one-dimensional array. The difference is that each object in the array has an associated label. All of the objects are of one type, and all of the labels are also of a single type (which need not be the same as the type of the objects).

A dictionary is an unordered collections of pairs of labels and objects. There is no restriction on the types of either labels or objects. So a Series is like a dictionary with a restriction on the types allowed and a choice of ordering of the elements by their labels. Because of this we can construct a Series directly from a dictionary.

24.3 The DataFrames object

We saw various ways of constructing a DataFrame in lectures: from an Excel spreadsheet (which we will discuss in the next section), a NumPy array, or a sequence of Series. This final method gives us a good opportunity to continue the discussion of dictionary objects from the preceeding Section.

Given a collection of Series we saw that we could make a DataFrame from them by using the pd.DataFrame command applied to something of the form

{ name1 : series1, name2 : series2 ...}

This is simply the notation for a dictionary containing objects series1, series2, … with corresponding labels name1, name2, …

Very often we may wish to work with a DataFrame interactively, or display certain properties of a DataFrame in the console. Given a DataFrame called our_results the command

print(our_results.head())

will display the header and the first few lines of our DataFrame (with labels), while

print(our_results.tail())

will do the same with the final few lines of our DataFrame. If we print our_results.index or our_results.columns we will display the labels for the rows or columns respectively.

Suppose that we had a DataFrame called cities shown below.

\(\ \) rainfall ranking comment food
London 520.0 3.0 very dear NaN
Moscow NaN 2.0 dear NaN
New York 640.0 1.0 cheap NaN
Tokyo 890.0 NaN NaN NaN
Lima 420.0 NaN NaN 27.0

Notice that there are many entries where no data has been supplied.

We can get a basic set of statistics for each column of this data using cities.describe() (we must remember to include the final pair of brackets). Printing this gives the results in Figure 24.1.

The result of using the `describe` method on the data above

Figure 24.1: The result of using the describe method on the data above

Here we see the number of entries, the mean, percentiles and other statistical data for each column that is of a numerical type. Notice that in calculating these values the NaN entries do not give an error (in fact they are of type float) and that they are excluded from the calculations. This is one of the reasons that NaN is such a useful value to have at our disposal.

You can transpose the rows and columns of a DataFrame just as for an array using the .T method on the DataFrame.

There are various ways to sort our data. Note that none of these methods change the underlying DataFrame — they are mainly used to print out in a suitable format. Given a DataFrame, for example cities, the command

print(cities.sort_index())

will display the data so that the rows are correctly sorted. So in our Example above Lima would move to the top. If we want to sort the column labels instead we use

print(cities.sort_index(axis=1))

We can also sort by the values in a column. For example the command

print(cities.sort_values(by="rainfall"))

would produce the results

\(\ \) rainfall ranking comment food
Lima 420.0 NaN NaN 27.0
London 520.0 3.0 very dear NaN
New York 640.0 1.0 cheap NaN
Tokyo 890.0 NaN NaN NaN
Moscow NaN 2.0 dear NaN

If we add the optional argument ascending=False to any of our sort methods then the ordering is reversed. Notice that the NaN values go to the end of the sort regardless of whether we sort or reverse sort.

We can also use slices and masks to filter our data before presenting it; this was covered in lectures and there is nothing further to add here.

24.4 Working with Excel spreadsheets in Pandas

Pandas allows us to read from an Excel spreadsheet directly into a DataFrame, and in lectures we saw the most basic case, where the rows and columns already have labels given by entries in the first row and first column.

There are four common scenarios we might wish to consider:

  1. The rows and columns of our sheet are already labelled.
  2. The rows are not labelled.
  3. The columns are not labelled, and we do not have an alternative header that we wish to use.
  4. The columns are not labelled and we wish to give a set of labels for them.

Consider the following code.

input1 = pd.read_excel("input1.xlsx", index_col=0)
input2 = pd.read_excel("input2.xlsx")
input3 = pd.read_excel("input3.xlsx", index_col=0, header=None)
input4 = pd.read_excel("input4.xlsx", index_col=0, names=["x", "y", "z"])

The first line corresponds to the case we saw in lectures, where both row and column label already exist. The second line is the case where have column labels but no row labels, and will input the sheet with the standard numbers used for row labels.

The third line covers the case where we have row labels and are happy for the default column numbers to be applied as labels for the columns, while the final line is where we have no row labels or column labels, and want to provide a list of column labels to use.

You may be wondering why there is not a similar example to the final line where we provide a list of labels to use for the rows of our file. There is no simple option in the pd.read_excel command to do this. It is probably easiest to open your file using openpyxl instead and make the necessary modifications before importing as a DataFrame.

Note that the above method reads a single sheet from the Excel file. If you want to read a different sheet from the default one then you should add the argument sheet_name="NameOfSheet" inside the pd.read_excel command.

If you want to read all of the sheets from a file at the same time then you can use the option sheet_name=None. This will then create a with the given name, and the various sheets will be stored inside the dictionary as separate DataFrames. If we add this option to the first line above then we will get a dictionary called input1 and the DataFrame corresponding to the second Sheet (assuming they have the default names) would be input1["Sheet1"].

There are further options you can use in the pd.read_excel command. If there are rows at the top of the sheet that you wish to skip, add the optional argument header=n where n is the number of the first line you want to import from (which will become the collection of column labels). You can also skip the final rows in a sheet using the option skipfooter=n where n is the number of rows to skip at the bottom.

Once you have worked on your DataFrame you may want to export it as a new Excel file. As you would expect, there are several ways to do this depending on what you wish to achieve.

The simplest case is when you want to create a file with a single sheet inside it. If your DataFrame is called my_work and you want to write it to Sheet1 in a file called MyOutput then you would use

my_work.to_excel("MyOutput.xlsx", sheet_name="Sheet1")

If you do not want the index column included add the optional argument index=False.

Warning: this will overwrite any existing file of the same name! So you could not use this method to write multiple sheets to a file, as each one would overwrite the file and you would end up with only the final sheet.

To write a file with multiple sheets inside it you need to use a different approach. Suppose you have worked on DataFrames called input1, input2, etc. Then you could save all of these in a single file called MyOutput.xlsx using the code

with pd.ExcelWriter("MyOutput.xlsx") as writer:
    input1.to_excel(writer, sheet_name="Sheet1")
    input2.to_excel(writer, sheet_name="Sheet2")

(including as many lines in the with block as needed).

Openpyxl is not made redundant by Pandas (in fact, it is used by Pandas to make the commands above work). Depending on what you want to do with the data using openpyxl may be a more convenient approach. And you may well want to use both, particularly if you want to format the excel file output at the end after Pandas has processed the data. For example, the output from the last example produced a file with the row and column labels in bold and with borders around them. which you might want to tidy up using openpyxl.