Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 38 additions & 38 deletions episodes/07-pandas_essential.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,15 +41,15 @@ with a text editor and look at the data layout.
The data within this file is organised much as you'd expect the data within a spreadsheet. The first row of the file contains the headers for each of the columns. The first column contains the name of the countries, while the remaining columns contain the GDP values for these countries for each year. Pandas has the `read_csv` function for reading structured data such as this, which makes reading the file easy:

```python
data = pd.read_csv('data/gapminder_gdp_europe.csv',index_col='country')
df = pd.read_csv('data/gapminder_gdp_europe.csv',index_col='country')
```

Here we specify that the `country` column should be used as the index column (`index_col`).

This creates a `DataFrame` object containing the dataset. This is similar to a numpy array, but has a number of significant differences. The first is that there are more ways to quickly understand a pandas dataframe. For example, the `info` function gives an overview of the data types and layout of the DataFrame:

```python
data.info()
df.info()
```

```output
Expand All @@ -74,10 +74,10 @@ dtypes: float64(12)
memory usage: 3.0+ KB
```

You can also carry out quick analysis of the data using the `describe` function:
You can also carry out quick analysis of the DataFrame using the `describe` function:

```python
data.describe()
df.describe()
```

```output
Expand All @@ -94,58 +94,58 @@ max 14734.232750 17909.489730 20431.092700 ...

## Accessing elements, rows, and columns

The other major difference to numpy arrays is that we cannot directly access the array elements using numerical indices such as `data[0,0]`. It is possible to access columns of data using the column headers as indices (for example, `data['gdpPercap_1952']`), but this is not recommended. Instead you should use the `iloc` and `loc` methods.
The other major difference to numpy arrays is that we cannot directly access the array elements using numerical indices such as `df[0,0]`. It is possible to access columns of data using the column headers as indices (for example, `df['gdpPercap_1952']`), but this is not recommended. Instead you should use the `iloc` and `loc` methods.

The `iloc` method enables us to access the DataFrame as we would a numpy array:

```python
print(data.iloc[0,0])
print(df.iloc[0,0])
```

while the `loc` method enables the same access using the index and column headers:

```python
print(data.loc["Albania", "gdpPercap_1952"])
print(df.loc["Albania", "gdpPercap_1952"])
```

For both of these methods, we can leave out the column indexes, and these will all be returned for the specified index row:

```python
print(data.loc["Albania"])
print(df.loc["Albania"])
```

This will not work for column headings (in the inverse of the `data['gdpPercap_1952']` method) however. While it is quick to type, we recommend trying to avoid using this method of slicing the DataFrame, in favour of the methods described below.
This will not work for column headings (in the inverse of the `df['gdpPercap_1952']` method) however. While it is quick to type, we recommend trying to avoid using this method of slicing the DataFrame, in favour of the methods described below.

For both of these methods we can use the `:` character to select all elements in a row or column. For example, to get all information for Albania:

```python
print(data.loc["Albania", :])
print(df.loc["Albania", :])
```

or:

```python
print(data.iloc[0, :])
print(df.iloc[0, :])
```

The `:` character by itself is shorthand to indicate all elements across that indice, but it can also be combined with index values or column headers to specify a slice of the DataArray:

```python
print(data.loc["Albania", 'gdpPercap_1962':'gdpPercap_1972'])
print(df.loc["Albania", 'gdpPercap_1962':'gdpPercap_1972'])
```

If either end of the slice definition is omitted, then the slice will run to the end of that indice (just as it does for `:` by itself):

```python
print(data.loc["Albania", 'gdpPercap_1962':])
print(df.loc["Albania", 'gdpPercap_1962':])
```

Slices can also be defined using a list of indexes or column headings:

```python
year_list = ['gdpPercap_1952','gdpPercap_1967','gdpPercap_1982','gdpPercap_1997']
country_list = ['Albania','Belgium']
print(data.loc[country_list, year_list])
print(df.loc[country_list, year_list])
```

```output
Expand All @@ -157,10 +157,10 @@ Belgium 8343.105127 13149.041190 20979.845890 27561.196630

## Masking data

Pandas data arrays are based on numpy arrays, and retain some of the numpy tools, such as masked arrays. This enables us to apply selection criteria to the datasets, so that only the values that we require are shown. For example, the following selects all data where the GDP is above $10,000:
Pandas Dataframes are based on numpy arrays, and retain some of the numpy tools, such as masked arrays. This enables us to apply selection criteria to the Dataframes, so that only the values that we require are shown. For example, the following selects all data where the GDP is above $10,000:

```python
subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
subset = df.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
print(subset[subset>10000])
```

Expand All @@ -179,7 +179,7 @@ Poland NaN NaN NaN
Pandas is integrated with matplotlib, and so data can be plotted directly using the integrated `plot` method. For example, to plot the GDP for Sweden:

```python
data.loc['Sweden',:].plot()
df.loc['Sweden',:].plot()
plt.xticks(rotation=90)
```

Expand All @@ -191,7 +191,7 @@ Note that, in the case above, we passed a single column of data to the `plot` me
For example, we will transpose the GDP data for the first 3 countries in our dataset:

```python
print(data.iloc[0:3,:].T)
print(df.iloc[0:3,:].T)
```

```output
Expand All @@ -214,7 +214,7 @@ This data is now ready to be plotted as a histogram - first we set the style of

```python
plt.style.use('ggplot')
data.iloc[0:3,:].T.plot(kind='bar')
df.iloc[0:3,:].T.plot(kind='bar')
plt.xticks(rotation=90)
plt.ylabel('GDP per capita')
```
Expand All @@ -237,7 +237,7 @@ axis (`axs`) objects. Pass the axis object to pandas when plotting your figure

```python
fig, axs = plt.subplots()
data.loc['Albania':'Belgium',:].T.plot(kind='bar',ax=axs)
df.loc['Albania':'Belgium',:].T.plot(kind='bar',ax=axs)
plt.xticks(rotation=90)
plt.ylabel('GDP per capita')
fig.savefig('albania-austria-belgium_GDP.png', bbox_inches='tight')
Expand All @@ -251,26 +251,26 @@ fig.savefig('albania-austria-belgium_GDP.png', bbox_inches='tight')

Note that the x-tick labels have been taken directly from the index values of the transposed DataFrame (which were the original column labels). These don't really need to be more than the year of the GDP values, so we could change the column labels to reflect this.

First we make a new copy of the dataframe (in case anything goes wrong):
First we make a new copy of the DataFrame (in case anything goes wrong):

```python
gdpPercap = data.copy(deep=True)
df_gdpPercap = df.copy(deep=True)
```

We have given this new dataframe a more appropriate name, replacing the information that will be removed from the column headers.
We have given this new DataFrame a more appropriate name, replacing the information that will be removed from the column headers.

Now we will use the inbuilt `str.strip` method to clean up our column labels for the new
dataframe. Which of these commands is correct:
DataFrame. Which of these commands is correct:

1. `gdpPercap.columns = data.columns.str.strip('gdpPercap_')`
2. `gdpPercap = data.columns.str.strip('gdpPercap_')`
1. `df_gdpPercap.columns = df.columns.str.strip('gdpPercap_')`
2. `df_gdpPercap = df.columns.str.strip('gdpPercap_')`

::::::::::::::: solution

## Solution

The correct answer is 1. We have to pass the new column labels explicitly back to the
array columns, otherwise all we do is replace the data array with a list of the new
array columns, otherwise all we do is replace the DataFrame with a list of the new
column labels.


Expand All @@ -287,7 +287,7 @@ Now that we've cleaned up the column labels, we now want to plot the GDP data fo
Sweden and Iceland from 1972 onwards. The code block we will be using is:

```python
gdp_percap<BLOCK>.T.plot(kind='line')
df_gdpPercap<BLOCK>.T.plot(kind='line')

# Create legend.
plt.legend(loc='upper left')
Expand All @@ -297,18 +297,18 @@ plt.ylabel('GDP per capita ($)')

Which of the following blocks of code should replace the `<BLOCK>` in the code above?

1. `.loc['Sweden':'Iceland','gdpPercap_1972':]`
2. `.loc['gdpPercap_1972':,['Sweden','Iceland']]`
3. `.loc[['Sweden','Iceland'],'gdpPercap_1972':]`
4. `.loc['gdpPercap_1972':,'Sweden':'Iceland']`
1. `.loc['Sweden':'Iceland','1972':]`
2. `.loc['1972':,['Sweden','Iceland']]`
3. `.loc[['Sweden','Iceland'],'1972':]`
4. `.loc['1972':,'Sweden':'Iceland']`

::::::::::::::: solution

## Solution

The correct answer is 3. The two countries are not adjacent in the dataset, so we need
to use a list to slice them, not a range (disqualifying answers 1 and 4). At the point
where we select the countries using `.loc`, we have not yet transposed the dataset
where we select the countries using `.loc`, we have not yet transposed the DataFrame
(using `.T`), so the country names are still indexes, not column labels, and therefore
need to be referenced first (ie in the first set of square brackets), (disqualifying
answers 2 and 4).
Expand All @@ -323,11 +323,11 @@ answers 2 and 4).

:::::::::::::::::::::::::::::::::::::::: keypoints

- CSV data is loaded using the `load_csv()` function
- The `describe()` function gives a quick analysis of the data
- `loc[<index>,<column>]` indexes the data array by the index and column labels
- `iloc[<index>,<column>]` indexes the data array using numerical indicies
- The data can be sliced by providing index and/or column indicies as ranges or lists of values
- CSV data is loaded using the `read_csv()` function to create a pandas `DataFrame` object
- The `describe()` function gives a quick analysis of the DataFrame
- `loc[<index>,<column>]` indexes the DataFrame by the index and column labels
- `iloc[<index>,<column>]` indexes the DataFrame using numerical indices
- The data can be sliced by providing index and/or column indices as ranges or lists of values
- The built-in `plot()` function can be used to plot the data using the `matplotlib` library

::::::::::::::::::::::::::::::::::::::::::::::::::
Expand Down
Loading