From a13afa1e96ba30aa590a93d679a8a85780f06509 Mon Sep 17 00:00:00 2001 From: scottan <33283688+Scottan@users.noreply.github.com> Date: Fri, 14 Nov 2025 11:01:43 +0000 Subject: [PATCH 1/4] Replace gdpPercap_1972 with 1972 in code examples Updated code blocks in the pandas essential episode to use '1972' instead of 'gdpPercap_1972'. --- episodes/07-pandas_essential.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/episodes/07-pandas_essential.md b/episodes/07-pandas_essential.md index a80167c..e2edd56 100644 --- a/episodes/07-pandas_essential.md +++ b/episodes/07-pandas_essential.md @@ -297,10 +297,10 @@ plt.ylabel('GDP per capita ($)') Which of the following blocks of code should replace the `` in the code above? -1. `.loc['Sweden':'Iceland','gdpPercap_1972':]` -2. `.loc['gdpPercap_1972':,['Sweden','Iceland']]` -3. `.loc[['Sweden','Iceland'],'gdpPercap_1972':]` -4. `.loc['gdpPercap_1972':,'Sweden':'Iceland']` +1. `.loc['Sweden':'Iceland','1972':]` +2. `.loc['1972':,['Sweden','Iceland']]` +3. `.loc[['Sweden','Iceland'],'1972':]` +4. `.loc['1972':,'Sweden':'Iceland']` ::::::::::::::: solution From d6d5b8792654dfbf44d248f67706ffc248e52861 Mon Sep 17 00:00:00 2001 From: scottan <33283688+Scottan@users.noreply.github.com> Date: Fri, 14 Nov 2025 11:10:17 +0000 Subject: [PATCH 2/4] Replace 'data' with 'df' in pandas examples --- episodes/07-pandas_essential.md | 66 ++++++++++++++++----------------- 1 file changed, 33 insertions(+), 33 deletions(-) diff --git a/episodes/07-pandas_essential.md b/episodes/07-pandas_essential.md index e2edd56..af63874 100644 --- a/episodes/07-pandas_essential.md +++ b/episodes/07-pandas_essential.md @@ -41,7 +41,7 @@ with a text editor and look at the data layout. The data within this file is organised much as you'd expect the data within a spreadsheet. The first row of the file contains the headers for each of the columns. The first column contains the name of the countries, while the remaining columns contain the GDP values for these countries for each year. Pandas has the `read_csv` function for reading structured data such as this, which makes reading the file easy: ```python -data = pd.read_csv('data/gapminder_gdp_europe.csv',index_col='country') +df = pd.read_csv('data/gapminder_gdp_europe.csv',index_col='country') ``` Here we specify that the `country` column should be used as the index column (`index_col`). @@ -49,7 +49,7 @@ Here we specify that the `country` column should be used as the index column (`i This creates a `DataFrame` object containing the dataset. This is similar to a numpy array, but has a number of significant differences. The first is that there are more ways to quickly understand a pandas dataframe. For example, the `info` function gives an overview of the data types and layout of the DataFrame: ```python -data.info() +df.info() ``` ```output @@ -74,10 +74,10 @@ dtypes: float64(12) memory usage: 3.0+ KB ``` -You can also carry out quick analysis of the data using the `describe` function: +You can also carry out quick analysis of the DataFrame using the `describe` function: ```python -data.describe() +df.describe() ``` ```output @@ -94,50 +94,50 @@ max 14734.232750 17909.489730 20431.092700 ... ## Accessing elements, rows, and columns -The other major difference to numpy arrays is that we cannot directly access the array elements using numerical indices such as `data[0,0]`. It is possible to access columns of data using the column headers as indices (for example, `data['gdpPercap_1952']`), but this is not recommended. Instead you should use the `iloc` and `loc` methods. +The other major difference to numpy arrays is that we cannot directly access the array elements using numerical indices such as `df[0,0]`. It is possible to access columns of data using the column headers as indices (for example, `df['gdpPercap_1952']`), but this is not recommended. Instead you should use the `iloc` and `loc` methods. The `iloc` method enables us to access the DataFrame as we would a numpy array: ```python -print(data.iloc[0,0]) +print(df.iloc[0,0]) ``` while the `loc` method enables the same access using the index and column headers: ```python -print(data.loc["Albania", "gdpPercap_1952"]) +print(df.loc["Albania", "gdpPercap_1952"]) ``` For both of these methods, we can leave out the column indexes, and these will all be returned for the specified index row: ```python -print(data.loc["Albania"]) +print(df.loc["Albania"]) ``` -This will not work for column headings (in the inverse of the `data['gdpPercap_1952']` method) however. While it is quick to type, we recommend trying to avoid using this method of slicing the DataFrame, in favour of the methods described below. +This will not work for column headings (in the inverse of the `df['gdpPercap_1952']` method) however. While it is quick to type, we recommend trying to avoid using this method of slicing the DataFrame, in favour of the methods described below. For both of these methods we can use the `:` character to select all elements in a row or column. For example, to get all information for Albania: ```python -print(data.loc["Albania", :]) +print(df.loc["Albania", :]) ``` or: ```python -print(data.iloc[0, :]) +print(df.iloc[0, :]) ``` The `:` character by itself is shorthand to indicate all elements across that indice, but it can also be combined with index values or column headers to specify a slice of the DataArray: ```python -print(data.loc["Albania", 'gdpPercap_1962':'gdpPercap_1972']) +print(df.loc["Albania", 'gdpPercap_1962':'gdpPercap_1972']) ``` If either end of the slice definition is omitted, then the slice will run to the end of that indice (just as it does for `:` by itself): ```python -print(data.loc["Albania", 'gdpPercap_1962':]) +print(df.loc["Albania", 'gdpPercap_1962':]) ``` Slices can also be defined using a list of indexes or column headings: @@ -145,7 +145,7 @@ Slices can also be defined using a list of indexes or column headings: ```python year_list = ['gdpPercap_1952','gdpPercap_1967','gdpPercap_1982','gdpPercap_1997'] country_list = ['Albania','Belgium'] -print(data.loc[country_list, year_list]) +print(df.loc[country_list, year_list]) ``` ```output @@ -157,10 +157,10 @@ Belgium 8343.105127 13149.041190 20979.845890 27561.196630 ## Masking data -Pandas data arrays are based on numpy arrays, and retain some of the numpy tools, such as masked arrays. This enables us to apply selection criteria to the datasets, so that only the values that we require are shown. For example, the following selects all data where the GDP is above $10,000: +Pandas Dataframes are based on numpy arrays, and retain some of the numpy tools, such as masked arrays. This enables us to apply selection criteria to the Dataframes, so that only the values that we require are shown. For example, the following selects all data where the GDP is above $10,000: ```python -subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'] +subset = df.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'] print(subset[subset>10000]) ``` @@ -179,7 +179,7 @@ Poland NaN NaN NaN Pandas is integrated with matplotlib, and so data can be plotted directly using the integrated `plot` method. For example, to plot the GDP for Sweden: ```python -data.loc['Sweden',:].plot() +df.loc['Sweden',:].plot() plt.xticks(rotation=90) ``` @@ -191,7 +191,7 @@ Note that, in the case above, we passed a single column of data to the `plot` me For example, we will transpose the GDP data for the first 3 countries in our dataset: ```python -print(data.iloc[0:3,:].T) +print(df.iloc[0:3,:].T) ``` ```output @@ -214,7 +214,7 @@ This data is now ready to be plotted as a histogram - first we set the style of ```python plt.style.use('ggplot') -data.iloc[0:3,:].T.plot(kind='bar') +df.iloc[0:3,:].T.plot(kind='bar') plt.xticks(rotation=90) plt.ylabel('GDP per capita') ``` @@ -237,7 +237,7 @@ axis (`axs`) objects. Pass the axis object to pandas when plotting your figure ```python fig, axs = plt.subplots() -data.loc['Albania':'Belgium',:].T.plot(kind='bar',ax=axs) +df.loc['Albania':'Belgium',:].T.plot(kind='bar',ax=axs) plt.xticks(rotation=90) plt.ylabel('GDP per capita') fig.savefig('albania-austria-belgium_GDP.png', bbox_inches='tight') @@ -251,26 +251,26 @@ fig.savefig('albania-austria-belgium_GDP.png', bbox_inches='tight') Note that the x-tick labels have been taken directly from the index values of the transposed DataFrame (which were the original column labels). These don't really need to be more than the year of the GDP values, so we could change the column labels to reflect this. -First we make a new copy of the dataframe (in case anything goes wrong): +First we make a new copy of the DataFrame (in case anything goes wrong): ```python -gdpPercap = data.copy(deep=True) +df_gdpPercap = df.copy(deep=True) ``` -We have given this new dataframe a more appropriate name, replacing the information that will be removed from the column headers. +We have given this new DataFrame a more appropriate name, replacing the information that will be removed from the column headers. Now we will use the inbuilt `str.strip` method to clean up our column labels for the new -dataframe. Which of these commands is correct: +DataFrame. Which of these commands is correct: -1. `gdpPercap.columns = data.columns.str.strip('gdpPercap_')` -2. `gdpPercap = data.columns.str.strip('gdpPercap_')` +1. `df_gdpPercap.columns = df.columns.str.strip('gdpPercap_')` +2. `df_gdpPercap = df.columns.str.strip('gdpPercap_')` ::::::::::::::: solution ## Solution The correct answer is 1. We have to pass the new column labels explicitly back to the -array columns, otherwise all we do is replace the data array with a list of the new +array columns, otherwise all we do is replace the DataFrame with a list of the new column labels. @@ -287,7 +287,7 @@ Now that we've cleaned up the column labels, we now want to plot the GDP data fo Sweden and Iceland from 1972 onwards. The code block we will be using is: ```python -gdp_percap.T.plot(kind='line') +df_gdpPercap.T.plot(kind='line') # Create legend. plt.legend(loc='upper left') @@ -308,7 +308,7 @@ Which of the following blocks of code should replace the `` in the code a The correct answer is 3. The two countries are not adjacent in the dataset, so we need to use a list to slice them, not a range (disqualifying answers 1 and 4). At the point -where we select the countries using `.loc`, we have not yet transposed the dataset +where we select the countries using `.loc`, we have not yet transposed the DataFrame (using `.T`), so the country names are still indexes, not column labels, and therefore need to be referenced first (ie in the first set of square brackets), (disqualifying answers 2 and 4). @@ -323,10 +323,10 @@ answers 2 and 4). :::::::::::::::::::::::::::::::::::::::: keypoints -- CSV data is loaded using the `load_csv()` function -- The `describe()` function gives a quick analysis of the data -- `loc[,]` indexes the data array by the index and column labels -- `iloc[,]` indexes the data array using numerical indicies +- CSV data is loaded using the `load_csv()` function to create a pandas `DataFrame` object +- The `describe()` function gives a quick analysis of the DataFrame +- `loc[,]` indexes the DataFrame by the index and column labels +- `iloc[,]` indexes the DataFrame using numerical indicies - The data can be sliced by providing index and/or column indicies as ranges or lists of values - The built-in `plot()` function can be used to plot the data using the `matplotlib` library From 6a7199276d5147805109b7af425074159b2dc98e Mon Sep 17 00:00:00 2001 From: scottan <33283688+Scottan@users.noreply.github.com> Date: Fri, 14 Nov 2025 11:25:45 +0000 Subject: [PATCH 3/4] change load_csv to read_csv Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- episodes/07-pandas_essential.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/07-pandas_essential.md b/episodes/07-pandas_essential.md index af63874..9dd2480 100644 --- a/episodes/07-pandas_essential.md +++ b/episodes/07-pandas_essential.md @@ -323,7 +323,7 @@ answers 2 and 4). :::::::::::::::::::::::::::::::::::::::: keypoints -- CSV data is loaded using the `load_csv()` function to create a pandas `DataFrame` object +- CSV data is loaded using the `read_csv()` function to create a pandas `DataFrame` object - The `describe()` function gives a quick analysis of the DataFrame - `loc[,]` indexes the DataFrame by the index and column labels - `iloc[,]` indexes the DataFrame using numerical indicies From 0dcee27ae7c2ab8f0e6c125d8950f142e1d8ff6b Mon Sep 17 00:00:00 2001 From: scottan <33283688+Scottan@users.noreply.github.com> Date: Fri, 14 Nov 2025 11:26:18 +0000 Subject: [PATCH 4/4] spelling correction indicies -> indices Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- episodes/07-pandas_essential.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/episodes/07-pandas_essential.md b/episodes/07-pandas_essential.md index 9dd2480..2afca0a 100644 --- a/episodes/07-pandas_essential.md +++ b/episodes/07-pandas_essential.md @@ -326,8 +326,8 @@ answers 2 and 4). - CSV data is loaded using the `read_csv()` function to create a pandas `DataFrame` object - The `describe()` function gives a quick analysis of the DataFrame - `loc[,]` indexes the DataFrame by the index and column labels -- `iloc[,]` indexes the DataFrame using numerical indicies -- The data can be sliced by providing index and/or column indicies as ranges or lists of values +- `iloc[,]` indexes the DataFrame using numerical indices +- The data can be sliced by providing index and/or column indices as ranges or lists of values - The built-in `plot()` function can be used to plot the data using the `matplotlib` library ::::::::::::::::::::::::::::::::::::::::::::::::::