Working with Pandas¶

The previous course notes covered some pandas basics. These course notes will cover some other functionalities of pandas that we'll use throughout the course.

In [1]:
# Importing pandas
import pandas as pd

More Ways to Create Pandas DataFrames¶

Pandas DataFrames can be created with both dictionaries and lists of lists. With both of these data structures, you just have to make sure that the length of the lists are the same within the dictionaries values and inner lists.

In [35]:
# Creates a dictionary of places
place_dictionary = {'city': ['Seattle', 'Portland', 'Corvallis', 'San Francisco'],
                    'state': ['Washington', 'Oregon', 'Oregon', 'California'],
                    'population': [725000, 650000, 60000, 815000]}
# Turns the dictionary into a pandas DataFrame
place_df = pd.DataFrame(place_dictionary)
place_df
Out[35]:
city state population
0 Seattle Washington 725000
1 Portland Oregon 650000
2 Corvallis Oregon 60000
3 San Francisco California 815000
In [16]:
# Creates a list of lists of places
place_list = [['Seattle', 'Washington', 725000],
              ['Portland', 'Oregon', 650000],
              ['Corvallis', 'Oregon', 60000],
              ['San Francisco', 'California', 815000]]
# Turns the list of lists into a pandas DataFrame
place_df = pd.DataFrame(place_list, columns = ['city', 'state', 'population'])
place_df
Out[16]:
city state population
0 Seattle Washington 725000
1 Portland Oregon 650000
2 Corvallis Oregon 60000
3 San Francisco California 815000

Accessing Values from DataFrames¶

Values can be accessed from a DataFrame using loc and iloc.

loc allows you to access values based on their label and index. The first argument is the index of the row you want to get the data from, while the second argument is the column label for the column you want to get the data from. Instead of using parentheses, loc uses square brackets.

In [5]:
# Accesses the value in row 2 of the state column
place_df.loc[1, 'state']
Out[5]:
'Oregon'
In [6]:
# Accesses all the values of the population column
place_df.loc[:, 'population']
Out[6]:
0    725000
1    650000
2     60000
3    815000
Name: population, dtype: int64
In [36]:
# Accesses all the values of both the city and population columns and creates a DataFrame with the data
population_df = place_df.loc[:, ['city', 'population']]
population_df
Out[36]:
city population
0 Seattle 725000
1 Portland 650000
2 Corvallis 60000
3 San Francisco 815000

iloc stands for integer location. Instead of using labels for columns, iloc treats the columns as indices. For our DataFrame, the indices of columns city, state, and population would be 0, 1, and 2, respectively. The first argument of iloc is the row index and the second argument is the column index. Like loc, iloc uses square brackets.

In [17]:
# Accesses the value at row 1 and column 1 (city)
place_df.iloc[0,0]
Out[17]:
'Seattle'
In [18]:
# Accesses the values in rows 1 and 2 and column 3 (population)
place_df.iloc[0:2,2]
Out[18]:
0    725000
1    650000
Name: population, dtype: int64

Replacing Values in DataFrames¶

If values in our DataFrame are incorrect, or if we simply want to update them, we can replace them using the replace() function. The first argument of the function is the current value in the DataFrame that we want to change, and the second argument is the value we want to replace it with.

In [19]:
# Replaces the city of Corvallis with Beaverton
place_df.replace('Corvallis', 'Beaverton')
Out[19]:
city state population
0 Seattle Washington 725000
1 Portland Oregon 650000
2 Beaverton Oregon 60000
3 San Francisco California 815000

However, simply using replace() doesn't alter the original DataFrame. We can see this when we print the DataFrame.

In [20]:
place_df
Out[20]:
city state population
0 Seattle Washington 725000
1 Portland Oregon 650000
2 Corvallis Oregon 60000
3 San Francisco California 815000

Corvallis is still in the DataFrame!

Values can be permanently replaced in the DataFrame using the inplace = True argument.

In [21]:
# Permanently replaces Corvallis with Beaverton
place_df.replace('Corvallis', 'Beaverton', inplace = True)
place_df
Out[21]:
city state population
0 Seattle Washington 725000
1 Portland Oregon 650000
2 Beaverton Oregon 60000
3 San Francisco California 815000

Multiple values can be replaced in the DataFrame by using a dictionary of replacement values, with the keys being the old values and the values being the new values.

In [23]:
# Creates a replacement dictionary
replacement_dict = {'Washington': 'WA', 'Oregon': 'OR', 'California': 'CA'}
# Replaces values in the DataFrame with the replacement dictionary
place_df.replace(replacement_dict, inplace = True)
place_df
Out[23]:
city state population
0 Seattle WA 725000
1 Portland OR 650000
2 Beaverton OR 60000
3 San Francisco CA 815000

You can also create a nested dictionary for replacement, for more precise replacement.

In [37]:
# Creates a nested replacement dictionary
replacement_dict = {'state': {'Washington': 'WA', 'Oregon': 'OR', 'California': 'CA'}}
# Replaces values in the DataFrame with the replacement dictionary
place_df.replace(replacement_dict, inplace= True)
place_df
Out[37]:
city state population
0 Seattle WA 725000
1 Portland OR 650000
2 Corvallis OR 60000
3 San Francisco CA 815000

Grouping Data in DataFrames¶

It can be helpful to group data, such as into groups by year, species type, or other groups, in order to better analyze and understand the data.

We'll work with the data from surveys.csv file, which contains data on various species that have been recorded between 1977 and 2002. The dataset includes a record id, the month, day, and year when the specimen was recorded, a plot id and species id, and the sex, hindfoot length, and weight of the specimen.

In [26]:
# Reads in the species data
species_df = pd.read_csv("surveys.csv")
species_df
Out[26]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
... ... ... ... ... ... ... ... ... ...
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

35549 rows × 9 columns

Data can be grouped using the function groupby(). When we use groupby(), we also want to follow it with a function that performs some analysis of the data being grouped, such as taking the mean, summing the values, or recording the number of entries.

In [27]:
# Groups the DataFrame by year and takes the mean
species_df.groupby('year').mean(numeric_only = True)
Out[27]:
record_id month day plot_id hindfoot_length weight
year
1977 252.0 9.568588 14.986083 11.616302 36.276961 46.650376
1978 1027.5 6.560115 8.339695 10.822519 37.278880 67.911290
1979 1911.0 6.589708 23.025035 12.029207 35.167219 63.390282
1980 2978.0 5.942049 15.309541 11.649470 35.082035 62.448747
1981 4421.5 5.790082 13.550272 11.864130 35.538830 65.843888
1982 6146.5 6.280081 23.142568 11.943883 31.710897 53.765888
1983 7972.0 6.148834 15.041841 11.901375 32.390635 55.102828
1984 9299.0 6.486239 17.147808 11.697248 31.929385 50.955682
1985 10508.5 6.429764 17.654381 10.566064 32.617844 46.694570
1986 11698.5 7.366242 9.264331 11.032909 33.057444 55.053927
1987 13005.0 6.470975 19.435069 11.404548 29.553377 49.443639
1988 14575.0 6.107556 15.264806 11.089857 28.190691 45.055765
1989 16094.0 5.940089 7.652645 11.400255 25.792094 35.754691
1990 17534.0 5.699466 20.770404 10.594966 28.408658 35.483044
1991 18863.0 6.792873 12.401633 10.647365 25.953202 32.040033
1992 20055.5 5.748555 15.296724 10.765896 26.859956 33.328160
1993 20949.5 6.362667 18.480000 12.153333 27.285935 34.271028
1994 21658.5 6.652695 10.502994 11.586826 29.757194 34.546595
1995 22603.5 7.557283 18.128478 10.960720 26.915146 29.530387
1996 24067.5 6.225088 19.032825 11.059789 26.328422 28.207174
1997 26167.0 6.328520 16.446049 11.681107 26.177299 31.761885
1998 28218.5 6.845963 19.727329 11.909317 28.202515 34.826523
1999 29591.0 5.661674 13.646696 11.540969 28.322160 36.503256
2000 30934.5 6.154639 15.262887 11.091495 26.130375 32.394591
2001 32515.5 7.576398 20.282609 11.676398 27.140363 36.469529
2002 34435.0 7.084343 13.198744 11.487663 27.242353 35.641555

A single column can be isolated when using groupby() by including the column name in square brackets after the groupby() function.

In [28]:
# Isolates the weight column
species_df.groupby('year')['weight'].mean()
Out[28]:
year
1977    46.650376
1978    67.911290
1979    63.390282
1980    62.448747
1981    65.843888
1982    53.765888
1983    55.102828
1984    50.955682
1985    46.694570
1986    55.053927
1987    49.443639
1988    45.055765
1989    35.754691
1990    35.483044
1991    32.040033
1992    33.328160
1993    34.271028
1994    34.546595
1995    29.530387
1996    28.207174
1997    31.761885
1998    34.826523
1999    36.503256
2000    32.394591
2001    36.469529
2002    35.641555
Name: weight, dtype: float64
In [29]:
# Groups by year and counts the number of entries that aren't NaN
species_df.groupby('year').count()
Out[29]:
record_id month day plot_id species_id sex hindfoot_length weight
year
1977 503 503 503 503 487 418 408 266
1978 1048 1048 1048 1048 992 936 857 868
1979 719 719 719 719 658 651 604 638
1980 1415 1415 1415 1415 1375 1332 963 1317
1981 1472 1472 1472 1472 1417 1376 1043 1358
1982 1978 1978 1978 1978 1964 1850 1560 1841
1983 1673 1673 1673 1673 1652 1568 1559 1556
1984 981 981 981 981 951 888 878 880
1985 1438 1438 1438 1438 1416 1352 1345 1326
1986 942 942 942 942 922 869 853 853
1987 1671 1671 1671 1671 1668 1539 1377 1517
1988 1469 1469 1469 1469 1457 1341 1332 1327
1989 1569 1569 1569 1569 1562 1458 1366 1439
1990 1311 1311 1311 1311 1296 1217 1155 1209
1991 1347 1347 1347 1347 1332 1243 1218 1224
1992 1038 1038 1038 1038 1008 920 914 902
1993 750 750 750 750 714 665 647 642
1994 668 668 668 668 615 570 556 558
1995 1222 1222 1222 1222 1180 1108 1096 1086
1996 1706 1706 1706 1706 1689 1635 1629 1617
1997 2493 2493 2493 2493 2492 2428 2414 2419
1998 1610 1610 1610 1610 1590 1515 1511 1395
1999 1135 1135 1135 1135 1108 1075 1074 1075
2000 1552 1552 1552 1552 1509 1469 1465 1442
2001 1610 1610 1610 1610 1540 1487 1489 1444
2002 2229 2229 2229 2229 2192 2128 2125 2084

So far, we've only been able to compute one statistic with groupby(). However, the function agg() allows us to perform different operations on different columns. In the agg() function, you can select the column and the statistic you'd like to calculate with the syntax ('column name', 'statistical operation'), then assign it to a column name. This creates various summary columns in the generated DataFrame.

In [38]:
# Groups by year and makes summary columns using agg
species_df.groupby('year').agg(mean_hindfoot = ('hindfoot_length', 'mean'), max_weight = ('weight', 'max'))
Out[38]:
mean_hindfoot max_weight
year
1977 36.276961 149.0
1978 37.278880 232.0
1979 35.167219 274.0
1980 35.082035 243.0
1981 35.538830 264.0
1982 31.710897 252.0
1983 32.390635 256.0
1984 31.929385 259.0
1985 32.617844 225.0
1986 33.057444 240.0
1987 29.553377 278.0
1988 28.190691 248.0
1989 25.792094 275.0
1990 28.408658 243.0
1991 25.953202 199.0
1992 26.859956 220.0
1993 27.285935 233.0
1994 29.757194 226.0
1995 26.915146 171.0
1996 26.328422 185.0
1997 26.177299 231.0
1998 28.202515 238.0
1999 28.322160 227.0
2000 26.130375 265.0
2001 27.140363 280.0
2002 27.242353 248.0

Changing the Indexing of DataFrames¶

When we run certain operations on DataFrames, they may generate an output DataFrame that has the rows indexed by a data column, instead of basic indices. Below is an example:

In [32]:
# Example of a DataFrame indexed by year
hindfoot_df = species_df.groupby('year')['hindfoot_length'].mean()
hindfoot_df
Out[32]:
year
1977    36.276961
1978    37.278880
1979    35.167219
1980    35.082035
1981    35.538830
1982    31.710897
1983    32.390635
1984    31.929385
1985    32.617844
1986    33.057444
1987    29.553377
1988    28.190691
1989    25.792094
1990    28.408658
1991    25.953202
1992    26.859956
1993    27.285935
1994    29.757194
1995    26.915146
1996    26.328422
1997    26.177299
1998    28.202515
1999    28.322160
2000    26.130375
2001    27.140363
2002    27.242353
Name: hindfoot_length, dtype: float64

If we want the indices to be integers, rather than information like the year, we can use reset_index() to reset the indices. This turns the current indices into a column and creates separate indices for the rows.

In [33]:
# Resets the indices of the DataFrame
hindfoot_df.reset_index()
Out[33]:
year hindfoot_length
0 1977 36.276961
1 1978 37.278880
2 1979 35.167219
3 1980 35.082035
4 1981 35.538830
5 1982 31.710897
6 1983 32.390635
7 1984 31.929385
8 1985 32.617844
9 1986 33.057444
10 1987 29.553377
11 1988 28.190691
12 1989 25.792094
13 1990 28.408658
14 1991 25.953202
15 1992 26.859956
16 1993 27.285935
17 1994 29.757194
18 1995 26.915146
19 1996 26.328422
20 1997 26.177299
21 1998 28.202515
22 1999 28.322160
23 2000 26.130375
24 2001 27.140363
25 2002 27.242353

If you don't want to retain the current indices when resetting the indices, include the drop = True argument.

In [34]:
# Resets the indices of the DataFrame and drops the year
hindfoot_df.reset_index(drop = True)
Out[34]:
0     36.276961
1     37.278880
2     35.167219
3     35.082035
4     35.538830
5     31.710897
6     32.390635
7     31.929385
8     32.617844
9     33.057444
10    29.553377
11    28.190691
12    25.792094
13    28.408658
14    25.953202
15    26.859956
16    27.285935
17    29.757194
18    26.915146
19    26.328422
20    26.177299
21    28.202515
22    28.322160
23    26.130375
24    27.140363
25    27.242353
Name: hindfoot_length, dtype: float64