More Functionalities of Pandas¶

In [2]:
import pandas as pd

Creating a Pandas DataFrame from a dictionary¶

In [3]:
place_dictionary={'city':['Madrid','Shanghai','Corvallis'],'country':['Spain','China','USA'],
                  'population':[4000000,26000000,60000]}
place_df = pd.DataFrame(place_dictionary)
place_df
Out[3]:
city country population
0 Madrid Spain 4000000
1 Shanghai China 26000000
2 Corvallis USA 60000

Creating a Pandas DataFrame from a list of lists¶

In [5]:
place_list=[['Madrid', 'Spain', 4000000], ['Shanghai', 'China', 26000000], ['Corvallis', 'USA', 60000]]
place_df = pd.DataFrame(place_list, columns=['city', 'country', 'population'])
place_df
Out[5]:
city country population
0 Madrid Spain 4000000
1 Shanghai China 26000000
2 Corvallis USA 60000

Accessing values from the DataFrame with .loc¶

In [6]:
place_df.loc[1, 'country']
Out[6]:
'China'
In [7]:
place_df.loc[:, 'population']
Out[7]:
0     4000000
1    26000000
2       60000
Name: population, dtype: int64
In [8]:
population_df = place_df.loc[:,['city', 'population']]
population_df
Out[8]:
city population
0 Madrid 4000000
1 Shanghai 26000000
2 Corvallis 60000

Using .iloc (integer location) as an alternative¶

In [9]:
place_df.iloc[0,0]
Out[9]:
'Madrid'
In [10]:
place_df.iloc[0:2,2]
Out[10]:
0     4000000
1    26000000
Name: population, dtype: int64

Replacing values in the DataFrame¶

In [11]:
# Replacement in the output
place_df.replace('Corvallis', 'Beavertown')
Out[11]:
city country population
0 Madrid Spain 4000000
1 Shanghai China 26000000
2 Beavertown USA 60000
In [12]:
# The original DataFrame has not changed
place_df
Out[12]:
city country population
0 Madrid Spain 4000000
1 Shanghai China 26000000
2 Corvallis USA 60000
In [13]:
# Permenant replacment to the DataFrame
place_df.replace('Corvallis', 'Beavertown', inplace=True)
place_df
Out[13]:
city country population
0 Madrid Spain 4000000
1 Shanghai China 26000000
2 Beavertown USA 60000
In [14]:
# Using a dictionary
replacement_dict={'USA':'United States','Spain':'Espana'}
place_df.replace(replacement_dict, inplace= True)
place_df
Out[14]:
city country population
0 Madrid Espana 4000000
1 Shanghai China 26000000
2 Beavertown United States 60000
In [15]:
# A more precise way to use a replacement dictionary
replacement_dict={'country':{'USA':'United States','Spain':'Espana'}}
place_df.replace(replacement_dict, inplace= True)
place_df
Out[15]:
city country population
0 Madrid Espana 4000000
1 Shanghai China 26000000
2 Beavertown United States 60000

Using .groupby() to combine data of the same category¶

In [16]:
# Familiarizing ourselves with the data
df = pd.read_csv("surveys.csv")
df
Out[16]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
... ... ... ... ... ... ... ... ... ...
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

35549 rows × 9 columns

In [17]:
# Grouping by year and taking the average of each column that has numerical values
df.groupby('year').mean(numeric_only=True)
Out[17]:
record_id month day plot_id hindfoot_length weight
year
1977 252.0 9.568588 14.986083 11.616302 36.276961 46.650376
1978 1027.5 6.560115 8.339695 10.822519 37.278880 67.911290
1979 1911.0 6.589708 23.025035 12.029207 35.167219 63.390282
1980 2978.0 5.942049 15.309541 11.649470 35.082035 62.448747
1981 4421.5 5.790082 13.550272 11.864130 35.538830 65.843888
1982 6146.5 6.280081 23.142568 11.943883 31.710897 53.765888
1983 7972.0 6.148834 15.041841 11.901375 32.390635 55.102828
1984 9299.0 6.486239 17.147808 11.697248 31.929385 50.955682
1985 10508.5 6.429764 17.654381 10.566064 32.617844 46.694570
1986 11698.5 7.366242 9.264331 11.032909 33.057444 55.053927
1987 13005.0 6.470975 19.435069 11.404548 29.553377 49.443639
1988 14575.0 6.107556 15.264806 11.089857 28.190691 45.055765
1989 16094.0 5.940089 7.652645 11.400255 25.792094 35.754691
1990 17534.0 5.699466 20.770404 10.594966 28.408658 35.483044
1991 18863.0 6.792873 12.401633 10.647365 25.953202 32.040033
1992 20055.5 5.748555 15.296724 10.765896 26.859956 33.328160
1993 20949.5 6.362667 18.480000 12.153333 27.285935 34.271028
1994 21658.5 6.652695 10.502994 11.586826 29.757194 34.546595
1995 22603.5 7.557283 18.128478 10.960720 26.915146 29.530387
1996 24067.5 6.225088 19.032825 11.059789 26.328422 28.207174
1997 26167.0 6.328520 16.446049 11.681107 26.177299 31.761885
1998 28218.5 6.845963 19.727329 11.909317 28.202515 34.826523
1999 29591.0 5.661674 13.646696 11.540969 28.322160 36.503256
2000 30934.5 6.154639 15.262887 11.091495 26.130375 32.394591
2001 32515.5 7.576398 20.282609 11.676398 27.140363 36.469529
2002 34435.0 7.084343 13.198744 11.487663 27.242353 35.641555
In [18]:
# Isolating one specific column
df.groupby('year')['weight'].mean()
Out[18]:
year
1977    46.650376
1978    67.911290
1979    63.390282
1980    62.448747
1981    65.843888
1982    53.765888
1983    55.102828
1984    50.955682
1985    46.694570
1986    55.053927
1987    49.443639
1988    45.055765
1989    35.754691
1990    35.483044
1991    32.040033
1992    33.328160
1993    34.271028
1994    34.546595
1995    29.530387
1996    28.207174
1997    31.761885
1998    34.826523
1999    36.503256
2000    32.394591
2001    36.469529
2002    35.641555
Name: weight, dtype: float64
In [19]:
# Grouping by year and counting the number of observations that are not NaN
df.groupby('year').count()
Out[19]:
record_id month day plot_id species_id sex hindfoot_length weight
year
1977 503 503 503 503 487 418 408 266
1978 1048 1048 1048 1048 992 936 857 868
1979 719 719 719 719 658 651 604 638
1980 1415 1415 1415 1415 1375 1332 963 1317
1981 1472 1472 1472 1472 1417 1376 1043 1358
1982 1978 1978 1978 1978 1964 1850 1560 1841
1983 1673 1673 1673 1673 1652 1568 1559 1556
1984 981 981 981 981 951 888 878 880
1985 1438 1438 1438 1438 1416 1352 1345 1326
1986 942 942 942 942 922 869 853 853
1987 1671 1671 1671 1671 1668 1539 1377 1517
1988 1469 1469 1469 1469 1457 1341 1332 1327
1989 1569 1569 1569 1569 1562 1458 1366 1439
1990 1311 1311 1311 1311 1296 1217 1155 1209
1991 1347 1347 1347 1347 1332 1243 1218 1224
1992 1038 1038 1038 1038 1008 920 914 902
1993 750 750 750 750 714 665 647 642
1994 668 668 668 668 615 570 556 558
1995 1222 1222 1222 1222 1180 1108 1096 1086
1996 1706 1706 1706 1706 1689 1635 1629 1617
1997 2493 2493 2493 2493 2492 2428 2414 2419
1998 1610 1610 1610 1610 1590 1515 1511 1395
1999 1135 1135 1135 1135 1108 1075 1074 1075
2000 1552 1552 1552 1552 1509 1469 1465 1442
2001 1610 1610 1610 1610 1540 1487 1489 1444
2002 2229 2229 2229 2229 2192 2128 2125 2084
In [20]:
# Using different operations on different columns that generate new summary columns
df.groupby('year').agg(mean_hindfoot = ('hindfoot_length','mean'), 
                             max_weight = ('weight', 'max'))
Out[20]:
mean_hindfoot max_weight
year
1977 36.276961 149.0
1978 37.278880 232.0
1979 35.167219 274.0
1980 35.082035 243.0
1981 35.538830 264.0
1982 31.710897 252.0
1983 32.390635 256.0
1984 31.929385 259.0
1985 32.617844 225.0
1986 33.057444 240.0
1987 29.553377 278.0
1988 28.190691 248.0
1989 25.792094 275.0
1990 28.408658 243.0
1991 25.953202 199.0
1992 26.859956 220.0
1993 27.285935 233.0
1994 29.757194 226.0
1995 26.915146 171.0
1996 26.328422 185.0
1997 26.177299 231.0
1998 28.202515 238.0
1999 28.322160 227.0
2000 26.130375 265.0
2001 27.140363 280.0
2002 27.242353 248.0

Changing the indexing with reset_index()¶

In [21]:
# DataFrame indexed my year
hindfoot_df = df.groupby('year')['hindfoot_length'].mean()
hindfoot_df
Out[21]:
year
1977    36.276961
1978    37.278880
1979    35.167219
1980    35.082035
1981    35.538830
1982    31.710897
1983    32.390635
1984    31.929385
1985    32.617844
1986    33.057444
1987    29.553377
1988    28.190691
1989    25.792094
1990    28.408658
1991    25.953202
1992    26.859956
1993    27.285935
1994    29.757194
1995    26.915146
1996    26.328422
1997    26.177299
1998    28.202515
1999    28.322160
2000    26.130375
2001    27.140363
2002    27.242353
Name: hindfoot_length, dtype: float64
In [22]:
# Year is now a column and there are seperate indicies
hindfoot_df.reset_index()
Out[22]:
year hindfoot_length
0 1977 36.276961
1 1978 37.278880
2 1979 35.167219
3 1980 35.082035
4 1981 35.538830
5 1982 31.710897
6 1983 32.390635
7 1984 31.929385
8 1985 32.617844
9 1986 33.057444
10 1987 29.553377
11 1988 28.190691
12 1989 25.792094
13 1990 28.408658
14 1991 25.953202
15 1992 26.859956
16 1993 27.285935
17 1994 29.757194
18 1995 26.915146
19 1996 26.328422
20 1997 26.177299
21 1998 28.202515
22 1999 28.322160
23 2000 26.130375
24 2001 27.140363
25 2002 27.242353
In [ ]:
# Year is removed as the index and not included as a column
hindfoot_df.reset_index(drop=True)
Out[ ]:
0     36.276961
1     37.278880
2     35.167219
3     35.082035
4     35.538830
5     31.710897
6     32.390635
7     31.929385
8     32.617844
9     33.057444
10    29.553377
11    28.190691
12    25.792094
13    28.408658
14    25.953202
15    26.859956
16    27.285935
17    29.757194
18    26.915146
19    26.328422
20    26.177299
21    28.202515
22    28.322160
23    26.130375
24    27.140363
25    27.242353
Name: hindfoot_length, dtype: float64