Intro to Pandas¶
Reading a csv file into a Pandas DataFrame¶

In [3]:
import pandas as pd
df = pd.read_csv('surveys.csv')
df
Out[3]:
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 9 columns
Confirming that we have a Pandas DataFrame¶
In [4]:
type(df)
Out[4]:
pandas.core.frame.DataFrame
Viewing the first few entries in the DataFrame¶
In [5]:
df.head()
Out[5]:
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
Viewing the last few entries in the DataFrame¶
In [6]:
df.tail()
Out[6]:
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
Determining the number of rows and columns in the DataFrame¶
In [11]:
df.shape
Out[11]:
(35549, 9)
Extracting the column names from the DataFrame¶
In [7]:
df.columns
Out[7]:
Index(['record_id', 'month', 'day', 'year', 'plot_id', 'species_id', 'sex', 'hindfoot_length', 'weight'], dtype='object')
Selecting a column from the DataFrame¶
In [8]:
df['species_id']
Out[8]:
0 NL 1 NL 2 DM 3 DM 4 DM ... 35544 AH 35545 AH 35546 RM 35547 DO 35548 NaN Name: species_id, Length: 35549, dtype: object
The selected column is a Series meaning it retains the original index column¶
In [9]:
type(df['hindfoot_length'])
Out[9]:
pandas.core.series.Series
Getting the values from the column as an array¶
In [10]:
df['hindfoot_length'].values
Out[10]:
array([32., 33., 37., ..., 15., 36., nan])
Creating a new DataFrame from a subset of columns¶
In [24]:
length_weight_df=df[['hindfoot_length','weight']]
length_weight_df
Out[24]:
hindfoot_length | weight | |
---|---|---|
0 | 32.0 | NaN |
1 | 33.0 | NaN |
2 | 37.0 | NaN |
3 | 36.0 | NaN |
4 | 35.0 | NaN |
... | ... | ... |
35544 | NaN | NaN |
35545 | NaN | NaN |
35546 | 15.0 | 14.0 |
35547 | 36.0 | 51.0 |
35548 | NaN | NaN |
35549 rows × 2 columns
Adding a new column to a DataFrame¶
In [25]:
import numpy as np
length_weight_df['sequence']=np.arange(df.shape[0])
length_weight_df['word'] = 'Nature'
length_weight_df
Out[25]:
hindfoot_length | weight | sequence | word | |
---|---|---|---|---|
0 | 32.0 | NaN | 0 | Nature |
1 | 33.0 | NaN | 1 | Nature |
2 | 37.0 | NaN | 2 | Nature |
3 | 36.0 | NaN | 3 | Nature |
4 | 35.0 | NaN | 4 | Nature |
... | ... | ... | ... | ... |
35544 | NaN | NaN | 35544 | Nature |
35545 | NaN | NaN | 35545 | Nature |
35546 | 15.0 | 14.0 | 35546 | Nature |
35547 | 36.0 | 51.0 | 35547 | Nature |
35548 | NaN | NaN | 35548 | Nature |
35549 rows × 4 columns
Selecting rows that contain specific values¶
In [26]:
df[df['species_id']=="DM"]
Out[26]:
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
7 | 8 | 7 | 16 | 1977 | 1 | DM | M | 37.0 | NaN |
8 | 9 | 7 | 16 | 1977 | 1 | DM | F | 34.0 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
35532 | 35533 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 48.0 |
35533 | 35534 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 56.0 |
35534 | 35535 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 53.0 |
35535 | 35536 | 12 | 31 | 2002 | 14 | DM | F | 35.0 | 42.0 |
35536 | 35537 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 46.0 |
10596 rows × 9 columns