Intro to Pandas¶
Reading a csv file into a Pandas DataFrame¶
In [3]:
import pandas as pd
df = pd.read_csv('surveys.csv')
df
Out[3]:
| record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
| 1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
| 2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
| 3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
| 4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
| 35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
| 35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
| 35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
| 35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 9 columns
Confirming that we have a Pandas DataFrame¶
In [4]:
type(df)
Out[4]:
pandas.core.frame.DataFrame
Viewing the first few entries in the DataFrame¶
In [5]:
df.head()
Out[5]:
| record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
| 1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
| 2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
| 3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
| 4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
Viewing the last few entries in the DataFrame¶
In [6]:
df.tail()
Out[6]:
| record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
|---|---|---|---|---|---|---|---|---|---|
| 35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
| 35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
| 35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
| 35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
| 35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
Determining the number of rows and columns in the DataFrame¶
In [11]:
df.shape
Out[11]:
(35549, 9)
Extracting the column names from the DataFrame¶
In [7]:
df.columns
Out[7]:
Index(['record_id', 'month', 'day', 'year', 'plot_id', 'species_id', 'sex',
'hindfoot_length', 'weight'],
dtype='object')
Selecting a column from the DataFrame¶
In [8]:
df['species_id']
Out[8]:
0 NL
1 NL
2 DM
3 DM
4 DM
...
35544 AH
35545 AH
35546 RM
35547 DO
35548 NaN
Name: species_id, Length: 35549, dtype: object
The selected column is a Series meaning it retains the original index column¶
In [9]:
type(df['hindfoot_length'])
Out[9]:
pandas.core.series.Series
Getting the values from the column as an array¶
In [10]:
df['hindfoot_length'].values
Out[10]:
array([32., 33., 37., ..., 15., 36., nan])
Creating a new DataFrame from a subset of columns¶
In [24]:
length_weight_df=df[['hindfoot_length','weight']]
length_weight_df
Out[24]:
| hindfoot_length | weight | |
|---|---|---|
| 0 | 32.0 | NaN |
| 1 | 33.0 | NaN |
| 2 | 37.0 | NaN |
| 3 | 36.0 | NaN |
| 4 | 35.0 | NaN |
| ... | ... | ... |
| 35544 | NaN | NaN |
| 35545 | NaN | NaN |
| 35546 | 15.0 | 14.0 |
| 35547 | 36.0 | 51.0 |
| 35548 | NaN | NaN |
35549 rows × 2 columns
Adding a new column to a DataFrame¶
In [25]:
import numpy as np
length_weight_df['sequence']=np.arange(df.shape[0])
length_weight_df['word'] = 'Nature'
length_weight_df
Out[25]:
| hindfoot_length | weight | sequence | word | |
|---|---|---|---|---|
| 0 | 32.0 | NaN | 0 | Nature |
| 1 | 33.0 | NaN | 1 | Nature |
| 2 | 37.0 | NaN | 2 | Nature |
| 3 | 36.0 | NaN | 3 | Nature |
| 4 | 35.0 | NaN | 4 | Nature |
| ... | ... | ... | ... | ... |
| 35544 | NaN | NaN | 35544 | Nature |
| 35545 | NaN | NaN | 35545 | Nature |
| 35546 | 15.0 | 14.0 | 35546 | Nature |
| 35547 | 36.0 | 51.0 | 35547 | Nature |
| 35548 | NaN | NaN | 35548 | Nature |
35549 rows × 4 columns
Selecting rows that contain specific values¶
In [26]:
df[df['species_id']=="DM"]
Out[26]:
| record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
|---|---|---|---|---|---|---|---|---|---|
| 2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
| 3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
| 4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
| 7 | 8 | 7 | 16 | 1977 | 1 | DM | M | 37.0 | NaN |
| 8 | 9 | 7 | 16 | 1977 | 1 | DM | F | 34.0 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 35532 | 35533 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 48.0 |
| 35533 | 35534 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 56.0 |
| 35534 | 35535 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 53.0 |
| 35535 | 35536 | 12 | 31 | 2002 | 14 | DM | F | 35.0 | 42.0 |
| 35536 | 35537 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 46.0 |
10596 rows × 9 columns