Intro to Pandas¶

Reading a csv file into a Pandas DataFrame¶

Description of image
In [3]:
import pandas as pd
df = pd.read_csv('surveys.csv')
df
Out[3]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
... ... ... ... ... ... ... ... ... ...
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

35549 rows × 9 columns

Confirming that we have a Pandas DataFrame¶

In [4]:
type(df)
Out[4]:
pandas.core.frame.DataFrame

Viewing the first few entries in the DataFrame¶

In [5]:
df.head()
Out[5]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN

Viewing the last few entries in the DataFrame¶

In [6]:
df.tail()
Out[6]:
record_id month day year plot_id species_id sex hindfoot_length weight
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

Determining the number of rows and columns in the DataFrame¶

In [11]:
df.shape
Out[11]:
(35549, 9)

Extracting the column names from the DataFrame¶

In [7]:
df.columns
Out[7]:
Index(['record_id', 'month', 'day', 'year', 'plot_id', 'species_id', 'sex',
       'hindfoot_length', 'weight'],
      dtype='object')

Selecting a column from the DataFrame¶

In [8]:
df['species_id']
Out[8]:
0         NL
1         NL
2         DM
3         DM
4         DM
        ... 
35544     AH
35545     AH
35546     RM
35547     DO
35548    NaN
Name: species_id, Length: 35549, dtype: object

The selected column is a Series meaning it retains the original index column¶

In [9]:
type(df['hindfoot_length'])
Out[9]:
pandas.core.series.Series

Getting the values from the column as an array¶

In [10]:
df['hindfoot_length'].values
Out[10]:
array([32., 33., 37., ..., 15., 36., nan])

Creating a new DataFrame from a subset of columns¶

In [24]:
length_weight_df=df[['hindfoot_length','weight']]
length_weight_df
Out[24]:
hindfoot_length weight
0 32.0 NaN
1 33.0 NaN
2 37.0 NaN
3 36.0 NaN
4 35.0 NaN
... ... ...
35544 NaN NaN
35545 NaN NaN
35546 15.0 14.0
35547 36.0 51.0
35548 NaN NaN

35549 rows × 2 columns

Adding a new column to a DataFrame¶

In [25]:
import numpy as np
length_weight_df['sequence']=np.arange(df.shape[0])
length_weight_df['word'] = 'Nature'
length_weight_df
Out[25]:
hindfoot_length weight sequence word
0 32.0 NaN 0 Nature
1 33.0 NaN 1 Nature
2 37.0 NaN 2 Nature
3 36.0 NaN 3 Nature
4 35.0 NaN 4 Nature
... ... ... ... ...
35544 NaN NaN 35544 Nature
35545 NaN NaN 35545 Nature
35546 15.0 14.0 35546 Nature
35547 36.0 51.0 35547 Nature
35548 NaN NaN 35548 Nature

35549 rows × 4 columns

Selecting rows that contain specific values¶

In [26]:
df[df['species_id']=="DM"]
Out[26]:
record_id month day year plot_id species_id sex hindfoot_length weight
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
7 8 7 16 1977 1 DM M 37.0 NaN
8 9 7 16 1977 1 DM F 34.0 NaN
... ... ... ... ... ... ... ... ... ...
35532 35533 12 31 2002 14 DM F 36.0 48.0
35533 35534 12 31 2002 14 DM M 37.0 56.0
35534 35535 12 31 2002 14 DM M 37.0 53.0
35535 35536 12 31 2002 14 DM F 35.0 42.0
35536 35537 12 31 2002 14 DM F 36.0 46.0

10596 rows × 9 columns