How to Change File Descriptor Back to Normal Simple Imputer to Numpy Array
Part 3 - Introduction to Pandas¶
Table of Contents
- 1Introduction to Pandas
- 1.1Installation and Import
- 2Working with Series
- 2.1Creating Series from a List or Array
- 2.2Creating Series from a Dictionary
- 3Working with DataFrame
- 3.1Creating a DataFrame
- 3.1.1From a Dictionary
- 3.1.2From a Series
- 3.1.3From a Dictionary of Series
- 3.1.4From a two-dimensional NumPy Array
- 3.2Importing Data into a DataFrame
- 3.2.1Read from a csv
- 3.2.2Read from an excel file
- 3.2.3Read from a json
- 3.2.4Read from an API
- 3.3Navigating a DataFrame
- 3.3.1Selecting Columns
- 3.3.2Selection using
loc
andiloc
- 3.3.3Conditional Selection
- 3.3.4Selection using
query()
- 3.3.4.1Subset based on Numeric Variable
- 3.3.4.2Subset based on Categorical Variable
- 3.3.4.3Subset based on Multiple Conditions
- 3.3.4.4Escape special characters
- 3.4Data Types in a DataFrame
- 3.4.1Checking Data Types
- 3.4.2Changing Data Types
- 3.4.2.1Using
astype()
function - 3.4.2.2Custom Functions
- 3.4.2.3Pandas built-in functions
- 3.4.2.1Using
- 3.5Operating on DataFrames
- 3.5.1Arithmetic Operations
- 3.5.1.1Unary Operations
- 3.5.1.2Binary Operations
- 3.5.1.3Operations between DataFrame and Series
- 3.5.2Reindexing Data
- 3.5.2.1Reset Index
- 3.5.3Dropping Data
- 3.5.4Adding Data
- 3.5.5Renaming Columns
- 3.5.1Arithmetic Operations
- 3.6Working with Missing Data
- 3.6.1Missing Data in Pandas
- 3.6.1.1
None
object - 3.6.1.2
NaN
: Not a Number - 3.6.1.3
NaN
andNone
- 3.6.1.1
- 3.6.2Detect Missing Data
- 3.6.2.1Missing values in each column
- 3.6.2.2Missing values in DataFrame
- 3.6.2.3Using
notnull()
- 3.6.3Visualize Misisng Data
- 3.6.4Filter based on Missing Data
- 3.6.5Drop Missing Data
- 3.6.6Impute Missing Data
- 3.6.6.1Fill with a Constant Value
- 3.6.6.2Forawrd and Backward Fill
- 3.6.6.3Fill with Arithmetic or Aggregate Functions
- 3.6.6.4Fill using a Dictionary of Values
- 3.6.6.5Using
interpolate()
- 3.6.1Missing Data in Pandas
- 3.1Creating a DataFrame
- 4Conclusion
- 5References
Introduction to Pandas¶
In the previous notebook, we dove into the details on NumPy, which provides efficient storage and ability to perform complex computations through its ndarray
object for working with homogeneous numerical array data. Here, we will build on the knowledge by looking into the data structures provided by Pandas.
Pandas provides high-level data structures and functions designed to make working with structured or tabular data fast, easy, and expressive. It is based on two main data structures:
-
Series
: one-dimensional such as a list of items -
DataFrame
: two-dimensional, such as a table
Both Series
and DataFrame
objects build on the NumPy array structure and form the core data model for Pandas in Python. While NumPy is best suited for working with homogeneous data, Pandas is designed for working with tabular or heterogeneous data. NumPy serves its purpose very well when the data is well organized and clean, however its limitations become clear when working with less structured data where we need more flexibility such as attaching labels to data, working with missing data, grouping, pivoting etc.
Pandas blends the high-performance, array-computing ideas of NumPy with the flexible data manipulation capabilities of relational databases (such as SQL). Pandas provides sophisticated indexing functionality to reshape, slice and dice, perform aggregations, and select subsets of data. It also provides capabilities for easily handling missing data, adding/deleting columns, imputing missing data, and creating plots on the go. Pandas is a must-have tool for data wrangling and manipulation.
Installation and Import¶
A typical installation of Python API comes with Pandas. You can import it as:
In [1]:
import pandas as pd import numpy as np
You can also check the version of Pandas that is installed:
Working with Series¶
A Pandas Series
is a one-dimensional array of indexed data. A Series
object contains a sequence of values and an associated array of data labels, called index
. While Numpy Array has an implicitly defined integer index that can be used to access the values, the index for a Pandas Series
can also be explicitly defined. Let's take a look at some examples.
Creating Series from a List or Array¶
A Series
can be created from a list or array as follows:
In [3]:
# Create Series data1 = [ 1 , 2 , 3 , 6 , 7 ] s1 = pd . Series ( data1 ) s1
Out[3]:
0 1 1 2 2 3 3 6 4 7 dtype: int64
The array representation and index object of the Series
can be accessed via its values
and index
attributes:
In [4]:
print ( 'Values in series s1:' , s1 . values ) print ( 'Index of series s1:' , s1 . index )
Values in series s1: [1 2 3 6 7] Index of series s1: RangeIndex(start=0, stop=5, step=1)
Like NumPy arrays, data in a Series
can be accessed by the associated index.
In [5]:
# Access elements of a Series print ( s1 [ 0 ]) print ( s1 [ 3 ])
For Series
objects, the index need not be an integer and can be explicitly defined as follows:
In [6]:
s2 = pd . Series ([ 1 , 2 , 3 , 6 , 7 ], index = [ 'd' , 'b' , 'a' , 'c' , 'e' ]) s2
Out[6]:
d 1 b 2 a 3 c 6 e 7 dtype: int64
In [7]:
# Check the index s2 . index
Out[7]:
Index(['d', 'b', 'a', 'c', 'e'], dtype='object')
In [8]:
# Access elements using index label print ( 'Element associated with index a:' , s2 [ 'a' ]) print ( 'Element associated with index e:' , s2 [ 'e' ])
Element associated with index a: 3 Element associated with index e: 7
In [9]:
# Access elements using index position print ( 'Element associated with index position 2:' , s2 [ 2 ]) print ( 'Element associated with last index position:' , s2 [ - 1 ])
Element associated with index position 2: 3 Element associated with last index position: 7
Creating Series from a Dictionary¶
Just like a dictionary maps keys to a set of values, a series can be thought of as a mapping of index values to data values. Let's create a series from a dictionary.
In [10]:
data = { 'CA' : 35000 , 'TX' : 71000 , 'WA' : 16000 , 'OR' : 5000 } s3 = pd . Series ( data ) s3
Out[10]:
CA 35000 TX 71000 WA 16000 OR 5000 dtype: int64
By default, index for Series
will be created using the keys.
In [11]:
# Access elements using index label s3 [ 'CA' ]
In [12]:
# Access elements using index position s3 [ 0 ]
Working with DataFrame¶
A DataFrame
represents a rectangular table of data and contains an ordered collection of columns. You can think of it as a spreadsheet or SQL table where each column has a column name for reference and each row can be accessed by using row numbers. Column names and row numbers are known as column and row index.
DataFrame is a fundamental Pandas data structure in which each column can be of a different value type (numeric, string, boolean, etc.). A data set can be first read into a DataFrame and then various operations (i.e. indexing, grouping, aggregation etc.) can be easily applied to it.
Creating a DataFrame¶
There are many ways to construct a DataFrame. Let's look at some examples.
From a Dictionary¶
DataFrame
can be constructed from a dictionary of equal sized lists or NumPy arrays.
In [13]:
data = { 'state' :[ 'CA' , 'WA' , 'CA' , 'WA' , 'CA' , 'WA' ], 'year' :[ 2015 , 2015 , 2016 , 2016 , 2017 , 2017 ], 'population' :[ 38.5 , 7.0 , 39.0 , 7.25 , 39.5 , 7.5 ]} df1 = pd . DataFrame ( data ) df1
Out[13]:
state | year | population | |
---|---|---|---|
0 | CA | 2015 | 38.50 |
1 | WA | 2015 | 7.00 |
2 | CA | 2016 | 39.00 |
3 | WA | 2016 | 7.25 |
4 | CA | 2017 | 39.50 |
5 | WA | 2017 | 7.50 |
From a Series¶
DataFrame
is a collection of Series
objects. Let's look at constructing a DataFrame
from a single Series
object.
In [14]:
# Create Population series pop_data = { 'CA' : 39.5 , 'TX' : 29 , 'NY' : 8.39 , 'FL' : 21.48 , 'IL' : 12.67 } population = pd . Series ( pop_data )
In [15]:
# Create DataFrame from series pop_df = pd . DataFrame ( population , columns = [ 'population' ]) pop_df
Out[15]:
population | |
---|---|
CA | 39.50 |
TX | 29.00 |
NY | 8.39 |
FL | 21.48 |
IL | 12.67 |
From a Dictionary of Series¶
DataFrame
can also be constructed from a dictionary of Series
.
In [16]:
# Create Area series area_data = { 'CA' : 155779.22 , 'TX' : 261231.71 , 'NY' : 47126.40 , 'FL' : 53624.76 , 'IL' : 55518.93 } area = pd . Series ( area_data ) area
Out[16]:
CA 155779.22 TX 261231.71 NY 47126.40 FL 53624.76 IL 55518.93 dtype: float64
In [17]:
# Create DataFrame from dictionary of Series df2 = pd . DataFrame ({ 'population' : population , 'land area' : area }) df2
Out[17]:
population | land area | |
---|---|---|
CA | 39.50 | 155779.22 |
TX | 29.00 | 261231.71 |
NY | 8.39 | 47126.40 |
FL | 21.48 | 53624.76 |
IL | 12.67 | 55518.93 |
From a two-dimensional NumPy Array¶
DataFrame
can be constructed from a two-dimensional NumPy array by specifying the column names. An integer Index
is used if not specified.
In [18]:
pd . DataFrame ( np . random . rand ( 4 , 2 ), columns = [ 'Col_A' , 'Col_B' ], index = [ 'a' , 'b' , 'c' , 'd' ])
Out[18]:
Col_A | Col_B | |
---|---|---|
a | 0.028278 | 0.687096 |
b | 0.999634 | 0.443590 |
c | 0.750714 | 0.785287 |
d | 0.611652 | 0.651515 |
Importing Data into a DataFrame¶
Reading data into a DataFrame
is one of the most common task in any data scinece problem. Pandas provides the ability to read data from various formats such as CSV, JSON, Excel, APIs, etc. directly into a DataFrame
object. Let's look at how to read data from some common formats into a DataFrame
.
Read from a csv¶
read_csv()
function can be used to read csv (comma-separated value) files. The function includes a number of different parameters and you can read more about them in the pandas documentation here.
Let's import the 'health.csv' file we used earlier in this guide series.
In [21]:
# Read the data from csv df_csv = pd . read_csv ( '../data/health.csv' ) df_csv . head ()
Out[21]:
Number of Beds | Name | Address | City | State | Zip Code | |
---|---|---|---|---|---|---|
0 | 156 | Facility 1 | 2468 SOUTH ST ANDREWS PLACE | LOS ANGELES | CA | 90018 |
1 | 59 | Facility 2 | 2300 W. WASHINGTON BLVD. | LOS ANGELES | CA | 90018 |
2 | 25 | Facility 3 | 4060 E. WHITTIER BLVD. | LOS ANGELES | CA | 90023 |
3 | 49 | Facility 4 | 6070 W. PICO BOULEVARD | LOS ANGELES | CA | 90035 |
4 | 55 | Facility 5 | 1480 S. LA CIENEGA BL | LOS ANGELES | CA | 90035 |
In [22]:
# Confirm type type ( df_csv )
Out[22]:
pandas.core.frame.DataFrame
Read from an excel file¶
Tabular data is often stored using Microsoft Excel 2003 (and higher) and can be read using read_excel()
.
In [26]:
# Read the data from excel df_excel = pd . read_excel ( '../data/health.xlsx' ) df_excel
Out[26]:
Number of Beds | Name | Address | City | State | Zip Code | |
---|---|---|---|---|---|---|
0 | 156 | Facility 1 | 2468 SOUTH ST ANDREWS PLACE | LOS ANGELES | CA | 90018 |
1 | 59 | Facility 2 | 2300 W. WASHINGTON BLVD. | LOS ANGELES | CA | 90018 |
2 | 25 | Facility 3 | 4060 E. WHITTIER BLVD. | LOS ANGELES | CA | 90023 |
3 | 49 | Facility 4 | 6070 W. PICO BOULEVARD | LOS ANGELES | CA | 90035 |
4 | 55 | Facility 5 | 1480 S. LA CIENEGA BL | LOS ANGELES | CA | 90035 |
In [27]:
# Confirm type type ( df_excel )
Out[27]:
pandas.core.frame.DataFrame
Read from a json¶
read_json()
can be used to read JSON (JavaScript Object Notation) files. JSON is mostly used to store unstructured data with key/value pairs. The function accepts a valid JSON string, path object or file-like object and does not consume a dictionay (key/value pair) directly. Let's read 'health.json' into a DataFrame
.
In [28]:
# Read the data from json df_json = pd . read_json ( '../data/health.json' ) df_json
Out[28]:
Number of Beds | Name | Address | City | State | Zip Code | |
---|---|---|---|---|---|---|
0 | 156 | Facility 1 | 2468 SOUTH ST ANDREWS PLACE | LOS ANGELES | CA | 90018 |
1 | 59 | Facility 2 | 2300 W. WASHINGTON BLVD. | LOS ANGELES | CA | 90018 |
2 | 25 | Facility 3 | 4060 E. WHITTIER BLVD. | LOS ANGELES | CA | 90023 |
3 | 49 | Facility 4 | 6070 W. PICO BOULEVARD | LOS ANGELES | CA | 90035 |
4 | 55 | Facility 5 | 1480 S. LA CIENEGA BL | LOS ANGELES | CA | 90035 |
In [29]:
# Confirm type type ( df_json )
Out[29]:
pandas.core.frame.DataFrame
Read from an API¶
Many websites provide data through public APIs in json or other formats. One easy way to access these APIs from Python is using requests
package. Let's find issues for pandas on GitHub using the add-on requests library.
In [30]:
# Import library import requests
In [31]:
# Make request and store response url = 'https://api.github.com/repos/pandas-dev/pandas/issues' resp = requests . get ( url )
The response object's json
method will return a dictionary containing JSON. .keys()
method can be used to explore the structure of the returned JSON object.
In [32]:
# Get json json_data = resp . json () json_data [ 0 ] . keys ()
Out[32]:
dict_keys(['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'performed_via_github_app'])
Since each element in json_data
is a dictionary, you can create a DataFrame using specific columns from the data.
In [33]:
# Create dataframe for specific cols issues = pd . DataFrame ( json_data , columns = [ 'number' , 'title' , 'labels' , 'state' , 'created_at' ]) issues . head ()
Out[33]:
number | title | labels | state | created_at | |
---|---|---|---|---|---|
0 | 38070 | BUG: Index.drop raising Error when Index has d... | [] | open | 2020-11-25T20:11:11Z |
1 | 38069 | ENH: Rename multi-level columns or indices usi... | [{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg=... | open | 2020-11-25T19:57:28Z |
2 | 38068 | ENH: NDArrayBackedExtensionArray.__array_funct... | [] | open | 2020-11-25T18:32:41Z |
3 | 38067 | DOC: Include missing holiday observance rules | [{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT... | open | 2020-11-25T17:25:35Z |
4 | 38066 | RLS: 1.1.5 | [{'id': 131473665, 'node_id': 'MDU6TGFiZWwxMzE... | open | 2020-11-25T15:51:33Z |
To create a DataFrame
with all the data directly from the response object, "json" library can be used. Since read_json()
accepts a valid JSON string, json.dumps()
can be used to convert the object back to a string.
In [34]:
# Import library import json
In [35]:
# Read all data from response object's json method all_issues = pd . read_json ( json . dumps ( json_data )) all_issues . head ()
Out[35]:
url | repository_url | labels_url | comments_url | events_url | html_url | id | node_id | number | title | ... | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | pull_request | body | performed_via_github_app | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | https://api.github.com/repos/pandas-dev/pandas... | https://api.github.com/repos/pandas-dev/pandas | https://api.github.com/repos/pandas-dev/pandas... | https://api.github.com/repos/pandas-dev/pandas... | https://api.github.com/repos/pandas-dev/pandas... | https://github.com/pandas-dev/pandas/pull/38070 | 751090409 | MDExOlB1bGxSZXF1ZXN0NTI3NjcwMDk4 | 38070 | BUG: Index.drop raising Error when Index has d... | ... | None | 0 | 2020-11-25 20:11:11+00:00 | 2020-11-25 20:39:16+00:00 | NaT | MEMBER | NaN | {'url': 'https://api.github.com/repos/pandas-d... | - [x] closes #38051\r\n- [x] closes #33494\r\n... | NaN |
1 | https://api.github.com/repos/pandas-dev/pandas... | https://api.github.com/repos/pandas-dev/pandas | https://api.github.com/repos/pandas-dev/pandas... | https://api.github.com/repos/pandas-dev/pandas... | https://api.github.com/repos/pandas-dev/pandas... | https://github.com/pandas-dev/pandas/issues/38069 | 751083108 | MDU6SXNzdWU3NTEwODMxMDg= | 38069 | ENH: Rename multi-level columns or indices usi... | ... | None | 0 | 2020-11-25 19:57:28+00:00 | 2020-11-25 19:57:28+00:00 | NaT | NONE | NaN | NaN | It's currently quite difficult to rename a sin... | NaN |
2 | https://api.github.com/repos/pandas-dev/pandas... | https://api.github.com/repos/pandas-dev/pandas | https://api.github.com/repos/pandas-dev/pandas... | https://api.github.com/repos/pandas-dev/pandas... | https://api.github.com/repos/pandas-dev/pandas... | https://github.com/pandas-dev/pandas/pull/38068 | 751038480 | MDExOlB1bGxSZXF1ZXN0NTI3NjI3Njc2 | 38068 | ENH: NDArrayBackedExtensionArray.__array_funct... | ... | None | 1 | 2020-11-25 18:32:41+00:00 | 2020-11-25 18:49:53+00:00 | NaT | MEMBER | NaN | {'url': 'https://api.github.com/repos/pandas-d... | motivated by getting np.delete and np.repeat w... | NaN |
3 | https://api.github.com/repos/pandas-dev/pandas... | https://api.github.com/repos/pandas-dev/pandas | https://api.github.com/repos/pandas-dev/pandas... | https://api.github.com/repos/pandas-dev/pandas... | https://api.github.com/repos/pandas-dev/pandas... | https://github.com/pandas-dev/pandas/issues/38067 | 750998859 | MDU6SXNzdWU3NTA5OTg4NTk= | 38067 | DOC: Include missing holiday observance rules | ... | None | 0 | 2020-11-25 17:25:35+00:00 | 2020-11-25 17:28:13+00:00 | NaT | NONE | NaN | NaN | #### Location of the documentation\r\n\r\n[Tim... | NaN |
4 | https://api.github.com/repos/pandas-dev/pandas... | https://api.github.com/repos/pandas-dev/pandas | https://api.github.com/repos/pandas-dev/pandas... | https://api.github.com/repos/pandas-dev/pandas... | https://api.github.com/repos/pandas-dev/pandas... | https://github.com/pandas-dev/pandas/issues/38066 | 750932114 | MDU6SXNzdWU3NTA5MzIxMTQ= | 38066 | RLS: 1.1.5 | ... | {'url': 'https://api.github.com/repos/pandas-d... | 3 | 2020-11-25 15:51:33+00:00 | 2020-11-25 16:11:36+00:00 | NaT | MEMBER | NaN | NaN | Tracking issue for the 1.1.5 release.\r\n\r\nh... | NaN |
5 rows × 26 columns
Navigating a DataFrame¶
Elements or subsets of a DataFrame
can be accessed in multiple ways. We can use []
or use slice notation, marked by the colon (:
) character to access subsets of data. Indexing and slicing to select subsets of a DataFrame can be performed:
- By location, like lists
- By label, like dictionaries
- By boolean masks like NumPy arrays
Special indexing operators such as loc
and iloc
also enable selection of a subset of the rows and columns from a DataFrame. Let's look at some examples.
Selecting Columns¶
Columns of the DataFrame are essentially Series
objects that can be accessed via dictionary-style indexing.
Out[36]:
CA 39.50 TX 29.00 NY 8.39 FL 21.48 IL 12.67 Name: population, dtype: float64
Individual columns can also be selected using attribute-style access.
Out[37]:
CA 39.50 TX 29.00 NY 8.39 FL 21.48 IL 12.67 Name: population, dtype: float64
Selection using loc
and iloc
¶
Special indexing operators such as loc
and iloc
can be used to select a subset of the rows and columns from a DataFrame.
-
.loc
for label-based indexing can be used to index the data in an array-like style by specifying index and column names -
.iloc
for positional indexing can be used to index the underlying array as if it is a simple NumPy array.
Let's add another column to df2
DataFrame and then look at some examples.
In [38]:
# Add column df2 [ 'density' ] = df2 [ 'population' ] * 1000000 / df2 [ 'land area' ] df2
Out[38]:
population | land area | density | |
---|---|---|---|
CA | 39.50 | 155779.22 | 253.563986 |
TX | 29.00 | 261231.71 | 111.012557 |
NY | 8.39 | 47126.40 | 178.031846 |
FL | 21.48 | 53624.76 | 400.561233 |
IL | 12.67 | 55518.93 | 228.210450 |
Selection using .iloc
.iloc
is used for positional indexing.
In [39]:
# First two rows and all columns df2 . iloc [: 2 , :]
Out[39]:
population | land area | density | |
---|---|---|---|
CA | 39.5 | 155779.22 | 253.563986 |
TX | 29.0 | 261231.71 | 111.012557 |
In [40]:
# Last two rows and first two columns df2 . iloc [ - 2 :, : 2 ]
Out[40]:
population | land area | |
---|---|---|
FL | 21.48 | 53624.76 |
IL | 12.67 | 55518.93 |
In [41]:
# Middle subset of rows and columns df2 . iloc [ 2 : 4 , 1 : 2 ]
Out[41]:
land area | |
---|---|
NY | 47126.40 |
FL | 53624.76 |
Selection using .loc
.loc
is used for label-based indexing by specifying index and column names.
In [42]:
# Select rows until 'TX' and all columns df2 . loc [: 'TX' , :]
Out[42]:
population | land area | density | |
---|---|---|---|
CA | 39.5 | 155779.22 | 253.563986 |
TX | 29.0 | 261231.71 | 111.012557 |
In [43]:
# Select rows from 'FL' and columns until 'area' df2 . loc [ 'FL' :, : 'land area' ]
Out[43]:
population | land area | |
---|---|---|
FL | 21.48 | 53624.76 |
IL | 12.67 | 55518.93 |
In [44]:
# Middle subset of rows from NY to FL and columns from 'area' to 'density' df2 . loc [ 'NY' : 'FL' , 'land area' : 'density' ]
Out[44]:
land area | density | |
---|---|---|
NY | 47126.40 | 178.031846 |
FL | 53624.76 | 400.561233 |
Conditional Selection¶
Boolean masks can be used to conditionally select specific subsets of the data. They select the elements of an DataFrame
that satisfy some condition. Let's take a look.
Out[45]:
CA True TX False NY False FL True IL True Name: density, dtype: bool
In [46]:
# Select data where density > 200 df2 [ df2 [ 'density' ] > 200 ]
Out[46]:
population | land area | density | |
---|---|---|---|
CA | 39.50 | 155779.22 | 253.563986 |
FL | 21.48 | 53624.76 | 400.561233 |
IL | 12.67 | 55518.93 | 228.210450 |
In [47]:
# Select data for only those state where area > 50000 and return first 2 rows df2 [ df2 [ 'land area' ] > 50000 ] . iloc [: 2 ]
Out[47]:
population | land area | density | |
---|---|---|---|
CA | 39.5 | 155779.22 | 253.563986 |
TX | 29.0 | 261231.71 | 111.012557 |
Selection using query()
¶
query()
enables you to "query" a DataFrame and retrieve subsets based on logical conditions. It acts similar to the 'where' clause in SQL making it much easier to read and understand. query()
uses string expressions to efficiently compute operations on a DataFrame and offers a more efficient computation compared to the masking expression. It also provides efficient memory use than pure python operations.
Let's add some data to df2
and take a look.
In [48]:
df2 [ 'drought' ] = [ 'Yes' , 'Yes' , 'No' , 'Yes' , 'No' ]
Out[49]:
population | land area | density | drought | |
---|---|---|---|---|
CA | 39.50 | 155779.22 | 253.563986 | Yes |
TX | 29.00 | 261231.71 | 111.012557 | Yes |
NY | 8.39 | 47126.40 | 178.031846 | No |
FL | 21.48 | 53624.76 | 400.561233 | Yes |
IL | 12.67 | 55518.93 | 228.210450 | No |
Subset based on Numeric Variable¶
In [50]:
df2 . query ( 'population > 20' )
Out[50]:
population | land area | density | drought | |
---|---|---|---|---|
CA | 39.50 | 155779.22 | 253.563986 | Yes |
TX | 29.00 | 261231.71 | 111.012557 | Yes |
FL | 21.48 | 53624.76 | 400.561233 | Yes |
Subset based on Categorical Variable¶
In [51]:
df2 . query ( 'drought == "No"' )
Out[51]:
population | land area | density | drought | |
---|---|---|---|---|
NY | 8.39 | 47126.40 | 178.031846 | No |
IL | 12.67 | 55518.93 | 228.210450 | No |
Subset based on Multiple Conditions¶
Condition 1: population > 20 and density < 200
In [52]:
# AND operator df2 . query ( '(population > 20) and (density < 200)' )
Out[52]:
population | land area | density | drought | |
---|---|---|---|---|
TX | 29.0 | 261231.71 | 111.012557 | Yes |
Condition 2: population < 25 or drought == "No"
In [53]:
# OR operator df2 . query ( '(population < 25) or (drought == "No")' )
Out[53]:
population | land area | density | drought | |
---|---|---|---|---|
NY | 8.39 | 47126.40 | 178.031846 | No |
FL | 21.48 | 53624.76 | 400.561233 | Yes |
IL | 12.67 | 55518.93 | 228.210450 | No |
Condition 3: population < 20 and index in ["NY", "IL"]
In [54]:
# IN operator df2 . query ( '(population < 20) and (index in ["NY", "IL"])' )
Out[54]:
population | land area | density | drought | |
---|---|---|---|---|
NY | 8.39 | 47126.40 | 178.031846 | No |
IL | 12.67 | 55518.93 | 228.210450 | No |
Condition 4: Like operation
Although "like" is not supported, it can be simulated using string operations and specifying engine='python'
.
In [55]:
df2 . query ( 'drought.str.contains("e")' , engine = 'python' )
Out[55]:
population | land area | density | drought | |
---|---|---|---|---|
CA | 39.50 | 155779.22 | 253.563986 | Yes |
TX | 29.00 | 261231.71 | 111.012557 | Yes |
FL | 21.48 | 53624.76 | 400.561233 | Yes |
Escape special characters¶
land area
column has a whitespace which may cause issues when a query gets executed. Wrap column names in backticks to escape special characters such as whitespace etc.
In [56]:
df2 . query ( '`land area` > 60000' )
Out[56]:
population | land area | density | drought | |
---|---|---|---|---|
CA | 39.5 | 155779.22 | 253.563986 | Yes |
TX | 29.0 | 261231.71 | 111.012557 | Yes |
Data Types in a DataFrame¶
The way information is stored in a DataFrame
affects what we can do with it and the outputs of calculations on it. When doing data analysis, it is important to make sure you are using the correct data types; otherwise you may get unexpected results or errors. Common data types available in Pandas are object
, int64
, float64
, datetime64
and bool
.
An object
data type can contain multiple different types such as integers, floats and strings. These different data types when included in a single column are collectively labeled as an object
.
Let's expand the df2
DataFrame to add month, year, date, gdp, rainfall and drought
columns and explore various data types.
In [57]:
# Add Columns df2 [ 'year' ] = [ 2019 , 2019 , 2020 , 2019 , 2020 ] df2 [ 'month' ] = [ 2 , 5 , 7 , 9 , 11 ] df2 [ 'day' ] = [ 27 , 12 , 10 , 7 , 17 ] df2 [ 'gdp' ] = [ '3183' , '1918' , '1751' , '1111' , '908' ] df2 [ 'rainfall' ] = [ '22.9' , 35 , 42.87 , '54.73' , 'No Value' ] df2
Out[57]:
population | land area | density | drought | year | month | day | gdp | rainfall | |
---|---|---|---|---|---|---|---|---|---|
CA | 39.50 | 155779.22 | 253.563986 | Yes | 2019 | 2 | 27 | 3183 | 22.9 |
TX | 29.00 | 261231.71 | 111.012557 | Yes | 2019 | 5 | 12 | 1918 | 35 |
NY | 8.39 | 47126.40 | 178.031846 | No | 2020 | 7 | 10 | 1751 | 42.87 |
FL | 21.48 | 53624.76 | 400.561233 | Yes | 2019 | 9 | 7 | 1111 | 54.73 |
IL | 12.67 | 55518.93 | 228.210450 | No | 2020 | 11 | 17 | 908 | No Value |
Checking Data Types¶
Using the dtypes
property of a DataFrame
, we can check the different Data Types of each column in a DataFrame.
Out[58]:
population float64 land area float64 density float64 drought object year int64 month int64 day int64 gdp object rainfall object dtype: object
info()
function can also be used to get a more detailed information; however, it is a more time consuming operation.
<class 'pandas.core.frame.DataFrame'> Index: 5 entries, CA to IL Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 population 5 non-null float64 1 land area 5 non-null float64 2 density 5 non-null float64 3 drought 5 non-null object 4 year 5 non-null int64 5 month 5 non-null int64 6 day 5 non-null int64 7 gdp 5 non-null object 8 rainfall 5 non-null object dtypes: float64(3), int64(3), object(3) memory usage: 560.0+ bytes
By looking at the data types of different columns, we can identify a few issues:
-
year
,month
andday
columns are of typeint64
and should be converted todatetime64
. -
gdp
is of typeobject
and should befloat64
. -
drought
, a Yes/No column is of typeobject
and should be converted tobool
. -
rainfall
is also of typeobject
and should be converted tofloat64
Changing Data Types¶
In Pandas, there are three options to convert the data type of a column:
- Using
astype()
function to to force the appropriate data type. - Writing a custom function.
- Using pandas built-in functions such as
to_numeric()
andto_datetime()
.
Let's look at each of these options.
Using astype()
function¶
astype()
is the simplest way to convert the data type. Let's start by converting the gdp
column of type object
to float64
data type.
In [60]:
df2 [ 'gdp' ] . astype ( 'float' )
Out[60]:
CA 3183.0 TX 1918.0 NY 1751.0 FL 1111.0 IL 908.0 Name: gdp, dtype: float64
Out[61]:
population float64 land area float64 density float64 drought object year int64 month int64 day int64 gdp object rainfall object dtype: object
Simply running astype()
on a column only returns a copy of the column. To ensure the change gets applied to the DataFrame, we need to assign it back.
In [62]:
# Change data type df2 [ 'gdp' ] = df2 [ 'gdp' ] . astype ( 'float' ) # Check data type df2 [ 'gdp' ] . dtype
Now let's try changing the data types of drought
(dtype object) to bool
and rainfall
(dtype object) to float64
.
Change drought
to bool
type
In [63]:
df2 [ 'drought' ] . astype ( 'bool' )
Out[63]:
CA True TX True NY True FL True IL True Name: drought, dtype: bool
Here we see that after type conversion, all values are interpreted as
True
, however, the column consisted of bothYes
andNo
values. By default, if there is a value in column, python evaluates it toTrue
irrespective of the type of value. Learn more about truth value evaluation here.
Change rainfall
to float64
type
This results in a
ValueError
which means the conversion ofrainfall
tofloat64
data type did not work.
astype()
works when the data is clean and when you want to convert a numeric to a string object. However, when the data is not homogeneous (i.e. object
) or has characters, astype
may not be the right way to change the data type of a column.
Let's look at the other options of converting data types (mentioned above) to see if we can fix these issues.
Custom Functions¶
We can write custom functions and then apply them to convert the data type of a column to an appropriate type. Let's change the data type of drought
(dtype object) to bool
.
In [64]:
# Create function def str2bool ( val ): """ Converts a string data type to boolean - converts string values to lowercase - checks for multiple positive values - returns the correct boolean type """ val = val . lower () # converts string value to lowercase if val in ( "yes" , "y" , "true" , "t" , "1" ): # checks for multiple positive values return True return False
We can now use the apply
function to apply the str2bool
function to all values of the column.
Remember that simply applying a function to a column only returns a copy of the column. To ensure the change gets applied to the DataFrame, we need to assign it back.
In [65]:
# Apply the function df2 [ 'drought' ] = df2 [ 'drought' ] . apply ( str2bool ) # Check type df2 [ 'drought' ] . dtype
In [66]:
# Look at column values df2 [ 'drought' ]
Out[66]:
CA True TX True NY False FL True IL False Name: drought, dtype: bool
The beauty of custom functions is that they open up a gateway of opportunities. With a simple function, we could consider multiple string values such as "yes", "y", "true", "t", "1"
. As a result, the column can essentially have any of these values and the function will return True
.
This was just a simple function and much more complex functions can be written to address various issues in the data.
Using np.where()
¶
Another popular approach that is useful for many types of problems is to use np.where()
function. Let's convert the data type of drought
back to object
and then take a look at using np.where()
.
In [67]:
df2 [ 'drought' ] = [ 'Y' , 'Y' , 'N' , 'Y' , 'N' ] df2 [ 'drought' ] . dtype
In [68]:
# Use np.where() df2 [ 'drought' ] = np . where ( df2 [ 'drought' ] == 'Y' , True , False )
In [69]:
# Check values df2 [ 'drought' ]
Out[69]:
CA True TX True NY False FL True IL False Name: drought, dtype: bool
np.where()
function converts allY
values to True
and everything else is changed to False
.
Pandas built-in functions¶
Pandas built-in helper functions, such as to_numeric()
and to_datetime()
, can be very useful for converting certain data type.
Convert rainfall
column (type object) to float
¶
In [70]:
# Look at rainfall column df2 [ 'rainfall' ]
Out[70]:
CA 22.9 TX 35 NY 42.87 FL 54.73 IL No Value Name: rainfall, dtype: object
In [71]:
# Check the data type of each value in the column df2 [ 'rainfall' ] . apply ( type )
Out[71]:
CA <class 'str'> TX <class 'int'> NY <class 'float'> FL <class 'str'> IL <class 'str'> Name: rainfall, dtype: object
The rainfall
column contains values of multiple different types, such as integers, floats and strings. Let's use the to_numeric
function to change data type to float64
.
In [72]:
pd . to_numeric ( df2 [ 'rainfall' ], errors = 'coerce' )
Out[72]:
CA 22.90 TX 35.00 NY 42.87 FL 54.73 IL NaN Name: rainfall, dtype: float64
By passing errors=coerce
, the function replaces the invalid "No Value" value with a NaN
.
To ensure the change gets applied to the DataFrame, we need to assign it back. Here is another way of applying to_numeric
using the apply
function.
In [73]:
df2 [ 'rainfall' ] = df2 [ 'rainfall' ] . apply ( pd . to_numeric , errors = 'coerce' )
Out[74]:
CA 22.90 TX 35.00 NY 42.87 FL 54.73 IL NaN Name: rainfall, dtype: float64
Convert year, month, day
columns (type int) to datetime64
¶
The year, month, day
columns can be combined into a single new date
column with the correct data type. Let's convert the date columns to datetime64
type using pd.to_datetime()
. We will dive into the details of individual date or time columns in a later section of this guide series.
In [75]:
df2 [ 'date' ] = pd . to_datetime ( df2 [[ 'year' , 'month' , 'day' ]])
Out[76]:
CA 2019-02-27 TX 2019-05-12 NY 2020-07-10 FL 2019-09-07 IL 2020-11-17 Name: date, dtype: datetime64[ns]
Now, the data types of df2
DataFrame are all cleaned up. Let's take a look.
Out[77]:
population float64 land area float64 density float64 drought bool year int64 month int64 day int64 gdp float64 rainfall float64 date datetime64[ns] dtype: object
Operating on DataFrames¶
Pandas is designed to work with NumPy and essentially inherits the ability to perform quick element-wise operations with basic arithmetic (add, subtract, multiply, etc.) and more sophisticated operations (trigonometric, exponential and logarithmic functions, etc.) from NumPy.
There are various other ways in which users can interact with the data in a DataFrame, such as reindexing data, dropping data, adding data, renaming columns etc. Let's explore how we can operate on the data in a DataFrame.
Arithmetic Operations¶
In Pandas,
- Unary ufuncs (operate on a single input): such as exponential and logarithmic functions, preserve index and column labels in the output.
- Binary ufuncs (operate on two inputs): such as addition and multiplication, automatically align indices and return a DataFrame whose index and columns are the unions of the ones in each DataFrame.
Let's create sample data to demonstrate this.
In [78]:
# Create data rng = np . random . RandomState ( 55 ) df3 = pd . DataFrame ( rng . randint ( 1 , 20 ,( 3 , 3 )), columns = list ( 'bcd' ), index = [ 'Ohio' , 'Texas' , 'Colorado' ]) print ( 'df3 is: \n ' , df3 ) print () df4 = pd . DataFrame ( rng . randint ( 1 , 25 ,( 4 , 3 )), columns = list ( 'bde' ), index = [ 'Arizona' , 'Ohio' , 'Texas' , 'Iowa' ]) print ( 'df4 is: \n ' , df4 )
df3 is: b c d Ohio 14 8 9 Texas 6 6 2 Colorado 17 4 15 df4 is: b d e Arizona 14 13 16 Ohio 8 2 22 Texas 24 9 20 Iowa 11 8 1
Unary Operations¶
When applying a NumPy ufunc on DataFrame object, the result will be a Pandas object with the indices preserved.
In [79]:
print ( 'Exponentiation: \n ' , np . power ( df3 , 2 )) print () print ( 'Square Root: \n ' , np . sqrt ( df3 )) print () print ( "ln(x) \n " , np . log ( df4 ))
Exponentiation: b c d Ohio 196 64 81 Texas 36 36 4 Colorado 289 16 225 Square Root: b c d Ohio 3.741657 2.828427 3.000000 Texas 2.449490 2.449490 1.414214 Colorado 4.123106 2.000000 3.872983 ln(x) b d e Arizona 2.639057 2.564949 2.772589 Ohio 2.079442 0.693147 3.091042 Texas 3.178054 2.197225 2.995732 Iowa 2.397895 2.079442 0.000000
Binary Operations¶
Pandas will automatically align indices and return a DataFrame whose index and columns are the unions of the ones in each DataFrame.
Out[80]:
b | c | d | e | |
---|---|---|---|---|
Arizona | NaN | NaN | NaN | NaN |
Colorado | NaN | NaN | NaN | NaN |
Iowa | NaN | NaN | NaN | NaN |
Ohio | 22.0 | NaN | 11.0 | NaN |
Texas | 30.0 | NaN | 11.0 | NaN |
Since the c
and e
columns are not found in both DataFrame objects, they appear as all missing in the result. The same holds for the rows whose labels are not common in both DataFrame.
We can also use Pandas Methods to perform arithmetic operations. Adding these DataFrame will result in NA values in the locations that don't overlap.
Out[81]:
b | c | d | e | |
---|---|---|---|---|
Arizona | NaN | NaN | NaN | NaN |
Colorado | NaN | NaN | NaN | NaN |
Iowa | NaN | NaN | NaN | NaN |
Ohio | 22.0 | NaN | 11.0 | NaN |
Texas | 30.0 | NaN | 11.0 | NaN |
Out[82]:
b | c | d | e | |
---|---|---|---|---|
Arizona | NaN | NaN | NaN | NaN |
Colorado | NaN | NaN | NaN | NaN |
Iowa | NaN | NaN | NaN | NaN |
Ohio | 6.0 | NaN | 7.0 | NaN |
Texas | -18.0 | NaN | -7.0 | NaN |
In [83]:
# Multiplication df3 * df4
Out[83]:
b | c | d | e | |
---|---|---|---|---|
Arizona | NaN | NaN | NaN | NaN |
Colorado | NaN | NaN | NaN | NaN |
Iowa | NaN | NaN | NaN | NaN |
Ohio | 112.0 | NaN | 18.0 | NaN |
Texas | 144.0 | NaN | 18.0 | NaN |
Out[84]:
b | c | d | e | |
---|---|---|---|---|
Arizona | NaN | NaN | NaN | NaN |
Colorado | NaN | NaN | NaN | NaN |
Iowa | NaN | NaN | NaN | NaN |
Ohio | 1.75 | NaN | 4.500000 | NaN |
Texas | 0.25 | NaN | 0.222222 | NaN |
Operations between DataFrame and Series¶
Operations between a DataFrame and a Series are similar to the operations between a two-dimensional and one-dimensional NumPy array. The index and column alignment is maintained when applying aritmatic operations between a DataFrame and a Series.
As an example, when we subtract a two-dimensional array and one of its rows, the subtraction is performed once for each row based on NumPy's broadcasting rules. Operations between a DataFrame and a Series are similar.
Out[85]:
b | c | d | |
---|---|---|---|
Ohio | 14 | 8 | 9 |
Texas | 6 | 6 | 2 |
Colorado | 17 | 4 | 15 |
In [86]:
# Subtract row for `Ohio` from df3 df3 - df3 . iloc [ 0 ]
Out[86]:
b | c | d | |
---|---|---|---|
Ohio | 0 | 0 | 0 |
Texas | -8 | -2 | -7 |
Colorado | 3 | -4 | 6 |
The resulting DataFrame shows element values when the row for Ohio
gets subtracted from the DataFrame.
For operating on columns, you can use the object methods while specifying the axis
keyword.
In [87]:
df3 . subtract ( df3 [ 'c' ], axis = 'columns' )
Out[87]:
Colorado | Ohio | Texas | b | c | d | |
---|---|---|---|---|---|---|
Ohio | NaN | NaN | NaN | NaN | NaN | NaN |
Texas | NaN | NaN | NaN | NaN | NaN | NaN |
Colorado | NaN | NaN | NaN | NaN | NaN | NaN |
The resulting DataFrame shows element values when column c
gets subtracted from the DataFrame.
Reindexing Data¶
reindex
is an important Pandas method that conforms the data to a new index. reindex
, when applied to a DataFrame
, can alter either the (row) index, columns, or both. Lets' take a look at reindexing.
In [88]:
# Create data df5 = pd . DataFrame ( np . arange ( 9 ) . reshape (( 3 , 3 )), index = [ 'a' , 'c' , 'e' ], columns = [ 'CA' , 'TX' , 'OH' ]) df5
Out[88]:
CA | TX | OH | |
---|---|---|---|
a | 0 | 1 | 2 |
c | 3 | 4 | 5 |
e | 6 | 7 | 8 |
In [89]:
df5 = df5 . reindex ([ 'e' , 'd' , 'c' , 'b' , 'a' ]) df5
Out[89]:
CA | TX | OH | |
---|---|---|---|
e | 6.0 | 7.0 | 8.0 |
d | NaN | NaN | NaN |
c | 3.0 | 4.0 | 5.0 |
b | NaN | NaN | NaN |
a | 0.0 | 1.0 | 2.0 |
The index for this DataFrame is now updated. For the new indices b and d
that were added with reindexing, Pandas has automatically generated Nan
values.
Reindexing Columns
Columns can be reindexed using the columns
keyword.
In [90]:
df5 = df5 . reindex ( columns = [ 'CA' , 'OH' , 'TX' ]) df5
Out[90]:
CA | OH | TX | |
---|---|---|---|
e | 6.0 | 8.0 | 7.0 |
d | NaN | NaN | NaN |
c | 3.0 | 5.0 | 4.0 |
b | NaN | NaN | NaN |
a | 0.0 | 2.0 | 1.0 |
Reindexing Rows and Columns
reindex
can be used more succinctly by label-indexing with loc
.
In [91]:
df5 = df5 . loc [[ 'a' , 'b' , 'c' , 'd' , 'e' ], [ 'TX' , 'OH' , 'CA' ]] df5
Out[91]:
TX | OH | CA | |
---|---|---|---|
a | 1.0 | 2.0 | 0.0 |
b | NaN | NaN | NaN |
c | 4.0 | 5.0 | 3.0 |
d | NaN | NaN | NaN |
e | 7.0 | 8.0 | 6.0 |
Reset Index¶
reset_index()
can be used to reset the index of a DataFrame to a default index. This operations turns the index labels into columns and assigns a default numerical index to the DataFrame. To modify the DataFrame use inplace=True
.
reset_index()
comes in handy when working with data that has multiple levels of indexing. We will dig into the details of MultiIndex
objects in the next part of this guide series. For now, let's take a quick look at how it works.
Out[92]:
index | TX | OH | CA | |
---|---|---|---|---|
0 | a | 1.0 | 2.0 | 0.0 |
1 | b | NaN | NaN | NaN |
2 | c | 4.0 | 5.0 | 3.0 |
3 | d | NaN | NaN | NaN |
4 | e | 7.0 | 8.0 | 6.0 |
drop
parameter can be used to avoid the old index being added as a column.
In [93]:
df5 . reset_index ( drop = True )
Out[93]:
TX | OH | CA | |
---|---|---|---|
0 | 1.0 | 2.0 | 0.0 |
1 | NaN | NaN | NaN |
2 | 4.0 | 5.0 | 3.0 |
3 | NaN | NaN | NaN |
4 | 7.0 | 8.0 | 6.0 |
Dropping Data¶
Dropping rows or columns comes in handy when cleaning your data. The drop()
function can be easily applied.
Functions that modify the size or shape of a DataFrame return a new object so that the original data remains unchanged. To apply changes to existing DataFrame, we need to either assign the function back to DataFrame or use inplace
keyword.
Dropping rows
Calling drop with index labels will drop values from the row.
In [94]:
# Drop rows df5 . drop ([ 'b' , 'd' ])
Out[94]:
TX | OH | CA | |
---|---|---|---|
a | 1.0 | 2.0 | 0.0 |
c | 4.0 | 5.0 | 3.0 |
e | 7.0 | 8.0 | 6.0 |
The result of drop
operation is a new object as it does not modify the original DataFrame.
Out[95]:
TX | OH | CA | |
---|---|---|---|
a | 1.0 | 2.0 | 0.0 |
b | NaN | NaN | NaN |
c | 4.0 | 5.0 | 3.0 |
d | NaN | NaN | NaN |
e | 7.0 | 8.0 | 6.0 |
To modify the DataFrame, we will use inplace
keyword.
In [96]:
df5 . drop ([ 'b' , 'd' ], inplace = True ) df5
Out[96]:
TX | OH | CA | |
---|---|---|---|
a | 1.0 | 2.0 | 0.0 |
c | 4.0 | 5.0 | 3.0 |
e | 7.0 | 8.0 | 6.0 |
Dropping Columns
Columns can be dropped by passing a value to the axis
keyword: axis=1
or axis='columns'
.
In [97]:
df5 . drop ( 'CA' , axis = 'columns' )
Out[97]:
TX | OH | |
---|---|---|
a | 1.0 | 2.0 |
c | 4.0 | 5.0 |
e | 7.0 | 8.0 |
Adding Data¶
A key aspect of data exploration is to feature engineer the data i.e. add new features to the data. New columns can be easily added to a DataFrame using the following methods.
Using Values in a List
A new column can be added with default values from a list.
Out[98]:
TX | OH | CA | FL | |
---|---|---|---|---|
a | 1.0 | 2.0 | 0.0 | 5 |
c | 4.0 | 5.0 | 3.0 | 8 |
e | 7.0 | 8.0 | 6.0 | 10 |
The column for FL
is now added to the end of the DataFrame.
Using insert()
insert()
provides the flexibility to add a column at any index position. It also provides different options for inserting the column values. Let's add a new column for NY
at index position 2
between OH
and CA
.
In [99]:
df5 . insert ( loc = 2 , column = 'NY' , value = [ 7 , 4 , 9 ]) df5
Out[99]:
TX | OH | NY | CA | FL | |
---|---|---|---|---|---|
a | 1.0 | 2.0 | 7 | 0.0 | 5 |
c | 4.0 | 5.0 | 4 | 3.0 | 8 |
e | 7.0 | 8.0 | 9 | 6.0 | 10 |
Using assign()
assign()
returns a new DataFrame object that has a copy of the original data with the requested changes. The original DataFrame remains unchanged. To apply changes to existing DataFrame, we need to assign the function back to the DataFrame.
In [100]:
df5 = df5 . assign ( WA = [ 0 , 3 , 12 ])
Out[101]:
TX | OH | NY | CA | FL | WA | |
---|---|---|---|---|---|---|
a | 1.0 | 2.0 | 7 | 0.0 | 5 | 0 |
c | 4.0 | 5.0 | 4 | 3.0 | 8 | 3 |
e | 7.0 | 8.0 | 9 | 6.0 | 10 | 12 |
Renaming Columns¶
Real world data is messy. In your data exploration journey, you may come across column names that are not representative of the data or that are too long, or you may just want to standardize the names of columns in your dataset. Let's look at how we can rename columns.
Using rename()
rename()
is quite useful when we need to rename some selected columns. Let's rename some columns to reflect the names of states.
In [102]:
df5 . rename ( columns = { 'TX' : 'Texas' , 'OH' : 'Ohio' }, inplace = True )
Out[103]:
Texas | Ohio | NY | CA | FL | WA | |
---|---|---|---|---|---|---|
a | 1.0 | 2.0 | 7 | 0.0 | 5 | 0 |
c | 4.0 | 5.0 | 4 | 3.0 | 8 | 3 |
e | 7.0 | 8.0 | 9 | 6.0 | 10 | 12 |
By assigning a list of Columns
We can also assign a list of new names to the columns
attribute of the DataFrame object. The caveat is that this method required providing new names for all the columns even if want to rename only a few.
In [104]:
df5 . columns = [ 'Texas' , 'Ohio' , 'New York' , 'California' , 'Florida' , 'Washington' ]
Out[105]:
Texas | Ohio | New York | California | Florida | Washington | |
---|---|---|---|---|---|---|
a | 1.0 | 2.0 | 7 | 0.0 | 5 | 0 |
c | 4.0 | 5.0 | 4 | 3.0 | 8 | 3 |
e | 7.0 | 8.0 | 9 | 6.0 | 10 | 12 |
Working with Missing Data¶
Missing data occurs in many applications as real world data in rarely clean. It is generally referred to as Null, NaN, or NA values. Assessing the missing data is an important process, as it can identify potential problems or biases that may arise as a result of the missing data. There are various useful methods for detecting, removing, and replacing null values in Pandas such as:
-
isnull()
: generates a boolean mask indicating missing values. -
notnull()
: generates a boolean mask indicating non-missing values. -
dropna()
: drops null values and returns a filtered version of the data. -
fillna()
: returns a copy of the data with missing values filled or imputed. -
interpolate()
: powerful function that providers various interpolation techniques to fill the missing values.
Let's start by looking at the types of missing data in Pandas and then we will explore how to detect, filter, drop and impute missing data.
Missing Data in Pandas¶
Pandas uses two already existing Python null values:
-
None
- is a Python singleton object only used in arrays with data type 'object'. -
NaN
- is a special floating point value.
None
object¶
In [106]:
v1 = np . array ([ 1 , None , 4 , 6 ]) v1
Out[106]:
array([1, None, 4, 6], dtype=object)
dtype=object
shows NumPy inferred that the contents of this array are Python objects. Any operations on the data will be done at the Python level, which are typically slower than the arrays with native types.
In [107]:
for dtype in [ 'object' , 'int' , 'float' ]: print ( "dtype =" , dtype ) % timeit np.arange(1E4, dtype=dtype).sum() print ()
dtype = object 391 µs ± 5.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) dtype = int 11.2 µs ± 708 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) dtype = float 11.2 µs ± 653 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Aggregation operations on an array with None
value results in an error.
NaN
: Not a Number¶
In [108]:
v2 = np . array ([ 1 , np . NaN , 4 , 6 ]) v2 . dtype
dtype('float64')
shows NumPy inferred that the contents of this array are native floating-point type. Unlike object
, this array supports faster operations.
Aggregation operations on an array with NaN
will result in a NaN
.
NaN
and None
¶
Pandas handles both NaN
and None
interchangeably and automatically converts them as appropriate.
In [110]:
pd . Series ([ 1 , np . nan , 5 , None ])
Out[110]:
0 1.0 1 NaN 2 5.0 3 NaN dtype: float64
Pandas automatically converts None
to a NaN
value.
Detect Missing Data¶
isnull()
and notnull()
methods can be used to detect and filter missing data. We will start by creating some data to apply these methods.
In [111]:
# Create data df6 = df3 + df4 df6 . loc [ 'Ohio' , 'c' ] = 10 df6 . loc [ 'Colorado' , 'b' ] = 5 df6
Out[111]:
b | c | d | e | |
---|---|---|---|---|
Arizona | NaN | NaN | NaN | NaN |
Colorado | 5.0 | NaN | NaN | NaN |
Iowa | NaN | NaN | NaN | NaN |
Ohio | 22.0 | 10.0 | 11.0 | NaN |
Texas | 30.0 | NaN | 11.0 | NaN |
Out[112]:
b | c | d | e | |
---|---|---|---|---|
Arizona | True | True | True | True |
Colorado | False | True | True | True |
Iowa | True | True | True | True |
Ohio | False | False | False | True |
Texas | False | True | False | True |
The result shows a matrix of True or False values based on the presence of null values in data.
Missing values in each column¶
Chaining a sum()
method returns a summation of missing values in each column.
Out[113]:
b 2 c 4 d 3 e 5 dtype: int64
Missing values in DataFrame¶
Chaining two sum()
methods will return the total number of missing values in the DataFrame.
Using notnull()
¶
notnull()
is the opposite of isnull()
and can be used to check the number of non-missing values.
Out[115]:
b 3 c 1 d 2 e 0 dtype: int64
In [116]:
# Total non-missing values in dataframe df6 . notnull () . sum () . sum ()
Visualize Misisng Data¶
It is important to understand the pattern of occurrence of missing values before we decide to drop or impute missing data. An easy way to visualize missing records is to use heatmap()
from the seaborn library.
In [117]:
# Import Seaborn import seaborn as sns
Out[118]:
b | c | d | e | |
---|---|---|---|---|
Arizona | NaN | NaN | NaN | NaN |
Colorado | 5.0 | NaN | NaN | NaN |
Iowa | NaN | NaN | NaN | NaN |
Ohio | 22.0 | 10.0 | 11.0 | NaN |
Texas | 30.0 | NaN | 11.0 | NaN |
In [119]:
# Visualize missing sns . heatmap ( df6 . isnull (), cbar = False , cmap = 'viridis' )
Out[119]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb2e87b5a90>
Missing records are displayed in yellow color.
Filter based on Missing Data¶
Both isnull()
and notnull()
can be applied to columns of a DataFrame to filter out rows with missing and non-missing data.
Out[120]:
b | c | d | e | |
---|---|---|---|---|
Arizona | NaN | NaN | NaN | NaN |
Colorado | 5.0 | NaN | NaN | NaN |
Iowa | NaN | NaN | NaN | NaN |
Ohio | 22.0 | 10.0 | 11.0 | NaN |
Texas | 30.0 | NaN | 11.0 | NaN |
Filter rows where values in column b
are null.
Out[121]:
b | c | d | e | |
---|---|---|---|---|
Arizona | NaN | NaN | NaN | NaN |
Iowa | NaN | NaN | NaN | NaN |
Filter rows where values in column b
are not null.
Out[122]:
b | c | d | e | |
---|---|---|---|---|
Colorado | 5.0 | NaN | NaN | NaN |
Ohio | 22.0 | 10.0 | 11.0 | NaN |
Texas | 30.0 | NaN | 11.0 | NaN |
Drop Missing Data¶
dropna()
method allows us to drop any missing values from the data. The method returns a copy of the DataFrame, so to apply the changes inplace, use inplace=True
.
Out[123]:
b | c | d | e | |
---|---|---|---|---|
Arizona | NaN | NaN | NaN | NaN |
Colorado | 5.0 | NaN | NaN | NaN |
Iowa | NaN | NaN | NaN | NaN |
Ohio | 22.0 | 10.0 | 11.0 | NaN |
Texas | 30.0 | NaN | 11.0 | NaN |
By default, dropna()
will drop all rows in which any null value is present.
Since all rows of df6
have some NA values, the result is an empty copy of the DataFrame. Alternatively, using axis='columns'
drops all columns containing a null value.
In [125]:
df6 . dropna ( axis = 'columns' )
Out[125]:
Arizona |
---|
Colorado |
Iowa |
Ohio |
Texas |
Since all columns have some NA values, the result is an empty copy of the DataFrame.
dropna()
provides the flexibility to drop rows or columns with all NA values, or a majority of NA values using how
or thresh
parameters. The default how='any'
, allows any row or column containing a null value to be dropped. To only drop rows or columns that have all null values, how='all'
can be specified.
In [126]:
# Drop rows with all null values df6 . dropna ( how = 'all' )
Out[126]:
b | c | d | e | |
---|---|---|---|---|
Colorado | 5.0 | NaN | NaN | NaN |
Ohio | 22.0 | 10.0 | 11.0 | NaN |
Texas | 30.0 | NaN | 11.0 | NaN |
In [127]:
# Drop columns with all null values df6 . dropna ( how = 'all' , axis = 'columns' )
Out[127]:
b | c | d | |
---|---|---|---|
Arizona | NaN | NaN | NaN |
Colorado | 5.0 | NaN | NaN |
Iowa | NaN | NaN | NaN |
Ohio | 22.0 | 10.0 | 11.0 |
Texas | 30.0 | NaN | 11.0 |
thresh
parameter allows you specify a minimum number of non-null values a row/column should have to be kept in the result.
Out[128]:
b | c | d | e | |
---|---|---|---|---|
Ohio | 22.0 | 10.0 | 11.0 | NaN |
Texas | 30.0 | NaN | 11.0 | NaN |
Only rows with 2 or more non-null values are kept, and since the row for Colorado
has only 1 non-null value, it is dropped.
In [129]:
df6 . dropna ( thresh = 2 , axis = 'columns' )
Out[129]:
b | d | |
---|---|---|
Arizona | NaN | NaN |
Colorado | 5.0 | NaN |
Iowa | NaN | NaN |
Ohio | 22.0 | 11.0 |
Texas | 30.0 | 11.0 |
Similarly, column c
has only 1 non-null value and is therefore dropped.
Impute Missing Data¶
Rather than dropping NA values and potentially discarding some other data with it, you may just want to replace them with a value such as 0, or some other imputation such as mean or median of the data. Pandas fillna()
method can be used for such operations. The method returns a new object, but you can modify the existing object in-place.
Out[130]:
b | c | d | e | |
---|---|---|---|---|
Arizona | NaN | NaN | NaN | NaN |
Colorado | 5.0 | NaN | NaN | NaN |
Iowa | NaN | NaN | NaN | NaN |
Ohio | 22.0 | 10.0 | 11.0 | NaN |
Texas | 30.0 | NaN | 11.0 | NaN |
Fill with a Constant Value¶
Calling fillna()
with a constant value replaces missing values with that value.
Out[131]:
b | c | d | e | |
---|---|---|---|---|
Arizona | 0.0 | 0.0 | 0.0 | 0.0 |
Colorado | 5.0 | 0.0 | 0.0 | 0.0 |
Iowa | 0.0 | 0.0 | 0.0 | 0.0 |
Ohio | 22.0 | 10.0 | 11.0 | 0.0 |
Texas | 30.0 | 0.0 | 11.0 | 0.0 |
Forawrd and Backward Fill¶
Forward and backward fill can be used to propagate the previous value forward (ffill
) or next values backward (bfill
). We can specify an axis
along which the fill method will operate.
Note: If a previous value is not available during a fill operation, the NA value remains.
In [132]:
# Forward fill along rows df6 . fillna ( method = 'ffill' )
Out[132]:
b | c | d | e | |
---|---|---|---|---|
Arizona | NaN | NaN | NaN | NaN |
Colorado | 5.0 | NaN | NaN | NaN |
Iowa | 5.0 | NaN | NaN | NaN |
Ohio | 22.0 | 10.0 | 11.0 | NaN |
Texas | 30.0 | 10.0 | 11.0 | NaN |
In [133]:
# Forward fill along columns df6 . fillna ( method = 'ffill' , axis = 'columns' )
Out[133]:
b | c | d | e | |
---|---|---|---|---|
Arizona | NaN | NaN | NaN | NaN |
Colorado | 5.0 | 5.0 | 5.0 | 5.0 |
Iowa | NaN | NaN | NaN | NaN |
Ohio | 22.0 | 10.0 | 11.0 | 11.0 |
Texas | 30.0 | 30.0 | 11.0 | 11.0 |
In [134]:
# Backward fill along rows df6 . fillna ( method = 'bfill' )
Out[134]:
b | c | d | e | |
---|---|---|---|---|
Arizona | 5.0 | 10.0 | 11.0 | NaN |
Colorado | 5.0 | 10.0 | 11.0 | NaN |
Iowa | 22.0 | 10.0 | 11.0 | NaN |
Ohio | 22.0 | 10.0 | 11.0 | NaN |
Texas | 30.0 | NaN | 11.0 | NaN |
Fill with Arithmetic or Aggregate Functions¶
A mean, median, mode, max or min value for the column can be used to fill missing values.
Out[135]:
b | c | d | e | |
---|---|---|---|---|
Arizona | NaN | NaN | NaN | NaN |
Colorado | 5.0 | NaN | NaN | NaN |
Iowa | NaN | NaN | NaN | NaN |
Ohio | 22.0 | 10.0 | 11.0 | NaN |
Texas | 30.0 | NaN | 11.0 | NaN |
In [136]:
# Fill with mean df6 . fillna ( df6 . mean ())
Out[136]:
b | c | d | e | |
---|---|---|---|---|
Arizona | 19.0 | 10.0 | 11.0 | NaN |
Colorado | 5.0 | 10.0 | 11.0 | NaN |
Iowa | 19.0 | 10.0 | 11.0 | NaN |
Ohio | 22.0 | 10.0 | 11.0 | NaN |
Texas | 30.0 | 10.0 | 11.0 | NaN |
In [137]:
# Fill with median df6 . fillna ( df6 . median ())
/Users/mohi9282/opt/anaconda3/envs/arcgis-clone/lib/python3.8/site-packages/numpy/lib/nanfunctions.py:1115: RuntimeWarning: All-NaN slice encountered r, k = function_base._ureduce(a, func=_nanmedian, axis=axis, out=out,
Out[137]:
b | c | d | e | |
---|---|---|---|---|
Arizona | 22.0 | 10.0 | 11.0 | NaN |
Colorado | 5.0 | 10.0 | 11.0 | NaN |
Iowa | 22.0 | 10.0 | 11.0 | NaN |
Ohio | 22.0 | 10.0 | 11.0 | NaN |
Texas | 30.0 | 10.0 | 11.0 | NaN |
In [138]:
# Fill with max df6 . fillna ( df6 . max ())
Out[138]:
b | c | d | e | |
---|---|---|---|---|
Arizona | 30.0 | 10.0 | 11.0 | NaN |
Colorado | 5.0 | 10.0 | 11.0 | NaN |
Iowa | 30.0 | 10.0 | 11.0 | NaN |
Ohio | 22.0 | 10.0 | 11.0 | NaN |
Texas | 30.0 | 10.0 | 11.0 | NaN |
Fill using a Dictionary of Values¶
A dictionary of constant values or aggregate functions can be passed to fill missing values in columns differently.
Out[139]:
b | c | d | e | |
---|---|---|---|---|
Arizona | NaN | NaN | NaN | NaN |
Colorado | 5.0 | NaN | NaN | NaN |
Iowa | NaN | NaN | NaN | NaN |
Ohio | 22.0 | 10.0 | 11.0 | NaN |
Texas | 30.0 | NaN | 11.0 | NaN |
In [140]:
fill_value = { 'b' : df6 [ 'b' ] . median (), 'd' : df6 [ 'd' ] . mean (), 'e' : 1 } df6 . fillna ( value = fill_value )
Out[140]:
b | c | d | e | |
---|---|---|---|---|
Arizona | 22.0 | NaN | 11.0 | 1.0 |
Colorado | 5.0 | NaN | 11.0 | 1.0 |
Iowa | 22.0 | NaN | 11.0 | 1.0 |
Ohio | 22.0 | 10.0 | 11.0 | 1.0 |
Texas | 30.0 | NaN | 11.0 | 1.0 |
Using interpolate()
¶
interpolate()
is a very powerful function that providers various interpolation techniques (linear, quadratic, polynomial etc.) to fill the missing values. Let's take a quick look, and you can learn more about interpolate()
here.
Out[141]:
b | c | d | e | |
---|---|---|---|---|
Arizona | NaN | NaN | NaN | NaN |
Colorado | 5.0 | NaN | NaN | NaN |
Iowa | NaN | NaN | NaN | NaN |
Ohio | 22.0 | 10.0 | 11.0 | NaN |
Texas | 30.0 | NaN | 11.0 | NaN |
interpolate()
using Linear method. Note that Linear method ignores the index and treats the values as equally spaced.
In [142]:
df6 . interpolate ( method = 'linear' , limit_direction = 'forward' )
Out[142]:
b | c | d | e | |
---|---|---|---|---|
Arizona | NaN | NaN | NaN | NaN |
Colorado | 5.0 | NaN | NaN | NaN |
Iowa | 13.5 | NaN | NaN | NaN |
Ohio | 22.0 | 10.0 | 11.0 | NaN |
Texas | 30.0 | 10.0 | 11.0 | NaN |
Conclusion¶
In this part of the guide series we introduced Pandas, a Python package that builds on NumPy and provides data structures and functions designed to make working with structured data fast, easy, and expressive.
You have seen how DataFrame
can be created and then data can be accessed using loc
and iloc
operators. We discussed in detail how to check the different data types in a DataFrame and ways to change these data types. We also discussed how to perform various operations on a DataFrame
(i.e. Arithmetic, Reindex, Add and Drop data) and to work with missing data. We briefly introduced working with a Series
object as well.
In the next part of this guide series, you will learn about how to be more productive with Pandas. We will discuss data aggregation, transformation operations, multi-level indexing and working with time series data.
References¶
[1] Wes McKinney. 2017. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (2nd. ed.). O'Reilly Media, Inc.
[2] Jake VanderPlas. 2016. Python Data Science Handbook: Essential Tools for Working with Data (1st. ed.). O'Reilly Media, Inc.
[3] https://pbpython.com/pandas_dtypes.html
Feedback on this topic?
How to Change File Descriptor Back to Normal Simple Imputer to Numpy Array
Source: https://developers.arcgis.com/python/guide/part3-introduction-to-pandas/
0 Response to "How to Change File Descriptor Back to Normal Simple Imputer to Numpy Array"
Post a Comment