Pandas DataFrames
Last updated on 2025-10-23 | Edit this page
Estimated time: 30 minutes
Overview
Questions
- How can I do statistical analysis of tabular data?
- What are Booleans? How can I use them?
Objectives
- Select individual values from a Pandas dataframe.
- Select entire rows or entire columns from a dataframe.
- Select a subset of both rows and columns from a dataframe in a single operation.
- Construct simple Boolean expressions with comparisons.
- Select a subset of a dataframe by a single Boolean criterion.
Note about Pandas DataFrames/Series
A DataFrame is a collection of Series; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.
Pandas is built on top of the Numpy library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.
What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.
Selecting values
To access a value at the position [i,j] of a DataFrame,
we have two options, depending on what is the meaning of i
in use. Remember that a DataFrame provides an index as a way to
identify the rows of the table; a row, then, has a position
inside the table as well as a label, which uniquely identifies
its entry in the DataFrame.
Use DataFrame.iloc[..., ...] to select values by their
(entry) position
- Can specify location by numerical index analogously to 2D version of character selection in strings.
PYTHON
import pandas as pd
europe = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
print(europe.iloc[0, 0])
OUTPUT
1601.056136
Use DataFrame.loc[..., ...] to select values by their
(entry) label.
- Can specify location by row and/or column name.
OUTPUT
1601.056136
Use : on its own to mean all columns or all rows.
- Just like Python’s usual slicing notation.
OUTPUT
gdpPercap_1952 1601.056136
gdpPercap_1957 1942.284244
gdpPercap_1962 2312.888958
gdpPercap_1967 2760.196931
gdpPercap_1972 3313.422188
gdpPercap_1977 3533.003910
gdpPercap_1982 3630.880722
gdpPercap_1987 3738.932735
gdpPercap_1992 2497.437901
gdpPercap_1997 3193.054604
gdpPercap_2002 4604.211737
gdpPercap_2007 5937.029526
Name: Albania, dtype: float64
- Would get the same result printing
europe.loc["Albania"](without a second index).
OUTPUT
country
Albania 1601.056136
Austria 6137.076492
Belgium 8343.105127
⋮ ⋮ ⋮
Switzerland 14734.232750
Turkey 1969.100980
United Kingdom 9979.508487
Name: gdpPercap_1952, dtype: float64
- Would get the same result printing
europe["gdpPercap_1952"] - Also get the same result printing
europe.gdpPercap_1952(not recommended, because easily confused with.notation for methods)
Select multiple columns or rows using DataFrame.loc and
a named slice.
OUTPUT
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
country
Italy 8243.582340 10022.401310 12269.273780
Montenegro 4649.593785 5907.850937 7778.414017
Netherlands 12790.849560 15363.251360 18794.745670
Norway 13450.401510 16361.876470 18965.055510
Poland 5338.752143 6557.152776 8006.506993
In the above code, we discover that slicing using
loc is inclusive at both ends, which differs from
slicing using iloc, where slicing
indicates everything up to but not including the final index.
Result of slicing can be used in further operations.
- Usually don’t just print a slice.
- All the statistical operators that work on entire dataframes work the same way on slices.
- E.g., calculate max of a slice.
OUTPUT
gdpPercap_1962 13450.40151
gdpPercap_1967 16361.87647
gdpPercap_1972 18965.05551
dtype: float64
OUTPUT
gdpPercap_1962 4649.593785
gdpPercap_1967 5907.850937
gdpPercap_1972 7778.414017
dtype: float64
Comparisons in Python return True or False
Let’s introduce a simple Python data type that hasn’t been explicitly
introduced yet: the Boolean
or bool.
- Booleans are either
TrueorFalse. - Capitalization matters here!
trueis not a valid Boolean.
OUTPUT
bool
Booleans are returned when we test the truth value
of a statement. To test whether two objects are equal, we use
==. (This is different from the = used for
assignment.)
PYTHON
x = 5 # Assign x the value 5
print(x)
print(x == 10) # Is the value at x equal to 10, True or False
print(x == 5) # Is the value at x equal to 5, True or False
OUTPUT
5
False
True
Booleans are returned by statements that make numerical comparisons
using operators like >, <= and
!=.
- Try
!=, which can be read as “is not equal to”.
OUTPUT
False
-
<is the “less than” operator.
OUTPUT
True
You can also do comparisons between two variables.
OUTPUT
True
Use comparisons to select data based on value.
- Comparison is applied element by element.
- Returns a similarly-shaped dataframe of
TrueandFalse.
PYTHON
# Use a subset of data to keep output readable.
subset = europe.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
print('Subset of europe:\n', subset)
# Which values were greater than 10000 ?
print('\nWhere are values greater than 10000?', subset > 10000)
OUTPUT
Subset of europe:
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
country
Italy 8243.582340 10022.401310 12269.273780
Montenegro 4649.593785 5907.850937 7778.414017
Netherlands 12790.849560 15363.251360 18794.745670
Norway 13450.401510 16361.876470 18965.055510
Poland 5338.752143 6557.152776 8006.506993
Where are values greater than 10000?
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
country
Italy False True True
Montenegro False False False
Netherlands True True True
Norway True True True
Poland False False False
Select values or NaN using a Boolean mask.
- A frame full of Booleans is sometimes called a mask because of how it can be used.
OUTPUT
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
country
Italy NaN 10022.40131 12269.27378
Montenegro NaN NaN NaN
Netherlands 12790.84956 15363.25136 18794.74567
Norway 13450.40151 16361.87647 18965.05551
Poland NaN NaN NaN
- Get the value where the mask is true, and NaN (Not a Number) where it is false.
- Useful because NaNs are ignored by operations like max, min, average, etc.
OUTPUT
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
count 2.000000 3.000000 3.000000
mean 13120.625535 13915.843047 16676.358320
std 466.373656 3408.589070 3817.597015
min 12790.849560 10022.401310 12269.273780
25% 12955.737547 12692.826335 15532.009725
50% 13120.625535 15363.251360 18794.745670
75% 13285.513523 15862.563915 18879.900590
max 13450.401510 16361.876470 18965.055510
Selection of Individual Values
No, they do not produce the same output! The output of the first statement is:
OUTPUT
gdpPercap_1952 gdpPercap_1957
country
Albania 1601.056136 1942.284244
Austria 6137.076492 8842.598030
The second statement gives:
OUTPUT
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962
country
Albania 1601.056136 1942.284244 2312.888958
Austria 6137.076492 8842.598030 10750.721110
Belgium 8343.105127 9714.960623 10991.206760
Clearly, the second statement produces an additional column and an
additional row compared to the first statement.
What conclusion can we draw? We see that a numerical slice, 0:2,
omits the final index (i.e. index 2) in the range provided,
while a named slice, ‘gdpPercap_1952’:‘gdpPercap_1962’,
includes the final element.
For each column in europe, idxmin will
return the index value corresponding to each column’s minimum;
idxmax will do accordingly the same for each column’s
maximum value.
You can use these functions whenever you want to get the row index of the minimum/maximum value and not the actual minimum/maximum value.
Practice with Selection
Assume Pandas has been imported and the Gapminder GDP data for Europe has been loaded. Write an expression to select each of the following:
- GDP per capita for all countries in 1982.
- GDP per capita for Denmark for all years.
- GDP per capita for all countries for years after 1985.
- GDP per capita for each country in 2007 as a multiple of GDP per capita for that country in 1952.
1:
2:
3:
Pandas is smart enough to recognize the number at the end of the
column label and does not give you an error, although no column named
gdpPercap_1985 actually exists. This is useful if new
columns are added to the CSV file later.
4:
Remember that = is used to assign a value to a variable,
while == is a comparator.
Solve the problem by comparing “sky” to “blue” with ==,
which evaluates to False, then assign False to
is_the_sky_blue. The second line will then print
False.
Many Ways of Access
There are at least two ways of accessing a value or slice of a
DataFrame: by name or index. However, there are many others. For
example, a single column or row can be accessed either as a
DataFrame or a Series object.
Suggest different ways of doing the following operations on a DataFrame:
- Access a single column
- Access a single row
- Access an individual DataFrame element
- Access several columns
- Access several rows
- Access a subset of specific rows and columns
- Access a subset of row and column ranges
1. Access a single column:
PYTHON
# by name
data["col_name"] # as a Series
data[["col_name"]] # as a DataFrame
# by name using .loc
data.T.loc["col_name"] # as a Series
data.T.loc[["col_name"]].T # as a DataFrame
# Dot notation (Series)
data.col_name
# by index (iloc)
data.iloc[:, col_index] # as a Series
data.iloc[:, [col_index]] # as a DataFrame
# using a mask
data.T[data.T.index == "col_name"].T
2. Access a single row:
PYTHON
# by name using .loc
data.loc["row_name"] # as a Series
data.loc[["row_name"]] # as a DataFrame
# by name
data.T["row_name"] # as a Series
data.T[["row_name"]].T # as a DataFrame
# by index
data.iloc[row_index] # as a Series
data.iloc[[row_index]] # as a DataFrame
# using mask
data[data.index == "row_name"]
3. Access an individual DataFrame element:
PYTHON
# by column/row names
data["column_name"]["row_name"] # as a Series
data[["col_name"]].loc["row_name"] # as a Series
data[["col_name"]].loc[["row_name"]] # as a DataFrame
data.loc["row_name"]["col_name"] # as a value
data.loc[["row_name"]]["col_name"] # as a Series
data.loc[["row_name"]][["col_name"]] # as a DataFrame
data.loc["row_name", "col_name"] # as a value
data.loc[["row_name"], "col_name"] # as a Series. Preserves index. Column name is moved to `.name`.
data.loc["row_name", ["col_name"]] # as a Series. Index is moved to `.name.` Sets index to column name.
data.loc[["row_name"], ["col_name"]] # as a DataFrame (preserves original index and column name)
# by column/row names: Dot notation
data.col_name.row_name
# by column/row indices
data.iloc[row_index, col_index] # as a value
data.iloc[[row_index], col_index] # as a Series. Preserves index. Column name is moved to `.name`
data.iloc[row_index, [col_index]] # as a Series. Index is moved to `.name.` Sets index to column name.
data.iloc[[row_index], [col_index]] # as a DataFrame (preserves original index and column name)
# column name + row index
data["col_name"][row_index]
data.col_name[row_index]
data["col_name"].iloc[row_index]
# column index + row name
data.iloc[:, [col_index]].loc["row_name"] # as a Series
data.iloc[:, [col_index]].loc[["row_name"]] # as a DataFrame
# using masks
data[data.index == "row_name"].T[data.T.index == "col_name"].T
4. Access several columns:
PYTHON
# by name
data[["col1", "col2", "col3"]]
data.loc[:, ["col1", "col2", "col3"]]
# by index
data.iloc[:, [col1_index, col2_index, col3_index]]
5. Access several rows
PYTHON
# by name
data.loc[["row1", "row2", "row3"]]
# by index
data.iloc[[row1_index, row2_index, row3_index]]
6. Access a subset of specific rows and columns
PYTHON
# by names
data.loc[["row1", "row2", "row3"], ["col1", "col2", "col3"]]
# by indices
data.iloc[[row1_index, row2_index, row3_index], [col1_index, col2_index, col3_index]]
# column names + row indices
data[["col1", "col2", "col3"]].iloc[[row1_index, row2_index, row3_index]]
# column indices + row names
data.iloc[:, [col1_index, col2_index, col3_index]].loc[["row1", "row2", "row3"]]
7. Access a subset of row and column ranges
PYTHON
# by name
data.loc["row1":"row2", "col1":"col2"]
# by index
data.iloc[row1_index:row2_index, col1_index:col2_index]
# column names + row indices
data.loc[:, "col1_name":"col2_name"].iloc[row1_index:row2_index]
# column indices + row names
data.iloc[:, col1_index:col2_index].loc["row1":"row2"]
- Use
DataFrame.iloc[..., ...]to select values by integer location. - Use
:on its own to mean all columns or all rows. - Select multiple columns or rows using
DataFrame.locand a named slice. - Result of slicing can be used in further operations.
- Use comparisons to select data based on value.
- Booleans store truth values.
- Select values or NaN using a Boolean mask.