Extract data from DataFrame in Python

·

2 min read

Extracting data from a DataFrame in Python is useful for data analysis, manipulation, and visualization. This guide will walk you through the basics of extracting data from DataFrames, including selecting columns, filtering rows, and advanced techniques like conditional selections and data aggregation.

Prerequisites

  • Pandas library. If you haven't installed it yet, you can do so by running pip install pandas in your terminal or command prompt.

Creating a DataFrame

Before extracting data, let's create a simple DataFrame to work with:

import pandas as pd

# Sample data
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 34, 29, 32],
    'City': ['New York', 'Paris', 'Berlin', 'London']
}

df = pd.DataFrame(data)

Selecting Columns

Single Column: To extract a single column, use the column label in square brackets.

names = df['Name']

Multiple Columns: To extract multiple columns, use a list of column labels.

subset = df[['Name', 'City']]

Filtering Rows

Based on Conditions: Use conditions inside the square brackets to filter rows.

kids = df[df['Age'] < 30]

Using loc and iloc: For more advanced row selection based on index labels (loc) or integer-location based indexing (iloc).

# Select rows with index 0 and 2
selected_rows = df.iloc[[0, 2]]

# Select rows where Name is 'Peter'
peter_row = df.loc[df['Name'] == 'Peter']

Conditional Selections

You can use logical operators to perform conditional selections:

# Select people aged below 30 and living in Berlin
young_in_berlin = df[(df['Age'] < 30) & (df['City'] == 'Berlin')]

Extracting Specific Data Points

Using at and iat: For extracting single data points using a label (at) or integer location (iat).

# Using `at` to get age of John
john_age = df.at[0, 'Age']

# Using `iat` for the same purpose
john_age_iat = df.iat[0, 1]

Data Aggregation

Pandas provides methods like groupby, sum, mean, etc., for aggregating data based on some criteria.

# Average age by city
average_age_by_city = df.groupby('City')['Age'].mean()

Advanced Data Extraction

  • Using query Method: For filtering rows using a query string.
adults_in_london = df.query("Age >= 18 and City == 'London'")
  • Using pivot_table for Data Summarization: To create a pivot table that summarizes data.
pivot = df.pivot_table(values='Age', index='City', aggfunc='mean')

Extracting data from DataFrames is a versatile skill in Python's pandas library, enabling you to select, filter, and aggregate data efficiently. Practice these techniques with different datasets to become proficient in data manipulation and analysis.