Converting CSV to DataFrame in Python

·

3 min read

The article is maintained by the team at commabot.

To convert a CSV file to a DataFrame in Python, we can use the pandas library. Here's a step-by-step guide to doing this:

Install Pandas: If you haven't already installed pandas, you can do so by running the following command in your terminal or command prompt:

pip install pandas

Read the CSV File: Use the pd.read_csv() function to read your CSV file and convert it into a DataFrame. You need to specify the path to your CSV file as the function's argument. Optionally, you can specify other parameters to handle different data formats, such as delimiter, column names, and encoding.

Basic usage:

With optional parameters (for example, specifying a delimiter and skipping rows):

df = pandas.read_csv('path/to/your/file.csv', delimiter=';', skiprows=1)

Use the DataFrame: Once you have read the CSV into a DataFrame, you can start using various pandas functionalities to analyze and manipulate your data. For example, you can view the first few rows of the DataFrame with df.head().

# Import pandas library
import pandas as pd

# Read the CSV file
df = pd.read_csv('path/to/your/file.csv')

# Display the first 5 rows of the DataFrame
print(df.head())

Now let's look at a more comprehensive example. Imagine you have a CSV file named sample_data.csv with the following content:

Name,Age,Salary,Department
John Doe,28,50000,Marketing
Jane Smith,,55000,Finance
Emily Jones,22,,HR
Michael Brown,30,60000,IT
Alex Johnson,25,52000,Marketing

Notice that some data points are missing (indicated by empty fields).

Let's write a script that does the following:

  1. Read CSV: The script starts by reading the sample_data.csv file into a DataFrame.

  2. Handle Missing Values: It fills missing Age values with the average age and drops rows where Salary is missing.

  3. Filter Data: It creates a new DataFrame (marketing_df) containing only rows for employees in the Marketing department.

  4. Compute Statistics: It calculates the average, maximum, and minimum salaries across the entire dataset.

  5. Save to CSV: Finally, it saves the modified DataFrame (after handling missing values and filtering) to a new CSV file named modified_sample_data.csv, excluding the index column.

import pandas as pd

# Read the CSV file
df = pd.read_csv('sample_data.csv')

# Display the first few rows of the DataFrame
print("Original DataFrame:")
print(df.head())

# Handling missing values
# Fill missing ages with the average age
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Drop rows where 'Salary' is missing
df.dropna(subset=['Salary'], inplace=True)

# Filter data: Select only employees in the Marketing department
marketing_df = df[df['Department'] == 'Marketing']

# Compute basic statistics for the Salary column
average_salary = df['Salary'].mean()
max_salary = df['Salary'].max()
min_salary = df['Salary'].min()

print("\nFiltered DataFrame (Marketing Department):")
print(marketing_df)

print("\nSalary Statistics:")
print(f"Average Salary: {average_salary}")
print(f"Maximum Salary: {max_salary}")
print(f"Minimum Salary: {min_salary}")

# Save the modified DataFrame to a new CSV file
df.to_csv('modified_sample_data.csv', index=False)

This example demonstrates basic data manipulation tasks with pandas, including cleaning data, filtering based on conditions, and computing statistics, which are common steps in data analysis workflows.