Breaking News

Essential Pandas Functions for Data Cleaning and Analysis

 



If you are working with data, especially in Python, pandas is one of the most powerful libraries for data analysis. Here’s a quick reference guide to data cleaning and analysis using pandas, with useful functions and their purposes.


📌 Data Loading & Basic Operations

FunctionPurpose
pd.read_csv("file.csv")Load data from a CSV file
df.head(n)Show first n rows
df.tail(n)Show last n rows
df.shapeGet number of rows and columns
df.info()Get summary info about dataset
df.describe()Get statistics of numerical columns
df.columnsList all column names

🛠 Handling Missing Data

FunctionPurposeExample Usage
df.isnull().sum()Count missing values in each columndf.isnull().sum()
df.dropna()Remove rows with missing valuesdf = df.dropna()
df.fillna(value)Fill missing values with a specific valuedf = df.fillna("Unknown")
df.fillna(df.mean())Fill missing values with column mean (for numerical columns)df = df.fillna(df.mean())

🔄 Handling Duplicates

FunctionPurposeExample Usage
df.duplicated()Check for duplicate rowsdf.duplicated().sum()
df.drop_duplicates()Remove duplicate rowsdf = df.drop_duplicates()

🔍 Filtering & Fixing Incorrect Data

FunctionPurposeExample Usage
df[df["column_name"] > 0]Keep only rows where values are greater than 0df = df[df["duration"] > 0]
df["column_name"].replace(old, new)Replace specific valuesdf["genre"] = df["genre"].replace("hiphop", "Hip-Hop")
df["column_name"].str.lower()Convert text to lowercasedf["artist"] = df["artist"].str.lower()
df["column_name"].astype(int)Convert column to integer typedf["year"] = df["year"].astype(int)

📊 Sorting & Grouping

FunctionPurposeExample Usage
df.sort_values("column_name")Sort by a columndf = df.sort_values("duration")
df.groupby("column_name").mean()Group by a column and find meandf.groupby("genre")["duration"].mean()

📈 Data Visualization (Graphs & Charts)

FunctionPurposeExample Usage
df["column_name"].value_counts().plot(kind="bar")Create a bar chartdf["genre"].value_counts().plot(kind="bar")
df.plot(kind="line")Create a line chartdf.plot(kind="line")
df.hist()Create a histogramdf["duration"].hist()

🎵 Example: Cleaning & Analyzing Music Data

If you have a dataset with song names, duration, genre, artist, and year, you can clean and analyze it using pandas.

python
import pandas as pd import matplotlib.pyplot as plt # Load the dataset df = pd.read_csv("music_data.csv") # Remove missing values df = df.dropna() # Remove duplicates df = df.drop_duplicates() # Fix incorrect durations (e.g., remove negative durations) df = df[df["duration"] > 0] # Convert all genre names to lowercase df["genre"] = df["genre"].str.lower() # Show the most popular genres df["genre"].value_counts().plot(kind="bar") plt.show()

🚀 Summary

With these functions, you can:
✔ Load and inspect your data
✔ Handle missing values, duplicates, and incorrect data
✔ Sort and group data for analysis
✔ Create basic visualizations

Mastering these pandas functions will make your data cleaning and analysis process much more efficient. Try them with your own datasets!



No comments