If you are working with data, especially in Python, pandas is one of the most powerful libraries for data analysis. Here’s a quick reference guide to data cleaning and analysis using pandas, with useful functions and their purposes.
📌 Data Loading & Basic Operations
Function | Purpose |
|
---|
pd.read_csv("file.csv") | Load data from a CSV file |
|
df.head(n) | Show first n rows |
|
df.tail(n) | Show last n rows |
|
df.shape | Get number of rows and columns |
|
df.info() | Get summary info about dataset |
|
df.describe() | Get statistics of numerical columns |
|
df.columns | List all column names |
|
🛠 Handling Missing Data
Function | Purpose | Example Usage |
---|
df.isnull().sum() | Count missing values in each column | df.isnull().sum() |
df.dropna() | Remove rows with missing values | df = df.dropna() |
df.fillna(value) | Fill missing values with a specific value | df = df.fillna("Unknown") |
df.fillna(df.mean()) | Fill missing values with column mean (for numerical columns) | df = df.fillna(df.mean()) |
🔄 Handling Duplicates
Function | Purpose | Example Usage |
---|
df.duplicated() | Check for duplicate rows | df.duplicated().sum() |
df.drop_duplicates() | Remove duplicate rows | df = df.drop_duplicates() |
🔍 Filtering & Fixing Incorrect Data
Function | Purpose | Example Usage |
---|
df[df["column_name"] > 0] | Keep only rows where values are greater than 0 | df = df[df["duration"] > 0] |
df["column_name"].replace(old, new) | Replace specific values | df["genre"] = df["genre"].replace("hiphop", "Hip-Hop") |
df["column_name"].str.lower() | Convert text to lowercase | df["artist"] = df["artist"].str.lower() |
df["column_name"].astype(int) | Convert column to integer type | df["year"] = df["year"].astype(int) |
📊 Sorting & Grouping
Function | Purpose | Example Usage |
---|
df.sort_values("column_name") | Sort by a column | df = df.sort_values("duration") |
df.groupby("column_name").mean() | Group by a column and find mean | df.groupby("genre")["duration"].mean() |
📈 Data Visualization (Graphs & Charts)
Function | Purpose | Example Usage |
---|
df["column_name"].value_counts().plot(kind="bar") | Create a bar chart | df["genre"].value_counts().plot(kind="bar") |
df.plot(kind="line") | Create a line chart | df.plot(kind="line") |
df.hist() | Create a histogram | df["duration"].hist() |
🎵 Example: Cleaning & Analyzing Music Data
If you have a dataset with song names, duration, genre, artist, and year, you can clean and analyze it using pandas.
🚀 Summary
With these functions, you can:
✔ Load and inspect your data
✔ Handle missing values, duplicates, and incorrect data
✔ Sort and group data for analysis
✔ Create basic visualizations
Mastering these pandas functions will make your data cleaning and analysis process much more efficient. Try them with your own datasets!
No comments