In housing price forecasting projects, data cleaning is a crucial link. It not only determines the accuracy of the model, but also directly affects the reliability of subsequent analysis. This article will take the Boston housing price data set as an example, and use Python's Pandas, Matplotlib and other tools to explain in detail the entire process of data cleaning, and generate data cleaning reports and visual charts. This article is suitable for beginners with zero foundation, and it is recommended to cooperate with Jupyter Notebook practice.
1. Overview of the importance and process of data cleaning
(one)Why do data cleaning need?
- Improve model performance: The original data may contain noise, missing values or outliers, which directly affects the prediction accuracy.
- Ensure analysis logic: Incorrect or inconsistent data can lead to wrong conclusions.
- Meet the algorithm input requirements: Many machine learning algorithms require the input data to be complete and the format is unified.
(two)Basic process of data cleaning
- Data collection: Get the original data (such as Kaggle dataset).
- Data preview: Check the data structure, field type and basic information.
- Handle missing values: Fill in, delete or mark missing data.
- Handle outliers: Identify and correct or delete abnormal data.
- Feature Engineering: Calculate correlation, normalization, feature selection, etc.
- Data visualization: Analyze data distribution and relationships through charts.
2. Environment preparation and data import
1. Install the dependency library
Make sure the following Python libraries are installed:
bash copy code
pip install pandas matplotlib seaborn numpy
2. Import data
Take the Boston House Price Dataset as an example (can be downloaded from Kaggle or usedsklearn
Built-in data):
import pandas as pd
import numpy as np
import as plt
import seaborn as sns
# Example: Loading data from local CSV file
df = pd.read_csv('boston_house_prices.csv') # Replace with the actual path
# Or use sklearn built-in data (extra processing is required)
# from import load_boston
# boston = load_boston()
# df = (, columns=boston.feature_names)
# df['PRICE'] =
3. Data preview and basic information analysis
1. View the data structure
print(()) # View the first 5 lines
print(()) # Data type, missing value statistics
print(()) # numerical field statistics
2. Check field type
Ensure that all field types are correct (such as numerical types, category types).
# Example: Convert a category field to a category type
df['CHAS'] = df['CHAS'].astype('category') # Assume CHAS is a binary classification variable
3. Data cleaning report template
It is recommended to record the following information:
- Dataset size (number of rows, columns)
- Missing value statistics
- Field Type
- Statistical information such as mean, standard deviation and other numerical fields
4. Handle missing values
1. Missing value detection
missing_summary = ().sum()
print(missing_summary[missing_summary > 0]) # output missing fields
2. Missing value processing strategy
-
Delete missing values: Applicable to fields with smaller missing ratios.
python copy code (subset=['FIELD_NAME'], inplace=True) # Replace with the actual field name
-
Fill in missing values:
-
Numeric fields: Fill with mean, median, or specific values.
python copy code df['FIELD_NAME'].fillna(df['FIELD_NAME'].mean(), inplace=True)
-
Category field: Fill with mode.
python copy code df['FIELD_NAME'].fillna(df['FIELD_NAME'].mode()[0], inplace=True)
-
3. Example: Comprehensive processing of missing values
# Assume that the 'AGE' field has a missing value
df['AGE'].fillna(df['AGE'].media(), inplace=True) # Fill with median
5. Handle outliers
1. Outlier value detection method
-
Boxplot method: Identify outliers through IQR (interquartile range).
Q1 = df['FIELD_NAME'].quantile(0.25) Q3 = df['FIELD_NAME'].quantile(0.75) IQR = Q3 - Q1 outliers = df[(df['FIELD_NAME'] < Q1 - 1.5*IQR) | (df['FIELD_NAME'] > Q3 + 1.5*IQR)] print(outliers)
-
3σ principle: Applicable to normal distributed data.
mean = df['FIELD_NAME'].mean() std = df['FIELD_NAME'].std() outliers = df[(df['FIELD_NAME'] < mean - 3*std) | (df['FIELD_NAME'] > mean + 3*std)]
2. Outlier value processing strategy
-
Correct outliers: Replace with reasonable values (such as median).
-
Delete outliers: suitable for extreme abnormalities and great impact.
python copy code df = df[~((df['FIELD_NAME'] < Q1 - 1.5*IQR) | (df['FIELD_NAME'] > Q3 + 1.5*IQR))]
VI. Characteristic engineering and correlation analysis
1. Calculate feature correlation
corr_matrix = ()
print(corr_matrix['PRICE'].sort_values(ascending=False)) # Assume PRICE is the target field
2. Visualize the correlation matrix
(figsize=(10, 8))
(corr_matrix, annot=True, cmap='coolwarm')
('Feature Correlation Matrix')
()
3. Feature selection
Based on the correlation analysis results, select features with high correlation with the target field.
7. Data visual analysis
1. Numerical field distribution
(figsize=(12, 6))
(df['PRICE'], kde=True)
('House Price Distribution')
('Price')
('Frequency')
()
2. Category Type Field Distribution
(figsize=(8, 6))
(x='CHAS', data=df) # Assume CHAS is a category field
('CHAS Field Distribution')
()
3. Multi-field relationship analysis
(df, vars=['RM', 'LSTAT', 'PRICE']) # Sample Field
()
8. Generate a data cleaning report
1. Report content recommendations
- Basic information of the dataset (size, field type)
- Missing value and outlier processing records
- Characteristic correlation analysis results
- Data visualization conclusion
2. Sample report snippet
# House price forecast data cleaning report
## 1. Basic information of the data set
- Number of lines: 506
- Number of columns: 14
- Target field: PRICE
## 2. Missing value processing
- The missing value of the field 'AGE' has been filled with median.
- The missing value of the field 'RAD' has been deleted.
## 3. Outlier value processing
- 5 outliers detected in field 'CRIM', deleted.
## 4. Characteristic correlation analysis
- 'RM' has the highest correlation with 'PRICE' (0.7).
- The correlation between 'LSTAT' and 'PRICE' is -0.74.
## 5. Data visualization conclusion
- The housing price distribution is right-skewed and logarithmic transformation is required.
- 'RM' is positively correlated with 'PRICE' and 'LSTAT' is negatively correlated.
9. Summary and Extension
1. Summary
This article explains in detail the entire process of data cleaning through the Boston housing price data set, including data collection, missing value processing, outlier value processing, feature engineering and visual analysis. After mastering these skills, you can easily deal with similar data cleaning tasks.
2. Extended learning direction
After completing the basic data cleaning, you can further explore the following directions:
- Feature Engineering Deepening: Try more feature combinations (such as interactive features) or dimensionality reduction techniques (such as PCA).
- Model adaptation optimization: Adjust the model input according to the data distribution (such as logarithmally transforming the target variable to alleviate skewness).
-
Automatic cleaning tools:use
sklearn-pandas
orPandas Profiling
Generate an automated cleaning report. - Data Enhancement: Extend the dataset size by synthesizing a few class samples (such as SMOTE) or data simulation.
10. Complete code implementation and comment
The following provides a complete code framework that can be run directly in a Jupyter Notebook:
# 1. Importing dependency library
import pandas as pd
import numpy as np
import as plt
import seaborn as sns
# 2. Data loading and preview
df = pd.read_csv('boston_house_prices.csv') # Replace with the actual path
print(())
print(())
print(())
# 3. Missing value processing
missing_summary = ().sum()
print("Missing value statistics:\n", missing_summary[missing_summary > 0])
# Example: Fill in missing values for numeric fields
df['AGE'].fillna(df['AGE'].media(), inplace=True)
# 4. Outlier detection and processing
Q1 = df['CRIM'].quantile(0.25) # Example field
Q3 = df['CRIM'].quantile(0.75)
IQR = Q3 - Q1
df = df[~((df['CRIM'] < Q1 - 1.5*IQR) | (df['CRIM'] > Q3 + 1.5*IQR))] # Delete outliers
# 5. Characteristic correlation analysis
corr_matrix = ()
print(corr_matrix['PRICE'].sort_values(ascending=False)) # Assume PRICE is the target field
# 6. Visual analysis
(figsize=(10, 6))
(df['PRICE'], kde=True, bins=30)
('House Price Distribution')
('Price')
('Frequency')
()
# 7. Save the cleaned data (optional)
df.to_csv('cleaned_boston_house_prices.csv', index=False)
11. Data cleaning report (template example)
1. Report structure recommendations
- Cover: Project name, date, author
- Table of contents: According to the cleaning steps, we divide them into chapters
- text:
- Dataset Overview: field name, data type, sample size
- Cleaning process: Code snippets and result screenshots for each step
- Analysis Conclusion: Key insights derived from charts (such as "strong positive correlation of RM and PRICE")
- Improvement suggestions: For example, "Accurate data on the age of the house (AGE) need to be supplemented")
2. Automated report generation tool
-
use
Pandas Profiling
from pandas_profiling import ProfileReport profile = ProfileReport(df, title="House Price Forecast Data Cleaning Report") profile.to_notebook_iframe() # Render directly in Jupyter
12. Practical Techniques and Pit Avoidance Guide
1. Missing value processing
- Error Example: Directly delete rows with missing values (may be lost a large amount of data).
- The correct way to do it: Select the filling strategy according to the business logic (such as filling the "House Area" field with the median).
2. Outlier value processing
- Error Example: All outliers are preserved (causing the model overfitting).
- The correct way to do it: Only remove extreme outliers (such as "House Price > $1 million sample).
3. Feature Engineering
- Error Example: Use original features directly (such as collinearity between "RM" and "ZN").
- The correct way to do it:Download dimensionality through PCA or select features of VIF<5.
13. Advanced learning resources
1. Books
- "Python Data Science Manual" (Wes McKinney)
- "Feature Engineering and Selection" (Zheng Laiyi)
2. Online courses
- Kaggle micro course "Data cleaning practice"
- Data preprocessing section in Coursera's "Machine Learning Special"
3. Community
- Kaggle Forum "Data Cleaning Tips"
- “Pandas Missing Value Processing”
14. Conclusion
The core of housing price prediction isData quality. Through this cleaning tutorial, we show how to "remove the drought and preserve the essence" from the original data:
- Delete duplicate records(such as multiple houses with the same address)
- Fill in the Key Fields(such as the age of the house, construction age)
- Building health datasets: Delete samples containing missing values/outliers
Although these operations do not change the total sample size, they can:
- makeMore reasonable feature distribution(If the house price no longer has negative value)
- letModel training is more stable(For example, the correlation coefficient is from 0.1→0.7)
- GetMore credible business conclusions(For example, "the proportion of houses in high-priced areas is from 30% → 65%)
All data workers are advised to collect this articleCode templates, it is both:
- Missing value processing script(Including 10 strategies)
- Outlier detection code(Dual version of IQR/3σ principle)
- Visualization of correlation analysis(Heatmap/scatter plot matrix)
Call for action:
- Download the dataset now(Attached Kaggle link: Official website download:/Or download from other websites:/Resource-Bundle-Collection/0f3b7/?utm_source=pan_gitcode&index=top&type=card)
- Hands-on cleaning(Jupyter Notebook template)
May every reader passData cleaning, let machine learning:
- from"Dirty Data" to gold dataThe transformation of
- focus onTechnology Community(Kaggle/Stack Overflow)
- subscriptionIndustry News(DataCamp/Towards Data Science)
Start yours nowData cleaning journey**! 🚀
Hope this article will be yoursIntelligent science, data science, artificial intelligenceOn the wayThe first cornerstone 🧱
If in doubt, welcome toComment areaCommunication 💬
Continue to pay attentionThis blog, get moreIntelligent science, data science, artificial intelligenceDry goods 📚
ClickFollow the blogger and never miss any updates 🌟
👇Take action now! 📢