The entire process of house price prediction data cleaning: from data collection to visual analysis (Python practical)

In housing price forecasting projects, data cleaning is a crucial link. It not only determines the accuracy of the model, but also directly affects the reliability of subsequent analysis. This article will take the Boston housing price data set as an example, and use Python's Pandas, Matplotlib and other tools to explain in detail the entire process of data cleaning, and generate data cleaning reports and visual charts. This article is suitable for beginners with zero foundation, and it is recommended to cooperate with Jupyter Notebook practice.

1. Overview of the importance and process of data cleaning

(one)Why do data cleaning need?

Improve model performance: The original data may contain noise, missing values or outliers, which directly affects the prediction accuracy.
Ensure analysis logic: Incorrect or inconsistent data can lead to wrong conclusions.
Meet the algorithm input requirements: Many machine learning algorithms require the input data to be complete and the format is unified.

(two)Basic process of data cleaning

Data collection: Get the original data (such as Kaggle dataset).
Data preview: Check the data structure, field type and basic information.
Handle missing values: Fill in, delete or mark missing data.
Handle outliers: Identify and correct or delete abnormal data.
Feature Engineering: Calculate correlation, normalization, feature selection, etc.
Data visualization: Analyze data distribution and relationships through charts.

2. Environment preparation and data import

1. Install the dependency library
Make sure the following Python libraries are installed:

bash copy code

 pip install pandas matplotlib seaborn numpy

2. Import data
Take the Boston House Price Dataset as an example (can be downloaded from Kaggle or usedsklearnBuilt-in data):

import pandas as pd
 import numpy as np
 import as plt
 import seaborn as sns
 
 # Example: Loading data from local CSV file
 df = pd.read_csv('boston_house_prices.csv') # Replace with the actual path
 # Or use sklearn built-in data (extra processing is required)
 # from import load_boston
 # boston = load_boston()
 # df = (, columns=boston.feature_names)
 # df['PRICE'] =

3. Data preview and basic information analysis

1. View the data structure

print(()) # View the first 5 lines
 print(()) # Data type, missing value statistics
 print(()) # numerical field statistics

2. Check field type
Ensure that all field types are correct (such as numerical types, category types).

# Example: Convert a category field to a category type
 df['CHAS'] = df['CHAS'].astype('category') # Assume CHAS is a binary classification variable

3. Data cleaning report template
It is recommended to record the following information:

Dataset size (number of rows, columns)
Missing value statistics
Field Type
Statistical information such as mean, standard deviation and other numerical fields

4. Handle missing values

1. Missing value detection

missing_summary = ().sum()
 print(missing_summary[missing_summary > 0]) # output missing fields

2. Missing value processing strategy

Delete missing values: Applicable to fields with smaller missing ratios.

python copy code

 (subset=['FIELD_NAME'], inplace=True) # Replace with the actual field name

Fill in missing values:

Numeric fields: Fill with mean, median, or specific values.

python copy code

 df['FIELD_NAME'].fillna(df['FIELD_NAME'].mean(), inplace=True)

Category field: Fill with mode.

python copy code

 df['FIELD_NAME'].fillna(df['FIELD_NAME'].mode()[0], inplace=True)

3. Example: Comprehensive processing of missing values

# Assume that the 'AGE' field has a missing value
 df['AGE'].fillna(df['AGE'].media(), inplace=True) # Fill with median

5. Handle outliers

1. Outlier value detection method

Boxplot method: Identify outliers through IQR (interquartile range).

Q1 = df['FIELD_NAME'].quantile(0.25)
Q3 = df['FIELD_NAME'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['FIELD_NAME'] < Q1 - 1.5*IQR) | (df['FIELD_NAME'] > Q3 + 1.5*IQR)]
print(outliers)

3σ principle: Applicable to normal distributed data.

mean = df['FIELD_NAME'].mean()
std = df['FIELD_NAME'].std()
outliers = df[(df['FIELD_NAME'] < mean - 3*std) | (df['FIELD_NAME'] > mean + 3*std)]

2. Outlier value processing strategy

Correct outliers: Replace with reasonable values (such as median).

Delete outliers: suitable for extreme abnormalities and great impact.

python copy code

 df = df[~((df['FIELD_NAME'] < Q1 - 1.5*IQR) | (df['FIELD_NAME'] > Q3 + 1.5*IQR))]

VI. Characteristic engineering and correlation analysis

1. Calculate feature correlation

corr_matrix = ()
 print(corr_matrix['PRICE'].sort_values(ascending=False)) # Assume PRICE is the target field

2. Visualize the correlation matrix

(figsize=(10, 8))
(corr_matrix, annot=True, cmap='coolwarm')
('Feature Correlation Matrix')
()

3. Feature selection
Based on the correlation analysis results, select features with high correlation with the target field.

7. Data visual analysis

1. Numerical field distribution

(figsize=(12, 6))
(df['PRICE'], kde=True)
('House Price Distribution')
('Price')
('Frequency')
()

2. Category Type Field Distribution

(figsize=(8, 6))
 (x='CHAS', data=df) # Assume CHAS is a category field
 ('CHAS Field Distribution')
 ()

3. Multi-field relationship analysis

(df, vars=['RM', 'LSTAT', 'PRICE']) # Sample Field
 ()

8. Generate a data cleaning report

1. Report content recommendations

Basic information of the dataset (size, field type)
Missing value and outlier processing records
Characteristic correlation analysis results
Data visualization conclusion

2. Sample report snippet

# House price forecast data cleaning report
 
 ## 1. Basic information of the data set
 - Number of lines: 506
 - Number of columns: 14
 - Target field: PRICE
 
 ## 2. Missing value processing
 - The missing value of the field 'AGE' has been filled with median.
 - The missing value of the field 'RAD' has been deleted.
 
 ## 3. Outlier value processing
 - 5 outliers detected in field 'CRIM', deleted.
 
 ## 4. Characteristic correlation analysis
 - 'RM' has the highest correlation with 'PRICE' (0.7).
 - The correlation between 'LSTAT' and 'PRICE' is -0.74.
 
 ## 5. Data visualization conclusion
 - The housing price distribution is right-skewed and logarithmic transformation is required.
 - 'RM' is positively correlated with 'PRICE' and 'LSTAT' is negatively correlated.

9. Summary and Extension

1. Summary
This article explains in detail the entire process of data cleaning through the Boston housing price data set, including data collection, missing value processing, outlier value processing, feature engineering and visual analysis. After mastering these skills, you can easily deal with similar data cleaning tasks.

2. Extended learning direction
After completing the basic data cleaning, you can further explore the following directions:

Feature Engineering Deepening: Try more feature combinations (such as interactive features) or dimensionality reduction techniques (such as PCA).
Model adaptation optimization: Adjust the model input according to the data distribution (such as logarithmally transforming the target variable to alleviate skewness).
Automatic cleaning tools:usesklearn-pandasorPandas ProfilingGenerate an automated cleaning report.
Data Enhancement: Extend the dataset size by synthesizing a few class samples (such as SMOTE) or data simulation.

10. Complete code implementation and comment

The following provides a complete code framework that can be run directly in a Jupyter Notebook:

# 1. Importing dependency library
 import pandas as pd
 import numpy as np
 import as plt
 import seaborn as sns
 
 # 2. Data loading and preview
 df = pd.read_csv('boston_house_prices.csv') # Replace with the actual path
 print(())
 print(())
 print(())
 
 # 3. Missing value processing
 missing_summary = ().sum()
 print("Missing value statistics:\n", missing_summary[missing_summary > 0])
 
 # Example: Fill in missing values for numeric fields
 df['AGE'].fillna(df['AGE'].media(), inplace=True)
 
 # 4. Outlier detection and processing
 Q1 = df['CRIM'].quantile(0.25) # Example field
 Q3 = df['CRIM'].quantile(0.75)
 IQR = Q3 - Q1
 df = df[~((df['CRIM'] < Q1 - 1.5*IQR) | (df['CRIM'] > Q3 + 1.5*IQR))] # Delete outliers
 
 # 5. Characteristic correlation analysis
 corr_matrix = ()
 print(corr_matrix['PRICE'].sort_values(ascending=False)) # Assume PRICE is the target field
 
 # 6. Visual analysis
 (figsize=(10, 6))
 (df['PRICE'], kde=True, bins=30)
 ('House Price Distribution')
 ('Price')
 ('Frequency')
 ()
 
 # 7. Save the cleaned data (optional)
 df.to_csv('cleaned_boston_house_prices.csv', index=False)

11. Data cleaning report (template example)

1. Report structure recommendations

Cover: Project name, date, author
Table of contents: According to the cleaning steps, we divide them into chapters
text:
- Dataset Overview: field name, data type, sample size
- Cleaning process: Code snippets and result screenshots for each step
- Analysis Conclusion: Key insights derived from charts (such as "strong positive correlation of RM and PRICE")
- Improvement suggestions: For example, "Accurate data on the age of the house (AGE) need to be supplemented")

2. Automated report generation tool

use

Pandas Profiling

from pandas_profiling import ProfileReport
 profile = ProfileReport(df, title="House Price Forecast Data Cleaning Report")
 profile.to_notebook_iframe() # Render directly in Jupyter

12. Practical Techniques and Pit Avoidance Guide

1. Missing value processing

Error Example: Directly delete rows with missing values (may be lost a large amount of data).
The correct way to do it: Select the filling strategy according to the business logic (such as filling the "House Area" field with the median).

2. Outlier value processing

Error Example: All outliers are preserved (causing the model overfitting).
The correct way to do it: Only remove extreme outliers (such as "House Price > $1 million sample).

3. Feature Engineering

Error Example: Use original features directly (such as collinearity between "RM" and "ZN").
The correct way to do it:Download dimensionality through PCA or select features of VIF<5.

13. Advanced learning resources

1. Books

"Python Data Science Manual" (Wes McKinney)
"Feature Engineering and Selection" (Zheng Laiyi)

2. Online courses

Kaggle micro course "Data cleaning practice"
Data preprocessing section in Coursera's "Machine Learning Special"

3. Community

Kaggle Forum "Data Cleaning Tips"
“Pandas Missing Value Processing”

14. Conclusion

The core of housing price prediction isData quality. Through this cleaning tutorial, we show how to "remove the drought and preserve the essence" from the original data:

Delete duplicate records(such as multiple houses with the same address)
Fill in the Key Fields(such as the age of the house, construction age)
Building health datasets: Delete samples containing missing values/outliers

Although these operations do not change the total sample size, they can:

makeMore reasonable feature distribution(If the house price no longer has negative value)
letModel training is more stable(For example, the correlation coefficient is from 0.1→0.7)
GetMore credible business conclusions(For example, "the proportion of houses in high-priced areas is from 30% → 65%)

All data workers are advised to collect this articleCode templates, it is both:

Missing value processing script(Including 10 strategies)
Outlier detection code(Dual version of IQR/3σ principle)
Visualization of correlation analysis(Heatmap/scatter plot matrix)

Call for action：

Download the dataset now(Attached Kaggle link: Official website download:/Or download from other websites:/Resource-Bundle-Collection/0f3b7/?utm_source=pan_gitcode&index=top&type=card）
Hands-on cleaning(Jupyter Notebook template)

May every reader passData cleaning, let machine learning:

from"Dirty Data" to gold dataThe transformation of
focus onTechnology Community（Kaggle/Stack Overflow）
subscriptionIndustry News（DataCamp/Towards Data Science）

Start yours nowData cleaning journey**! 🚀

Hope this article will be yoursIntelligent science, data science, artificial intelligenceOn the wayThe first cornerstone 🧱

If in doubt, welcome toComment areaCommunication 💬

Continue to pay attentionThis blog, get moreIntelligent science, data science, artificial intelligenceDry goods 📚

ClickFollow the blogger and never miss any updates 🌟

👇Take action now! 📢