- characteristics
-
- 2.1 New Seriws
- 2.2 Using labels to select data
- 2.3 Selection of data by specified position
- 2.4 Selecting Data Using Boolean Values
-
2.5 Other operations
- 2.5.1 Modification of data
- 2.5.2 Statistical operations
- 2.5.3 Missing data processing
-
- 3.1 New DataFrame
-
3.2 Selection of data
- 3.2.1 Selecting data using labels
- 3.2.2 Selecting data using iloc
- 3.2.3 Selecting data using specified column names
- 3.2.4 Selecting Data Using Boolean Values
- 3.3 Modification of data
- 3.4 Statistical operations
- 3.5 Handling of missing data
-
4. Read data in a variety of formats
- 4.1 Reading CSV files
- 4.2 Reading Excel files
- 4.3 Reading SQL files
- 4.4 Reading HTML files
-
5. Data pre-processing
- 5.1 Filtering Data with Boolean Values
- 5.2 Filtering data using the where method
- 5.3 Modification of data
- 5.4 Missing value processing
- 5.5 Sorting
-
6. Statistical calculations
-
6.1 Common Statistical Functions
- descriptive statistics
- Distribution and shape
- relevance
- Customized statistics
- 6.2 Rapid statistical summaries
-
6.1 Common Statistical Functions
-
7. Cross-cutting statistics
- 7.1 Using groupby() statistics
- 7.2 Statistics with pivot_table()
-
8. Data processing of time series
- 8.1 Functions using time series data
- 8.2 DatetimeIndex
- 8.3 Screening time series data
- 8.4 Sampling
pandas is a third-party python package with very intuitive and easy to manipulate indexed data. pandas has two main data structures, Series and DataFrame, which are widely used for data analysis in finance, statistics, social sciences and other fields.
characteristics
-
data structure:
- DataFrame: Similar to an Excel table, it can store different types of data columns.
- Series: A one-dimensional array that can store any data type (integers, strings, floats, Python objects, etc.).
-
data manipulation:
- Supports a large number of data operations, including data cleansing, handling missing data, resampling time series data, etc.
- Provides rich data alignment and integrated processing capabilities.
-
data index:
- Supports a variety of indexing methods, including timestamp, integer indexing, tag indexing, etc.
- Data can be efficiently sliced, filtered and grouped.
-
Time series function:
- Powerful time series functions to easily process and analyze time series data.
-
Data consolidation:
- A variety of data merging and linking tools are provided, such as
merge
、join
cap (a poem)concat
。
- A variety of data merging and linking tools are provided, such as
-
data grouping:
- pass (a bill or inspection etc)
groupby
function that allows you to group data and apply aggregation functions.
- pass (a bill or inspection etc)
-
Reinventing Data:
- be in favor of
pivot
、melt
and other operations that can easily reshape data structures.
- be in favor of
-
Handling Big Data:
- While Pandas is not designed to work with large-scale datasets, it can be used in conjunction with libraries such as Dask to work with large datasets that exceed memory limits.
-
integrality:
- Seamless integration with other Python data science libraries such as NumPy, SciPy, Matplotlib, Scikit-learn, and more.
-
performances:
- The underlying layer is written in Cython and C, providing fast data manipulation performance.
-
usability:
- Provides an intuitive API that makes data manipulation and analysis simple and intuitive.
-
Documentation and Community:
- With detailed official documentation and an active community, users can easily find help and resources.
In the Pandas library, theSeries
is a one-dimensional array structure that can store any data type (integers, strings, floats, Python objects, etc.). It is similar to a list in Python or a one-dimensional array in NumPy, but theSeries
is more powerful because it can store different data types and each element has a label (called an index).
2.1 New Seriws
You can use the class to create a new Series, and the first parameter can bring in data such as (lists, tuples, dictionaries, ).
ser = ([1,2,3,4,5],index=list('abcde'))
ser
If index is omitted, the index will be created from 0 by default.
([1,2,3,4,5])
2.2 Using labels to select data
Using the loc method allows you to select data based on labels
# Specify the label
print(['b'])
#No loc is used
print(ser['b'])
#Specify a range of labels
print(['a':'c'])
You've given a good overview of what's going on in Pandas.Series
creation and basic access methods. Below I will add some details and additional operations to help you better understand theSeries
The use of.
2.3 Selection of data by specified position
In Pandas, in addition to using labels (indexes) to select data, you can also select data by position (integer indexes). This is similar to indexing Python lists. Here are some examples:
import pandas as pd
# Create a Series
ser = ([1, 2, 3, 4, 5], index=list('abcde'))
# Select the first element using position
print([0]) # Output: 1
# Select multiple elements using position
print([0:3]) # Output: a 1, b 2, c 3
# Select the last element using position
print([-1]) # Output: 5
2.4 Selecting Data Using Boolean Values
Boolean indexing is a very powerful feature in Pandas that allows you to select data based on conditions. Here are some examples:
import pandas as pd
# Create a Series
ser = ([1, 2, 3, 4, 5], index=list('abcde'))
# Use a boolean index to select elements greater than 2
print(ser[ser > 2])
# Use boolean index to select elements less than or equal to 3
print(ser[ser <= 3])
2.5 Other operations
2.5.1 Modification of data
You can modify it directly through the indexSeries
The data in the
ser['a'] = 10 # modify the element with index 'a'
print(ser)
2.5.2 Statistical operations
Series
A number of built-in statistical methods are provided, such assum()
, mean()
, max()
, min()
, std()
, var()
etc:
print(()) # Summing
print(()) # Find the average
print(()) # Find the maximum value
print(()) # Find the minimum value
print(()) # Find the standard deviation
print(()) # Find the variance
2.5.3 Missing data processing
in the event thatSeries
contains missing values (NaN
), Pandas provides a variety of processing methods, such asdropna()
, fillna()
etc:
ser = ([1, 2, None, 4, 5])
print(()) # remove missing values
(0, inplace=True) # Fill in missing values to 0
print(ser)
These operations make theSeries
becomes a very flexible and powerful data structure for a variety of data analysis tasks.
DataFrame
is another core data structure in Pandas, which is a two-dimensional tabular data structure that can be thought of as being composed of multipleSeries
composed of (eachSeries
act asDataFrame
of a column), allSeries
Shares an index.
3.1 New DataFrame
DataFrame
can be created in a variety of ways, such as from dictionaries, lists, NumPy arrays, existingDataFrame
Or read directly from a data file (e.g. CSV).
import pandas as pd
# Create from Dictionary DataFrame
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 23, 34, 29],
'City': ['New York', 'Paris', 'Berlin', 'London']}
df = (data)
print(df)
# Create from list DataFrame
data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
df = (data, columns=['A', 'B', 'C'])
print(df)
# through (a gap) NumPy Array Creation DataFrame
import numpy as np
data = ([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df = (data, columns=['A', 'B', 'C'])
print(df)
3.2 Selection of data
3.2.1 Selecting data using labels
utilization.loc
You can select data based on labels. It allows you to select rows and columns.
# Select the line labeled 'John'
print([df['Name'] == 'John'])
# Select columns 'Age' and 'City'
print([:, ['Age', 'City']])
3.2.2 Selecting data using iloc
utilization.iloc
You can select data based on integer positions. It allows you to select rows and columns.
# Select the first line
print([0])
# Select first two rows and first two columns
print([:2, :2])
3.2.3 Selecting data using specified column names
Columns can be quickly selected by using the column name directly.
# Select the 'Age' column
print(df['Age'])
3.2.4 Selecting Data Using Boolean Values
Boolean indexes allow you to select rows based on conditions.
# Select rows where 'Age' is greater than 25
print(df[df['Age'] > 25])
3.3 Modification of data
modificationsDataFrame
The data in theSeries
Similarly, changes can be made directly by label or location.
# Change 'City' of 'John' to 'Los Angeles'
[df['Name'] == 'John', 'City'] = 'Los Angeles'
print(df)
3.4 Statistical operations
DataFrame
Provides a rich set of statistical methods that can operate on entire data frames or specific columns.
# Calculate descriptive statistics for each column
print(())
# Calculate the mean of the 'Age' column
print(df['Age'].mean())
3.5 Handling of missing data
together withSeries
Similarly.DataFrame
Multiple methods of handling missing data are also supported.
# Add missing values
[3, 'Age'] = None
# Remove rows containing missing values
print(())
# Fill in the missing values
(value=30, inplace=True)
print(df)
DataFrame
It is a very powerful tool when working with data science and analysis, providing flexible data manipulation and analysis capabilities.
4. Read data in a variety of formats
Pandas provides a variety of functions to read data files in different formats, and these functions make data import very simple and straightforward. The following are some commonly used methods for reading data:
4.1 Reading CSV files
CSV (Comma Separated Value) files are a common format for data exchange.Pandas'read_csv
function can easily read CSV files.
import pandas as pd
# Read a CSV file
df = pd.read_csv('path_to_file.csv')
# Display the first few rows of data
print(())
read_csv
The function provides a number of parameters to handle different CSV formats, such as specifying delimiters, handling missing values, selecting specific columns, and so on.
4.2 Reading Excel files
Excel files are a widely used spreadsheet format.read_excel
function can be used to read Excel files.
# Read the Excel file
df = pd.read_excel('path_to_file.xlsx')
# Display the first few rows of data
print(())
read_excel
Functions allow you to specify worksheets, read specific cell ranges, and more.
4.3 Reading SQL files
Pandas can connect to a database through SQL Alchemy and use theread_sql
mayberead_sql_query
function reads SQL data.
from sqlalchemy import create_engine
import pandas as pd
# Creating a Database Connection Engine
engine = create_engine('database_connection_string')
# retrieve SQL Inquiry results
df = pd.read_sql_query('SELECT * FROM table_name', con=engine)
# Show previous rows of data
print(())
A valid database connection string is required here, as well as a corresponding database driver.
4.4 Reading HTML files
Pandasread_html
function parses the HTML<table>
tags and converts them toDataFrame
Object.
# Read the HTML file
df = pd.read_html('path_to_file.html')
# df is a list of DataFrames, select the first one
df = df[0]
# Display the first few rows of data
print(())
read_html
function will try to find all the<table>
tag, and returns aDataFrame
List.
When reading these files, Pandas allows you to specify a variety of parameters to deal with specific formats in the file, such as encoding, column names, data types, and so on. These functions greatly simplify the process of importing data from different data sources.
5. Data pre-processing
Data preprocessing is a critical step in data analysis and machine learning projects, and Pandas provides a variety of tools to help us accomplish these tasks. Here are some common data preprocessing techniques:
5.1 Filtering Data with Boolean Values
Boolean indexes allow us to filter data based on conditions. We can do this forDataFrame
maybeSeries
Use Boolean expressions to select rows or columns that satisfy conditions.
import pandas as pd
# Suppose we have the following DataFrame
df = ({
'Name': ['John', 'Anna', 'Peter', 'Linda', 'Michael']
})
# Use a boolean to filter for people older than 30
filtered_df = df[df['Age'] > 30]
print(filtered_df)
5.2 Filtering data using the where method
where
method can filter data based on a conditional expression, returning a Boolean that satisfies the condition.DataFrame
maybeSeries
。
# Use the where method to filter for ages greater than 30
filtered_df = (df['Age'] > 30)
print(filtered_df)
where
The result returned by the method is a boolean mask of the original data, if you need to replace a value that does not satisfy the condition, you can combine it with thefillna
maybemask
Methods to use.
5.3 Modification of data
Modify data directly by label or location.
# Modify data on specific rows
[df['Name'] == 'John', 'Age'] = 28
# Modify data for a specific column
df['Age'] = df['Age'] + 1
print(df)
5.4 Missing value processing
Missing value handling is an important part of data preprocessing.Pandas provides several ways to handle missing values.
# Remove rows containing missing values
df_cleaned = ()
# Fill in missing values
df_filled = (value=0)
print(df_filled)
It is also possible to useinterpolate
method to interpolate the fill.
5.5 Sorting
Sorting is a common operation in data analysis, and Pandas provides thesort_values
method to sort the data.
# Sort by age in ascending order
sorted_df = df.sort_values(by='Age')
# Sort by age in descending order
sorted_df_desc = df.sort_values(by='Age', ascending=False)
print(sorted_df)
print(sorted_df_desc)
You can specify multiple columns when sorting and set whether they are in ascending or descending order.
These are some of the common operations used in data preprocessing, and the features provided by Pandas make data cleaning and preparation very efficient and convenient.
6. Statistical calculations
Statistical calculations are a core part of data analysis, and Pandas provides a rich set of functions to perform descriptive statistical analysis. The following are some commonly used statistical calculations:
6.1 Common Statistical Functions
descriptive statistics
-
count()
: Counts the number of non-NA/null values. -
mean()
: Calculate the average. -
median()
:: Calculate the median. -
min()
cap (a poem)max()
: Calculate the minimum and maximum values. -
std()
cap (a poem)var()
:: Calculate standard deviation and variance. -
sum()
:: Calculate the sum. -
size()
: Returns the total size of the data.
import pandas as pd
# Create a simple DataFrame
df = ({
'B': [10, 20, 30, 40, 50]
})
# Calculate descriptive statistics
print(())
Distribution and shape
-
skew()
: Calculate the skewness (asymmetry of the data distribution). -
kurt()
:: Calculate kurtosis (the degree of "tailing" of the data distribution).
print(())
print(())
relevance
-
corr()
:: Calculate the correlation coefficient between the columns.
print(())
Customized statistics
-
agg()
:: Allows multiple statistical functions to be applied.
print((['mean', 'max', 'min']))
6.2 Rapid statistical summaries
Pandasdescribe()
method can quickly provide summary statistics for a data frame, including mean, standard deviation, minimum, maximum, and so on.
# Perform descriptive statistics on the entire DataFrame
print(())
# Perform descriptive statistics for the specified column
print(df[['A', 'B']].describe())
describe()
method calculates statistics for numeric columns by default, but it can also be used for columns of string type, in which case it displays information such as count, number of unique values, most common values, and so on.
For categorized data, you can use thevalue_counts()
method to see the frequency of each category.
# Suppose we have a category column
df['Category'] = ['A', 'B', 'A', 'C', 'B']
print(df['Category'].value_counts())
Statistical calculations are the foundation of data analysis, and Pandas provides these functions that make it simple to extract meaningful statistical information from data. With these statistical functions, we can quickly understand the distribution, central tendency, and degree of dispersion of the data.
7. Cross-cutting statistics
In Pandas, thegroupby()
cap (a poem)pivot_table()
are two very powerful tools that help us group data and summarize statistics.
7.1 Using groupby() statistics
groupby()
method allows us to group data based on one or more keys and then apply an aggregation function to each group, such assum()
、mean()
、count()
etc.
import pandas as pd
# Create an example DataFrame
df = ({
'Values': [10, 20, 30, 40, 50, 60, 70, 80]
})
# Group by the 'Category' column and calculate the sum for each group
grouped_sum = ('Category')['Values'].sum()
print(grouped_sum)
# You can apply multiple aggregation functions at the same time
grouped_stats = ('Category')['Values'].agg(['sum', 'mean', 'count'])
print(grouped_stats)
groupby()
It can also be used for multi-level grouping, i.e. grouping based on multiple columns.
# Suppose we have another column 'Subcategory'
df['Subcategory'] = ['X', 'X', 'Y', 'Y', 'X', 'Y', 'X', 'Y']
grouped_multi = (['Category', 'Subcategory'])['Values'].sum()
print(grouped_multi)
7.2 Statistics with pivot_table()
pivot_table()
method is similar to thegroupby()
, but it provides more flexibility by allowing us to rearrange the data to create a pivot table where specified columns become row and column indexes while other columns are used to calculate values.
# Create a pivot table with 'Category' as the row index and 'Subcategory' as the column index to calculate the sum of 'Values'
pivot_table = df.pivot_table(index='Category', columns='Subcategory', values='Values', aggfunc='sum')
print(pivot_table)
pivot_table()
The methods are flexible enough to handle multiple aggregate functions and can fill in missing values, handle missing combinations, and so on.
# Creating Pivot Tables,and fill in the missing values
pivot_table_filled = df.pivot_table(index='Category', columns='Subcategory', values='Values', aggfunc='sum', fill_value=0)
print(pivot_table_filled)
pivot_table()
It also allows us to specify multiple aggregate functions and further process the results.
# Creating Pivot Tables,and apply multiple aggregation functions
pivot_table_multi = df.pivot_table(index='Category', columns='Subcategory', values='Values', aggfunc=['sum', 'mean'])
print(pivot_table_multi)
These tools are very useful in data analysis, especially when you need to analyze data in groups or create complex summary reports. This is accomplished through thegroupby()
cap (a poem)pivot_table()
We can easily explore and analyze data in multiple dimensions.
8. Data processing of time series
Time series data is a series of data points arranged in chronological order. Time series analysis is an important analytical tool in finance, meteorology, economics and many other fields.Pandas provides powerful tools to work with time series data.
8.1 Functions using time series data
Pandas provides a series of functions specialized for working with time series data. These functions help us index, resample, and move window statistics on time series data.
import pandas as pd
import datetime as dt
# Create time series data
dates = pd.date_range('20230101', periods=6)
values = [10, 20, 25, 30, 40, 50]
ts = (values, index=dates)
# Access time series data
print(ts)
# Date offset of the time series
ts_1day_later = (1)
print(ts_1day_later)
# Rolling statistics for the time series
rolling_mean = (window=3).mean()
print(rolling_mean)
8.2 DatetimeIndex
DatetimeIndex
is an index object in Pandas specialized for time series. It can handle date and time data and provides rich time series functionality.
# Create DatetimeIndex
index = (['2023-01-01', '2023-01-02', '2023-01-03'])
# Set DatetimeIndex to range
date_range = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')
# Create time series data
ts_with_range = (range(10), index=date_range)
print(ts_with_range)
8.3 Screening time series data
It is possible to useDatetimeIndex
to filter the time series data.
# Filter data for a specific time period
selected_ts = ts['2023-01-02':'2023-01-04']
print(selected_ts)
8.4 Sampling
Sampling time series data means extracting data from a time series at a specific point in time.Pandas allows us to use theresample
Methods for sampling time series data.
# resampled time series data
resampled_ts = ('D').mean() # daily average
print(resampled_ts)
# You can specify a different frequency
resampled_ts_monthly = ('M').mean() # monthly average
print(resampled_ts_monthly)
When working with time series data, Pandas provides these tools to help us manage and analyze the data effectively. Through time series analysis, we can identify patterns, trends, and seasonal variations in our data, which is valuable for forecasting and decision making.