Location>code7788 >text

Data analysis using pandas

Popularity:641 ℃/2024-10-27 15:54:01

catalogs
  • characteristics
    • 2.1 New Seriws
    • 2.2 Using labels to select data
    • 2.3 Selection of data by specified position
    • 2.4 Selecting Data Using Boolean Values
    • 2.5 Other operations
      • 2.5.1 Modification of data
      • 2.5.2 Statistical operations
      • 2.5.3 Missing data processing
    • 3.1 New DataFrame
    • 3.2 Selection of data
      • 3.2.1 Selecting data using labels
      • 3.2.2 Selecting data using iloc
      • 3.2.3 Selecting data using specified column names
      • 3.2.4 Selecting Data Using Boolean Values
    • 3.3 Modification of data
    • 3.4 Statistical operations
    • 3.5 Handling of missing data
  • 4. Read data in a variety of formats
    • 4.1 Reading CSV files
    • 4.2 Reading Excel files
    • 4.3 Reading SQL files
    • 4.4 Reading HTML files
  • 5. Data pre-processing
    • 5.1 Filtering Data with Boolean Values
    • 5.2 Filtering data using the where method
    • 5.3 Modification of data
    • 5.4 Missing value processing
    • 5.5 Sorting
  • 6. Statistical calculations
    • 6.1 Common Statistical Functions
        • descriptive statistics
        • Distribution and shape
        • relevance
        • Customized statistics
    • 6.2 Rapid statistical summaries
  • 7. Cross-cutting statistics
    • 7.1 Using groupby() statistics
    • 7.2 Statistics with pivot_table()
  • 8. Data processing of time series
    • 8.1 Functions using time series data
    • 8.2 DatetimeIndex
    • 8.3 Screening time series data
    • 8.4 Sampling

pandas is a third-party python package with very intuitive and easy to manipulate indexed data. pandas has two main data structures, Series and DataFrame, which are widely used for data analysis in finance, statistics, social sciences and other fields.

characteristics

  1. data structure
    • DataFrame: Similar to an Excel table, it can store different types of data columns.
    • Series: A one-dimensional array that can store any data type (integers, strings, floats, Python objects, etc.).
  2. data manipulation
    • Supports a large number of data operations, including data cleansing, handling missing data, resampling time series data, etc.
    • Provides rich data alignment and integrated processing capabilities.
  3. data index
    • Supports a variety of indexing methods, including timestamp, integer indexing, tag indexing, etc.
    • Data can be efficiently sliced, filtered and grouped.
  4. Time series function
    • Powerful time series functions to easily process and analyze time series data.
  5. Data consolidation
    • A variety of data merging and linking tools are provided, such asmergejoin cap (a poem)concat
  6. data grouping
    • pass (a bill or inspection etc)groupby function that allows you to group data and apply aggregation functions.
  7. Reinventing Data
    • be in favor ofpivotmelt and other operations that can easily reshape data structures.
  8. Handling Big Data
    • While Pandas is not designed to work with large-scale datasets, it can be used in conjunction with libraries such as Dask to work with large datasets that exceed memory limits.
  9. integrality
    • Seamless integration with other Python data science libraries such as NumPy, SciPy, Matplotlib, Scikit-learn, and more.
  10. performances
    • The underlying layer is written in Cython and C, providing fast data manipulation performance.
  11. usability
    • Provides an intuitive API that makes data manipulation and analysis simple and intuitive.
  12. Documentation and Community
    • With detailed official documentation and an active community, users can easily find help and resources.

In the Pandas library, theSeries is a one-dimensional array structure that can store any data type (integers, strings, floats, Python objects, etc.). It is similar to a list in Python or a one-dimensional array in NumPy, but theSeries is more powerful because it can store different data types and each element has a label (called an index).

2.1 New Seriws

You can use the class to create a new Series, and the first parameter can bring in data such as (lists, tuples, dictionaries, ).

ser = ([1,2,3,4,5],index=list('abcde'))  
ser

If index is omitted, the index will be created from 0 by default.

([1,2,3,4,5])

2.2 Using labels to select data

Using the loc method allows you to select data based on labels

# Specify the label
print(['b'])

#No loc is used
print(ser['b'])

#Specify a range of labels
print(['a':'c'])

You've given a good overview of what's going on in Pandas.Series creation and basic access methods. Below I will add some details and additional operations to help you better understand theSeries The use of.

2.3 Selection of data by specified position

In Pandas, in addition to using labels (indexes) to select data, you can also select data by position (integer indexes). This is similar to indexing Python lists. Here are some examples:

import pandas as pd

# Create a Series
ser = ([1, 2, 3, 4, 5], index=list('abcde'))

# Select the first element using position
print([0]) # Output: 1

# Select multiple elements using position
print([0:3]) # Output: a 1, b 2, c 3

# Select the last element using position
print([-1]) # Output: 5

2.4 Selecting Data Using Boolean Values

Boolean indexing is a very powerful feature in Pandas that allows you to select data based on conditions. Here are some examples:

import pandas as pd

# Create a Series
ser = ([1, 2, 3, 4, 5], index=list('abcde'))

# Use a boolean index to select elements greater than 2
print(ser[ser > 2])

# Use boolean index to select elements less than or equal to 3
print(ser[ser <= 3])

2.5 Other operations

2.5.1 Modification of data

You can modify it directly through the indexSeries The data in the

ser['a'] = 10 # modify the element with index 'a'
print(ser)

2.5.2 Statistical operations

Series A number of built-in statistical methods are provided, such assum(), mean(), max(), min(), std(), var() etc:

print(()) # Summing
print(()) # Find the average
print(()) # Find the maximum value
print(()) # Find the minimum value
print(()) # Find the standard deviation
print(()) # Find the variance

2.5.3 Missing data processing

in the event thatSeries contains missing values (NaN), Pandas provides a variety of processing methods, such asdropna(), fillna() etc:

ser = ([1, 2, None, 4, 5])
print(()) # remove missing values

(0, inplace=True) # Fill in missing values to 0
print(ser)

These operations make theSeries becomes a very flexible and powerful data structure for a variety of data analysis tasks.

DataFrame is another core data structure in Pandas, which is a two-dimensional tabular data structure that can be thought of as being composed of multipleSeries composed of (eachSeries act asDataFrame of a column), allSeries Shares an index.

3.1 New DataFrame

DataFrame can be created in a variety of ways, such as from dictionaries, lists, NumPy arrays, existingDataFrame Or read directly from a data file (e.g. CSV).

import pandas as pd

# Create from Dictionary DataFrame
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
        'Age': [28, 23, 34, 29],
        'City': ['New York', 'Paris', 'Berlin', 'London']}
df = (data)
print(df)

# Create from list DataFrame
data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
df = (data, columns=['A', 'B', 'C'])
print(df)

# through (a gap) NumPy Array Creation DataFrame
import numpy as np
data = ([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df = (data, columns=['A', 'B', 'C'])
print(df)

3.2 Selection of data

3.2.1 Selecting data using labels

utilization.loc You can select data based on labels. It allows you to select rows and columns.

# Select the line labeled 'John'
print([df['Name'] == 'John'])

# Select columns 'Age' and 'City'
print([:, ['Age', 'City']])

3.2.2 Selecting data using iloc

utilization.iloc You can select data based on integer positions. It allows you to select rows and columns.

# Select the first line
print([0])

# Select first two rows and first two columns
print([:2, :2])

3.2.3 Selecting data using specified column names

Columns can be quickly selected by using the column name directly.

# Select the 'Age' column
print(df['Age'])

3.2.4 Selecting Data Using Boolean Values

Boolean indexes allow you to select rows based on conditions.

# Select rows where 'Age' is greater than 25
print(df[df['Age'] > 25])

3.3 Modification of data

modificationsDataFrame The data in theSeries Similarly, changes can be made directly by label or location.

# Change 'City' of 'John' to 'Los Angeles'
[df['Name'] == 'John', 'City'] = 'Los Angeles'
print(df)

3.4 Statistical operations

DataFrame Provides a rich set of statistical methods that can operate on entire data frames or specific columns.

# Calculate descriptive statistics for each column
print(())

# Calculate the mean of the 'Age' column
print(df['Age'].mean())

3.5 Handling of missing data

together withSeries Similarly.DataFrame Multiple methods of handling missing data are also supported.

# Add missing values
[3, 'Age'] = None

# Remove rows containing missing values
print(())

# Fill in the missing values
(value=30, inplace=True)
print(df)

DataFrame It is a very powerful tool when working with data science and analysis, providing flexible data manipulation and analysis capabilities.

4. Read data in a variety of formats

Pandas provides a variety of functions to read data files in different formats, and these functions make data import very simple and straightforward. The following are some commonly used methods for reading data:

4.1 Reading CSV files

CSV (Comma Separated Value) files are a common format for data exchange.Pandas'read_csv function can easily read CSV files.

import pandas as pd

# Read a CSV file
df = pd.read_csv('path_to_file.csv')

# Display the first few rows of data
print(())

read_csv The function provides a number of parameters to handle different CSV formats, such as specifying delimiters, handling missing values, selecting specific columns, and so on.

4.2 Reading Excel files

Excel files are a widely used spreadsheet format.read_excel function can be used to read Excel files.

# Read the Excel file
df = pd.read_excel('path_to_file.xlsx')

# Display the first few rows of data
print(())

read_excel Functions allow you to specify worksheets, read specific cell ranges, and more.

4.3 Reading SQL files

Pandas can connect to a database through SQL Alchemy and use theread_sql mayberead_sql_query function reads SQL data.

from sqlalchemy import create_engine
import pandas as pd

# Creating a Database Connection Engine
engine = create_engine('database_connection_string')

# retrieve SQL Inquiry results
df = pd.read_sql_query('SELECT * FROM table_name', con=engine)

# Show previous rows of data
print(())

A valid database connection string is required here, as well as a corresponding database driver.

4.4 Reading HTML files

Pandasread_html function parses the HTML<table> tags and converts them toDataFrame Object.

# Read the HTML file
df = pd.read_html('path_to_file.html')

# df is a list of DataFrames, select the first one
df = df[0]

# Display the first few rows of data
print(())

read_html function will try to find all the<table> tag, and returns aDataFrame List.

When reading these files, Pandas allows you to specify a variety of parameters to deal with specific formats in the file, such as encoding, column names, data types, and so on. These functions greatly simplify the process of importing data from different data sources.

5. Data pre-processing

Data preprocessing is a critical step in data analysis and machine learning projects, and Pandas provides a variety of tools to help us accomplish these tasks. Here are some common data preprocessing techniques:

5.1 Filtering Data with Boolean Values

Boolean indexes allow us to filter data based on conditions. We can do this forDataFrame maybeSeries Use Boolean expressions to select rows or columns that satisfy conditions.

import pandas as pd

# Suppose we have the following DataFrame
df = ({
    
    'Name': ['John', 'Anna', 'Peter', 'Linda', 'Michael']
})

# Use a boolean to filter for people older than 30
filtered_df = df[df['Age'] > 30]
print(filtered_df)

5.2 Filtering data using the where method

where method can filter data based on a conditional expression, returning a Boolean that satisfies the condition.DataFrame maybeSeries

# Use the where method to filter for ages greater than 30
filtered_df = (df['Age'] > 30)
print(filtered_df)

where The result returned by the method is a boolean mask of the original data, if you need to replace a value that does not satisfy the condition, you can combine it with thefillna maybemask Methods to use.

5.3 Modification of data

Modify data directly by label or location.

# Modify data on specific rows
[df['Name'] == 'John', 'Age'] = 28

# Modify data for a specific column
df['Age'] = df['Age'] + 1
print(df)

5.4 Missing value processing

Missing value handling is an important part of data preprocessing.Pandas provides several ways to handle missing values.

# Remove rows containing missing values
df_cleaned = ()

# Fill in missing values
df_filled = (value=0)
print(df_filled)

It is also possible to useinterpolate method to interpolate the fill.

5.5 Sorting

Sorting is a common operation in data analysis, and Pandas provides thesort_values method to sort the data.

# Sort by age in ascending order
sorted_df = df.sort_values(by='Age')

# Sort by age in descending order
sorted_df_desc = df.sort_values(by='Age', ascending=False)
print(sorted_df)
print(sorted_df_desc)

You can specify multiple columns when sorting and set whether they are in ascending or descending order.

These are some of the common operations used in data preprocessing, and the features provided by Pandas make data cleaning and preparation very efficient and convenient.

6. Statistical calculations

Statistical calculations are a core part of data analysis, and Pandas provides a rich set of functions to perform descriptive statistical analysis. The following are some commonly used statistical calculations:

6.1 Common Statistical Functions

descriptive statistics

  • count(): Counts the number of non-NA/null values.
  • mean(): Calculate the average.
  • median():: Calculate the median.
  • min() cap (a poem)max(): Calculate the minimum and maximum values.
  • std() cap (a poem)var():: Calculate standard deviation and variance.
  • sum():: Calculate the sum.
  • size(): Returns the total size of the data.
import pandas as pd

# Create a simple DataFrame
df = ({
    
    'B': [10, 20, 30, 40, 50]
})

# Calculate descriptive statistics
print(())

Distribution and shape

  • skew(): Calculate the skewness (asymmetry of the data distribution).
  • kurt():: Calculate kurtosis (the degree of "tailing" of the data distribution).
print(())
print(())

relevance

  • corr():: Calculate the correlation coefficient between the columns.
print(())

Customized statistics

  • agg():: Allows multiple statistical functions to be applied.
print((['mean', 'max', 'min']))

6.2 Rapid statistical summaries

Pandasdescribe() method can quickly provide summary statistics for a data frame, including mean, standard deviation, minimum, maximum, and so on.

# Perform descriptive statistics on the entire DataFrame
print(())

# Perform descriptive statistics for the specified column
print(df[['A', 'B']].describe())

describe() method calculates statistics for numeric columns by default, but it can also be used for columns of string type, in which case it displays information such as count, number of unique values, most common values, and so on.

For categorized data, you can use thevalue_counts() method to see the frequency of each category.

# Suppose we have a category column
df['Category'] = ['A', 'B', 'A', 'C', 'B']
print(df['Category'].value_counts())

Statistical calculations are the foundation of data analysis, and Pandas provides these functions that make it simple to extract meaningful statistical information from data. With these statistical functions, we can quickly understand the distribution, central tendency, and degree of dispersion of the data.

7. Cross-cutting statistics

In Pandas, thegroupby() cap (a poem)pivot_table() are two very powerful tools that help us group data and summarize statistics.

7.1 Using groupby() statistics

groupby() method allows us to group data based on one or more keys and then apply an aggregation function to each group, such assum()mean()count() etc.

import pandas as pd

# Create an example DataFrame
df = ({
    
    'Values': [10, 20, 30, 40, 50, 60, 70, 80]
})

# Group by the 'Category' column and calculate the sum for each group
grouped_sum = ('Category')['Values'].sum()
print(grouped_sum)

# You can apply multiple aggregation functions at the same time
grouped_stats = ('Category')['Values'].agg(['sum', 'mean', 'count'])
print(grouped_stats)

groupby() It can also be used for multi-level grouping, i.e. grouping based on multiple columns.

# Suppose we have another column 'Subcategory'
df['Subcategory'] = ['X', 'X', 'Y', 'Y', 'X', 'Y', 'X', 'Y']
grouped_multi = (['Category', 'Subcategory'])['Values'].sum()
print(grouped_multi)

7.2 Statistics with pivot_table()

pivot_table() method is similar to thegroupby(), but it provides more flexibility by allowing us to rearrange the data to create a pivot table where specified columns become row and column indexes while other columns are used to calculate values.

# Create a pivot table with 'Category' as the row index and 'Subcategory' as the column index to calculate the sum of 'Values'
pivot_table = df.pivot_table(index='Category', columns='Subcategory', values='Values', aggfunc='sum')
print(pivot_table)

pivot_table() The methods are flexible enough to handle multiple aggregate functions and can fill in missing values, handle missing combinations, and so on.

# Creating Pivot Tables,and fill in the missing values
pivot_table_filled = df.pivot_table(index='Category', columns='Subcategory', values='Values', aggfunc='sum', fill_value=0)
print(pivot_table_filled)

pivot_table() It also allows us to specify multiple aggregate functions and further process the results.

# Creating Pivot Tables,and apply multiple aggregation functions
pivot_table_multi = df.pivot_table(index='Category', columns='Subcategory', values='Values', aggfunc=['sum', 'mean'])
print(pivot_table_multi)

These tools are very useful in data analysis, especially when you need to analyze data in groups or create complex summary reports. This is accomplished through thegroupby() cap (a poem)pivot_table()We can easily explore and analyze data in multiple dimensions.

8. Data processing of time series

Time series data is a series of data points arranged in chronological order. Time series analysis is an important analytical tool in finance, meteorology, economics and many other fields.Pandas provides powerful tools to work with time series data.

8.1 Functions using time series data

Pandas provides a series of functions specialized for working with time series data. These functions help us index, resample, and move window statistics on time series data.

import pandas as pd
import datetime as dt

# Create time series data
dates = pd.date_range('20230101', periods=6)
values = [10, 20, 25, 30, 40, 50]
ts = (values, index=dates)

# Access time series data
print(ts)

# Date offset of the time series
ts_1day_later = (1)
print(ts_1day_later)

# Rolling statistics for the time series
rolling_mean = (window=3).mean()
print(rolling_mean)

8.2 DatetimeIndex

DatetimeIndex is an index object in Pandas specialized for time series. It can handle date and time data and provides rich time series functionality.

# Create DatetimeIndex
index = (['2023-01-01', '2023-01-02', '2023-01-03'])

# Set DatetimeIndex to range
date_range = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')

# Create time series data
ts_with_range = (range(10), index=date_range)
print(ts_with_range)

8.3 Screening time series data

It is possible to useDatetimeIndex to filter the time series data.

# Filter data for a specific time period
selected_ts = ts['2023-01-02':'2023-01-04']
print(selected_ts)

8.4 Sampling

Sampling time series data means extracting data from a time series at a specific point in time.Pandas allows us to use theresample Methods for sampling time series data.

# resampled time series data
resampled_ts = ('D').mean() # daily average
print(resampled_ts)

# You can specify a different frequency
resampled_ts_monthly = ('M').mean() # monthly average
print(resampled_ts_monthly)

When working with time series data, Pandas provides these tools to help us manage and analyze the data effectively. Through time series analysis, we can identify patterns, trends, and seasonal variations in our data, which is valuable for forecasting and decision making.