This paper describes a system based onPythonLanguage, readExcelForm file data to whereThe value of a column of dataAs a standard, forThis column is in the specified range(used form a nominal expression)All rowsreuseOther columns of datavalues to filter and exclude the data; at the same time, the data before and after the filtering are plotted for a number ofbar chartand export and save the resultant data as a newExcelMethods for form documents.
First, let's clarify the specific needs of this paper. There is an existingExcelform file, in this article we'll take the.csv
format as an example; where, as shown below, there is a column in this file (which in this article is also thedays
(this column) data, which we use as theBaseline dataI'd like to take it out first.days
numerical value at0
until45
、320
until365
All samples in the range (a row is a sample) for subsequent operations.
Secondly, for the samples taken out, then based on other4
columns (which in this paper are alsoblue_dif
、green_dif
、red_dif
together withinf_dif
these4
columns) of data to put this4
column dataRows not in the specified value fieldDeletion. In this process, we also want to plot before and after data deletion, which4
columns (that is, theblue_dif
、green_dif
、red_dif
together withinf_dif
these4
columns) of the respective histograms of the data, a total of8
A diagram. Finally, we also want to save the data after deleting the above data as a newExcelForms document.
Knowing the requirements, we can write the code. The code used in this article is shown below.
# -*- coding: utf-8 -*-
"""
Created on Tue Sep 12 07:55:40 2023
@author: fkxxgis
"""
import numpy as np
import pandas as pd
import as plt
original_file_path = "E:/01_Reflectivity/99_Model/02_Extract_Data/26_Train_Model_New/Train_Model_0715_Main_Over_NIR.csv"
# original_file_path = "E:/01_Reflectivity/99_Model/02_Extract_Data/26_Train_Model_New/"
result_file_path = "E:/01_Reflectivity/99_Model/02_Extract_Data/26_Train_Model_New/Train_Model_0715_Main_Over_NIR_New.csv"
df = pd.read_csv(original_file_path)
blue_original = df[(df['blue_dif'] >= -0.08) & (df['blue_dif'] <= 0.08)]['blue_dif']
green_original = df[(df['green_dif'] >= -0.08) & (df['green_dif'] <= 0.08)]['green_dif']
red_original = df[(df['red_dif'] >= -0.08) & (df['red_dif'] <= 0.08)]['red_dif']
inf_original = df[(df['inf_dif'] >= -0.1) & (df['inf_dif'] <= 0.1)]['inf_dif']
mask = ((df['days'] >= 0) & (df['days'] <= 45)) | ((df['days'] >= 320) & (df['days'] <= 365))
range_min = -0.03
range_max = 0.03
[mask, 'blue_dif'] = [mask, 'blue_dif'].apply(lambda x: x if range_min <= x <= range_max else ([, x]))
[mask, 'green_dif'] = [mask, 'green_dif'].apply(lambda x: x if range_min <= x <= range_max else ([, x]))
[mask, 'red_dif'] = [mask, 'red_dif'].apply(lambda x: x if range_min <= x <= range_max else ([, x]))
[mask, 'inf_dif'] = [mask, 'inf_dif'].apply(lambda x: x if range_min <= x <= range_max else ([, x], p =[0.9, 0.1]))
df = ()
blue_new = df[(df['blue_dif'] >= -0.08) & (df['blue_dif'] <= 0.08)]['blue_dif']
green_new = df[(df['green_dif'] >= -0.08) & (df['green_dif'] <= 0.08)]['green_dif']
red_new = df[(df['red_dif'] >= -0.08) & (df['red_dif'] <= 0.08)]['red_dif']
inf_new = df[(df['inf_dif'] >= -0.1) & (df['inf_dif'] <= 0.1)]['inf_dif']
(0)
(blue_original, bins = 50)
(1)
(green_original, bins = 50)
(2)
(red_original, bins = 50)
(3)
(inf_original, bins = 50)
(4)
(blue_new, bins = 50)
(5)
(green_new, bins = 50)
(6)
(red_new, bins = 50)
(7)
(inf_new, bins = 50)
df.to_csv(result_file_path, index=False)
First of all, we pass thepd.read_csv
function from the specified path of the.csv
file reads the data and stores it in a file nameddf
(used form a nominal expression)DataFrameCenter.
Next, a series of conditional filtering operations are performed to select a subset of the raw data that satisfies specific conditions. Specifically, we filter out the subset of data in theblue_dif
、green_dif
、red_dif
together withinf_dif
these4
columns with values within a certain range, and stores this data in a file namedblue_original
、green_original
、red_original
cap (a poem)inf_original
newSeriesin which these data prepared us to plot histograms at a later stage.
Next, create a file namedmask
of a Boolean mask that is used to filter the data that satisfies the condition. Here, it filters thedays
The values of the columns in the0
until (a time)45
Between or within320
until (a time)365
Data between.
Subsequently, we use theapply
functions andlambda
expression, fordays
The values of the columns in the0
until (a time)45
Between or within320
until (a time)365
The lines between them, if theirblue_dif
、green_dif
、red_dif
together withinf_dif
these4
column is not in the specified range, then the data in this column is randomly set to theNaN,p =[0.9, 0.1]
is then specified to be randomly replaced withNaNof the probability. Note here that if we do not givep =[0.9, 0.1]
such a probability distribution, then the program will randomly select data based on the principle of uniform distribution.
Finally, we use thedropna
function, which removes the function that contains theNaNvalues of the rows to get the filtered processed data. Second, we still compute a subset of the processed data based on the filtering criteria of these four columns, stored in theblue_new
、green_new
、red_new
cap (a poem)inf_new
in. Immediately afterward, use theMatplotlibHistograms are created to visualize the distribution of the original and processed data; these histograms are stored separately in the8
in a different graph.
The code ends by saving the processed data as a new.csv
file, the path to which is defined by theresult_file_path
Designation.
Running the above code, we will get8
A histogram, as shown in the following figure. and the resultant file is seen in the specified folder.
At this point, the job is done.