Python filtering and culling table data by condition and plotting histograms before and after culling

This paper describes a system based onPythonLanguage, readExcelForm file data to whereThe value of a column of dataAs a standard, forThis column is in the specified range(used form a nominal expression)All rowsreuseOther columns of datavalues to filter and exclude the data; at the same time, the data before and after the filtering are plotted for a number ofbar chartand export and save the resultant data as a newExcelMethods for form documents.

First, let's clarify the specific needs of this paper. There is an existingExcelform file, in this article we'll take the.csvformat as an example; where, as shown below, there is a column in this file (which in this article is also thedays(this column) data, which we use as theBaseline dataI'd like to take it out first.daysnumerical value at0until45、320until365All samples in the range (a row is a sample) for subsequent operations.

Secondly, for the samples taken out, then based on other4columns (which in this paper are alsoblue_dif、green_dif、red_diftogether withinf_difthese4columns) of data to put this4column dataRows not in the specified value fieldDeletion. In this process, we also want to plot before and after data deletion, which4columns (that is, theblue_dif、green_dif、red_diftogether withinf_difthese4columns) of the respective histograms of the data, a total of8A diagram. Finally, we also want to save the data after deleting the above data as a newExcelForms document.

Knowing the requirements, we can write the code. The code used in this article is shown below.

# -*- coding: utf-8 -*-
"""
Created on Tue Sep 12 07:55:40 2023

@author: fkxxgis
"""

import numpy as np
import pandas as pd
import  as plt

original_file_path = "E:/01_Reflectivity/99_Model/02_Extract_Data/26_Train_Model_New/Train_Model_0715_Main_Over_NIR.csv"
# original_file_path = "E:/01_Reflectivity/99_Model/02_Extract_Data/26_Train_Model_New/"
result_file_path = "E:/01_Reflectivity/99_Model/02_Extract_Data/26_Train_Model_New/Train_Model_0715_Main_Over_NIR_New.csv"

df = pd.read_csv(original_file_path)

blue_original = df[(df['blue_dif'] >= -0.08) & (df['blue_dif'] <= 0.08)]['blue_dif']
green_original = df[(df['green_dif'] >= -0.08) & (df['green_dif'] <= 0.08)]['green_dif']
red_original = df[(df['red_dif'] >= -0.08) & (df['red_dif'] <= 0.08)]['red_dif']
inf_original = df[(df['inf_dif'] >= -0.1) & (df['inf_dif'] <= 0.1)]['inf_dif']

mask = ((df['days'] >= 0) & (df['days'] <= 45)) | ((df['days'] >= 320) & (df['days'] <= 365))
range_min = -0.03
range_max = 0.03

[mask, 'blue_dif'] = [mask, 'blue_dif'].apply(lambda x: x if range_min <= x <= range_max else ([, x]))
[mask, 'green_dif'] = [mask, 'green_dif'].apply(lambda x: x if range_min <= x <= range_max else ([, x]))
[mask, 'red_dif'] = [mask, 'red_dif'].apply(lambda x: x if range_min <= x <= range_max else ([, x]))
[mask, 'inf_dif'] = [mask, 'inf_dif'].apply(lambda x: x if range_min <= x <= range_max else ([, x], p =[0.9, 0.1]))
df = ()

blue_new = df[(df['blue_dif'] >= -0.08) & (df['blue_dif'] <= 0.08)]['blue_dif']
green_new = df[(df['green_dif'] >= -0.08) & (df['green_dif'] <= 0.08)]['green_dif']
red_new = df[(df['red_dif'] >= -0.08) & (df['red_dif'] <= 0.08)]['red_dif']
inf_new = df[(df['inf_dif'] >= -0.1) & (df['inf_dif'] <= 0.1)]['inf_dif']

(0)
(blue_original, bins = 50)
(1)
(green_original, bins = 50)
(2)
(red_original, bins = 50)
(3)
(inf_original, bins = 50)

(4)
(blue_new, bins = 50)
(5)
(green_new, bins = 50)
(6)
(red_new, bins = 50)
(7)
(inf_new, bins = 50)

df.to_csv(result_file_path, index=False)

First of all, we pass thepd.read_csvfunction from the specified path of the.csvfile reads the data and stores it in a file nameddf(used form a nominal expression)DataFrameCenter.

Next, a series of conditional filtering operations are performed to select a subset of the raw data that satisfies specific conditions. Specifically, we filter out the subset of data in theblue_dif、green_dif、red_diftogether withinf_difthese4columns with values within a certain range, and stores this data in a file namedblue_original、green_original、red_originalcap (a poem)inf_originalnewSeriesin which these data prepared us to plot histograms at a later stage.

Next, create a file namedmaskof a Boolean mask that is used to filter the data that satisfies the condition. Here, it filters thedaysThe values of the columns in the0until (a time)45Between or within320until (a time)365Data between.

Subsequently, we use theapplyfunctions andlambdaexpression, fordaysThe values of the columns in the0until (a time)45Between or within320until (a time)365The lines between them, if theirblue_dif、green_dif、red_diftogether withinf_difthese4column is not in the specified range, then the data in this column is randomly set to theNaN，p =[0.9, 0.1]is then specified to be randomly replaced withNaNof the probability. Note here that if we do not givep =[0.9, 0.1]such a probability distribution, then the program will randomly select data based on the principle of uniform distribution.

Finally, we use thedropnafunction, which removes the function that contains theNaNvalues of the rows to get the filtered processed data. Second, we still compute a subset of the processed data based on the filtering criteria of these four columns, stored in theblue_new、green_new、red_newcap (a poem)inf_newin. Immediately afterward, use theMatplotlibHistograms are created to visualize the distribution of the original and processed data; these histograms are stored separately in the8in a different graph.

The code ends by saving the processed data as a new.csvfile, the path to which is defined by theresult_file_pathDesignation.

Running the above code, we will get8A histogram, as shown in the following figure. and the resultant file is seen in the specified folder.

At this point, the job is done.