This paper describes a system based onPythonLanguage, readExcelform file data, and to convert any of theMeet our specific requirements(used form a nominal expression)that lineto copy the specified number of times, and thenot up to standard(used form a nominal expression)that linethen it is not copied; and the resulting result is saved as a newExcelMethods for form documents.
It should be noted here that in our previous postMultiple copies of Excel conforming data rows: Python batch implementationAnother way of realizing a similar requirement was introduced inPythoncode, you can check the above article if you need; and the code in the above article, since it uses the()
This one is in the latest versionpandas
library to cancel the method, so sometimes there may be an error; and the requirements in this article compared to the above article has been further enhanced, so you can mainly refer to this article.
First, let's clarify the specific needs of this paper. There is an existingExcelform file, in this article we'll take the.csv
format as an example; where, as shown below, there is a column in this file (that is, theinf_dif
(This column) data is more critical, and we would like to process this column - for theeach lineIfThe value of this column of data in this rowis within the specified range, then the line is copied the specified number of times (copying is equivalent to generating a newand the current lineIt's the same data.new line); and forRows that meet our requirementsThe specific requirements of theNumber of replicationsIt's not fixed either, it also depends on theThe value of this column of data in this rowto determine - for example if this data is in theWithin a certain value rangeThen this line is copied10
times; and if in theInside another value fieldCopy this line.50
Subdivision, etc.
Knowing the requirements, we can start writing the code. Among them, the specific code used in this article is shown below.
# -*- coding: utf-8 -*-
"""
Created on Thu Jul 6 22:04:48 2023
@author: fkxxgis
"""
import numpy as np
import pandas as pd
import as plt
original_file_path = "E:/01_Reflectivity/99_Model/02_Extract_Data/26_Train_Model_New/Train_Model_0715.csv"
result_file_path = "E:/01_Reflectivity/99_Model/02_Extract_Data/26_Train_Model_New/Train_Model_0715_Over_NIR_0717_2.csv"
df = pd.read_csv(original_file_path)
duplicated_num_0 = 70
duplicated_num_1 = 35
duplicated_num_2 = 7
duplicated_num_3 = 2
num = [duplicated_num_0 if (value <= -0.12 or value >= 0.12) else duplicated_num_1 if (value <= -0.1 or value >= 0.1) \
else duplicated_num_2 if (value <= -0.07 or value >= 0.07) else duplicated_num_3 if (value <= -0.03 or value >= 0.03) \
else 1 for value in df.inf_dif]
duplicated_df = [(, num)]
(0)
(df["inf_dif"], bins = 50)
(1)
(duplicated_df["inf_dif"], bins = 50)
duplicated_df.to_csv(result_file_path, index=False)
Among other things, the specific meanings of the above codes are as follows.
First, we need to import the required libraries, including thenumpy
、pandas
cap (a poem)etc. for subsequent data processing and plotting operations. The next step is to start reading the raw data, which we do using the
pd.read_csv()
function reads the file and stores it in aDataFrameboyfrienddf
in the file; here the path to the original file is determined by theoriginal_file_path
variable is specified.
Afterwards, we start setting the number of repetitions. Here, we set the number of repetitions for each value based on specific conditions. Depending on theinf_dif
column's value, store the corresponding number of repetitions in thenum
list. Depending on the conditions, use conditional expressions (if-elsestatements) are each set to a different number of repetitions.
Next, we use theloc
functions and()
function that duplicates the data by the number of repetitions and stores the result in theduplicated_df
Center.
Finally, to compare the effect of duplication of our data, a histogram can be plotted. Here, we use thelibrary
hist()
function plots two histograms; the first of which is the original data setdf
centerinf_dif
columns of the histogram, and the second histogram is the replicated datasetduplicated_df
centerinf_dif
histogram of the columns. By specifying thebins
parameter, which splits the data into50
Intervals.
After completing the above, we can save the data. Place the copied datasetduplicated_df
save as (a file).csv
format file, the path is defined by theresult_file_path
variable is specified.
Executing the above code, we will get two histograms as shown below; where the first histogram is the original datasetdf
centerinf_dif
The histogram of the columns, i.e., the histograms for which data replication has not yet been performed.
Second, the second histogram is the replicated datasetduplicated_df
centerinf_dif
Histogram of columns.
As you can see, our original data distribution has changed quite significantly after the aforementioned code.
At this point, the job is done.