This paper describes a system based onPythonlanguage for afile (paper)lowerExcelForms file, based on whichEach fileFirstly, on the basis ofA column of dataparticular characteristicsinterceptionThe data we need, and then the intercepted dataline by line differenceand based onMultiple other foldersThe same large number ofExcelForms file forData merging across filesThe specific methodology of the
First, let's clarify the specific needs of this paper. There is an existingfile (paper)There's a lot of it.Excelform file (in this article we'll just start with the.csv
format file for example), and eachfile(used form a nominal expression)name (of a thing)Both indicate that the file corresponds to the data source point of theID
; as shown in the figure below.
Of these, eachExcelThe form files all have the data format shown below; where the first1
Columns that are representations ofnumber of daysof time data, the time span between each row of data is8
Days.
What we want to accomplish is to first truncate, for each file in this folder, the number of days in the2022001
(a.k.a.2022
th (of a year)1
days) and thereafter; subsequently, the columns of the intercepted data (except for columns1
columns, since the first1
(the columns are the data representing the time) and summed up row by row - for example, using the2022009
of the data minus2022001
of the data, followed by the2022017
of the data minus2022009
and put the difference as new columns after the original columns; also, we would like to get the file name from the current file, and the first column of the1
The days in the columns are extracted and some key information is put behind them as new columns (I am looking to produce training data for a deep neural network regression here, so it will be necessary to combine the various types of data). In addition, we have2
folders with the same large amount of data, the same file naming convention, and the same data format, we would like to combine this2
Folders in the same folder ascurrent foldereach ofDocuments of the same namehit the nail on the headData for the same dayMerger.
Understanding the requirements, we can start writing the code. The code used in this article is shown below.
# -*- coding: utf-8 -*-
"""
Created on Thu May 18 11:36:41 2023
@author: fkxxgis
"""
import os
import numpy as np
import pandas as pd
original_path = "E:/01_Reflectivity/99_Model_Training/00_Data/02_Extract_Data/17_HANTS"
era5_path = "E:/01_Reflectivity/99_Model_Training/00_Data/03_Meteorological_Data/02_AllERA5"
history_path = "E:/01_Reflectivity/99_Model_Training/00_Data/02_Extract_Data/18_AllYearAverage_2"
output_path = "E:/01_Reflectivity/99_Model_Training/00_Data/02_Extract_Data/19_2022Data"
era5_files = (era5_path)
history_files = (history_path)
for file in (original_path):
file_path = (original_path, file)
if (".csv") and (file_path):
point_id = file[4 : -4]
df = pd.read_csv(file_path)
filter_df = df[df["DOY"] >= 2022001]
filter_df = filter_df.reset_index(drop = True)
filter_df["blue_dif"] = filter_df["blue"].diff()
filter_df["green_dif"] = filter_df["green"].diff()
filter_df["red_dif"] = filter_df["red"].diff()
filter_df["inf_dif"] = filter_df["inf"].diff()
filter_df["si1_dif"] = filter_df["si1"].diff()
filter_df["si2_dif"] = filter_df["si2"].diff()
filter_df["NDVI_dif"] = filter_df["NDVI"].diff()
filter_df["PointType"] = file[4 : 7]
filter_df["days"] = filter_df["DOY"] % 1000
for era5_file in era5_files:
if point_id in era5_file:
era5_df = pd.read_csv((era5_path, era5_file))
rows_num = filter_df.shape[0]
for i in range(rows_num):
day = filter_df.iloc[i, 0]
row_need_index = era5_df.index[era5_df.iloc[ : , 1] == day]
row_need = row_need_index[0]
sola_data_all = era5_df.iloc[row_need - 2 : row_need, 2]
temp_data_all = era5_df.iloc[row_need - 6 : row_need - 2, 3]
prec_data_all = era5_df.iloc[row_need - 5 : row_need - 1, 4]
soil_data_all = era5_df.iloc[row_need - 6 : row_need - 2, 5 : 7 + 1]
sola_data = (sola_data_all.values)
temp_data = (temp_data_all.values)
prec_data = (prec_data_all.values)
soil_data = (soil_data_all.values)
filter_df.loc[i, "sola"] = sola_data
filter_df.loc[i, "temp"] = temp_data
filter_df.loc[i, "prec"] = prec_data
filter_df.loc[i, "soil"] = soil_data
break
for history_file in history_files:
if point_id in history_file:
history_df = pd.read_csv((history_path, history_file)).iloc[ : , 1 : ]
history_df.columns = ["blue_h", "green_h", "red_h", "inf_h", "si1_h", "si2_h", "ndvi_h"]
break
filter_df_new = ([filter_df, history_df], axis = 1)
output_file = (output_path, file)
filter_df_new.to_csv(output_file, index = False)
The code first defines several folder paths, which are the raw data folder (that is, the first folder in the beginning of this article).1
(the folder shown in the picture),ERA5Meteorological Data folder, Historical Data folder, and Output folder. Then, by means of the()
function gets theERA5All file names in the Meteorological Data folder and the Historical Data folder are used in subsequent loops.
Next up is afor
loop that iterates through all of the raw data folder's.csv
file, if the file name starts with.csv
ending and is a legitimate file, the file is read. Then, points are extracted based on the filenameID
and usePandashit the nail on the headread_csv()
function reads the data from the file. Next, the file is read using thePandashit the nail on the headloc[]
The function processes the data, including filtering out theDOY
greater than or equal to 2022001
rows, which were reindexed, and the difference in reflectance data was calculated. Then, some metadata were added to the filtered data, including point type and days.
Next up are twofor
loops, respectively, for processingERA5meteorological data and historical data. In processingERA5When meteorological data is available, first find the point that is the same as the current pointID
matchingERA5Meteorological data files and use thePandashit the nail on the headread_csv()
function reads the data from that file. Then, using theiloc[]
function finds the current date based on theERA5corresponding rows in the meteorological data, and solar radiation, temperature, precipitation, and soil moisture data were extracted from that row and its first two rows. Finally, these data were added to the filtered data.
When processing historical data, first find the point with the currentID
Matching historical data files withPandashit the nail on the headread_csv()
function reads the data from that file. Then, using theiloc[]
function removes the first column and renames the remaining columns toblue_h
、green_h
、red_h
、inf_h
、si1_h
、si2_h
cap (a poem)ndvi_h
. Finally, the use ofPandashit the nail on the headconcat()
function combines the filtered and historical data into a newDataFrame
。
Finally, use thePandashit the nail on the headto_csv()
function takes the newDataFrame
Save to the output folder.
Running the above code gives us an infinite number of combinedExcelThe table files, each of which has the columns shown below, are already after we have merged the various types of information.
This completes the production process of our neural network training dataset.
At this point, the job is done.