Python batch split Excel line by line after the difference, merge file method

This paper describes a system based onPythonlanguage for afile (paper)lowerExcelForms file, based on whichEach fileFirstly, on the basis ofA column of dataparticular characteristicsinterceptionThe data we need, and then the intercepted dataline by line differenceand based onMultiple other foldersThe same large number ofExcelForms file forData merging across filesThe specific methodology of the

First, let's clarify the specific needs of this paper. There is an existingfile (paper)There's a lot of it.Excelform file (in this article we'll just start with the.csvformat file for example), and eachfile(used form a nominal expression)name (of a thing)Both indicate that the file corresponds to the data source point of theID; as shown in the figure below.

Of these, eachExcelThe form files all have the data format shown below; where the first1Columns that are representations ofnumber of daysof time data, the time span between each row of data is8Days.

What we want to accomplish is to first truncate, for each file in this folder, the number of days in the2022001(a.k.a.2022th (of a year)1days) and thereafter; subsequently, the columns of the intercepted data (except for columns1columns, since the first1(the columns are the data representing the time) and summed up row by row - for example, using the2022009of the data minus2022001of the data, followed by the2022017of the data minus2022009and put the difference as new columns after the original columns; also, we would like to get the file name from the current file, and the first column of the1The days in the columns are extracted and some key information is put behind them as new columns (I am looking to produce training data for a deep neural network regression here, so it will be necessary to combine the various types of data). In addition, we have2folders with the same large amount of data, the same file naming convention, and the same data format, we would like to combine this2Folders in the same folder ascurrent foldereach ofDocuments of the same namehit the nail on the headData for the same dayMerger.

Understanding the requirements, we can start writing the code. The code used in this article is shown below.

# -*- coding: utf-8 -*-
"""
Created on Thu May 18 11:36:41 2023

@author: fkxxgis
"""

import os
import numpy as np
import pandas as pd

original_path = "E:/01_Reflectivity/99_Model_Training/00_Data/02_Extract_Data/17_HANTS"
era5_path = "E:/01_Reflectivity/99_Model_Training/00_Data/03_Meteorological_Data/02_AllERA5"
history_path = "E:/01_Reflectivity/99_Model_Training/00_Data/02_Extract_Data/18_AllYearAverage_2"
output_path = "E:/01_Reflectivity/99_Model_Training/00_Data/02_Extract_Data/19_2022Data"

era5_files = (era5_path)
history_files = (history_path)

for file in (original_path):
    file_path = (original_path, file)
    
    if (".csv") and (file_path):
        point_id = file[4 : -4]
        
        df = pd.read_csv(file_path)
        filter_df = df[df["DOY"] >= 2022001]
        filter_df = filter_df.reset_index(drop = True)
        filter_df["blue_dif"] = filter_df["blue"].diff()
        filter_df["green_dif"] = filter_df["green"].diff()
        filter_df["red_dif"] = filter_df["red"].diff()
        filter_df["inf_dif"] = filter_df["inf"].diff()
        filter_df["si1_dif"] = filter_df["si1"].diff()
        filter_df["si2_dif"] = filter_df["si2"].diff()
        filter_df["NDVI_dif"] = filter_df["NDVI"].diff()
        filter_df["PointType"] = file[4 : 7]
        filter_df["days"] = filter_df["DOY"] % 1000
        
        for era5_file in era5_files:
            if point_id in era5_file:
                era5_df = pd.read_csv((era5_path, era5_file))
                
                rows_num = filter_df.shape[0]
                for i in range(rows_num):
                    day = filter_df.iloc[i, 0]
                    row_need_index = era5_df.index[era5_df.iloc[ : , 1] == day]
                    row_need = row_need_index[0]
                    sola_data_all = era5_df.iloc[row_need - 2 : row_need, 2]
                    temp_data_all = era5_df.iloc[row_need - 6 : row_need - 2, 3]
                    prec_data_all = era5_df.iloc[row_need - 5 : row_need - 1, 4]
                    soil_data_all = era5_df.iloc[row_need - 6 : row_need - 2, 5 : 7 + 1]
                    sola_data = (sola_data_all.values)
                    temp_data = (temp_data_all.values)
                    prec_data = (prec_data_all.values)
                    soil_data = (soil_data_all.values)
                    filter_df.loc[i, "sola"] = sola_data
                    filter_df.loc[i, "temp"] = temp_data
                    filter_df.loc[i, "prec"] = prec_data
                    filter_df.loc[i, "soil"] = soil_data
                break
            
        for history_file in history_files:
            if point_id in history_file:
                history_df = pd.read_csv((history_path, history_file)).iloc[ : , 1 : ]
                history_df.columns = ["blue_h", "green_h", "red_h", "inf_h", "si1_h", "si2_h", "ndvi_h"]
                break
        
        filter_df_new = ([filter_df, history_df], axis = 1)
                
        output_file = (output_path, file)
        filter_df_new.to_csv(output_file, index = False)

The code first defines several folder paths, which are the raw data folder (that is, the first folder in the beginning of this article).1(the folder shown in the picture),ERA5Meteorological Data folder, Historical Data folder, and Output folder. Then, by means of the() function gets theERA5All file names in the Meteorological Data folder and the Historical Data folder are used in subsequent loops.

Next up is afor loop that iterates through all of the raw data folder's.csvfile, if the file name starts with.csv ending and is a legitimate file, the file is read. Then, points are extracted based on the filenameIDand usePandashit the nail on the headread_csv() function reads the data from the file. Next, the file is read using thePandashit the nail on the headloc[] The function processes the data, including filtering out theDOYgreater than or equal to 2022001 rows, which were reindexed, and the difference in reflectance data was calculated. Then, some metadata were added to the filtered data, including point type and days.

Next up are twofor loops, respectively, for processingERA5meteorological data and historical data. In processingERA5When meteorological data is available, first find the point that is the same as the current pointIDmatchingERA5Meteorological data files and use thePandashit the nail on the headread_csv() function reads the data from that file. Then, using theiloc[] function finds the current date based on theERA5corresponding rows in the meteorological data, and solar radiation, temperature, precipitation, and soil moisture data were extracted from that row and its first two rows. Finally, these data were added to the filtered data.

When processing historical data, first find the point with the currentIDMatching historical data files withPandashit the nail on the headread_csv() function reads the data from that file. Then, using theiloc[] function removes the first column and renames the remaining columns toblue_h、green_h、red_h、inf_h、si1_h、si2_h cap (a poem)ndvi_h. Finally, the use ofPandashit the nail on the headconcat() function combines the filtered and historical data into a newDataFrame。

Finally, use thePandashit the nail on the headto_csv() function takes the newDataFrameSave to the output folder.

Running the above code gives us an infinite number of combinedExcelThe table files, each of which has the columns shown below, are already after we have merged the various types of information.

This completes the production process of our neural network training dataset.

At this point, the job is done.