Location>code7788 >text

Chinese keyword search analysis - export to csv or excel - multiple files or folders - dataframe using python and asyncio and pandas

Popularity:46 ℃/2024-09-09 23:50:46
  • Version 1.02

    • Splice the original tabs one by one into a file output, change it to a pandas dataframe
    • Using the asyncio library to use co-programming, but the speed seems to be about the same when tested. Probably too fast to measure the difference well.
  • The original initial code was a java version, now rewritten in python

    • The java version uses completableFuture for asynchronous IO, mainly for file output, but it seems like the order of the files doesn't change.
    • The java version doesn't use any special classes or libraries, the results are printed to, either the console, or a file
  • Functions of the code

    • Print out the keywords
    • Prints out the retrieved file name, line number, and regular expression hits, multiple results on multiple lines. How many regular expressions hit on the same line. The original contents of the current line.
    • Multiple regular expressions can be retrieved at the same time
    • Matching and exclusion of retrieved file paths and filenames is also done
    • Results from the same file need to be together, as do multiple hits on the same line
    • Retrieve results via tab or excel spreadsheet, each column showing hit results for the same regular expression
    • The original version of java, you could set the parameter to return the next line as well if it hit.
  • Difference between code and local code

    • Change the path to [your search path] and [your output file].
  • What's the point?

    • Retrieve multiple keywords at the same time, and be able to output the hit results into a table along with the original contents of the row. This can then be filtered and analyzed. This is a customization that is not available in existing IDE tools.
    • The result of the search is a visual analysis of the code or file these contents, without the need to open the file again to view and analyze.
  • Examples of practical use

    • Exam memorization question bank, the problem solution has the link to the official website document, copy all of them out, by retrieving the keyword [https:/ regular expression for this url], extract it to the same column at once, and paste it next to it. This way you can quickly look at the official website link found in the problem solution. And the problem solution many links written in the wrong place, but also can be found, adjusted to the individual errors also modified.
    • And the functionality of this script is the result of actual work experience, a little bit of pain points discovered, and then the functionality of the tool down to the most basic and simple. It started with vba, then java, and now python. but indeed the above mentioned brings up the official website link, and this is also possible through the search function of some IDEs. There are some latest versions of IDEs with more features too. But this one I have is a feature that I have come up with in real life under immense pressure every day, and the freedom to customize. The functionality is again very grounded.
  • unsolvable problem

    • If the code that is searched, is an assignment with a different variable name. Then it needs to be searched again, or the variable name is added as a keyword again to search. So the code needs to be more standardized, or the variable names need to be unified.
    • Retrieve the Chinese language, there is no tokenizer the case of the function of the lexer, then just all a word, you need to retrieve the word before the word and the word after the word to find the complete results, such as [Knowledge] and [Recognition], then the hit results will not be beautiful and direct. It has nothing to do with the processing of AI's word vocabulary. The original intention at the beginning was to deal with code.
  • Reasons for not uploading to github and gitee

    • Basic text processing tools, very practical, just a piece of code
    • More hasty, easy to find, can be uploaded to github, and then the input file, and the output of the two kinds of files are also uploaded, but this tool lies in the hand rubbing and lightweight, there is a need to add functionality at any time!
  • rendering (visual representation of how things will turn out)

    • Lotusland, on the results of the analysis of an article
    • image
    • image
  • source code (computing)

    • Hastily uploaded without many changes, will change it when I use it.
# %mandates,direct conversion utilizationdataframeto operate
# % after that,。。what?,Just finish this. fulfillmentfileoperatingasync
# Prints the beginning of the search result andtitle,prerequisite,限制prerequisite
# after that提供"one" radical in Chinese characters (Kangxi radical 1)个api,oneselfcurlTest it.。
# after that用dockerPack it up.。Transmit the contents of the file.,after that返回处理(used form a nominal expression)结果。
# after that做"one" radical in Chinese characters (Kangxi radical 1)个页面,Fill in the blanks.,after that点击按钮,The other side shows up。
# Make a memo.,or whiteboard function
# Make a small program interface
# Make a search function,Put the keywords in.,after that能够查询出结果
# javaRewrite that, too.


# 3great function
# Search for folder and file names
# (vscodebecomman + P)
# wonderful:Multiple folders can be selected,添加子文件夹排除prerequisite

# Search for document content
# (vscode command + shift + F)
# wonderful: Multiple regular expressions can be retrieved,after that先匹配上(used form a nominal expression)作为变量向下复制,same line,or the same document,
#   但beon account of没有结束匹配。So there will be a mismatch of data at the beginning and end of the。If it is possible to add a regular expression that ends the match。那么默认be文件内作为变量

# Extracting the contents of a column in a table,after that命中分类
# wonderful:same line抽取多个特征,A regular expression extracts multipletoken
# wonderful:Multiple regular expression hits,Simultaneous classification
# same line多个命中,split into multiple lines。

# Configuration Constants
import datetime
from datetime import date
from operator import concat
import os
import asyncio
from asyncio import Lock
import pandas as pd
import numpy as np
# from traceback import print_list
# import tornado
# from threading import Thread
import re
# from typing import Concatenate

def getChildFiles(basePath):
    return [f for f in (basePath) if (basePath + f)]

def getChildFolders(basePath):
    return [f for f in (basePath) if (basePath + f)]

isFirstExcelOutput = True
# macOnce you have accessed the settings inside the,There will be allowances and disallowances,下面(used form a nominal expression)be可移除卷宗,after that网络卷宗现existvscodebe没有勾选上
# async defutilization方法
# /asyncio-async-def/
# /3/library/
# regular expression (math.) Chinese Example https:///article/
# /weixin_40907382/article/details/79654372
# official website regular expression (math.) /3/library/
async def writeToFile(filout, finalStrArr, lock: Lock, oneFileData: ):
	async with lock:
		# for finalStr in finalStrArr: 
		# (oneFileData.)
		
		# note 输出(used form a nominal expression)be,有(used form a nominal expression)be多个空格(used form a nominal expression)字符
		# oneFileData.to_string(filout)
		# 
		# ("\n\n")
		# ("".join(finalStrArr))
		# Table header not included,The form header has been printed。
		oneFileData.to_csv(filout, sep='\t', index=False, header=None)
		# not including sth.excelIt's in the file.。on account ofexcel文件不知道什么位置be文件末尾。there is nothing be doneappend。
		# ifappend,require mode=append after thatsheetname (of a thing),开始(used form a nominal expression)行数bemaxrow
		file_path = 'Your output file/output/'
		global isFirstExcelOutput

		oneFileData = (" ")
		if isFirstExcelOutput:
			 = "No"
			 = "No2" # This doesn't seem to show up when it's set。
			 =  + 1
			# (columns={"result1":"result1"+"\nresult1_1"}, inplace=True)
			multHd = []
			((t_hitNos,""))
			resultNoCnt = 1
			for kw in searchKwsArr:
				((t_result_tmp+str(resultNoCnt),kw))
				resultNoCnt+=1
			((t_hitNos,""))
			((t_hitKws,""))
			((t_lineContent,""))
			 = .from_tuples(multHd,names=["titles","keywords"])
			#  = .from_tuples([("lineNo",""),("result1",""),("result2", "result1_1"),("result3",""),("result4",""),("result5",""),("hitNos",""),("hitKws",""),("lineContent","")])
			# [2] = ("result1",r"(exist|until (a time)).+Li (surname)")
			oneFileData.to_excel(file_path)

			isFirstExcelOutput = False
		else:
			with (file_path, mode='a', if_sheet_exists='overlay') as writer:
				 =  + 1 - 1 + ['Sheet1'].max_row 
				oneFileData.to_excel(writer, sheet_name='Sheet1', startrow=['Sheet1'].max_row, header=None)
			# with (file_path) as writer:
			# 	oneFileData.to_excel(writer, sheet_name='Sheet1', startrow=['Sheet1'].max_row, header=None)

			# oneFileData.to_excel(writer, sheet_name='Sheet1', startrow=['Sheet1'].max_row, index=False, header=None)
		# oneFileData.to_excel("Your output fileoutput/")

	# print("fileout" + str(()))
# def writeToFile(filout, finalStr):
# 	(finalStr)
def multiMatch(content, kwsArr):
	for kw in kwsArr:
		if (kw, content):
			return True
	return False

excFileType = [
	r"^\._.*",
	r".*\.xls.*"
]
incFileType = [
	r"^[^\.]+\.[^\.]+"
]
searchKwsArr = [
	r"(exist|until (a time))[^,。]+Li (surname)",
	r"all of a sudden[^,。]+",
	r"[^,。]+general",
	r"look as if[^,。]+",
    r"be[^,。]+"
]
t_lineNo="lineNo"
t_result_tmp="result"
t_hitNos="hitNos"
t_hitKws="hitKws"
t_lineContent="lineContent"

async def searchInFile(f, basePath, filout, lock: Lock):

	print("filename: " + f)
	# if not (r"^\._.*", f) and not (r".*\.xls.*", f):
	if not multiMatch(f,excFileType):
	# if not (r"^\.", f):
		col_title=[t_lineNo]
		resultNoCnt = 1
		for kw in searchKwsArr:
			col_title.append(t_result_tmp+str(resultNoCnt))
			resultNoCnt+=1
		# col_title.extend([t_hitNos,t_hitKws,t_lineContent])
		col_title.append(t_hitNos)
		col_title.append(t_hitKws)
		col_title.append(t_lineContent)
		with open(basePath + f, "r") as file:

			one_file_result = (columns=
				col_title)
			finalStrArr = []

			# ["lineNo","result1","result2","result3","result4","result5","hitNos","hitKws","lineContent"])
			# one_file_result = 
			# note 明明可以看until (a time)append,但be提示没有这个append,说be"one" radical in Chinese characters (Kangxi radical 1)种方法be降低版本,但beon account of和很多其他裤捆绑,So it's not recommended.
			# pip install pandas==1.3.4 
			# 大多数还be说用concatin place of
			# one_file_result = ([one_file_result,({"lineNo":[5],"result1":["tmp"],"result2":["tmp"],"result3":["tmp"],"result4":["tmp"]
			# ,"result5":["tmp"],"hitNos":["tmp"],"hitKws":["tmp"],"lineContent":["tmp"]})], ignore_index=True)

			# one_file_result.add(({"lineNo":5,"result1":"tmp","result2":"tmp","result3":"tmp","result4":"tmp","result5":"tmp"
			# 	,"hitNos":"tmp","hitKws":"tmp","lineContent":"tmp"}), ignore_index=True)

			# one_file_result = ([one_file_result,({"lineNo":5,"result1":"tmp","result2":"tmp","result3":"tmp","result4":"tmp","result5":"tmp"
			# 	,"hitNos":"tmp","hitKws":"tmp","lineContent":"tmp"})], ignore_index=True)
			# print(one_file_result)
			linNo = 0
			lines = ()
			for line in lines:
				linNo += 1
				ptStrs = list()
				resultPD_key = (columns=col_title)
				ptStrTmp = str(linNo) + "\t"
				resultPD_tmp = (columns=col_title)
				resultPD_tmp.loc[0,t_lineNo]=linNo

				maxFnd = 0
				hitKws = []
				hitNos = []
				kwsSeq = 0
				# for pp in [r"https://hXXXXXXXXXXXXXXXXXXl/[0-9]+\.html"]:
				# for pp in [r"(exist|until (a time)).+Li (surname)", r"all of a sudden[^,。]+", r"[^,。]+general", r"look as if[^,。]+", r"be[^,。]+"]:
				for pp in searchKwsArr:
					kwsSeq = kwsSeq + 1
				# for pp in [r".custom.", r".savory or appetizing", r""one" radical in Chinese characters (Kangxi radical 1).", r".{2,4}(structural particle: used before a verb or adjective, linking it preceding the verb or adjective)" 	, r"carry on one's shoulder or back.", r".hot-water bathing pool", r"moon.", r".color"]:
					lastFnd = "\t"
					findCnt = 0
					for m in (
						pp
						, line
						, flags=):
						findCnt += 1
						if findCnt > maxFnd:
							maxFnd = findCnt
							(ptStrTmp)
							resultPD_key = ([resultPD_key,resultPD_tmp], ignore_index=True)

						ptStrs[findCnt-1] = ptStrs[findCnt-1] + pp + ": " + () + "\t"
						# resultPD_key.loc[findCnt-1,t_result_tmp+str(kwsSeq)] =	pp + ": " + ()
						resultPD_key.loc[findCnt-1,t_result_tmp+str(kwsSeq)] =	()
						lastFnd = pp + ": " + () + "\t"
						(str(kwsSeq))
						(pp)
					if False:	
						ptStrTmp = ptStrTmp + lastFnd
					else:
						ptStrTmp = ptStrTmp + "\t"

					# pd这Li (surname)单个keyNo need to fill in the search
					notfnd = 0
					for fnd in ptStrs:
						notfnd += 1
						if notfnd > findCnt:
							ptStrs[notfnd-1] = ptStrs[notfnd-1] + "\t" 
				
				# 统计"one" radical in Chinese characters (Kangxi radical 1)行(used form a nominal expression)命中结果
				fndNo = 0
				for fnd in ptStrs:
					fndNo += 1
					ptStrs[fndNo-1] = ptStrs[fndNo-1] + ";"+";".join(hitNos) +";"+ "\t"	 +";"+ ";".join(hitKws) +";"+ "\t"		
				# for i in range(0,maxFnd-1):

				# 这Li (surname)be单行搜索,单行(used form a nominal expression)多个结果拼接until (a time)"one" radical in Chinese characters (Kangxi radical 1)起
				if maxFnd > 0:
					finalStr = ""
					for st in (ptStrs): finalStr = finalStr + st + line # + "\n"
					(finalStr)
					resultPD_key[t_hitNos]=";".join(hitNos)
					resultPD_key[t_hitKws]="【"+"】;【".join(hitKws)+"】"
					resultPD_key[t_lineContent]=("\n","").replace("\r","")
					one_file_result = ([one_file_result,resultPD_key], ignore_index=True)
			
					# one_file_result = one_file_result.fillna({t_result_tmp+str(1):"b"})
					# writeToFile(filout, finalStr)	
			# print(one_file_result)
			# one_file_result.columns[2].
			await asyncio.create_task(writeToFile(filout, finalStrArr, lock, one_file_result))

async def searchInFolder(basePath, filout, lock: Lock):
	tasklist = []
	for fo in getChildFolders(basePath):
		asyncio.create_task(searchInFolder(basePath + fo + "/", filout, lock))
		
	files = getChildFiles(basePath)
	for f in files:
		(asyncio.create_task(searchInFile(f, basePath, filout, lock)))
		# if f 
	await (tasklist)

async def main():
	lock = Lock()
	starttime =()
	basePaths = ['/Volumes/SDCARD_01/tmp/']
	filout = open("/Volumes/SDCARD_01/output/"+"","w")  
	
	("excFileType:" + "\n")
	("\t" + "\n\t".join(excFileType) + "\n")
	("incFileType:" + "\n")
	("\t" + "\n\t".join(incFileType) + "\n")
	("searchKwsArr:" + "\n")
	("\t" + "\n\t".join(searchKwsArr) + "\n")
	("basePaths:" + "\n")
	("\t" + "\n\t".join(basePaths) + "\n")
	titleStr = "lineNo\t"
	titleStrDes = "\t"
	resultNo = 1
	for kw in searchKwsArr:
		titleStr = titleStr + "result" + str(resultNo) + "\t"
		titleStrDes = titleStrDes + kw + "\t"
		resultNo = resultNo + 1
	titleStr = titleStr + "hitNos" + "\t" + "hitKws" + "\t" + "lineContent" + "\t"
	(titleStr + "\n")
	(titleStrDes + "\n")
	
	task_fol_list = []
	for basePath in basePaths:
		task_fol_list.append(asyncio.create_task(searchInFolder(basePath, filout, lock)))
	await (task_fol_list)
	# await coro

	print('search complete!')
	print("start" + str(starttime))
	print("end  " + str(()))
# 2024-03-04 21:53:57.998985
# 2024-03-04 21:53:58.041339
# 2024-03-04 22:10:00.298639
# 2024-03-04 22:10:00.443002
# async
# 2024-03-04 21:55:17.430653
# 2024-03-04 21:55:17.490983
# lock
# 2024-03-04 22:07:11.735860
# 2024-03-04 22:07:11.850801
# 2024-03-04 22:11:36.540289
# 2024-03-04 22:11:36.595845
# create task
# start2024-03-04 22:40:18.462565
# end  2024-03-04 22:40:18.653983

if __name__ == "__main__":
    # loop = asyncio.get_event_loop()
    # result = loop.run_until_complete(main())
	(main())
	# print(())
	
def foldersSample():

	basePath = 'Path to your retrieval folder/'
	print("当前目录下(used form a nominal expression)文件夹name (of a thing)为:", getChildFolders(basePath))
	# print("当前目录下(used form a nominal expression)文件夹name (of a thing)为:", getChildFolders(basePath))
	files = getChildFiles(basePath)
	print("当前目录下(used form a nominal expression)文件name (of a thing)为:", getChildFiles(basePath))
	# TODO 觉得可以修改"one" radical in Chinese characters (Kangxi radical 1)下快捷键 ctrl + K
	# TODO Read file,read by line,which is better
	# TODO 文件名可以先用regular expression (math.)筛选"one" radical in Chinese characters (Kangxi radical 1)下。如果be多次匹配来试"one" radical in Chinese characters (Kangxi radical 1)下比如atwo persons,The time to testprint"one" radical in Chinese characters (Kangxi radical 1)下
# foldersSample()

def sample():
	pattern = ("(d)[o|a](g)")
	matc = ("abcdogabcdagabc")     # Match at index 0
	matc = ("abcdogabcdagabc",3)     # Match at index 0
	matcs = (pattern, "abcdogabcdagabc", flags=0)
	print((("c(d([o|a])g)"), "abcdogabcdagabc", flags=0))
	iter = (("c(d([o|a])g)"), "abcdogabcdagabc", flags=0)
	for m in (
		"c(d([o|a])g)"
		, "abcdogabcdagabc"
		, flags=):
		
		print(())
		for g in ():
			print(g)
		print(())
	# I should have usedfindallThat's all I need.。就be没有all(used form a nominal expression)index,
	print((r'l','liuyan1').group())
	print((r'y','liuyan1'))
	print((r'y','liuyan1').groups())
	("dog", 1)  # No match; search doesn't include the "d"
# sample()

# 协程utilization方法
# asyncio walkthrough
# /async-io-python/
# Coroutines and Tasksofficial website文档
# /3/library/
# async def main2():
#     print('hello')
#     await (1)
#     print('world')


# loop = asyncio.get_event_loop()
# result = loop.run_until_complete(main2())