I. Background of needs
There are several report files available, currently a human looking for report information to fill in Excel to generate statistics
After communicating with the user about the requirements and the few files provided, I found that they are all html files
In fact, the so-called report of the file, is some local open static resources, there are also js, img, etc.
II. Program selection
The previous boss kept saying it was document parsing, and I searched to see if it wasn't just writing a crawler ....
Because it is on the existing system plus new features to achieve, the existing system is still Java as a back-end service, so the previously learned Python do not want to use it!
Write Python also need to start a separate service to deploy up, Java has JSOUP can be used, not as good as Python is ...
III. Realization on the ground
1. JSOUP depends on the coordinates:
<!-- /artifact//jsoup --> <dependency> <groupId></groupId> <artifactId>jsoup</artifactId> <version>1.18.1</version> </dependency>
2. File reading problems
I've found that each type of report file is stored differently
The first single HTML file:
This is relatively simple, just read the path and then access the contents of the file directly
String reportFilePath = "C:/Users/Administrator/Desktop/report-type/"; String htmlContent = new String(((reportFilePath)), StandardCharsets.UTF_8); Document doc = (htmlContent);
The second single Zip compressed file:
Single-layer compression, accessible through the zipFile API, takes out the compressed entries one by one and judges them with the entry name
Then open a read stream via zipFile to read the entry.
String targetFile = ""; ZipEntry targetEntry = null; String reportFilePath = "C:/Users/Administrator/Desktop/report-type/"; ZipFile zipFile = isWinSys() ? new ZipFile(new File(reportFilePath), ZipFile.OPEN_READ, ("GBK")) : new ZipFile(reportFilePath); Enumeration<? extends ZipEntry> zipEntries = (); while (()) { ZipEntry zipEntry = (); boolean isDirectory = (); if (isDirectory) continue; String name = (); if ((name)) { targetEntry = zipEntry; break; } } boolean hasFind = (targetEntry); if (!hasFind) return; /* No readable target file */ InputStream inputStream = (targetEntry); String htmlCode = IoUtil.readUtf8(inputStream); Document doc = (htmlCode);
Remember to release resources when execution is complete:
/* Resource release */ (); ();
The third multi-Zip nested compressed file:
The file is compressed twice, you have to decompress both sides to access it
1, read the embedded Zip file found MALFORM reported an error, you need to set the read encoding according to the operating system ...
/qq_25112523/article/details/136060946
Then add an operating system judgment to the API that creates the ZipFile object
public static boolean isWinSys() { String property = (""); return ("win") || ("Win"); }
2, ZipFile is only useful for single-layer compression, if it is a nested compressed file will not support the
What's happening with this report file is that there's only one entry on the first level, so the only thing I care about in the uploaded file is that there's only one embedded zip file in it
When matching this condition to ZipFile to read the input stream, converted to Zip input stream, otherwise not processed
As you can see in the following code, after reading the inputStream of the compressed file, you have to read it with ZipInputStream instead.
zipInputStream is equivalent to zipFile + zipEntries combined and contains the entry iteration information.
But there is only one getNextEntry method, you can only write a While loop to keep determining whether the next entry still exists or not
The file name is called, and the loop ends after determining if the entry name matches.
Then use the IO utility class to read the ZipInputStream directly (the getNextEntry method is what keeps the ZipInputStream switching to the reference of the current entry)
If you want to deal with complex situations that can only be realized in While, it is recommended that you call the closeEntry method after the end of each entry
String targetSuffix = ".zip"; String targetFile = ""; String reportFilePath = "C:/Users/Administrator/Desktop/report-type/xx_20240729153751.zip"; ZipFile zipFile = isWinSys() ? new ZipFile(new File(reportFilePath), ZipFile.OPEN_READ, ("GBK")) : new ZipFile(reportFilePath); Enumeration<? extends ZipEntry> enumeration = (); /* Converted to a collection entry, iterative entries can't judge size */ List<ZipEntry> zipEntrieList = new ArrayList<>(); while (()) { ZipEntry zipEntry = (); (zipEntry); } /* Processed when there is only 1 zip archive */ if ((zipEntrieList)) return; boolean isOnlyOneEntry = () == 1; boolean anyMatch = ().anyMatch(ze -> ().endsWith(targetSuffix)); if (!isOnlyOneEntry || !anyMatch) return; ZipEntry zipEntry = (0); /* Constantly switching entries through the ZipInputStream to find the target file */ InputStream inputStream = (zipEntry); ZipInputStream zipInputStream = new ZipInputStream(inputStream); /* Find the target file in the inner layer */ ZipEntry reportEntry = (); while ((reportEntry)) { String name = (); if ((name)) break; reportEntry = (); } String htmlCode = IoUtil.readUtf8(zipInputStream); Document doc = (htmlCode);
Again resources need to be released here:
/* Resource release */ (); (); ();
3、Common query API use
I. Common API Methods
I arrived home from work only to realize that ownText is the element's own text content, filtering out other nested element text
You can also use cssQuery directly
("-report-ui-report-info-grid")
II. Using sibling elements to find correspondences
One special case is that some elements should be a hierarchical structure according to the document structure.
First there's A, then B is in A, and C is in B like this
But this is a spread structure, A -> B -> C -> D, and the element ids are not directly related to the class name, which makes it difficult to build the association
Structure can only be inferred from the order of the elements:
1, get the current ip title element and the next ip title element sibling element subscript value
2. Take out the subscript value of the idp element's sibling element
3, compare the idp element is between the two, if for is that the idp element belongs to the first ip title element
Third, the parent-child element operation to get the brother element
Report detail list, found that the title is xx name, xx level summary information, clicking on the details is bringing up the next row of the display
Then list all of xx's information in the tr on the next line
Using siblingIndex is inaccurate, the elements are dynamic, you can have 10 in the first table and 20 in the second table like this
So when reading from the table, use parent() + child() instead.
After selecting all summary rows of the table, get the subscript of the current summary row via the indexOf method of the parent element
Plus one is the next line item in the table.
It is also possible to go directly to the nth child element via the child method of the current element
This way compared to the select method you don't have to fetch from a collection of elements, making sure that it's the only element
/* 2. Read [vulnerability distribution] information */ Element vulnTable = ("vuln_distribution"); Element vulnTableBody = (1); Elements allTrList = (); Elements vulnTitleTrList = ("tr[style='cursor:pointer;']"); for (Element vrTr : vulnTitleTrList) { /* 2-1, Vulnerability Name */ String vt = (1).text(); int vrTrIdx = (vrTr); Element vrDetailTr = (vrTrIdx + 1); Element vrDetailTableBody = (1).child(0).child(0); /* 2-2, Vulnerable Host */ String ipHosts = (0).child(1).text(); ipHosts = (" ", "").replaceAll(" Click for details;", ""); /* 2-3, Vulnerability Description */ String vulnDesc = (1).child(1).text(); /* 2-4, Threat Score */ String vulnTag = (3).child(1).text(); String format = ("reportTime: {}, ip: {}, name: {}, tag: {} desc: {}, ", date, ipHosts, vt, vulnTag, vulnDesc); (format); }