LLM Application Practice: Industry Governance Multi-Label Categorization

1. contexts

It's been a long time. I miss it.~

recent pastqiang~Changed jobs and am in a period of adjustment to a new business, so articles have not been updated for a while, understand!

We are now working on the industry governance aspect, and are mainly responsible for one of the functional modules, i.e., multi-label classification of enterprise profiles and their patent data according to the industry governance standard system.

This issue's dry run isShare information on how to build onLLMHands-on experience in realizing multi-label categorization with a large number and many levels of hierarchy, which all readers can refer to.

2. Data presentation

2.1 labeling system

Total labeling system in industrial governance200+Each label has a total of4tiers, and the first3、4Layers have generalized information about the meaning of the label.

2.2 raw data

Enterprise official website introduction data, including basic introduction, main products, etc.
Enterprise patent data, including patent name and patent summary information, and the data volume of the patent is large.

2.3 LLMselection

After research, the use ofQwen2-72B-Instruct-GPTQ-Int4The quantized version, which takes up less memory and is comparable to the non-quantized version, can be found atQwen2 official website description。

3. Technical difficulties

The team has no labeled personnel and therefore cannot be usedBertClass miniatures for multi-label categorization tasks
Involves vertical domains that require strong background knowledge to perform the annotation, even if annotators are present
The number of tags is high, the hierarchy is deep, and the project has requirements for accuracy

4. schematic design

Due to the lack of annotators and the high background requirements of the annotators, it was only possible tooptionLLMConduct of missions。

The meaning of each tag in the tagging system is not sufficiently concrete, being summarized and scenario-specific, theLLMmay not be accurately categorized. Therefore, it is possible toConsider extracting domain-specific keywords as the baseline knowledge for theRAG。

Huge amount of enterprise official website and patent data, callLLMThere is the problem of taking an inordinate amount of time, and it's good to have the2classifier for heavy objects, such as machines, TVs, computers; theater performances8Cards for machines that can do distributed inference to improve response performance。

The overall program design is as follows:

The diagram is simple and straightforward, but the details of it are worth playing with.

4.1 Word Level Matching Module

(1) For vertical domains, based on the meaning of labels and empirical knowledge, manually organize the keywords that the labels may involve, e.g., smart car, there may be smart driving, automatic parking, lane change assistance, etc., but the keywords that are manually organized are limited;

(2) For enterprise and patent data, the use ofLAC+Jiebaparticiple(Note that manually organized word lists are not split)and then use theKeyBert+Edit distance for keyword matching(keybertThe underlying model uses the most effectivexiaobu-embedding-v2), filtering the mapping tags for possible keyword matches

4.2 categorizationRAGmodule (in software)

(1) Label No. 1 for each category3No. 1 under the hierarchy4The number of level labels is limited, so first targeting the label's first3The layers are categorized. The tricky part is to coarsen it first and then refine it, i.e., the first3The hierarchy corresponds to a large number of tags, so it is split into theNGroups, each group passespromptinvocationsLLMOutputs a result, then aggregates the results against the output and calls it againLLMGenerate fine-grained labels

(2) ahead3After the hierarchical labeling is determined, it is then based on the first4Hierarchical labeling for end-level labeling determination

5. Functional Features

Why use keywords forRAG？

A: Keywords don't directly map to corresponding tags though(Can you think about why?), but the keywords have strong contextual cues, so thepromptThe probability of filtering out tags with a value for the keyword in the middle is a bit higher

Why is there a need to add edit distance for keyword semantic matching?

A: Because semantic similarity models generally target comparisons of shorter texts and are less effective for word comparisons, edit distances are introduced to improve word-level matching

How to solve the scenario of multiple tags corresponding to the same keyword?

A: By means of the underlyingLLMTo distinguish which of these should belong to which

categorizationRAGHow it was considered

A: Due to the large number of tags, the deeper hierarchy, and theLLMThe length of the input is limited, so simplification is applied(Or divide and then merge)way, grouping the entire tag system first and then calling theLLMOutputs each grouped output, then consolidates the results and calls again theLLMConfirmation of fine-grained categorization

categorizationRAGWhat are the benefits of going thick before thin?

A: Coarse-grained classification.LLMOnly a given set of labels can be observed, not the overall labels, and after coarse-grained delineation, fine-grained reconfirmation helps to improve the accuracy of the classification.

roots or stems of plantsqiang~The experimental results show that the accuracy can vary from70%-80%Up to85%-90%, of course the experiment is only for that particular scenario, but the downside is the addedLLMThe response time of the

LabelingNpostgroup callLLMHow can I improve response performance?

A: Due to the deployment ofQwen2quantized version, and there is2classifier for heavy objects, such as machines, TVs, computers; theater performances8The card can be used, thus starting the8classifier for individual things or people, general, catch-all classifiervllmprocess with thehaproxyDo request forwarding, which improvesLLMThe response performance of the Experiments have shown that the7W+data, which only takes time1The results can be run in about days, the single node non-quantized version may take a couple of weeks to run.

What are the specific levels of effect?

A: Based on this set of programs, random sampling sampling for each label, the accuracy can be maintained in the85%-95%among

Why not increasefewshotAnd?

A: The keywords here are similar tofewshotExample, if a company or patent is used directly as afewshotFirstly, the range of label examples belonging to it is wide and not well organized, and secondly, it seriously affectsLLMresponse time as the input length gets longer.

How do you ensure accurate categorization of manually unorganized keyword scenarios?

A: Dependent on the underlyingLLMability, which is why the choice ofQwen2-72BThe reason for the currentQwen2-72BThe results are among the best in the industry.

6. Future Optimization Points

If you want to further improve the accuracy rate, the current program has already reserved the mouth, namely, the detailed description of the label and the manual organization of the keywords in the vertical field. The more detailed the label description and the more complete the keywords are organized, the higher the accuracy of the classification will be.

However, the question that arises is that the manual sorting of keywords is time-consuming and labor-intensive, and how to further reduce the manual sorting has become the next step in the direction of optimization.

7. summarize

One word is enough.~

This paper focuses on the use ofLLMRealization of multi-label categorization tasks in the field of industrial governance, including specific programs.LLMEngineering level optimization, realization effects and future optimization directions.

Readers can refer to this program to try to solve theLLMThe task of realizing multi-label categorization, especially when there is a wide variety of labels and the hierarchy is deep.

If you have any questions or want to cooperate with the customer, you can communicate by private message.

8. consultation

(1) Qwen2: /zh-cn/latest/