PaddleNLP UIE - Extraction of information from drug descriptions (name, specification, usage, dosage)

Environment Configuration
Create a project
Upload Code
Customized Models - Training
- code structure
- data annotation
  - Preparing the corpus
  - data annotation
  - Export data
- data conversion
  - doccano
  - Label Studio
- Model fine-tuning
- Model Evaluation
Customized Models - Forecasting
effect

PaddleNLP UIE Entity Relationship Extraction - Extraction of Drug Specifications (name, specification, usage, dosage)

For segmented scenarios, it is recommended to use the light customization function (labeling a small amount of data for model fine-tuning) to further improve the effect

schema =['drug name','usage','dosage','frequency']
ie = Taskflow('information_extraction',schema=schema)
print(ie("Ibuprofen dispersible tablets, taken by mouth or dispersed with water. For use in adults and children 12 years of age and older, the recommended dosage is 0.2 to 0.4 (1 to 2 tablets) three times a day or as directed by a physician."))

As pictured:The default model can only extract drug namesNext, the UIE model is fine-tuned with the training data

Environment Configuration

Creating your own project has the advantage of avoiding so many pitfalls caused by versioning issues:/vipsoft/p/18265581#Problem solving

Python 3.10.10
paddlepaddle-gpu Version: 2.5.2 [GPUs are required for model fine-tuning, others can be CPUs]
PaddleNLP 2.6.1

Create a project

Log in to AI Studio Free Arithmetic Get free math/personalcenter/thirdview/2631487
Extraction of information from drug inserts

Select Environment PaddlePaddle 2.5.2

At startup, you can choose the environment, and the basic version of the CPU is unlimited, with 8 points of arithmetic per day available for study
GPUs are needed for training, prediction, others can use GPUs

Upload Code

Project Address:/projectdetail/8126673?sUid=2631487&shared=1&ts=1719990517603
If it's a Fork project.Skip to model fine-tuning

AI Studio cannot be accessed , by downloading locally. Then upload the operation
Download V2.8:/paddlepaddle/PaddleNLP/branches

Upload code (In reality, you only need to pass the PaddleNLP-v2.5.0\model_zoo\uie directory, and nothing else. Because PandleNLP will be installed in the environment., e.g.:The project I built.)

Decompression code
unzip PaddleNLP-release-2.

Customized Models - Training

code structure

model_zoo/uie Catalog code files are described below:

.
├── # Data processing tools
. ├── # Model mesh scripts
├── # Data annotation scripts => will be used in the following data transformation
├── # Data annotation document
├── # Model fine-tuning and compression scripts
├── # Model evaluation scripts
└──

data annotation

Detailed process referenceData annotation tool doccano | Named Entity Recognition (NER)

Preparing the corpus

Preparation of the corpus, one line of text to be labeled for each row, examples:

Ibuprofen Dispersible Tablets, taken orally or dispersed with water. For adults and children over 12 years of age, the recommended dosage is 0.2 to 0.4 (1 to 2 tablets) three times a day, or as directed by a physician
White Plus Black (Amphenol Pseudoephedrine Tablets II Amphentermine Tablets), for oral administration. 1 to 2 tablets once, 3 times a day (1 to 2 white tablets in the morning and midday, and 1 to 2 black tablets at night), or as directed by a doctor for children
Loratadine tablets, oral, size 10 mg of loratadine tablets, usually 1 tablet once a day for adults and children over 12 years of age
Fitalin (diclofenac diethylamide emulsion), topical. According to the size of the painful area, use the appropriate amount of this product, gently rubbing, so that the product penetrates the skin, 3-4 times a day
Heptaphyllum digitalis biside, topical. When used for macular degeneration, apply 1 drop 3 times a day into the conjunctival sac of the eye (near the outer corner of the eye on the ear side)

data annotation

Define Tags
Demo simply settled on "drug name, generic name, specification, usage, dosage, frequency"

data annotation
On the doccano platform, create a file of typeSerial labelingThe labeled items.
Define the entity label categories, the entity labels to be defined in the above example are [ Drug Name, Generic Name, Specification, Usage, Dosage, Frequency ].
Use the tags defined above to start labeling the data, an example of doccano labeling is shown below:

Export data

After labeling is complete, export the file on the doccano platform and rename it to doccano_ext.json and place it in the . /data directory

data conversion

doccano

Create a data directory in the AI Studio environment, place thedoccano_ext.json Into the data directory

Execute the following script for data transformation, after execution it will generate training/validation/test set files in . /data directory to generate training/validation/test set files.

python  \
    --doccano_file ./data/doccano_ext.json \
    --task_type ext \
    --save_dir ./data \
    --splits 0.8 0.2 0 \
    --schema_lang ch

# Execution generates training/validation/test set files in the . /data directory to generate training/validation/test set files.
[2024-06-26 09:48:38,269] [ INFO] - Save 24 examples to . /data/.
[2024-06-26 09:48:38,269] [ INFO] - Save 5 examples to . /data/.
[2024-06-26 09:48:38,269] [ INFO] - Save 0 examples to . /data/.

Configurable parameter description:

doccano_file:: Data annotation files exported from doccano.
save_dir: The directory where the training data is stored, by default it is stored in thedataCatalog.
negative_ratio: Maximum negative instance ratio, this parameter is only valid for extraction type tasks, proper construction of negative instances can improve the model effect. The number of negative cases is related to the actual number of labels, the maximum number of negative cases = negative_ratio * number of positive cases, this parameter is only valid for the training set, the default is 5. This parameter is only valid for the training set, the default is 5. In order to ensure the accuracy of the evaluation metrics, the validation set and the test set are constructed with all negative examples by default.
splits: The proportion of the training set and validation set when dividing the dataset. The default is [0.8, 0.1, 0.1] which means the proportion of training set and validation set according to the8:1:1The ratio of the data is divided into training set, validation set and test set.
task_type:: Select the type of task, there are two types of tasks available, extraction and categorization.
options: Specifies the category label of the categorization task, this parameter is only valid for categorization type tasks. Defaults to ["positive", "negative"].
prompt_prefix: Declares the prompt prefix information of the categorization task, this parameter is only valid for the categorization type task. Default is "emotional orientation".
is_shuffle: Whether to randomly break the dataset, defaults to True.
seed: Random seed, default 1000.
separator:: Separator between entity category/evaluation dimension and categorization label, this parameter is only valid for entity/evaluation dimension level categorization tasks. The default is "###".
schema_lang: Select the language of the schema, the options arechcap (a poem)enThe The default ischFor English data sets, please selecten。

Remarks:

default The script divides the data into the train/dev/test dataset in a proportional manner
Each execution script, which will overwrite an existing data file with the same name
During the model training phase we recommend constructing some negative examples to improve the model results, and we have built in this feature during the data transformation phase. This can be done by using thenegative_ratioControls the proportion of negative samples that are automatically constructed; number of negative samples = negative_ratio * number of positive samples.
For files exported from doccano, by default every piece of data in the file is correctly labeled by hand.

Label Studio

Also available through data annotation platformsLabel Studio Perform data labeling. The script converts the JSON data file format exported by label studio to the data file format exported by doccano, and the subsequent data conversion and model fine-tuning operations remain unchanged.

python  --labelstudio_file

Configurable parameter description:

labelstudio_file: The path to the label studio's export file (JSON format only).
doccano_file: The path to save the data file in doccano format, default is "doccano_ext.jsonl".
task_type: Task type, with the option to have both extraction ("ext") and categorization ("cls") types of tasks, defaulting to "ext".

Model fine-tuning

Note the use of the GPU environment

Enabling the GPU Environment

aistudio@jupyter-2631487-8126673:~$ python -V
Python 3.10.10
aistudio@jupyter-2631487-8126673:~$ pip show paddlepaddle-gpu
Name: paddlepaddle-gpu
Version: 2.5.2
Summary: Parallel Distributed Deep Learning
Home-page: /
Author: 
Author-email: Paddle-better@
License: Apache Software License
Location: /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages
Requires: astor, decorator, httpx, numpy, opt-einsum, Pillow, protobuf
Required-by: 
aistudio@jupyter-2631487-8126673:~$ pip show paddlenlp
Name: paddlenlp
Version: 2.6.1.post0
Summary: Easy-to-use and powerful NLP library with Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including Neural Search, Question Answering, Information Extraction and Sentiment Analysis end-to-end system.
Home-page: /PaddlePaddle/PaddleNLP
Author: PaddleNLP Team
Author-email: paddlenlp@
License: Apache 2.0
Location: /opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages
Requires: aistudio-sdk, colorama, colorlog, datasets, dill, fastapi, Flask-Babel, huggingface-hub, jieba, jinja2, multiprocess, onnx, paddle2onnx, paddlefsl, protobuf, rich, safetensors, sentencepiece, seqeval, tool-helpers, tqdm, typer, uvicorn, visualdl
Required-by: paddlehub
aistudio@jupyter-2631487-8126673:~$

RecommendedTrainer API Fine-tune models. Simply enter the model, dataset, etc. can use the Trainer API to efficiently and quickly pre-training, fine-tuning and model compression and other tasks, you can one-click to start the multi-card training, mixed-precision training, gradient accumulation, restart at the breakpoints, logs to display the function, etc., the Trainer API also for the training process of the general training configuration to do the encapsulation of the training process, such as: the optimizer, the learning rate scheduling, and so on.
Use the following command with theuie-base As a pre-trained model for model fine-tuning, save the fine-tuned model to the$finetuned_model：
Single card startup:
cd PaddleNLP-release-2.8/model_zoo/uie/

export finetuned_model=./checkpoint/model_best

python   \
    --device gpu \
    --logging_steps 10 \
    --save_steps 100 \
    --eval_steps 100 \
    --seed 42 \
    --model_name_or_path uie-base \
    --output_dir $finetuned_model \
    --train_path data/ \
    --dev_path data/  \
    --per_device_eval_batch_size 16 \
    --per_device_train_batch_size  16 \
    --num_train_epochs 20 \
    --learning_rate 1e-5 \
    --label_names "start_positions" "end_positions" \
    --do_train \
    --do_eval \
    --do_export \
    --export_model_dir $finetuned_model \
    --overwrite_output_dir \
    --disable_tqdm True \
    --metric_for_best_model eval_f1 \
    --load_best_model_at_end  True \
    --save_total_limit 1

take note of: If the model is the cross-language model UIE-M, you also need to set the--multilingual。

Configurable parameter description:

model_name_or_path: Mandatory, pre-trained model used for few shot training. Options are "uie-base", "uie-medium", "uie-mini", "uie-micro", "uie-nano", "uie-m-base", "uie-m-large".
multilingual: If or not the model is cross-language, the model obtained by fine-tuning with "uie-m-base", "uie-m-large", etc. is also a multi-language model, and needs to be set to True; the default is False.
output_dir: must, the model directory where the model is saved after training or compression of the model; defaults toNone 。
device: training device, can choose one of 'cpu', 'gpu', 'npu'; default is GPU training.
per_device_train_batch_size: The batch size of the training process of the training set, please adjust it in conjunction with the video memory, if there is a lack of video memory, please adjust this parameter down appropriately; the default is 32.
per_device_eval_batch_size: The batch size of the development set evaluation process, please adjust it in conjunction with the situation of the video memory, if there is insufficient video memory, please adjust this parameter down appropriately; the default is 32.
learning_rate: The maximum learning rate for training, which the UIE recommends setting to 1e-5; the default value is 3e-5.
num_train_epochs: Training rounds, 100 can be selected when using the Early Stop method; the default is 10.
logging_steps: Number of steps between log prints during training, default 100.
save_steps: The number of steps between model checkpoints saved during training, default 100.
seed: Global random seed, default is 42.
weight_decay: Weight decay value to be applied to all layers except all bias and LayerNorm weights. Optional; default is 0.0;
do_train:Whether to perform fine-tuning training, setting this parameter indicates that fine-tuning training is performed, the default is not set.
do_eval:Whether or not to perform an evaluation, setting this parameter indicates that an evaluation is performed.

The sample code is not as good as it should be due to the setting of the parameter--do_evalThe evaluation is therefore done automatically after the training.

Model Evaluation

Model evaluation is performed by running the following command:

python  \
    --model_path ./checkpoint/model_best \
    --test_path ./data/ \
    --batch_size 16 \
    --max_seq_len 512

Output:

[2024-07-03 14:14:16,345] [    INFO] - Class Name: all_classes
[2024-07-03 14:14:16,345] [    INFO] - Evaluation Precision: 1.00000 | Recall: 0.80000 | F1: 0.88889

Perform a model evaluation of the UIE-M by running the following command:

python  \
    --model_path ./checkpoint/model_best \
    --test_path ./data/ \
    --batch_size 16 \
    --max_seq_len 512 \
    --multilingual

Description of the evaluation method: a single-stage evaluation is used, i.e., tasks such as relation extraction and event extraction that require staged prediction evaluate the prediction results of each stage separately. The validation/test set will by default utilize all labels in the same hierarchy to construct the full negative case.

switchabledebugmode is evaluated separately for each positive instance class, and the mode is used only for model debugging:

python  \
    --model_path ./checkpoint/model_best \
    --test_path ./data/ \
    --debug

Output:

[2024-07-03 14:15:53,892] [ INFO] - -----------------------------
[2024-07-03 14:15:53,892] [ INFO] - Class Name: Common Name
[2024-07-03 14:15:53,892] [ INFO] - Evaluation Precision: 0.00000 | Recall: 0.00000 | F1: 0.00000
[2024-07-03 14:15:53,922] [ INFO] - -----------------------------
[2024-07-03 14:15:53,922] [ INFO] - Class Name.
[2024-07-03 14:15:53,922] [ INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2024-07-03 14:15:54,039] [ INFO] - -----------------------------
[2024-07-03 14:15:54,039] [ INFO] - Class Name: usage
[2024-07-03 14:15:54,039] [ INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2024-07-03 14:15:54,065] [ INFO] - -----------------------------
[2024-07-03 14:15:54,065] [ INFO] - Class Name: Dosage
[2024-07-03 14:15:54,065] [ INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2024-07-03 14:15:54,091] [ INFO] - -----------------------------
[2024-07-03 14:15:54,091] [ INFO] - Class Name: frequency
[2024-07-03 14:15:54,091] [ INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000

Configurable parameter description:

model_path:: Path to the model folder where the evaluation will take place, which should contain the model weights file.model_state.pdparamsand configuration filesmodel_config.json。
test_path:: Documentation of the test set to be evaluated.
batch_size: Batch size, please adjust it according to your machine, the default is 16.
max_seq_len: The maximum length of the text to be cut, when the input exceeds the maximum length, the input text will be cut automatically, the default is 512.
debug: Whether to enable debug mode for evaluating each positive instance category separately, this mode is only used for model debugging and is off by default.
multilingual: Whether it is a cross-language model, off by default.
schema_lang: Select the language of the schema, the options arechcap (a poem)enThe The default ischFor English data sets, please selecten。

Customized Models - Forecasting

Create a test file in the uie directory (You can also tap directly in the terminal command line, personally I prefer to execute it as a file)

from pprint import pprint
from paddlenlp import Taskflow

schema = ['drug name','usage','dosage','frequency']

# Set extraction goals and customize model weight paths
my_ie = Taskflow("information_extraction", schema=schema, task_path='. /checkpoint/model_best')
print(my_ie("Ibuprofen dispersible tablets, taken by mouth or dispersed with water. For use in adults and children 12 years of age and older, the recommended dosage is 0.2 to 0.4 (1 to 2 tablets) three times a day or as directed by a physician."))

# Information about medicines that are not in the corpus
print(my_ie("Cefixime Dispersible Tablets, can be taken by melting in warm water or swallowed directly. It can be taken before or after meals. Adults and children weighing more than 30 kilograms: take orally, 50~100mg each time, twice a day;"))

perceive Directoryuie run under

aistudio@jupyter-2631487-8126673:~/PaddleNLP-release-2.8/model_zoo/uie$ python

exports

[{'usage': [{'end': 9, 'probability'.
          'probability': 0.9967116760425654,
          
          'text': 'Oral' }],.
  'dosage': [{'end': 50, 'dosage'.
          'probability': 0.9942849459419811,
          'start': 46, 'text': '1~1~1~1~1~1~1~1~1~1
          
  'Drug Name': [{'end': 6, 'probability': 0.99428494949419811, 'start': 46, 'text': '1~2 tablets'}], 'drug name': [{'end': 6,
            'probability': 0.9993706125082298,
            'start': 0,
            'text': 'Ibuprofen Dispersible Tablets'}], [{'end': 6, 'probability': 0.9993706125082298, 'start': 0, 'text': 'Ibuprofen Dispersible Tablets'}].
  'frequency': [{'end': 55.
          'probability': 0.993725564192772,
          
          'text': '3 times a day'}]}]]
[{'usage': [{'end': 50, 'probability'.
          'probability': 0.9894656717712884,
          'start': 48, 'text': 'text'.
          
         {'end': 24, 'probability'.
          'probability': 0.468091403198585,
          
          'text': 'Swallow Directly'}], [text].
  'dosage': [{'end': 61.
          'probability': 0.7317206007179742,
          'start': 53, 'text': '50~100100
          
  'Drug Name': [{'end': 7, 'probability': 0.9
            'probability': 0.999166781489965,
            'start': 0,
            
  'frequency': [{'end': 66.
          'probability': 0.9919675951550744,
          
          'text': '2 times daily'}]}]]

effect

Project Address:
Extraction of information from drug inserts:/projectdetail/8126673?sUid=2631487&shared=1&ts=1719990517603