Dictionary Compilation Profile Overview

summarize

Dr. Chinese allows users to compile their own dictionary files. This article briefly describes the dictionary compilation process and how to write the related configuration files.

Readers are required to haveXMLcap (a poem)regular expression (math.)of the basics.

dictionary compiler

The dictionary compiler for Dr. Chinese can be accessed by clicking on "Dictionary Compiler" in the "File" menu.

Before compiling, you need to specify the configuration file by clicking the "Load" button. When the profile is selected, the output location will be filled in automatically. After clicking the Compile button, the dictionary will be compiled and automatically registered into the program's dictionary library for subsequent retrieval.

Overview of Compile Profiles

The dictionary compilation configuration file is an XML file, and the program processes the specified text file using the regular expressions defined in the configuration file to convert the text content to dictionary content.

Tip: A large number of examples are available in the program's Download Space/Dictionary Database/Dictionary Source Files directory.

The compilation configuration file should contain at least the following types of information:

Basic information about the dictionary, such as title, version, publisher, creator, description, etc.
Dictionary fields, such as pronunciation, source, explanation, example sentences, etc., can all be freely defined.
- A dictionary database must have at least one field (field).
- Fields need to be specified with a name and a description.
- The default field data type (valueType) is plain text.
- The tag type indicates that the field content is HTML.
- There are also other types, such as pinyin, Cantonese spelling, lists of related Chinese characters, and so on.
- The content of a field, before displaying it as a search result, can be specified in a regular expression pattern (pattern), which replaces (replace) the field text with other content (replacement, e.g., HTML tags, jump links, etc.). As a result, there is no need to insert a large number of HTML tags into the source file, which can be converted at runtime so that the dictionary can be compressed even smaller.
The source file for the data (DATA) of the dictionary.
- The source file of the dictionary is a text file, the compiler reads the text file line by line (row), extracts the words according to the regular expression (pattern), and maps the other text to the corresponding fields.
- A dictionary can have more than one source file.
- Entries can be repeated. When searching for an entry, the program displays all the contents of the dictionary that match the entry, in the order in which they appear in the source file.
- The source file (encoding) is recommended to use UTF-8 or GB18030 encoding to maximize the support of Chinese characters that can be displayed.
- The text can be replaced before it is compiled and written to the dictionary file.
- Generally, there is no need to take into account the simplified and traditional forms of the word head, and the program can automatically check the simplified and traditional forms or different forms during the search.
- Generally, there is no need to consider the Simplified Chinese or Traditional Chinese representation of the body of the explanation. Users can select the body of the dictionary and convert the content to Simplified Chinese or Traditional Chinese in the right-click menu.

In addition to the essential information above, the compilation profile supports the following advanced features:

Dictionary document (documentation). "Documentation" refers to the auxiliary content does not belong to the dictionary entries, such as the preface, preface, examples, appendices, etc., the user clicks on the search results of the dictionary title, will be transferred to the dictionary of the basic information page, the page can view the dictionary of the "documentation". The dictionary document can be a text file, an HTML page or a simple Markdown file.
Dictionary resource. "Resource" refers to the content of the data that is not retrieved by this condition, typically illustrations, but also content that is not expected to appear directly in the list of lexical documents can also be placed in the resource.
Field classification (category). Fields can have categories that are displayed in the search results. This is relatively rare and is usually found in dictionaries with more complex structures.
Dictionary. Since dictionaries are single words, they are somewhat faster to retrieve than lexicons. If the dictionary to be edited is actually a dictionary, or contains a large number of single words, you can specify the dictionary attribute (enableCharacterDictionary) in the configuration file to have the compiled dictionary use a dictionary structure to hold single-word entries.
Synonym. The compiler allows to specify a synonym for a header using a regular expression in the configuration file. The search structure for both synonyms and prefixes points to the same paraphrased content. A synonym can be specified in either the header line (word) or the content line (row).

Compilation Example

The following is an example of how to compile and create a dictionary, using Herbal Harmful as an example. The source file can be downloaded from the Dictionary Database → Dictionary Source Files → Herbal Harms directory in the program download space. For convenience, the source file is compressed into a 7zip file, so after downloading it, please decompress the file, and then compile it by selecting "lit. pestilence of this herb (idiom); fig. the harm done to people in need of help.xml" in the compiler.

Analyzing Source Document Structure

Materia Medica Harmful Profit" is the Qing Dynasty Ling Huan compiled a book of materia medica, the book contains about three hundred kinds of Chinese herbs, the author based on the idea of medicines such as soldiers, according to the internal organs and the effectiveness of the categorization, divided into something like "the heart of the medicine team [tonifying the heart of the fierce generals]", "lungs team [warm lungs fierce generals]" and other classifications, and one by one to explain the "harm", "benefit" and "repair" of herbs. In this example, we have categorized the herbs into categories such as "Heart Medicine Team [Heart Tonic General]" and "Lung Medicine Team [Lung Warmer General]", and we have explained the "harms", "benefits", and "remedies" of each herb one by one. In this example, we use the name of the herb as the beginning of the word. For example, the search result for the Chinese herbal medicine "Schisandra chinensis" is shown below:

The beginning part of the source file is roughly as shown below (without line breaks, the text behind the long lines has been hidden).

Pre-compile documentation preparation

Some edits were made to the original file in order to compile it into a dictionary. For example:

A "★" sign is added in front of the name of the herb, and when compiling, we will specify what follows the "★" sign as the head of the word. This convention notation can be based on personal preference, and there are no special requirements.
Since herbs usually have aliases, for example, "Bei Wu Wei" is also known as "Wu Wei Zi". In the original file, there is no ",Wuweizi" after "Beiwuwei", so if we don't specify an alias, we won't be able to find out the content of this entry when we type "Wuweizi". After "★Bei Wu Wei", add ", Wu Wei Zi", and specify "," as a synonymDelimiter.
When the author is explaining a herb, he sometimes mentions other herbs. We hope that when we click on these herbs in the search results interface, we can jump to the corresponding entries. For example, "dried ginger" in the above figure is also a herb, we use half angle brackets to enclose it as "[dried ginger]" to indicate the clickable word link. This way of labeling is also according to personal preference, we can specify it in the configuration file.
In some places in the text, the abbreviations of herbs are used, for example, "(Bai Shao) and Gui Di tonify Blood", where "Gui" refers to Angelica sinensis and "Di" refers to "Di Huang", and we use parentheses to indicate this relationship. Here, "(of a responsibility) be taken care of by" refers to Angelica sinensis and "structural particle: used before a verb or adjective, linking it preceding the verb or adjective" refers to "structural particle: used before a verb or adjective, linking it preceding the verb or adjective黄", and we use parentheses to denote this relationship as "Same as [(of a responsibility) be taken care of by](当(of a responsibility) be taken care of by)[structural particle: used before a verb or adjective, linking it preceding the verb or adjective](structural particle: used before a verb or adjective, linking it preceding the verb or adjective黄), to tonify Blood.", which is displayed in the search results screen as "same asreturn homeblood tonic", whereas clicking on the word "(of a responsibility) be taken care of by" will retrieve "当(of a responsibility) be taken care of by" and clicking on the word "structural particle: used before a verb or adjective, linking it preceding the verb or adjective" will retrieve "structural particle: used before a verb or adjective, linking it preceding the verb or adjective黄". ".

If we look at the document again, since the herb troop appears before the name of the herb, that is, before the word mark "★", the "Heart Troop [Heart Tonic]" that appears before "★Beiwuwei "★Bei Wu Wei" will not be included in the entry "Korean Goryeo 고려 dynasty, 918-1392", and "Heartland Pharmaceutical Team[fill the gap between the hearts of those who can't afford it and those who can't]", which appears in front of "★Sour Jujube Kernel", will become the entry "Korean Goryeo 고려 dynasty, 918-1392". The contents of "Beigongmyo" will become the content of "Beigongmyo". Although it does not matter here because the content is the same, and "Bei Wu Wei" also lacks the content of the herb troop, when switching the herb troop later on, it will be misplaced, resulting in a content error. In order to remedy this, we will be switching the contents according to the "Pharmaceutical team ["FirstneglectDrop the herb troops appearing here, and then use a text editor called "BowPad", through the regular expression, extract the herb troops and the relationship between the herbs to create a file called "Herbal Harmful Directory.txt", the contents of the The contents of the file are roughly as follows, each line is the relationship between a herb and a herb troop, separated by tabs. Compile, we will put the contents of this file in front of the first compilation, and the contents of the body of the file on the back of the compilation, you can get the effect of the beginning of the search results.

Bei Wu Wei, Wu Wei Zi, Heart Team.
Sour date kernel, heart team.
Cortex Phellodendron Bidentata, Heart team.
Yuan Zhi Pork, Yuan Zhi, Heart team.
Salvia miltiorrhiza, heart team.

Compile Configuration File

Here is the compilation configuration file:

<?xml version="1.0" encoding="gbk"?>
<config title= "Herbal Harm."author= "Ching Huan">
    <fields>
        <field name="category" cssClass="categoryTitle" description= "Troops" fieldNameVisible="no"
        mergeAdjacentField="true" adjacentFieldDelimiter="、"/>
        <field name="content" description= "Content" valueType= "Marker" fieldNameVisible="no" fullText="yes">
            <replace pattern="\[([^\]]+)\]\(([^\)]+)\)">
                <replacement><![CDATA[<ref q="$2" title="$2">$1</ref>]]></replacement>
            </replace>
            <replace pattern="\[([^\]]+)\]" replacement="&lt;ref>$1&lt;/ref>"/>
            <replace pattern="〔.+〕" replacement="&lt;b>$0&lt;/b>"/>
        </field>
    </fields>
    <data file="Herbal Harm Catalog.txt" encoding="GBK" synonymDelimiter="、">
        <column field="1"/>
    </data>
    <data file= "Herbal Harm.txt" encoding="utf-8" synonymDelimiter="、" allowEmptyLine="true">
        <word pattern="^★" removePattern="true"/>
        <row field="2">
            <ignore pattern= "Drug squad." useRegex="false"/>
            <ignore pattern=" (see " useRegex="false" />
            <ignore pattern= "(ibid.)" useRegex="false" />
        </row>
    </data>
</config>

The "title" and "author" attributes that appear in the "config" element indicate the name and author of the dictionary, respectively. These will be displayed on the search results window.

dictionary field

The "field" element under the "fields" element specifies two fields. The first field will be used to display the herb force, and the second field will be used to display the body text such as "harm", "benefit", and "cure".

The code for the first field is as follows:

<field name="category" cssClass="categoryTitle" description= "Troops" fieldNameVisible="no"
        mergeAdjacentField="true" adjacentFieldDelimiter="、"/>

It is obvious from the search results window that the two fields are displayed differently. The reason is that the first "name="category""The field has a "cssClass="categoryTitle"" attribute, indicating the use of the class name "categoryTitle"CSS Styles. The styles that can be used are in the program directory of thedefined in the file. By default, the program displays the field description name (in this case, "Force") to the search results, specifying "fieldNameVisible="no""It is possible to hide field names.

Since an herb may belong to more than one herb unit, e.g., "Bei Wu Wei, Wu Wei Zi" appears three times in the above catalog file, a program is needed to string them together. We can specify in the configuration "mergeAdjacentField="true""Concatenate adjacent results of the same field, and additionally specify "adjacentFieldDelimiter="、"", using commas to separate the contents. This gives the effect of several herb troops strung together in the opening screenshot.

The second field is more extensive and the code is as follows:

<field name="content" description= "Content" valueType= "Marker" fieldNameVisible="no" fullText="yes">
    <replace pattern="\[([^\]]+)\]\(([^\)]+)\)">
        <replacement><![CDATA[<ref q="$2" title="$2">$1</ref>]]></replacement>
    </replace>
    <replace pattern="\[([^\]]+)\]" replacement="&lt;ref>$1&lt;/ref>"/>
    <replace pattern="〔.+〕" replacement="&lt;b>$0&lt;/b>"/>
</field>

“valueType="Marker"" attribute indicates that the content of this field is HTML markup (without this attribute, all content would be treated as plain text and there would be no jump links).

“replace"The element specifies a replacement operation for the contents of the field before displaying the search results. "pattern"The value of the attribute is a regular expression, which is used to match text, while the "replacement"The content of the attribute or element then indicates the content of the alternative. Where "$1”、“$2A token such as "indicates the content of the grouping captured by the regular expression.

The first set of substitution expressions "\[([^\]]+)\]$([^$]+)\)"denotes a replacement similar to the previously mentioned pattern of "with [gui] (dang gui) [di] (dihuang) to replenish blood". "[Gui] (Angelica sinensis)" would be replaced with "<ref q="Angelica sinensis" title="Angelica sinensis">Angelica sinensis</ref>”。“ref"The element has a special meaning in the program, indicating that theword link. Among them."q"The attribute indicates the phrase to be retrieved, "title"The attribute is the prompt text displayed when you mouse over the connection,"refThe text of the "element is the content of the link. Since the replacement content of this group is complex, we have placed it in a CDATA section to avoid writing unintuitive entity references for the quotes and pointed brackets of the replacement target.

The second set of substitution expressions "\[([^\]]+)\]"Indicates a pattern of replacement similar to the previously mentioned "[dry ginger]". "[Dry ginger]" would be replaced with "<ref> dried ginger</ref>". Since the content to be retrieved is the same as the content to be displayed, there is no need for "q"Properties up.

In arranging the order of substitution, attention is paid to precedence; the pattern of square brackets plus parentheses must be replaced before the pattern of square brackets only, and if the reverse is true, "[Glycine max] (Angelica sinensis)"and it will be."\[([^\]]+)\]"The replacement group corresponding to the pattern is replaced with "<ref>qui</ref>(angelica)", while "\[([^\]]+)\]\]$(([^$]+)\)" would not get the chance to replace the text.

The third set of substitution expressions "[. +]" means that the contents of "[harm]", "[benefit]" and "[repair]" are replaced with bold.

Data source files

field element that appears after the "dataThe "element represents the dictionary data source file. The source file is a text file, typically with a column layout and a row layout.

If the file uses a column layout with one header per line, use tabs to break it down, with the first column being the header and subsequent columns being the field data.
Most files use a line layout where the program reads the text line by line and determines whether each line of data corresponds to a header or a field based on a regular expression.

The corresponding configuration code for the first source file is as follows:

<data file="Herbal Harm Catalog.txt" encoding="GBK" synonymDelimiter="、">
    <column field="1"/>
</data>

“encoding" indicates the text encoding of the text file, noting that it must be the same as "file" attribute corresponds to the same encoding of the file, otherwise it will not be readable.

“synonymDelimiter="、"" specifies the synonym separator for the headword.

Note that the "data" element appears under the "column" sub-element. This element indicates that this document is laid out in columns (tabbed), where column 1 implicitly indicates the header and the individual "columnThe "column" element indicates the subsequent columns that will correspond to the contents of the field. Therefore, this "column" element corresponds to the second column, which corresponds to the "field="1"The "attribute" indicates that the column corresponds to the content of the field corresponding to the first "field" element defined earlier (i.e., Herbal Force).

In this example, there are only two columns in the file (roughly as shown below), and the first column corresponds to the word header in the "synonymDelimiter="、""The role of such things as "Schisandra chinensisThe header "Korean Goryeo 고려 dynasty, 918-1392" has been split into two headers "Magnolia vine" and "Magnolia vine". The second column, "Heart medicinal troop [heart tonic]" corresponds to the field of medicinal troop.

Schisandra chinensis	Heart team.
sour date kernel (Ziziphus jujuba var. spinosa)	Heart team.
cypress kernel	Heart team.

The code for the second source file is as follows:

<data file= "Herbal Harm.txt" encoding="utf-8" synonymDelimiter="、" allowEmptyLine="true">
    <word pattern="^★" removePattern="true"/>
    <row field="2">
        <ignore pattern= "Drug squad." useRegex="false"/>
        <ignore pattern=" (see " useRegex="false" />
        <ignore pattern= "(ibid.)" useRegex="false" />
    </row>
</data>

Since the source file has a large number of blank lines. By default of the program, blank lines are used as end markers for word content, which is not the case for the file in this example. Therefore, we can specify "allowEmptyLine="true"", which allows blank lines to appear within the content of a lexical entry. The compiler then automatically ignores the blank lines when compiling.

It appears here."word"The element denotes the head of the word,"rowThe "element represents the rule used to identify which field corresponds to each line in addition to the word header.

“word"of the elements."pattern"The attribute indicates that lines beginning with a pentagram are recognized as word heads.

Since the pentagram is already being used to identify the head of the word here, and it is not actually the text of the word, but just a markup we added to make it easier for the program's compiler to identify the head of the word, it should be removed.removePattern="true"attribute, it means that deleting the match "pattern"The content of the attribute so that the pentagram at the beginning of the line does not appear at the beginning of the word.

As a result of this "data"The elements have one too."synonymDelimiter="、"" attribute, and thus the dunce is also used to separate word heads.

"row" element of the "field="2"The "attribute indicates that this element defines the rule for the second field, the "content" field defined earlier.

Since there is only body content in the document and nothing else, there is no need to filter. If there are multiple fields in a document, you can specify the "pattern" attribute that matches the field using a regular expression.

Three."ignore"Element here means match is ignored."pattern"attribute content of the line,"useRegex="false""Attribute representation"patternThe "pattern" attribute is plain text, not a regular expression. If a line of text in the source file contains one of the three "patterns" here, it will be ignored.

Within a "row" element, you can also specify "replace" element, replacing some content before writing it to the dictionary.

summarize

This article explains the tools for dictionary compilation and the common elements and attribute settings of the configuration file, and explains how to write a dictionary compilation configuration file with the example of Benjamin Harmful. This article explains the compilation configuration is only the more basic part, there are more features will be explained in subsequent articles.

Examples of dictionary source files can be downloaded in the download area (Access code: 8518)

It is also evident from the above design that the dictionary compiler of Dr. Hanwen is more flexible as it supports a variety of text file layouts. The concept of fields provides convenience for constructing the logical content of the dictionary. The unique runtime text replacement mechanism makes it possible to make the dictionary source file more concise without embedding a large number of labels, which also reduces the space occupation of the dictionary file. The program's Simplified-Traditional heterogeneous search and the unique synonym mechanism facilitate word retrieval, which also allows dictionary creators to focus more on creating quality content without being troubled by the functional limitations of the dictionary program.