Text Similarity HanPL Chinese Language Processing

preamble
demand (economics)
synopsis
Hands-on start
- 1. Adding dependencies
- 2. Text similarity toolkit
- 3. Case validation
- 4. Validation results
summarize

preamble

Please respect my original knowledge sharing and keep my blog in mind:South of the South i、

Tip: Below is the body of this post, with the following examples for reference

demand (economics)

When we needFind the percentage of similarity between two or more stringsYou can use theHanLP Chinese LanguageProcessing package to help us, to find the percentage of similarity, Hemming distance of two texts,

synopsis

HanLP（Han Language Processing） It has powerful functions in Chinese semantic processing and is developed by the Research Center for Natural Language Processing and Social Humanities Computing, Institute of Computing Technology, Chinese Academy of Sciences. It provides functions for a variety of natural language processing tasks including word segmentation, lexical annotation, named entity recognition, dependent syntactic analysis, sentiment analysis, text classification, and so on.

Key Features:

Subtext: HanLP provides a variety of word segmentation models, including rule-based models, neural network-based models, etc., which can accurately perform Chinese word segmentation.
Lexical labeling: On top of word separation, HanLP can also lexically label words to help understand their role in a sentence.
Named entity identification: HanLP is able to recognize named entities in text, such as names of people, places, organizations, etc., which is important for tasks such as information extraction.
Dependent syntactic analysis: By modeling the dependencies between individual words in a sentence, HanLP is able to analyze the sentence structure and extract the semantic information of the sentence.
Sentiment analysis: HanLP supports sentiment analysis of texts to determine the emotional tendency expressed in the text, such as positive, negative, or neutral.
Text Categorization: HanLP also provides the function of text categorization, which can classify text according to a preset classification system.

Links:HanLP Program Home 、HanLP Download Address、Detailed introduction

Hands-on start

Attention: This article uses the Java development language as a case study

1. Adding dependencies

	    <!--junitdependencies-->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>3.8.1</version>
            <scope>test</scope>
        </dependency>
        <!--hanlp语言处理dependencies-->
        <dependency>
            <groupId></groupId>
            <artifactId>hanlp</artifactId>
            <version>portable-1.7.6</version>
        </dependency>
        <!-- /artifact//commons-lang3 -->
        <dependency>
            <groupId></groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.4</version>
        </dependency>
        <!-- /artifact//jsoup -->
        <dependency>
            <groupId></groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.3</version>
        </dependency>

2. Text similarity toolkit

import ;
import ;
import . import ; import ; import .
import ;
import ;

import ;
import . *; ;import ;import ;import .

public class MySimHash {
    private String tokens; //character string
    private BigInteger strSimHash; // character yield hash value
    private int hashbits = 64; // the number of hashbits after hashbits; // the number of hashbits after hashbits.


    public MySimHash(String tokens) {
         = tokens.
         = (); }
    }

    private MySimHash(String tokens, int hashbits) {
         = tokens.
         = hashbits.
         = (); }
    }


    /**
     * Clear html tags
     *
     * @param content
     * @return
     */
    private String cleanResume(String content) {
        // If the input is HTML, all HTML tags will be filtered out.
        content = (content, ()); // If the input is HTML, the following will filter out all HTML tags.
        content = (content);
        String[] strings = {" ", "\n", "\r", "\t", "\\r", "\\n", "\\t", "&nbsp;"};
        for (String s : strings) {
            content = (s, "");
        }
        return content; }
    }


    /**
     * This is a hash of the entire string.
     *
     * @return
     */
    private BigInteger simHash() {
        tokens = cleanResume(tokens); // cleanResume removes some special characters
        int[] v = new int[]; // delete some special characters.
        List<Term> termList = (); // Split the string into words
        // some special handling of the term : e.g. add weights according to lexical properties, filter out punctuation, filter out over-frequent words, etc.;; // the term list is a list of all the terms in the string.
        Map<String, Integer> weightOfNature = new HashMap<String, Integer>(); // weight of the word.
        ("n", 2); // weight given to the noun is 2.
        Map<String, String> stopNatures = new HashMap<String, String>(); // Stopped lexemes such as some punctuation marks and such; ;//Disabled lexemes.
        ("w", ""); //
        int overCount = 5; // Set boundaries for overclocked words ;//Set boundaries for overclocked words.
        Map<String, Integer> wordCount = new HashMap<String, Integer> ();
        for (Term term : termList) {
            String word = ; //Subtitle String
            String nature = (); // Segmentation attribute; // Filter hyperfrequent words.
            // Filter for overused words
            if ((word)) {
                int count = (word); }
                if (count > overCount) {
                    continue; }
                }
                (word, count + 1); } else { int count = (word); if (count > overCount) { continue; }
            } else {
                (word, 1); }
            }
            // Filter for deactivated properties
            if ((nature)) {
                continue; }
            }
            // 2. Hash each morpheme into a fixed-length array. For example, a 64bit integer.
            
            for (int i = 0; i < ; i++) {
                BigInteger bitmask = new BigInteger("1").shiftLeft(i); // 3.
                // 3. Create an array of integers of length 64 (assuming you want to generate a 64-bit numeric fingerprint, which could be any number), // and then create an array of integers of length 64 (assuming you want to generate a 64-bit numeric fingerprint, which could be any number).
                // Judge the array after each hash, if it's 1000...1, then it's 1000...1, then it's 1000...1, then it's 1000...1. .1, then add 1 to the first and last digits of the array, // and subtract 1 to the middle 62 digits, // then add 1 to the first and last digits of the array.
                // the middle 62 bits minus one, that is, every 1 plus 1, every 0 minus 1. Until all the participle hash arrays have been judged.
                int weight = 1; //add weights
                if ((nature)) {
                    weight = (nature); }
                }
                if ((bitmask).signum() ! = 0) {
                    // Here is the vector sum of all the features for the entire document
                    v[i] += weight; } else {
                } else {
                    v[i] -= weight; } else {
                }
            }
        }
        BigInteger fingerprint = new BigInteger("0");
        for (int i = 0; i < ; i++) {
            if (v[i] >= 0) {
                fingerprint = (new BigInteger("1").shiftLeft(i));
            }
        }
        return fingerprint; }
    }


    /**
     * Performs hash calculation on a single participle.
     *
     * @param source
     * @return
     */
    private BigInteger hash(String source) {
        if (source == null || () == 0) {
            return new BigInteger("0"); } else { if (source == null || () == 0)
        } else {
            /**
             * When the length of the sourece is too short, it will cause the hash algorithm to fail, so we need to compensate for the short word
             */
            while (() < 3) {
                source = source + (0);
            }
            char[] sourceArray = ();
            BigInteger x = (((long) sourceArray[0]) < < 7);
            BigInteger m = new BigInteger("1000003");
            BigInteger mask = new BigInteger("2").pow().subtract(new BigInteger("1"));
            for (char item : sourceArray) {
                BigInteger temp = ((long) item); x = (m).xor; x = (m).
                x = (m).xor(temp).and(mask);
            }
            x = (new BigInteger((()))).
            if ((new BigInteger("-1"))) {
                x = new BigInteger("-2"));
            }
            return x; }
        }
    }

    /**
     * Calculate the Hamming distance, the smaller the Hamming distance the more similar it is.
     *}
     * @param other
     * @return
     */
    private int hammingDistance(MySimHash other) {
        BigInteger m = new BigInteger("1").shiftLeft().subtract(
                new BigInteger("1"));
        BigInteger x = ().and(m); int tot = 0; int
        int tot = 0; while (() !
        while (() ! = 0) {
            x = ((new BigInteger()).and(m)); int tot = 0; while (() !
            x = ((new BigInteger("1"))); x = ((new BigInteger("1")))
        }
        x = ((new BigInteger("1"))); }
    }


    /**
     * .
     * Find the percentage.
     * .
     * @param s2
     * @return
     */
    public double getSemblance(MySimHash s2) {
        double i = (double) (s2); }
        return 1 - i / ;
    }

}

Details:

/*-------------------------------------- similarity algorithm description --------------------------------------
    Borrowing from the hashmap algorithm to find out the key value that can be hashed, because we use simhash is locally sensitive hash, this algorithm is characterized as long as the similar strings only individual bits are differentiated changes.
    Then we can infer two similar text, at least 16 bits of simhash is the same. Specific selection of 16-bit, 8-bit, 4-bit, we choose according to their own data testing, although the smaller the number of bits compared to the more accurate, but the space will become larger.
    The storage space divided into four 16-bit segments is four times the storage space of a separate simhash. Previously, we calculated that the 5000w data is 382 Mb, expanding 4 times is about 1.5G, which is still acceptable:) By doing this calculation, our simhash lookup process is all reduced to less than 1 ms.
    By doing this calculation, our simhash lookup process is reduced to less than 1 millisecond. The effect of adding just one hash is so powerful? We can do some math, the original is 5000w sequential comparisons, now it is less than 2 to the 16th power comparisons, the first 16 bits into the hash lookup. What is the number of sequential comparisons in the back?
    2^16 = 65536, 5000w/65536 = 763 times 。。。。 The actual final chained list compares the data only 763 times! So the efficiency is greatly improved!  So far the first point down to 3.6 milliseconds, support 5000w data similarity comparison done.
    There is a second point of the same moment issued by the text if repeated can only retain a short text and how to solve the comparison of the degree of similarity. In fact, the above problem solved, these two is not a problem.  Before the evaluation has been in accordance with the linear calculation to estimate.
    Even if there is a multi-threaded submission of similarity calculation comparison, we provide similarity calculation server also needs linear calculation. For example, at the same time the client sends over two requests that need to compare the similarity, on the server side of the queuing process, one after another, the first one is processed before the second one.
    The first processed in the processing of the second, until the first processing is also added to the simhash library. So as long as the server side added a queue, there is no simultaneous request can not judge the situation. How does simhash handle short texts? Another way to think about it.
    simhash can be used as a locally sensitive hash the first calculation to narrow the entire range of comparisons, wait until we only have to compare the 700 multiple comparisons, even if we use our previous accuracy of the calculation is very slow editing distance can be dealt with. Of course, if you feel slow, you can also use cosine pinch and other slightly more efficient similarity algorithms";
     */

    /* -------------------------------------- split word description --------------------------------------
    Segmentation, the need to judge the text segmentation to form the characteristic words of this article. Finally form a sequence of words to remove the noise words and add weights for each word, we assume that the weights are divided into five levels (1 to 5).
    As long as similar strings only individual bits are differentiated changes. Then we can infer two similar texts, for example: " U.S. "Area 51" employees said there are nine flying saucers inside, had seen gray aliens " ==>
    The participle is followed by " United States (4) Area 51 (5) employee (3) says (1) inside (2) has (1) nine (3) flying saucers (5) has (1) seen (3) gray (4) aliens (5)", with parentheses representing the importance of the word in the whole sentence, and numbers representing the importance of the word in the whole sentence.
    The parentheses represent the importance of the word in the whole sentence, and the larger the number, the more important it is. 2, hash, through the hash algorithm to each word into a hash value, such as "United States" through the hash algorithm calculated as 100101, "Area 51" through the hash algorithm calculated as 101011.
    So that our string has become a string of numbers, remember the beginning of the article said, to the article into a number of calculations in order to improve the performance of similarity calculations, and now is the process of dimensionality reduction in progress. 3, weighting, through the 2-step hash to generate results.
    Need to form a weighted string of numbers in accordance with the weight of the word, such as "United States" hash value of "100101", through the weighted calculation of "4 -4 -4 4 -4 4"; "Area 51" hash value of "101011", weighted by " 5 -5 5 -5 5 5". 4, merge.
    The sequence of values calculated for each of the above words are added together to become only one sequence string. For example, "4-4-4 4-4 4" for "United States" and "5-5 5-5 5-5 5" for "Area 51". For example, "4-4-4 4-4 4-4 4" for "United States" and "5-5 5-5 5-5 5 5" for "Area 51", adding up each bit, "4+5 -4+-5 -4+5 4+-5 -4+5 4+5" == "9-9 1 -1 1 9".
    Here as an example only counted two words, the real calculation needs to add all the words of the sequence of strings. 5, the dimensionality of the 4-step calculation of the "9 -9 1 -1 1 9" into 0 1 string, the formation of our final simhash signature. If each bit is greater than 0 recorded as 1, less than 0 is a unified priority company ";".


    /*-------------------------------------- sensitive hash description --------------------------------------
    algorithm to find out the key value can be hashed, because we use the simhash is locally sensitive hash, this algorithm is characterized by as long as the word is similar to the need to judge the text of the word division to form the article's characteristics of the word.
    Finally formed to remove the noise word as long as the similar string only individual bits are different. Then we can infer two similar text, the word order is on behalf of the word in the entire sentence in the degree of importance, the larger the number of the more important.
    2, hash, through the hash algorithm to turn each word into a hash value, such as "United States" through the hash algorithm is calculated as 100101, "Area 51" through the hash algorithm is calculated as 101011. This way, we string This turns our string into a string of numbers.
    Remember what was said at the beginning of the article, to turn the article into a number weighted, through the home may have questions, after so many steps to get so troublesome, not just to get a 0 1 string? I directly input this text as a string into v compare.
    The first 16 bits become a hash lookup. After the number of sequential comparisons is more, with hd5 is used to generate a unique sign to far apart; hashmap is also used for key-value pair lookup, easy to quickly insert and find the data structure. But we mainly solve the
    is the text similarity calculation, to compare the two articles whether they know each other, of course, we downgrade the generated hashcode is also used for this purpose. See here estimates that we understand, we use sim is this, the traditional hash function to solve the problem of
    generate unique values, such as md5, hashmap, etc. md5 is used to generate a unique signature string, as long as a little more than one character md5 of the two numbers look far apart; hashmap is also used for key-value pairs to find, easy to quickly insert and find the number of
    data structure. However, we are mainly to solve the text similarity calculation, to compare the two articles whether they know each other, of course, we have generated a reduced dimensional hashcode is also used for this purpose. See here estimates that you understand, we use simhash
    Even if the article in the string into 01 string can still be used to calculate the similarity, while the traditional hashcode can not. We can do a test, the difference between the two only one character of the text string, "your mother called you home for dinner oh, home
    Luo home Luo" and "your mom called you home for dinner, home Luo home Luo". Short text a lot of repeated information will not be filtered, is not ";;".
     */

    /* -------------------------------------- filtering instructions --------------------------------------
    Finally formed to remove the noise word word order participle is on behalf of the word in the whole sentence in the degree of importance, the larger the number the more important. Finally form a word sequence to remove the noise words and add weights for each word 2. hash, through the hash algorithm to turn each word into a hash value.
    For example, "the United States" through the hash algorithm is calculated as 100101, "Area 51" through the hash algorithm is calculated as 101011. Thus, our string has become a string of numbers, remember the beginning of the article said, divided into 4 16-bit segments is 4 times the storage space of a separate simhash
    4 times the storage space of a separate simhash. Previously calculated 5000w data is 382 Mb, expand 4 times 1.5G or so, still acceptable:) To turn the article into a number weighted, through the home may have questions, after so many steps to engage in so much trouble, not just to get a 0 1 string?
    I directly enter this text as a string, with hd5 is used to generate a unique signature to far apart; hashmap is also used for key-value pair lookup, easy to quickly insert and find the data structure. However, we mainly solve the text similarity calculation, to compare the two articles whether they know each other.
    Of course, we have generated a hashcode to reduce the dimensionality is also used for this purpose. See here estimates that we understand, we use the sim is this way, the traditional hash function to solve the generation of unique values, such as md5, hashmap, etc. md5 is used to generate a unique signature string, as long as the slightly more
    a character md5 the two numbers look far apart; hashmap is also used for key-value pairs to find, easy to quickly insert and find the data structure. However, we are mainly to solve the text similarity calculation, to compare the two articles whether they know each other, of course, we downgrade the generated hashcode is also used for this purpose.
    is used for this purpose. See here estimates that we understand, we use simhash even if the article string into 01 string can still be used to calculate the similarity of the traditional hashcode but not. We can do a test, the difference between the two only a character of the text
    string, "your mother called you home for dinner oh, home Luo home Luo" and "your mother called you home for dinner, home Luo home Luo". The short text of a large number of repeated information will not be filtered, ";;".
     */

3. Case validation

    public static void main(String[] args) {
        String text = "Apricot South One";
        List<String> itemList = new LinkedList<String>();
        ("Apricot South 1").
        
        ("Hengnan 3"); ("Hengnan 3").
        ("Hengnan 4"); ("Hengnan 3"); ("Hengnan 4").
        ("Hengnan 5"); ("Hengnan 3"); ("Hengnan 4"); ("Hengnan 5"); ("Hengnan 5").
        ("Xingnan 6"); ("Xingnan 6"); ("Xingnan 6").
        ("Xingnan 7"); ("Xingnan 7"); ("Xingnan 7"); ("Xingnan 7"); ("Xingnan 7"); ("Xingnan 7").
        ("Xingnan 8"); ("Xingnan 8"); ("Xingnan 8").
        ("Hengnan Ninth District"); ("Hengnan Ninth District"); ("Hengnan Ninth District"); ("Hengnan Ninth District").
        ("Xingnan 10").
        ("======================================");
        long startTime = ();
        MySimHash hash1 = new MySimHash(text, 64);
        List<Double> list = new LinkedList<Double> ();
        for (String str : itemList) {
            MySimHash hash2 = new MySimHash(str, 64);
            // The smaller the Hemming distance, the more similar it is.
            ("Hemming distance:" + (hash2) + "###" + "Text similarity:" + (hash1));; ("Hemming distance:" + (hash2) + "###" + "Text similarity:" + (hash1));)
            ((hash1));
        }
        long endTime = ();
        double max = (list); int index = (max)
        int index = (max);//get the set subscripts
        ("======================================"); }
        ("Elapsed time:" + (endTime - startTime));
        
        
        ("Comparison content:" + text + "####" + "Maximum similarity:" + (index));
    }

4. Validation results

summarize

I am.South of the South iRecord the drops grow a little bit every day, learning is never ending! Reprinted with link to original article!!!

Reference Links、