Law of cosines for text similarity calculation

preface

Cosine similarity, also known as cosine similarity, is to evaluate the similarity of two vectors by calculating the cosine of their included Angle. Cosine similarity plots vectors in terms of coordinates into vector space. The cosine of the Angle between two vectors in the vector space is used to measure the difference between two individuals. The closer the cosine value is to 1, the closer the Angle is to 0, that is, the more similar the two vectors are. Conversely, the closer the cosine value is to 0, the less similar the two vectors are, which is called “cosine similarity”.

The body of the

Review the law of cosines

Let me briefly review my high school math knowledge, the law of cosines

Do you remember this formula? If not, let’s look at the picture below

So a is equal to xa,ya, and b is equal to xb,0, so how do we calculate the length of each side?

At this point, the length of each side is substituted into the formula above, and the final calculation formula can be obtained

Text similarity calculation steps

So what are the steps in our text similarity calculation?

Participles, such as two lines of text, the first sentence:Hello, I'm Xiao Wang. I'm a programmer.", will be divided intoHello/I/yes/Xiao Wang/I/a/programmer. Second statement:Hello, I'm a designer, will be divided intoHello/I/is/designer
Counting word frequency is actually countingAll statementsEvery word inThe current sentenceNumber of occurrences, first sentence:Hello 1, me 2, yes 2, Xiao Wang 1, a 1, programmer 1, designer 0The second sentence,Hello 1, me 1, yes 1, Xiao Wang 0, 0, programmer 0, designer 1
Combination word frequency vector, first sentence,2,2,1,1,1,0 (1)The second sentence,,1,1,0,0,0,1 (1).
Plug the data into the formula above to calculate the similarity

Maven introduces the IKAnalyzer dependency

Ikanalyzer is used here to implement a simple word segmentation function

 <dependency>
    <groupId>com.janeluo</groupId>
    <artifactId>ikanalyzer</artifactId>
    <version>2012_u6</version>
</dependency>
Copy the code

IKUtils word segmentation tool class, code than simple, the only method to return is the statement word segmentation List object

/** ** ** ** *@author wangzh
 */
public class IKUtils {

    /** * returns the result of text segmentation * in List format@param text
     * @return* /
    public static List<String> divideText(String text){
        if(null == text || "".equals(text.trim())){
            return null;
        }
        List<String> resultList = new ArrayList<>();
        StringReader re = new StringReader(text);
        IKSegmenter ik = new IKSegmenter(re, true);
        Lexeme lex = null;
        try {
            while((lex = ik.next()) ! =null) { resultList.add(lex.getLexemeText()); }}catch (Exception e) {
            //TODO
        }
        returnresultList; }}Copy the code

Below is the main code logic, with the steps commented in the code

public class Analysis {
    public static void main(String[] args) {
        Map<String,int[]> resultMap = new HashMap<>();
        // Test text
        String text1 = "Hello, I'm Xiao Wang. I'm a programmer.";
        String text2 = "Hi, I'm a designer.";
        / / statistics
        statistics(resultMap, IKUtils.divideText(text1),1);
        statistics(resultMap, IKUtils.divideText(text2),0);
        / / class
        final Calculation calculation = new Calculation();
        resultMap.forEach((k,v)->{
            int[] arr = resultMap.get(k);
            calculation.setNumerator(calculation.getNumerator() + arr[0] * arr[1]);
            calculation.setElementA(calculation.getElementA() + arr[0] * arr[0]);
            calculation.setElementB(calculation.getElementB() + arr[1] * arr[1]);
        });

       System.out.println("Text similarity:" + calculation.result());
    }

    /** * combine word frequency vector *@param words
     * @param direction
     * @return* /
    private static void statistics(Map<String,int[]> map,List<String> words ,int direction){
        if(null == words || words.size() == 0) {return ;
        }
        int[] in = null;
        boolean flag = direction(direction);
        for (String word : words){
            int[] wordD = map.get(word);
            if(null == wordD){
                if(flag){
                    in = new int[] {1.0};
                }else {
                    in = new int[] {0.1};
                }
                map.put(word,in);
            }else{
                if(flag){
                    wordD[0] + +; }else{
                    wordD[1] + +; }}}}// Judge different sentences
    private static boolean direction(int direction){
        return direction == 1?true:false; }}Copy the code

Class used to calculate cosine similarity

public class Calculation{

    private  double elementA;
    private  double elementB;
    private  double numerator;

    public double result(a){
        return numerator / Math.sqrt(elementA * elementB);
    }
    Get/set / / ignore
}
Copy the code

Output result:

Text similarity: 0.7216878364870323Copy the code

It can be seen from the results that the two sentences are roughly similar. In layman’s terms, it’s 72% similar.

Refer to legend:

www.jianshu.com/p/f4606ae11…

Public account blog sync Github warehouse, interested friends can help give a Star oh, code word is not easy, thank you for your support.

Github.com/PeppaLittle…

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Law of cosines for text similarity calculation

preface

The body of the

Review the law of cosines

Text similarity calculation steps

Maven introduces the IKAnalyzer dependency

Recommended reading

Law of cosines for text similarity calculation

preface

The body of the

Review the law of cosines

Text similarity calculation steps

Maven introduces the IKAnalyzer dependency

Recommended reading

Related Posts

Design pattern: Policy pattern, the core idea of Java collection custom sorting

Some thoughts on recent project refactorings

Python multiprocess programming