Peking University open Source Chinese word segmentation was hit in the face scene...

Friends who have done search know that the quality of the participle directly affects our final search results. In the field of word segmentation, participles are much simple in English, because English statement is divided by one space, and our extensive and profound Chinese, the same word in different context represents the meaning of differ in thousands ways, sometimes have to context to know it accurately express meaning, so the Chinese word segmentation has been a big challenge in the field of word segmentation.

Previously, I introduced a new open source word segmentation device of Peking University. According to the author’s test results, this word segmentation device is more accurate and faster than jieba and other word segmentation devices.

So I thought I’d take a simple test! So I wanted to use The Romance of The Three Kingdoms as a test to extract the frequency of famous names.

First, run a list of The Three Kingdoms,

Once you have the names, make a list of names, previously set to a dictionary with the characters’ names as keys and a value of 0. I only took the names of Cao Wei and Shu Han, and the codes are as follows:

wei = ["Xu chu"."荀攸"."Giffin"."Country"."Freeze"."Volunteers"."Liu"."Catalan"."Iwslt"."Hua xin,"."Zhong yao"."Wald"."Kimbro"."Wanglang"."Crumb"."Wargo"."Du ji"."Fields"."Bosco"."The beginning"."The sheen which"."Henton"."Mori"."Maines"."The more kuai"."Poetry"."With"."Jujube only"."Cao cao"."孟德"."Aldana"."Chen pose"."Xi worries,"."Huan Jie"."Rockett"."Ding 廙"."Simarang"."Korea and"."WeiKang"."BingYuan"."Zhao yanyan." "."Polston"."Gilroy"."Chen Lin".Sima Yi."After"."Coppage"."Xiahou Dun"."Xia Houyuan"."Pound"."Zhang he"."Primm"."Music into"."Dian wei"."Cao hong"."Coss"."Tsao jang"]

wei_dict = dict.fromkeys(wei, 0)
shu_dict = dict.fromkeys(shu, 0)
Copy the code

Then download an e-book of The Three Kingdoms and read it

def read_txt(a):
    with open("The three countries. TXT", encoding="utf-8") as f:
        content = f.read()

    return content
Copy the code

Pkuseg test results

First, instantiate the pkuseg object and get the number of character names as follows: loop through the list after word segmentation. If we detect a character name in the character name dictionary, increment this data by 1, as shown in the following code:

def extract_pkuseg(content):
    start = time.time()
    seg = pkuseg.pkuseg()
    text = seg.cut(content)

    for name in text:
        if name in wei:
            wei_dict[name] = wei_dict.get(name) + 1
        elif name in shu:
            shu_dict[name] = shu_dict.get(name) + 1

print(F "pkuseg available:{time.time() - start}")
print(F "pkuseg reads the total number of names:{sum(wei_dict.values()) + sum(shu_dict.values())}")
Copy the code

The result is as follows:

Jieba test result

The code is basically the same, but the use of the word divider is a little different.

def extract_jieba(content):
    start = time.time()
    seg_list = jieba.cut(content)
    for name in seg_list:
        if name in wei:
            wei_dict[name] = wei_dict.get(name) + 1
        elif name in shu:
            shu_dict[name] = shu_dict.get(name) + 1

    print(F "jieba available:{time.time() - start}")
    print(F "jieba{sum(wei_dict.values()) + sum(shu_dict.values())}")
Copy the code

The result is as follows:

Emmm test results seem to be a little unexpected, the promised pkusEG accuracy is higher? Pkuseg is nearly three times longer than jieba, and the extraction effect is not as good as jieba. So I went to force a search pkuseg, the result is such….

In general, Pkuseg is a bit overblown, not as magical as the author said, a bit eye-catching elements, perhaps it is a more focus on segmentation of the field of word segmentation!

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Peking University open Source Chinese word segmentation was hit in the face scene…

Pkuseg test results

Jieba test result

Peking University open Source Chinese word segmentation was hit in the face scene…

Pkuseg test results

Jieba test result

Related Posts

Summary of the process of reducing Docker image volume

ACM contestant illustration LeetCode removal element

Django REST Framework API Guide (20) : Metadata