preface

Chinese corpora are often needed in natural language processing. High-quality Chinese corpora are difficult to find. Wikipedia and Baidu Encyclopedia are relatively good corpora. Wrap corpus of wikipedia will regularly publish https://dumps.wikimedia.org/zhwiki/, you can download the latest version of the corpus. While baidu encyclopedia have to climb, but also someone climb it comes out good corpora, https://pan.baidu.com/share/init?surl=i3wvfil neqs extract yard.

This article focuses on how to use the Corpus of Chinese Wikipedia.

Wikipedia Dump

Through https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2 you can download the latest Chinese wikipedia corpora, the size is about 1.37 G, The contents are saved in XML format, so we need to do some processing later. The XML node information is similar to the following

<page>
  <title></title>
  <id></id>
  <timestamp></timestamp>
  <username></username>
  <comment></comment>
  <text xml:space="preserve"></text>
</page>Copy the code

According to the label name, it is easy to know the meaning of the content of each node. The text node is the content of the article, which actually contains many other symbols, which also need to be filtered out.

Extract the data

For the corpus downloaded above, further extraction is required, and you can choose

  • Write your own program extraction.
  • Extract using Wikipedia Extractor.
  • Extract using the Wikicorpus library in Gensim.

Because wikipedia corpus has many symbols, the content filtered varies according to the usage scenario, so you can handle it according to your own needs. Wikipedia Extractor is directly used here for preliminary processing,

git clone https://github.com/attardi/wikiextractor.git wikiextractor

cd wikiextractor

python setup.py install

python WikiExtractor.py -b 1024M -o extracted zhwiki-latest-pages-articles.xml.bz2Copy the code

The execution is as follows, and you can see that a total of 965,446 articles were processed.

INFO:root:5913353 Service level management INFO:root:5913361 Shi Shojong INFO: 5913367 The DPRK delegation to the 2018 Winter Olympic Games INFO:root:5913369 support INFO:root:5913390 Peng Yuzhen INFO:root:5913402 Schneider 75mm Ultra Portable Mountain gun INFO:root:5913435 Ian Maynell INFO:root:5913442 Lehman 2: Great Escape INFO:root:5913443 MB INFO:root:5913445 Bit like INFO:root:5913446 Tanaka Light INFO:root:5913450 Toy Commander INFO:root:5913453 How the Grinch Stole Christmas (Game) INFO:root:5913457 Ontario Solar Panel Business Proposal INFO:root:5913458 Tokai Adventure Drama Super Hero Cast INFO:root:5913465 Toy Story 2: INFO: Root :5913467 Canon EOS-1 INFO:root:5913480 Nan Byinggi INFO: Finished 11-Process Extraction of 965446 articlesin(1138.8 847.7 s art/s)Copy the code

After the above extraction, two files wiki_00 and wiki_01 are obtained. The format is similar to the following

<doc id="5323477" url="https://zh.wikipedia.org/wiki?curid=5323477" title="Structure and agency"> </doc>Copy the code

Secondary processing

The content of some special marks will be removed by Wikipedia Extractor, but sometimes these do not affect our use scenarios, so as long as the extracted label and some empty parentheses, “”,” “, empty book name can be removed.

import re
import sys
import codecs
def filte(input_file):
    p1 = re.compile('()')
    p2 = re.compile('" "')
    p3 = re.compile(' ' ')
    p4 = re.compile(' ' ')
    p5 = re.compile('<doc (.*)>')
    p6 = re.compile('</doc>')
    outfile = codecs.open('std_' + input_file, 'w'.'utf-8')
    with codecs.open(input_file, 'r'.'utf-8') as myfile:
        for line in myfile:
            line = p1.sub(' ', line)
            line = p2.sub(' ', line)
            line = p3.sub(' ', line)
            line = p4.sub(' ', line)
            line = p5.sub(' ', line)
            line = p6.sub(' ', line)
            outfile.write(line)
    outfile.close()
if __name__ == '__main__':
    input_file = sys.argv[1]
    filte(input_file)Copy the code

From traditional to simplified

Wikipedia corpus contains a large number of traditional Chinese, for we may need to convert it into simplified Chinese, here we use OpencC to do the conversion. There are two ways to use Opencc

  • Directly use the Windows version of OpencC, and then execute the command to convert the file tohttps://bintray.com/package/files/byvoid/opencc/OpenCCDownload.
  • Using the Python version of Opencc, which may generate an error under Python 3.5, execute the command,pip install opencc-python, an error may be reported:ImportError: No module named distribute_setupAt this timehttp://download.csdn.net/download/tab_space/9455349Download and unzip itdistribute_setup.pythonCopy the files to the Lib directory in the Python installation directory. If the command is executed again, an error may occur:chown() missing 1 required positional argument: 'numeric_owner', then you need to changedistribute_setup.pythonIn the fileself.chown(tarinfo, dirpath)Instead ofself.chown(tarinfo, dirpath, '').

Run the following command to convert the traditional wiki_00 and wiki_01 files to simplified wiki_01 using The Windows version of Opencc.

opencc -i wiki_00 -o zh_wiki_00 -c t2s.json

opencc -i wiki_01 -o zh_wiki_01 -c t2s.jsonCopy the code

participles

To continue to operate on jieba, you can use Jieba to install Python jieba and run the following script:

import jieba
import re
filename='cut_std_zh_wiki_01'
fileneedCut='std_zh_wiki_01'
fn=open(fileneedCut,"r",encoding="utf-8")
f=open(filename,"w+",encoding="utf-8")
for line in fn.readlines():
    words=jieba.cut(line)
    for w in words:
       f.write(str(w))
f.close()
fn.close()Copy the code

Here are the ads

======== advertising time ========

My new book “Analysis of Tomcat kernel Design” has been sold in JINGdong, friends in need can go to item.jd.com/12185360.ht… Make a reservation. Thank you all.

Why to write “Analysis of Tomcat Kernel Design”

= = = = = = = = = = = = = = = = = = = = = = = = =

Welcome to:

Write the picture description here


Write the picture description here