A few days ago made a robot] [WeChat find room (https://zhuanlan.zhihu.com/p/23798922), crawl douban rent team found many mediation in disguise, this format is as follows:

Monthly payment without intermediary fangzhuang subway near fangcheng Park area one single rental

Monthly payment without intermediary Fangzhuang subway near Fangcheng Park area 1 main bedroom second bedroom rental

Fangzhuang subway near fangcheng Park area 1 second bedroom rental

Douban has a simple filter for group posts. If the title is exactly the same, it will be deleted directly, so agents usually change a few words in the title and adjust the order of the content. In fact, there are simple solutions to this problem, such as cosine similarity, described below.

About the cosine similarity principle is very simple, reference this article [TF – the application of IDF and cosine similarity (2) : find the similar article] (http://www.ruanyifeng.com/blog/2013/03/cosine_similarity.html)

Python is implemented as follows:

(1) word segmentation of text

def cut_word(content):

    tags = jieba.analyse.extract_tags(content, withWeight=True, topK=20)

    return tags
Copy the code

1) It is too useful to jieba

(2) Return two multidimensional vectors of text word frequency

def merge_tag(tag1=None, tag2=None):
    v1 = []
    v2 = []
    tag_dict1 = {i[0]: i[1] for i in tag1}
    tag_dict2 = {i[0]: i[1] for i in tag2}
    merged_tag = set(tag_dict1.keys()+tag_dict2.keys())
    for i in merged_tag:
        if i in tag_dict1:
            v1.append(tag_dict1[i])
        else:
            v1.append(0)
        if i in tag_dict2:
            v2.append(tag_dict2[i])
        else:
            v2.append(0)
    return v1, v2
Copy the code

(3) Calculate cosine similarity

def dot_product(v1, v2): return sum(a * b for a, b in zip(v1, v2)) def magnitude(vector): return sqrt(dot_product(vector, vector)) def similarity(v1, v2): Return dot_product(v1, v2)/(magnitude(v1) * magnitude(v2) +.00000000001) return dot_product(v1, v2)/(magnitude(v2) *.00000000001)Copy the code

(4) Results

Let’s take some samples as examples. The content is title + details

# https://www.douban.com/group/topic/93410497/ content1 = u "" "month pay without mediation Fangzhuang near the subway Fang city garden area single room rent my house in fangzhuang fang city garden area near the subway, formal district building, three share, Now rent a master bedroom and a second bedroom with a small balcony, indoor appliances are complete, refrigerator, washing machine and so on, can take a bath, Internet access, cooking can be, the community is convenient, accessible, Hope is to stay near normal work friends "" "# https://www.douban.com/group/topic/93410328/ content2 = u "" "month pay without mediation Fangzhuang near the subway Fang city garden area advocate lie second lie for rent My house is located in fangcheng Garden Area 1, near Fangzhuang Subway. It is a regular residential building with three families living together. Now I rent a master bedroom and a secondary bedroom with a small balcony. Hope is to stay near normal work friends "" "# https://www.douban.com/group/topic/93410308/ content3 = u "" "fangzhuang near the subway Fang city garden area second lie for rent My house is in fangzhuang fang city garden area near the subway, Regular residential buildings, three families live together, now rent a master bedroom and a second bedroom with a small balcony, complete indoor appliances, refrigerators, washing machines and so on, can take a shower, Internet access, cooking can be, the community is convenient, accessible, Hope is to stay near normal work friends "" "# https://www.douban.com/group/topic/93381171/ content4 = u "" "on yu ting bridge next to the next month after 27 can stay 2 r Fangzhuang Fanggu Garden area 1 downstairs no. 5 due to rent on 27th, I am the owner of the broker fee, the New Year rent 6000 yuan a month to pay three, the primary and secondary bedroom can be separated. It takes 5 minutes to puhuangyu Station of Metro Line 5. The house is 60 level facing with fixed parking space for guard.Copy the code

The test results are as follows: The closer you get to 1, the more similar the articles are

Content1 and content2 similarity: 0.968802386285 Content1 and Content3 similarity: 0.926323584519 0.921885685549 Content2 and content4 similarity: 0.174889264654Copy the code

Description Content1 Content2 Content3 is likely to be a batch of intermediary posts

This is just a simple study. For better results, you need to have a corpus of these keywords and then weight them.

Code see [cosine_similarity py] (https://gist.github.com/facert/097af928b50ef9946513c7a5b42ec5c2)