This article is for personal study reference only and is not intended for commercial use.

Of the more than 20 million public accounts, if each published an average of 1,000 articles, or 50 billion articles, each Chinese person would get 36 articles. While providing a great deal of information, information and knowledge, these articles have penetrated into all walks of life. Among them, the Internet, education and finance are the most popular. If you understand the principle of Open Source Intelligence analysis, the data of wechat public account must not be missed.

As long as you have a wechat account, the content of any public account can be viewed, but do you really see? Can you really read it? Do you really have time to watch it? Admit that there is always a small group of people who have mastered these methods and techniques and whose way of thinking and access to information completely outsmart everyone else on the subway. Be sure that printing, the Internet and wechat accounts have actually exacerbated information asymmetry.

demand

As mentioned in the introduction, you may have followed hundreds of wechat public accounts, which contain a lot of valuable information for you. How to effectively aggregate these information for reading or further processing becomes a problem. I have to “forward public articles to WeChat robot for later processing” this step into their workflow for nearly two months, but as the time longer, the frequency of usage, use scope expanding, I feel the system is still not enough increasingly efficient, because collecting information practitioners or myself, a lot of time spent in reading the operation logic of the public, It’s not elegant. Therefore, I have new requirements for this system, as follows:

  1. Can real-time collection of public information
  2. Automatically categorize articles according to the tags I specify
  3. Automatic keyword extraction, event extraction
  4. Automatic text summarization

This article, the first in a series, focuses on implementing the first requirement.

The implementation of

According to the method I currently think of, a simple list of the current method used to obtain wechat public number article data:

  1. Using third-party platform APIS
  2. Collected from sogou wechat search platform
  3. Subscribe to the RSS
  4. Collection based on wechat public account platform
  5. Use an automated testing framework to simulate clicks
  6. Request collection with mobile terminal construction
  7. Passively accept the public number message collection

One I recently picked up from Clubhouse is “FOMO,” or “Fear of Missing Out on what’s going on in your circle of friends.” I find that I am also sensitive to the time of information, so I hope to make it as real time as possible. One day later does not meet my needs. According to my practice, there is only one way to meet my needs. But to describe the process of exploration, I’ll go through each of these in turn, and save time by going straight to the last one:

Using third-party platform APIS

According to the results of the investigation, the data service providers mainly include Xinbang, Qingbo and Tuotuo data. However, in the process of searching, I also found an article that the court ruled to prohibit unauthorized access to the data of wechat public accounts, which is the reason for the warning at the beginning of the article.

  1. The new list

According to the New list new media data API, does provide a lot of capabilities, including the wechat article API is as follows:

Its billing method is its own. After the registration trial, it should be 3000 yuan to send 20,000 units. That is to say, the data of the public account without text is about 0.15 yuan per article. According to my API test, the data obtained is not real-time, there should be a day or so of delay, plus charges, so skip.

  1. Qing bo

According to the API documents of Qingbo open platform, I only found the accounts of Toutiao, Douyin and Kuaishou, so I skipped them. I used my small public account “xyzlabAI” as an experiment and didn’t have any data, so I skipped the entry of “wechat Article Collection” of Qingbo.

  1. Rio road data

Toppath data should be a more friendly platform for non-developers, it is very convenient to add monitoring tasks, without the body of the cost is also 0.15/ article:

But it seems to have been suspended:

According to the extension data API document, it should be quite in line with my use requirements, but I have not tried the API, and there is no conclusion on data delay. After monitoring quantity is big, this also is a not small charge actually, then skip.

Collected from sogou wechat search platform

Sogou micro channel search platform from the real-time observation effect should also be possible, but only show the latest 10 group hair, then skip.

Subscribe to the RSS

Really Simple Syndication (RSS) is an easy way to get information from news channels, blogs and so on. It is also supported by this Blog. You can click here to subscribe to this Blog with RSS feeds.

RSSHub is an open source, easy to use, easy to expand RSS generator, can generate RSS feeds for any strange content, has been adapted to hundreds of sites thousands of content, it is a very good tool. Later, when expanding the personal intelligence public opinion analysis system, this method should be adopted when subscribing to the bosses’ microblog and Twitter messages.

According to RSSHub’s document, RSSHub provides eight kinds of RSS subscription sources, but only three can be used at present, one of which needs to forward wechat messages to Telegram, so there are actually only two kinds that can be used out of the box. The CraeerEngine source and the twentieth power source respectively. After testing, RSS data is also not real-time, so skip.

Collection based on wechat public account platform

In fact, the wechat public number platform has a real-time access to wechat public number information interface, which requires a public number, and then only need to create a new text in the public number:

In the page of the new picture and text, you only need to select the edit hyperlink and select other public accounts to search for information and obtain the corresponding list:

The intermediate process will not be repeated, and the code will be given directly:

import time
import requests
import pandas as pd
from datetime import datetime, timedelta


def get_fakeid() :
    """ FakeID used to obtain the public number """
    df = pd.DataFrame()
    content = pd.read_excel('XLSX') Fakeid = fakeID = fakeID = fakeID
    search_params = {
        'action': 'search_biz'.'begin': 0.'count': 5.'query': ' '.'token': TOKEN,
        'lang': 'zh_CN'.'f': 'json'.'ajax': 1,
    }
    headers = {
        'cookie': COOKIE,
    }
    URL = 'https://mp.weixin.qq.com/cgi-bin/searchbiz'
    for mp in content['NickName']:
        search_params['query'] = mp
        res = requests.get(URL, params=search_params, headers=headers).json()
        print('{}, {}'.format(mp, res['list'] [0] ['fakeid']))
        s = pd.Series({
            'NickName': res['list'] [0] ['nickname'].'fakeid': res['list'] [0] ['fakeid'],
        })
        df = df.append(s, ignore_index=True)
        time.sleep(10)
        df.to_excel('fakeid.xlsx', index=False) # Prevent API restricted data loss, save once per round


def get_mps(start, end) :
    Args: start (string): start date, e.g. '2021-02-01' end (string): End date, e.g. '2021-02-04', automatically continued to 23:59:59 ""
    start = datetime.strptime(start, '%Y-%m-%d')
    end = datetime.strptime(end, '%Y-%m-%d')
    end = end + timedelta(hours=23, minutes=59, seconds=59)
    start = start.timestamp()
    end = end.timestamp()
    df = pd.DataFrame()
    content = pd.read_excel('XLSX') # read the list of public numbers to read containing fakeID
    search_params = {
        'action': 'list_ex'.'begin': 0.'count': 5.'fakeid': ' '.'query': ' '.'token': TOKEN,
        'lang': 'zh_CN'.'f': 'json'.'ajax': 1,
    }
    headers = {
        'cookie': COOKIE,
    }
    URL = 'https://mp.weixin.qq.com/cgi-bin/appmsg'
    for mp in content['fakeid']:
        search_params['fakeid'] = mp
        res = requests.get(URL, params=search_params,
                           headers=headers).json()['app_msg_list']
        articles = [{'title': i['title'].'Source link': i['link'].'the': i['digest'].'time': i['create_time']} for i in res]
        times = [i['create_time'] for i in res]
        count = sum(start < i < end for i in times)
        while count == len(times):
            time.sleep(10)
            search_params['begin'] + =5
            res = requests.get(URL, params=search_params,
                               headers=headers).json()['app_msg_list']
            articles.extend([{'title': i['title'].'Source link': i['link'].'the': i['digest'].'time': i['create_time']} for i in res])
            times = [i['create_time'] for i in res]
            count = sum(start < i < end for i in times)
        for article in articles:
            s = pd.Series({
                'title': article['title'].'Source link': article['Source link'].'the': article['the'].'time': datetime.fromtimestamp(article['time']).strftime('%Y-%m-%d'),
            })
            df = df.append(s, ignore_index=True)
            df.to_excel('XLSX', index=False) # Prevent API restricted data loss, save once per round


if __name__ == '__main__':
    start = '2021-01-29'
    end = '2021-02-04'
    TOKEN = ' ' # your TOKEN
    COOKIE = ' ' # your cookies
    get_fakeid()
    get_mps(start, end)

Copy the code

The TOKEN and COOKIE in the code only need to be obtained and filled in the request of the browser. First, the fakeID corresponding to the public number needs to be obtained, and then the information data corresponding to the simulated request is obtained. According to the test results, public accounts with a little over a hundred will be limited to get FAkeID a little faster (in the code, change each public account to get FAkeID and wait 10 seconds, there is no limit, but there is no limit to test the TOTAL amount of access to the API), so this method can only be used as an alternative.

In addition, in wechat platform, FakeID is one-to-one, while OpenID is one-to-many, that is, each public account has a unique FakeID, but different users who follow the same public account will have different Openids.

Use an automated testing framework to simulate clicks

Use automated testing framework to control the client, simulate click, obtain parameters, too much trouble, so did not write code.

Request collection with mobile terminal construction

Three or four years ago, I had my first internship. When I participated in the Hackathon of the company, I collected the article data of wechat public account and used it to supplement the training data of AI model (I won the first place LOL in the competition). At that time, it also collected data such as likes, comments, views and text content, mainly based on man-in-the-middle attack. Now I am lazy and I bought a set of source code directly. It is the Flask service combining MongoDB and Elasticsearch, combining mitmProxy and old wechat client to conduct simulation request, requiring users to manually click the public account (of course, automatic test framework can be used, But too lazy to write) history list and body data, you can get the historical data of the public number:

At the beginning I thought that I bought not the commercial version of the reason, leading to a one-time access to the content of a public number, the author asked, found that the commercial version of a public number is only supported by the collection. If every time I collect the need to manually choose the public, it actually reduces my efficiency, can only say that tool positioning, demand is different, the tool more biased towards the public history data collection, then consider the use of the tools to gather the required public history data, and text data, to build their own knowledge base, Easy to search for Elasticsearch.

Passively accept the public number message collection

In fact, all of the above methods need to be collected on their own initiative. Although they increase their subjective initiative, they are still very uncomfortable to use. Why not just let him serve them directly? This method is actually micro channel personal number limit broken game – after itchat era Serverless and fly book shortcut smart use of the function of the continuation of this article, since the micro channel client to accept the subscription of the public number of messages, it can be directly processed. There are still some pits in the specific operation. Start with the autoReplyByAI method again:

#pragma mark - Other
- (void)autoReplyByAI:(AddMsg *)addMsg
{
    if(addMsg.msgType ! =1) {
         return;
    }
    
    NSString *userName = addMsg.fromUserName.string;
    
    MMSessionMgr *sessionMgr = [[objc_getClass("MMServiceCenter") defaultCenter] getService:objc_getClass("MMSessionMgr")];
    WCContactData *msgContact = nil;
    
    if (LargerOrEqualVersion(@ 2.3.26 "")) {
        msgContact = [sessionMgr getSessionContact:userName];
    } else {
        msgContact = [sessionMgr getContact:userName];
    }
    
    if ([msgContact isBrandContact] || [msgContact isSelf]) {
        // This message is sent by the public account or by myself
        return;
    }
    YMAIAutoModel *AIModel = [[YMWeChatPluginConfig sharedConfig] AIReplyModel];
    if (AIModel.specificContacts.count < 1) {
        return;
    }
    
    [AIModel.specificContacts enumerateObjectsUsingBlock:^(NSString *wxid, NSUInteger idx, BOOL * _Nonnull stop) {
        if ([wxid isEqualToString:addMsg.fromUserName.string]) {
            
            NSString *content = @ "";
            NSString *session = @ "";
            if ([wxid containsString:@"@chatroom"]) {
                NSArray *contents = [addMsg.content.string componentsSeparatedByString:@":\n"];
                NSArray *sessions = [wxid componentsSeparatedByString:@ "@"];
                if (contents.count > 1) {
                    content = contents[1];
                }
                if (sessions.count > 1) {
                    session = sessions[0]; }}else {
                content = addMsg.content.string;
                session = wxid;
            }
            
            [[YMNetWorkHelper share] GET:content session:session success:^(NSString *content, NSString *session) {
                [[YMMessageManager shareManager] sendTextMessage:content toUsrName:addMsg.fromUserName.string delay:kArc4random_Double_inSpace(3.8)]; }]; }}]; }Copy the code

As you can see in the code, if you receive a message sent by the public account or myself, you will return directly. Therefore, you must introduce these types of messages in addMsg. When I first tried it, I went straight to NSLog(@” message content: %@”, addMsg); At that time, I did not pay attention to the end of the field, so I made a hasty conclusion: the wechat client only accepts the first message from the public number, and will only get the rest of the content after clicking in, so as to save traffic. The next day I thought something was wrong and asked a classmate who specialized in reverse engineering, saying that his friend had done Android. But since Android is not my primary phone, I stopped asking questions and peered over to the Xcode console and noticed that the end of the printed string was suddenly truncated, which made me think he wasn’t printing the whole thing.

I asked my iOS developer, and he said, “How can that be?” Then I added a macro at the beginning:

#ifdef DEBUG
#define NSLog(FORMAT, ...) fprintf(stderr, "%s:%zd\t%s\n", [[[NSString stringWithUTF8String: __FILE__] lastPathComponent] UTF8String], __LINE__, [[NSString stringWithFormat: FORMAT, ## __VA_ARGS__] UTF8String]);
#else
#define NSLog(FORMAT, ...) nil
#endif
Copy the code

Then NSLog can output the information completely: when the wechat client receives the message of subscribing to the public account, it accepts all the articles pushed in XML form. All I need to do is change the relevant part of the code (I’ve never written Objective-C before, lightly) :

#pragma mark - Other
- (void)autoReplyByAI:(AddMsg *)addMsg
{
    if(addMsg.msgType ! =1&& addMsg.msgType ! =49) {
         return;
    }
    
    NSString *userName = addMsg.fromUserName.string;
    
    MMSessionMgr *sessionMgr = [[objc_getClass("MMServiceCenter") defaultCenter] getService:objc_getClass("MMSessionMgr")];
    WCContactData *msgContact = nil;
    
    if (LargerOrEqualVersion(@ 2.3.26 "")) {
        msgContact = [sessionMgr getSessionContact:userName];
    } else {
        msgContact = [sessionMgr getContact:userName];
    }
    
    if ([msgContact isSelf]) {
        // This message is sent by the public account or by myself
        return;
    }
    YMAIAutoModel *AIModel = [[YMWeChatPluginConfig sharedConfig] AIReplyModel];
    // Listen for messages from concerned public accounts
    if([msgContact isBrandContact]){
        if ([addMsg.content.string hasPrefix:@"<msg>"]) {NSError *error = nil;
            NSDictionary *xmlDict = [XMLReader dictionaryForXMLString:addMsg.content.string error:&error];
            NSDictionary *msgDict = [xmlDict valueForKey:@"msg"];
            NSDictionary *appMsgDict = [msgDict valueForKey:@"appmsg"];
            NSDictionary *mmreaderDict = [appMsgDict valueForKey:@"mmreader"];
            NSDictionary *categoryDict = [mmreaderDict valueForKey:@"category"];
            NSArray *items = [categoryDict valueForKey:@"item"];
            NSMutableArray *mps = [[NSMutableArray alloc] init];
            if (items.count > 20) {
                NSString *title = @ "";
                NSString *url = @ "";
                NSString *pub_time = @ "";
                NSString *digest = @ "";
                NSString *source = @ "";
                NSDictionary *titleDict = [items valueForKey:@"title"];
                title = [titleDict valueForKey:@"text"];
                NSDictionary *urlDict = [items valueForKey:@"url"];
                url = [urlDict valueForKey:@"text"];
                NSDictionary *pub_timeDict = [items valueForKey:@"pub_time"];
                pub_time = [pub_timeDict valueForKey:@"text"];
                NSDictionary *digestDict = [items valueForKey:@"digest"];
                digest = [digestDict valueForKey:@"text"];
                NSDictionary *sourcesDict = [items valueForKey:@"sources"];
                NSDictionary *sourceDict = [sourcesDict valueForKey:@"source"];
                NSDictionary *nameDict = [sourceDict valueForKey:@"name"];
                source = [nameDict valueForKey:@"text"];
                NSDictionary *article = @{
                    @"title": title,
                    @"url": url,
                    @"pub_time": pub_time,
                    @"digest": digest,
                    @"source": source,
                };
                [mps addObject:article];
            } else {
                for (id item in items) {
                    NSString *title = @ "";
                    NSString *url = @ "";
                    NSString *pub_time = @ "";
                    NSString *digest = @ "";
                    NSString *source = @ "";
                    NSDictionary *titleDict = [item valueForKey:@"title"];
                    title = [titleDict valueForKey:@"text"];
                    NSDictionary *urlDict = [item valueForKey:@"url"];
                    url = [urlDict valueForKey:@"text"];
                    NSDictionary *pub_timeDict = [item valueForKey:@"pub_time"];
                    pub_time = [pub_timeDict valueForKey:@"text"];
                    NSDictionary *digestDict = [item valueForKey:@"digest"];
                    digest = [digestDict valueForKey:@"text"];
                    NSDictionary *sourcesDict = [item valueForKey:@"sources"];
                    NSDictionary *sourceDict = [sourcesDict valueForKey:@"source"];
                    NSDictionary *nameDict = [sourceDict valueForKey:@"name"];
                    source = [nameDict valueForKey:@"text"];
                    NSDictionary *article = @{
                        @"title": title,
                        @"url": url,
                        @"pub_time": pub_time,
                        @"digest": digest,
                        @"source": source, }; [mps addObject:article]; }}NSError *parseError = nil;
            NSData *jsonData = [NSJSONSerialization dataWithJSONObject:mps options:NSJSONWritingPrettyPrinted error:&parseError];
            NSString *str = [[NSString alloc] initWithData:jsonData encoding:NSUTF8StringEncoding];
    // NSLog(@" format content: %@", STR);
            [AIModel.specificContacts enumerateObjectsUsingBlock:^(NSString *wxid, NSUInteger idx, BOOL * _Nonnull stop) {
                [[YMNetWorkHelper share] GET:str session:wxid success:^(NSString *content, NSString *session) {
                }];
            }];
        } else {
            return; }};if (AIModel.specificContacts.count < 1) {
        return;
    }
    [AIModel.specificContacts enumerateObjectsUsingBlock:^(NSString *wxid, NSUInteger idx, BOOL * _Nonnull stop) {
        
        if (addMsg.msgType == 1) {
            if ([wxid isEqualToString:addMsg.fromUserName.string]) {
                
                NSString *content = @ "";
                NSString *session = @ "";
                if ([wxid containsString:@"@chatroom"]) {
                    NSArray *contents = [addMsg.content.string componentsSeparatedByString:@":\n"];
                    NSArray *sessions = [wxid componentsSeparatedByString:@ "@"];
                    if (contents.count > 1) {
                        content = contents[1];
                    }
                    if (sessions.count > 1) {
                        session = sessions[0]; }}else {
                    content = addMsg.content.string;
                    session = wxid;
                }
                [[YMNetWorkHelper share] GET:content session:session success:^(NSString *content, NSString *session) {
                    [[YMMessageManager shareManager] sendTextMessage:content toUsrName:addMsg.fromUserName.string delay:kArc4random_Double_inSpace(3.8)]; }]; }}else if (addMsg.msgType == 49) {
            if ([wxid isEqualToString:addMsg.fromUserName.string]) {
                NSString *msgContentStr = nil;
                NSString *session = @ "";
                if ([addMsg.fromUserName.string containsString:@"@chatroom"]) {
                    NSArray *msgAry = [addMsg.content.string componentsSeparatedByString:@":\n
      ];
                    NSArray *sessions = [wxid componentsSeparatedByString:@ "@"];
                    if (msgAry.count > 1) {
                        msgContentStr = [NSString stringWithFormat:@ "
      ,msgAry[1]];
                    } else {
                        msgAry = [addMsg.content.string componentsSeparatedByString:@":\n<msg"];
                        if (msgAry.count > 1) {
                            msgContentStr = [NSString stringWithFormat:@"<msg%@",msgAry[1]];
                        }
                        if (sessions.count > 1) {
                            session = sessions[0]; }}}else {
                    msgContentStr = addMsg.content.string;
                    session = wxid;
                }
                NSString *url = @ "";
                NSString *title = @ "";
                NSError *error;
                NSDictionary *xmlDict = [XMLReader dictionaryForXMLString:msgContentStr error:&error];
                NSDictionary *msgDict = [xmlDict valueForKey:@"msg"];
                NSDictionary *appMsgDict = [msgDict valueForKey:@"appmsg"];
                NSDictionary *titleDict = [appMsgDict valueForKey:@"title"];
                title = [titleDict valueForKey:@"text"];
                NSDictionary *urlDict = [appMsgDict valueForKey:@"url"];
                url = [urlDict valueForKey:@"text"];
                NSDictionary *content = @{
                    @"title":title,
                    @"url":url
                };
                NSError *parseError = nil;
                NSData *jsonData = [NSJSONSerialization dataWithJSONObject:content options:NSJSONWritingPrettyPrinted error:&parseError];
                NSString *str = [[NSString alloc] initWithData:jsonData encoding:NSUTF8StringEncoding];
                [[YMNetWorkHelper share] GET:str session:session success:^(NSString *str, NSString *session) {
                    [[YMMessageManager shareManager] sendTextMessage:str toUsrName:addMsg.fromUserName.string delay:kArc4random_Double_inSpace(3.8)]; }]; }}}]; }Copy the code

From there, the plugin will automatically forward the message received from the subscription public account to the previously specified API address, which also needs to be changed. Before is through flying book shortcut and flying book documents, the combination of Quip to carry out data storage, but this time forward all the subscription public number data should be relatively large, is expected to be more than one thousand data a week, so the introduction of Leancloud, is also the blog comments, reading back-end system, It is mainly used to accept and store these article data. I forgot to write about the Serverless function in the previous article. Here are some related information:

# -*- coding: utf8 -*-
import re
import json
import requests
import leancloud
from datetime import datetime

leancloud.init(Application AppID "{{}}".Application "{{Key}}")
TOKEN = 'Bearer {{Bearer document TOKEN}}'
headers = {'Content-Type': 'application/x-www-form-urlencoded'.'Authorization': TOKEN}


def main_handler(event, context) :
    print("Received event: " + json.dumps(event, indent = 2)) 
    print("Received context: " + str(context))
    CATCH_URL = ' ' # Fly book shortcut webhook address
    question = event['queryString'] ['question']
    data = None
    try:
        question = json.loads(question)
    except:
        return None
    if isinstance(question, list) :for i in question:
            post_data = {
                'pub_time': datetime.fromtimestamp(int(i['pub_time'])).strftime('%Y-%m-%d %T'),
                'title': i['title'].'digest': i['digest'].'url': i['url'].'source': i['source'],
            }
            r = requests.post(CATCH_URL, json=post_data) # Save to flybook file
            mps = leancloud.Object.extend('mps')
            mps = mps()
            mps.set("title", i['title'])
            mps.set("digest", i['digest'])
            mps.set("url", i['url'])
            mps.set("pub_time", datetime.fromtimestamp(int(i['pub_time'])))
            mps.set("source", i['source'])
            mps.save()
    return data

Copy the code

Since then, we can get real-time wechat official number article data in Leancloud data warehouse. Of course, these data fields are made according to my own data fields, and can be changed and adjusted in the code.

The code part is done, but I need a Mac as the server, and I happen to have several Macs that I can use as servers. While the MacBook is off the screen, the network will go to sleep. This won’t give me 7 × 24 information collection, so I downloaded an Amphetamine to keep me awake for a long time. At present, after a few days of testing, the information can be collected normally.

The next step

The last method is currently only real-time access to their subscriptions to the public number of messages, if you want to subscribe to a new account can only be their own manual attention, the next step should consider the implementation of forwarding the public number of business cards to wechat robot is automatically concerned about the function.

In addition, the MacBook Pro is still running as a server, which is not very elegant. After that, I will consider buying a Mac Mini or directly migrating to the cloud macOS instance of AWS.

At present, the system only completes the building of the data warehouse. After that, it should display and screen it, collect historical data according to the construction request of the mobile terminal, and improve the data warehouse. After that, automatic text classification, keyword extraction, event extraction, text summarization and so on are realized. This is supposed to be a series, but I don’t know when the next one will be updated.