Originally published on my blog: www.bmpi.dev/dev/coupon-…

About half a year ago, I participated in a project of a foreign coupon aggregation platform. Here is a brief introduction of the project background: The site shows the latest coupons of many brands. Users can view and search the latest coupons of different brands in the site. The coupons are divided into two types. The coupons of the website come from two channels. One is Aff Network Alliance platform. Our website connects with three platforms, including Shareasale, LinkShare and CJ. The other is social platforms, mainly Twitter. The first type of platform will automatically store and put the data into our website through the interface docking and pulling, and the manual can apply for the coupon offer of various brands on these network alliance platforms. For the coupon offer of Wangmeng platform, we offer commission for users who click the coupon to jump to the brand official website and buy the product. The commission rate is different for different products. The second type, such as Twitter platform coupon without commission, is very large, can expand the number of our coupons, can also improve the number of pages on our website and Google included pages.

At first our website just the three WangMeng docking platform, as the site of the first version online, there are about thirty thousand valid coupon number, brand for three thousand pages, web site design is our brand page for polymerization, also is the first edition website is only about three thousand pages, in order to improve the collection of Google page, we began to do SEO optimization, Basic SEO optimization includes website Html optimization, Sitemap, Title, Keywords, Description, Meta, brand index page, long tail word construction, chain construction, URL optimization, site search, etc. Outside the chain construction in a short time to do more than ten thousand, long tail words also do seven or eight thousand, so the page exploded to about twelve thousand, long tail words page also do SEO optimization, including similar long tail word recommendation, similar brand recommendation. At that time, the log monitoring observed that the daily crawling frequency of Google crawler became higher, including two or three thousand. Some brand words had been ranked, and everything was developing in a good direction. There is a hidden trouble is what we did not anticipate online, at that time, also for the project after problems have buried a ray, is used when we have just launched a new common domain name, but the team for some reason had a roughly 10 years old domain name, the old domain name you have some outside the chain can bring some traffic, Should be done before some sites outside the chain of quantity, thought the old domain name can bring good weight, and so on online immediately switch to the old domain name after a week, now already has two versions, one is the old version of the new domain, is a new version of the old domain name (bring traffic), because the team limited manpower, This is mainly to add new functions to the site of the old domain name, the previous new domain name was not maintained by the robots shield crawler temporarily placed for a period of time.

Soon we found the problem, Google always had problems in the inclusion of pages, first the cache was 404, and then the loading speed of our website was very slow. The score in The Google Page speed measurement was about 60, so we spent a week to speed up the website and added CDN. Optimized the site rendering code logic, simplified the front-end JS and CSS file size, optimized the image size and so on, and then the website speed was improved, loading can be completed within 2 seconds, the server response time was controlled within 200ms, but Google’s cache page has been a problem, finally SEO said that this is Google’s problem. And Google officials also said that this was an internal Google server problem, as long as the index is normal, we did not take care of the problem. However, the problem that always bothers us is that Google’s included pages have been unstable, hundreds in the morning, and then thousands in the afternoon, and The SEO traffic brought by Google is hovering around dozens of times a day.

Focus on improving the second phase of the project on Google page, when the SEO considered web similarity is too high, so it took us a week or two time is to study the contents of our website is too similar, so for the possible similar pages made some random differential treatment, let website refresh every time a certain part of the random, although the way of the crawler is not optimized, But can ensure that web page similarity become low, I wrote a python program to examine tens of thousands of pages of total station, we use first crawler crawl took two days to get to the page and then automatically to render shot down, then extracted from the HTML text, the text cosine similarity calculation, finally randomly selected one thousand pages, Pair comparison similarity drawing contrast diagram, found that the similarity of the whole station is not high, and the similarity of competitive products and our gap is not very big, because everyone’s data are from the network alliance platform, so the data should be about the same. Then Google’s cache finally worked, but the mobile interface was displayed, which was actually strange, because Google had two versions of bot, one for mobile and one for PC. Our website would detect the HTML code given different templates through UA, but Google finally gave the cached version for mobile. The mobile terminal cache version display is not normal, and the final check is that a front-end component is caused by a technology incompatible with Google. After fixing the problem, we spent another week to observe it, and it was still abnormal.

Third period included abnormal or to solve the problem, then we put the question focus on the content, because the content is from WangMeng pull to come over to the interface of the data, so the differentiation and competing goods, about in order to improve the content of the website, we chose to climb in social networking site information to generate the coupon, because a lot of brands will pay station in a release their own brand of coupons information, Therefore, we decided to crawl Twitter tweets after investigation. I built a distributed crawler system of Twitter, which searched Twitter every second for keywords related to coupons, and then crawled relevant tweets into the database. Then, a Python program in the background went to the database every day to get the tweets that we crawled from the previous day. After algorithmic processing, the information is generated into the coupon format, and another script task will automatically remove the generated information and post it to our website. When we added Twitter coupons, we quickly doubled the number of coupons and had 80,000 pages. Because before we analysis the long tail word page too much, so I stopped long tail word page, outside the chain also began declining, but at this point, included or not stable, at this time the bing included on page twenty thousand, but Google has wandered the 23000, is not very stable, sometimes in the afternoon is sixteen thousand, and become a two thousand in the morning. Compared with competing products, it is really very weird.

In the end we think the problem big probability on the domain name, because the screening test did not solve the problem, is the domain name has not been to test, because the domain name changed after a lot of work has been done for white, but we also want to go to test whether domain problems, so before the new domain name is enabled again, in the new domain name before doing a simplified version of the coupon system, And the old domain name the site to do comparison, observation after a period of time, before the new domain name system included also wrong, SEO are considered when we were in the publishing system before the two domain name has been associated, could be detected by Google, then this way doesn’t work, at this time we are going to enable a domain name to test, There is also a plan to build a traffic station to divert traffic to the coupon platform. However, all of this was not done in time. The company eventually stopped developing new features due to the long project time, and just left the site waiting for inclusion.