Brief introduction:The past year of 2020 has been an unforgettable one for all. The impact of COVID-19 has brought challenges to all industries, including education, which has spawned a new business landscape. With the rapid development of online education platform, Ali Cloud also actively responds, providing efficient and stable technical support for many customers of online education. This paper introduces the technical principle of Aliyun open search, which is an important tool for online education to plunder traffic – photo search.

Shared by: Xu Guangwei (Kunka) Alibaba Damo Yuan algorithm expert

Learn more solution details:

Search is the online education enterprise traffic acquisition weapon

As of December 2020 month living need of education industry, including the ability to search the topic software as many as five, photo search topic as a product ability, can help customers to obtain a large number of users and traffic, and provide the cashability to other products, it is because of this positioning, photo search topic overall accuracy and search efficiency have become a critical point, So open search has done a lot of tailoring for that.

Education search business characteristics

Three characteristics are summarized for the business scene of education search:

The first point is that the massive question bank, the education question bank all belong to the ten million level or even to the hundred million level, and the continuous growth; At the same time, there is an obvious peak phenomenon in search business, such as seven or eight o ‘clock in the evening, the last day of the holiday. At this time, there will be a very high QPS wave peak in search business. Search delays can seriously affect the user experience.

The second point is rich in scenes. The scenes covered by the photo search questions are increasingly rich, including different age groups. For example, the questions in the lower grades are mainly about taking photos, reading pictures, reading and connecting questions, which require more picture information. It also includes different disciplines, with more than a dozen currently supported, so the richness of the scenarios poses a greater challenge to search results.

The third algorithm requirement is that the product form of photo search questions generally only shows the TOP3 or TOP5 results. Just because of this setting, accuracy is crucial for photo search questions. At the same time, photo search questions will also involve multimodal and multilingual processing ability to solve the needs of image search and multilingual processing.

Architecture of Open Search Educational Search Scheme

AliCloud open search solution for taking photos and searching questions, when users take photos and identify the text by OCR, after processing by the open search engine will return the top 3-5 results for users to display, and strictly guarantee the security and privacy of the data in the enterprise question bank.

Educational search algorithm ability

Query analysis algorithm to optimize the complete processing flow

Word segmentation and subject category prediction in education industry

There are two major difficulties in word segmentation in the scene of taking photos and searching questions. The first point is that the space is missing after the OCR recognition of English questions. As can be seen from the first figure on the left, even for the long English text without space, the model can do the correct segmentation very accurately. The second difficulty is the segmentation of the mathematical problems. As can be seen from the second figure on the left, all the mathematical symbols are segmented correctly.

Category prediction in the scene of taking photos to search for questions corresponds to the prediction of subjects and types of questions. We combined pictures and OCR recognition of text information to do multi-modal prediction, so as to improve the accuracy of search.

Multiple recall sequencing technology

Due to the particularity of photo search business scene, open search also introduces the multi-way recall sorting technology.

Why do multiple recalls?

Compared with traditional web page or e-commerce search, educational photo search has obvious differences. The first point is that the search Query is extremely long, and the second point is that the search Query is the text obtained after the recognition of photo OCR. If the key TERM identification is wrong, the recall order will be seriously affected.

Traditional plain text Query schemes include two kinds: the first is OR logic Query, AND the second is AND logic Query. After analyzing the AND logic Query based on the optimized AND customized Query module for the education field as we just mentioned, the effect is greatly improved, AND the accuracy can now be close to OR logic.

How to take into account the cost of search computation and the accuracy of search?

The text vector recall is introduced, and three points are optimized for the text vector recall technology.

The first point is that we adopted the Structbert model developed by Damo Institute and customized it for the education industry. At the same time, we compressed and accelerated the BERT model.

The second point is that the vector retrieval engine uses the Proxma engine developed by Damo Institute, which is more accurate and faster than the open source system.

The third point is that training data can be accumulated based on the customer’s search logs, and the effect can be continuously improved.

As can be seen from the figure on the right, in the end, we can achieve very good results with the Bert model based on both sides, whose accuracy exceeds the OR logic by 3%-5%, and the overall number of recalled DOC is reduced by 40 times and Latecy by more than 10 times.

Search effect display

Taking two specific search cases as an example, we can see from the case on the left that the results returned by traditional search engines are of poor relevance because of the inconsistency between the text description in the question and the text description in the question bank. After the introduction of semantic vector recall, the TOP3 results on the right completely meet the question meaning. In the second case, because the topic contained information of pictures, traditional search engines could not accurately recall the topic. Based on our multi-way recall, TOP1 returned exactly the same topic after introducing image information.

Open search solution advantages

Case 1: For an education user of K12, the number of users reached tens of millions, and the number of question bank was about 80 million and increasing continuously. After the customer connected to open search, the accuracy rate of returning questions increased by 45%, and the delay was reduced to 50% milliseconds.

Case 2: A customer of higher vocational education has 3 million DAU and 10 million monthly lives. After the customer connects to the system, the feedback from the customer is more than two seconds in the peak time, but now the open search can be stabilized at 50 milliseconds, 40 times lower than that of the previous year. The search results of Top5 topics are reduced from 40% to less than 1%, and the smooth expansion of capacity in seconds can be achieved during peak business periods.

Access to expert guidance:

Copyright Notice:The content of this article is contributed by Aliyun real-name registered users, and the copyright belongs to the original author. Aliyun developer community does not own the copyright and does not bear the corresponding legal liability. For specific rules, please refer to User Service Agreement of Alibaba Cloud Developer Community and Guidance on Intellectual Property Protection of Alibaba Cloud Developer Community. If you find any suspected plagiarism in the community, fill in the infringement complaint form to report, once verified, the community will immediately delete the suspected infringing content.