Terminal new gameplay: stack-agnostic script guidance

App guidance is an important means of mental building on the end. We tried “playbook” thinking and achieved good results. When the idea is put into practice, the relevant research and development workload is large, and the terminal technology stack is diversified, so it is necessary to make “zero code” and “technology stack independent”. Finally, we achieved a breakthrough through core schemes such as “image matching” and “standard protocol”. This paper will introduce the thinking process of the project, and analyze and interpret the key technical solutions, hoping to inspire students engaged in related development work.

background

The pace of the Internet industry is fast, and App updates are increasingly frequent. How to make users keep up with the pace of updates, understand product functions and complete cognitive iteration is an important part of business development. At the same time, the concept of “low code/zero code” has gradually been recognized by the public. Relevant research reports indicate that “low code/zero code” can accelerate the digital transformation of enterprises. Take Meituan Home business group as an example, after the home economy warms up again, the growth rate of instant delivery applications is higher than other delivery applications. The influx of new users is both an opportunity and a challenge. At present, The business group of Meituan home has covered 10+ business lines such as medicine, group meals, flash purchase, errand running, group goods and unmanned delivery. The new business model means the trial of new fields, and the main business takeout will launch new function modules in an average of several days. All these need to pay attention to the mental construction of users and the improvement of efficiency.

The status quo

In terms of improving the user’s mind and gaining recognition of the service, the industry has also made a lot of attempts, including a variety of light interaction, as well as “nanny” game-guided teaching. These implementation methods come down to the technical level and are all function guidance in the App, which enables users to quickly understand product features and product usage in a short period of time. Compared with traditional programs such as “advertising”, “slogan dissemination” and “ground promotion and introduction”, function guidance in App has the characteristics of low cost, accurate coverage and reusable.

App function guidance is the “stepping stone” to user mental construction. Only when users are familiar with platform operation and understand product characteristics as the premise, can they further build user mental by means of emotion, scene identification, operation skills and other means. With the continuous iteration of App functions, the phenomenon of “not understanding” has gradually emerged among users, which is particularly prominent in Meituan takeaway business client. As the main tool of business production and operation, the business functions carried by the client are complex and diverse, and the setting items are more complex. If the business does not understand how to use them, it will have a very adverse impact on the whole operation system.

For merchants to “understand”, in the first quarter of 2021, Meituan take-away business side guide the function class took a lot of manpower, demand level platform on products to support businesses, and the pilot “emotional guide” and other projects, although the business results achieved positive returns, but as a result of subsequent research and development for large, empty have ideas but hard to fall to the ground. Similar businesses such as marketing, advertising, goods and orders are also in backlog due to the rapid iteration and the guiding demand of supporting the production of a series of product functions.

Goals and Challenges

Based on the above background and current situation, we urgently need to provide a solution, so that the business side can more quickly implement their own ideas, while controlling the cost, better build the user’s mind. At the same time, solve the current backlog of business tasks, including but not limited to operation teaching, function introduction, emotional, serious and other scenarios. That’s where Scripted projects like the ASG (Application Scripted Guidance) come in.

Project objectives

Our project goal is to build a set of user-friendly playbook guidance tools, which even non-technical students can independently complete production and delivery. Compared with traditional programs, the cost is lower and the effect is better. Currently, it is mainly used in “operation guidance” and “mental construction” and other scenes.

How to understand the “script” here? It is to bring into a real scene, simulate an expected goal, and lead the user to carry out a series of operations for this goal. , users can feel the overall process as well as its association and timing relationship. It can also be interpreted as a small, pre-arranged program that is presented step by step to the user, which may or may not require interaction.

Scripted guidance is common in game apps. For example, when you encounter a fire enemy, you need to go to the weapon interface, select a weapon, and replace it with a water gem. In the past two years, script-based guidance has gradually been used in display apps and tool apps.

Previously, meituan takeout merchants used similar ideas for guiding demands such as “opening business” and “simulated receiving orders”. This approach was more advanced, but the development cost was high, resulting in the backlog of subsequent guiding demands.

Revenue measurement logic

The revenue calculation logic of THE ASG script guidance project is “cost reduction and efficiency improvement”, where “efficiency” refers to both “efficiency” and “effect”. The calculation formula of the result data is as follows: Efficiency improvement multiple x = (1 / (1 – cost reduction ratio)) * (1 + product index growth ratio). Therefore, the target can be divided into the following two directions:

Lower production cost, with the help of some terminal capabilities and configuration capabilities, through simple interaction, can let the product and operation students independently online script. “Zero code” and “technology stack independent” as the core competitiveness of the project. We provide a standardized framework and provide limited customization capabilities in a large framework by adjusting parameters and types to meet different demand scenarios.
Higher application effect, script guidance can be more vivid than traditional function guidance, and can integrate more elements (not rigid voice, timely motion effects, friendly IP image), thus bringing an immersive experience and enhancing user perception. Pay more attention to the interaction/interaction with users, the feedback after operation is better the change of the real page, deepen the understanding of users. The timing is more controllable, automatically triggered after meeting the rules, and the background can screen users with specific characteristics (such as users who do not understand) to deliver script guidance.

Challenges

Currently, terminal technology stacks including Flutter/React Native/ applets /PWA have their own application scenarios. Most apps are a combination of several technology stacks. How to bridge the differences and make the technology stack independent? (that is, Containerless).
How to ensure the success rate and robustness of script execution? (The MVP version of the Demo success rate is only 50%, stable version of the goal is 99% +).
How to implement the script production plan of “Code Zero” to support independent production and distribution? (Previous similar missions required 20-50 person-days).

The overall design

Selection of display form

What form should the project body choose to be based on? The idea is to identify “good results” and then try to achieve “lower costs” in this form.

“Good effect” is naturally expected to be reflected in the product indicators, but in the early stage, in data comparison, different scenarios have a large span of landing indicators, and it is difficult to draw the standard horizontal comparison for different forms. Therefore, from the perspective of “the more you learn, the more you will learn”, we deduce whether the information delivered through the platform can be more accepted by users to measure the final product effect.

We selected some previous business data including video teaching, and the average playing time proportion was about 50% ~ 66%, while most users did not watch the whole video. After analysis, we believe that because the speed of users’ understanding is slow or fast, it is difficult to watch a longer video content if it is not attractive enough or does not fit the pace of users’ understanding. At the same time, video communication is one-way, lack of interaction, and is not scripted thinking. So after consultation with the product, we piloted interactive guidance based on real page development with a certain script (there is a resident button in the upper left corner, users can exit the guidance at any time) on some guidance requirements.

The results of the pilot are in line with our expectations. Interactive bootstrappings, based on real pages, are indeed more acceptable to users. The proportion of guided steps reached 76% ~ 83%, which was significantly higher than the average playing time.

In fact, the conventional form of display also includes picture group, which basically forces the user to click before entering this function. It can be applied to some suggested guidance scenes, but for some guidance cases of medium complexity and above, the data here does not have reference significance. Based on some collected data and basic cognition, we made a comparison of the above three categories, as shown in the table below:

We came to the conclusion that if you want to get better results and want to design user-centric guidance that can be more accepted by users, there are obvious advantages to developing based on real pages, but the disadvantage of doing so is high development cost. At present, the simple pilot has achieved good improvement effect, so the students are confident that after introducing more client capabilities and tuning, the overall effect will have more room for improvement.

Scheme described

The target audience of ASG playbook guidance project is product operation students. We tried to think from their perspective: what is a convenient and efficient “playbook guidance production and delivery tool”?

As shown in the above, we provide products operating students only: the interaction of the recording, editing, preview, publish four steps, such as when the product operating students need in business module online guide, only need to draw up a script and then four steps to complete the “requirements”, almost don’t need to research and development and design of the whole process of participation.

In the specific execution plan, we have carried out the template design and arrangement of the script guidance, and each guiding action is abstracted into an event, and multiple events are combined to form a script. At the same time, in order to ensure the compatibility of different terminals, we designed a set of standard and easily extensible protocols to describe script elements. During runtime, PC management background and App can automatically parse script into executable events (such as coordinate click, page navigation, voice play, etc.).

The core functional modules are in the execution side of the script. In order to ensure a higher application effect, we require that the interaction between the guidance process and users be operated on real business pages, and the elements played and displayed are also required to be calculated and drawn in real time, which puts forward higher requirements for system performance and accuracy. The panorama of the system is shown in the figure below, which consists of three parts: terminal side, management background and cloud service:

Terminal side: it has two functions, including the ability to record scripts and play scripts. It is composed of four functional modules. The preprocessing module is responsible for the resource download, protocol analysis, codec and other operations of the script, which is the pre-step to ensure the successful execution of the script. The real-time computing module can dynamically obtain the information of script anchor point elements through screen capture, feature matching and image intelligence, which ensures the accurate display of script guidance and is the core link to achieve the script guidance technology stack independence. The task scheduling module ensures the orderly and correct execution of the script by implementing the event queue. The multimedia module is responsible for speech synthesis and dynamic rendering, providing an immersive experience for playing scripts in specific business scenarios. Meanwhile, the CAPABILITIES of the PC are expanded on the basis of the client. Common React/Vue/Svelte web applications can be accessed and used at a low cost.

Management background: including script editing, import and release, permission control, data kanban and other functional modules. The script editing module carries the key functions of script protocol analysis, editing, preview and so on. The operation interface is divided into the following areas according to functions:

Event flow control area: events in the script process are displayed in the form of page frames, and editing functions such as dynamically adding and deleting, and adjusting the sequence of page frames are provided.
Protocol configuration area: according to the script’s standard protocol, through the visual page frame configuration items, generated to meet the needs of the guide events; At the same time, provide rich materials to meet the emotional creation of the script of mind.
Script preview area: qr code scanning is supported to achieve convenient and undifferentiated preview of the effect and ensure that it is consistent with the final guidance effect presented to the user.

Cloud service: the underlying cloud service platform that relies on Meituan needs resource hosting service and CDN to manage and distribute resources after script editing, and complete script delivery and update. Under the combined effect of the back-end SDK and background policy configuration, the service middle platform provides more fine-grained delivery configurations and richer access opportunities to meet the requirements of the service side by time, city, account and store, and service label configuration.

Analysis of some technical solutions

Regional location scheme based on visual intelligence

During boot, highlight the target area on the critical path. On the premise that technology stack is independent, the basic idea is to capture the target area offline and take the full-screen screenshot during the online operation. Through the image matching algorithm, find the position of the target area in the full-screen screenshot, so as to obtain the coordinates of the area, as shown below:

The overall idea looks simple, but it faces many challenges in specific practice:

UI elements for rounded corner ICONS (RadioButton ,Switch ) Too few feature points can be detected in the edge region, resulting in low matching success rate.
In the area of small font, enough feature points cannot be detected in the case of low resolution. Magnifying the resolution can improve the matching accuracy, but the time consuming will also increase exponentially.
Under the condition of not providing the initial location, only full image detection and violent matching can be done. The number of feature points that need to be detected and stored is too large, especially for complex images and high resolution images, which makes the performance and memory overhead unacceptable on mobile devices.
There are dozens of screen resolutions of terminal mobile devices, and algorithms need to adapt to a variety of resolutions.
End-to-end deployment requires the package size, performance and memory footprint of the algorithm library. For example, OpenCV, even after careful tailoring, still has 10 ~ 15 MB, which cannot be directly integrated into online App.

After theoretical research and practical pilot, we finally adopted the solution of traditional CV (Computer Vision) + AI. Most scenes can be obtained based on the detection and matching of corner features of traditional CV, and the unmatched results will continue to be obtained through the detection and tracking of deep learning network. Corresponding optimization has also been made in engineering deployment. The implementation of this scenario is described in detail next.

Outline of image matching process

The image matching algorithm consists of information extraction and matching criteria. According to whether the two-dimensional structural features of the information carrier are retained, the matching algorithm can be divided into region-based information matching and feature-based information matching, as shown in the figure below:

The region-based image matching method uses the original image or the image after the domain change as the carrier, and selects the region with the minimum information difference as the matching result. This method is not good for image deformation and noise sensitivity. In the feature-based image matching method, the two-dimensional structure information of the image is discarded, and the texture, shape, color and other features and location information of the image are extracted, and then the matching result is obtained. The feature-based algorithm has good robustness, fast information matching step speed, strong adaptability and wider application.

Image matching based on traditional CV features

In fact, the application scenarios Of the project belong to typical Region Of Interesting (ROI) Region detection and location. Traditional CV algorithm has many mature algorithms for different scenarios, such as contour feature, connected Region, color feature based, corner detection, etc. Corner feature is a feature point based on the sharp difference in brightness between the center pixel and the surrounding pixels, and is basically not affected by rotation, scaling, shading and other changes. Classic corner detection includes SIFT, SURF, ORB, etc., and there have been many relevant researches in the industry. A comparative study published by E Karami [5] et al in 2017 (as shown in the figure below) shows that ORB is the fastest and SIFT matching result is the best in most cases. The DISTRIBUTION of ORB feature points is concentrated in the central area of the image, while SIFT, SURF and FAST are distributed in the whole image. In the scenario of meituan home, the target area may be located in the center of the image, four corners and other locations, so the ORB has a high probability of failing to match the target area in the edge region, which requires special processing.

Generally speaking, a feature detection and matching algorithm with good effect needs to have scale invariance, rotation invariance and brightness invariance at the same time, so as to adapt to more application scenarios and have good robustness. Let’s use ORB as an example to briefly illustrate the calculation process of the algorithm (see more information if you are interested).

The ORB = Oriented FAST + BRIEF (hereinafter replaced by OFAST and rBRIEF) was a combination of FAST detection and BRIEF feature description algorithm, and some improvements were made, including the improved OFAST feature detection algorithm, It is directional and rBRIEF feature descriptor with rotation invariance is used. FAST and BRIEF are both very FAST feature calculation methods, resulting in significant performance improvements for ORB.

In order to determine whether a pixel p is a feature point of FAST, it only needs to determine whether the absolute value of the difference between the gray value of N consecutive points and P exceeds the threshold among the 16 pixels in the surrounding 7×7 neighborhood. In addition, a FAST FAST, because first of all, according to the up, down, left, right, the result of the four points do judgment, if does not meet the conditions of angular point directly, if meet again to calculate the remaining 12 points, because most of the pixels in the image are not feature points, so as a result, with deep learning “alchemist” words, It is “basically unchanged”, and the calculation time is greatly reduced. Maximum suppression can be used to remove the overlapping problem of adjacent feature points.

The improved OFAST will calculate a direction vector for each feature point. The research shows that the feature points are directed by the connected vector from the luminance center to the geometric center, which is better than the histogram algorithm and MAX algorithm.

The second step in the ORB algorithm is to compute feature descriptors. The rBRIEF algorithm is used in this step, and each feature descriptor is a vector of 128-512 bits containing only 1 and 0. After obtaining feature points and feature descriptors, feature matching can be done. In addition, there are many feature matching algorithms. In order to simplify the calculation, LPM[6] algorithm is adopted here. After the feature pairs are obtained, their enclosing rectangular boxes are calculated, and the regional coordinates of the target can be obtained by inverse transformation to the original image coordinate system.

The test results based on pure traditional CV algorithm show that the number of feature points has a direct impact on the recall rate of matching. If the feature points exceed 10000, the algorithm performance will be seriously affected, especially on mobile devices, and it takes more than 1 second on high-end models. We set the number of different feature points for the small image and the original image of the target area, and then make matching, so as to give consideration to both performance and matching accuracy.

The measured feature points and matching results of different configuration parameters are shown in the figure below. For most regions of image and text content, feature points are more than 5000, and the matching results are good, but there are still common regional matching failures. The feature points are more than 10000, and the matching results of most scenes are satisfactory except for some special cases. If the approximate initial location of the target region is not provided (true), basically most regions require 10,000 ~ 20,000 feature points to match, and end-to-end performance is a problem.

Image matching based on deep learning

Based on the disadvantages of traditional CV and some unsolvable problems, we need algorithms with stronger image feature expression ability to carry out image matching. In recent years, deep learning algorithms have made great breakthroughs and also achieved great success in the field of image feature matching. In this application scenario, we need the algorithm to quickly locate the specific position of a sub-region in the full-screen screenshot, that is, a model needs to quickly locate its corresponding position in the global feature through the features of a local region in a region. It seems that this problem can be solved by using relevant algorithms of target detection, but general target detection algorithms need the category/semantic information of the target, while what we need to match here is the apparent features of the target region. To solve this problem, we adopt an image tracking algorithm based on target detection, that is, the target area is regarded as the target that the algorithm needs to track, and the target we want to track is found in the full-screen screenshot. In the specific implementation process, we use an algorithm similar to GlobalTrack[7]. Firstly, the corresponding features of the target area are extracted, and the features of the target area are used to modulate the features of the full-screen screenshot, and the target area is located according to the modulated features. Based on the limited computational capacity of mobile terminals, we designed a single-stage target detector based on GlobalTrack to accelerate the process.

Since we directly use the features of the target area to guide the process of target detection, it can process more complex target areas, such as pure text, pure images or ICONS, text and image mixing, etc. Any element that can appear on the UI may be the target area, as shown in the following figure.

Combined with business scenarios, precise positioning is required for any local area of App UI screen on mobile devices. As the above analysis, this problem can be regarded as a target detection and matching problem, or a target tracking problem. The algorithm also needs to be able to adapt to different ROI regions of content, different screen resolutions, and different mobile devices.

The plan we chose

As mentioned above, we adopt CV + AI solution, which has the following advantages: on the one hand, it solves the problem that traditional CV detection cannot cover all scenes; On the other hand, optimize performance and reduce the time consuming of mobile devices.

In terms of engineering deployment, we used pure C to implement the detection and matching algorithm, and made some custom modifications to the ORB algorithm. In addition, we used multi-threading, Neon optimization, and more to improve performance from 800 milliseconds to around 100 milliseconds. The final version does not rely on OpenCV or third-party libraries, greatly reducing the package size of the algorithm library. The deep learning model based on MTNN end-to-end inference engine achieves the optimal inference performance and accuracy. In the middle and high-end models, parallel acceleration of heterogeneous hardware can be enabled, CV and AI can perform parallel computation, feature detection calculation can be performed on CPU, model inference can be performed on GPU or NPU, and then fusion can be performed, so as to improve performance and accuracy without increasing CPU load.

Ensure the robustness of task execution

Task execution perception

In the traditional scheme, we can get the task execution status through function callback, broadcast, component change and so on. However, it is more difficult to sense the failure of the boot process and whether the user executed/clicked correctly when the technology stack is independent. At the same time, error types should be accurately identified, and retry schemes for specific steps should be added to ensure smooth script execution as far as possible. In rare cases where blocking is wrong, timely confirmation, error reporting and exit from the boot should be required to reduce the impact on users.

First of all, the more elegant “black box” scheme uses image similarity comparison technology. This capability model is relatively basic in visual intelligence. After jumping to the target page, screenshots will be compared with target features for rapid fault tolerance. According to a large number of offline test data, excluding some extreme cases, we find that there is a rule under different thresholds:

Similarity of more than 80% of the range, the basic can determine the accuracy of the target page, affected by some corner markers or image block loading did not reach higher.
The range of similarity between 60% and 80% is caused by slight differences in some list styles or background images and Banner images, which can be judged to be hit by fuzzy judgment (data is reported but exceptions are not reported).
When the similarity ranges from 40% to 60%, there is a high probability that the UI interface of the corresponding module is modified or there is a partial popup window. In this case, some retry policies are needed to report exceptions timely.
If the similarity is less than 40%, it is almost certain that the error page is redirected, and the boot process can be directly terminated and an exception can be reported.

At the same time, we also have some decision rules on the side to assist the decision of image comparison, such as container route URL comparison. When the image comparison does not match but the container route URL is accurate, some policy adjustment and retry logic will be carried out. After verifying that the page is accurate, the highlighting area search and subsequent drawing logic are performed. The last pocket can be verified naturally by the way of timeout failure. For the complete decision process of a script key frame, we set a timeout policy of 5 seconds.

On scale and rotation invariance

In order to scale has better robustness, calculation process will first do a gaussian blur the image, to remove the influence of noise, and to do sampling to generate multilayer image pyramid, characteristics of each layer to do test, all the feature point set as feature points detected results output, to participate in the subsequent feature matching calculation. To cope with image rotation, rBRIEF can be added, which randomly selects a pair of pixels from the 31 x 31 neighborhood (called a Patch) of a given feature point. The figure below shows sampling random point pairs using the Gaussian distribution. The blue square pixel is one pixel extracted from the Gaussian distribution centered on the key point, with the standard deviation σ; The pixel of the yellow square, the second pixel in the random pair, is extracted from the Gaussian distribution centered on the blue square with a standard deviation of σ/2. Experience shows that this Gaussian selection improves the feature matching rate. Of course, there are other options, and we’re not going to list them here. Firstly, the rotation matrix is constructed according to the direction vector of feature points, and N point pairs are rotated so that each point pair is consistent with the main direction of the feature points. Then, the feature vectors are calculated according to the point pairs. Because the main direction of the feature vector is consistent with the feature point, it means that rBRIEF can detect the same feature point in the image rotating at any Angle.

Other fault tolerant processing

For scenarios where there are multiple identical or similar elements on the page, you should not arbitrarily select any area. Therefore, when locating the target region, we need to provide a reference region on the basis of retrieving the target region and combining the information around the target. When running, it provides the image information of the target area and the reference area. After querying multiple target results, it queries the location of the reference area. Through calculation, the target area closest to the reference area is the final target area.

For pop-ups of different technology stacks in the page, due to the uncertain timing, it is easy to block the target area and affect the whole guidance process. Therefore, all kinds of pop-ups need to be filtered and blocked. In view of Native technology stack, we intercept the unified popup components and judge that pop-ups are prohibited in the execution process. In the boot process, pop-ups considered very important by the business are processed by adding white. The global didPush process in NavigatorObserver is used to intercept and filter all widgets, dialogs and Alert pop-ups of Flutter. As for the processing on the Web, because there are many Web popover business parties, there is no special unified popover specification, and the characteristics are difficult to get. Currently, a JavaScript code is injected into the Web container to hide some components with popover features and specified types. In consideration of extensibility, the JavaScript code is set to be dynamically updated.

For scenes where the loading time of some page elements is a little long due to the complexity of the page elements, some delay determination strategies will be implemented based on the delayInfo field provided by the recording side when the script is played.

Based on the previous efforts, the success rate of script execution link (as shown in the figure below) can basically reach more than 98%. Some scripts with low success rate can be drilled down according to the dimensions to query specific abnormal causes.

Complete script writing and editing with zero code

The life cycle of a script is divided into two stages: “production” and “consumption”. The “production” stage corresponds to the process of script recording and uploading to the management background for editing, while the “consumption” stage corresponds to delivery and playback. If the first two challenges are focused on consumption, the challenge here is focused on production. Next, we will introduce “recording side empowerment” and “standard protocol design” in detail.

Recording end – side enabling

The INTEGRATED recording SDK is limited by the screen size on the mobile terminal, so it is not easy to carry out fine creation, so its positioning is to create and record the basic script framework.

In this process, the recording SDK firstly records the user’s operation information and the basic information of the page. When the information input user uses the recording function, the recording SDK will synchronically record the current page information and the corresponding audio input to form a key frame, and the subsequent recording will be carried out in the same way. When all the information input is completed, Multiple generated key frames will form a sequence of key frames and combine some basic information to form a script frame, which will be uploaded to the server to facilitate the recording of fine creation in the background.

At the same time, recording the SDK requires active inference of the user’s intentions, reducing input editing. We input key frames and divide them into two types according to whether a page jump is generated. Different paths are automatically generated for different types. When the operator’s operation leads to a page jump, the recording SDK will actively mark the voice input as the description of the next key frame when determining the classification of the operation, so as to reduce the operator’s operation.

During the recording, the opening time of each page was also recorded as a part of the key frame, which was used as reference information to help readers adjust the rhythm of the script.

Standard Protocol Design

Standard protocols serve as the cornerstone of code Zero, bridging the recording to editing process.

Currently, there are dozens of operation guidance scenarios in App. We extract core fields and strip redundant fields through the combination of transmission model and view model. On the premise of standardization and compatibility, dozens of scenes are abstracted into four general event types, which facilitates the arrangement of key frames and the coverage of business scenes. For mind scripts, new branches are constantly generated as users interact, eventually becoming a complex and redundant binary tree structure. When designing such protocol, we flatten the nodes of the binary tree and store them as a HashMap. The connection of two key frames can be identified by ID.

When users use the App, under the guidance of certain requirements, there will be the alternating situation of mental and operational script guidance. For example, after the merchant (user) opens the promotion page, there appears a mental script — xiaopouch animation accompanied by the voice: “Hello, boss, Xiaopouch found that you have opened the store for three months, but you have not used the store promotion function. Do you not know how to operate it or worry about the promotion effect?” The screen will be accompanied by two button selection (1) no operation; (2) Worry about the promotion effect. At this point, if the user clicks 1, it will turn to the “operation” script, so we should pay special attention to the connection between the two scripts when designing the agreement. Here, we have refined the protocol and split the basic capability protocol and the display protocol. The two scripts share a basic capability agreement to prevent compatibility issues.

After the script protocol is parsed by the editor engine in the management background, the built-in logic is initialized and the event key frames in the script are rendered. The editor engine realizes the capability of subscribing based on the event mechanism. When a key frame triggers events such as insertion, editing and order adjustment, all other key frames can subscribe to the above core events to achieve a complete linkage effect. The script protocol after editing is connected to the unified dynamic distribution platform of Meituan to realize the dynamic release capability of gray scale, full scale and patch of scripts. The editor has a full built-in lifecycle, exposing complete event hooks at different stages of the operation, supporting good access and extension capabilities.

Stage results

Capacity building

We have abstracted the two standard styles of scripts as shown in the figure above. Most of the scripts used online are operation guidance scripts, which are mostly backlogged tasks. At present, we have iterated out a standardized form, which is easy to access. Generally, during the development and testing of new modules, the production and operation can quickly arrange operations to guide the script to go online synchronously with the requirements. You can also set the boot for existing complex modules, hidden by default in the navigation bar “? In the icon, trigger when appropriate.

At the same time, user mind building is not only the conventional product operation guidance, we also provide a mental script (also called concept script), which can be applied in the “concept transfer” or “concept implantation” scenario. In appropriate business scenarios, the system and specifications of the platform are delivered to users in a personified way, so that users can more easily accept the concept of the platform and comply with the operation specifications; For example, when businesses read bad reviews, they can execute an emotional script (the content is that bad reviews are common, every XX orders are prone to a bad review, so don’t worry too much, the platform also has fair bad review protection and bad review appeal rules); If the merchant appears to violate the rules of operation, it can also implement a serious script that strengthens the concept (presumably the content is that the platform is very fair and has multiple inspection measures, do not try to upload false materials in the appeal to get away with it).

It is worth mentioning that the ability of image feature localization and video animation to Alpha channel produced in this process also completes the technical reserve, which can be used for other scenes. I have also applied for two national invention patents for the core technologies mentioned above.

Part of the business line effect

New store growth plans are the first big demand for script-guided apps. Support the new store growth plan project smoothly launched, the current results are very positive. The ASG supported 78.1% of the boot play volume of the entire project, and the development cost of a single script was less than 0.5D. The comprehensive observation index “merchant task completion degree” increased from 18% to 35.7% in the same observation period, and other process indicators also improved to varying degrees.
Value exchange provides mental guidance for merchants to create exchange activities in the optimal way. Combined with the process data, it is estimated that the visit rate has increased from 4% to 5.5%, and the order penetration rate of active merchants has increased from 2.95% to 4%, both with an increase of about 35%.
Distribution information task guidance, optimize the action point guidance of the overall process of distribution information task, avoid the blocking of merchants’ behavior, reduce the operation cost and understanding cost of users, improve the satisfaction of merchants in the opening stage, and improve the level of recognition of the distribution service of merchants.
.

Since its launch in November 2021, ASG has supported new stores, events, marketing, advertising and other businesses, and has been implemented in more than 20 business scenarios of Meituan. Overall, compared to the traditional boot solution, the ASG script boot can improve the performance by about 20% at about 1/10 of the previous cost. Put in the previous calculation formula of the result, the improvement multiple (x = (1 / (1-90%)) * (1 + 20%)), is 12 times.

As a result, the cost reduction is much more obvious than the effect improvement, so this paper focuses more on the former than the latter. At present, in terms of effect improvement, we mainly use some basic combination of side abilities. We are not worried about effect improvement. There are still many possibilities to explore for cutting-edge innovative technologies in the industry, and we will gradually follow up to make the script effect more empathetic and immersive.

Summary and Prospect

This paper introduces the exploration and practice of Meituan takeout terminal team in the field of user mind construction. Starting from the business status and script thinking, the one-stop design of terminal and management background is discussed to simplify the script access threshold. Later, we also talked about the key role of traditional CV and deep learning in script execution. As a whole, this project is a bold attempt at terminal competency-based expansion. We understand the business perspective and empower non-technical people through open-ended cross-team collaboration.

Combined with the present achievements, we validate the correctness of direction before, the next step we will continue from lower production cost and the application effect of “higher” two angles on the deepening (such as combination element script ease of use, the update cost optimization, the lead time to combine the rule engine and intention of speculation, folding and reawaken logic, etc.), To support the need for more similar scenarios. Also, we are happy to see that the terminal’s “container independent” revenue leverage is obvious, and there is still a lot of room to play. Welcome to discuss and exchange with us.

Author’s brief introduction

Song Tao, Shang Xian, Cheng Hao, Zhang Xue, Qing Bin, etc., from Meituan Home R&D Platform/delivery technology Department; Xiaobin, Minqin, Debang, etc., from Meituan basic R&D platform/visual intelligence Department.

reference

[1] App Annie. 2022 Mobile Market Report
[2] HBR. When Low-Code/No-Code Development Works and When It Doesn’t
[3] Google. Compression Techniques
[4] Apple. Quartz 2D Programming Guide
[5] E. Karami, S. Prasad, M. Shehata “Image Matching Using SIFT, SURF, BRIEF and ORB: Performance Comparison for Distorted Images” Newfoundland Electrical and Computer Engineering Conference,St. johns, Canada, October, 2017
[6] Jiayi Ma, Ji Zhao, Junjun Jiang, Huabing Zhou, and Xiaojie Guo. “Locality Preserving Matching”, International Journal of Computer Vision, 127(5), pp. 512-531, May 2019.
[7] Huang, Lianghua, et al. Globaltrack: A simple and strong baseline for long-term tracking[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34 (7) : 11037-11044.

Read more technical articles from meituan’s technical team

| in the public bar menu dialog reply goodies for [2021], [2020] special purchases, goodies for [2019], [2018] special purchases, 【 2017 】 special purchases, such as keywords, to view Meituan technology team calendar year essay collection.

| this paper Meituan produced by the technical team, the copyright ownership Meituan. You are welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication. Please mark “Content reprinted from Meituan Technical team”. This article shall not be reproduced or used commercially without permission. For any commercial activity, please send an email to [email protected] for authorization.