Introduction: For operation and maintenance engineers, if you want to vote for the five most crazy operation and maintenance support scenarios, various kinds of promotional activities must be on the list. Every sales season is a restless night. A large number of content updates, a large number of customers influx, a large number of data read and write, although there are a variety of technical solutions or tools to ensure the smooth progress of the promotion. However, it is still possible to receive complaints from users around the country, such as “the product picture cannot be loaded”, “the page opens slowly”, “the order payment cannot be completed” and so on. These poor results, such as low user conversion and slow business growth due to user experience and website performance, will eventually make the operation and maintenance engineers become “expected” villains.

Bai Yu

In view of the issue of “user experience and website performance”, we conducted interviews with many enterprise operation and maintenance engineers and independent webmasters, and found that their opinions focused on the following aspects:

(I) Performance and experience problems caused by “the gap between product and user experience”

As the Internet dividend fades, product features and user experience design become more internal. There is a gap between the design of product functional logic and the understanding of users when they use it. A large number of SEC kill activities, promotion activities and UGC content make the product logic more complicated. Even though various guidance and explanation documents are provided, users still need time to understand and develop their usage habits. At the same time, in order to further enrich the functional modules, a large number of rich media, third-party components, and customer advertisements are constantly added, and the external cooperation content is too much and unreasonable, which increases the system load and drags down the product performance. Both to, and, but also, the ultimate price is to have to sacrifice a certain site performance and user experience.

(2) Performance and experience problems brought by “complex network environment”

As is known to all, all over the country is filled with a variety of level 1, level 2 operators, which significantly increased the complexity of the network environment, because the operator infrastructure update slow, sudden man-made problem, more damage will be regular IDC, enterprise can reassure users and lay equal to repair, and trying to take all these problems can only be in god’s hands. At the same time, the access network is complicated due to the wide geographical distribution, scattered user distribution and personalized access mode, and enterprises cannot effectively estimate the user’s use environment. Even with the help of widely distributed data centers and multi-line BGP access, there are still insufficient resources to solve the network environment problems, which further aggravates the difficulty of optimizing the network environment and makes the actual user experience more difficult to predict.

(3) Performance and experience problems caused by “distinct PC environment” differences

As a country with the largest number of netizens in the world, behind the massive user scale in China is a huge difference in user end hardware configuration. Some people may use i9-11900K+RTX3080 Ti to watch 4K HD live video on Bilibili. Some use the Millennium Pentium 4 and integrated graphics card to read text news on portal sites. This results in differences in user experience between different browser versions, their own rendering mechanisms, and localhost performance, such as access anomalies, slow speeds, and local resource consumption. In the face of this situation, how to understand the actual experience of the majority of users, balance or evaluate the differences in user experience, and make a choice among them has become a difficult problem that each website operation and maintenance and research and development must face.

(4) System availability guarantee caused by “sequela of pursuing iteration speed”

Due to the fierce competition on the Internet, the product has to selectively ignore product architecture and stability in the choice between function window period and fine tuning. Lax architecture, business development beyond the supporting capacity of the architecture, resulting in system overload, system crash, response timeout and other problems. There are many factors contributing to this problem:

First of all, the speed of business iteration is very fast, and intrusive monitoring means cannot be implemented in a short time, but the failure of the business system needs to be quickly sensed.

Secondly, development resources are limited or not coordinated, and infrastructure-related monitoring cannot directly reflect business problems, so the implementation cost of application monitoring is too high.

Finally, the application invokes the third-party API interface. The availability of the third-party API interface cannot be guaranteed. If a fault occurs, the application cannot respond and handle it in time.

When we take these problems apart, we may think they are a single point of problem, but when there is a chain reaction in business, these problems will be magnified and directly affect the user experience.

(5) “lack of monitoring means from the perspective of users” leads to passive response to customer complaints

While product features are tested as they go live, the operations team is also constantly monitoring user usage. However, for the operation and maintenance team, it is very passive to deal with the system problems only after the customer complains. It may even take one day to abnormal recurrence and locate the problems, which seriously affects the NPS. Common monitoring methods are mostly from their own perspective, which cannot directly reflect users’ problems.

So, in the face of so many factors, how should we test our website from the perspective of real users, quantify the user experience of the website and locate the performance bottleneck of the website? Here, we take e-commerce industry marketing activities as an example. With increasingly fierce competition, promotions such as Singles’ Day and 618 have become important annual marketing events for e-commerce and other pan-transaction industries. However, the influx of a large number of users in a short period of time will cause delays in website loading, or service delays and other problems affecting user experience.

Specific issues include:

Before going online, it is impossible to simulate real users and test the actual product experience when peak users have high concurrent access.

There is no accurate assessment of the actual browsing path of users, so it is impossible to locate the transformation bottleneck link and do not know how to optimize.

During the promotion stage, the product information was updated frequently. After the update, we often received complaints from users around the country, such as “the product picture cannot be loaded” and “the page opens slowly”.

The activity performance of rival products cannot be obtained, and the marketing trend of rival products cannot be understood.

In the past, these problems have been difficult to solve for the following reasons:

Although there are task walls and other methods, the operation and maintenance team cannot find enough real traffic in line with the actual demand for product user experience testing, and it is time-consuming and expensive to purchase relevant traffic.

The marketing promotion general product launch window is very urgent, and the delivery time left to the R&D team is relatively limited. If you want to add relevant intrusive probes for monitoring, it will slow down product delivery speed and may affect product stability.

The o&M team cannot actively test the correlation, resulting in problems that can only be found in the actual user experience and can only be passive troubleshooting. But problem recurrence and fault location can drag out the entire o&M team, resulting in an indefinite delay in repair time.

Therefore, the operations and maintenance teams need a product or solution that can solve the above problems. As a business-oriented non-intrusive cloud native monitoring product, cloud dial-up has become the best choice. Through aliyun’s worldwide service network, it simulates real user behavior and continuously monitors the availability and performance of websites and their networks, services and API ports around the clock. Achieve page element level, network request level, network link level fine granularity problem location. Rich monitoring related items and analysis models help enterprises timely find and locate performance bottlenecks and dark points in experience, reduce operating risks, and improve service experience and efficiency.

(I) Global monitoring node coverage

More than 200,000 LM worldwide, more than 500 IDC terminal monitoring nodes, 400+ operators at home and abroad and hundreds of thousands of registered members, to ensure that the monitoring scale meets the increasingly large business scale.

(two) no embedded code, out of the box

Zero intrusive monitoring, just enter the URL and perform simple configuration, no r&d support required. Complete site performance data analysis reports are available in minutes. Resource pack & pay-per-volume multiple purchase modes to meet the requirements of operation and maintenance testing.

(3) Business-oriented, preset a variety of analysis models

The monitoring period is refined to the minute level, and more than 20 monitoring associated parameter Settings of 7 categories are supported. It supports multiple mainstream protocols, and provides 7×24 hours fine-particle fault real-time monitoring, alarm and performance analysis services for sites and service ports. From the perspective of the end customer, through multi-dimensional combination analysis such as region and operator, drilldown analysis of single sample details, the use of rich index system and chart types, intuitive positioning of problems, affected range and root cause, pressure drop analysis time, improve operation and maintenance efficiency. Truly achieve fine monitoring.

(4) Intelligent alarm and accurate positioning

Real-time alarms are realized for the time, overall performance and availability of the first screen, rich alarm policy Settings, and deep integration with ali Cloud Alarm center to effectively shorten THE MTTR. Page element level errors can be found, and fault attribution can be accurately located to a single network request process, improving problem location efficiency.

Take the marketing promotion of an e-commerce enterprise as an example. The monthly active users of this website are more than one million, and the user groups are mainly distributed in the third, fourth and fifth tier cities in China. The annual operation and maintenance expenditure of the website is more than 2 million yuan. However, due to the frequent updating of product information in the promotion stage, complaints of “product pictures cannot be loaded” and “page opening is slow” were frequently received from users around the country after the update, resulting in low conversion of users and complaints of operation and maintenance teams.

In the face of this dilemma, we solved this problem through cloud testing products and further optimized the performance of the website in order to support the business.

(1) Stress test

Before the marketing activities or the launch of the new system, the cloud dial test is used to select the monitoring points of operators in different cities across the country, set browsing and network tasks, and instantly obtain the real user access experience data of the front line, accurately locate the page elements with problems, and help the technical team to repair problems in time. By simulating the high concurrent access of peak users and increasing the peak pressure, we can observe the changes of major performance indicators and explore performance bottlenecks.

(2) Optimization of user experience

Through the first screen monitoring and real-time monitoring functions can be immediately problem verification and recurrence, evaluation and optimization of website performance. Through transaction flow analysis, we can understand the user’s real experience process, optimize the browsing path, dig the bottleneck and improve the conversion rate.

(iii) Iteration of competitive product analysis

With the help of the zero-intrusion feature, collect and analyze the performance of marketing activities of competing products in the same industry, understand the changes of marketing trends of competing products and countermeasures, and carry out targeted IT investment and optimization iteration to make up for marketing shortcomings and stabilize the leading position.

Through the above measures, the website performance has been greatly improved, and the quantitative indicators related to user experience have been increased by more than 30%, effectively driving business growth. In addition to the above scenarios, cloud dial-up can also be widely applied to network interface, service availability monitoring, CDN service monitoring and selection, DNS resolution status, hijacking analysis and many other scenarios.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.