Fruit basket
For details

In this article, the author describes what an ideal data BI system would look like. Is that true of you? enjoy~

In daily work, whether to C or to wang B products, all need to face a problem, that is in the business development to a certain size, due to many reasons, such as for security reasons, or business scenarios are increasingly complex and so on, the third party on market data analysis platform or their own platform has been unable to meet the needs of business development, At this point, it is time to build a powerful BI tool to adapt to the rapid development of the business.

And a good tool, will liberate productivity, greatly improve work efficiency. Operators and analysts with this buff will be able to show you what the best fighting forces on the surface are.

Even if there is no PRESSURE from KPI, just think, when the operation sister is surprised by the excellent BI system designed by you, as a product of Dan Wang, will you have the chance to… Well ~

Before you design the product, take a moment to talk about the stages that data goes through in its life cycle from generation to death (archiving).

Picture: The picture comes from the network

The data collection

Data sources are divided into two parts, one is internal data, one is external data. For business to B, the internal data for more business data, such as retail supply chain warehouse and GMV breakdown data, financial trading and risk control model for water (risk control model also contains the user behavior data), etc., for business or product to C, in addition to traffic data, business data, will there be any user behavior data, and so on. The acquisition of such internal data is relatively basic. Data collection can be completed through the conventional CS/BS interface data interaction. For C-end products, it is more important to collect user behavior data.

Take wechat for example, from the early behavior of installation, startup, login/registration to the late behavior of chat, public number reading, moments of friends and so on, the event lines of different processes will share some data burying points.

When a user wakes up the APP and records a series of operation nodes with serial numbers and time stamps, the user behavior log can be perfectly recorded. However, the problems of this kind of technology will not be explored in depth, and it depends on how the R&D team chooses the technical solution. Create a log message for the behavior group.

Put aside the data encryption problem first, this string of logs means that at 13:40:03 PM, the user with ID 498161 clicks the push ID 164618 to actively wake up the APP. At this time, the ID of the behavior group information generated is 341325. After waking up, the user opens the APP and enters the wechat TAB page according to the parameters pushed to the app. In other words, on the home page, after staying for about 4 seconds, I carried out the drag operation, dragging the range from x:334, Y :130 to X :560, Y :363. Then, after pausing for about 2s, I clicked the coordinate x:634, Y :30, and the target of this position was the dialog information with the ID of 76316. After sitting for about 2 seconds, CLICK the input box in the page, and after about 10 seconds, the input box goes out of focus, so that the content of the input box is’ \u8001\ U54e5 \ U53ef \ u4EE5 \ U7684 ‘, and press the send button at the same time. At this point, this series of processes has been completed, received the friend’s information push to remind the user, the user made a response, and a series of operations to complete the reply.

So what does this set of information mean?

One of the new features in wechat 5.3.1 is the ability to recall the last message sent within two minutes.

Figure: wechat 5.3.1 new features

For this feature, the author made a simple thinking:

  1. Consider only user needs, not business needs.
  2. This operation is mostly required by the retracter.
  3. It is assumed that it takes 10 to 30 minutes for average users to receive and see messages, and that 60% of users have read messages in the first two minutes.
  4. According to the research on the users who want to withdraw the message, it is found that the users want to withdraw the message for confidentiality, error, regret or other reasons, so the allowed regret time is about 90s. In order to tolerate faults, the fault tolerance time should be expanded to a certain extent.
  5. For the consideration of the experience of the person being withdrawn, the time of the withdrawal should be compressed to avoid the discomfort of the reader when the message is withdrawn after the user has read it. This time is set at 120s.

The above values are all guesses. Before the function design of “withdrawal”, it is necessary to conduct research and analysis on user behavior through data. Quantifiable data provides a measurement standard for the function design, and data feedback is made after the function is realized, so as to modify the function design.

To concise expression, above the log information of log only show behavior, lack of the client and the server response or to monitor the behavior of these logs, app awakened from the start, the server and app data transmission, receiving the first request is in the terminal information, like equipment type, geographical location, etc., Generally speaking, the notification of app being awakened will also contain the channel mark of the arouser and other information. In addition to input method developers, keystrokes are not collected because they are not very valuable for most products, and they may not be collected due to system permissions.

As long as such logs are collected in as detailed a location as possible, problems such as technical support, operation strategy, or product design can be accurately identified. In some web applications have seen every drag or click record and back to the server, this kind of high frequency transmission mode can guarantee data integrity, but can create constant pressure to the server when high concurrency, products can be in wang’s requirement review meeting these needs and concerns, let team know in advance to this business scenario, And provide solutions. Below, some technical solutions will be outlined when talking about data processing.

Internal data are their own, as long as the permission allows, it can add, delete, change and check the operation, but most of the external data is not public, need to buy, some even nowhere to buy, and the structure of external data is different, when the data classification may cause some troubles. For those that can be purchased, data cleansing is often required.

The year 2015 is known as the first year of Internet consumer finance, and related industries are booming. Banks, e-commerce giants and traditional companies are entering the market, and these giants have enough data to support a more comprehensive risk control system. However, for those new Internet teams, there are not enough data sources to build their risk control systems. In this case, these teams need to buy data from companies in the market ‘specializing’ in big data services. However, these are two dealers, three dealers don’t even know a few hand data, middlemen in order to profit, each layer to the data inside the water, some big gold to interfere with the market, each of the data processing and sales shell, to use these data team’s hand, the filled with garbage data can be effectively used may be less than ten percent. And the cost of cleaning that data would be infinite. The iterative revision of a user risk control model requires at least one consumption-repayment cycle, which requires a time cost that is not acceptable to the average team.

For those front-end data that can’t be bought but are visible, crawler can be used to collect them. Crawler has a long history. It has been nearly thirty years since the birth of the first search engine. However, there is no clear statement about the earliest crawler, and the only thing that can be known is from the era of host Web. Up to now, crawler technology and its community have flourished, from the popular Python in recent years to the best language PHP, and C language can be used to write crawlers. The general principle of crawler is to construct request and request relevant content to the target server. Data itself is a relatively sensitive point, some numerical data is often the lifeblood of a company, in order to curb crawler to collect data, the company will generally design different mechanisms to anti-crawler. For example, the conventional server determines the UA and Referer of the request header, verifies cookies and verification codes, limits the number of IP access times, and discriminates the CDN identifier. There are also some front-end limitations, such as loading data dynamically with JS and things that make parsing pages more difficult.

Some mature methods are the combination of the front and back end, such as the user behavior record mentioned above, and then find the characteristics of crawler through the user model algorithm, so as to block. For crawlers, most sites are more tolerant, or there is no foolproof way to distinguish between real users and crawlers, because the error rate has always been an insurmountable obstacle. As for how to deal with anti-crawler mechanism, interested viewers can take a detour to ctrip technology Center’s official account. As the target website of various Crawler tutorials, Ctrip has some experience in anti-crawler.

Figure: Front end anti-crawler strategy of a public account

The data processing

Generally speaking, the storage and processing of the data involved is not within the scope of the product post. However, the product needs to clarify the requirements for data invocation so that DB development can design a reasonable technical architecture. Data can be divided into four levels, active, dormant, silent, and archived, based on its life cycle, or how often it is invoked. Here take the order cycle of e-commerce as a model to explain.

  • Active data: Active data here generally refers to data that needs to be stored for a long period of time as defined by requirements. Excluding data that is stored in the cache and will expire in a short time. After the user confirms to submit the order, the order data during this period will be accessed at a high frequency. Users, merchants, platforms, logistics and other multiple roles need to view this data to ensure the completion of the order.
  • Hibernated data: Hibernated data is accessed at a moderate frequency. When the order has gone through the normal cycle and the return or replacement period has been exceeded, it may be invoked due to warranty, statistical analysis and other circumstances.
  • Silent data: This type of data is accessed relatively infrequently. After routine processing and storage, the content that needs to be called for a long time has been stored in other database tables as active data. At this time, the source data will enter the silent stage and will be called only during periodic statistics. In the analysis of quarterly merchandise sales, we need to turn to this order for detailed analysis.
  • Archived data: Archived data refers to the source data that has been processed and extracted. To ensure data integrity, archived source data is not only backed up, but also backed up at various stages of the data life cycle. Another purpose of archiving is to compress data structures so that core data can be quickly responded to.

In everyday work, the hierarchy of data may not be so clear-cut, depending on the requirements.

Data applications

After talking about the generation and archiving of data, I believe the audience has already had an overall plan for the construction of this BI data system. The author will describe what an ideal BI data system should look like in the author’s mind.

Figure: Product architecture

In the background of data operations, can choose to carry out GUI and SQL operations. All of the features below describe the use in GUI state only, not SQL state. There are several reasons to retain SQL functionality. Just as Bash is much less likely to have accidents than GUI interfaces, pure SQL operations are more stable than GUI operations, and data developers are more willing to use SQL. This kind of insurance design can be implemented in the process of product iteration, some functions that are not commonly used are missing.

The following describes the functions of each module according to the operation process.

Data management

In a production environment, business data, traffic data or other data metadata has been produced, stored in a database, this part of the data is called a “source”, the function of the page called “data management”, the background to the data source only permissions only read (select), losses or wrong operation, the POTS is bad points, intuitive to see, It seems to be the fault of the operator. Further, it is the lack of technical specifications, but after all, it is the defect of the product design. Therefore, the permission should be controlled to read only. Users (operators, data analysts, and data engineers are called users) can read metadata from tables in the data source library and store it in the database. This part of the refined data is called a “dataset”. Many data need to deal with the daily timing or carried out in accordance with the other cycles, such as those to avoid concurrent server session data statistics, the peak mostly on the processing of data or other service free time in the morning, in order to save user’s energy and prevent the omission, this point control type of periodic task is very meaningful, However, a single task cannot meet the processing requirements of complex data. The design concept of event Scheduler and Trigger of MySQL is borrowed to connect these tasks together and make them execute in sequence according to specified conditions, so as to realize the task loop that processes data structure and data content in accordance with sequence. The permission control for adding, deleting, modifying, and checking database tables generated by data set operations is only available to users by default during initialization. If required, you can configure the permission control in permission management later.

The business report

After the data processing is completed, the data is still pure single structure content, dry data is difficult to observe intuitively, so the function of data visualization is essential. There are many excellent open source front-end frameworks on the market. Here are two of them — AntV and Echarts. These two frameworks can meet the vast majority of business scenarios, especially recommend Echarts, to a degree of open source such a good work to call, Echarts gallery community works are amazing, if the framework can continue to optimize the performance, it is the only choice for data visualization. As for AntV, Ant Design is probably the first UI component library, and the integration of visual, interaction Design and front end is quite impressive. The React implementation uses a more advanced framework, but the Angular version was finally released in August. However, the framework belongs to the framework, and it is the right of the r&d students to choose the technical solution. AntV can be referred to in the background design of the product, which can save you a lot of detours. The service report design function allows you to customize the menu and page layout of the service report. You can choose a variety of chart types in chart design, from conventional line charts, bar charts and pie charts to charts commonly used in different fields, such as k-chart, funnel chart, relationship chart, etc. Then fill the chart with data, enter the chart according to the rules, and at the same time select the display object of the table. This highly free mode of operation seems to increase the cycle and cost of research and development, but this design will actually liberate the students of research and development. After the product is iterated in the later stage, users will use the data to generate reports independently instead of the regular demand-development cycle, so that the r&d students will focus more on the core technology rather than the business. So to sum up, this design is not only a functional design, but also a design of cost transfer.

Real-time monitoring

The module is divided into two main functions, one is the overview function, one is abnormal warning. The overview page is the Dashboard of business reports, the only difference is that the data here is mostly real-time data, so it is not repeated. If you want this facade to be better, you can separate it out and design and develop it as an independent functional requirement. As for the abnormal warning module, the name is often not true. The significance of warning is to predict the alarm operation when a certain value reaches a critical point. However, the algorithm of this warning mechanism may be complex and it is difficult to formulate the logic of product demand. Therefore, in the actual design, multiple conditions can be configured to trigger the alarm. According to the execution cycle polling, push operations such as SMS or email will be carried out when the alarm is triggered. As for data source management, attention should be paid to the date variable of the library table name and the setting of event ring.

Rights management

In talking about the whole product design, has been emphasizing the management of authority, not the author, this is related to the data security problem or have to pay attention to. Database permissions should be controlled from server, database to specific tables, to specific columns and even stored procedures. Although MySQL has this capability, it seems that regular permissions really don’t need to be so detailed. Of course, it is necessary to do permission control for data structure and data content. Finally, plus the basic operation log display, for error check and accountability.

At this point, the author will complete the description of the whole product design ideas, the audience master may feel that there is no graphic, too abstract to understand. But in fact, the author deliberately did not include the legend, for fear that the prototype will mislead readers to carry out the same product design, so only describe the core functions, not visual and interactive performance. If there is where do not understand or ambiguous can be asked to the author of the public number, the author will take time to reply.

It is a cost-effective choice for enterprises to use third-party platform services in the early stage of business, but data migration should also be considered in the later stage. In order to maximize the value of the platform, the third-party platform inevitably designs a common product structure and service system to accommodate more business scenarios, but in this way, it cannot meet some special business requirements of enterprises. Most importantly, a series of operations of their own data are running on the server of others, the problem of data security is a heart disease. This year, as a novice logistics, one of the founding members of motion, a wave of raid by novice group, this was already not little problems, from IaaS to SaaS, a big pile of data processing and transmission infrastructure have been someone’s home, in the information sovereignty on the problem, in the event of a dispute, the first two big to handle a. The recent session of the National People’s Congress recently revised the anti-unfair competition Law, which may hopefully improve the problem. After all, for enterprises, they still want to put more energy into business operation, rather than some bad business wars. Fortunately, there are more and more data service providers in the market that can provide private deployment, and personalized functional design can solve the business needs of enterprises to some extent.

Whether to build their own BI system or use a third-party platform, in fact, the audience has already had the answer in their mind.

Figure: Data server

The product design references Dataworks and QuickBI, and if you’re interested, you can try it out, but the price is a little bit less beautiful (the premium version has a lot of features, while the basic version has too many features).

If the audience master has any ideas, please beat brick violently.

 

Author: Fruit basket, public number: Lao Yang to accompany you to tell the story

This article was originally posted by @FruitBasket. Everyone is a product manager. Reprint without permission is prohibited.

The picture is from Unsplash


3 years
BI system
The intermediate
Permissions system