I. Business background and demand sorting

At the beginning of the annual plan of the year, WE planned to go on BI this year, but the reasons for the epidemic were delayed and delayed, and it was just at that time that the demand broke out. Various business problems and various data analysis made us realize that we could not go on and could not carry the load, so we began to set up projects and conduct BI selection research.

Here to share my research notes in a written form, only on behalf of personal views.

Companies’ data needs now fall into two main categories:

  • Temporary demand: The business suddenly wants to see how effective this wave of activity is (the definition of data metrics can be changed and added at any time)

  • Solidified requirements: weekly to see, monthly to see data, (data is very clearly defined).

For these two types of requirements, our current solution is as follows:

  • For temporary requirements: write HQL to Hive to check, and then convert the result into Excel and send it to the requirements personnel.

  • For solidified requirements: write scripts, run results with Hive, write the results into the corresponding DB library, and then summarize and show them through third-party open source BI tools.

This is neat and clear, but the problem is obvious:

  • Development costs were too high: Every requirement, whether temporary or long-term, required custom development, and we were swamped with manpower.

  • Inflexible use: a report can only be displayed without analysis function. If you want to analyze it, you need to copy the data into Excel for processing and analysis. However, our data users may not have such ability.

  • Waste of resources: Reports developed by different people, in many cases with a lot of double counting.

  • The hue query speed is very slow. It takes more than one minute to perform a select from query.

In this case, we are going to build a multi-dimensional analysis platform that allows the business side to take numbers themselves, preferably without SQL, because most of our business side does not know SQL(although we have SQL training, there is still a threshold).

Based on the above, we conducted product research on BI selection.

Ii. Product trial analysis

Based on the discussion and understanding of BI tools recommended by peers in several groups, after a round of screening, the following products were selected as key research objects: Superset, Metabase and FineBI.

1, the Superset

Overall, after personal experience, the feeling is as follows:

  • Installing the tutorial is a pain, just like installing mysql

  • Table joins are not supported. The speed of calculation depends on the speed of your database

  • Visualization options are very rich, and there are several geographic location visualization schemes based on latitude and longitude

  • Permissions are very fine, down to each function key;

  • Unfortunately, the biggest problem is that the user experience of business analysts is not very good. The visualization process requires setting corresponding parameters for different graphic schemes, and the permission control is also very complicated.

The details are as follows:

1) Data source and data management

  • The supported databases are very rich: Druid, Hive, Impala, Kylin, Spark SQL, BigQuery, Pinot, ClickHouse, Google Sheets, Greenplum, IBM Db2, MySQL, Oracle, PostgreSQL, Presto, Snowflake, SQLite, SQL Server, Teradata, and Vertica support uploading local CSV files

  • Data table model management, can set the field type, dimension/can filter/can do time column, secondary processing fields, statistical indicators

  • Chart available tables have to be added one by one from the database (SQL toolbox can see all), which is not very convenient.

  • Deep support for DURID

2) Chart chart & Dashboard kanban

  • Chart making process: Select data source (table or view) > Select chart type > Set chart parameters (indicator/dimension/filter criteria). The data source can only be selected from the data table list page. After entering the analysis page, the data source cannot be changed. Because when switching different chart types, it is necessary to fill in parameters according to different charts, which is not convenient to use in self-help analysis.

  • It supports a variety of visualization graphics, 48 kinds of visualization schemes;

  • Kanban’s filtering function is very weak, even the most basic date filtering component is not implemented through a filter component in a single graph, can only be made for a single table, and then applied to kanban, this feature is also very inconvenient.

  • Provides simple chart drilling and exploration (jump straight into a single chart), but does not support chart linkage;

  • Kanban can not be cloned directly, to copy a Kanban can only be re-edited to select a single graph;

  • Kanban supports automatic refresh with a minimum refresh time of 10 seconds

3) SQL query

  • Support to fill field/table information by association

  • Supports cross-library associated query

  • A multi-tabbed environment that processes multiple queries at once

  • The query results should be visualized and saved as a view before jumping to chart page. And you need to weight that view (very inconvenient process)

  • Can do search on query history records;

  • Support for templated using the Jinja template language, which allows macros to be used in SQL code

4) Permission management

  • By setting permissions for roles, users can specify roles to control permissions

  • The granularity of permission control is very fine, supporting functional permission control (table modification can be subdivided to delete, new operations), support menu, data source, data table, field, chart, kanban permission control

  • The configuration of permissions is very complicated and tedious

  • Data row level control is not supported

  1. Secondary development
  • Technical architecture: Python+Flask+Recat+Redux+SQLAlchemy

  • Originally an open source project of Airbnb, it is supported by a large company team for maintenance, version update, bug repair and secondary development

  • Support for restful API

Metabase (Open Source, Gihub Star 15,670)

Overall, after personal experience, the highlights are as follows:

  • The interactive experience is business-friendly. Through a global search function for kanban and single figure, it creates an intelligent scene of “ask a question”, that is, the system will tell you the answer through the search box consultation, and the interface of the whole product is very simple and clear.

  • It is very simple to create a single graph, data-centric, to select different graphics (optional graphics are automatically grayed). Basically, it takes half a minute to complete a single graph analysis.

  • However, the biggest disadvantage is that the permission management is too weak, only modifiable/visible coarse-grained control, whether the table can be deleted can not be independently controlled.

The details are as follows:

1) Data source and data management

  • The supported databases are relatively weak: Postgres, MySQL, SQL Server, Redshift, SQLite, Google BigQuery, H2, Oracle, Vertica, Snowflake, MongoDB, Druid, Presto, SparkSQL Druid does not support SQL queries. Druid does not support SQL queries. Also no Hive support,Kylin)

  • Unified data model management portal to set dimension/measure fields after adding tables/views (this section is very detailed and the types of Settings are extended greatly)

  • Provides scheduled tasks, database synchronization (hour level)

  • Self-service table field information perspective function, intelligent exploration, automatic kanban, automatic distribution of associated data (bonus points cool features)

2) Chart chart & Dashboard kanban

  • The process of making a single graph is very simple: select data source -> Filter criteria -> Analysis indicators -> Group dimensions -> Visual type

  • The types of visualizations supported only meet basic requirements, 14 visualization schemes (including funnels, numbers with changes, maps)

  • Some charts can be controlled in detail, such as the form according to the condition control line color, adjust field position, display mini color bar, suffix setting

  • Support for basic filtering criteria, including date segments (associations between fields passed by filters and fields in a single graph)

  • Provides simple chart drilling function, but does not support chart linkage

  • Copy existing kanban with one click

  • Automatically refreshing data in the minimum granularity of 1 minute

  • Sharing support: public links, public embedding (blog pages), embedding in the application

  • Using Trap to send data to Slack (a foreign chat tool) or email

3) SQL query

  • Support to fill field/table information by association

  • SQL query results can be directly switched to graphical display scheme

  • Cross-library associated queries are not supported

  • Variables in native queries allow dynamic substitution of values in queries using filtering components or URL parameters

4) Permission management

  • By setting permissions for roles, users can specify roles to control permissions

  • The permission setting is very weak, only the access permission can be set (the accessible data may be directly deleted)

  • The permission setting object is shallow: it can only control the permission of data source, data table, chart and analysis item set, not the data row level

  • Field control at the field level can be set visible or not (sensitive field scenarios), but cannot be managed by role

5) Secondary development

  • Technical architecture: Clojure+Recat+Redux

  • Complete API documentation is provided, and many secondary developments can be completed with rich API and documentation

3. FineBI (Business)

(Small program has been added here, please go to toutiao client to view)
  • 5 minutes to complete the data analysis, zero code operation, click and drag to complete the analysis, can make a data report within half an hour.

  • Meet a variety of analysis needs, data processing, exploratory OLAP analysis, self-service data analysis

  • With the self-service data set function, ordinary business personnel can screen, cut, sort and summarize the data, so as to achieve the desired data results flexibly

  • One-click realization of data sharing and control, detailed and accurate data permission control, data and reports can be shared throughout the company, sharing results updated in real time

  • It supports large amount of data analysis, adopts advanced column storage, has efficient computing capability and powerful data compression capability, and supports fast data analysis at the front end.

Overall, after personal experience, the highlights are as follows:

  • Getting started requires getting used to the process of configuring data, handling self-service data sets, and then visualizing dashboards and charts, which is a bit confusing

  • Creating the visualizations is very simple, and the interaction is a bit like Tableau, dragging and dropping data fields into dimension boxes, rendering the visualizations immediately, and then building dashboards based on the visualized components

  • Linkage drill function is large, but also very intelligent, can automatically associate to the common field

  • Data processing function is very powerful, self-help data set, can deal with a lot of data functions, including group summary, modify data fields, table merge and so on

  • The biggest highlight is that the authority control is very detailed and practical. It can be subdivided into original data sources, processed data sets and dashboards. It can manage users of different roles, including positions and departments, with authority control of administrators and users

The details are as follows:

1) Data source and data management

  • The supported databases are very rich, as you can see in the figure below

  • There is a business bundle capability to sort different data sources, such as by department or by business requirements

  • Data table can be visual management, data preview, blood analysis, associated view, etc

2) Chart chart & Dashboard kanban

  • Visual production process: connect data source (database or import Excel) — self-service data set processing data — make chart components based on data — make visual dashboard.

  • There are more than 50 kinds of basic charts and over 100 kinds of styles with overlapping charts.

  • There are three types of table: grouping, cross and detail. Especially complex table is not supported. There is another report software finereport

  • The dashboard has powerful filtering functions, such as time, text, value, and query, as well as user-defined condition filtering.

  • Provide data drilling, linkage, jump functions, can directly interact in the dashboard, can also interact with other dashboards, support chart linkage;

  • Components and kanban can be reused and copied directly

  • There are rich function functions, in the visualization of the data can be filtered twice, summary, sorting, as well as their own formula calculation

  • The dashboard has the timed refresh function, which can automatically refresh a single dashboard, multiple dashboards, and a single component. JS is required to write the timed refresh frequency

  • Dashboard public links are supported on Share and can be attached to its decision system or embedded in web pages

  • Dashboards share directly, dashboards create public links, and dashboards hang out.

3) SQL query

Connect to the database directly through JDBC

  • Supports cross-library associated query

  • Support SQL data sets, allowing write SQL fetch

  • Supports visual data preview. After data tables are added and updated in the service package, the data preview area exists on the service package editing page, where you can view the edited table data.

  • Visual association tables and consents analysis

  • BI project that provides real-time data and extracted data in two computing modes

4) Permission management

  • You can set permissions based on roles. Permission recipients include departments, roles, duties, and users

  • It can set permissions for personnel management, directory permissions, management system, data connection, data permissions (data tables), sharing permissions, timing scheduling management permissions, etc., with rich permissions setting

  • The permission setting object is deep, down to the component or data row level

5) Secondary development

  • Pure Java development, is basically zero code tools

  • Support certain secondary development, API documentation

conclusion

Finally, in terms of comprehensive selection, the choice of BI selection falls between Metabase and FineBI, with the former open source and the latter commercial.

Open source has its disadvantages, such as weak rights and functions, lack of platform operation and maintenance functions, and the interface is all in English. Some development applications in our department are ok. However, considering that BI may be promoted to the whole company in the future, the selection of tools needs to be accepted by the business side, and product stability needs to be guaranteed by technology and service.

In either case, mature platforms are to be purchased, so I prefer the latter within the range of acceptable budget. The function satisfaction reaches 90%, and the specific performance has to be tested. The price is about 200,000 ~ hundreds of thousands, and there are also service projects.