1. Demand background

1.1 Challenges of big data visualization

With the rise of big data, data storage and computing technologies emerge in an endless stream, but the final visual presentation of data and exploration of data also become a very important part. This piece is not as blooming as storage and computing technology stack. Have we ever had these puzzles when doing big data visualization?

  1. Traditional visualization interconnects with traditional databases, but is not compatible or even incompatible with Hive, Spark, Presto, ElasticSearch, and Clickhouse of big data components. Each time, a redundant operation is required to distribute big data cluster data to traditional databases.
  2. Commercial products are expensive and even set technical barriers. Many of them even require the docking of their own big data technology.
  3. Excel with a large mass base is used to drag and drop, the convenience of SQL operation, the exclusion of its own school of new technology, web version account login is better than users download client login;
  4. The company has a tight configuration of developers and no redundant manpower to research the big data visualization platform, but the decision makers hope to have a unified visualization platform.

Apache Superser is an open source big data analytics exploration and visual reporting tool.

1.2 Target architecture of big data data visualization

It is still necessary to establish a target structure for work. Only by centering on the target structure can we do things more easily, as shown in Figure 1.2, but the structure is divided into three echelons.

  1. Level 1: ClickHouse, DorisDB, Kylin and other excellent OLAP technologies for storage, the use of its own connection engine, fast response, while supporting real-time data and offline data access, external visual platform, through permission control to present to users;
  2. Tier 2: Data is stored in Hive or Hbase of NoSQL, and is connected to a visual platform through excellent and efficient engines such as Presto, Flink, and Spark. After permission control, data is presented to users.
  3. The rest is a special file access, such as MySQL, temporary files, etc.;

Note: There are other technical architectures that are commonly used, such as ELK, which is made up of ElasticSearch, Logstash and Kiabana. Elasticsearch is an open source distributed search engine. It features distributed, zero configuration, automatic discovery, index sharding, index copy, restful interface, multiple data sources, and automatic search load. Logstash is a completely open source tool that collects, analyzes, and stores your logs for future use (e.g., searching). Kibana is also an open source and free tool that provides a log analysis friendly Web interface for Logstash and ElasticSearch to help you aggregate, analyze and search important data logs. I’ll talk about this later, but I’m going to get back to Apache Superser.

2. Introduction to Apache Superset

2.1 What is Apache Superset?

Apache Superset is a Python based open source modern data exploration and analysis and visual reporting platform; She supports rich data sources and has a colorful selection of visual charts.

  • website :superset.apache.org/
  • github :Github.com/apache/supe…
  • Domestically supported mirror stations: Aliyun:Mirrors.aliyun.com/pypi/simple…, watercress:pypi.douban.com/simple/ 等
  • Development of language:PythonGive priority to

2.2 Why Apache Superset?

  1. Support rich database as a data source, basically usually used to support the database; As shown in Figure 2.2.0, the supported data sources are:
  • Amazon Athena
  • Amazon Redshift
  • Apache Drill
  • Apache Druid
  • Apache Hive
  • Apache Impala
  • Apache Kylin
  • Apache Pinot
  • Apache Solr
  • Apache Spark SQL
  • Ascend.io
  • Azure MS SQL
  • Big Query
  • ClickHouse
  • CockroachDB
  • Dremio
  • Elasticsearch
  • Exasol
  • Google Sheets
  • Hologres
  • IBM Db2
  • IBM Netezza Performance Server
  • MySQL
  • Oracle
  • PostgreSQL
  • Trino
  • Presto
  • SAP Hana
  • Snowflake
  • SQLite
  • SQL Server
  • Teradata
  • Vertica

  1. Apache Superset has a very rich set of diagrams for different visualization needs, as shown in Figure 2.2.1.

  1. Lightweight and highly extensible, using the existing data base model directly for data exploration and visual presentation, without the need for another ingestion layer, as shown in Figure 2.2.2. After configuring the database, enterSQL Lab(SQL Lab)To explore and analyze the data,SQL LabMore like a database connection query client, of course, for better visual presentation of data, but also must combine chart and dashboard functions.

  1. Easy to use, as shown in Figure 2.3.3,Apache SupersetThe use level is mainly divided into the following parts;
  • Data: Adds data sources and data setsDataset(older version also called Table), Dataset as the basis of data chart visualization;
  • Charts: Chart, is for preparedDatasetData set, select a suitable chart to present;
  • Dashboards: dashboard, in fact, is a report, kanban large screen display, can be more thanChartsPut them into a dashboard and display them together.
  • SQL Lab: SQL Lab, in fact, is a multi-functional database connection client like DBeaver, Navicat, DataGrip, etc., but it only has query function. After configuring the driver and connection, it can perform SQL query operations on database, table, field and other models.
  • Set up the: Language selection, login and logout, personnel permissions, operation logs and other Settings;

2.3 compare the Metabase

Before the blogger also wrote a Metabase – Metabase big data visualization artifact – open source big data analysis exploration, visual report artifact blog, so for Metabase, Apache Superset what are the advantages and disadvantages?

  • Apache Superset, a data source with built-in support, beat Metabase.
  • Data chart form Apache Superset beat Metabase;
  • Operation interface beautiful silk slippery Apache Superset slightly inferior Metabase;
  • Pulling operation of Apache Superset slightly inferior to Metabase;

Metabase is a Metabase dedicated business data requirements staff, Apache Superset for SQL data requirements staff, the two generated universal dashboard, can use a unified web page hyperlink together, Form a unified report platform.

3. Get started

Here is a quick start to take you to experience a, the details of the follow-up chapter details, first configure the database connection (configuration method refer to the follow-up5.1 Creating Databases), and open itSQL Lab, select a good configuration database, writeSQLStatement analysis exploration data, as shown in Figure 3.1.0, and then run the statement to get the data result. You can click save to save the commonly used exploration SQL, and then click above the query resultEXPLOREButton, you can jump diagram analysis figure 3.1.1;

usingSQL LabExplore the resulting data set, select the data chart for the appropriate requirements, select the appropriate metrics, metrics, click on the topRUNYou can get the result, very convenient, you can directly click on the topSAVESave the chart;

Create a Dashboard, then edit the Dashboard, drag and drop the previously generated Charts to the Dashboard, and complete the final presentation of the data Dashboard. Then the data Dashboard can be shared with the demander, or access links can be generated and shared.

Note: try to drag the Dashboard to the top of the drag, there will be a blue line can be released, otherwise it may be unable to drag the situation, this design is very bad.

4. Deploy and install

4.1 Deployment mode and Version

  • Support Linux, Windows, Mac Docker deployment
  • Support Linux, Windows, Mac Python environment code deployment
  • You can download the Apache Superset from Github, the Apache Superset website, or a local mirror, but don’t download it first.

  • The blogger choseApache - superset - 0.38.1. Tar. GzPython environment code deployment on Linux.

4.2 Configuration Requirements

  • Apache - superset - 0.38.1. Tar. Gz
  • CentOS 7 16-core 32 GB (optional, common performance servers can be used)
  • Python 3.6
  • Requires a network of servers, if not, you can use a proxy server that can be connected to the network, depending on a lot, using the form of online installation

4.3 Download and Installation

  1. To install Python3.6, you can choose to install Python with Anaconda integration. You can refer to the Linux blog to install Python with Anaconda3-5.2.0-linux-x86_64. Anaconda Download; Once installed, if Python2 exists on an older server, the default environment variable to start is Python2. It doesn’t matter, just set a new environment variable to ensure that Python3 starts with the version you just installed.

  2. Install the Python virtual machine, start it, and then install Apache Superset.

#Switch to the directory where you installed the software, and the blogger's is in /usr/local/tools and create a superset directory
cd /usr/local/tools
mkdir superset
cd superset

#If a proxy server is not required on a network, configure the proxy server 10.212.18.34.3129 Method:
#Write the append to /etc/profile
# exportHttp_proxy = 10.212.18.34:3129
# exportHttps_proxy = 10.212.18.34:3129
#Then wq! Save and exit,source/etc/profile Refreshes the configuration file
#After installing, you can delete the agent, remember againsource /etc/profile
#If you do not want to configure /etc/profile or do not have permissions, you can run the following command
#PIP install virtualenv --proxy= 10.212.18.34.3129
pip install virtualenv

#Configure the named VM
python3 -m venv venv

#If the VM is started, the venv directory is automatically created in the current directory
. venv/bin/activate

#Exit virtual machine instruction, but exit is not required here
#Exit virtual machine instruction, but exit is not required here
#Exit virtual machine instruction, but exit is not required here
deactivate

#Install and update some dependencies
pip install --upgrade setuptools pip -i https://pypi.douban.com/simple/

yum install gcc gcc-c++ libffi-devel python-devel python-pip python-wheel openssl-devel libsasl2-devel openldap-devel mysql-devel gcc-devel

#GPG key retrieval failed: [Errno 14] curl#37 - "Couldn't open file /etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7"
#Solution:
vi /etc/yum.repos.d/epel.repo
gpgcheck=0
#Then wq! Yum install GCC GCC GCC c++ libffi-devel python-devel python-pip python-wheel openssl-devel libsasl2-devel openldap-devel mysql-devel gcc-devel

#First use the official website to download, because the official website will automatically rely on you to install together, really can not use the mirror of other websitesPIP install apache - superset = = 1.4.2
#Install superset, specify version, do not specify version default is the latest versionPIP install superset = = 0.30.1 -i https://pypi.douban.com/simple
#Install email_validator 
pip3 install email_validator -i https://pypi.douban.com/simple/

#Updating the database
superset db upgrade

#Create admin user name, user name write freely, bigdata123, admin, after writing the user name will let you enter the last name, first name, email, these three can write can not write, do not write directly enter, and then set the password, a point to write.
export FLASK_APP=superset
superset fab create-admin

#Load sample data, test the network, if it is consistent with the loading error to give up, does not affect the subsequent use.
superset load_examples

#Initialize the
superset init

#Superset run -p 8088 --with-threads -- Reload -- Debugger
#You are advised to use Gunicorn for quick startup. To ensure that logs printed on the client run properly, run the gunicorn command firstPIP install gunicorn Gunicorn -w 5 --timeout 120-b 10.218.10.290:9089 "superset.app:create_app()"
#Gunicorn is a Python WEB service that can be understood as Tomcat
#-w WORKERS: Specifies the number of threads
#--timeout: indicates the timeout period of the worker process. If the timeout period expires, the process will restart automatically
#-b BIND: binds the Superset access address
#Daemon: runs in the background

#Open a browser on a server that can access 10.218.10.290:9089 and enter the user name and password you just logged in to.


#If background stop is not enabled, directly CTRL + C to stop
#Background process stops Gunicornps -ef | awk '/gunicorn/ && ! /awk/{print $2}' | xargs kill -9Copy the code

4.3 Installation Precautions and Troubleshooting

  pip install supersetKey words appear during stepsSuccessfully installedVerify correct installation, as shown in FIG. 4.3.0;

Superset Fab create-admin Prompts when configuring the user name, as shown in Figure 4.3.1.

Each person’s server environment, may lead to a lack of different dependencies, on the way if you encounter bugs, baidu can solve their own, basically are python dependency packages and other problems, to be patient.

#An error
ModuleNotFoundError: No module named 'dataclasses'

#To solve
 pip install dataclasses

#An error
 No PIL installation found
#To solve
pip install pillow
Copy the code

After all is resolved, the webpage login is shown in Figure 4.3.2;

4.4 Startup and Shutdown

The direct boot option on the website is not very good, but the blogger recommends taking gunicorn’s approach and shutting down Superset first.


#After installing Superset, many files will be generated in Venv. Switch to Venv
cd /usr/local/tools/superset/venv/



#Creating a Log Folder
mkdir log

#Switch to thelogDirectory, new permission log, error log and startup PID fileCD log touch gunicorn_access.log touch gunicorn_error.log touch pidfile chmod 755
#Switch to the/usr /local/tools/superset/venv/bin, write a Gunicorn configuration file, PythonCD. / usr/local/tools/superset venv/bin vim gunicorn_config. Py # content is as follows
#Content to startImport multiprocessing bind = '10.218.10.290:9089' # Bind IP and port numbers backlog = 512 # listen queue timeout = 30 # timeout worker_class = 'gevent' workers = 5 worker_connections = 1000 threads = 2 # specify the number of threads started for each process loglevel = 'info' # loglevel access_log_format = '% s % (p) s (t) (h) % s "% s" (r) (L) (s) % s % s % s % (f) (b) "" % s" (a)' s # set gunicorn access log format, Error log can't set pidfile = '/ usr/local/tools/superset/venv/log/pidfile' errorlog = '/usr/local/tools/superset/venv/log/gunicorn_error.log' accesslog = '/usr/local/tools/superset/venv/log/gunicorn_access.log' print("IP and PORT:"+bind) print("pid_file:"+pidfile) print("error_log:"+errorlog) print("access_log:"+accesslog)
#End of the content

#Then wq! Save the exit

#Gunicorn startup -c config file startup; Daemon starts in the background and logs can be viewed in the path specified in the configuration file
gunicorn -c ./gunicorn_config.py "superset.app:create_app()" --daemon

#Background Process Viewing
ps -ef | grep gunicorn

#Or view it through the port
netstata -tunlp | grep 9089
#or
ss -anp | grep 9089

#If background stop is not enabled, directly CTRL + C to stop
#Background process stops Gunicornps -ef | awk '/gunicorn/ && ! /awk/{print $2}' | xargs kill -9Copy the code

5. User Manual (key points)

5.1 Creating Databases

Before creating a new Database, you need to install the Python driver package of the Database first. For details, please refer to Database Drivers, as shown in Figure 5.1.0, which is usually PIP install XXX.

SQL Lab (SQL Lab, SQL Lab, SQL Lab); Of course, there is also an Upload CSV under Data (Upload Excel is also supported in the latest version). Local CSV files can also be directly uploaded to Superset site as a Data source for Data exploration and analysis.

After logging in to Apache Superset, click Data, select Databases, and jump to Figure 5.1.1. Click the + sign on the upper right side to jump to figure 5.1.2.

Figure 5.1.2.DatabaseCreate a display name for the new database.SQLAlchemy URI This is the database connection string shown in Figure 5.1.0. Make sure it matches the database type you choose. And then clickTEST CONECTION, will be displayed after successful connectionSeems OK!In the popup box, scroll down to the bottom and clicksaveIf the connection fails, check whether the database instance, port, user name, password, and port network of the Apache Superset server deployed by you can access the databaseSQLAlchemy URI Fill in the specification and the saved database connections are listed in Figure 5.1.1.

5.2 Creating Datasets(old version also called Tables)

As shown in Figure 5.2.0, click Datasets under Data in the figure, and then click + to jump to Figure 5.2.1. Select the configured database name, write a database under the connected instance, select a table, and click Save. The saved Dataset is listed in Figure 5.2.0. You see why the boss is called Tabels; The data set serves as a data source for subsequent Charts data visualization.

5.3 SQL Lab

SQL Lab is actually a database query client, which makes use of SQL statements to query and EXPLORE the table and field model of the database, while supporting intelligent completion. Of course, the query results of SQL Lab can also be directly explored to Charts as data visualization data sources. As shown in Figure 5.3.0, SQL Lab has three options. The functions of the three options are as follows:

  • SQL Editor: Performs SQL query exploration
  • Saved Queries: Saved generic query SQL
  • Query Search: Indicates the query history

Click SQL Editor to enter the SQL query exploration in Figure 5.3.1. The upper left side is the configured database connection name and selected database, and the lower left side is the table and field model to be used. The top right side is the place to write SQL statements, support RUN(query), RUN SELECTION(query), SAVE(SAVE), SHARE(SHARE), etc. The bottom right side is the data results, support EXPLORE to Charts visualization,.csv download, CLIPBOARD(copy to CLIPBOARD).

5.4 Creating Charts

The function of charts is data visualization, and different charts meet different business needs. Charts also serve as a part of the dashboard display. One dashboard can display one or more charts. There are two ways to create a chart:

  • As shown in Figure 5.4.0, clickCharts, click on the+Create a new chart to jump to Figure 5.4.1
  • inSQL LabSQL statements explore the results of queries directlyEXPLOREtoCharts (chart)visualization

As shown in Figure 5.4.1, select data charts with appropriate requirements (as shown in Figure 5.4.2, the supported chart types are very rich, known as the most beautiful visual chart display), select appropriate indicators and measurement values, click RUN at the top to get the results, which is very convenient, you can directly click SAVE at the top to SAVE the charts;

Known as the most beautiful visualization display, support visualization of the chart type is indeed rich and colorful, to meet a variety of visualization needs.

5.5 Creating Dashboards

The dashboard is the final overall presentation of the data, the report presentation.

As shown in Figure 5.5.0, click Dashboards and then click + New Dashboard to jump to Figure 5.5.1.

Click the edit dashboard in the upper right corner of Figure 5.5.1 and drag and drop the Charts made before to the dashboard. Note: When dragging and dropping for the first time, try to drag and drop online until the blue decomposition line appears, otherwise it cannot be dragged to the dashboard.

It also supports some common Components, such as Header, Tabs, Row, Column, Markdown, and Divider. Remember to click Save after editing.

The saved dashboard supports sharing, downloading and other functions, and also obtains new data according to the data source refresh in the chart.

The dashboard shared to others is shown in Figure 5.5.3.

Set 6.

Settings are included in the menu barSettingsBelow, the main design permission and operation log module, respectively explained next.

6.1 Role List and Rights

Security in Apache Superset is handled by Flask AppBuilder (FAB), an application development framework built on top of Flask. FAB provides authentication, user management, permissions, and roles, and you can view their documentation. Apache Superset provides different roles by default, each of which has different permissions. When running the Superset init command, the permissions associated with each role are resynchronized to their original values. It is not recommended to change the permissions associated with each role (for example, By deleting or adding permissions), you can create a role type by admin and specify the desired permissions. The default roles and permissions are as follows:

  • Admin: The administrator has all possible permissions, including granting or revoking permissions of other users and changing slices and dashboards of other users.

  • Alpha: The Alpha user can access all data sources, but cannot grant or revoke access to other users. They are also limited to changing the objects they own. Alpha users can add and change data sources.

  • Gamma: User Gamma has limited access rights. They can only use data from a data source accessed through another complementary role. They can only view slices and dashboards made by data sources they have access to. Currently the Gamma user cannot change or add a data source. We assume that they are primarily content consumers, although they can create slices and dashboards. Also note that when the Gamma user looks at the dashboard and slice list views, they will only see the objects they have access to.

  • Sql_lab: the SQL_lab role grants access to SQL lab. Note that while administrator users have access to all databases by default, both Alpha and Gamma users need access on a per-database basis.

  • Public: To allow the deregistered user to access certain superset functions, you need to configure permissions and assign them to another role to which you want to transfer their permissions.

For more role permissions, see the official websiteApache Superset Security, or click edit role in Figure 6.1.0 to view, try not to change the permission of the default role.

Apache Superset also supports the administrator to create new roles, as shown in Figure 6.1.1. Create new roles and specify role permissions.

6.2 User List

Create and edit a user to specify a role. The user’s permission is bound to the role. A user can have multiple roles.

6.3 Operation Logs

An action log is a log of the behavior of different users on your Superset platform, as shown in Figure 6.3.0.

6.4 User Information, Exit, and version Information

The personal information on the right of the menu bar mainly includes:

  • The user information: Changes the user name and resets the password.
  • exit: Returns to the main login screen.
  • version: Information about the current Superset version you have installed.

6.5 Language Selection

As Apache’s top project, it is naturally global, supporting some of the world’s most common languages, so pick your favorite.

6.6 Managing Settings

For the dashboard, chart rendering to add their own style and template, the actual use of the use of little.

6.7 + NEW

The menu bar’s + NEW is a shortcut to the three most common modules, SQL Query, chart, and Kanban (dashboard).

+ NEW

The above is a basic introduction of Apache Superset, an open source platform for big data exploration and analysis and visual reporting. For more exciting content, you can pay attention to Apache Superset Documention.