• Building an SQL Database Audit System Using Kafka, MongoDB and Maxwell’s Daemon
  • About the Author
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: joyking7
  • Proofread: lsvih, PassionPenguin

In this paper, the point

  • Audit logging systems are useful for much more than storing data for audit purposes. In addition to compliance and security purposes, it can be used by marketing teams for easy targeting and can also be used to generate critical alarms.
  • The audit logging capability built into the database may not be sufficient, and it certainly isn’t the best choice for situations where you need to handle all user scenarios.
  • There are many open source tools that implement audit logging capabilities, such as Maxwell’s Daemons and Debezium, that support these requirements with minimal infrastructure and time.
  • Maxwell’s Daemons can read SQL binlogs and send binlog events to producers such as Kafka, Amazon Kinesis, SQS, and Rabbit MQ.
  • The BINlog file generated by the SQL database must be in ROW format for the entire configuration to work.

So suppose you are using a relational database to maintain your transactional data, and the audit trail information for some of the data you need to store only appears in a few tables. If you do what most developers do, you might end up using something like this:

1. Use the database audit log function

Most databases provide plug-ins to support audit logging. These plug-ins can be easily installed and configured to record data. However, this approach has the following problems:

  • Full-fledged audit log plug-ins are typically available only in the enterprise edition and may be missing in the community edition. In the case of MySQL, the audit log plug-in is available only in the Enterprise edition. It’s worth noting that users of MySQL Community Edition can still install other audit logging plug-ins from MariaDB or Percona to get around this limitation.
  • As discussed in the performance impact of slow query logging and the performance impact of MySQL logging, database level audit logging can impose an additional 10-20% overhead on the database server. In general, on a high-load system, you might want to enable audit logging only for slow queries, not for all queries.
  • Audit logs are written to log files and cannot be easily searched. For data analysis and auditing purposes, we would prefer to have audit data stored in a searchable format.
  • A large number of audit archives can consume important database storage because they are stored on the same server as the database.

2. Use your application to audit the logs

You can do this by:

A. Before updating the existing data, copy the existing data to another table, and then update the data in the current table.

B. Add a version number to the data. Each update will insert an incremented version number.

C. Write data to two database tables, one containing the latest data and the other containing audit trace information.

As a rule of thumb in designing scalable systems, it is important to avoid writing the same data more than once, as this can degrade system performance and cause data synchronization problems.

Why do companies need audit data?

Before starting with the audit logging system architecture, let’s first look at the requirements for audit logging systems in various organizations:

  • Compliance and audit: From an auditor’s perspective, they need to capture data in a meaningful and context-relevant way. Database audit logs are good for DBA teams but not for auditors.
  • A basic requirement for any large software is to generate critical alarms when a security breach occurs, and audit logs can be used to do this.
  • You need to be able to answer questions about who accessed the data, what state the data was in prior to that, what was changed when it was updated, and whether internal users abused access.
  • It is also important to note that audit trail information can help identify infiltrators, which can enhance deterrence against “insiders”. People are less likely to access unauthorized databases or tamper with specific data if they know their actions will be scrutinized.
  • All industries, from finance and energy to food services and public works, need to analyze data access and submit detailed reports to various government agencies on a regular basis. Under the Health Insurance Portability and Accountability Act (HIPAA), which requires Health care providers to provide audit trails of all the people who touch their data records, Information requirements are detailed down to the line level and record level. The new European Union General Data Protection Regulation (GDPR) has similar requirements. The Sarbanes-Oxley Act (SOX) introduced broad accounting regulations for public businesses, which are required to regularly analyze data access and generate detailed reports.

In this article, I’ll use technologies like Maxwell’s Daemon and Kafka to provide you with an extensible solution for managing audit trail data.

Problem description

To build an audit system independent of application and data model, the system must be scalable and cost-effective.

architecture

Important note: This system only works with MySQL database, and uses binlog based on ROW format

Before we get into the details of the solution, let’s take a quick look at each of the technologies discussed in this article.

Maxwell ‘s Daemon

Maxwell’s Daemon (MD) is an open source project developed by Zendesk that reads MySQL binlog and writes ROW updates in JSON format to Kafka, Kinesis, or other streaming platforms. Maxwell’s operating costs are low, and there are no requirements other than MySQL and a few places where data needs to be written as mentioned in Maxwell’s Daemon documentation. In short, MD is a change-data-capture (CDC) tool.

There are quite a few different CDC tools out there, like Redhat’s Debezium, Netflix’s DBLog, and LinkedIn’s Brooklyn. CDC functionality can be implemented through any of these tools, but Netflix’s DBLog and LinkedIn’s Brooklyn were developed to meet these different usage scenarios. But Debezium is very similar to MD and can be used to replace MD in our architecture. Between MD and Debezium, I briefly listed a few things to consider:

  • Debezium can only write data to Kafka, at least that’s the producer it mainly supports. MD supports a variety of producers including Kafka, Kinesis, Google Cloud Pub/Sub, SQS, Rabbit MQ, and Redis.
  • You can write producers and configure them by yourself. For details, see Maxwell’s Daemon producer documentation.
  • The advantage of Debezium is that it can read change data from a variety of sources, such as MySQL, MongoDB, PostgreSQL, SQL Server, Cassandra, DB2, and Oracle. Debezium is very flexible with new data sources, whereas MD currently only supports MySQL data sources.

Kafka

Apache Kafka is an open source distributed event flow platform for high-performance data pipelines, flow analysis, data integration, and mission-critical applications.

MongoDB

MongoDB is a general-purpose, document-based distributed database built for modern application developers and the cloud age. We are using MongoDB just for illustration purposes, but you can also choose other options, such as S3, or choose other temporal data, such as InfluxDB or Cassandra.

The following figure shows the data flow diagram of the audit trail scenario:

The following steps should be included in the audit tracking management system:

  1. The program performs database writes, updates, or deletes.
  2. The SQL database generates binlogs for the above operations in ROW format, which involves the configuration of the SQL database.
  3. Maxwell’s Daemon polls the SQL binlog to read new content and write it to a Kafka Topic.
  4. Consumers apply polling Kafka themes to read data and process it.
  5. The consumer writes the processing data to the new store.

Set up the

For easy configuration, we use Docker containers as much as possible. If you don’t already have Dockcer installed on your computer, consider installing Docker Desktop.

The MySQL database

1. Run the mysql server locally. The following command will start a mysql container on port 3307.

docker run -p 3307:3306 -p 33061:33060 --name=mysql83 -d mysql/mysql-server:latest
Copy the code

2. If it is newly installed and we do not know the root password, run the following command to print the password on the console.

docker logs mysql83 2>&1 | grep GENERATED
Copy the code

3. Log in to the container and change the password if necessary.

docker exec -it mysql83 mysql -uroot -p
alter user 'root'@'localhost' IDENTIFIED BY 'abcd1234'
Copy the code

4. For security reasons, the mysql Docker container does not allow external applications to connect by default. We need to run the following command to make the change.

update mysql.user set host = The '%' where user='root';
Copy the code

5. Exit the mysql prompt window and restart the Docker container.

docker container restart mysql83
Copy the code

6. Log in to the mysql client again and run the following command to create a user for Maxwell’s Daemon. For details about this step, see Maxwell’s Daemon Quick Guide.

docker exec -it mysql83 mysql -uroot -p
set global binlog_format=ROW;
set global binlog_row_image=FULL;
CREATE USER 'maxwell'@The '%' IDENTIFIED BY 'pmaxwell';
GRANT ALL ON maxwell.* TO 'maxwell'@The '%';
GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'maxwell'@The '%';
CREATE USER 'maxwell'@'localhost' IDENTIFIED BY 'pmaxwell';
GRANT ALL ON maxwell.* TO 'maxwell'@'localhost';
GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'maxwell'@'localhost';
Copy the code

Kafka message broker

Setting up Kafka is very simple. Download Kafka from this link.

Run the following command:

Extract the Kafka file

The tar - XZF kafka_2. 13 - server. TGZcdKafka_2. 13 - serverCopy the code

Start the Zookeeper that Kafka currently requires

bin/zookeeper-server-start.sh config/zookeeper.properties
Copy the code

Start Kafka on another terminal

bin/kafka-server-start.sh config/server.properties
Copy the code

Create a Kafka theme on another terminal

bin/kafka-topics.sh --create --topic maxwell-events --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
Copy the code

The above command starts the Kafka message broker and creates a topic called “Maxwell-events” within it.

To push messages to the Kafka theme, run the following command from a new terminal:

bin/kafka-console-producer.sh --topic maxwell-events --broker-list localhost:9092
Copy the code

The command above gives us a dialog box to enter the message content and hit enter to send the message to Kafka.

Consume messages from Kafka topics:

bin/kafka-console-producer.sh --topic quickstart-events --broker-list localhost:9092
Copy the code

Maxwell ‘s Daemon

Download Maxwell’s Daemon from this link. Unpack it and run the following command:

bin/maxwell --user=maxwell   --password=pmaxwell --host=localhost --port=3307  --producer=kafka     --kafka.bootstrap.servers=localhost:9092 --kafka_topic=maxwell-events
Copy the code

Maxwell has been started to monitor the binlog database. You can also monitor only a few databases or tables. For more information, refer to Maxwell’s Daemon configuration documentation.

The test set

To test if the setup works, connect to MySQL and insert some data into a table.

docker exec -it mysql83 mysql -uroot -p

CREATE DATABASE maxwelltest;

USE maxwelltest;

CREATE TABLE Persons (
    PersonId int NOT NULL AUTO_INCREMENT,
    LastName varchar(255),
    FirstName varchar(255),
    City varchar(255),
    primary key (PersonId)

);

INSERT INTO Persons (LastName, FirstName, City) VALUES ('Erichsen', 'Tom',  'Stavanger');

Copy the code

On another terminal, run the following command:

bin/kafka-console-consumer.sh --topic maxwell-events --from-beginning --bootstrap-server localhost:9092
Copy the code

On the terminal, you should see the following:

{"database":"maxwelltest"."table":"Persons"."type":"insert"."ts":1602904030."xid":17358."commit":true."data": {"PersonId":1."LastName":"Erichsen"."FirstName":"Tom"."City":"Stavanger"}}
Copy the code

As you can see, Maxwell’s Daemon captures the database insert event and writes a JSON string containing the event details to the Kafka topic.

Build mongo

To run MongoDB locally, run the following command:

docker run --name mongolocal -p 27017:27017 mongo:latest
Copy the code

Kafka consumers

The kafka-consumer code is available from the Github project KMaxwell, download the source code, and consult the README documentation to see how it works.

The ultimate test

Finally, we completed the installation process. Log in to the MySQL database and run any insert, delete, or update commands. If configured correctly, we can see the corresponding data entry in MongoDB’s AuditLog database. We can have fun doing audit work!

conclusion

The system described in this article works well in a real-world deployment, giving us an additional source of data in addition to user data, but there are a few things to note before using this architecture:

  1. Infrastructure costs. Such a setup requires additional infrastructure, with data jumping multiple times from database to Kafka to another database and possibly being stored in backups, which adds to the cost of the infrastructure.
  2. Audit log data cannot be maintained in real time because it goes through multiple hops and has a delay of seconds or minutes. We might talk about “who needs real-time audit log data?” but if you plan to use this data for real-time monitoring, this is something you have to consider.
  3. In this architecture, we capture data changes, not who changes the data. If you also care about which user changed the data, this design may not be directly supported.

After highlighting some of the tradeoffs of this architecture, I’d like to conclude by reiterating the benefits of this architecture. The main benefits are as follows:

  • This design reduces the performance cost of database audit logs and meets the marketing and alarm needs of traditional data sources
  • This architecture is easy to set up and stable, and any problems with any component will not result in data loss. For example, if the MD fails, the data will still be stored in a binlog file, and the next time the Daemon starts, it will still be able to read from where it left off. If the Kafka message broker fails, MD can detect it and stop reading data from binlog. If the Kafka consumer crashes, the data will be stored in the Kafka message broker. Therefore, in the worst case, audit logging may be delayed without data loss.
  • The installation and configuration process is simple and direct, without consuming too much development effort.

About the author

Vishal SinhaA passionate technologist with expertise and interest in distributed computing and large scale scalable systems, He is currently working as cTO at an industry-leading Indian unicorn. During his 16-year career in the software industry, He has worked at multinational companies and startups, developed a variety of large-scale scalable systems, and led a team of software engineers who enjoy solving complex problems and experimenting with new technologies.

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.