In the real-time message queue implemented in the GO language, NSQ’s heat can rank first.

NSQ is a simple and easy-to-use messaging middleware designed to provide a powerful infrastructure for decentralized services to operate in a distributed environment. It has a distributed and decentralized topology structure, which is characterized by no single point of failure, fault tolerance, high availability and reliable message delivery.

NSQ captured the hearts of many Gopher with its distributed architecture and the ability to handle hundreds of millions of messages. Our company is no exception. Many businesses rely on NSQ for message push. Today I want to talk to you a little bit about NSQ monitoring.

Why deploy monitoring?

The importance of monitoring should be clear to everyone. A service without monitoring is “a blind man riding a blind horse, going to a deep pool at midnight”. This may seem a little abstract, but let me share with you a real case from my own experience.

I still remember that day, I was eating hot pot and singing, feeling so beautiful that my phone rang suddenly. When I opened it, I found that some customers reported that the CDN was not effective after it was refreshed successfully.

The hotpot could not continue to eat happily. I picked up the computer and operated like a tiger. Unfortunately, the result was unreliable: I checked the logs of relevant services on the system call link, but the URL that the customer needed to refresh did not find any trace of the task in the logs. So what’s the problem?

The schematic diagram of the service invocation involved in this business is shown in the figure above. Due to the urgency of the customer’s demand, I also took my eyes off the boiling hot pot and pondered over the service link diagram:

  • As shown in the figure, the successful submission of a refresh request by the user indicates that the request has been forwarded to the OHM service layer and that the OHM has successfully processed the request.
  • The OHM service is the Gateway to refresh and warm up the relevant business, and it is logged as an ERROR-level log. No record of the request was found in the OHM’s log, indicating that the OHM’s push of the message to the downstream NSQ was successful.
  • The corresponding consumers for NSQ are Purge and Preheat components. Purge is responsible for performing the refresh. It logs at the INFO level and logs each refreshed URL. But why can’t the logs be found in purge?

I was stuck right there. The crux of the problem is when the corresponding logs are not found in the Purge service. I have listed several cases as follows:

  • Service changes. The purge service may cause an exception if the recently updated code has a Bug. But I quickly ruled this out because the release record shows that the service has not been touched in recent months.
  • NSQ is broken. This is even less reliable. NSQ is deployed in a cluster, so a single point of failure can be avoided. If there were a global failure, I’m afraid the company group would be blown up by now.
  • NSQ didn’t send the message. But NSQ is a real-time message queue, message delivery should be fast, and the customer’s refresh operation was several hours ago.

Is it possible that because NSQ messages are piling up, the message is not delivered in time? I didn’t find this problem in the test environment before, because the magnitude of the test was far from enough compared to the online environment… It was a bit of a revelation when I thought about it. Log in the console of NSQ and look at the corresponding Topic. Sure enough, the problem appears here. Hundreds of millions of undelivered messages have been accumulated on NSQ!

Once the problem is identified, it is a normal operation to solve the problem for the customer first through internal tools, which will not be expanded here.

Monitoring deployment and landing

I was relieved when the work order was finished, but it wasn’t the end of the story. This failure is a wake-up call: just because NSQ performance is good does not mean that messages will not pile up, and necessary monitoring alarms must be arranged.

Because of our existing infrastructure, I decided to use Prometheus to monitor NSQ services. (The relevant background knowledge of Prometheus is not available here. Please leave a message if you want to see it.)

Prometheus collects data from third-party services through Exporter, which means that NSQ must configure an Exporter to access Prometheus.

Prometheus’s official document [https://prometheus.io/docs/in… exporter have recommended, I follow the link to find the official recommendation NSQ exporter [https://github.com/lovoo/nsq_… exporter this project disrepair, a recent submission is already four years ago.

So I took the project local and made some simple changes to support Go Mod. (PR [https://github.com/lovoo/nsq_ here…

With NSQ Exporter deployed, the next question is what metrics need to be monitored?

Refer to the website (https://nsq.io/components/nsq…

  • Depth: The current NSQ stack of messages. NSQ holds only 8000 messages in memory by default, and messages beyond that are persisted to disk.
  • Requeued: The number of times the message requeue.
  • Timed Out: Processes messages Timed Out.

Prometheus suggested installing Grafana to see the changes in the metrics more visually. I configured it to look like this:

The Timed Out message corresponds to the Timed Out indicator

  • The stack of messages corresponds to the Depth indicator
  • The load is generated according to the formula SUM (IRATE (NSQ_TOPIC_MESSAGE_COUNT {}[5m])).
  • Probe service is to probe whether the NSQ Exporter service is working. Exporter’s own services are often unavailable because of NSQ pressure.

Since NSQ is configured with monitoring service, we can quickly perceive the current status of NSQ, and timely handle and follow up manually after the alarm is sent. The stability of related business has been significantly improved, and the work order caused by such problems has become less; In addition, monitoring the relevant data collected made our thinking clearer and direction more obvious in the following performance optimization work.

Recommended reading

Server standard SSH protocol, how much do you know?

Talk about EBPF in the air