Recently, a new RocketMQ cluster with dual master and slave roles was deployed. A few days ago, I received feedback from my colleague that an application occasionally failed to send messages recently, and the following error logs were recorded:

The number of messages produced is too large for the broker to handle. But colleagues said, this is a new application, not many messages. Log in to the server and check stats. Log.

I looked at the gc of the broker and saw that there were few gc occurrences, so it was not the GC that caused the stalling.

The other broker logs also did not print exception messages, which RocketMQ would have been comfortable with, but it crashed. Change waitTimeMillsInSendQueue from 200 to 1000, which should alleviate the problem.

The day after I changed the parameters, I received feedback again, but this time the exception was a little different:The changed parameters should have worked, but a new exception was raised due to broker problems that must be resolved.

The page cache timed out when the data was flushed.RocketMQ reads and writes files using MappedByteBuffer. Instead of writing data directly to disk, the data is first written to the memory map. The operating system automatically flusher the data to disk when appropriate, and the broker periodically flusher the disk.

The flush log is found in store.log. The flush log is displayed when the flush time exceeds 500 ms. As you can see, this machine takes a long time to flush. If this time exceeds 1000 ms, the [PCBUSY_CLEAN_QUEUE]broker is busy.

We have two main brokers and one is much better

So I suspect that there is something wrong with the disk of the machine, which leads to the frequent timeout of the disk brushing. So I wrote a piece of code to test the brush speed:

public class Benchmark {
  /** 1G */
  private static long FILE_SIZE = 1024 * 1024 * 1024;

  public static void main(String[] args) throws Exception {
    for (int i = 0; i < 5; i++) {
      MappedByteBuffer mb = getMappedByteBuffer(FILE_SIZE);

      byte[] arr = new byte[4096];
      Arrays.fill(arr, (byte) 2);
      int length = 0;
      long s = System.currentTimeMillis();
      while (length < mb.capacity()) {
        length += arr.length;
        mb.put(arr);
      }
      long e = System.currentTimeMillis();
      System.out.print("MappedByteBuffer  cost: " + (e - s) + "ms");
      mb.force();
      System.out.println(", force time: " + (System.currentTimeMillis() - e) + "ms");
    }
  }

  private static MappedByteBuffer getMappedByteBuffer(long fileSize) throws IOException {
    File file = getRandomFile();
    return new RandomAccessFile(file, "rw").getChannel().
        map(FileChannel.MapMode.READ_WRITE, 0, fileSize);
  }

  private static File getRandomFile() {
    String fileName = UUID.randomUUID().toString();

    File file = new File(fileName);
    file.deleteOnExit();
    return file;
  }
}
Copy the code

After the suspected failure machine and other machines were run again, the average time to brush the disk doubled.

MappedByteBuffer cost: 1622ms, force time: 243ms MappedByteBuffer cost: 1441ms, force time: 292ms MappedByteBuffer cost: 1376ms, force time: 560ms MappedByteBuffer cost: 2232ms, force time: 333ms MappedByteBuffer cost: 1747ms, force time: 370ms MappedByteBuffer cost: 2073ms, force time: 412ms MappedByteBuffer cost: 1912ms, force time: 524ms MappedByteBuffer cost: 1968ms, force time: 503ms MappedByteBuffer cost: 1838ms, force time: 434ms MappedByteBuffer cost: 2135ms, force time: 231ms MappedByteBuffer cost: 1899ms, force time: 209ms MappedByteBuffer cost: 2103ms, force time: 567ms MappedByteBuffer cost: 1716ms, force time: 411ms MappedByteBuffer cost: 1953ms, force time: 806ms MappedByteBuffer cost: 2156ms, force time: 299ms MappedByteBuffer cost: 1925ms, force time: 526ms MappedByteBuffer cost: 1809ms, force time: 414ms MappedByteBuffer cost: 2655ms, force time: 61ms MappedByteBuffer cost: 1888ms, force time: 1477ms MappedByteBuffer cost: 1791ms, force time: Other machines: MappedByteBuffer cost: 2101ms, force time: 487ms MappedByteBuffer cost: 2317ms, force time: 171ms MappedByteBuffer cost: 2037ms, force time: 578ms MappedByteBuffer cost: 2502ms, force time: 137ms MappedByteBuffer cost: 2144ms, force time: 210ms MappedByteBuffer cost: 2253ms, force time: 51ms MappedByteBuffer cost: 2404ms, force time: 295ms MappedByteBuffer cost: 2186ms, force time: 335ms MappedByteBuffer cost: 1932ms, force time: 632ms MappedByteBuffer cost: 2492ms, force time: 79ms MappedByteBuffer cost: 2004ms, force time: 566ms MappedByteBuffer cost: 2252ms, force time: 414ms MappedByteBuffer cost: 2163ms, force time: 399ms MappedByteBuffer cost: 2214ms, force time: 131ms MappedByteBuffer cost: 2343ms, force time: 44ms MappedByteBuffer cost: 2270ms, force time: 149ms MappedByteBuffer cost: 2316ms, force time: 170ms MappedByteBuffer cost: 2312ms, force time: 191ms MappedByteBuffer cost: 2225ms, force time: 89ms MappedByteBuffer cost: 2359ms, force time: 106msCopy the code

Finally, we applied for replacing the machine, which has been running for a week now without broker busy problem.