Martin.kleppmann.com/2018/01/18/…

If you’re using a streaming platform like Kafka, be clear about one thing: What themes do you need to use? In particular, if you’re publishing a bunch of different events as messages to Kafka, do you put them together in the same topic, or do you split them into different topics? For more quality content, please follow the wechat official account “AI Front” (ID: AI-Front).

One of the most important features of Kafka themes is the ability for consumers to specify the subset of messages they want to consume. In extreme cases, it may not be a good idea to put all the data in the same topic, because then consumers can’t choose the events they’re interested in — they need to consume all the messages. At the other extreme, having millions of different themes is not a good idea either, because every theme in Kafka has a cost, and having a large number of themes can hurt performance.

In fact, the number of partitions is the key factor from a performance perspective. In Kafka, there is at least one partition for each theme, or at least n partitions if you have n themes. Not long ago Jun Rao wrote a blog post explaining the cost of having multiple partitions (end-to-end latency, file descriptors, memory overhead, recovery time after a failure). As a rule of thumb, if you care about latency, a few hundred partitions per node should do the job. If the number of partitions per node exceeds tens of thousands, this can cause significant delays.

The performance discussion provides some guidance for designing the topic structure: If you find yourself with thousands of topics, it might be wise to consolidate some fine-grained, low-throughput topics into coarse-grained topics to avoid the number of partitions spreading.

However, performance is not our only concern. More important, in my opinion, is the data integrity and data model of the topic structure. We’ll discuss these in the rest of this article.

Topic is equal to the set of events of the same type?

It is generally accepted that events of the same type should be placed in the same topic, and that different event types should use different topics. This thinking reminds us of a relational database, where tables are collections of records of the same type, and we have an analogy between database tables and the Kafka theme.

The Confluent Avro Schema Registry further reinforces this concept because it encourages you to use the same Avro Schema for all messages on a topic. Schemas can evolve while maintaining compatibility (for example, by adding optional fields), but all messages must conform to some record type. I’ll come back to that later.

For some types of streaming data, such as active events, it is reasonable to expect all messages on the same topic to conform to the same pattern. However, some people use Kafka as a database, such as event sourcing, or exchanging data between microservices. In this case, I don’t think it matters so much whether the topic is defined as a collection of messages with the same pattern. At this point, it is more important that the messages in the topic partition be in order.

Imagine a scenario where you have an entity (such as a customer) that can do many different things, such as create a customer, change an address, add a new credit card to the account, initiate a customer service request, pay a bill, close the account.

The sequence between these events is important. For example, we want other events to occur only after the customer is created, and no other events to occur after the customer closes the account. With Kafka, you can keep them in order by putting them all in the same theme partition. In this example, you can use the customer ID as the key for the partition, and then put all the events in the same topic. They must be in the same topic, because different topics correspond to different partitions, and Kafka does not guarantee the order between partitions.

Order problem

If you use different themes for the customerCreated, customerAddressChanged, and customerInvoicePaid events, consumers of these themes may not see the order between these events. For example, a consumer might see an address change made by a customer that does not exist (the customer has not yet been created because the corresponding customerCreated event may have been delayed).

If the consumer pauses for a period of time (for example, to do maintenance or deploy a new release), the likelihood of events getting out of order is higher. During the consumer shutdown, events continue to be published, and these events are stored in a specific topic partition. When the consumer starts again, it consumes all the events that are backlogged in the partition. If the consumer consumes only one partition, that’s fine: backlogged events are processed in the order they were stored. However, if a consumer consumes several topics at the same time, the data in the topics will be read in any order. It can read all the data backlogged on one topic and then on another topic, or it can interleaved data from multiple topics.

So if you put the customerCreated, customerAddressChanged, and customerInvoicePaid events in three separate topics, Then the consumer might see the customerAddressChanged event before the customerCreated event. Therefore, the consumer is likely to see a customerAddressChanged event for a customer that has not been created.

You might think of attaching a timestamp to each message and using it to sort events. If you import events into the data warehouse and sort the events, this is probably fine. But using a timestamp alone in streaming data is not enough: When you receive an event with a specific timestamp, you don’t know if you need to wait for an event with an earlier timestamp, or if all previous events have already arrived before the current event. Relying on clocks for synchronization often leads to nightmares, see chapter 8 Designing Data-Intensive Applications for more details on clock issues.

When to split topics and when to merge topics?

With this background, I’ll give you a few rules of thumb to help you determine what data should be in the same topic and what data should be in different topics.

  1. First, events that need to remain in a fixed order must be placed on the same topic (and need to use the same partitioning key). If events belong to the same entity, the order of events is important. Therefore, we can say that all events of the same entity should be saved in the same topic.

    Event sequencing is especially important if you are using event sourcing for data modeling. The state of the aggregate object is determined by replaying the event log in a specific order. Therefore, even though different event types may exist, all events needed for aggregation must be in the same topic.

  2. For events of different entities, should they be saved in the same theme or in different themes? I would say that if an entity is dependent on another entity (such as an address belonging to a customer), or often needs to use them together, then they should also be kept in the same topic. On the other hand, if they are unrelated and belong to different teams, it is best to keep them under different topics.

    Also, this depends on event throughput: If one entity type has a much higher event throughput than the others, it is best to split it up into several topics so as not to overwhelm consumers who just want to consume low-throughput entities (see point 4). However, multiple entities with low throughput can be combined.

  3. What if an event involves multiple entities? For example, orders involve products and customers, and transfers involve at least two accounts.

    I recommend logging these events initially as a single atomic message, rather than breaking them up into several messages on different topics. When recording an event, it is best to keep the data as raw as possible. You can split the compound event later using a streaming processor, but it’s much harder to recreate the original event if you split it too early. It would be nice to be able to assign a unique ID (such as a UUID) to the initial event, which you can then take along if you want to split the original event, so that you can trace the origin of each event.

  4. Look at the number of topics consumers need to subscribe to. If several consumers are subscribed to a particular set of topics, this indicates that they may need to be merged together.

    If fine-grained topics are merged into coarse-grained topics, some consumers may receive events that they do not need and need to ignore. This is not a big deal: the cost of consuming messages is so low that even if you end up ignoring a good half of the events, the total cost may not be huge. I recommend splitting a high-volume event stream into a low-volume event stream only when consumers need to ignore the vast majority of messages (99.9% are not needed, for example).

  5. Change log topics used as Kafka Streams state store (KTable) should be separate from other topics. In this case, these topics are managed by the Kafka Streams process and should not contain other types of events.

    Finally, what if you still can’t make the right decision based on the above rules? Then group events by type, putting events of the same type in the same topic. However, I think this rule is the least important.

Model management

If your data is plain text (such as JSON) and you are not using a static schema, you can easily place different types of events in the same topic. However, if you use pattern encoding (as Avro does), saving multiple types of events in a single topic requires more consideration.

As mentioned above, the Avro-based Kafka Confluent Schema Registry assumes that there is a Schema for each topic (more specifically, one Schema for the key of the message and one Schema for the value of the message). You can register a new version of the schema, and the registry checks the schema for forward and backward compatibility. One advantage of this design is that you can have different producers and consumers using different versions of the pattern at the same time and still maintain compatibility with each other.

Confluent’s Avro serializer registers schemas in the registry with the Subject name. By default, the subject of the message key is -key and the subject of the message value is -value. The schema registry checks the compatibility of all schemas registered under a particular Subject.

Recently, I provided a patch for the Avro serializer (github.com/confluentin…) To make compatibility checking more flexible. This patch adds two new configuration options: Key. Subject. Name. The strategy (used to define how to construct the subject name) of the message key and the value subject. Name. The strategy (used to define how to construct the news value of the subject name). Their values can be as follows:

  • IO. Confluent. Kafka. Serializers. Subject. TopicNameStrategy (default) : the subject name for message key – key, news value is the value. This means that the schemas for all messages in a topic must be compatible with each other.
  • IO. Confluent. Kafka. Serializers. Subject. RecordNameStrategy: the subject name is the fully qualified name Avro record type. Therefore, the schema registry checks for compatibility of a particular record type, regardless of the topic. This setting allows the same topic to contain different types of events.
  • IO. Confluent. Kafka. Serializers. Subject. TopicRecordNameStrategy: the subject name is — – where is kafka subject name, is the fully qualified name Avro record type. This setting allows different types of events to be included on the same topic and further checks on the current topic for compatibility.

With this new feature, you can easily put all the different types of events belonging to a particular entity into the same topic. You can now choose the granularity of themes, rather than being limited to one type for one topic.

英文原文 :

Martin.kleppmann.com/2018/01/18/…