Kafka : Generating unique IDs for strings across partitions

Kafka : Generating unique IDs for strings across partitions

I'm trying to asses if Kafka could be used to scale-out our current solution.
I can identify partitions easily. Currently, the requirement is there to be 1500 partitions, each having 1-2 events per second, but future might go as high as 10000 partitions.

But there is one part of our solution which I don't know how would be solved in Kafka.
The problem is that each message contains a string and I want to assign unique ID to each string across whole topic. So same strings have same ID while different strings have different IDs. The IDs don't need to be sequential, nor do they need to be always-growing.

The IDs will then be used down-stream as unique keys to identify those strings. The strings can be hundreds of characters long, so I don't think they would make efficient keys.

More advanced usage would be where messages might have different "kinds" of strings, so there would be multiple unique sequences of IDs. And messages will contain only some of those kinds depending on type of the message.

Another advanced usage would be that the values are not strings, but structures and if two structures are same would be some more elaborate rule, like if PropA is equal, then structures are equal, if not, then structures are equal if PropB are equal.

To illustrate the problem : Each partition is a computer in a network. Each event is action on the computer. Events need to be ordered per-computer, so that events that change state of the comptuer (eg. user logged in) can affect other types of events, and ordering is critical for that. Eg. user openned an application, file is written, flash drive is inserted, etc.. And I need each application, file, flash drive, or many others to have unique identifiers across all computers. This is then used to calculate statistics down-stream. And sometimes, an event can have multiple of those, eg. operation on specific file on specific flash drive.

This question has not received enough attention.

So you want the same strings should persist in the same partition? If I understood your problem correctly or else let me know.
– Raman Mishra
Jul 2 at 5:29

What exactly is the question?
– cricket_007
Jul 2 at 5:35

@RamanMishra No. The strings are part of bigger event and won't be key to partitions.
– Euphoric
Jul 2 at 6:00

Adding brokers is what scales, not adding partitions to a limited set of machines
– cricket_007
Jul 2 at 6:01

Okay, so if you set a null key, then messages will round-robin over all partitions. If you generate some key value at random UUIDs, then you potentially run the risk of "hot partitions", where your data gets skewed onto those. Similarly, if you defined your own partitioner class. If you care about tracking messages by the key, then you can guarantee ordering within a partition for matching keys. Again, scalability is more from # of producers and hardware. At some point, you will be CPU or network capped, and adding partitions will not help
– cricket_007
Jul 2 at 6:07

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

fNZmdtVdiSUJ t4

搜尋此網誌

Gtjkyu