Sunday, January 29, 2023

Kafka is good for transport, not for system boundaries

In the last years I have learned that you should not run Kafka as a system boundary. A system boundary in this article is the place where messages are passed from one autonomy domain to another.

Now why is that? Let’s look at two classes of problems: connecting to Kafka and the long feedback loop. To prove my points, I am going to bore you with long stories from my personal experience. You may be in a different situation, YMMV!

Problem 1: Connecting to Kafka is hard

Compared to calling an HTTP endpoint, sending messages to Kafka is much much harder.

Don’t agree? Watch out for observation bias! During my holiday we often have long high-way drives through unknown countries. After looking at a highway for several hours non-stop, you might be inclined to believe that the entire country is covered by a dense highway network. In reality though, the next highway might be 200km away. A similar thing can happen at work. My part of the company offers Kafka as a service. We also run several services that invariable use Kafka in some way. We have deep knowledge and experience. It would be easy to think that Kafka is simple for everyone. However, for the rest of the company this Kafka thing is just another far away system that they have to integrate with and knowledge will be spotty and incomplete.

Let’s look at some of the problems that you have to deal with.

Partitioning is hard

It is easier to deal with partitioning problems when you control both the producer and the broker. We once had a problem where our systems could not keep up with the inflow of Kafka messages for one of the producers. The weird thing is that most of the machines were just idling. The problem grew slowly, so it took us some time before we realized it was caused by some partitions having most of the traffic. Producers of Kafka events do not always realize the effect of wrongly chosen key values. When many messages have the same key they end up in the same partition. It took some time before we got across that they needed to change the message key.

When you run an HTTP endpoint, spreading traffic and partitioning is handled by the load-balancer and is therefore under control of the receiver and not the sender.

Cross network connections are hard

Producers and the Kafka brokers need to have the same view of the network. This is because the brokers will tell a producer to which broker (by DNS name or IP address) it needs to connect to for each partition. This might go wrong when the producers and brokers use a different DNS server, or when they are on networks with colliding IP address ranges. Getting this right is a lot easier when you’re running everything in a single network you control.

This is not a problem with HTTP endpoints. Producers only need 1 hostname and optionally an HTTP proxy.

We didn’t talk about authentication and encryption yet. Kafka is very flexible; it has many knobs and settings in this area and the producers have to be configured exactly right or else it just won’t work. And don’t expect good error messages. Good documentation and cooperation is required to make this work across different teams.

With HTTP endpoints, encryption is very well-supported through https. Authentication is straight forward with HTTP’s basic authentication.

Problems that have been solved

Just for completeness here are some problems from around 2019 that have since been solved.

Around 2019 Kafka did not support authentication and TLS out of the box. Crossing untrusted networks was quite cumbersome.

Also around that time you had to be very careful about versioning. The client and server had to be upgraded in a very controlled order. Today this looks much better; you can combine almost any client and server version.

The default partitioner would give slow brokers more, instead of less work. This has been solved a few months ago.

Problem 2: Long feedback loop

When messages are being given to you via Kafka, you can not reject them. They are send and forget, the producer no longer cares. Dealing with invalid messages is now your responsibility.

In one of our projects we used to set invalid messages apart and offer Slack alerts so that the producers knew they had to look at the validation errors. Unfortunately, it didn’t work well. The feedback loop was simply too long and the number of invalid messages stayed high.

Later we introduced an HTTP endpoint in which we reject invalid messages with a 400 response. This simple change was nothing less than a miracle. For every producer that switched the vast majority of invalid messages disappeared. The number of invalid messages has remained very low since then.

Because we were able to reject invalid messages the feedback loop shortened and became much more effective.

Conclusions

Kafka within your own autonomy domain can be a great solution for message transport. However, Kafka as a boundary between autonomy domains will hurt.

Footnotes

  1. Though at high enough volume, HTTP is not easy either; you’ll need proper connection pooling and an endpoint that accepts batches or else deploy a huge server park.
  2. Many load balancers offer sticky sessions which is a weak form of partitioning.
  3. We suffered both.
  4. When your authentication settings are wrong, the Kafka command line tools tell you that by showing an OutOfMemoryError. My head still hurts from this one.
  5. Though unfortunately, many architects will make this complex by using oauth or other such systems.
  6. Most invalid messages could be fixed with a few minutes of coding time.