Interesting post but I do have a couple of observations.

1 min readJun 13, 2022

1. Sliding windows functions reside withing Kafka Streams API and KSQL, better handing for Avro/Protobuf/JSON schemas so you know your incoming data is valid. Your consumer would be liable to stop if it hit any bad data.

2. Never rely on being able to reset the offset. If it were a large organisation you wouldn't have this option available to you easily. Plus if devops decide to change the retention and not tell you (it does happen), then there's awkward conversation to have why you can't access the data.

3. A Streaming API job would let you consumer, compute and then produce to a topic, then use the Kafka FileSink Connector in Kafka Connect to do the file writing or the S3 sink if that works better for you.

4. Logging/monitoring/alerting the Python applications. Another dependency, not a huge problem but who monitors the app as well as the cluster?

It's a very good post and the process is sound. I'd be sitting a team down in a meeting with a piece of paper first and getting them to think about every piece of the process from the input to the output. Then tell them to go away and think about it again.

Kafka is not a database. It's not indexed :)

Written by Jason Bell

No responses yet