📨
Messages Being Lost or Processed Multiple Times
An order event was published but the notification service never received it. Or the payment service processed the same payment twice because the consumer committed the offset before completing the work. At-least-once delivery sounds simple until you realize your consumers are not idempotent.
🔀
Event Ordering and Consistency Issues
A user updated their profile and then placed an order, but the order service processed the order event before the profile update arrived. Partition keys were not set correctly, consumer group rebalancing shuffled assignments, and now your downstream services see events in the wrong sequence.
📉
Kafka Consumer Lag Growing Uncontrollably
Your consumer group's lag keeps increasing and you cannot figure out why. Is it slow processing, too few partitions, a downstream dependency that is throttling, or a poison message that keeps causing retries? The consumer metrics do not tell you enough to diagnose the bottleneck.
❓
Unclear When to Use Events vs API Calls
Your team is debating whether a new feature should use events or direct API calls. Some interactions naturally fit async patterns while others need synchronous responses. Making the wrong choice leads to unnecessary complexity or awkward request-reply patterns over a message broker.
📐
Event Schema Evolution Breaking Consumers
You added a new field to an event schema and three consumers broke. Or you renamed a field and old events in the topic cannot be deserialized anymore. Without a schema evolution strategy, every schema change is a potential production incident.
☠️
Dead Letter Queues Filling Up with No Resolution
Failed messages are routed to the DLQ but nobody looks at them. They accumulate for months until the queue is full and new failures start getting dropped silently. There is no process for analyzing, fixing, and replaying dead-lettered messages.