redisrabbitmqbullmqqueuesbackend

Redis, RabbitMQ, and BullMQ in Real Products

Apr 12, 202611 min readBackend

Choosing between cache, broker, and job queue requires understanding latency, durability, retries, ownership, and the operational behavior each tool introduces.

These tools solve different problems

Redis is excellent for fast, short-lived coordination: caching query results and rendered responses, tracking counters and metrics, managing distributed locks, storing session tokens, enforcing rate limits, and recording idempotency keys. Its speed comes from keeping everything in memory, and its simplicity comes from a clean set of data structures with predictable behavior. Redis is not a database replacement — it is the fast layer in front of systems that need coordination without persistence guarantees.

RabbitMQ is better suited for explicit messaging between parts of a system. It is designed around delivery semantics: messages are acknowledged when processed, not just received. Queues survive consumer restarts. Exchanges route messages to the right consumers based on routing keys and binding rules. Dead-letter queues catch messages that could not be processed. This is a broker built for durability and delivery guarantees, not speed alone.

BullMQ is practical for application-level job scheduling: sending emails after a delay, generating reports on a schedule, processing uploads in the background, running cleanup tasks at night, and retrying failed external API calls with exponential backoff. It is a job queue library built on top of Redis that provides a higher-level API for common application job patterns without the operational overhead of running a separate message broker.

The right choice depends on the failure behavior you need, not the API you prefer. A cache that loses a few records under memory pressure is an acceptable cost. A payment notification that loses a message is not. Understanding the failure characteristics of each tool before choosing it prevents painful architectural changes later.

Redis in production: what to watch for

Redis's speed and simplicity make it easy to reach for in almost every situation, which is also what makes it dangerous when misused. The most common mistake is using Redis as a primary data store for business records because it is fast and the data fits in memory. When Redis restarts without persistence, or when the memory limit is reached and the eviction policy removes records, the 'fast database' becomes data loss.

Eviction policies deserve explicit attention. The default eviction behavior under memory pressure depends on the configuration, and the default may not be appropriate for your use case. A cache should probably evict least-recently-used items. A rate limit counter should never be evicted. An idempotency key should not be evicted before it expires naturally. These different requirements may mean different Redis instances with different configurations, not one shared instance.

Persistence options — RDB snapshots and AOF logs — reduce the risk of data loss on restart but introduce their own tradeoffs in write performance and recovery time. An RDB snapshot taken every five minutes will lose up to five minutes of data on a crash. AOF logging provides much stronger durability but increases write amplification. The choice should be deliberate, based on what the application can tolerate losing.

Cluster mode adds horizontal scaling but changes the semantics of multi-key operations. Commands like MULTI/EXEC, Lua scripts, and operations on multiple keys only work when all keys are in the same hash slot. BullMQ and many locking libraries rely on multi-key operations and may require configuration changes or compatibility shims to work correctly in cluster mode.

Queues do not remove failure

A queue moves failure into a place where it can be handled more deliberately. It does not make failure disappear. If the job is not idempotent, observable, and retry-safe, the queue simply delays the incident — and adds the operational complexity of a queue to the investigation.

Idempotency in job processing means that processing the same message twice produces the same result as processing it once. An email job that is retried after a transient failure should not send two emails. A billing job that is replayed after a crash should not charge the customer twice. Achieving idempotency often requires checking an external state before taking action — has this email already been sent? has this charge already been created? — which adds latency but prevents the worse outcome of duplicate effects.

Dead-letter behavior should be designed before the first production failure, not after. When a job fails its maximum retries, where does it go? How does an operations engineer see it? How do they inspect the failure reason? How do they replay it after the underlying issue is fixed? Systems that do not answer these questions accumulate failed jobs invisibly until someone notices an anomaly in the business metrics.

I also avoid hiding user-facing truth inside background work without a visible status model. If a user triggers an action that results in a background job, and that job fails, the user should eventually see a status that reflects the failure — not a perpetual 'processing' state with no resolution. The queue is an implementation detail. The user-facing status model is the product.

RabbitMQ patterns worth knowing

The exchange and binding model in RabbitMQ is more flexible than a simple queue, and knowing the right pattern for a use case prevents over-engineering. Direct exchanges route messages to the queue with a matching routing key — useful when the producer knows exactly which consumer should receive the message. Topic exchanges use pattern matching on routing keys — useful when consumers want to subscribe to categories of events rather than specific message types. Fanout exchanges deliver to all bound queues — useful for broadcasting events that multiple services need to react to independently.

Consumer acknowledgement mode is the most important operational decision. Auto-acknowledgement — where the broker marks a message as delivered as soon as it is received — is the fastest mode but provides no delivery guarantee. If the consumer crashes after receiving the message but before processing it, the message is lost. Manual acknowledgement — where the consumer explicitly confirms success or failure after processing — is slower but ensures that failed processing triggers retry or dead-letter behavior.

Prefetch count controls how many unacknowledged messages a consumer can hold at once. A prefetch of one means the consumer processes one message at a time before receiving another. A higher prefetch allows parallelism but can create uneven load distribution if some messages take much longer to process than others. For most application-level job processing, a prefetch of three to ten is a reasonable starting point.

Publisher confirms are the producer-side equivalent of consumer acknowledgements. A producer that does not wait for a confirm has no guarantee that the message was durably stored by the broker. For critical messages — payment events, booking confirmations, compliance records — publisher confirms should be enabled, even at the cost of throughput.

A practical rule

Use Redis when the data is temporary, latency-sensitive, or coordination-heavy and you can accept the risk of memory-bound eviction. Rate limiting, caching, sessions, locks, and presence indicators are the natural fit. Use RabbitMQ when messages cross service boundaries, require explicit delivery guarantees, need routing flexibility, or involve consumer groups that need independent acknowledgement semantics. Use BullMQ when the application needs jobs, schedules, retries, and a simple operational model that lives close to the application code.

The cleanest architectures usually use fewer tools, but use each one honestly. Adding a queue to avoid modeling state only moves complexity into a darker corner. Adding Redis to avoid thinking about data ownership creates a system where data lives in two places with inconsistent staleness semantics.

The questions worth asking before choosing are: what happens when this tool restarts? What happens when it is temporarily unavailable? What happens when the data grows beyond the expected size? What does an operations engineer need to do when a failure occurs in this layer? If the answers to those questions are clear and acceptable, the tool is a good fit. If the answers are unclear, the tool will introduce operational complexity that was not budgeted for.

Good backend design is less about choosing fashionable infrastructure and more about making failure predictable, visible, and recoverable. The tools are a means to that end — not the goal itself.

MCP Servers and Tool Calling for Real Developer Products

Realtime Chat Architecture with Socket.IO