Troubleshooting
If something isn't working, start here. The sections below are grouped by symptom — scan them before opening an issue.
Connection errors
NATS connection refused
The transport throws this on startup when the NATS server is unreachable.
Causes:
- NATS server is not running
- Wrong server URL in
serversconfig - Firewall or network policy blocking the port
Fix:
# Verify NATS is running
nats-server --version
docker ps | grep nats
# Test connectivity
nats server check connection --server nats://localhost:4222
CONNECTION_REFUSED in production
If the app crashes immediately on deploy, ensure the NATS server URL is correct for the environment. Use forRootAsync() with ConfigService to load URLs from environment variables:
JetstreamModule.forRootAsync({
name: 'my-service',
imports: [ConfigModule],
inject: [ConfigService],
useFactory: (config: ConfigService) => ({
servers: config.get<string>('NATS_SERVERS')!.split(','),
}),
})
Reconnection loop
The transport defaults to unlimited reconnection (maxReconnectAttempts: -1). If you see repeated Reconnecting... logs, the NATS server is flapping or the connection is being dropped.
Diagnosis:
- Register a
Reconnecthook to see when reconnections happen:hooks: {[TransportEvent.Reconnect]: (server) => console.log(`Reconnected to ${server}`),} - Check NATS server logs for client disconnect reasons.
- Verify TLS certificates haven't expired.
Consumer issues
Messages not being delivered
Checklist:
- Handler registered? — Check startup logs for
Registered handlers: X RPC, Y events, Z broadcasts(plusN orderedwhen ordered handlers are present). A zero count means the decorator didn't hit the registry — usually an import-order or module-wiring problem. - Stream exists? — The transport creates streams on startup. Check with
nats stream ls. - Consumer exists? — Check with
nats consumer ls <stream-name>. - Subject matches? — Use
nats sub "servicename__microservice.ev.>"to see if messages arrive on the expected subject. - Publisher-only mode? — If
consumer: falseis set, no handlers are registered.
Messages redelivered unexpectedly
Messages are redelivered when the ack_wait deadline expires before the handler acknowledges them.
Causes:
- Handler is too slow (exceeds
ack_wait) - Handler throws an error (message is nak'd for retry)
- Connection lost during processing
Fix:
- Increase
ack_wait:consumer: { ack_wait: toNanos(60, 'seconds') } - Enable
ackExtension: truefor long-running handlers - Set explicit
concurrencyto prevent overload
Consumer lag growing
If your consumer is falling behind (messages accumulating faster than they're processed):
- Increase concurrency:
events: { concurrency: 200 } - Increase
max_ack_pending:consumer: { max_ack_pending: 500 } - Scale horizontally: Deploy more instances — each gets a share of the workqueue messages.
- Check handler performance: Profile your handlers. Database queries, external API calls, and heavy computations are common bottlenecks.
Monitor lag with the NATS CLI:
nats consumer info <stream> <consumer> | grep "Num Pending"
RPC issues
RPC timeout errors
The caller didn't receive a response within the timeout period.
Causes:
- Handler is slow or stuck
- No handler registered for the pattern
- Network partition between publisher and consumer
- Wrong RPC mode — publisher uses
corebut handler expectsjetstream(or vice versa)
Diagnosis:
hooks: {
[TransportEvent.RpcTimeout]: (subject, correlationId) =>
console.warn(`RPC timeout: ${subject} (${correlationId})`),
}
Fix:
- Increase timeout:
rpc: { mode: 'core', timeout: 60_000 } - Per-request timeout:
new JetstreamRecordBuilder(data).setTimeout(120_000).build() - Ensure both sides use the same RPC mode
No handler for subject warnings
A message arrived on a subject that has no registered handler.
Causes:
- Typo in the
@EventPattern()or@MessagePattern()pattern - Handler not imported in the module
- Subject naming mismatch between publisher and consumer
Fix: Compare the full NATS subject in the warning with your handler patterns. See Naming Conventions for the subject structure.
Dead letter queue
onDeadLetter not called
The callback only fires when all of these conditions are met:
onDeadLetteris configured in module options- The message has been delivered
max_delivertimes (default: 3) - The handler throws on every delivery attempt
Not applicable to:
- RPC commands (they use
term, not DLQ) - Ordered events (no ack/nak)
- Messages that are
term'd (decode errors, missing handlers)
DLQ callback throws
If your onDeadLetter callback throws, the message is nak'd for another retry instead of being terminated. This is intentional — it allows transient failures (e.g., DLQ database is down) to recover.
DLQ stream publish fails
When dlq: { stream } is configured, the transport republishes exhausted messages to a dedicated DLQ stream. If that publish fails, the transport falls back to the onDeadLetter callback and then to nak() as a last resort — the full sequence is documented in the Fallback chain.
Causes:
- DLQ stream was deleted manually (
nats stream rm orders__microservice_dlq-stream) - NATS server is out of disk space or has hit
max_bytes - NATS connection dropped between the original publish and the DLQ republish
Fix:
- Check that the DLQ stream exists:
nats stream ls | grep dlq-stream - If it was deleted, restart the pod — the transport's
ensureDlqStream()recreates it on startup whendlqis configured - Check NATS server disk usage:
nats server report accounts - Ensure the
dlq.streamconfig (if overridden) is compatible with the server's resource limits
Handler metadata registry
Entries missing from the KV bucket
The transport only publishes handler metadata when the handler has a meta field in its decorator extras. Handlers without meta are intentionally skipped — see Handler Metadata for the quick-start example.
Checklist:
- Does the handler have
meta: { ... }in@EventPattern/@MessagePattern? - Is the NATS server version >= 2.10 (KV support)?
- Did startup succeed? Check logs for
MetadataRegistryerrors. - Inspect the bucket:
nats kv ls handler_registry
Bucket config mismatch error on startup
NATS KV buckets have immutable config for some fields (replicas, ttl). If you change these in forRoot() after the bucket already exists, startup fails.
Fix: Delete your configured metadata bucket — the default name is handler_registry, but if you overrode metadata.bucket in forRoot(), substitute your own. Entries are re-published on the next startup, so the delete is safe.
# Replace `handler_registry` with your metadata.bucket value if you overrode it
nats kv rm handler_registry
# Restart the service → fresh bucket with new config
Stream migration
Consumer self-healing waits on "migration in progress"
If a previous migration was interrupted (process killed mid-phase, NATS crash), an orphaned {stream}__migration_backup stream exists. During consumer self-healing (after a live consumer's iterator breaks), the transport detects the backup stream and refuses to recreate the consumer until the backup is gone — the self-healing loop waits with exponential backoff. This check runs only in the recovery path, not during initial application startup.
Diagnosis:
nats stream ls | grep migration_backup
Fix: None needed in most cases — self-healing will recover automatically once the migrating pod finishes and cleans up the backup. If the backup is orphaned permanently (the migrating pod died and nobody retried), manually inspect it and either retain it (if it contains messages you need) or delete it with nats stream rm <stream>__migration_backup so self-healing can resume.
See Stream Migration — Error handling for the full recovery flow.
Publisher errors during rolling update
During the brief window between Phase 2 (delete) and Phase 3 (create) of a stream migration, publishers may see "stream not found" errors. The window is effectively one NATS round-trip, but is not zero. Mitigations:
- For
client.emit()(fire-and-forget), accept the loss or implement caller-side retry. - For
client.send()(RPC), the caller receives an error and can retry. - For zero-loss migrations, schedule migration during a maintenance window with publishers paused.
Startup issues
listen() hangs on startup
If the application doesn't finish starting:
- Ordered consumer can't connect — The stream might not exist yet. Check that the ordered stream is created before the consumer tries to connect.
- NATS connection timeout — The initial connection attempt blocks startup.
Stream already exists with different config
NATS returns an error when you try to update a stream with incompatible changes (e.g., changing retention policy on an existing stream).
Fix: Delete the stream and let the transport recreate it:
nats stream rm <stream-name>
Deleting a stream destroys all messages in it. Only do this in development or when data loss is acceptable.
Typed error handling with NatsErrorCode
When your own code needs to react to a NATS JetStream API error, use the NatsErrorCode enum instead of matching on error messages. It covers the three error conditions the transport itself observes most often:
import { NatsErrorCode } from '@horizon-republic/nestjs-jetstream';
try {
await jsm.streams.info('orders__microservice_ev-stream');
} catch (err) {
const code = (err as { code?: number }).code;
if (code === NatsErrorCode.StreamNotFound) {
// 10059 — stream does not exist yet, safe to create
} else if (code === NatsErrorCode.ConsumerNotFound) {
// 10014 — consumer was deleted externally
} else if (code === NatsErrorCode.ConsumerAlreadyExists) {
// 10148 — race on consumer create, fetch the existing one instead
} else {
throw err;
}
}
The library itself uses these constants in its self-healing flows (src/server/infrastructure/consumer.provider.ts), so consumer code that wraps library calls can reuse the same vocabulary.
See also
- Lifecycle Hooks — register hooks for observability
- Health Checks — monitor connection status
- Default Configs — all default values
- Edge Cases — less obvious behaviors