Architecture
Overview
nat-zero uses a reconciliation pattern to manage NAT instance lifecycles. A single Lambda function (concurrency=1) observes the current state of an AZ and takes one action to converge toward desired state, then returns. The next event picks up where this one left off.
Pattern Origins
The reconciliation loop pattern has deep roots:
- Control theory (1788+): Feedback loops comparing actual state to desired state, taking corrective action
- CFEngine (1993): Mark Burgess introduced "convergence" to configuration management
- Google Borg/Omega (2005+): Internal cluster managers used reconciliation controllers
- Kubernetes (2014+): Popularized the pattern as "level-triggered" vs "edge-triggered" logic
The key insight: state is more useful than events. Rather than tracking event sequences, we observe current state and compute the delta. This makes the system robust to missed events, crashes, and restarts.
See: Borg, Omega, and Kubernetes (ACM Queue), Tim Hockin - Edge vs Level Triggered Logic
EventBridge (EC2 state changes)
│
▼
┌─────────────────────────┐
│ Lambda (concurrency=1) │
│ │
│ 1. Resolve AZ │
│ 2. Observe state │
│ 3. Take one action │
│ 4. Return │
└─────────────────────────┘
│
┌────┴────┐
▼ ▼
EC2 API EIP API
(NATs) (allocate/release)
Reconciliation Loop
Every invocation runs the same loop regardless of which event triggered it:
reconcile(az):
workloads = pending/running non-NAT instances in AZ
nats = non-terminated NAT instances in AZ
eips = EIPs tagged for this AZ
needNAT = len(workloads) > 0
# One action per invocation, then return
Decision Matrix
| Workloads? | NAT State | EIP State | Action |
|---|---|---|---|
| Yes | None / shutting-down | — | Create NAT |
| Yes | Stopped | — | Start NAT |
| Yes | Stopping | — | Wait (no-op) |
| Yes | Outdated config | — | Terminate NAT (recreate on next event) |
| Yes | Running | No EIP | Allocate + attach EIP |
| Yes | Running | Has EIP | Converged |
| No | Running / pending | — | Stop NAT |
| No | Stopped | Has EIP | Release EIP |
| No | Stopped | No EIP | Converged |
| No | Stopping | — | Wait (no-op) |
| — | Multiple NATs | — | Terminate duplicates |
| — | — | Multiple EIPs | Release extras |
Why Single Writer
reserved_concurrent_executions = 1 means only one Lambda runs at a time. Events that arrive during execution are queued and processed sequentially. This eliminates:
- Duplicate NAT creation
- Double EIP allocation
- Start/stop race conditions
- Need for distributed locking
Event Agnosticism
The reconciler does not care what type of instance triggered the event. It observes all workloads and NATs in the AZ, computes desired state, and acts. The event is just a signal that "something changed."
- Workload
pending→ reconcile → creates NAT if needed - NAT
running→ reconcile → attaches EIP if needed - Workload
terminated→ reconcile → stops NAT if no workloads - NAT
stopped→ reconcile → releases EIP if present - Instance gone from API → sweep all configured AZs
Event Flows
Scale-up
Workload launches (pending)
→ reconcile: workloads=1, NAT=nil → createNAT
NAT reaches running
→ reconcile: workloads=1, NAT=running, EIP=0 → allocateAndAttachEIP
Next event
→ reconcile: workloads=1, NAT=running, EIP=1 → converged ✓
Scale-down
Last workload terminates
→ reconcile: workloads=0, NAT=running → stopNAT
NAT reaches stopped
→ reconcile: workloads=0, NAT=stopped, EIP=1 → releaseEIP
Next event
→ reconcile: workloads=0, NAT=stopped, EIP=0 → converged ✓
Restart
New workload launches, NAT is stopped
→ reconcile: workloads=1, NAT=stopped → startNAT
NAT reaches running
→ reconcile: workloads=1, NAT=running, EIP=0 → allocateAndAttachEIP
→ converged ✓
Terraform Destroy
Terraform invokes Lambda with {action: "cleanup"}
→ terminate all NAT instances
→ wait for full termination (ENI detachment)
→ release all EIPs
→ return (Terraform proceeds to delete ENIs/SGs)
Dual ENI Architecture
Each NAT instance uses two ENIs to separate public and private traffic:
Private Subnet NAT Instance Public Subnet
┌──────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Route Table │ │ │ │ │
│ 0.0.0.0/0 ───┼──→│ Private ENI │ │ Public ENI │
│ │ │ (ens6) │ │ (ens5) + EIP │──→ IGW
│ │ │ ↓ iptables ──┼──→│ │
│ │ │ MASQUERADE │ │ src_dst_check=off│
└──────────────┘ └──────────────────┘ └──────────────────┘
- Pre-created by Terraform: ENIs persist across stop/start cycles, keeping route tables intact
- source_dest_check=false: Required on both ENIs for NAT forwarding
- EIP lifecycle: Allocated on NAT running, released on NAT stopped — no charge when idle
Config Versioning
The Lambda tags each NAT instance with a ConfigVersion hash derived from AMI, instance type, market type, and volume size.
When the reconciler detects an outdated NAT, replacement takes two events (following the "one action per invocation" pattern):
- Event 1: Outdated config detected → terminate NAT → return
- Event 2: NAT is now
shutting-down/terminated→ create new NAT with current config
This avoids racing with ENI detachment and keeps error handling simple.
Reliability
EC2 API Eventual Consistency
The EC2 API is eventually consistent. When EventBridge fires a state change event (e.g., running), the EC2 DescribeInstances API may still return the previous state (e.g., pending) for several seconds.
nat-zero handles this by trusting the event state for the trigger instance:
// Trust event state over EC2 API (eventual consistency)
if triggerInst != nil {
triggerInst.StateName = event.State
}
This also applies to NAT instances that may not appear in filter-based queries immediately after creation (tag propagation delay). The reconciler adds the trigger instance to the NAT list if it's missing.
EventBridge Propagation Delay
After Terraform creates the EventBridge rule and target, there's a propagation delay before events are reliably delivered. Events fired during this window may be silently dropped.
nat-zero includes a 60-second time_sleep resource after target creation to mitigate this. Workloads launched immediately after terraform apply may still miss their initial events, but subsequent events will trigger reconciliation.
NAT Stop Behavior
NAT instances are stopped with Force=true because they're stateless packet forwarders. There's no graceful shutdown needed — the routing table instantly fails over when the ENI becomes unreachable, and workloads retry their connections.
Lambda Timeout
The Lambda has a 90-second timeout. Typical invocations complete in 400-600ms. The extended timeout accommodates:
- Cleanup operations during terraform destroy (terminate NATs, wait for ENI detachment, release EIPs)
- Slow EC2 API responses under load