From 7fdad1300dacd65b738d7df3bb97a85228ad2a5f Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Tue, 3 Mar 2026 08:13:38 +1000 Subject: [PATCH 1/2] feat: make root EBS volume encryption configurable Add `encrypt_root_volume` variable (default: true) so users who prioritize boot speed over root volume encryption can opt out. Encrypted gp3 volumes add ~6s to EC2 pending state vs unencrypted. The NAT instance is a stateless packet forwarder with no sensitive data on the root volume. - Wire variable into launch template and CONFIG_VERSION hash - Test fixture defaults to false to benchmark unencrypted boot - Timing summary now logs encryption state for comparison across runs - Add performance docs section and "Faster Cold Start" example Co-Authored-By: Claude Opus 4.6 --- docs/examples.md | 16 ++++++++++++++++ docs/performance.md | 16 ++++++++++++++++ docs/reference.md | 3 ++- lambda.tf | 1 + launch_template.tf | 2 +- tests/integration/fixture/main.tf | 14 ++++++++++++-- tests/integration/nat_zero_test.go | 8 ++++++++ variables.tf | 6 ++++++ 8 files changed, 62 insertions(+), 4 deletions(-) diff --git a/docs/examples.md b/docs/examples.md index d182de4..833e632 100644 --- a/docs/examples.md +++ b/docs/examples.md @@ -118,6 +118,22 @@ module "nat_zero" { } ``` +## Faster Cold Start + +Disable root volume encryption to shave ~5-10 seconds off NAT cold-start time. The NAT instance is a stateless packet forwarder — no sensitive data is stored on the root volume. + +```hcl +module "nat_zero" { + source = "github.com/MachineDotDev/nat-zero" + + # ... required variables ... + + encrypt_root_volume = false +} +``` + +See [Performance](performance.md#root-volume-encryption) for benchmarks. + ## Building Lambda Locally For development or if you want to build from source: diff --git a/docs/performance.md b/docs/performance.md index f7a5816..c5d4264 100644 --- a/docs/performance.md +++ b/docs/performance.md @@ -44,6 +44,22 @@ The ~8 second gap is EC2 instance lifecycle (placement, OS boot, iptables config Restart is ~2 seconds faster — `StartInstances` skips AMI resolution and launch template processing. +## Root Volume Encryption + +The module encrypts the root EBS volume by default (`encrypt_root_volume = true`). Encryption adds to EC2 pending-state time because the volume must be initialized with an encrypted key before the instance can boot. + +| Volume encryption | Approximate pending time (gp3) | +|-------------------|-------------------------------| +| Encrypted (default) | ~11 s | +| Unencrypted | ~5 s | + +For faster cold starts in non-compliance environments, set `encrypt_root_volume = false`. This has no impact on: + +- **Restart from stopped** — the volume already exists, so no encryption overhead. +- **Steady-state throughput** — Nitro instances use dedicated hardware for AES-256 encryption at line rate. + +The NAT instance is a stateless packet forwarder with no sensitive data on the root volume, so disabling encryption does not expose user traffic. + ## Lambda Execution | Metric | Value | diff --git a/docs/reference.md b/docs/reference.md index 914612b..b705b81 100644 --- a/docs/reference.md +++ b/docs/reference.md @@ -56,6 +56,7 @@ No modules. | [custom\_ami\_name\_pattern](#input\_custom\_ami\_name\_pattern) | AMI name pattern when use\_fck\_nat\_ami is false | `string` | `null` | no | | [custom\_ami\_owner](#input\_custom\_ami\_owner) | AMI owner account ID when use\_fck\_nat\_ami is false | `string` | `null` | no | | [enable\_logging](#input\_enable\_logging) | Create a CloudWatch log group for the Lambda function | `bool` | `true` | no | +| [encrypt\_root\_volume](#input\_encrypt\_root\_volume) | Encrypt the root EBS volume. Disabling may reduce cold-start time by ~5-10 seconds on gp3. | `bool` | `true` | no | | [ignore\_tag\_key](#input\_ignore\_tag\_key) | Tag key used to mark instances the Lambda should ignore | `string` | `"nat-zero:ignore"` | no | | [ignore\_tag\_value](#input\_ignore\_tag\_value) | Tag value used to mark instances the Lambda should ignore | `string` | `"true"` | no | | [instance\_type](#input\_instance\_type) | Instance type for the NAT instance | `string` | `"t4g.nano"` | no | @@ -68,7 +69,7 @@ No modules. | [nat\_tag\_value](#input\_nat\_tag\_value) | Tag value used to identify NAT instances | `string` | `"true"` | no | | [private\_route\_table\_ids](#input\_private\_route\_table\_ids) | Route table IDs for the private subnets (one per AZ) | `list(string)` | n/a | yes | | [private\_subnets](#input\_private\_subnets) | Private subnet IDs (one per AZ) for NAT instance private ENIs | `list(string)` | n/a | yes | -| [private\_subnets\_cidr\_blocks](#input\_private\_subnets\_cidr\_blocks) | CIDR blocks for the private subnets (one per AZ, used in security group rules) | `list(string)` | n/a | yes | +| [private\_subnets\_cidr_blocks](#input\_private\_subnets\_cidr\_blocks) | CIDR blocks for the private subnets (one per AZ, used in security group rules) | `list(string)` | n/a | yes | | [public\_subnets](#input\_public\_subnets) | Public subnet IDs (one per AZ) for NAT instance public ENIs | `list(string)` | n/a | yes | | [tags](#input\_tags) | Additional tags to apply to all resources | `map(string)` | `{}` | no | | [use\_fck\_nat\_ami](#input\_use\_fck\_nat\_ami) | Use the public fck-nat AMI. Set to false to use a custom AMI. | `bool` | `true` | no | diff --git a/lambda.tf b/lambda.tf index 3f219c1..049ec2e 100644 --- a/lambda.tf +++ b/lambda.tf @@ -80,6 +80,7 @@ resource "aws_lambda_function" "nat_zero" { var.instance_type, var.market_type, tostring(var.block_device_size), + tostring(var.encrypt_root_volume), ])) } } diff --git a/launch_template.tf b/launch_template.tf index cbf06ac..2388791 100644 --- a/launch_template.tf +++ b/launch_template.tf @@ -25,7 +25,7 @@ resource "aws_launch_template" "nat_launch_template" { volume_type = "gp3" iops = 3000 throughput = 250 - encrypted = true + encrypted = var.encrypt_root_volume } } diff --git a/tests/integration/fixture/main.tf b/tests/integration/fixture/main.tf index 0e47126..ee41255 100644 --- a/tests/integration/fixture/main.tf +++ b/tests/integration/fixture/main.tf @@ -66,6 +66,11 @@ variable "nat_instance_type" { default = "t4g.nano" } +variable "encrypt_root_volume" { + type = bool + default = false +} + module "nat_zero" { source = "../../../" @@ -78,8 +83,9 @@ module "nat_zero" { private_route_table_ids = [aws_route_table.private.id] private_subnets_cidr_blocks = [aws_subnet.private.cidr_block] - instance_type = var.nat_instance_type - market_type = "on-demand" + instance_type = var.nat_instance_type + market_type = "on-demand" + encrypt_root_volume = var.encrypt_root_volume } output "vpc_id" { @@ -97,3 +103,7 @@ output "lambda_function_name" { output "nat_security_group_ids" { value = module.nat_zero.nat_security_group_ids } + +output "encrypt_root_volume" { + value = var.encrypt_root_volume +} diff --git a/tests/integration/nat_zero_test.go b/tests/integration/nat_zero_test.go index 842578f..dc4af3a 100644 --- a/tests/integration/nat_zero_test.go +++ b/tests/integration/nat_zero_test.go @@ -71,9 +71,16 @@ func TestNatZero(t *testing.T) { phases = append(phases, phase{name, d}) t.Logf("[TIMER] %-45s %s", name, d.Round(time.Millisecond)) } + var encryptRootVolume string defer func() { t.Log("") t.Log("=== TIMING SUMMARY ===") + encryptLabel := "enabled" + if encryptRootVolume == "false" { + encryptLabel = "disabled" + } + t.Logf(" EBS Encryption: %s", encryptLabel) + t.Log("") t.Logf(" %-45s %s", "PHASE", "DURATION") t.Log(" " + strings.Repeat("-", 60)) var total time.Duration @@ -121,6 +128,7 @@ func TestNatZero(t *testing.T) { vpcID := terraform.Output(t, opts, "vpc_id") privateSubnet := terraform.Output(t, opts, "private_subnet_id") lambdaName := terraform.Output(t, opts, "lambda_function_name") + encryptRootVolume = terraform.Output(t, opts, "encrypt_root_volume") t.Logf("VPC: %s, private subnet: %s, Lambda: %s", vpcID, privateSubnet, lambdaName) // Terminate test workload instances before terraform destroy. diff --git a/variables.tf b/variables.tf index 1e09963..dd31a55 100644 --- a/variables.tf +++ b/variables.tf @@ -62,6 +62,12 @@ variable "block_device_size" { description = "Size in GB of the root EBS volume" } +variable "encrypt_root_volume" { + type = bool + default = true + description = "Encrypt the root EBS volume. Disabling may reduce cold-start time by ~5-10 seconds on gp3." +} + # AMI configuration variable "use_fck_nat_ami" { type = bool From ea7b0999853a0673d4d4c209b52a3d47af0e20c7 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Tue, 3 Mar 2026 09:47:40 +1000 Subject: [PATCH 2/2] fix: correct docs factual errors and remove unverified performance claims - Remove false DLQ claim from README (no SQS DLQ exists; Lambda retries via EventBridge with maximum_retry_attempts=2) - Fix testing.md: phases 1-3 are EventBridge-driven, not direct Lambda invocations - Remove unverified "Root Volume Encryption" performance section and "Faster Cold Start" example (benchmark showed no measurable difference) - Update ConfigVersion hash descriptions to include encryption setting - Strip performance claim from encrypt_root_volume variable description - Reset test fixture default back to encrypted (true) - Regenerate terraform-docs in README and reference.md Co-Authored-By: Claude Opus 4.6 --- README.md | 3 ++- docs/architecture.md | 2 +- docs/examples.md | 6 ++---- docs/performance.md | 16 ---------------- docs/reference.md | 4 ++-- docs/testing.md | 10 +++++----- tests/integration/fixture/main.tf | 2 +- variables.tf | 2 +- 8 files changed, 14 insertions(+), 31 deletions(-) diff --git a/README.md b/README.md index 9e9754b..997dd83 100644 --- a/README.md +++ b/README.md @@ -99,7 +99,7 @@ See [Performance](docs/performance.md) for detailed timings and cost breakdowns. - **EventBridge scope**: Captures all EC2 state changes in the account; Lambda filters by VPC ID. - **Startup delay**: First workload in an idle AZ waits ~10 seconds for internet. Design scripts to retry outbound connections. - **Dual ENI**: Persistent public + private ENIs survive stop/start cycles. -- **DLQ**: Failed Lambda invocations go to an SQS dead letter queue. +- **Retries**: Failed Lambda invocations are retried up to 2 times by EventBridge. - **Clean destroy**: A cleanup action terminates NAT instances before `terraform destroy` removes ENIs. - **Config versioning**: Changing AMI or instance type auto-replaces NAT instances on next workload event. - **EC2 events only**: Currently nat-zero responds only to EC2 instance state changes. If you have a use case for other event sources (ECS tasks, Lambda, etc.), PRs are welcome. @@ -163,6 +163,7 @@ No modules. | [custom\_ami\_name\_pattern](#input\_custom\_ami\_name\_pattern) | AMI name pattern when use\_fck\_nat\_ami is false | `string` | `null` | no | | [custom\_ami\_owner](#input\_custom\_ami\_owner) | AMI owner account ID when use\_fck\_nat\_ami is false | `string` | `null` | no | | [enable\_logging](#input\_enable\_logging) | Create a CloudWatch log group for the Lambda function | `bool` | `true` | no | +| [encrypt\_root\_volume](#input\_encrypt\_root\_volume) | Encrypt the root EBS volume. | `bool` | `true` | no | | [ignore\_tag\_key](#input\_ignore\_tag\_key) | Tag key used to mark instances the Lambda should ignore | `string` | `"nat-zero:ignore"` | no | | [ignore\_tag\_value](#input\_ignore\_tag\_value) | Tag value used to mark instances the Lambda should ignore | `string` | `"true"` | no | | [instance\_type](#input\_instance\_type) | Instance type for the NAT instance | `string` | `"t4g.nano"` | no | diff --git a/docs/architecture.md b/docs/architecture.md index 85a3da9..c3d58e2 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -156,7 +156,7 @@ Each NAT instance uses two ENIs to separate public and private traffic: ## Config Versioning -The Lambda tags each NAT instance with a `ConfigVersion` hash derived from AMI, instance type, market type, and volume size. +The Lambda tags each NAT instance with a `ConfigVersion` hash derived from AMI, instance type, market type, volume size, and encryption setting. When the reconciler detects an outdated NAT, replacement takes two events (following the "one action per invocation" pattern): diff --git a/docs/examples.md b/docs/examples.md index 833e632..6ac5448 100644 --- a/docs/examples.md +++ b/docs/examples.md @@ -118,9 +118,9 @@ module "nat_zero" { } ``` -## Faster Cold Start +## Disable Root Volume Encryption -Disable root volume encryption to shave ~5-10 seconds off NAT cold-start time. The NAT instance is a stateless packet forwarder — no sensitive data is stored on the root volume. +The root EBS volume is encrypted by default. To disable encryption (e.g., for environments without compliance requirements): ```hcl module "nat_zero" { @@ -132,8 +132,6 @@ module "nat_zero" { } ``` -See [Performance](performance.md#root-volume-encryption) for benchmarks. - ## Building Lambda Locally For development or if you want to build from source: diff --git a/docs/performance.md b/docs/performance.md index c5d4264..f7a5816 100644 --- a/docs/performance.md +++ b/docs/performance.md @@ -44,22 +44,6 @@ The ~8 second gap is EC2 instance lifecycle (placement, OS boot, iptables config Restart is ~2 seconds faster — `StartInstances` skips AMI resolution and launch template processing. -## Root Volume Encryption - -The module encrypts the root EBS volume by default (`encrypt_root_volume = true`). Encryption adds to EC2 pending-state time because the volume must be initialized with an encrypted key before the instance can boot. - -| Volume encryption | Approximate pending time (gp3) | -|-------------------|-------------------------------| -| Encrypted (default) | ~11 s | -| Unencrypted | ~5 s | - -For faster cold starts in non-compliance environments, set `encrypt_root_volume = false`. This has no impact on: - -- **Restart from stopped** — the volume already exists, so no encryption overhead. -- **Steady-state throughput** — Nitro instances use dedicated hardware for AES-256 encryption at line rate. - -The NAT instance is a stateless packet forwarder with no sensitive data on the root volume, so disabling encryption does not expose user traffic. - ## Lambda Execution | Metric | Value | diff --git a/docs/reference.md b/docs/reference.md index b705b81..d286474 100644 --- a/docs/reference.md +++ b/docs/reference.md @@ -56,7 +56,7 @@ No modules. | [custom\_ami\_name\_pattern](#input\_custom\_ami\_name\_pattern) | AMI name pattern when use\_fck\_nat\_ami is false | `string` | `null` | no | | [custom\_ami\_owner](#input\_custom\_ami\_owner) | AMI owner account ID when use\_fck\_nat\_ami is false | `string` | `null` | no | | [enable\_logging](#input\_enable\_logging) | Create a CloudWatch log group for the Lambda function | `bool` | `true` | no | -| [encrypt\_root\_volume](#input\_encrypt\_root\_volume) | Encrypt the root EBS volume. Disabling may reduce cold-start time by ~5-10 seconds on gp3. | `bool` | `true` | no | +| [encrypt\_root\_volume](#input\_encrypt\_root\_volume) | Encrypt the root EBS volume. | `bool` | `true` | no | | [ignore\_tag\_key](#input\_ignore\_tag\_key) | Tag key used to mark instances the Lambda should ignore | `string` | `"nat-zero:ignore"` | no | | [ignore\_tag\_value](#input\_ignore\_tag\_value) | Tag value used to mark instances the Lambda should ignore | `string` | `"true"` | no | | [instance\_type](#input\_instance\_type) | Instance type for the NAT instance | `string` | `"t4g.nano"` | no | @@ -69,7 +69,7 @@ No modules. | [nat\_tag\_value](#input\_nat\_tag\_value) | Tag value used to identify NAT instances | `string` | `"true"` | no | | [private\_route\_table\_ids](#input\_private\_route\_table\_ids) | Route table IDs for the private subnets (one per AZ) | `list(string)` | n/a | yes | | [private\_subnets](#input\_private\_subnets) | Private subnet IDs (one per AZ) for NAT instance private ENIs | `list(string)` | n/a | yes | -| [private\_subnets\_cidr_blocks](#input\_private\_subnets\_cidr\_blocks) | CIDR blocks for the private subnets (one per AZ, used in security group rules) | `list(string)` | n/a | yes | +| [private\_subnets\_cidr\_blocks](#input\_private\_subnets\_cidr\_blocks) | CIDR blocks for the private subnets (one per AZ, used in security group rules) | `list(string)` | n/a | yes | | [public\_subnets](#input\_public\_subnets) | Public subnet IDs (one per AZ) for NAT instance public ENIs | `list(string)` | n/a | yes | | [tags](#input\_tags) | Additional tags to apply to all resources | `map(string)` | `{}` | no | | [use\_fck\_nat\_ami](#input\_use\_fck\_nat\_ami) | Use the public fck-nat AMI. Set to false to use a custom AMI. | `bool` | `true` | no | diff --git a/docs/testing.md b/docs/testing.md index 8ce0bec..0dce6a3 100644 --- a/docs/testing.md +++ b/docs/testing.md @@ -22,22 +22,22 @@ The test uses [Terratest](https://terratest.gruntwork.io/) with a single `terraf 1. Deploy fixture (private subnet + nat-zero module in default VPC) 2. Launch workload instance in private subnet -3. Invoke Lambda → creates NAT instance +3. EventBridge fires workload state change → Lambda creates NAT instance 4. Wait for NAT running with EIP attached 5. Verify workload's egress IP matches NAT's EIP ### Phase 2: Scale-Down 1. Terminate workload -2. Invoke Lambda → stops NAT +2. EventBridge fires workload terminated → Lambda stops NAT 3. Wait for NAT stopped -4. Invoke Lambda → releases EIP +4. EventBridge fires NAT stopped → Lambda releases EIP 5. Verify no EIPs remain ### Phase 3: Restart 1. Launch new workload -2. Invoke Lambda → restarts stopped NAT +2. EventBridge fires workload state change → Lambda restarts stopped NAT 3. Wait for NAT running with new EIP 4. Verify connectivity @@ -64,4 +64,4 @@ Integration tests run in GitHub Actions when the `integration-test` label is add ## Config Version Replacement -The Lambda tags NAT instances with a `ConfigVersion` hash (AMI + instance type + market type + volume size). When the config changes and a workload triggers reconciliation, the Lambda terminates the outdated NAT and creates a replacement. The integration test doesn't exercise this path directly, but it's covered by unit tests. +The Lambda tags NAT instances with a `ConfigVersion` hash (AMI + instance type + market type + volume size + encryption). When the config changes and a workload triggers reconciliation, the Lambda terminates the outdated NAT and creates a replacement. The integration test doesn't exercise this path directly, but it's covered by unit tests. diff --git a/tests/integration/fixture/main.tf b/tests/integration/fixture/main.tf index ee41255..0a608b4 100644 --- a/tests/integration/fixture/main.tf +++ b/tests/integration/fixture/main.tf @@ -68,7 +68,7 @@ variable "nat_instance_type" { variable "encrypt_root_volume" { type = bool - default = false + default = true } module "nat_zero" { diff --git a/variables.tf b/variables.tf index dd31a55..47c046a 100644 --- a/variables.tf +++ b/variables.tf @@ -65,7 +65,7 @@ variable "block_device_size" { variable "encrypt_root_volume" { type = bool default = true - description = "Encrypt the root EBS volume. Disabling may reduce cold-start time by ~5-10 seconds on gp3." + description = "Encrypt the root EBS volume." } # AMI configuration