Feat/deb based images#274
Conversation
- Add Dockerfile_postgres: lean PG18 image with documentdb extension
installed from a pre-built .deb package (CNPG-compatible)
- Add Dockerfile_gateway_deb: lean gateway-only image installed from
a pre-built gateway .deb package
- Update build_images.yml:
- Add build-packages job that checks out documentdb repo at a
pinned ref and builds extension + gateway debs
- Replace documentdb/gateway matrix entries to use new Dockerfiles
- Add documentdb_ref workflow input for pinning source version
- Remove old Dockerfiles: Dockerfile_docdb, Dockerfile_docdb_packages,
Dockerfile_gateway
This eliminates the dependency on the documentdb.io/deb APT repo and
the slow build-from-source approach. Images are now built from debs
produced by the documentdb repo packaging scripts, supporting PG18
on debian:trixie-slim.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update build-documentdb and build-gateway jobs to use the new Dockerfile_postgres and Dockerfile_gateway_deb with pre-built debs. Add build-packages job to build extension and gateway debs from the documentdb repo before image builds. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix create-manifest to check build-packages result (not build-and-push) for skip logic, preventing manifest creation for unbuilt images - Add set -e and deb file validation to gateway context preparation in both build_images.yml and test-build-and-package.yml Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…way pg version - Add documentdb_ref input to test-build-and-package.yml (defaults to main, can be overridden by callers for pinning) - Add comments explaining why gateway uses --pg 17 (pure Rust binary, PG version is passthrough, matches upstream convention) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Update operator constants to use new image repos under
ghcr.io/documentdb/documentdb-kubernetes-operator/{documentdb,gateway}
- Enable version-based image resolution (documentDBVersion -> image ref)
for both documentdb extension and gateway images
- Load documentdb+gateway images into kind cluster in E2E tests
- Compute image refs dynamically from build outputs instead of
hardcoding external images in E2E matrix
- Update all test workflows to use consistent image references
- Add unit tests for version-based image resolution
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add image ref computation step to test-integration.yml and test-backup-and-restore.yml so they use locally built documentdb and gateway images instead of stale external defaults - Pass documentdb-image and gateway-image to setup-test-environment - Update backup restore cluster specs to use computed image refs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the operator and CI/CD pipeline to use new deb-based DocumentDB/gateway container images, and wires spec.documentDBVersion / DOCUMENTDB_VERSION into image resolution so deployments can pin component versions consistently.
Changes:
- Add
spec.documentDBVersion+DOCUMENTDB_VERSIONsupport for resolving DocumentDB and gateway image tags. - Introduce new image repository constants and update defaults to the new GHCR locations.
- Rework GitHub Actions image build/test workflows to build deb packages first, then build documentdb/gateway images from those artifacts using new Dockerfiles.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
operator/src/internal/utils/util.go |
Enables documentDBVersion + env-var based image selection for gateway and engine images. |
operator/src/internal/utils/util_test.go |
Adds unit tests for documentDBVersion image resolution. |
operator/src/internal/utils/constants.go |
Adds new image repo constants and updates default image references. |
.github/workflows/test-upgrade-and-rollback.yml |
Updates default test image references to new GHCR locations. |
.github/workflows/test-integration.yml |
Passes resolved documentdb/gateway image references into the test setup action. |
.github/workflows/test-build-and-package.yml |
Adds package-build job and switches documentdb/gateway image builds to consume built deb artifacts. |
.github/workflows/test-backup-and-restore.yml |
Updates default test image references and wiring for gateway/documentdb images. |
.github/workflows/test-E2E.yml |
Updates E2E workflows to use separate documentdb/gateway images for ImageVolume mode and combined image for legacy mode. |
.github/workflows/build_images.yml |
Adds deb package build stage and updates build/push logic for documentdb/gateway images using deb-based Dockerfiles. |
.github/dockerfiles/Dockerfile_postgres |
New lean Postgres image that installs DocumentDB extension from a pre-built deb. |
.github/dockerfiles/Dockerfile_gateway_deb |
New lean gateway image that installs the gateway binary from a pre-built deb. |
.github/dockerfiles/Dockerfile_gateway |
Removes legacy gateway build-from-source Dockerfile. |
.github/dockerfiles/Dockerfile_docdb_packages |
Removes the old “official packages” image build Dockerfile. |
.github/dockerfiles/Dockerfile_docdb |
Removes legacy build-from-source DocumentDB Dockerfile. |
.github/actions/setup-test-environment/action.yml |
Updates defaults to new GHCR images and loads additional locally-built images into kind. |
Comments suppressed due to low confidence (1)
.github/actions/setup-test-environment/action.yml:266
- For
use-external-images == 'true', this step only verifies/pre-pulls operator and sidecar images. The action also acceptsdocumentdb-imageandgateway-image, but those are neither verified nor pre-pulled/loaded into kind, which can cause failures if they’re private or rate-limited. Consider adding manifest checks (and optionally pre-pull + kind load) for the DocumentDB and gateway images referenced by the inputs.
- name: Pre-pull external images for kind cluster (external images)
if: inputs.use-external-images == 'true'
shell: bash
run: |
echo "Pre-pulling external Docker images for kind cluster..."
# For external images, we use manifest-based names (no architecture suffix)
OPERATOR_IMAGE="ghcr.io/${{ inputs.repository-owner }}/documentdb-kubernetes-operator/operator:${{ inputs.image-tag }}"
SIDECAR_IMAGE="ghcr.io/${{ inputs.repository-owner }}/documentdb-kubernetes-operator/sidecar:${{ inputs.image-tag }}"
echo "Pre-pulling operator image: $OPERATOR_IMAGE"
docker pull "$OPERATOR_IMAGE"
echo "Pre-pulling sidecar image: $SIDECAR_IMAGE"
docker pull "$SIDECAR_IMAGE"
# Load the pulled images into kind cluster
CLUSTER_NAME="documentdb-${{ inputs.test-type }}-${{ inputs.architecture }}-${{ inputs.test-scenario-name }}"
kind load docker-image "$OPERATOR_IMAGE" --name "$CLUSTER_NAME"
kind load docker-image "$SIDECAR_IMAGE" --name "$CLUSTER_NAME"
| - name: Build documentdb Docker image for ${{ matrix.arch }} | ||
| run: | | ||
| echo "Building documentdb Docker image for ${{ matrix.arch }} architecture..." | ||
| DEB_FILE=$(ls packages/ | grep -E 'postgresql.*documentdb.*\.deb' | grep -v dbgsym | head -1) | ||
| echo "Using deb: $DEB_FILE" | ||
| docker buildx build \ | ||
| --platform linux/${{ matrix.arch }} \ | ||
| --build-arg ARCH=${{ matrix.base_arch }} \ | ||
| --build-arg POSTGRES_VERSION=18 \ | ||
| --build-arg DEB_PACKAGE_REL_PATH=packages/$DEB_FILE \ | ||
| --tag ghcr.io/${{ github.repository_owner }}/documentdb-kubernetes-operator/documentdb:${{ env.IMAGE_TAG }}-${{ matrix.arch }} \ |
There was a problem hiding this comment.
The deb selection pipeline can succeed even when no matching deb exists (no set -o pipefail / no explicit non-empty check), leaving DEB_FILE empty and causing a hard-to-debug Docker build error later. Add set -euo pipefail and fail with a clear message when no matching deb is found.
| - name: Build gateway Docker image for ${{ matrix.arch }} | ||
| run: | | ||
| echo "Building gateway Docker image for ${{ matrix.arch }} architecture..." | ||
| GW_DEB=$(ls gateway-context/packages/ | grep -E 'gateway.*\.deb' | head -1) | ||
| echo "Using deb: $GW_DEB" | ||
| docker buildx build \ | ||
| --platform linux/${{ matrix.arch }} \ | ||
| --build-arg ARCH=${{ matrix.base_arch }} \ | ||
| --build-arg GATEWAY_DEB_REL_PATH=packages/$GW_DEB \ | ||
| --tag ghcr.io/${{ github.repository_owner }}/documentdb-kubernetes-operator/gateway:${{ env.IMAGE_TAG }}-${{ matrix.arch }} \ |
There was a problem hiding this comment.
Similar to the documentdb image build, GW_DEB is selected via a pipeline that can yield an empty value without failing, causing a confusing Docker build failure. Add set -euo pipefail (or explicit checks) and fail fast when no gateway deb is found.
| // Use global documentDbVersion if set | ||
| if version := os.Getenv(DOCUMENTDB_VERSION_ENV); version != "" { | ||
| return fmt.Sprintf("%s:%s", GATEWAY_IMAGE_REPO, version) | ||
| } |
There was a problem hiding this comment.
This adds support for resolving the gateway image from the global DOCUMENTDB_VERSION env var, but there’s no unit test validating this precedence/behavior. Add a test that sets/unsets DOCUMENTDB_VERSION (with t.Setenv) and asserts the expected image reference.
| // Use global documentDbVersion if set (from DOCUMENTDB_VERSION env var) | ||
| if version := os.Getenv(DOCUMENTDB_VERSION_ENV); version != "" { | ||
| if useImageVolume { | ||
| return fmt.Sprintf("%s:%s", DOCUMENTDB_EXTENSION_IMAGE_REPO, version) | ||
| } | ||
| return fmt.Sprintf("%s:%s", COMBINED_IMAGE_REPO, version) | ||
| } |
There was a problem hiding this comment.
This adds support for resolving DocumentDB images from the global DOCUMENTDB_VERSION env var (with different repos depending on mode), but the unit tests don’t cover this path. Add tests using t.Setenv(DOCUMENTDB_VERSION_ENV, ...) for both useImageVolume=true/false to prevent regressions.
| && chmod +x /home/documentdb/gateway/scripts/*.sh | ||
|
|
||
| USER documentdb | ||
| WORKDIR /home/documentdb/gateway/scripts |
There was a problem hiding this comment.
The gateway image does not define a CMD/ENTRYPOINT. The sidecar injector sets only container Args (no Command), so Kubernetes will rely on the image entrypoint; with none set here (and Debian’s default CMD), the gateway container is likely to start the wrong process or fail with “no command specified”. Add an explicit ENTRYPOINT (or CMD) to run the gateway binary (or a wrapper script) so the provided args are actually consumed by the gateway process.
| WORKDIR /home/documentdb/gateway/scripts | |
| WORKDIR /home/documentdb/gateway/scripts | |
| ENTRYPOINT ["/home/documentdb/gateway/scripts/build_and_start_gateway.sh"] |
| wget -qO /etc/apt/keyrings/pgdg.asc \ | ||
| https://www.postgresql.org/media/keys/ACCC4CF8.asc; \ | ||
| echo "deb [signed-by=/etc/apt/keyrings/pgdg.asc] \ | ||
| http://apt.postgresql.org/pub/repos/apt \ |
There was a problem hiding this comment.
The pgdg APT repository is configured over plain HTTP. Even though packages are signed, this still allows MITM/replay against repository metadata and is generally discouraged. Use the HTTPS endpoint for apt.postgresql.org instead.
| http://apt.postgresql.org/pub/repos/apt \ | |
| https://apt.postgresql.org/pub/repos/apt \ |
| CLUSTER_NAME="documentdb-${{ inputs.test-type }}-${{ inputs.architecture }}-${{ inputs.test-scenario-name }}" | ||
| OPERATOR_IMAGE="ghcr.io/${{ inputs.repository-owner }}/documentdb-kubernetes-operator/operator:${{ inputs.image-tag }}-${{ inputs.architecture }}" | ||
| SIDECAR_IMAGE="ghcr.io/${{ inputs.repository-owner }}/documentdb-kubernetes-operator/sidecar:${{ inputs.image-tag }}-${{ inputs.architecture }}" | ||
| DOCUMENTDB_IMAGE="ghcr.io/${{ inputs.repository-owner }}/documentdb-kubernetes-operator/documentdb:${{ inputs.image-tag }}-${{ inputs.architecture }}" | ||
| GATEWAY_IMAGE="ghcr.io/${{ inputs.repository-owner }}/documentdb-kubernetes-operator/gateway:${{ inputs.image-tag }}-${{ inputs.architecture }}" | ||
|
|
||
| # Load the operator image into kind cluster | ||
| kind load docker-image "$OPERATOR_IMAGE" --name "$CLUSTER_NAME" | ||
|
|
||
| # Load the sidecar image into kind cluster | ||
| kind load docker-image "$SIDECAR_IMAGE" --name "$CLUSTER_NAME" | ||
|
|
||
| # Load the documentdb extension image into kind cluster | ||
| kind load docker-image "$DOCUMENTDB_IMAGE" --name "$CLUSTER_NAME" | ||
|
|
||
| # Load the gateway image into kind cluster | ||
| kind load docker-image "$GATEWAY_IMAGE" --name "$CLUSTER_NAME" |
There was a problem hiding this comment.
In the local-build path, the images loaded into kind are hard-coded to ghcr.io/${owner}/...:${image-tag}-${arch} and ignore the action inputs documentdb-image / gateway-image. This can lead to loading the wrong images (or missing required ones) when the workflow passes an explicit image reference (e.g., combined-mode using an external all-in-one image). Consider loading/pulling the exact images specified by the inputs (or at least conditionally skipping loading when they don’t match the local tag).
| image: [operator, sidecar, documentdb, gateway] | ||
| runs-on: ubuntu-22.04 | ||
| needs: [build-and-push] | ||
| needs: [build-packages, build-and-push] |
There was a problem hiding this comment.
create-manifest lists build-packages in needs. If build-packages is skipped (e.g., skip_optional_images=true), GitHub Actions will skip this entire job, so manifests for operator/sidecar won’t be created either. Consider removing build-packages from needs (and keep the per-image skip logic inside the job), or add a job-level if: always() and handle the skipped dependency explicitly.
| needs: [build-packages, build-and-push] | |
| needs: [build-packages, build-and-push] | |
| if: always() |
| documentdb) | ||
| DEB_FILE=$(ls packages/ | grep -E 'postgresql.*documentdb.*\.deb' | grep -v dbgsym | head -1) | ||
| BUILD_ARGS="--build-arg POSTGRES_VERSION=18 --build-arg DEB_PACKAGE_REL_PATH=packages/$DEB_FILE" | ||
| ;; |
There was a problem hiding this comment.
The deb selection uses a pipeline that won’t fail when no files match (because set -e/pipefail isn’t enabled), which can produce an empty DEB_FILE and a confusing downstream Docker build failure. Add set -euo pipefail and validate that DEB_FILE is non-empty before using it.
Extension debs built on debian:trixie-slim require GLIBC 2.38+. The CNPG postgresql:18-minimal-bookworm image only has GLIBC 2.36, causing 'GLIBC_2.38 not found' errors when loading extension .so files in ImageVolume mode. Switch default postgresImage from bookworm to trixie across CRD defaults, generated CRD manifests, and tests. Verified locally: - bookworm: FATAL: GLIBC_2.38 not found - trixie: extensions load successfully Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Compute DOCUMENTDB_IMAGE and GATEWAY_IMAGE dynamically from build output instead of hardcoding external image references. The old baseline images remain hardcoded as they represent a released version to upgrade from. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CNPG mounts extension images at /extensions/{name}/ and resolves
library paths relative to that mount point. Our Dockerfile_postgres
uses the standard Debian filesystem layout, so libraries are at
usr/lib/postgresql/18/lib/ (not lib/) and extension control files
at usr/share/postgresql/18/extension/ (not share/).
Root cause: initdb pods crashed with 'FATAL: could not access file
pg_cron: No such file or directory' because CNPG looked for .so
files at /extensions/documentdb/lib/ but they were actually at
/extensions/documentdb/usr/lib/postgresql/18/lib/.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three issues prevented CREATE EXTENSION documentdb from working when the extension image is mounted as a CNPG ImageVolume: 1. extension_control_path must point to the share dir (PG appends /extension/) Changed from 'usr/share/postgresql/18/extension' to 'usr/share/postgresql/18' 2. Debian-alternatives symlinks (e.g. postgis.control -> /etc/alternatives/...) break in ImageVolume mounts because the target is outside the volume. Added a build step to resolve all dangling symlinks to real files. 3. PostGIS shared libraries (libgeos_c, etc.) live in the platform-specific lib dir. Added 'usr/lib/aarch64-linux-gnu' and 'usr/lib/x86_64-linux-gnu' to LdLibraryPath so both architectures are covered. Tested locally in kind (K8s 1.35): initdb completes, all extensions load, PostgreSQL 18 starts successfully with documentdb 0.110-0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The sidecar injector was using PullAlways for the gateway container, which fails in kind clusters where images are loaded locally (not pushed to a registry). Changed to IfNotPresent which works for both CI (kind load) and production (images available in registry). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Root cause: Dockerfile_gateway_deb had no ENTRYPOINT/CMD, so K8s tried to execute '--create-user' as a command when the sidecar injector set only Args without Command, causing RunContainerError. Fixes: - Create gateway_entrypoint.sh as a standalone script that handles --create-user, --start-pg, --pg-port, --cert-path, --key-file args passed by the sidecar injector plugin - Set OWNER default to 'postgres' (CNPG superuser) instead of whoami - Export PGHOST=localhost to force TCP connection (PG runs in separate container, sharing pod network but not filesystem) - Install postgresql-client for pg_isready and psql (needed for readiness check and admin user creation via SetupCustomAdminUser) - Set ENTRYPOINT in Dockerfile to the entrypoint script - Remove build_and_start_gateway.sh from build context (replaced by gateway_entrypoint.sh) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
No description provided.