Skip to content

Feat/deb based images#274

Open
guanzhousongmicrosoft wants to merge 12 commits intodocumentdb:mainfrom
guanzhousongmicrosoft:feat/deb-based-images
Open

Feat/deb based images#274
guanzhousongmicrosoft wants to merge 12 commits intodocumentdb:mainfrom
guanzhousongmicrosoft:feat/deb-based-images

Conversation

@guanzhousongmicrosoft
Copy link
Collaborator

No description provided.

- Add Dockerfile_postgres: lean PG18 image with documentdb extension
  installed from a pre-built .deb package (CNPG-compatible)
- Add Dockerfile_gateway_deb: lean gateway-only image installed from
  a pre-built gateway .deb package
- Update build_images.yml:
  - Add build-packages job that checks out documentdb repo at a
    pinned ref and builds extension + gateway debs
  - Replace documentdb/gateway matrix entries to use new Dockerfiles
  - Add documentdb_ref workflow input for pinning source version
- Remove old Dockerfiles: Dockerfile_docdb, Dockerfile_docdb_packages,
  Dockerfile_gateway

This eliminates the dependency on the documentdb.io/deb APT repo and
the slow build-from-source approach. Images are now built from debs
produced by the documentdb repo packaging scripts, supporting PG18
on debian:trixie-slim.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update build-documentdb and build-gateway jobs to use the new
Dockerfile_postgres and Dockerfile_gateway_deb with pre-built debs.
Add build-packages job to build extension and gateway debs from the
documentdb repo before image builds.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix create-manifest to check build-packages result (not build-and-push)
  for skip logic, preventing manifest creation for unbuilt images
- Add set -e and deb file validation to gateway context preparation
  in both build_images.yml and test-build-and-package.yml

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…way pg version

- Add documentdb_ref input to test-build-and-package.yml (defaults to
  main, can be overridden by callers for pinning)
- Add comments explaining why gateway uses --pg 17 (pure Rust binary,
  PG version is passthrough, matches upstream convention)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Update operator constants to use new image repos under
  ghcr.io/documentdb/documentdb-kubernetes-operator/{documentdb,gateway}
- Enable version-based image resolution (documentDBVersion -> image ref)
  for both documentdb extension and gateway images
- Load documentdb+gateway images into kind cluster in E2E tests
- Compute image refs dynamically from build outputs instead of
  hardcoding external images in E2E matrix
- Update all test workflows to use consistent image references
- Add unit tests for version-based image resolution

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings February 27, 2026 20:59
- Add image ref computation step to test-integration.yml and
  test-backup-and-restore.yml so they use locally built documentdb
  and gateway images instead of stale external defaults
- Pass documentdb-image and gateway-image to setup-test-environment
- Update backup restore cluster specs to use computed image refs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the operator and CI/CD pipeline to use new deb-based DocumentDB/gateway container images, and wires spec.documentDBVersion / DOCUMENTDB_VERSION into image resolution so deployments can pin component versions consistently.

Changes:

  • Add spec.documentDBVersion + DOCUMENTDB_VERSION support for resolving DocumentDB and gateway image tags.
  • Introduce new image repository constants and update defaults to the new GHCR locations.
  • Rework GitHub Actions image build/test workflows to build deb packages first, then build documentdb/gateway images from those artifacts using new Dockerfiles.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
operator/src/internal/utils/util.go Enables documentDBVersion + env-var based image selection for gateway and engine images.
operator/src/internal/utils/util_test.go Adds unit tests for documentDBVersion image resolution.
operator/src/internal/utils/constants.go Adds new image repo constants and updates default image references.
.github/workflows/test-upgrade-and-rollback.yml Updates default test image references to new GHCR locations.
.github/workflows/test-integration.yml Passes resolved documentdb/gateway image references into the test setup action.
.github/workflows/test-build-and-package.yml Adds package-build job and switches documentdb/gateway image builds to consume built deb artifacts.
.github/workflows/test-backup-and-restore.yml Updates default test image references and wiring for gateway/documentdb images.
.github/workflows/test-E2E.yml Updates E2E workflows to use separate documentdb/gateway images for ImageVolume mode and combined image for legacy mode.
.github/workflows/build_images.yml Adds deb package build stage and updates build/push logic for documentdb/gateway images using deb-based Dockerfiles.
.github/dockerfiles/Dockerfile_postgres New lean Postgres image that installs DocumentDB extension from a pre-built deb.
.github/dockerfiles/Dockerfile_gateway_deb New lean gateway image that installs the gateway binary from a pre-built deb.
.github/dockerfiles/Dockerfile_gateway Removes legacy gateway build-from-source Dockerfile.
.github/dockerfiles/Dockerfile_docdb_packages Removes the old “official packages” image build Dockerfile.
.github/dockerfiles/Dockerfile_docdb Removes legacy build-from-source DocumentDB Dockerfile.
.github/actions/setup-test-environment/action.yml Updates defaults to new GHCR images and loads additional locally-built images into kind.
Comments suppressed due to low confidence (1)

.github/actions/setup-test-environment/action.yml:266

  • For use-external-images == 'true', this step only verifies/pre-pulls operator and sidecar images. The action also accepts documentdb-image and gateway-image, but those are neither verified nor pre-pulled/loaded into kind, which can cause failures if they’re private or rate-limited. Consider adding manifest checks (and optionally pre-pull + kind load) for the DocumentDB and gateway images referenced by the inputs.
    - name: Pre-pull external images for kind cluster (external images)
      if: inputs.use-external-images == 'true'
      shell: bash
      run: |
        echo "Pre-pulling external Docker images for kind cluster..."
        
        # For external images, we use manifest-based names (no architecture suffix)
        OPERATOR_IMAGE="ghcr.io/${{ inputs.repository-owner }}/documentdb-kubernetes-operator/operator:${{ inputs.image-tag }}"
        SIDECAR_IMAGE="ghcr.io/${{ inputs.repository-owner }}/documentdb-kubernetes-operator/sidecar:${{ inputs.image-tag }}"
        
        echo "Pre-pulling operator image: $OPERATOR_IMAGE"
        docker pull "$OPERATOR_IMAGE"
        
        echo "Pre-pulling sidecar image: $SIDECAR_IMAGE"
        docker pull "$SIDECAR_IMAGE"
        
        # Load the pulled images into kind cluster
        CLUSTER_NAME="documentdb-${{ inputs.test-type }}-${{ inputs.architecture }}-${{ inputs.test-scenario-name }}"
        
        kind load docker-image "$OPERATOR_IMAGE" --name "$CLUSTER_NAME"
        kind load docker-image "$SIDECAR_IMAGE" --name "$CLUSTER_NAME"
        

Comment on lines 205 to 214
- name: Build documentdb Docker image for ${{ matrix.arch }}
run: |
echo "Building documentdb Docker image for ${{ matrix.arch }} architecture..."
DEB_FILE=$(ls packages/ | grep -E 'postgresql.*documentdb.*\.deb' | grep -v dbgsym | head -1)
echo "Using deb: $DEB_FILE"
docker buildx build \
--platform linux/${{ matrix.arch }} \
--build-arg ARCH=${{ matrix.base_arch }} \
--build-arg POSTGRES_VERSION=18 \
--build-arg DEB_PACKAGE_REL_PATH=packages/$DEB_FILE \
--tag ghcr.io/${{ github.repository_owner }}/documentdb-kubernetes-operator/documentdb:${{ env.IMAGE_TAG }}-${{ matrix.arch }} \
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deb selection pipeline can succeed even when no matching deb exists (no set -o pipefail / no explicit non-empty check), leaving DEB_FILE empty and causing a hard-to-debug Docker build error later. Add set -euo pipefail and fail with a clear message when no matching deb is found.

Copilot uses AI. Check for mistakes.
Comment on lines 277 to 285
- name: Build gateway Docker image for ${{ matrix.arch }}
run: |
echo "Building gateway Docker image for ${{ matrix.arch }} architecture..."
GW_DEB=$(ls gateway-context/packages/ | grep -E 'gateway.*\.deb' | head -1)
echo "Using deb: $GW_DEB"
docker buildx build \
--platform linux/${{ matrix.arch }} \
--build-arg ARCH=${{ matrix.base_arch }} \
--build-arg GATEWAY_DEB_REL_PATH=packages/$GW_DEB \
--tag ghcr.io/${{ github.repository_owner }}/documentdb-kubernetes-operator/gateway:${{ env.IMAGE_TAG }}-${{ matrix.arch }} \
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the documentdb image build, GW_DEB is selected via a pipeline that can yield an empty value without failing, causing a confusing Docker build failure. Add set -euo pipefail (or explicit checks) and fail fast when no gateway deb is found.

Copilot uses AI. Check for mistakes.
Comment on lines +442 to +445
// Use global documentDbVersion if set
if version := os.Getenv(DOCUMENTDB_VERSION_ENV); version != "" {
return fmt.Sprintf("%s:%s", GATEWAY_IMAGE_REPO, version)
}
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds support for resolving the gateway image from the global DOCUMENTDB_VERSION env var, but there’s no unit test validating this precedence/behavior. Add a test that sets/unsets DOCUMENTDB_VERSION (with t.Setenv) and asserts the expected image reference.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +475 to +481
// Use global documentDbVersion if set (from DOCUMENTDB_VERSION env var)
if version := os.Getenv(DOCUMENTDB_VERSION_ENV); version != "" {
if useImageVolume {
return fmt.Sprintf("%s:%s", DOCUMENTDB_EXTENSION_IMAGE_REPO, version)
}
return fmt.Sprintf("%s:%s", COMBINED_IMAGE_REPO, version)
}
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds support for resolving DocumentDB images from the global DOCUMENTDB_VERSION env var (with different repos depending on mode), but the unit tests don’t cover this path. Add tests using t.Setenv(DOCUMENTDB_VERSION_ENV, ...) for both useImageVolume=true/false to prevent regressions.

Copilot generated this review using guidance from repository custom instructions.
&& chmod +x /home/documentdb/gateway/scripts/*.sh

USER documentdb
WORKDIR /home/documentdb/gateway/scripts
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gateway image does not define a CMD/ENTRYPOINT. The sidecar injector sets only container Args (no Command), so Kubernetes will rely on the image entrypoint; with none set here (and Debian’s default CMD), the gateway container is likely to start the wrong process or fail with “no command specified”. Add an explicit ENTRYPOINT (or CMD) to run the gateway binary (or a wrapper script) so the provided args are actually consumed by the gateway process.

Suggested change
WORKDIR /home/documentdb/gateway/scripts
WORKDIR /home/documentdb/gateway/scripts
ENTRYPOINT ["/home/documentdb/gateway/scripts/build_and_start_gateway.sh"]

Copilot uses AI. Check for mistakes.
wget -qO /etc/apt/keyrings/pgdg.asc \
https://www.postgresql.org/media/keys/ACCC4CF8.asc; \
echo "deb [signed-by=/etc/apt/keyrings/pgdg.asc] \
http://apt.postgresql.org/pub/repos/apt \
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pgdg APT repository is configured over plain HTTP. Even though packages are signed, this still allows MITM/replay against repository metadata and is generally discouraged. Use the HTTPS endpoint for apt.postgresql.org instead.

Suggested change
http://apt.postgresql.org/pub/repos/apt \
https://apt.postgresql.org/pub/repos/apt \

Copilot uses AI. Check for mistakes.
Comment on lines 225 to +241
CLUSTER_NAME="documentdb-${{ inputs.test-type }}-${{ inputs.architecture }}-${{ inputs.test-scenario-name }}"
OPERATOR_IMAGE="ghcr.io/${{ inputs.repository-owner }}/documentdb-kubernetes-operator/operator:${{ inputs.image-tag }}-${{ inputs.architecture }}"
SIDECAR_IMAGE="ghcr.io/${{ inputs.repository-owner }}/documentdb-kubernetes-operator/sidecar:${{ inputs.image-tag }}-${{ inputs.architecture }}"
DOCUMENTDB_IMAGE="ghcr.io/${{ inputs.repository-owner }}/documentdb-kubernetes-operator/documentdb:${{ inputs.image-tag }}-${{ inputs.architecture }}"
GATEWAY_IMAGE="ghcr.io/${{ inputs.repository-owner }}/documentdb-kubernetes-operator/gateway:${{ inputs.image-tag }}-${{ inputs.architecture }}"

# Load the operator image into kind cluster
kind load docker-image "$OPERATOR_IMAGE" --name "$CLUSTER_NAME"

# Load the sidecar image into kind cluster
kind load docker-image "$SIDECAR_IMAGE" --name "$CLUSTER_NAME"

# Load the documentdb extension image into kind cluster
kind load docker-image "$DOCUMENTDB_IMAGE" --name "$CLUSTER_NAME"

# Load the gateway image into kind cluster
kind load docker-image "$GATEWAY_IMAGE" --name "$CLUSTER_NAME"
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the local-build path, the images loaded into kind are hard-coded to ghcr.io/${owner}/...:${image-tag}-${arch} and ignore the action inputs documentdb-image / gateway-image. This can lead to loading the wrong images (or missing required ones) when the workflow passes an explicit image reference (e.g., combined-mode using an external all-in-one image). Consider loading/pulling the exact images specified by the inputs (or at least conditionally skipping loading when they don’t match the local tag).

Copilot uses AI. Check for mistakes.
image: [operator, sidecar, documentdb, gateway]
runs-on: ubuntu-22.04
needs: [build-and-push]
needs: [build-packages, build-and-push]
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create-manifest lists build-packages in needs. If build-packages is skipped (e.g., skip_optional_images=true), GitHub Actions will skip this entire job, so manifests for operator/sidecar won’t be created either. Consider removing build-packages from needs (and keep the per-image skip logic inside the job), or add a job-level if: always() and handle the skipped dependency explicitly.

Suggested change
needs: [build-packages, build-and-push]
needs: [build-packages, build-and-push]
if: always()

Copilot uses AI. Check for mistakes.
Comment on lines +187 to +190
documentdb)
DEB_FILE=$(ls packages/ | grep -E 'postgresql.*documentdb.*\.deb' | grep -v dbgsym | head -1)
BUILD_ARGS="--build-arg POSTGRES_VERSION=18 --build-arg DEB_PACKAGE_REL_PATH=packages/$DEB_FILE"
;;
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deb selection uses a pipeline that won’t fail when no files match (because set -e/pipefail isn’t enabled), which can produce an empty DEB_FILE and a confusing downstream Docker build failure. Add set -euo pipefail and validate that DEB_FILE is non-empty before using it.

Copilot uses AI. Check for mistakes.
Extension debs built on debian:trixie-slim require GLIBC 2.38+.
The CNPG postgresql:18-minimal-bookworm image only has GLIBC 2.36,
causing 'GLIBC_2.38 not found' errors when loading extension .so
files in ImageVolume mode.

Switch default postgresImage from bookworm to trixie across CRD
defaults, generated CRD manifests, and tests.

Verified locally:
- bookworm: FATAL: GLIBC_2.38 not found
- trixie: extensions load successfully

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Compute DOCUMENTDB_IMAGE and GATEWAY_IMAGE dynamically from build
output instead of hardcoding external image references. The old
baseline images remain hardcoded as they represent a released
version to upgrade from.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CNPG mounts extension images at /extensions/{name}/ and resolves
library paths relative to that mount point. Our Dockerfile_postgres
uses the standard Debian filesystem layout, so libraries are at
usr/lib/postgresql/18/lib/ (not lib/) and extension control files
at usr/share/postgresql/18/extension/ (not share/).

Root cause: initdb pods crashed with 'FATAL: could not access file
pg_cron: No such file or directory' because CNPG looked for .so
files at /extensions/documentdb/lib/ but they were actually at
/extensions/documentdb/usr/lib/postgresql/18/lib/.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three issues prevented CREATE EXTENSION documentdb from working when the
extension image is mounted as a CNPG ImageVolume:

1. extension_control_path must point to the share dir (PG appends /extension/)
   Changed from 'usr/share/postgresql/18/extension' to 'usr/share/postgresql/18'

2. Debian-alternatives symlinks (e.g. postgis.control -> /etc/alternatives/...)
   break in ImageVolume mounts because the target is outside the volume.
   Added a build step to resolve all dangling symlinks to real files.

3. PostGIS shared libraries (libgeos_c, etc.) live in the platform-specific
   lib dir. Added 'usr/lib/aarch64-linux-gnu' and 'usr/lib/x86_64-linux-gnu'
   to LdLibraryPath so both architectures are covered.

Tested locally in kind (K8s 1.35): initdb completes, all extensions load,
PostgreSQL 18 starts successfully with documentdb 0.110-0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The sidecar injector was using PullAlways for the gateway container,
which fails in kind clusters where images are loaded locally (not
pushed to a registry). Changed to IfNotPresent which works for both
CI (kind load) and production (images available in registry).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Root cause: Dockerfile_gateway_deb had no ENTRYPOINT/CMD, so K8s tried
to execute '--create-user' as a command when the sidecar injector set
only Args without Command, causing RunContainerError.

Fixes:
- Create gateway_entrypoint.sh as a standalone script that handles
  --create-user, --start-pg, --pg-port, --cert-path, --key-file args
  passed by the sidecar injector plugin
- Set OWNER default to 'postgres' (CNPG superuser) instead of whoami
- Export PGHOST=localhost to force TCP connection (PG runs in separate
  container, sharing pod network but not filesystem)
- Install postgresql-client for pg_isready and psql (needed for
  readiness check and admin user creation via SetupCustomAdminUser)
- Set ENTRYPOINT in Dockerfile to the entrypoint script
- Remove build_and_start_gateway.sh from build context (replaced by
  gateway_entrypoint.sh)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants