Skip to content

feat: add fake-gcs-server to devstack with GCS integration tests#3155

Open
npow wants to merge 7 commits intomasterfrom
gcs-devstack-tests
Open

feat: add fake-gcs-server to devstack with GCS integration tests#3155
npow wants to merge 7 commits intomasterfrom
gcs-devstack-tests

Conversation

@npow
Copy link
Copy Markdown
Collaborator

@npow npow commented Apr 27, 2026

Summary

  • Add fake-gcs-server (Google Cloud Storage emulator) as a devstack component with full CI coverage
  • Add gcs-local backend to the UX test CI matrix
  • Monkey-patch GCS client factory to use anonymous credentials with the emulator
  • Add Tiltfile, k8s deployment, bucket init job, and secret for fake-gcs-server

Resurrected from npow/devstack-fake-gcs where these changes were added then removed in a cleanup pass.

Test plan

  • Verify gcs-local backend passes in CI UX test matrix
  • Confirm fake-gcs-server starts and bucket init succeeds in devstack
  • Verify GCS anonymous credential monkey-patch works with the emulator

🤖 Generated with Claude Code

Nissan Pow added 3 commits April 27, 2026 18:54
Restore fake-gcs-server (Google Cloud Storage emulator) as a devstack
component with full CI coverage via a new gcs-local backend.

- Add fake-gcs-server Tiltfile, k8s deployment, bucket init job, and secret
- Add gcs-local backend to GHA matrix (minio + postgresql + metadata-service + fake-gcs-server)
- Add gcs-local backend to ux_test_config.yaml (runner-only, no scheduler)
- Monkey-patch GCS client factory in conftest to use anonymous credentials
  with the emulator (google.auth.default() fails without real GCP creds)
- Update verify_run_provenance to accept ds-type 'gs' for GCS backends
- Install google-cloud-storage in CI for gcs-local backend
- Set METAFLOW_DEFAULT_DATASTORE=gs and STORAGE_EMULATOR_HOST via GITHUB_ENV
When STORAGE_EMULATOR_HOST is set, create a plain storage.Client()
that auto-detects the emulator instead of calling google.auth.default()
which fails without real GCP credentials. This fixes gcs-local CI tests
where flow subprocesses inherit the emulator env var but have no GCP
credentials configured.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 27, 2026

Greptile Summary

This PR adds fake-gcs-server as a devstack component and wires up a gcs-local CI matrix entry that exercises the GCS datastore backend end-to-end using the emulator.

  • gs_storage_client_factory.py gains a native STORAGE_EMULATOR_HOST check that skips google.auth.default(), making the factory safe to use without real GCP credentials; a belt-and-suspenders monkey-patch in conftest.py covers custom provider configurations.
  • The devstack gains a fake-gcs-server Deployment, a gcs-bucket-init Job, and a fake-gcs-secret, all plumbed through a new Tiltfile extension with correct port-forwarding and resource_deps ordering.
  • verify_run_provenance in test_utils.py is extended to assert ds_type == \"gs\" when the GCS backend is active.

Confidence Score: 5/5

Safe to merge — changes are additive (new devstack component + CI matrix entry) and the factory fix for emulator credentials is well-guarded behind the STORAGE_EMULATOR_HOST check.

All changes are isolated to the new gcs-local CI path and devstack tooling. The factory edit touches production code but only activates when STORAGE_EMULATOR_HOST is explicitly set, so it cannot affect existing GCS users. The monkey-patch in conftest.py is now redundant with the factory fix, but both paths produce correct behavior.

No files require special attention. The image-tag and bucket-init idempotency concerns in the k8s manifests were flagged in earlier review threads.

Important Files Changed

Filename Overview
.github/workflows/ux-tests.yml Adds gcs-local CI matrix entry with correct timeout, workers, memory, and extra_args; skips python:3.9 image pre-pull for local execution; installs google-cloud-storage conditionally; sets GCS emulator env vars before the pytest step.
metaflow/plugins/gcp/gs_storage_client_factory.py Adds STORAGE_EMULATOR_HOST check to _get_gs_storage_client_default so that storage.Client() is used without google.auth.default() when a GCS emulator is active; the factory-level fix makes the conftest.py monkey-patch redundant.
test/ux/core/conftest.py Adds _setup_gcs_emulator() which monkey-patches factory.get_gs_storage_client; now redundant with the factory fix in gs_storage_client_factory.py but harmless as belt-and-suspenders.
test/ux/core/test_utils.py Extends verify_run_provenance to assert ds_type == 'gs' when METAFLOW_DEFAULT_DATASTORE is 'gs'; logic is correct.
devtools/tilt/k8s/fake-gcs-server.yaml Adds Deployment and Service for fake-gcs-server; uses fsouza/fake-gcs-server:latest (floating tag).
devtools/tilt/k8s/gcs-bucket-init-job.yaml One-shot Job to create the metaflow-test bucket; uses curlimages/curl:latest (floating tag); bucket creation is non-idempotent (POST returns 409 on retry), no backoffLimit set so it defaults to 6.
devtools/tilt/k8s/fake-gcs-secret.yaml Kubernetes Secret exposing STORAGE_EMULATOR_HOST with the cluster-internal fake-gcs-server hostname; straightforward and correct.
devtools/tilt/fake_gcs_server.tiltfile Tiltfile extension that applies k8s manifests, sets up port-forwards (4443), and returns result with correct config and shell_env for the GCS emulator.
devtools/Tiltfile Registers fake-gcs-server as a zero-dependency devstack component with no issues.
test/ux/ux_test_config.yaml Adds gcs-local backend definition (scheduler_type: null, no decospec, enabled: true); consistent with other local-execution backends.

Reviews (5): Last reviewed commit: "Merge branch 'master' into gcs-devstack-..." | Re-trigger Greptile

Comment thread .github/workflows/ux-tests.yml
spec:
containers:
- name: fake-gcs-server
image: fsouza/fake-gcs-server:latest
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Mutable latest image tag reduces reproducibility

fsouza/fake-gcs-server:latest can silently pick up a breaking upstream release between CI runs, making failures hard to diagnose. Pinning to a specific release tag (e.g. 1.15.0) keeps the environment reproducible. The same applies to curlimages/curl:latest in gcs-bucket-init-job.yaml.

restartPolicy: OnFailure
containers:
- name: init
image: curlimages/curl:latest
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Mutable latest image tag

curlimages/curl:latest is a floating tag; pinning to a digest or specific version (e.g. curlimages/curl:8.7.1) makes the bucket-init job deterministic across CI runs.

The gcs-local backend was missing timeout and memory values in the
CI matrix, causing pytest --timeout to receive an empty string.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment on lines +12 to +20
image: curlimages/curl:latest
command: ["/bin/sh", "-ec"]
args:
- |
curl -sf -X POST \
http://fake-gcs-server:4443/storage/v1/b \
-H "Content-Type: application/json" \
-d '{"name":"metaflow-test"}'
echo "Bucket 'metaflow-test' created successfully"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Non-idempotent bucket init breaks on retry

With restartPolicy: OnFailure, if the container is killed after the bucket is created (e.g., OOM eviction, node pressure) but before the Job records success, Kubernetes restarts the container. The second attempt POSTs to an already-existing bucket, gets a 409, and curl -sf treats that as a failure — causing repeated retries until the Job's backoffLimit is exhausted and it enters a permanent Failed state. Handling the 409 makes the script safe to retry:

Suggested change
image: curlimages/curl:latest
command: ["/bin/sh", "-ec"]
args:
- |
curl -sf -X POST \
http://fake-gcs-server:4443/storage/v1/b \
-H "Content-Type: application/json" \
-d '{"name":"metaflow-test"}'
echo "Bucket 'metaflow-test' created successfully"
command: ["/bin/sh", "-ec"]
args:
- |
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" -X POST \
http://fake-gcs-server:4443/storage/v1/b \
-H "Content-Type: application/json" \
-d '{"name":"metaflow-test"}')
[ "$HTTP_STATUS" = "200" ] || [ "$HTTP_STATUS" = "409" ] || \
(echo "Unexpected status: $HTTP_STATUS" && exit 1)
echo "Bucket 'metaflow-test' ready (status: $HTTP_STATUS)"

npow and others added 3 commits April 27, 2026 19:24
The full-stack-test workflow was timing out on generate-configs with
WAIT_TIMEOUT=600 (10 min). CI runners are slow and services sometimes
need longer to initialize.

- Increase WAIT_TIMEOUT from 600 to 900 (15 min)
- Add timeout-minutes: 30 to prevent runaway jobs (was using 6h default)
- Add diagnostic step on failure: dump tilt resource status and recent logs
- Run teardown with if: always() so cleanup happens on failure too

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The conda test takes too long to set up, causing the minikube
port-forwarding to fake-gcs-server to die mid-test. Skip conda
tests for gcs-local since they test conda integration, not the
GCS datastore backend.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant