Skip to content

docs: add CoreWeave full stack examples#1

Merged
TylrDn merged 1 commit into
mainfrom
codex/audit-repository-for-coreweave-tensorizer
Aug 30, 2025
Merged

docs: add CoreWeave full stack examples#1
TylrDn merged 1 commit into
mainfrom
codex/audit-repository-for-coreweave-tensorizer

Conversation

@TylrDn

@TylrDn TylrDn commented Aug 29, 2025

Copy link
Copy Markdown
Owner

Summary

  • add SUNK, Tensorizer, vLLM and observability docs
  • provide Helm, Knative and Argo CD examples
  • add CI/CD workflow and security policy notes

Testing

  • python -m black examples/tensorizer/serialize_and_load.py
  • isort examples/tensorizer/serialize_and_load.py
  • pip install -r requirements.txt
  • pip install -r tests/requirements.txt
  • python -m unittest discover tests -v (fails: Killed)
  • PYTHONPATH=. python examples/tensorizer/serialize_and_load.py --local-only (fails: ProxyError)
  • linkchecker docs/overview.md (command not found)

https://chatgpt.com/codex/tasks/task_e_68b224c0d1d08323b92fc953c577dc16

Summary by Sourcery

Add comprehensive CoreWeave full-stack examples with documentation, deployment configs, and CI/CD workflows to showcase SUNK, Tensorizer, vLLM, and observability.

New Features:

  • Provide a Python example to serialize, host, and lazily load a tensorized model via HTTP or S3
  • Include SUNK Slurm-on-Kubernetes example with Slurm Pod scheduling and corresponding example scripts
  • Add vLLM integration example with a shell script for launching and testing a tensorized model
  • Offer an observability walkthrough for Grafana dashboards and a Docker Compose local demo

Enhancements:

  • Add GitHub Actions workflow to build, scan, and deploy the vLLM container and upgrade the Helm chart
  • Include CKS-aligned security policy notes with RBAC, NetworkPolicy, image signing, and SBOM guidance

CI:

  • Add .github/workflows/build-and-deploy.yml to automate image build, Trivy scanning, registry push, and Helm upgrade

Deployment:

  • Introduce Helm chart for tensorizer-vllm (deployment, service, ingress templates and values)
  • Provide Knative Service manifest for serverless deployment
  • Add Argo CD Application manifest for GitOps deployment

Documentation:

  • Create documentation pages (overview, SUNK, schedule-k8s, Tensorizer, vLLM, observability, CI/CD, security) and update README with quickstart map

Chores:

  • Add GitOps Argo CD configuration under gitops/ directory

Summary by CodeRabbit

  • New Features

    • Helm chart to deploy vLLM serving tensorized models, plus optional Ingress and Service.
    • Knative Service manifest for tensorizer-enabled vLLM.
    • Argo CD application for GitOps-based deployments.
  • Documentation

    • New guides: Overview, SUNK, Scheduling K8s with Slurm, Tensorizer, vLLM, Observability, CI/CD, and Security Notes.
    • Quickstarts and Deep Dives with diagrams, metrics, and best practices.
    • Updated README with badges, intro, and navigation.
  • Examples

    • End-to-end tensorization script, vLLM smoke-test runner, Grafana walkthrough, and Slurm-to-Pod demo.
  • Chores

    • GitHub Actions workflow to build, scan, push, and deploy images.

@sourcery-ai

sourcery-ai Bot commented Aug 29, 2025

Copy link
Copy Markdown

Reviewer's Guide

This PR adds a comprehensive set of documentation and end-to-end examples demonstrating CoreWeave’s full-stack integration—covering SUNK, Tensorizer, vLLM, observability, CI/CD, and security—along with Helm, Knative, and ArgoCD deployment manifests and a Python serialization example.

Sequence diagram for tensorized model serialization, upload, serving, and lazy loading

sequenceDiagram
    actor User
    participant TensorizerScript
    participant S3
    participant HTTPServer
    participant vLLM
    User->>TensorizerScript: Run serialize_and_load.py
    TensorizerScript->>TensorizerScript: serialize(model_id, out_path)
    TensorizerScript->>S3: upload_to_s3(out_path, bucket, key) (optional)
    alt Serve over HTTP
        TensorizerScript->>HTTPServer: serve_file(out_path, port)
    end
    TensorizerScript->>TensorizerScript: load(uri, device, num_readers)
    TensorizerScript->>vLLM: vLLM loads tensorized weights via TensorDeserializer
    vLLM->>User: Model ready for inference
Loading

Class diagram for Tensorizer serialization and deserialization example

classDiagram
    class TensorSerializer {
        +__init__(out_path)
        +write_module(model)
    }
    class TensorDeserializer {
        +__init__(uri, device, lazy_load, num_readers)
        +load_into_module(model)
    }
    class AutoModelForCausalLM {
        +from_pretrained(model_id)
    }
    class serialize_and_load {
        +serialize(model_id, out_path)
        +upload_to_s3(path, bucket, key)
        +serve_file(path, port)
        +load(uri, device, num_readers)
        +main()
    }
    serialize_and_load --> TensorSerializer : uses
    serialize_and_load --> TensorDeserializer : uses
    serialize_and_load --> AutoModelForCausalLM : uses
    TensorSerializer --> AutoModelForCausalLM : serializes
    TensorDeserializer --> AutoModelForCausalLM : loads
Loading

File-Level Changes

Change Details Files
Updated README with project overview and quickstart navigation
  • Added CI, container scan, and lint badges
  • Introduced architecture diagram and quickstart map
  • Linked to all major docs sections
README.md
docs/overview.md
Added SUNK and Slurm-on-Kubernetes scheduling guides
  • Created SUNK overview and five-minute quickstart
  • Outlined deep-dive details for control plane and worker scaling
  • Documented Slurm-to-K8s pod scheduling with example sbatch script
docs/sunk.md
docs/schedule-k8s-with-slurm.md
examples/sunk/slurm-pod/README.md
examples/sunk/slurm-pod/pod.sbatch
Introduced Tensorizer end-to-end example and docs
  • Wrote Python script to serialize, serve, and lazy-load a GPT-2 model
  • Detail steps for S3 upload, HTTP serving, and multi-reader lazy loading
  • Documented throughput expectations and KNative/KServe benefits
examples/tensorizer/serialize_and_load.py
docs/tensorizer.md
Added vLLM integration examples and documentation
  • Created run script for serving a tensorized model and smoke test
  • Explained vLLM serve flags, tuning env vars, and Prometheus metrics
  • Referenced Helm chart usage for scale-out
examples/vllm/run_vllm_tensorized.sh
docs/vllm.md
Documented observability setup and local Grafana walkthrough
  • Outlined Grafana dashboard navigation for GPU, network, and pod metrics
  • Provided Loki log queries and Docker Compose local demo
  • Suggested adding screenshots under docs/img
docs/observability.md
examples/observability/grafana/README.md
Defined CI/CD workflow and GitHub Actions pipeline
  • Added build-and-deploy GitHub Actions workflow with build, scan, push, and helm upgrade steps
  • Documented workflow auth via OIDC and secret management
  • Detailed SBOM generation and image signing in CI/CD doc
.github/workflows/build-and-deploy.yml
docs/cicd.md
Included security policy notes and CKS-aligned guidelines
  • Provided RBAC and NetworkPolicy YAML examples
  • Described image signing with Cosign and SBOM generation with Syft
  • Linked continuous security checks into the CI/CD pipeline
security/policy-notes.md
docs/cks.md
Added Helm chart, GitOps and Knative deployment manifests
  • Introduced tensorizer-vllm Helm chart (Chart.yaml, values.yaml, templates)
  • Provided ArgoCD Application manifest for automated sync
  • Included Knative Service YAML for serverless deployment
helm/tensorizer-vllm/Chart.yaml
helm/tensorizer-vllm/values.yaml
helm/tensorizer-vllm/templates/deployment.yaml
helm/tensorizer-vllm/templates/service.yaml
helm/tensorizer-vllm/templates/ingress.yaml
gitops/argocd/app.yaml
k8s/knative-service.yaml

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai

coderabbitai Bot commented Aug 29, 2025

Copy link
Copy Markdown

Walkthrough

Adds CI/CD via GitHub Actions to build, scan, push, and deploy a vLLM image with Helm. Introduces Helm chart, Argo CD app, and Knative Service. Provides extensive docs (overview, SUNK, Tensorizer, vLLM, observability, CI/CD, security) and runnable examples (tensorizer script, vLLM smoke test, SUNK Slurm pod, Grafana walkthrough).

Changes

Cohort / File(s) Summary of Changes
CI/CD Workflow
.github/workflows/build-and-deploy.yml
New workflow: checkout, Docker Buildx build, Trivy scan, GHCR login/push, Helm upgrade/install targeting tensorizer-vllm in test namespace.
Docs — Overview & Guides
README.md, docs/overview.md, docs/tensorizer.md, docs/vllm.md, docs/observability.md, docs/cicd.md, docs/cks.md, docs/schedule-k8s-with-slurm.md, docs/sunk.md
Adds project intro, quickstarts, and deep dives covering stack overview, tensorization, vLLM usage, observability, CI/CD, security, SUNK, and Slurm-to-K8s scheduling.
Examples — Tensorizer & vLLM
examples/tensorizer/serialize_and_load.py, examples/vllm/run_vllm_tensorized.sh
New Python script to serialize, optionally upload/serve, and lazy-load a model; new bash script to run vLLM with tensorized weights and perform a smoke test.
Examples — SUNK Slurm Pod
examples/sunk/slurm-pod/README.md, examples/sunk/slurm-pod/pod.sbatch
Example Slurm batch job that creates a Kubernetes Pod via kubectl; README with run/verify/cleanup steps.
Observability Example
examples/observability/grafana/README.md
Grafana walkthrough for monitoring vLLM metrics and logs; local Docker Compose usage.
Helm Chart — tensorizer-vllm
helm/tensorizer-vllm/Chart.yaml, helm/tensorizer-vllm/values.yaml, helm/tensorizer-vllm/templates/deployment.yaml, helm/tensorizer-vllm/templates/service.yaml, helm/tensorizer-vllm/templates/ingress.yaml
New chart to deploy vLLM with tensorizer; configurable image, model URI, and host; includes Deployment, Service, and Ingress.
GitOps — Argo CD
gitops/argocd/app.yaml
Adds Argo CD Application pointing to the Helm chart with automated sync, prune, and self-heal.
Knative Service
k8s/knative-service.yaml
New Knative Service manifest to run vLLM with tensorizer and autoscaling settings.
Security Notes
security/policy-notes.md
Adds RBAC and NetworkPolicy examples; notes on image signing and SBOMs.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Dev as Developer
  participant GH as GitHub Actions
  participant D as Docker/Buildx
  participant S as Trivy
  participant R as GHCR
  participant H as Helm
  participant K as Kubernetes

  Dev->>GH: Push to main
  GH->>D: Build image (vLLM)
  D-->>GH: Image built (tag: example/vllm:${sha})
  GH->>S: Scan image
  S-->>GH: Scan results (pass)
  GH->>R: Login & push image
  GH->>H: helm upgrade --install tensorizer-vllm
  H->>K: Apply Deployment/Service/Ingress
  K-->>H: Resources ready
  H-->>GH: Deploy complete
Loading
sequenceDiagram
  autonumber
  actor U as User
  participant I as Ingress/Knative
  participant Svc as Service
  participant Pod as vLLM Pod
  note over Pod: vllm serve --tensorizer<br/>--model s3://... or http://...
  U->>I: HTTP request (/generate)
  I->>Svc: Route request
  Svc->>Pod: Forward to port 8000
  Pod-->>U: Response (tokens, metrics at /metrics)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I thump with glee at charts that bloom,
A pipeline hops from build to zoom—
Scan, then helm, to clusters glide,
vLLM serves with tensors pride.
Grafana stars, Argo’s tune,
Slurm meets pods—oh what a boon!
Carrot commits, shipped by noon. 🥕🐇

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch codex/audit-repository-for-coreweave-tensorizer

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

Blocking issues:

  • An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload. (link)
  • An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload. (link)
  • An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload. (link)

General comments:

  • In examples/tensorizer/serialize_and_load.py, avoid using os.chdir() to set the serve directory and instead supply the directory parameter to SimpleHTTPRequestHandler to prevent global working-directory side effects.
  • Add logic to gracefully shut down the HTTP server thread after load completes in serialize_and_load.py to avoid leaving orphan background threads.
  • Verify that all relative links in the new docs (for example in schedule-k8s-with-slurm.md) resolve correctly in the rendered site to prevent broken navigation.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In examples/tensorizer/serialize_and_load.py, avoid using os.chdir() to set the serve directory and instead supply the directory parameter to SimpleHTTPRequestHandler to prevent global working-directory side effects.
- Add logic to gracefully shut down the HTTP server thread after load completes in serialize_and_load.py to avoid leaving orphan background threads.
- Verify that all relative links in the new docs (for example in schedule-k8s-with-slurm.md) resolve correctly in the rendered site to prevent broken navigation.

## Individual Comments

### Comment 1
<location> `examples/tensorizer/serialize_and_load.py:31` </location>
<code_context>
+    s3.upload_file(path, bucket, key)
+
+
+def serve_file(path: str, port: int) -> threading.Thread:
+    directory = os.path.dirname(os.path.abspath(path))
+    os.chdir(directory)
+    server = ThreadingHTTPServer(("0.0.0.0", port), SimpleHTTPRequestHandler)
+    thread = threading.Thread(target=server.serve_forever, daemon=True)
</code_context>

<issue_to_address>
Changing the working directory with os.chdir may have unintended side effects.

Using os.chdir can impact other threads or components. To avoid this, use the 'directory' parameter in SimpleHTTPRequestHandler if available.
</issue_to_address>

### Comment 2
<location> `examples/tensorizer/serialize_and_load.py:24` </location>
<code_context>
+    serializer.write_module(model)
+
+
+def upload_to_s3(path: str, bucket: str, key: str) -> None:
+    import boto3
+
+    s3 = boto3.client("s3")
+    s3.upload_file(path, bucket, key)
+
+
</code_context>

<issue_to_address>
No error handling for S3 upload failures.

Add error handling to manage S3 upload exceptions and consider implementing retry logic for transient errors.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def upload_to_s3(path: str, bucket: str, key: str) -> None:
    import boto3

    s3 = boto3.client("s3")
    s3.upload_file(path, bucket, key)
=======
import time

def upload_to_s3(path: str, bucket: str, key: str, max_retries: int = 3, delay: float = 2.0) -> None:
    import boto3
    import botocore.exceptions

    s3 = boto3.client("s3")
    attempt = 0
    while attempt < max_retries:
        try:
            s3.upload_file(path, bucket, key)
            print(f"Successfully uploaded {path} to s3://{bucket}/{key}")
            return
        except botocore.exceptions.BotoCoreError as e:
            attempt += 1
            print(f"S3 upload failed (attempt {attempt}/{max_retries}): {e}")
            if attempt < max_retries:
                print(f"Retrying in {delay} seconds...")
                time.sleep(delay)
            else:
                print("Max retries reached. Upload failed.")
                raise
        except Exception as e:
            print(f"Unexpected error during S3 upload: {e}")
            raise
>>>>>>> REPLACE

</suggested_fix>

### Comment 3
<location> `helm/tensorizer-vllm/templates/service.yaml:9` </location>
<code_context>
+  selector:
+    app: {{ .Release.Name }}
+  ports:
+    - port: 80
+      targetPort: 8000
</code_context>

<issue_to_address>
Mapping port 80 to 8000 may cause confusion or conflicts.

To prevent potential port conflicts, make the service port configurable or choose a less common default port.
</issue_to_address>

## Security Issues

### Issue 1
<location> `.github/workflows/build-and-deploy.yml:13` </location>

<issue_to_address>
**security (yaml.github-actions.security.third-party-action-not-pinned-to-commit-sha):** An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload.

*Source: opengrep*
</issue_to_address>

### Issue 2
<location> `.github/workflows/build-and-deploy.yml:17` </location>

<issue_to_address>
**security (yaml.github-actions.security.third-party-action-not-pinned-to-commit-sha):** An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload.

*Source: opengrep*
</issue_to_address>

### Issue 3
<location> `.github/workflows/build-and-deploy.yml:21` </location>

<issue_to_address>
**security (yaml.github-actions.security.third-party-action-not-pinned-to-commit-sha):** An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload.

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +31 to +33
def serve_file(path: str, port: int) -> threading.Thread:
directory = os.path.dirname(os.path.abspath(path))
os.chdir(directory)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Changing the working directory with os.chdir may have unintended side effects.

Using os.chdir can impact other threads or components. To avoid this, use the 'directory' parameter in SimpleHTTPRequestHandler if available.

Comment on lines +24 to +28
def upload_to_s3(path: str, bucket: str, key: str) -> None:
import boto3

s3 = boto3.client("s3")
s3.upload_file(path, bucket, key)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: No error handling for S3 upload failures.

Add error handling to manage S3 upload exceptions and consider implementing retry logic for transient errors.

Suggested change
def upload_to_s3(path: str, bucket: str, key: str) -> None:
import boto3
s3 = boto3.client("s3")
s3.upload_file(path, bucket, key)
import time
def upload_to_s3(path: str, bucket: str, key: str, max_retries: int = 3, delay: float = 2.0) -> None:
import boto3
import botocore.exceptions
s3 = boto3.client("s3")
attempt = 0
while attempt < max_retries:
try:
s3.upload_file(path, bucket, key)
print(f"Successfully uploaded {path} to s3://{bucket}/{key}")
return
except botocore.exceptions.BotoCoreError as e:
attempt += 1
print(f"S3 upload failed (attempt {attempt}/{max_retries}): {e}")
if attempt < max_retries:
print(f"Retrying in {delay} seconds...")
time.sleep(delay)
else:
print("Max retries reached. Upload failed.")
raise
except Exception as e:
print(f"Unexpected error during S3 upload: {e}")
raise

Comment on lines +9 to +10
- port: 80
targetPort: 8000

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Mapping port 80 to 8000 may cause confusion or conflicts.

To prevent potential port conflicts, make the service port configurable or choose a less common default port.

steps:
- uses: actions/checkout@v4
- name: Set up Docker
uses: docker/setup-buildx-action@v2

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (yaml.github-actions.security.third-party-action-not-pinned-to-commit-sha): An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload.

Source: opengrep

- name: Build
run: docker build -t example/vllm:${{ github.sha }} .
- name: Scan
uses: aquasecurity/trivy-action@0.20.0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (yaml.github-actions.security.third-party-action-not-pinned-to-commit-sha): An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload.

Source: opengrep

with:
image-ref: example/vllm:${{ github.sha }}
- name: Login
uses: docker/login-action@v3

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (yaml.github-actions.security.third-party-action-not-pinned-to-commit-sha): An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload.

Source: opengrep

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 24

🧹 Nitpick comments (43)
k8s/knative-service.yaml (4)

13-13: Be explicit about the entrypoint

Assuming the image ENTRYPOINT is vllm may fail. Set command to ["vllm"] so args: ["serve", ...] is valid.

-          args: ["serve", "--model", "s3://my-bucket/models/tiny-gpt2.tensors", "--tensorizer"]
+          command: ["vllm"]
+          args: ["serve", "--model", "s3://my-bucket/models/tiny-gpt2.tensors", "--tensorizer"]

14-15: Add health probes to avoid flapping revisions

Knative benefits from readiness/liveness probes; vLLM exposes an HTTP endpoint—wire probes to reduce cold/fail traffic.

           ports:
             - containerPort: 8000
+          readinessProbe:
+            httpGet: { path: /health, port: 8000 }
+            initialDelaySeconds: 10
+            periodSeconds: 5
+          livenessProbe:
+            httpGet: { path: /health, port: 8000 }
+            initialDelaySeconds: 30
+            periodSeconds: 10

Adjust path if your vLLM build uses a different health route.


8-9: Cold starts likely with minScale=0

Scale-to-zero is fine for cost, but adds latency on first request. For demos/benchmarks set minScale=1.

-        autoscaling.knative.dev/minScale: "0"
+        autoscaling.knative.dev/minScale: "1"

11-15: Harden the pod

Consider non-root, read-only FS, and dropped capabilities.

         - image: vllm/vllm:latest
+          securityContext:
+            runAsNonRoot: true
+            allowPrivilegeEscalation: false
+            readOnlyRootFilesystem: true
+            capabilities: { drop: ["ALL"] }
examples/observability/grafana/README.md (2)

12-15: Make the “open Grafana” step cross‑platform

open works on macOS only. Prefer echoing the URL and try xdg-open/open/start.

 ```bash
-docker compose up -d
-open http://localhost:3000
+docker compose up -d
+echo "Grafana: http://localhost:3000"
+# Try to open in a browser (Linux/macOS/Windows PowerShell)
+xdg-open http://localhost:3000 2>/dev/null || open http://localhost:3000 2>/dev/null || start http://localhost:3000 2>/dev/null || true

---

`3-8`: **Tighten wording and metric names (minor)**

Optional polish for readability; no content change.


```diff
-Open the CoreWeave Grafana instance and inspect the following dashboards while
-running the vLLM demo:
+Open the CoreWeave Grafana instance and, while running the vLLM demo, inspect:
examples/vllm/run_vllm_tensorized.sh (2)

8-14: Add readiness wait, JSON header, and robust cleanup

Prevents flakiness from fixed sleep, ensures correct content type, and cleans up the server on exit.

-vllm serve --model "$MODEL_URI" --tensorizer --port "$PORT" &
-SERVER_PID=$!
-
-sleep 5
-curl -sS http://localhost:$PORT/generate -d '{"prompt":"Hello","max_tokens":8}'
-
-kill $SERVER_PID
+vllm serve --model "$MODEL_URI" --tensorizer --port "$PORT" &
+SERVER_PID=$!
+cleanup() { kill "$SERVER_PID" 2>/dev/null || true; wait "$SERVER_PID" 2>/dev/null || true; }
+trap cleanup EXIT
+
+# Wait up to ~30s for the port to accept connections
+for _ in {1..30}; do
+  if timeout 1 bash -c ">/dev/tcp/127.0.0.1/$PORT" 2>/dev/null; then break; fi
+  sleep 1
+done
+
+curl -fsS "http://localhost:$PORT/generate" \
+  -H 'Content-Type: application/json' \
+  -d '{"prompt":"Hello","max_tokens":8}'

5-6: Validate prerequisites (optional)

Fail fast if vllm CLI is missing or port is occupied.

 MODEL_URI=${1:-s3://my-bucket/models/tiny-gpt2.tensors}
 PORT=${PORT:-8000}
+command -v vllm >/dev/null || { echo "vllm CLI not found on PATH"; exit 1; }
docs/overview.md (2)

3-5: Minor wording/hyphenation

Hyphenate “CoreWeave‑aligned”; no semantic change.

-This repository demonstrates a CoreWeave aligned stack for high performance model serving.
+This repository demonstrates a CoreWeave‑aligned stack for high‑performance model serving.

14-16: Verify the Quickstart is offline‑friendly

The prior test run hit a ProxyError. If the demo fetches remote artifacts, add a note about offline mode and credentials, or switch to a purely local artifact path.

docs/observability.md (1)

7-11: Make Grafana navigation unambiguous (optional)

Name a dashboard explicitly to reduce guesswork (e.g., “Kubernetes / Workloads / Pods”).

README.md (2)

1-1: Deduplicate H1 title.

Two top-level “# tensorizer” headers (Line 1 here and Line 22 below) read oddly. Keep one.


13-20: Quickstart links look good. Minor polish optional.

Consider adding brief one-line descriptions after each link for scannability.

examples/tensorizer/serialize_and_load.py (2)

63-73: Small readiness race when starting the HTTP server.

Add a short wait or probe to avoid connection refusals on fast clients.

Apply:

     else:
         port = 8000
         serve_file(out_path, port)
+        import time
+        time.sleep(0.2)  # give server a moment to start
         uri = f"http://localhost:{port}/{os.path.basename(out_path)}"

24-29: S3 upload lacks basic error handling.

Wrap boto3 call to surface actionable errors for missing creds/bucket.

Apply:

 def upload_to_s3(path: str, bucket: str, key: str) -> None:
     import boto3
-
-    s3 = boto3.client("s3")
-    s3.upload_file(path, bucket, key)
+    from botocore.exceptions import BotoCoreError, ClientError
+    s3 = boto3.client("s3")
+    try:
+        s3.upload_file(path, bucket, key)
+    except (BotoCoreError, ClientError) as e:
+        raise RuntimeError(f"Failed to upload {path} to s3://{bucket}/{key}: {e}") from e
docs/tensorizer.md (3)

19-20: Align num_readers with script default or mention the flag.

Either show num_readers=4 (default) or note that you can set --num-readers 8.

Apply:

-3. `TensorDeserializer(..., device=..., lazy_load=True, num_readers=8)` streams
+3. `TensorDeserializer(..., device=..., lazy_load=True, num_readers=4)` streams
+# Use `--num-readers 8` to increase concurrency if your source supports range requests.

21-22: Fix product name capitalization.

Knative (not KNative).

Apply:

-4. KNative/KServe benefit from faster cold starts because weights are fetched
+4. Knative/KServe benefit from faster cold starts because weights are fetched

24-24: Optional: qualify the throughput claim.

Consider adding “depending on model/dtype and storage backend” to set expectations.

helm/tensorizer-vllm/Chart.yaml (1)

1-5: Mark chart type for Helm v2 schema.

Explicitly set type: application.

Apply:

 apiVersion: v2
 name: tensorizer-vllm
 version: 0.1.0
 appVersion: "0.1.0"
 description: Deploy vLLM serving tensorized models
+type: application
helm/tensorizer-vllm/templates/service.yaml (1)

3-10: Improve Service metadata and port naming.

Add standard labels and a named port for clarity/prometheus scraping.

Apply:

 metadata:
   name: {{ .Release.Name }}
+  labels:
+    app.kubernetes.io/name: {{ .Release.Name }}
 spec:
   selector:
     app: {{ .Release.Name }}
   ports:
-    - port: 80
+    - name: http
+      port: 80
       targetPort: 8000
helm/tensorizer-vllm/templates/deployment.yaml (1)

5-14: Consider making resources configurable via values.

Avoid hardcoding; let users set CPU/memory/GPU.

Example (template-side):

 spec:
   replicas: 1
   selector:
     matchLabels:
       app: {{ .Release.Name }}
   template:
     metadata:
       labels:
         app: {{ .Release.Name }}
     spec:
       containers:
         - name: vllm
+          resources:
+            {{- toYaml .Values.resources | nindent 12 }}

And add in values.yaml:

resources:
  requests: { cpu: "1", memory: "4Gi" }
  limits:   { cpu: "2", memory: "8Gi" }
  # limits:
  #   nvidia.com/gpu: 1  # if GPU required
docs/schedule-k8s-with-slurm.md (1)

16-19: Add language to fenced code block (markdownlint MD040).

-```
+```text
 NAME          READY   STATUS    RESTARTS   AGE
 slurm-pod-0   1/1     Running   0          1m

</blockquote></details>
<details>
<summary>examples/sunk/slurm-pod/pod.sbatch (1)</summary><blockquote>

`1-5`: **Optional: wait for readiness and clean up.**

Improves UX for demos.


```diff
 #SBATCH --output=slurm-pod.log
 
-srun kubectl run slurm-pod-0 --image=busybox --restart=Never --command -- sh -c 'echo hello-world; sleep 30'
+srun kubectl run slurm-pod-0 --image=busybox --restart=Never --labels=job-name=slurm-pod --command -- sh -c 'echo hello-world; sleep 30'
+kubectl wait --for=condition=Ready pod/slurm-pod-0 --timeout=60s || true
+# Optional: show logs, then clean up
+kubectl logs slurm-pod-0 || true
+kubectl delete pod slurm-pod-0 --ignore-not-found
examples/sunk/slurm-pod/README.md (3)

13-17: Make the run flow deterministic and observable.

Capture the job ID, wait for the Pod readiness by label, then fetch logs. This avoids races.

 ```bash
-sbatch pod.sbatch
-squeue -u $USER
-kubectl get pods -l job-name=slurm-pod
+JOBID=$(sbatch --parsable pod.sbatch)
+squeue -j "$JOBID"
+kubectl wait --for=condition=Ready pod -l job-name=slurm-pod --timeout=120s
+kubectl logs slurm-pod-0

---

`19-24`: **Tighten success criteria phrasing.**

Use consistent casing and precise conditions.

```diff
-- `squeue` shows the job in `RUNNING`
-- `kubectl get pods` shows the pod in `Running`
-- `kubectl logs slurm-pod-0` prints `hello-world`
+- `squeue` shows the job as `RUNNING`
+- `kubectl get pods` shows the Pod `Running` (or `Ready` after the wait)
+- `kubectl logs slurm-pod-0` includes `hello-world`

25-29: Ensure full cleanup (Slurm job and spawned Pod).

Canceling the job may leave the Pod. Offer a label-based delete to avoid leaks.

 ```bash
-scancel <jobid>
+scancel <jobid>
+kubectl delete pod -l job-name=slurm-pod --ignore-not-found

</blockquote></details>
<details>
<summary>gitops/argocd/app.yaml (2)</summary><blockquote>

`6-9`: **Create target namespace automatically.**

Default namespace is fine for demos, but creating a dedicated namespace improves isolation.

```diff
   destination:
     server: https://kubernetes.default.svc
-    namespace: default
+    namespace: vllm

And enable CreateNamespace:

   syncPolicy:
     automated:
       prune: true
       selfHeal: true
+    syncOptions:
+      - CreateNamespace=true

13-17: Add basic health and retry for stability.

Self-heal is on; add retries to avoid transient failures bricking syncs.

   project: default
   syncPolicy:
     automated:
       prune: true
       selfHeal: true
+    retry:
+      limit: 5
+      backoff:
+        duration: 5s
+        factor: 2
+        maxDuration: 3m
docs/cicd.md (3)

7-11: Clarify Quickstart sequencing and fix wrapping.

Combine the wrapped sentence and explicitly mention registry push.

-2. The workflow builds the image with `docker build` and scans it with Trivy.
-3. If successful the image is pushed to the registry and Helm is upgraded in a
-test namespace.
+2. The workflow builds the image with `docker build` and scans it with Trivy.
+3. On success, it pushes the image to the registry and upgrades the Helm release in a test namespace.

14-18: Verify the workflow actually implements Cosign/Syft/Trivy gates.

Docs mention OIDC auth, External Secrets, GitOps commits, Cosign, and SBOMs. Ensure .github/workflows/build-and-deploy.yml has these steps and required permissions (e.g., contents: write for GitOps commits; OIDC to GHCR; Cosign keyless).

I can align the workflow with the doc (Cosign keyless sign, Syft SBOM upload, Trivy PR gate) if you want a patch.


14-16: Name the exact secrets and permissions.

Minimal additions help users succeed.

Add a short list, for example:

  • Required permissions: id-token: write; contents: write; packages: write.
  • Required secrets (if not using keyless): COSIGN_PRIVATE_KEY, COSIGN_PASSWORD (or use keyless).
  • External Secrets references for registry creds if pushing outside GHCR.
docs/vllm.md (3)

8-12: Quickstart: call out credentials and port.

S3 URIs require credentials and the server binds a port; add one-liners to reduce first‑run failures.

 ```bash
-bash examples/vllm/run_vllm_tensorized.sh s3://my-bucket/models/tiny-gpt2.tensors
+# Set credentials if using S3
+export AWS_REGION=us-east-1
+export AWS_ACCESS_KEY_ID=...; export AWS_SECRET_ACCESS_KEY=...
+# Launch
+bash examples/vllm/run_vllm_tensorized.sh s3://my-bucket/models/tiny-gpt2.tensors
+# Default API is on port 8000
+curl -s localhost:8000/v1/models || true

---

`16-23`: **Confirm flag support and provide tuning pointers.**

`--tensorizer` and env names can drift across versions. Ask users to match their vLLM version and add a basic throughput knob.


```diff
-1. `vllm serve --tensorizer` reads weights from disk, HTTP, or S3.
+1. `vllm serve --tensorizer` reads weights from disk, HTTP, or S3 (verify your vLLM version supports this flag).
 2. Environment variables like `VLLM_WORKER_GPU_MEMORY_UTILIZATION` tune
 throughput vs. memory usage.
+   - Also consider `VLLM_MAX_MODEL_LEN` and `VLLM_CPU_OFFLOAD_GB` per GPU memory.
 3. Prometheus metrics at `/metrics` expose time‑to‑first‑token and tokens/sec.

20-22: Link chart usage to values that matter for vLLM.

Mention key values so users don’t hunt.

Add a sentence after the Helm reference:

  • Important values: image, modelURI, resources (GPU), replicas, and env (VLLM_*).
docs/sunk.md (4)

3-5: Grammar and brevity.

Remove double spaces and tighten phrasing.

-SUNK runs Slurm control and worker nodes inside Kubernetes Pods.  Slurm and
-native Kubernetes workloads can share the same cluster while maintaining
-isolation.
+SUNK runs Slurm control and worker nodes inside Kubernetes Pods. Slurm and native Kubernetes workloads share the same cluster while maintaining isolation.

22-29: Install into a dedicated namespace and wait for readiness.

Improves reproducibility and avoids racing job submission.

 ```bash
-# install SUNK operator
-helm repo add sunk https://coreweave.github.io/sunk
-helm install sunk sunk/sunk-operator
+# install SUNK operator
+helm repo add sunk https://coreweave.github.io/sunk
+helm install sunk sunk/sunk-operator -n sunk-system --create-namespace
+kubectl -n sunk-system rollout status deploy/sunk-operator --timeout=120s

---

`33-36`: **Clarify K8s plugin requirement.**

Make it explicit the Slurm Kubernetes plugin is needed to launch native Pods.

```diff
-3. Slurm's Kubernetes plugin can launch native Pods alongside batch jobs.
+3. With the Slurm Kubernetes plugin enabled, Slurm can launch native Pods alongside batch jobs.

38-39: Link text polish.

Add a comma for flow and simplify the link target (folder is enough).

-For a demo of creating a Kubernetes Pod from Slurm see
-[examples/sunk/slurm-pod](../examples/sunk/slurm-pod/README.md).
+For a demo of creating a Kubernetes Pod from Slurm, see
+[examples/sunk/slurm-pod](../examples/sunk/slurm-pod/).
security/policy-notes.md (1)

35-38: Image-signing: add concrete commands and admission policy example.

Make the guidance actionable with signing/attestation and policy verification.

Append:

# Build SBOM and attach
syft packages --source app-image:latest -o spdx-json > sbom.spdx.json
cosign attach sbom --sbom sbom.spdx.json app-image:latest

# Keyless sign (OIDC)
COSIGN_YES=true cosign sign app-image:latest

# Kyverno verifyImages (keyless) – example
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-signed-images
spec:
  rules:
    - name: require-cosign
      match:
        any:
          - resources:
              kinds: ["Pod"]
      verifyImages:
        - imageReferences: ["ghcr.io/*/*"]
          attestations:
            - type: spdx
          keyless:
            issuer: "https://token.actions.githubusercontent.com"
            subject: "repo:${{ org }}/*"

Nit: remove double spaces after periods for consistency (“Cosign). Generate SBOMs…”).

.github/workflows/build-and-deploy.yml (1)

11-27: Harden workflow: pin actions, cache builds, and fail on high vulns.

Supply-chain and performance improvements.

  • Pin actions to commit SHAs.
  • Use docker/build-push-action with GHA cache (type=gha).
  • Configure Trivy to fail on HIGH/CRITICAL and ignore-unfixed as needed:
       - name: Scan
         uses: aquasecurity/trivy-action@0.20.0
         with:
-          image-ref: ${{ env.IMAGE }}
+          image-ref: ${{ env.IMAGE }}
+          vuln-type: 'os,library'
+          severity: 'HIGH,CRITICAL'
+          exit-code: '1'
+          ignore-unfixed: true
docs/cks.md (3)

8-11: Quickstart: call out default-deny egress and binding to a dedicated SA.

Prevents accidental wide egress and avoids using default service account.

Proposed bullets:

  • Apply a namespace-level default-deny egress policy; then allow only required egress (DNS, object storage).
  • Bind workloads to a dedicated ServiceAccount via Role/RoleBinding (avoid default SA).

14-18: NetworkPolicy and admission notes: add DNS allowance and clarify tooling.

Mention explicit DNS egress and name an admission tool to reduce ambiguity.

  • “NetworkPolicies restrict pod egress…” → add “and permit DNS to kube-system on TCP/UDP 53.”
  • “Admission policies (OPA/Gatekeeper or Kyverno)…” → include a minimal example link/snippet as in security/policy-notes.md.

19-20: Link text vs href mismatch.

Text says “docs/cicd.md” but href is “cicd.md”; align for consistency.

-Security is continuous; integrate checks into CI/CD as shown in
-[docs/cicd.md](cicd.md).
+Security is continuous; integrate checks into CI/CD as shown in
+[cicd.md](cicd.md).
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 92bd5bc and b1a3bfc.

⛔ Files ignored due to path filters (1)
  • docs/img/architecture.svg is excluded by !**/*.svg
📒 Files selected for processing (23)
  • .github/workflows/build-and-deploy.yml (1 hunks)
  • README.md (1 hunks)
  • docs/cicd.md (1 hunks)
  • docs/cks.md (1 hunks)
  • docs/observability.md (1 hunks)
  • docs/overview.md (1 hunks)
  • docs/schedule-k8s-with-slurm.md (1 hunks)
  • docs/sunk.md (1 hunks)
  • docs/tensorizer.md (1 hunks)
  • docs/vllm.md (1 hunks)
  • examples/observability/grafana/README.md (1 hunks)
  • examples/sunk/slurm-pod/README.md (1 hunks)
  • examples/sunk/slurm-pod/pod.sbatch (1 hunks)
  • examples/tensorizer/serialize_and_load.py (1 hunks)
  • examples/vllm/run_vllm_tensorized.sh (1 hunks)
  • gitops/argocd/app.yaml (1 hunks)
  • helm/tensorizer-vllm/Chart.yaml (1 hunks)
  • helm/tensorizer-vllm/templates/deployment.yaml (1 hunks)
  • helm/tensorizer-vllm/templates/ingress.yaml (1 hunks)
  • helm/tensorizer-vllm/templates/service.yaml (1 hunks)
  • helm/tensorizer-vllm/values.yaml (1 hunks)
  • k8s/knative-service.yaml (1 hunks)
  • security/policy-notes.md (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
examples/tensorizer/serialize_and_load.py (1)
tensorizer/serialization.py (5)
  • TensorDeserializer (1591-3419)
  • TensorSerializer (3422-4815)
  • write_module (4579-4674)
  • key (1497-1505)
  • load_into_module (3222-3307)
🪛 LanguageTool
examples/observability/grafana/README.md

[grammar] ~3-~3: There might be a mistake here.
Context: ...d inspect the following dashboards while running the vLLM demo: - **Kubernetes /...

(QB_NEW_EN)


[grammar] ~6-~6: There might be a mistake here.
Context: ...mpute / GPU** – DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_FB_USED - Kubernetes / Networking / Namespace – ...

(QB_NEW_EN)


[grammar] ~7-~7: There might be a mistake here.
Context: ...Kubernetes / Networking / Namespace** – container_network_receive_bytes_total - Loki Logs – query app=vllm For a lo...

(QB_NEW_EN)

security/policy-notes.md

[grammar] ~35-~35: There might be a mistake here.
Context: ...//github.com/sigstore/cosign). Generate SBOMs with [Syft](https://github.com/anc...

(QB_NEW_EN)


[grammar] ~36-~36: There might be a mistake here.
Context: ...chore/syft) and store them alongside the images. Admission controllers should ve...

(QB_NEW_EN)


[grammar] ~37-~37: There might be a mistake here.
Context: ...erify signatures before allowing pods to run.

(QB_NEW_EN)


[grammar] ~38-~38: There might be a mistake here.
Context: ... signatures before allowing pods to run.

(QB_NEW_EN)

docs/tensorizer.md

[grammar] ~3-~3: There might be a mistake here.
Context: ...nto a single .tensors file that can be streamed from HTTP or S3 at wire speed. ...

(QB_NEW_EN)


[grammar] ~12-~12: There might be a mistake here.
Context: ...l, serves it over HTTP, and lazily loads it back into a fresh module. ## Fifteen...

(QB_NEW_EN)


[grammar] ~19-~19: There might be a mistake here.
Context: ... lazy_load=True, num_readers=8)` streams the model directly to CPU or GPU memory....

(QB_NEW_EN)


[grammar] ~20-~20: There might be a mistake here.
Context: ...the model directly to CPU or GPU memory. 4. KNative/KServe benefit from faster cold ...

(QB_NEW_EN)


[grammar] ~21-~21: There might be a mistake here.
Context: ... cold starts because weights are fetched on demand rather than baked into the con...

(QB_NEW_EN)


[grammar] ~24-~24: There might be a mistake here.
Context: ... network limits: on 40GbE expect ~5GB/s.

(QB_NEW_EN)

docs/overview.md

[grammar] ~3-~3: There might be a mistake here.
Context: ...tack for high performance model serving. It combines **Slurm on Kubernetes (SUNK)...

(QB_NEW_EN)


[grammar] ~4-~4: There might be a mistake here.
Context: ... Tensorizer, vLLM, and CoreWeave observability to provide fast, reproduci...

(QB_NEW_EN)

docs/vllm.md

[grammar] ~3-~3: There might be a mistake here.
Context: ...roject/vllm) can load tensorized weights without conversion. ## Five‑Minute Quic...

(QB_NEW_EN)


[grammar] ~16-~16: There might be a mistake here.
Context: ...rreads weights from disk, HTTP, or S3. 2. Environment variables likeVLLM_WORKER_...

(QB_NEW_EN)


[grammar] ~19-~19: There might be a mistake here.
Context: ...pose time‑to‑first‑token and tokens/sec. 4. Scale out with KServe or plain Deploymen...

(QB_NEW_EN)


[grammar] ~20-~20: There might be a mistake here.
Context: ...lain Deployments using the Helm chart in [helm/tensorizer-vllm](../helm/tensori...

(QB_NEW_EN)

docs/sunk.md

[grammar] ~3-~3: There might be a mistake here.
Context: ...nodes inside Kubernetes Pods. Slurm and native Kubernetes workloads can share th...

(QB_NEW_EN)


[grammar] ~4-~4: There might be a mistake here.
Context: ...share the same cluster while maintaining isolation. ``` +-----------------------...

(QB_NEW_EN)


[grammar] ~35-~35: There might be a mistake here.
Context: ...launch native Pods alongside batch jobs. 4. Metrics and logs are exported to CoreWea...

(QB_NEW_EN)


[grammar] ~38-~38: There might be a mistake here.
Context: ...creating a Kubernetes Pod from Slurm see [examples/sunk/slurm-pod](../examples/su...

(QB_NEW_EN)


[grammar] ~39-~39: There might be a mistake here.
Context: ...](../examples/sunk/slurm-pod/README.md).

(QB_NEW_EN)

docs/cicd.md

[grammar] ~9-~9: There might be a mistake here.
Context: ...o the registry and Helm is upgraded in a test namespace. ## Fifteen‑Minute Deep ...

(QB_NEW_EN)


[grammar] ~14-~14: There might be a mistake here.
Context: ...istry using OIDC and short‑lived tokens. 2. Secrets are provided via External Secret...

(QB_NEW_EN)

docs/observability.md

[grammar] ~11-~11: There might be a mistake here.
Context: ...twork throughput, and pod restarts while invoking the model. ## Fifteen‑Minute D...

(QB_NEW_EN)

examples/sunk/slurm-pod/README.md

[grammar] ~3-~3: There might be a mistake here.
Context: ...n create a native Kubernetes Pod via the kubernetes plugin. ## Prerequisites ...

(QB_NEW_EN)

README.md

[grammar] ~13-~13: There might be a mistake here.
Context: ...ure.svg) ## Quickstart Map - Overview - SUNK - [Schedule K8s Pods...

(QB_NEW_EN)


[grammar] ~14-~14: There might be a mistake here.
Context: ... - Overview - SUNK - [Schedule K8s Pods with Slurm](docs/sched...

(QB_NEW_EN)


[grammar] ~15-~15: There might be a mistake here.
Context: ...sunk.md) - Schedule K8s Pods with Slurm - Tensorizer - [vLLM]...

(QB_NEW_EN)


[grammar] ~16-~16: There might be a mistake here.
Context: ...chedule-k8s-with-slurm.md) - Tensorizer - vLLM - [Observability](do...

(QB_NEW_EN)


[grammar] ~17-~17: There might be a mistake here.
Context: ...Tensorizer - vLLM - Observability - ...

(QB_NEW_EN)


[grammar] ~18-~18: There might be a mistake here.
Context: ... - vLLM - Observability - CI/CD - [Security](docs/c...

(QB_NEW_EN)


[grammar] ~19-~19: There might be a mistake here.
Context: ...ability](docs/observability.md) - CI/CD - Security # tensorizer Mod...

(QB_NEW_EN)

docs/cks.md

[grammar] ~3-~3: There might be a mistake here.
Context: ...etes Certified Security Specialist (CKS) objectives. ## Five‑Minute Quickstart ...

(QB_NEW_EN)


[grammar] ~14-~14: There might be a mistake here.
Context: ...notes.md` for Role/RoleBinding examples. - NetworkPolicies restrict pod egress to O...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
docs/schedule-k8s-with-slurm.md

16-16: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 actionlint (1.7.7)
.github/workflows/build-and-deploy.yml

13-13: the runner of "docker/setup-buildx-action@v2" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🪛 Ruff (0.12.2)
examples/tensorizer/serialize_and_load.py

34-34: Possible binding to all interfaces

(S104)

🔇 Additional comments (1)
README.md (1)

9-9: Verify architecture image path.

Ensure docs/img/architecture.svg exists (case-sensitive on some filesystems) and renders on GitHub.

Comment on lines +1 to +6
name: build-and-deploy

on:
push:
branches: [ main ]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add required permissions for GHCR push (packages: write).

Without this, push to ghcr.io can fail.

 name: build-and-deploy
 
 on:
   push:
     branches: [ main ]
 
+permissions:
+  contents: read
+  packages: write
+  id-token: write
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
name: build-and-deploy
on:
push:
branches: [ main ]
name: build-and-deploy
on:
push:
branches: [ main ]
permissions:
contents: read
packages: write
id-token: write
🤖 Prompt for AI Agents
.github/workflows/build-and-deploy.yml lines 1-6: the workflow lacks required
permissions for pushing to GHCR, causing push failures; add a top-level
permissions block (under the workflow root, before jobs) and set packages: write
(and optionally contents: read) so the runner can push images to ghcr.io; ensure
indentation matches YAML (permissions: packages: write).

steps:
- uses: actions/checkout@v4
- name: Set up Docker
uses: docker/setup-buildx-action@v2

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Update setup-buildx-action to v3 (actionlint failure).

v2 is flagged incompatible on current runners.

-      - name: Set up Docker
-        uses: docker/setup-buildx-action@v2
+      - name: Set up Docker
+        uses: docker/setup-buildx-action@v3
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
uses: docker/setup-buildx-action@v2
- name: Set up Docker
uses: docker/setup-buildx-action@v3
🧰 Tools
🪛 actionlint (1.7.7)

13-13: the runner of "docker/setup-buildx-action@v2" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🤖 Prompt for AI Agents
.github/workflows/build-and-deploy.yml around line 13: the workflow uses
docker/setup-buildx-action@v2 which is flagged incompatible on current runners;
update the action reference to docker/setup-buildx-action@v3 (or the latest v3
tag) in the uses field, run a quick syntax/lint check and re-run the workflow to
ensure compatibility with the current runners.

Comment on lines +15 to +29
run: docker build -t example/vllm:${{ github.sha }} .
- name: Scan
uses: aquasecurity/trivy-action@0.20.0
with:
image-ref: example/vllm:${{ github.sha }}
- name: Login
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Push
run: docker push example/vllm:${{ github.sha }}
- name: Helm Upgrade
run: helm upgrade --install tensorizer helm/tensorizer-vllm --set image=example/vllm:${{ github.sha }} --namespace test --create-namespace

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Tag to GHCR consistently and reuse a single IMAGE var across steps.

Current tag example/vllm doesn’t match GHCR login and will push to Docker Hub.

+      - name: Set image tag
+        run: echo "IMAGE=ghcr.io/${{ github.repository_owner }}/tensorizer-vllm:${{ github.sha }}" >> $GITHUB_ENV
       - name: Build
-        run: docker build -t example/vllm:${{ github.sha }} .
+        run: docker build -t "$IMAGE" .
       - name: Scan
         uses: aquasecurity/trivy-action@0.20.0
         with:
-          image-ref: example/vllm:${{ github.sha }}
+          image-ref: ${{ env.IMAGE }}
       - name: Login
         uses: docker/login-action@v3
         with:
           registry: ghcr.io
           username: ${{ github.actor }}
           password: ${{ secrets.GITHUB_TOKEN }}
       - name: Push
-        run: docker push example/vllm:${{ github.sha }}
+        run: docker push "$IMAGE"
       - name: Helm Upgrade
-        run: helm upgrade --install tensorizer helm/tensorizer-vllm --set image=example/vllm:${{ github.sha }} --namespace test --create-namespace
+        run: helm upgrade --install tensorizer helm/tensorizer-vllm --set image=${IMAGE} --namespace test --create-namespace
🤖 Prompt for AI Agents
.github/workflows/build-and-deploy.yml lines 15-29: the workflow builds, scans,
logs in and pushes an image but uses inconsistent name "example/vllm" which
targets Docker Hub while the login is for GHCR; change to define a single IMAGE
variable (e.g. IMAGE=ghcr.io/${{ github.repository_owner }}/vllm:${{ github.sha
}}) at job or step level and reference that variable in all steps (docker build
-t $IMAGE ., trivy-action image-ref: $IMAGE, docker push $IMAGE, and helm
upgrade --set image=$IMAGE ...) so the same GHCR-qualified tag is used
everywhere and removes the mismatch.

- name: Push
run: docker push example/vllm:${{ github.sha }}
- name: Helm Upgrade
run: helm upgrade --install tensorizer helm/tensorizer-vllm --set image=example/vllm:${{ github.sha }} --namespace test --create-namespace

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Helm step lacks Helm installation and kubeconfig; deployment will fail.

Install Helm and configure cluster auth (e.g., from a base64 KUBECONFIG secret) before running upgrade.

       - name: Push
         run: docker push "$IMAGE"
+      - name: Install Helm
+        uses: azure/setup-helm@v4
+      - name: Configure kubeconfig
+        if: ${{ secrets.KUBECONFIG_B64 != '' }}
+        run: |
+          echo "${KUBECONFIG_B64}" | base64 -d > $HOME/kubeconfig
+          echo "KUBECONFIG=$HOME/kubeconfig" >> $GITHUB_ENV
+        env:
+          KUBECONFIG_B64: ${{ secrets.KUBECONFIG_B64 }}
       - name: Helm Upgrade
-        run: helm upgrade --install tensorizer helm/tensorizer-vllm --set image=${IMAGE} --namespace test --create-namespace
+        run: helm upgrade --install tensorizer helm/tensorizer-vllm --set image=${IMAGE} --namespace test --create-namespace

Optionally gate with environments and approvals (environment: test) and add concurrency to avoid overlapping deploys.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In .github/workflows/build-and-deploy.yml around line 29, the Helm deploy step
runs helm upgrade without ensuring Helm is installed or a kubeconfig is
configured; before the helm upgrade, add steps to install Helm (or use
actions/helm) and set up cluster auth by decoding a base64 KUBECONFIG secret
into $HOME/.kube/config (or use
azure/k8s-set-context/google-github-actions/get-gke-credentials as appropriate),
then run helm init/registry login if needed and only after that run helm upgrade
--install; additionally, gate this job with an environment: test and required
reviewers/approvals and add concurrency/group to the workflow to prevent
overlapping deploys.

Comment thread docs/observability.md

1. The `kube-state-metrics` and `node-exporter` dashboards show cluster health.
2. vLLM exports Prometheus metrics such as `vllm_engine_execution_time`.
3. Logs are collected via Loki; search by `app=vllm`.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Correct Loki query syntax

Use {app="vllm"} rather than app=vllm.

-3. Logs are collected via Loki; search by `app=vllm`.
+3. Logs are collected via Loki; search by `{app="vllm"}`.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
3. Logs are collected via Loki; search by `app=vllm`.
3. Logs are collected via Loki; search by `{app="vllm"}`.
🤖 Prompt for AI Agents
In docs/observability.md around line 18, the Loki query example uses incorrect
syntax "app=vllm"; replace it with the proper label matcher form {app="vllm"} so
examples and instructions use valid Loki query syntax; update the text to show
the corrected query and ensure any surrounding examples use the same
label-matcher format.

Comment thread k8s/knative-service.yaml
autoscaling.knative.dev/minScale: "0"
spec:
containers:
- image: vllm/vllm:latest

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Pin the image (avoid latest) and align with CI/CD output

Use a tagged release or digest, or the CI-produced image (e.g., example/vllm:${{ github.sha }}) for reproducibility and supply-chain integrity.

-        - image: vllm/vllm:latest
+        - image: example/vllm:{{ .Values.image.tag | default "sha-<commit>" }}

If not using Helm templating here, pin to a specific version or digest.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In k8s/knative-service.yaml around line 12, the image is using the mutable tag
"vllm/vllm:latest"; replace it with a pinned image reference for reproducibility
and supply-chain integrity—either the CI-built image variable (e.g.,
example/vllm:${{ github.sha }}) or a fixed semver tag or immutable digest
(sha256:...), and ensure the manifest uses the same tag/digest the pipeline
publishes (or add templating if needed) so deployments always pull the exact
intended image.

Comment thread k8s/knative-service.yaml
spec:
containers:
- image: vllm/vllm:latest
args: ["serve", "--model", "s3://my-bucket/models/tiny-gpt2.tensors", "--tensorizer"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

S3 credentials/config not provided

Loading from s3:// will fail without credentials. Mount a Secret or set env vars (and prefer an S3-compatible endpoint if using CoreWeave Object Storage).

           args: ["serve", "--model", "s3://my-bucket/models/tiny-gpt2.tensors", "--tensorizer"]
+          env:
+            - name: AWS_ACCESS_KEY_ID
+              valueFrom:
+                secretKeyRef: { name: s3-credentials, key: accessKeyId }
+            - name: AWS_SECRET_ACCESS_KEY
+              valueFrom:
+                secretKeyRef: { name: s3-credentials, key: secretAccessKey }
+            - name: AWS_REGION
+              value: us-east-1
+            # Uncomment if using a custom S3 endpoint:
+            # - name: AWS_ENDPOINT_URL_S3
+            #   value: https://<endpoint>

I can draft a Secret manifest if you share the naming you prefer.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
args: ["serve", "--model", "s3://my-bucket/models/tiny-gpt2.tensors", "--tensorizer"]
args: ["serve", "--model", "s3://my-bucket/models/tiny-gpt2.tensors", "--tensorizer"]
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: s3-credentials
key: accessKeyId
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: s3-credentials
key: secretAccessKey
- name: AWS_REGION
value: us-east-1
# Uncomment if using a custom S3 endpoint:
# - name: AWS_ENDPOINT_URL_S3
# value: https://<endpoint>
🤖 Prompt for AI Agents
In k8s/knative-service.yaml around line 13, the container args reference an
s3:// model path but no S3 credentials or endpoint are provided; add a
Kubernetes Secret (or use an existing one) with AWS_ACCESS_KEY_ID and
AWS_SECRET_ACCESS_KEY (optionally AWS_REGION and S3_ENDPOINT for S3-compatible
providers like CoreWeave), then inject those into the Knative Service container
as environment variables (envFrom: secretRef or individual env entries) or mount
the Secret as files and point the runtime to them; also consider setting
S3_ENDPOINT to the CoreWeave Object Storage endpoint and prefer using the Secret
name you want so I can draft the Secret manifest for you.

Comment thread README.md
Comment on lines +3 to +5
[![CI](https://github.com/coreweave/tensorizer/actions/workflows/build-and-deploy.yml/badge.svg)](https://github.com/coreweave/tensorizer/actions/workflows/build-and-deploy.yml)
[![Container Scan](https://img.shields.io/badge/scan-trivy-blue)](#)
[![Lint](https://img.shields.io/badge/lint-black%20%2B%20isort-brightgreen)](#)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix workflow badge to this repo (not coreweave/tensorizer).

Current badge URLs point to coreweave/tensorizer, so status is misleading in this fork/PR.

Apply:

-[![CI](https://github.com/coreweave/tensorizer/actions/workflows/build-and-deploy.yml/badge.svg)](https://github.com/coreweave/tensorizer/actions/workflows/build-and-deploy.yml)
+[![CI](https://github.com/TylrDn/tensorizer/actions/workflows/build-and-deploy.yml/badge.svg)](https://github.com/TylrDn/tensorizer/actions/workflows/build-and-deploy.yml)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
[![CI](https://github.com/coreweave/tensorizer/actions/workflows/build-and-deploy.yml/badge.svg)](https://github.com/coreweave/tensorizer/actions/workflows/build-and-deploy.yml)
[![Container Scan](https://img.shields.io/badge/scan-trivy-blue)](#)
[![Lint](https://img.shields.io/badge/lint-black%20%2B%20isort-brightgreen)](#)
[![CI](https://github.com/TylrDn/tensorizer/actions/workflows/build-and-deploy.yml/badge.svg)](https://github.com/TylrDn/tensorizer/actions/workflows/build-and-deploy.yml)
[![Container Scan](https://img.shields.io/badge/scan-trivy-blue)](#)
[![Lint](https://img.shields.io/badge/lint-black%20%2B%20isort-brightgreen)](#)
🤖 Prompt for AI Agents
In README.md around lines 3 to 5, the GitHub Actions workflow badge links to
coreweave/tensorizer which is incorrect for this fork; update the badge URL so
it points to this repository's workflows (replace coreweave/tensorizer in the
badge link and image URL with the current GitHub owner/repo or use the
repository-relative workflow path), leaving the rest of the markdown intact;
verify the badge link and image render the correct status for this repo.

Comment thread security/policy-notes.md
Comment on lines +5 to +14
```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-runner
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["create", "get", "list"]
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

RBAC example: scope and binding are missing; add namespace and RoleBinding.

Roles are namespace-scoped; without a RoleBinding the example isn’t actionable.

Apply:

 kind: Role
 metadata:
   name: pod-runner
+  namespace: workloads
 rules:
   - apiGroups: [""]
     resources: ["pods"]
-    verbs: ["create", "get", "list"]
+    verbs: ["create", "get", "list"]

Add this RoleBinding right below to bind a least-privileged SA:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: pod-runner
  namespace: workloads
subjects:
  - kind: ServiceAccount
    name: runner
    namespace: workloads
roleRef:
  kind: Role
  name: pod-runner
  apiGroup: rbac.authorization.k8s.io
🤖 Prompt for AI Agents
security/policy-notes.md around lines 5 to 14: the RBAC Role example is missing
namespace scope and a RoleBinding so it’s not actionable; add
metadata.namespace: workloads to the Role and immediately add a RoleBinding
resource that binds the least-privileged ServiceAccount named "runner" in the
workloads namespace to the pod-runner Role (roleRef.kind: Role, roleRef.name:
pod-runner, apiGroup: rbac.authorization.k8s.io) and set the
RoleBinding.subjects with kind: ServiceAccount, name: runner, namespace:
workloads.

Comment thread security/policy-notes.md
Comment on lines +18 to +31
```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: restrict-egress
spec:
podSelector: {}
policyTypes: [Egress]
egress:
- to:
- namespaceSelector:
matchLabels:
access: object-storage
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

NetworkPolicy will break DNS; explicitly allow DNS egress.

As written, all egress except object storage is denied; pods will fail DNS lookups.

Apply:

 spec:
   podSelector: {}
   policyTypes: [Egress]
   egress:
     - to:
         - namespaceSelector:
             matchLabels:
               access: object-storage
+    - to:
+        - namespaceSelector:
+            matchLabels:
+              kubernetes.io/metadata.name: kube-system
+      ports:
+        - protocol: UDP
+          port: 53
+        - protocol: TCP
+          port: 53

Optional: restrict by IPBlock for your DNS service if labels differ.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: restrict-egress
spec:
podSelector: {}
policyTypes: [Egress]
egress:
- to:
- namespaceSelector:
matchLabels:
access: object-storage
```
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: restrict-egress
spec:
podSelector: {}
policyTypes: [Egress]
egress:
- to:
- namespaceSelector:
matchLabels:
access: object-storage
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
🤖 Prompt for AI Agents
In security/policy-notes.md around lines 18 to 31, the NetworkPolicy as shown
denies all egress except to namespaces labeled access=object-storage which will
break DNS resolution; update the policy to explicitly allow egress to your DNS
servers by adding an egress rule for DNS (UDP and TCP port 53) either by
allowing traffic to the kube-dns/coredns namespaceSelector (matchLabels:
k8s-app: kube-dns or your cluster label) or by adding an ipBlock/CIDR for your
DNS service, ensuring both UDP and TCP port 53 are permitted before the
restrictive object-storage egress rule.

@TylrDn TylrDn merged commit 046fedf into main Aug 30, 2025
0 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant