Skip to content

Enhancement proposal: kcli Service Provider#43

Open
pgarciaq wants to merge 7 commits into
dcm-project:mainfrom
pgarciaq:kcli-sp
Open

Enhancement proposal: kcli Service Provider#43
pgarciaq wants to merge 7 commits into
dcm-project:mainfrom
pgarciaq:kcli-sp

Conversation

@pgarciaq
Copy link
Copy Markdown

@pgarciaq pgarciaq commented Apr 22, 2026

Summary

Enhancement proposal for the kcli Service Provider — the first
non-Kubernetes DCM service provider. It manages VMs and Kubernetes
clusters through kweb (kcli's HTTP
API), targeting development, testing, and homelab environments.

Implementation: pgarciaq/dcm-kcli-provider
Container image: quay.io/pgarciaq/dcm-kcli-provider

Important

This SP is intended for development and testing purposes only, not for production.
kcli is a community tool that Red Hat does not provide support for.
Accordingly, this SP should be considered community-only. Despite this scope limitation, the kcli SP
is architecturally significant as a reference implementation: it demonstrates DCM's ability to provision
resources through non-Kubernetes execution planes, including third-party Kubernetes clusters (k3s,
microshift, generic), cloud VMs, and bare-metal/libvirt hosts — capabilities not covered by any
production-targeted SP today.

What makes this SP unique

The kcli SP introduces several patterns not found in existing DCM
service providers (KubeVirt SP, k8s-container SP, ACM Cluster SP):

  1. Non-Kubernetes execution plane. First DCM SP that integrates via
    a standalone HTTP API (kweb) rather than the Kubernetes API. No
    kubeconfig, no client-go, no CRDs. Enables DCM on bare-metal/libvirt
    without a management cluster.

  2. Dual service-type registration. One binary registers as both vm
    and cluster providers with separate endpoints and provider IDs. All
    other SPs register a single service type.

  3. Name-prefix identity (not K8s labels). Uses dcm- prefix on kcli
    resource names + bbolt ID mapping instead of dcm.project/* labels on
    Kubernetes objects.

  4. Embedded state store (bbolt). Authoritative ID-to-name mapping
    lives in a local embedded database, not in Kubernetes etcd. Enables
    orphan detection and crash recovery without external dependencies.

  5. Polling-based status model. Contrasts with informer-driven
    KubeVirt/k8s-container SPs. Includes debouncing and cluster creation
    timeout to handle kweb's async cluster provisioning.

  6. Profile resolution via provider_hints. Explicit precedence chain
    (provider_hints.kcli.profileguest_os.type → default) with
    runtime validation against live kweb profile cache. Fills a gap where
    catalog specs don't map directly to kcli profiles.

  7. Cluster type via provider_hints.kcli.cluster_type. The catalog
    ClusterSpec has no cluster_type field; provider_hints is the
    only mechanism to select k3s/generic/openshift/microshift/hypershift.

  8. Upstream error normalization. kweb returns 2xx with failure JSON,
    HTML error pages, and empty bodies. The SP normalizes all of these
    into consistent RFC 7807 responses — a challenge unique to integrating
    with a non-API-first upstream.

  9. Health = downstream dependency. SP health is downstream-dependent
    (probes kweb /host), unlike KubeVirt/k8s-container which report
    self-health only.

  10. Homelab-first design. Intentionally scoped for dev/test/homelab.
    Trusted-network assumption, single-replica, embedded store — all
    deliberate trade-offs that simplify deployment without compromising
    the DCM integration contract.

Implementation status

  • Full SPM generic resource protocol (POST ?id=, DELETE /{id},
    GET /health)
  • E2E tested on Apollo hypervisor (full DCM stack + kcli v99.0)
  • Adversarial review completed (security, correctness, operations,
    design) with all critical/high findings fixed
  • Ginkgo specs across 8 suites, all passing with --race

Screenshots

E2E test walkthrough from the DCM UI on an Apollo hypervisor running the full DCM stack with the kcli SP.

Providers

Both kcli providers (VM and cluster) registered and in ready state:

Providers list

Cluster provider configuration — dual registration with separate endpoints:

Edit cluster provider

VM provider configuration:

Edit VM provider

Policies

Rego policy routing all VM requests to the kcli-vm provider:

Policies list

Edit Rego policy

Service types

DCM service type registry — kcli registers as both vm and cluster:

Service types

Catalog items

Catalog items for Fedora VM, K3s Cluster, and Pet Clinic:

Catalog items list

Fedora VM catalog item — editable fields for OS image, memory, and vCPUs:

Edit Fedora VM catalog item

Instances and Resources

Catalog item instance created from the Fedora VM template:

Catalog item instances

The resulting resource provisioned by kcli-vm, status APPROVED:

Resources

Review checklist

  • Proposal structure follows the enhancement template
  • API design aligns with DCM SP contracts (SPM, health, status)
  • Risks and mitigations are comprehensive
  • Alternatives were considered and documented

Introduces the kcli SP — the first non-Kubernetes DCM service provider.
It manages VMs and clusters through kweb (kcli's HTTP API), targeting
development, testing, and homelab environments.

Made-with: Cursor
Signed-off-by: Pau Garcia Quiles <pgarciaq@redhat.com>
Screenshots showing the full kcli SP integration flow through
the DCM UI: providers, policies, service types, catalog items,
instances, and resources.

Made-with: Cursor
Signed-off-by: Pau Garcia Quiles <pgarciaq@redhat.com>
Comment thread enhancements/kcli-sp/kcli-sp.md Outdated

## Open Questions

1. **kweb version pinning.** kweb has no versioned API contract. Should the SP
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any benefit of supporting multiple versions? if not we can pin SP version to a specific kcli version

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. kcli doesn't follow conventional versioning — it's been at 99.0.0 for ~6 years, with a new RPM build on every commit (e.g. 99.0.0.git.202604230909.6534e7c). There's no "kcli 1.2" vs "kcli 1.3" to pin against.

What we can do instead: each SP release documents the kweb git commit it was validated against. For example, v0.1.0 was tested against 6534e7c. When a kweb change breaks or improves something, we bump the SP version and update the validated commit. This gives traceability without pretending kcli has semver releases.

Will update the proposal to close Open Question #1 with this approach.

Comment thread enhancements/kcli-sp/kcli-sp.md Outdated
pin to a specific kcli release and test against it, or attempt to support
multiple kweb versions with feature detection?

2. **Multi-backend status mapping.** kweb VM status strings vary by backend
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to normalize statuses passed to DCM control plane for all backends

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The SP already normalizes kweb status strings to DCM's vocabulary (RUNNING, STOPPED, PROVISIONING, ERROR). This works across all kcli backends (libvirt, vSphere, KubeVirt, AWS, Azure, OpenStack, etc.) since the SP talks to kweb over HTTP and the kweb API is backend-agnostic.

Status strings may vary by backend (e.g. libvirt returns up/down, vSphere adds suspended), but the SP handles known values and maps unknowns to ERROR.

For multi-backend deployments (e.g. one kweb on libvirt, another on AWS), the admin deploys one kweb+SP pair per backend. Each registers as service type vm (or cluster) with a unique provider name (e.g. kcli-libvirt-apollo, kcli-aws-useast). DCM's Rego policy routes requests to the right provider based on catalog item metadata or provider_hints. The SP itself is backend-agnostic — it doesn't need to know which backend kweb is configured for.

Will update the proposal to make the status mapping explicit, clarify multi-backend support, and document the multi-backend deployment pattern.

`error`). Should the SP implement per-backend mapping tables, or document
libvirt as the only supported backend for v1?

3. **Cluster type `kind`.** kweb's `swagger.yml` lists `kind` as a valid cluster
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be resolved by my answer for question #1. If we pin to specific version we only support what works for that version. Once fixes are available we can bump SP version with new kcli.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense in principle. Since kcli doesn't have semver releases (see response above), we'll track validated git commits instead. The effect is the same: each SP release is tested against a known kweb state, and we bump when upstream fixes land.


## Summary

The kcli Service Provider is a DCM Service Provider that manages virtual
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of service provider which supports more than one service type. Potentially this could simplify how we manage SPs and reduce the footprint.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The kcli SP is a single Go binary handling both vm and cluster service types against one kweb backend, which keeps the deployment footprint minimal.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today the registration must be done once per service type reg for clear endpoint separation and to avoid complex capability matrices, with its own URL and provider name. VM traffic and cluster traffic don’t hit the same path (e.g. …/vm vs …/cluster, and the service health URLs), so DCM ends up with "two provider rows" even though it’s one process. Even the re-registration is easy since it impacts only one "row". If we wanted one registration call that lists several types at once, we’d need to define how that works (API shape, registry, how SPM picks the right base URL). That’s a separate change from this proposal.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The current dual registration (one per service type) is correct and works well. The proposal documents it as the expected approach, not a workaround. Thanks for the clear explanation of why this is the right design.

with a standalone kweb instance, enabling DCM to provision infrastructure on any
hypervisor backend that kcli supports (primarily libvirt/KVM for homelab use).

Because DCM registration is per service type, the kcli SP registers **twice**
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should change this and allow registering for multiple service types.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Today the SPM API requires one registration per service type, so the kcli SP registers twice (kcli-vm and kcli-cluster). This works but is a workaround.

A future SPM enhancement to support multi-service-type registration in a single call would be the right fix — it would benefit any SP that handles multiple resource types.

Will update the proposal to document dual registration as a known limitation and reference this as a candidate SPM enhancement.


{
"spec": {
"service_type": "cluster",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to define kcli SP specific service type or should we use existing type?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer as above — existing types (`vm`, `cluster`) for consistency with other SPs.

by anyone who can reach it. Additionally, kweb exposes cluster-admin kubeconfigs
via `GET /kubes/{name}/kubeconfig` without any access control.

**Mitigation:** The homelab/dev/test deployment model assumes a **trusted
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reasonable approach to me. This SP would be non-production. We need to document as such

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will add a prominent Production Readiness disclaimer to the proposal and to the SP repository's README and docs. Something along the lines of:

This service provider is not intended for production use. It is designed for development, testing, and homelab environments. kweb has no authentication, no TLS, no rate limiting, and no SLA guarantees. The kcli SP inherits these limitations. For production workloads, use the KubeVirt SP (VMs) or ACM Cluster SP (clusters) instead.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to define how to manage the non production SPs. Do we want to have a community sp repo, for example (outside dcm-project org)?
Not for this PR obviously, but I think we may need to provide a guide/ruleset for the developers. I'm also thinking about authN/authZ, how (and if) we want to verify if an SPs is safe (whatever it means).
Just random doubts :)

@ygalblum wdyt?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gciavarrini re: this vs outside dcm-project org repositories for community, for me the key question is whether DCM is something by itself, or DCM is the upstream for some RH product. If DCM is an upstream, community SPs could live in dcm-project org. But if DCM is its own "downstream", then a different org is probably better.


#### kweb Credential Exposure

**Risk:** kweb's `GET /kubes/{name}/kubeconfig` returns raw cluster-admin
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could provide an endpoint for ssh/vnc access. Not sure why we need to expose credentials via the api.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The v1 SP does not proxy or expose VNC passwords or kubeconfig credentials today.

For VMs: the SP already surfaces the VM's IP address in the `GET /vms/{id}` response. We'll also add the default SSH user (e.g. `fedora`, `core`, `centos` — returned by kweb) so users know how to connect (`ssh fedora@192.168.x.x`). No raw credentials exposed.

For clusters: as noted above, we'll follow the ACM SP pattern — embed base64-encoded kubeconfig in the `GET /clusters/{id}` response when status is `RUNNING`.

Console/VNC access via a `GET /vms/{id}/console` endpoint (returning structured connection info with time-limited tokens and audit logging) is a good v2 candidate. For v1, kweb's console mechanism requires WebSocket proxying (`websockify`) which is a significant scope increase.

Note: the KubeVirt SP currently doesn't return IP, credentials, or console access either — so the kcli SP will actually be ahead on this front.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pkliczewski currently DCM does not provide this. For VMs the users can provide a SSH public key to inject and DCM will return the machine's IP address for the user to connect. For clusters, DCM returns the API and kubeconfig.
DCM is focused on managing the resources. I don't see it handling SSH or VNC anytime soon


#### Polling Latency vs. Informer-Based Providers

**Risk:** Other DCM providers use Kubernetes informers for near-real-time status
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is expected. Not all SPs will be k8s/ocp based

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. Thanks for confirming — good to know this is expected for non-K8s/OCP providers. The poll interval is configurable via `MONITOR_POLL_INTERVAL` for tighter feedback in CI/testing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


#### Cons

- **Breaks DCM provider conventions.** No existing DCM SP shells out to a CLI.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an implementation detail. It is up to dev to implement SP in the way they want. Always there are tradeoffs

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, included for transparency. One additional note worth mentioning: kcli previously offered a gRPC API, but it was deprecated in favor of the kweb REST API. CLI wrapping would be the only other integration path, and it's more fragile due to unstructured text output and version-sensitive parsing. kweb was the natural choice.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming the CLI uses kcli's REST APIs. So, writing a REST wrapper for a CLI that uses REST seems a bit redandunt.

Copy link
Copy Markdown
Author

@pgarciaq pgarciaq May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ygalblum Both kcli (CLI) and kweb (REST API) are frontends to the same kvirt library

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Anyhow, I think using the REST API is cleaner than running CLI from code. So, we're good

- Resolve Q1 (version pinning): track kweb git commit per SP release
- Resolve Q2 (multi-backend status): SP supports all kcli backends
- Add Production Readiness disclaimer section
- Clarify kweb configuration (SP uses KWEB_URL, no kcli dependency)
- Document multi-backend deployment pattern with Rego routing
- Document kubeconfig embedding (ACM SP pattern) for cluster access
- Document VM access: ip + ssh_user in GET /vms/{id} response
- Note dual registration as SPM limitation, propose enhancement
- Add gRPC deprecation note to CLI wrapping alternative

Signed-off-by: Pau Garcia Quiles <pgarciaq@redhat.com>
Made-with: Cursor
- Change cluster ready status from ACTIVE to READY (matches ACM SP)
- Expand kweb default backend docs: environment-dependent behavior
  (libvirt, KubeVirt in-pod, macOS Homebrew, or exit)
- Multi-backend table: show separate VM and cluster provider names
- Attribute gRPC deprecation to kcli maintainer (Karim Boumedhel)
- Update graduation criteria with full status vocabulary

Signed-off-by: Pedro Garcia Quiles <pgarciaq@redhat.com>
Signed-off-by: Pau Garcia Quiles <pgarciaq@redhat.com>
Made-with: Cursor
@pgarciaq
Copy link
Copy Markdown
Author

pgarciaq commented Apr 24, 2026

Implementation update

Hi @pkliczewski — thanks again for the thorough review. All the enhancements we committed to in the review comments have now been implemented in both the proposal and the codebase. Here's a summary:

Proposal updates (this PR)

  • Closed Open Questions Service Provider Registration Flow #1 and Add service type definition #2 — kweb version tracking via validated git commits; status normalization across all backends
  • Production Readiness disclaimer added prominently in Summary and throughout
  • kweb configuration clarified: SP has no dependency on Python/kcli; kweb's default backend is environment-dependent (libvirt if sockets exist, KubeVirt in-pod, macOS Homebrew, or exit — not always libvirt)
  • Multi-backend deployment pattern documented with separate VM and cluster provider names per SP instance
  • Dual registration documented as a workaround; suggested SPM enhancement for multi-service-type registration
  • Cluster status: ACTIVE → READY to align with the ACM Cluster SP
  • Kubeconfig embedding follows the ACM SP pattern: base64-encoded kubeconfig + api_endpoint in GET /clusters/{id} when status is READY
  • VM access: ip and ssh_user returned in GET /vms/{id} responses
  • gRPC deprecation attributed to kcli maintainer (Karim Boumedhel)
  • Graduation criteria updated with the full status vocabulary for both VMs and clusters

Implementation (dcm-kcli-provider)

All the above changes are implemented and pushed to pgarciaq/dcm-kcli-provider:

  • OpenAPI schema updated with kubeconfig, api_endpoint, ip, ssh_user fields
  • GetCluster embeds base64 kubeconfig when status is READY; extractAPIEndpoint follows current-context for correctness with multi-cluster kubeconfigs
  • GetVM and ListVMs return ip and ssh_user at the top level
  • Observability: slog.Warn on silent kweb failures in GetVM/GetCluster
  • Safety: io.LimitReader (10 MB cap) on kweb HTTP responses
  • 6 new test cases covering error paths, base64 round-trip, and multi-cluster kubeconfig
  • All 141+ tests pass, go vet clean

…outes

Add implementation history entries for April 23-25:
- Kubeconfig and ssh_user in API responses
- mergeKcliHints() for forwarding provider_hints.kcli params to kweb
- Traefik routes example and combined Rego policy
- Cluster catalog item with image and node count fields
- v0.1.1 release

Also documents the node OS image override mechanism and additional
kweb parameter forwarding in the cluster creation section.

Made-with: Cursor
Signed-off-by: Pau Garcia Quiles <pgarciaq@redhat.com>
Copy link
Copy Markdown
Collaborator

@gciavarrini gciavarrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: PR description stated 148+ specs while the file says 141, i would suggest to avoid such scrict description just to reduce the confusion :)

Comment thread enhancements/kcli-sp/kcli-sp.md Outdated

The kcli Service Provider is a DCM Service Provider that manages virtual
machines and Kubernetes clusters through
[kcli](https://github.com/karmab/kcli)'s HTTP API (kweb). It is designed for
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Author

@pgarciaq pgarciaq May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — done, added the kweb link. Thanks!


## Summary

The kcli Service Provider is a DCM Service Provider that manages virtual
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today the registration must be done once per service type reg for clear endpoint separation and to avoid complex capability matrices, with its own URL and provider name. VM traffic and cluster traffic don’t hit the same path (e.g. …/vm vs …/cluster, and the service health URLs), so DCM ends up with "two provider rows" even though it’s one process. Even the re-registration is easy since it impacts only one "row". If we wanted one registration call that lists several types at once, we’d need to define how that works (API shape, registry, how SPM picks the right base URL). That’s a separate change from this proposal.

Comment thread enhancements/kcli-sp/kcli-sp.md Outdated
The kcli SP must successfully register with DCM for each service type it
provides. During startup, after the HTTP server is ready, the SP uses the DCM
registration client to send two requests to the SP API registration endpoint:
`POST /api/v1alpha1/providers`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

POST /api/v1/providers

Copy link
Copy Markdown
Author

@pgarciaq pgarciaq May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — the other enhancement docs (sp-registration-flow, kubevirt-sp) consistently use POST /api/v1/providers. Fixed. Thanks!

pgarciaq and others added 2 commits May 4, 2026 12:53
- Add hyperlink to kweb docs in the Summary section
- Fix registration endpoint from /api/v1alpha1/providers to /api/v1/providers
  to match the sp-registration-flow and kubevirt-sp enhancement docs

Signed-off-by: Pau Garcia Quiles <pgarciaq@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Per review feedback — the exact number goes stale as tests are added.

Signed-off-by: Pau Garcia Quiles <pgarciaq@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@pgarciaq
Copy link
Copy Markdown
Author

pgarciaq commented May 4, 2026

@gciavarrini re: the spec-count nit — good catch, the number was stale in both the PR description and the document. I've dropped the exact count from both; the text now just says "Ginkgo specs across 8 suites" without a number, so it won't go stale again as tests are added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants