openstack-cgroup-tiering

Ideas about nova / watcher in RHOSO.

Technical Plan: Per-Host, Per-Domain Virtual Core Mapping in Nova & Watcher

Background and Goals

Each compute host in the RHOSO cloud is configured with

/etc/systemd/system.conf

# Pin systemd and spawned hierarchy to the first 4 physical 
# cores on both sockets and their sibiling threads (16 logical vcpu)
[Manager]
CPUAffinity=0-3,32-35,64-67,96-99

nova.conf

[compute]
...
# isolate instance workloads to the shared cpu set (112 logical vcpu)
# which the scheduler will balance across
cpu_shared_set = 4-31,36-63,68-95,100-127
...

The cloud offers service tiers – Gold, Silver, Bronze – each with a different CPU share weight (Gold=512, Silver=333, Bronze=167, vs default 1024). These weights ensure a relative CPU time share under contention (e.g. a 2048-share VM gets twice the CPU time of a 1024-share VM). The goal is to achieve deterministic SLO guarantees: under full load on any host, Gold VMs collectively get ~50% CPU, Silver ~33%, Bronze ~17%, reflecting their share ratios. This is accomplished by a “Fully Distributed” placement scheme:

Equal Domain Capacity Per Host: Each hypervisor (say 112 vCPU cores available for VMs) is logically partitioned into 112 Gold-vCPUs, 112 Silver-vCPUs, 112 Bronze-vCPUs. This yields a fixed 3:1 oversubscription (336 vCPUs allocated per 112 physical cores), with each physical core potentially hosting one Gold, one Silver, and one Bronze vCPU concurrently. Linux CFS then schedules these with the configured weights, enforcing the intended 50/33/17% time share split when all domains are busy.
Per-Host Domain Quota: On any given host, no more than 112 Gold vCPUs worth of instances should run (and same for Silver, Bronze). This guarantees no single host exceeds its budget for any tier, preserving cross-tier fairness host-by-host. Even if overall capacity in the cloud is free, the scheduler must avoid placing a new Gold VM on a host that already has 112 Gold vCPUs running. It should instead pick a host with available Gold capacity. In short, placement decisions need to consider per-host, per-tier utilization and skip hosts that would violate the tier’s quota on that host.

Problem: Nova’s default scheduler only tracks total VCPUs (with a global allocation ratio) and cannot inherently limit usage per service tier. We need a scheduling mechanism to enforce these per-host domain quotas. Two key design aspects are:

How to represent and track the “domain capacity” per host – using dynamic resource inventory.
Where to enforce it – in Nova’s scheduler (proactively during placement) or via Watcher (reactively optimizing after placement).

The solution must operate at per-host granularity for determinism (every host independently meets SLO fairness) and scale to many hosts/VMs. Below is a proposed approach for Nova and Watcher, including required flavor specs, metadata, or telemetry, and their pros/cons.

Representing Per-Host Domain Capacity

Dynamic Domain Core Inventory (Placement-Based)

Idea: Model each tier’s vCPU capacity as a custom resource class in OpenStack Placement. Each compute node will advertise three separate resource inventories: “Gold vCPU”, “Silver vCPU”, “Bronze vCPU” (in addition to normal VCPUs). The Nova scheduler will then automatically enforce per-host limits by scheduling against these resources. This essentially turns the problem into one of resource accounting, leveraging Placement’s atomic allocation tracking.

Implementation:
– Define Custom Resource Classes: We create custom resource class names for each domain, e.g. CUSTOM_VCPU_GOLD, CUSTOM_VCPU_SILVER, CUSTOM_VCPU_BRONZE. According to Nova’s rules, custom classes must be prefixed with CUSTOM_. We assign each class a capacity equal to one host’s worth of vCPUs for that tier. In our example, each compute node will have total=112 for each of CUSTOM_VCPU_GOLD/SILVER/BRONZE. No overallocation within each class is allowed (set allocation_ratio=1.0 for them, since we don’t want to permit more than 112 Gold vCPUs on that host). The three resource classes together allow up to 336 vCPUs to be allocated on the host (112 of each), matching the desired 3:1 oversubscription.
– Expose Inventory to Placement: Nova needs to report these custom resources for each compute node. In Antelope, this can be done via a provider config file (provider.yaml). Red Hat’s OpenStack on OpenShift supports defining additional inventories in a ConfigMap which Nova-compute reads on startup. In provider.yaml, for each compute, list the resource provider (by name or UUID) and add the custom inventories with the specified totals and allocation_ratio. For example:
```
providers:
  - name: <hostname>
    inventories:
      additional:
        CUSTOM_VCPU_GOLD:
          total: 112
          reserved: 0
          min_unit: 1
          max_unit: 112
          step_size: 1
          allocation_ratio: 1.0
        CUSTOM_VCPU_SILVER:
          total: 112
          ... (same fields) ...
        CUSTOM_VCPU_BRONZE:
          total: 112
          ... 
```
This declares to Placement that each host has 112 units of each resource class available. Nova will register these resources when the compute service starts. (Note: The normal VCPU inventory will also exist, typically 112 with a cpu_allocation_ratio of 3.0 to allow 336 total vCPUs. It’s important to align that with the domain totals – more on this below.)
– Flavor Extra Specs: Each flavor corresponding to a service tier must request the appropriate custom resource class in addition to (or instead of) standard VCPUs. Nova’s flavor extra_specs syntax allows this: resources:$CUSTOM_RESOURCE_CLASS = $N (where $N is the number of units to allocate). For example, a Gold flavor with 4 vCPUs would include resources:CUSTOM_VCPU_GOLD=4. This tells the scheduler to allocate 4 units of CUSTOM_VCPU_GOLD from the chosen host’s inventory. We would do similarly for Silver and Bronze flavors (with CUSTOM_VCPU_SILVER or BRONZE). All tenant workloads should use flavors that include flavors that include one of these specs, ensuring they consume from the correct domain pool. (It’s critical to enforce that every VM belongs to a tier, otherwise a VM launched with no domain resource request would bypass the capacity accounting. One way is to have no “untagged” flavors, or treat the default as Bronze by giving default flavors the Bronze resource request.)
– Placement Enforcement: When scheduling, Nova’s FilterScheduler will query the Placement API for hosts that can satisfy the request of (say) 4 VCPU and 4 CUSTOM_VCPU_GOLD. Initially, all hosts have 100 free Gold units. If a host already has, e.g., 90 Gold vCPUs allocated, then 10 Gold units remain. Placement will deem a host unfit once its available CUSTOM_VCPU_GOLD drops below the request. So once a host has 112 Gold allocated, it has 0 free Gold units and will no longer be returned as a candidate for a Gold flavor. This automatically prevents over-allocation beyond 112 Gold vCPUs on any host. The same mechanism applies independently for Silver and Bronze. The scheduler doesn’t even need a custom filter – the standard PlacementFilter and resources logic handle it.
– Overall VCPU Coordination: We still want to ensure the total vCPUs on a host do not exceed physical capacity 3x. Normally, the [compute] cpu_allocation_ratio (or aggregate override) is set to 3.0 to allow up to 300% of physical cores. We should set that such that 112 * ratio = 336 (for a 112 core host). If using the default (16:1), Placement would allow up to 1600 total VCPUs which is far above our intended domain sum. However, because each domain resource caps at 112, no single domain can use beyond that, and others similarly, so in practice total won’t exceed 336 even if allocation_ratio is higher. To be safe, aligning the global VCPU allocation ratio to 3.0 is recommended so that the normal VCPU inventory also imposes a 300 total limit. (An alternative is to not schedule on standard VCPUs at all by using the flavor override resources:VCPU=0. This would mean Placement ignores the generic VCPU and only uses custom classes for scheduling. This is an advanced tweak – it prevents double-counting VCPUs and ensures only the domain pools are considered. If every flavor has a custom vCPU resource, setting resources:VCPU=0 in that flavor’s extra_specs is possible to avoid redundant checks. Either approach works; keeping the normal VCPU accounting as a secondary check (with ratio=3) adds a safety net in case a flavor is misconfigured without a custom resource.)
Benefits:
– First-Class Tracking: Domain capacities are modeled as real resources in OpenStack. This means you can query usage via Placement APIs or openstack resource provider inventory/usage show. It’s clearly visible how many Gold vCPUs each host has free or used. The scheduler’s decisions are based on a unified view of resource allocations, which is atomically updated on each scheduling and instance creation. This eliminates race conditions – Placement will not allow two schedulers to allocate the same remaining Gold slot concurrently due to its transaction logic.
– Proactive and Deterministic: Enforcement happens at scheduling time via a robust mechanism. A host cannot even be selected by the scheduler once its domain capacity is exhausted, guaranteeing SLO compliance from the start. This satisfies the deterministic per-host SLO requirement by construction.
– No Custom Scheduler Code: We leverage existing Nova scheduler capabilities (the PlacementFilter and custom resource requests). Nova started supporting custom resource classes in flavors in Pike, so Antelope fully supports this. There is no need to write a new filter plugin. We only need to configure Nova (provider inventories and flavors). This minimizes maintenance – we’re using supported extension points rather than out-of-tree code.
– Scales Naturally: Placement is designed to scale to large numbers of resource providers and allocations. The overhead of tracking a few extra resource classes per host is minimal. The scheduler’s flow remains standard (Placement returns candidates, then normal filters/weighers apply). We can still use normal filters (ComputeFilter, etc.) and also apply a custom weigher if desired to spread load. In fact, with this approach, the existing CPUWeigher with a positive multiplier will naturally distribute VMs across hosts since it prefers hosts with more free VCPUs. Since free VCPU correlates with free domain slots, Gold VMs will tend to spread out instead of clustering, which aligns with fairness goals.
– Flexibility: While our scenario is equal 112/112/112 split, this approach could support different per-tier ratios by setting different inventory totals or even per-host variations if needed (e.g., some specialized host could advertise more Gold capacity and less Bronze, etc., if that were a policy). All that is adjustable in the inventory config. Also, if in the future the oversubscription policy changes (say adding a Platinum tier or changing ratios), updating the provider configs and adding new resource classes is straightforward compared to rewriting filtering logic.
– Alignment with Upstream Features: This approach uses Placement in the way it’s intended – similar patterns are used for GPUs, hugepages, etc., where you define custom resource classes for each consumable. Here the consumable is “vCPU slot in Gold pool”. It’s a clean design. OpenStack docs even give examples of declaring custom resources (like CUSTOM_DISK_IOPS) and requesting them via flavor metadata. We’re essentially treating Gold/Silver/Bronze capacity as a resource to be consumed.

Feasibility: This option is fully feasible in Antelope (2023.1). The Nova scheduler was enhanced in Pike to honor resources:$CUSTOM_CLASS specs, and the provider config file method is available in recent OpenStack versions (OSP 17+/Wallaby and above). The approach scales well and can be largely automated (script the generation of provider.yaml from an inventory of hosts and apply it via the OpenShift OpenStack operator ConfigMap). Once in place, day-to-day scheduling and even minor version upgrades should continue to work, since it’s using standard Nova/Placement interfaces.

Enforcing Domain Placement Policies

With the above mapping approaches in mind, we now consider how to enforce the per-host quotas and SLOs at runtime. Essentially, enforcement means ensuring that the running cloud never violates the domain capacity rules and thus maintains the intended share fairness. There are two points in time to enforce: at scheduling (placement time) or after scheduling (continuous monitoring and correction). Nova’s scheduler is the natural place for proactive enforcement, while OpenStack Watcher can provide reactive enforcement through optimization audits. We will explore both and also mention how they can complement each other.

Nova Scheduler–Based Enforcement

Using Nova’s scheduler for enforcement means the rules are applied before a VM is placed on a host – preventing violations from occurring.

With Dynamic Placement : Enforcement is largely handled by Placement’s filtering. Nova’s scheduler will request the needed custom resource and Placement returns only hosts with available inventory. Therefore, the standard PlacementFilter plus perhaps the ComputeFilter are sufficient – no additional custom filter code is needed to enforce the limits. However, we can still use a custom Weigher if we want to fine-tune placement beyond what free resource count does. For example, when using the dynamic approach, one might find that simply counting free domain slots could be a good metric to weigh on. But in practice, weighting by free VCPUs (with a spread strategy) achieves a similar result: since each host initially has 336 total slots, a host that has consumed a lot of Gold will also have consumed a lot of total VCPUs, making it less attractive if you’ve configured spreading (the default ram_weight_multiplier=1.0 and cpu_weight_multiplier=1.0 spread load across hosts). If needed, a Metrics Weigher could be used with a custom metric “gold_used” etc., but that requires feeding such metrics to Nova. This is probably unnecessary given Placement already prevents violation and basic weighting prevents extreme imbalance.

Prerequisites: For Nova scheduler enforcement, we need to ensure:

Flavor tier labeling: As discussed, flavors must indicate their tier either via name or extra_specs. In static approach, the filter logic will rely on this (e.g., check flavor.extra_specs[service_tier]). In dynamic approach, flavors must have the custom resource extra_specs so that Placement knows what to allocate. This is critical – the scheduler can’t enforce what it can’t detect. So an audit of all flavors should be done to verify each one is correctly configured for Gold/Silver/Bronze. You might create new flavors specifically for this and deprecate old ones.
Nova Scheduler Config: In the dynamic approach, we update nova.conf to include the provider config (e.g., mount the ConfigMap with provider.yaml and set compute/provider_config_location = /etc/nova/provider.yaml in each nova-compute, if not automatically done by the operator). Also adjusting cpu_allocation_ratio if needed.

Pros (Nova-side enforcement):
– Ensures no SLO violations at time of placement. At no point will a host have more Gold VMs than it should, meaning the share-based fairness holds by design. For example, if Gold capacity is 50% of cores, a Gold VM will never land on a host where Gold already occupies that 50% – so it will never end up with less than its intended share due to overallocation of Gold on that host.
– Implementation is contained within the scheduling process, which means decisions are made with up-to-date cluster state (especially if using Placement, which is strongly consistent for allocations).
– Fail-fast behavior: If the cloud is at or beyond capacity for a given tier (e.g., all hosts have 100 Gold already), a new Gold instance will simply fail to schedule (NoValidHost), rather than being scheduled and then contending poorly. This is good because it surfaces to the user that capacity is exhausted for that tier, allowing them to take action (instead of launching an instance that performs badly or violates guarantees).
– Leverages existing Nova components: either via a small extension (filter) or via built-in Placement. This keeps the solution lightweight in terms of moving parts – it’s just part of the normal VM launch flow.

Cons / Considerations:
– Nova’s scheduler only makes initial placement decisions. Once a VM is launched, Nova won’t move it or adjust automatically if conditions change. If an admin manually changes flavors or if usage patterns shift (outside the scope of scheduling), Nova won’t revisit the placement. For example, if someone mistakenly reconfigures a running VM to a different tier (not trivial to do, but hypothetically), Nova won’t re-check the host’s compliance. Or if a host was within limits but then you reduce the capacity setting (policy change), Nova won’t automatically migrate VMs; it would only apply to new placements. This means any policy drift или needed rebalancing is not handled by scheduler – this is where Watcher can complement.
– There is no historical or utilization-based component in Nova’s decision beyond the static counts. Nova doesn’t measure actual CPU usage or performance – it only knows allocations. So, as long as the allocations are within the limits, Nova is satisfied. In rare cases, actual performance might differ (e.g., if Bronze VMs on a host are all idle, Gold VMs on that host actually get nearly 100% CPU – which is fine, but if suddenly those Bronze wake up, Gold share will drop to expected 50%). Nova won’t anticipate or react to such load changes since it doesn’t monitor runtime – it just ensured capacity for the scenario of full contention.

Overall, Nova scheduler-based enforcement (especially with the dynamic placement approach) is the recommended primary mechanism to achieve deterministic placement. It’s implemented at the source of truth (Placement), scales well, and prevents violations by construction.

Watcher-Based Enforcement (Reactive)

OpenStack Watcher is an infrastructure optimization service that can be used to monitor and auto-correct the environment to meet certain goals. In this context, Watcher could enforce the domain mapping rules by moving workloads after initial placement, or by adjusting how capacity is used. This is a reactive approach: if the scheduler does not strictly prevent an imbalance, Watcher can detect it and then trigger migrations to fix it. It can also potentially optimize placement over time for better performance isolation. Here’s how a Watcher solution might look:

Architecture: Watcher’s Decision Engine runs periodic audits using a strategy plugin that you can customize. You would create a custom strategy, say “DomainCapacityEnforcementStrategy”. This strategy’s goal would be to ensure no host exceeds N vCPUs of any tier and to balance tiers across hosts if needed for performance. The strategy would query cluster data (from Nova and Telemetry) and generate action plans (like migrating a VM from HostA to HostB) to correct any issues. The Watcher API and Planner would then execute these migrations live (via Nova).
Data Required: To decide on optimizations, the strategy needs to know: per-host breakdown of running VMs by tier, and possibly the CPU utilization of those VMs or hosts. It has a few ways to get this:
(a) Nova API/Placement data for allocations – you can directly get from Placement how many CUSTOM_VCPU_GOLD are allocated on each host. If not, the strategy can get the list of instances on each host and filter by flavor extra_specs to count Gold/Silver/Bronze usage.
(b) Telemetry data for actual CPU usage – e.g., using Ceilometer/Gnocchi or Monasca. Watcher usually relies on the Telemetry service for metrics. You might configure Ceilometer to gather CPU utilization or steal time metrics per instance. This can be used to detect if a host’s VMs are actually contending and hitting limits. For example, if a host has 112 Gold and 112 Silver vCPUs and both groups are 100% busy, that’s expected full contention (Gold ~50% CPU each). If one group is not busy, SLOs aren’t being tested. The strategy might focus on cases where contention is real or likely.
Strategy Logic: The custom strategy (written in Python, extending Watcher’s base Strategy class) would implement an execute() method that:
(a) Scans each host’s current usage per tier. If any host has more vCPUs of a tier than allowed (e.g., 120 Gold due to some manual change or scheduler miss), that’s a violation to fix immediately – plan to migrate some Gold instances off that host to bring it to 112.
(b) Even if within hard limits, the strategy can look at imbalances. Perhaps Host1 has 112 Gold/0 Silver, Host2 has 0 Gold/112 Silver. This technically meets the quotas, but it means those hosts are effectively single-tier (all Gold on one, all Silver on another). This might be acceptable (no cross-tier interference on either host), but if the goal is to distribute load to avoid risk (e.g., if one host fails, all Gold are hit), the strategy might decide to shuffle some VMs (move some Gold VMs from Host1 to Host2 and some Silver from Host2 to Host1) to mix them. This would ensure that each host sees a blend of tiers, realizing the fully distributed design. Whether you want this depends on if you prefer strict adherence to the “each host has equal parts of each tier” ideal even when overall usage is skewed.
(c) Consider performance metrics: If, for instance, on a certain host, Gold VMs are getting less CPU share than 50% because the host has (say) 112 Gold and 56 Silver and 56 Bronze all fully active (that totals 224 vCPUs active; Gold’s share would actually be 100512/(112512+56333+56167) ~ 52% – actually Gold still ~50% in that case because the ratios hold when scaled, but imagine a scenario of imbalance like 112 Gold vs 1 Silver fully active: the single Silver vCPU will get a very small fraction ~0.6% because it’s outnumbered by Gold on that host). If such a scenario occurred, Watcher could identify that the Silver VM on that host is starved (high CPU ready time or low achieved cycles) and decide to migrate it to another host with fewer Gold competitors. In essence, Watcher can use actual SLO measurements (like CPU usage or latency metrics if available) to fine-tune placements. This goes beyond what Nova can do, entering the realm of QoS optimization.
(d) Output an action plan: e.g., “Migrate instance X (Gold) from HostA to HostB” or “Migrate instance Y (Silver) from HostA to HostC”. It would ensure after these migrations, all hosts adhere to the desired state (no quotas exceeded, and maybe more balanced). The Watcher Applier then executes these live migrations one by one (honoring any migration concurrency limits to avoid overloading the network).
Prerequisites for Watcher:
– Telemetry: As noted, configure the Telemetry service (Ceilometer or Gnocchi) to collect needed metrics (CPU utilization per instance/host, etc.). Even if the strategy initially only uses allocation counts, having performance data can let it prioritize which move yields the biggest SLO improvement. For example, if all Gold VMs on a host are idle, it doesn’t matter that there are 100 of them; moving one Silver there won’t hurt, so Watcher can avoid unnecessary migrations. Metrics like CPU utilization, load, or steal time can guide this.
– Watcher Deployment: Ensure Watcher is installed (there is a Watcher operator for OpenShift deployments). This involves the Watcher API, Decision Engine, and Applier services, plus a MySQL DB for Watcher. In Antelope, Watcher should be available (if not already deployed, it’s an additional component to set up).
– Strategy Development: Write the custom strategy code and add it to Watcher. This typically means creating a Python package with your strategy class and entrypoint, then deploying it where Watcher can load it (e.g., include it in the Watcher container image or as an installable plugin). The OpenStack Watcher docs provide guidance for writing a strategy plugin. You’ll define a name for it and possibly a custom goal like “DETERMINISTIC_SLO” or reuse a generic goal like “BALANCE_LOAD” with your strategy.
– Audit Configuration: Create a Watcher Audit Template that uses this strategy (and goal) on the scope of all hosts. You can schedule this audit to run periodically (say every 15 minutes or 1 hour, depending on how quickly you want to respond to changes). You could also set it to run on-demand or triggered by events (though event-triggered audits require some integration; e.g., a message from Nova could potentially trigger Watcher, but simplest is periodic).
Pros:
– Ensures Ongoing Compliance: Watcher provides a safety net. If for any reason the Nova scheduler approach failed or was not strict (e.g., if you opted not to implement the Nova solution fully), Watcher would catch and fix violations. Even with Nova enforcement, Watcher can help maintain balance over time, not just at the instant of scheduling. It addresses “what if something changes later?” by continuously monitoring.
– Performance Awareness: Watcher can incorporate real performance metrics, not just static allocations. This means it can optimize for actual SLO fulfillment. For example, it might observe that all Bronze VMs rarely use CPU – in that case, even if one host has more than its share of Bronze, it might not matter. Conversely, if a particular workload is sensitive, Watcher could make sure it’s placed in the best possible host environment. This context awareness is something a one-time scheduler decision cannot achieve.
– Complex Optimization Goals: The strategy could be extended to also consider other factors (network bandwidth, latency, host thermal state, etc.) if needed. Watcher is built for multi-metric optimization. In our case, the primary goal is CPU fairness, but it could incorporate, say, ensuring high-tier (Gold) VMs are spread out enough to avoid all landing on the same physical core across hosts – e.g., if hyper-threading or NUMA considerations come in, the strategy could account for those too. It’s very flexible.
– No Impact on Initial Launch: Watcher operates in the background, so using it doesn’t slow down or complicate the initial VM scheduling process. VMs launch as normal (possibly even ignoring tiers, if one chose to). Then Watcher “cleans up” placements. In some environments, operators prefer an eventual consistency approach for complex policies to keep the provisioning fast and simple, and then correct things after. Watcher enables that pattern.
– Operational Visibility: Watcher will produce reports of its audits and actions. This gives operators a clear view of how the system is managing the workloads. You’ll see “Audit at 12:00 moved 2 instances from HostX to HostY to rebalance Gold capacity,” etc. This can be useful to ensure the policy is actually effective and to adjust thresholds if needed.
Cons:
– Reactive (Potential SLO Breach Window): By design, Watcher acts after an imbalance occurs. This means there could be periods where a host is violating the intended policy until the next audit. During that window, SLOs might not hold. For example, if Nova allowed 120 Gold on a host (in absence of proper checks), those VMs would contend and each get less CPU than expected until Watcher migrates 20 of them away. If the audit runs infrequently or is backlogged, this condition could last minutes to hours. Thus, Watcher alone is not ideal for strict determinism – it’s more about eventual compliance. To mitigate this, one could run audits very frequently, but that adds overhead. It’s better used as a complement to a mostly proactive strategy (Nova) or when occasional slight breaches are acceptable.
– Reliance on Telemetry: Accurate Telemetry is crucial if using performance metrics. Misleading metrics could cause bad decisions (e.g., if Ceilometer reports a spike that’s not actually sustained, Watcher might migrate unnecessarily). Ensuring the Telemetry pipeline is tuned and reliable is another operational concern. Also, Telemetry itself introduces some overhead on the system (pollsters running, etc.), though modern systems can handle this at scale with Ceilometer or Gnocchi.
– No Direct Capacity Reconfiguration: One thing Watcher cannot easily do is dynamically change the defined capacities (like altering the 112/112/112 split on the fly). It deals with moving workloads given the current state. If you ever wanted to temporarily let a host exceed 112 in one tier because another tier is empty (borrowing capacity), Nova’s placement would block it unless you reconfigure inventories. Watcher could detect an empty Bronze pool and suggest “we could put more Gold here,” but it can’t override the Placement constraint unless it triggers a reconfiguration (which is outside its normal scope).

Using Nova and Watcher Together: It’s worth noting that these approaches are not mutually exclusive. A sensible plan is: use Nova’s scheduler (preferably with dynamic resource classes) as the primary enforcement, and deploy Watcher as a secondary layer to monitor compliance and optimize placements over time. Nova will handle the gross partitioning (ensuring no new violation), and Watcher can handle any unforeseen issues or improvements: for example, if certain hosts become hot spots or if workload patterns shift (maybe many Bronze VMs got deleted on HostA and many Gold on HostB – Watcher could shuffle to re-spread remaining VMs). In this combo, Watcher might have less work to do (since flagrant violations won’t happen thanks to Nova), focusing mostly on performance balancing rather than hard rule enforcement. That minimizes Watcher’s impact (migrations should be rarer and only when beneficial).

Recommendation and Integration Plan

Considering the above, here is a recommended stack-ranked approach for deterministic per-host domain placement, with integration steps and rationale:

1. Nova Scheduler with Dynamic Domain Inventories (Primary Solution) – Highest suitability.
Implement the custom Placement resource classes for Gold, Silver, Bronze and require flavors to consume them. This gives a robust, deterministic guarantee of per-host quotas. Steps to integrate:

Prepare a provider.yaml listing each compute host’s UUID and adding inventories for CUSTOM_VCPU_GOLD, CUSTOM_VCPU_SILVER, CUSTOM_VCPU_BRONZE with totals equal to that host’s CPU_shared_set count (e.g., 112). Use allocation_ratio:1.0 for each to disallow overcommit per tier. Create the ConfigMap and link it to the NovaCompute pods, then restart computes to register the new resources. Verify in Placement (e.g., openstack resource provider inventory list <host> shows the new classes).
Define new flavors or update existing ones for each service tier. For example, for each flavor size, create three variants (Gold, Silver, Bronze) identical in vCPU/RAM except for an extra_spec: resources:CUSTOM_VCPU_GOLD = <vcpus> (or Silver/Bronze accordingly). Optionally set resources:VCPU=0 on these flavors to rely solely on custom classes. Also set the CPU shares extra_spec (cpu_shares=512 etc.) to ensure the guest’s scheduler weight is correct – this you likely already have. Ensure the naming or description clearly indicates the tier, and communicate to users that they must choose the correct flavor for their workload’s QoS tier.
Adjust Nova scheduler config: set cpu_allocation_ratio=3.0 (or appropriate to allow overall 3x) for consistency. Include any needed filter tweaks (generally, use default filters: PlacementFilter, ComputeFilter, etc. is enough). If you want to enforce that no VM launches without specifying a tier, you could add a custom filter to reject instances whose flavor has no CUSTOM_VCPU_* resource request. This is optional if you simply ensure via flavor management that all flavors have one.
Testing: Launch a batch of e.g. 120 Gold small instances and confirm that none lands more than 112 vCPUs on the same host (the 113st should schedule to a different host or fail if all hosts hit 112 Gold). Do similar for Silver/Bronze. Check that aggregate usage matches expectations (if you fully load all domains on all hosts, each host should have ~336 VCPUs, 112 from each tier).
Pros: This satisfies the requirement directly and is maintainable. Once set up, Nova will naturally handle new requests. The admin overhead is mostly in flavor management and updating provider.yaml if new computes are added (which can be automated via scripting or Ansible when scaling out).

2. Watcher with Custom SLO Strategy (Supplementary/Advanced) – Use as needed.
After implementing Nova-based controls, decide if the added complexity of Watcher is justified in your environment. If your workloads and placement are relatively static and the Nova scheduler does a good job, Watcher might not be strictly necessary. However, if you have very dynamic scenarios or want to maximize utilization while still guaranteeing SLOs, Watcher can be an excellent tool. For example, if you wanted to allow temporary overallocation of one tier when others are idle (to reduce wastage) and then automatically correct it when contention arises, Watcher would be the way to implement that policy (since Nova’s static placement won’t allow it by itself). In such a case, you’d intentionally loosen Nova’s constraints (maybe set Gold inventory higher or allow borrowing) and rely on Watcher to migrate VMs when conflict occurs – a more complex but flexible scheme.

Integration steps for Watcher (if chosen):

Deploy the Watcher Decision Engine, API, and Applier in the OpenShift cluster. Use the Watcher K8s operator if available for easier deployment. Verify it can access Nova and Ceilometer APIs (credentials and network).
Develop the custom strategy plugin. Based on the earlier design, implement the logic to ensure per-host allocations don’t exceed thresholds and to balance workloads. You can start simpler: e.g., just detect violations of >100 per tier on a host and fix them. Later, add balancing logic if desired. The strategy can leverage Watcher’s cluster data model which includes info on compute nodes and instances. When using Placement approach, direct violations won’t occur, so your strategy might focus purely on balancing for performance (e.g., ensure that whenever possible each host has similar proportions of tiers if those tiers are active to avoid extreme cases like 100 vs 1 scenario).
Register the strategy with Watcher (via entrypoint in setup.cfg of your plugin package). Then configure an Audit template. For example, define a goal “continuous_slo_compliance” and set your strategy as the only one for that goal. Schedule audits every 30 minutes. Monitor Watcher’s first runs and see the recommended actions. You can run Watcher in audit-only mode initially (not executing actions, just logging suggestions) to ensure it’s making sane decisions. Once confident, allow it to apply actions automatically.
Ensure proper notifications: Watcher can publish events or notifications when it migrates instances. You might want to inform users or at least log such events for transparency. Also set limits – e.g., maybe migrate at most X instances per audit to limit impact.
Over time, evaluate if Watcher’s actions are actually beneficial (e.g., improved CPU utilization fairness, better response times for Gold VMs under load, etc.). If Nova’s built-in scheduling already kept things well-distributed, Watcher might end up mostly idle (which is fine). If not, it will help correct any imbalances.

Minimizing Operational Complexity:

Favor configuration over customization where possible. Using built-in mechanisms (resource classes, flavor specs) reduces the amount of custom code to maintain. It leverages OpenStack’s strengths (Placement’s solver for resource allocation) rather than duplicating that logic. Red Hat’s platform specifically provides a supported way to inject provider configs, making Dynamic Placement quite admin-friendly.
Use automation for repetitive tasks: If you have many compute nodes, manually writing their resource inventories is not scalable. Instead, integrate provider.yaml generation into your provisioning workflow (for example, an Ansible playbook that gathers facts (number of pCPUs) from new compute, calculates the domain capacity = pCPU count, and updates the YAML). Since all hosts use full capacity for each domain in this model, you might not even need host-specific entries if all are identical – you could potentially use a wildcard in some future version, but currently it’s per provider. Automation and proper templating will handle this. Similarly, automate flavor creation to ensure consistency (script the creation of Gold/Silver/Bronze flavors so that an update to one tier’s specs is replicated appropriately to others, avoiding manual errors).
Documentation and Governance: Clearly document the new scheduling policy for operators and tenants. Tenants should know that if they choose a Gold flavor, their instance will be scheduled with certain guarantees and that if capacity isn’t available, scheduling will fail (as opposed to silently running with risk of slowdown). Internally, have monitoring in place: e.g., set up alerts if any host’s actual running vCPUs for a tier exceed the limit (this shouldn’t happen with Dynamic Placement, but it’s a good check). Also monitor for any placement rejections (NoValidHost) for certain tiers – that could indicate that tier is fully subscribed on all hosts, perhaps time to add capacity or move some lower-tier workloads off.

Summary: By using Nova’s scheduler with a per-host domain capacity map, we achieve deterministic CPU share guarantees at the host level: no host will host more Gold, Silver, or Bronze vCPUs than it can fairly schedule. The dynamic inventory approach is preferred for its accuracy and integration. OpenStack Watcher can be layered on to continuously ensure and optimize SLO compliance, though it introduces complexity and is optional if Nova’s placement is strict. In an Antelope-based cloud (with potential Epoxy backports), all required hooks – custom resource classes, flavor extra_specs, scheduler filter plugins, Watcher strategies – are available. The recommended plan is to implement the Placement-based scheduling constraints for immediate, automatic enforcement, and use Watcher in a monitoring/optimization role to refine the placement over time. This combination minimizes manual management and keeps operations predictable, ensuring each workload tier consistently gets its intended share of CPU on every host, fulfilling the deterministic SLO objectives.

Sources:

OpenStack Nova Custom Resource Classes (Placement) – Nova allows defining custom resource classes and requesting them via flavor extra specs for scheduling. This mechanism is used to represent Gold/Silver/Bronze vCPU quotas per host.
Red Hat OpenStack Provider Config – Admins can declare custom resources in a provider.yaml to advertise consumable resources like CUSTOM_POWER_WATTS (or in our case CUSTOM_VCPU_GOLD)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
hardware-lifecycle.md		hardware-lifecycle.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

openstack-cgroup-tiering

Technical Plan: Per-Host, Per-Domain Virtual Core Mapping in Nova & Watcher

Background and Goals

Representing Per-Host Domain Capacity

Dynamic Domain Core Inventory (Placement-Based)

Enforcing Domain Placement Policies

Nova Scheduler–Based Enforcement

Watcher-Based Enforcement (Reactive)

Recommendation and Integration Plan

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

openstack-cgroup-tiering

Technical Plan: Per-Host, Per-Domain Virtual Core Mapping in Nova & Watcher

Background and Goals

Representing Per-Host Domain Capacity

Dynamic Domain Core Inventory (Placement-Based)

Enforcing Domain Placement Policies

Nova Scheduler–Based Enforcement

Watcher-Based Enforcement (Reactive)

Recommendation and Integration Plan

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages