Skip to content

[COST-7249] Phase 3: Migrate monthly/tag/VM cost SQL to RTU pipeline#6043

Open
jordigilh wants to merge 17 commits into
project-koku:mainfrom
jordigilh:COST-7249/phase3-sql-migration
Open

[COST-7249] Phase 3: Migrate monthly/tag/VM cost SQL to RTU pipeline#6043
jordigilh wants to merge 17 commits into
project-koku:mainfrom
jordigilh:COST-7249/phase3-sql-migration

Conversation

@jordigilh
Copy link
Copy Markdown
Contributor

@jordigilh jordigilh commented May 5, 2026

Summary

Completes the RTU (RatesToUsage) pipeline migration for all remaining cost types — monthly, tag-based, and VM costs — building on the Phase 2 foundation merged in #6017.

After the initial SQL migration, a comprehensive readiness audit (FedRAMP, QE, SRE, Security, Architecture, DBA) identified several issues. This PR includes the original migration plus all audit fixes.


What changed (by commit)

1. Original Phase 3 SQL migration (2fec011)

  • 25 SQL files converted to INSERT into rates_to_usage instead of direct-writing to reporting_ocpusagelineitem_daily_summary:
    • 4 PostgreSQL tag rate files (infra/supp specific + default)
    • 5 PostgreSQL monthly cost files (node/cluster, PVC, VM, node_cost_by_tag, PVC_by_tag)
    • 8 Trino SQL files (VM hourly/monthly, VM core, project, GPU)
    • 8 self-hosted SQL files (mirrors of Trino)
  • 6 accessor methods updated in ocp_report_db_accessor.py to pass cost_model_id, rate_uuid, and custom_name to SQL templates via new rate_info_map property on CostModelDBAccessor

2. Orchestration: RTU pipeline unconditional (da8808b)

  • All cost types now flow through RTU unconditionally. Phase 3 converts monthly/tag/VM SQL to write to rates_to_usage, making the legacy direct-write path non-functional for these cost types.
  • The COST_BREAKDOWN_RTU_UNLEASH_FLAG is still read; when OFF, a warning log is emitted but the RTU path runs anyway. This prevents silent data loss — if usage costs went through legacy direct-write while aggregation runs (triggered by Phase 3 costs in RTU), the aggregation DELETE would wipe the legacy rows.
  • The legacy _update_usage_costs() method is preserved (not called) for git-revert rollback.
  • ⚠️ This changes rollback semantics: reverting Phase 3 requires a git revert, not a flag toggle. See risk register R20 update.

3. SQL audit fixes (97767b7)

  • metric_type centralized: Added _get_routing_metric_type() static helper to OCPReportDBAccessor. Replaces hardcoded 'cpu', 'storage', 'gpu' and CASE expressions in 25 SQL files with {{metric_type}} template variable. The routing logic mirrors Phase 2's SQL-level decisions (R13 IQ-2):
    • Node/Cluster → 'cpu' if distribution is cpu, else 'memory'
    • PVC → 'storage'
    • GPU → 'gpu'
    • Tag usage → passthrough from usage_type
  • SHA-256 upgrade: Replaced md5() with sha256() across all PostgreSQL, Trino, and self-hosted SQL files. sha256(bytea) is built-in PostgreSQL 11+ (proven by koku/db_functions/jsonb_sha256_text.sql, in production since 2021). No pgcrypto needed.
    • Migration 0350 widens label_hash from VARCHAR(32) to VARCHAR(64)
    • No data migration needed — delete-and-rebuild on next RTU run
    • Aligns with NIST 800-53 Rev 5 (FedRAMP) guidance discouraging md5 for new implementations
    • Risk register R13 already documented this as an available option
  • DELETE scope restored: Added AND rtu.pod_labels ? {{tag_key}} / AND rtu.volume_labels ? {{tag_key}} to the DELETE statements in node_cost_by_tag.sql and monthly_cost_persistentvolumeclaim_by_tag.sql to prevent over-deletion across tag keys

4. Risk register update (bc59356)

  • R20: Documented Phase 3 realization — RTU unconditional, flag warning-only, rollback via git revert
  • R13: Recorded md5 → sha256 upgrade with rationale and computation patterns
  • Changelog v1.3 added

5. Tests (891753e)

  • TC-R20-01: Updated expected orchestration order to rtu → monthly → vm → agg → markup → dist
  • TC-R20-03: Asserts monthly/vm/agg/markup now called even when cost_model_id is None
  • TC-55: Asserts RTU is called when flag is OFF (override behavior)
  • TestRoutingMetricType: 12 new unit tests covering all aggregation buckets (cpu, memory, storage, gpu) for _get_routing_metric_type()

6. Cleanup (b726ce0)

  • rate_info_map on CostModelDBAccessor cached with @cached_property

Key design decisions

Decision Rationale
RTU unconditional (flag → warning) Phase 3 SQL writes to RTU; legacy path can't coexist without maintaining dual SQL versions (rejected by TL in PR #5948)
SHA-256 over md5 FedRAMP alignment; built-in PG function; risk register R13 documented this option; zero-cost migration via delete-and-rebuild
Centralized metric_type helper Prevents drift between SQL files; mirrors Phase 2 routing logic; 12 unit tests
Rollback via git revert Consistent with R18 (distribution SQL); flag toggle no longer viable

Alignment with Phase 2

Area Phase 2 Phase 3 Status
metric_type routing In SQL (rate table + overrides) Python helper, same logic ✅ Aligned
label_hash md5, VARCHAR(32) sha256, VARCHAR(64) ✅ Upgraded (documented option)
DELETE scoping By cost_model_id + period Tag-key filter restored ✅ Aligned
Test patterns _make_orchestration_patches Same decorator, extended ✅ Aligned
Feature flag Gates usage between RTU/legacy Warning-only (RTU unconditional) ⚠️ Evolved

Test plan

  • CI passes (koku-ci pipeline)
  • RTU path processes all cost types (usage, monthly, tag, VM)
  • Flag OFF: warning logged, RTU path still executes
  • Aggregation correctly routes metric_type into cost_model_cpu_cost, cost_model_memory_cost, cost_model_volume_cost, cost_model_gpu_cost
  • Tag cost DELETE only removes rows for the current tag_key
  • label_hash is 64-char hex (sha256) in rates_to_usage
  • Migration 0350 applies cleanly
  • GPU costs (allocated + unallocated) with MIG slice support preserved

@jordigilh jordigilh requested review from a team as code owners May 5, 2026 00:00
@github-actions github-actions Bot added the smokes-required Label to show that smokes tests should be run against these changes. label May 5, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements Phase 3 of the cost model refactor, migrating OpenShift cost calculations (monthly, tag-based, and VM costs) to the new rates_to_usage intermediate table. It includes updates to the database accessors, SQL templates for both Postgres and Trino, and the cost updater logic to handle rate metadata like UUIDs and custom names. Feedback identifies a high-severity issue where the legacy processing path is broken due to missing aggregation steps for the redirected SQL outputs. Other improvements were suggested regarding consistent usage_end date handling in aggregation SQL and ensuring proper fallback logic for rate information lookups in tag-based cost population.

Comment on lines 797 to 850
if rtu_enabled:
if self._cost_model_id:
self._update_usage_rates_to_usage(start_date, end_date)
self._update_monthly_cost(start_date, end_date)
if self._tag_infra_rates != {} or self._tag_supplementary_rates != {}:
self._delete_tag_usage_costs(start_date, end_date, self._provider.uuid)
self._update_tag_usage_costs(start_date, end_date)
self._update_tag_usage_default_costs(start_date, end_date)
self._update_monthly_tag_based_cost(start_date, end_date)
self._update_node_hour_tag_based_cost(start_date, end_date)
with OCPReportDBAccessor(self._schema) as report_accessor:
cluster_params = {
"cluster_id": self._cluster_id,
"cluster_alias": self._cluster_alias,
}
report_accessor.populate_tag_based_costs(
start_date,
end_date,
self._provider.uuid,
self.metric_to_tag_params_map,
cluster_params,
cost_model_id=self._cost_model_id,
rate_info_map=self._rate_info_map,
)
if not (self._tag_infra_rates or self._tag_supplementary_rates):
self._delete_tag_usage_costs(start_date, end_date, self._provider.uuid)
self._aggregate_rates_to_daily_summary(start_date, end_date)
self._update_vm_usage_costs(start_date, end_date)
else:
self._cleanup_stale_rtu_costs(start_date, end_date)
else:
self._update_usage_costs(start_date, end_date)
self._update_monthly_cost(start_date, end_date)
if self._tag_infra_rates != {} or self._tag_supplementary_rates != {}:
self._delete_tag_usage_costs(start_date, end_date, self._provider.uuid)
self._update_tag_usage_costs(start_date, end_date)
self._update_tag_usage_default_costs(start_date, end_date)
self._update_monthly_tag_based_cost(start_date, end_date)
self._update_node_hour_tag_based_cost(start_date, end_date)
with OCPReportDBAccessor(self._schema) as report_accessor:
cluster_params = {
"cluster_id": self._cluster_id,
"cluster_alias": self._cluster_alias,
}
report_accessor.populate_tag_based_costs(
start_date,
end_date,
self._provider.uuid,
self.metric_to_tag_params_map,
cluster_params,
)
if not (self._tag_infra_rates or self._tag_supplementary_rates):
self._delete_tag_usage_costs(start_date, end_date, self._provider.uuid)
self._update_markup_cost(start_date, end_date)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for processing monthly, tag-based, and VM costs is duplicated between the rtu_enabled and legacy branches. More importantly, the legacy path appears to be broken for these cost types. Since the underlying SQL files (e.g., monthly_cost_cluster_and_node.sql, infrastructure_tag_rates.sql) have been modified to target the rates_to_usage table instead of the daily summary table, and the legacy branch (lines 827-850) does not include the _aggregate_rates_to_daily_summary call, these costs will never reach the daily summary reports when the feature flag is OFF. This contradicts the PR description's goal of preserving the legacy path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in commit da8808b. The duplicated flag branches and the broken legacy path were both addressed by making the RTU pipeline unconditional — all cost types now flow through a single code path (RTU INSERT → aggregation → daily summary). The feature flag is still read but emits a warning when OFF; the legacy _update_usage_costs() method is preserved for git-revert rollback. See updated PR description § "Orchestration: RTU pipeline unconditional" for full rationale.

for value_to_skip in value_names:
key_value_pair.append(json.dumps({tag_key: value_to_skip}))
json.dumps(key_value_pair)
tag_rate_info = rate_info_map.get((metric, cost_type_label, tag_key), {})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In populate_tag_usage_default_costs, the tag_rate_info lookup is missing the fallback to the metric-level rate_info that is correctly implemented in populate_tag_usage_costs (line 1123). This may result in missing custom_name or rate_uuid values if they are defined at the metric level but not specifically for the tag key.

Suggested change
tag_rate_info = rate_info_map.get((metric, cost_type_label, tag_key), {})
tag_rate_info = rate_info_map.get((metric, cost_type_label, tag_key), rate_info_map.get((metric, cost_type_label, ""), {}))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed in commit c39760bpopulate_tag_usage_default_costs now falls back to the metric-level rate_info (matching the pattern in populate_tag_usage_costs at line 1150):

rate_info = rate_info_map.get((metric, cost_type_label, ""), {})
# ...
tag_rate_info = rate_info_map.get((metric, cost_type_label, tag_key), rate_info)

rtu.namespace,
rtu.node,
rtu.usage_start,
rtu.usage_start + interval '1 day' AS usage_end,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In Step 3, usage_end is hardcoded as rtu.usage_start + interval '1 day'. This is inconsistent with Step 2 (line 127), which uses MIN(rtu.usage_end). Since daily summary rows typically expect usage_start and usage_end to be the same date for daily granularity, and the RTU rows already have usage_end populated from the source data, using MIN(rtu.usage_end) is more accurate and maintains consistency across cost types.

    MIN(rtu.usage_end) AS usage_end,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a false positive. Both Step 2 (line 87) and Step 3 (line 158) use the same expression: rtu.usage_start + interval '1 day' AS usage_end. Neither uses MIN(rtu.usage_end). The pattern is consistent across both INSERT blocks and matches the daily granularity convention where usage_end = usage_start + 1 day.

@jordigilh
Copy link
Copy Markdown
Contributor Author

FedRAMP Readiness Review — Phase 3 RTU Migration

Reviewed the full diff (29 files, 1,256+/1,116−) against NIST 800-53 Rev 5 controls. The migration pattern is solid overall — tenant isolation is preserved, SQL injection surface is reduced by removing {{labels_field | sqlsafe}}, and the RTU staging model is consistent with Phase 2. There are a few items that need attention before merge.


Critical: Flag-OFF path silently drops monthly/tag/VM costs

Problem. All 25 SQL templates now INSERT into rates_to_usage unconditionally. The flag-OFF branch in update_summary_cost_model_costs calls _update_monthly_cost, tag cost methods, and _update_vm_usage_costs — all of which now write to RTU — but never calls _aggregate_rates_to_daily_summary. Those costs never reach the daily summary table.

Root cause. Phase 2 kept monthly/tag processing outside the flag check (they still targeted the daily summary directly). This PR moved them inside both branches and changed the SQL target, but only added aggregation to the flag-ON branch.

Recommended fix. Preserve the Phase 2 structural pattern: the flag gates only usage-cost routing. Monthly/tag/VM are now fully migrated and should sit outside the flag check, with aggregation always running after them:

# Flag gates ONLY usage costs (Phase 2 scope):
if rtu_enabled:
    if self._cost_model_id:
        self._update_usage_rates_to_usage(...)
    else:
        self._cleanup_stale_rtu_costs(...)
else:
    self._update_usage_costs(...)

# Phase 3: always through RTU (migration complete):
self._update_monthly_cost(...)
# ... tag cost methods ...
self._update_vm_usage_costs(...)

# Aggregation always runs:
self._aggregate_rates_to_daily_summary(...)

self._update_markup_cost(...)

This matches the Phase 2 test TC-R20-01 contract, avoids duplicate SQL files, and keeps the flag as a safety valve for usage costs only.


High: DELETE scope broadened in tag-based cost files

node_cost_by_tag.sql and monthly_cost_persistentvolumeclaim_by_tag.sql removed the tag_key containment filter (pod_labels ? {{tag_key}} / volume_labels ? {{tag_key}}) from their DELETE statements. When multiple tag keys have rates, processing tag_key_A will delete rows created for tag_key_B.

Fix: Restore an equivalent filter on rates_to_usage — either the JSONB containment check or a filter on rate_id / custom_name.


Medium: metric_type hardcoded — regression from Phase 2 pattern

Phase 2's insert_usage_rates_to_usage.sql reads metric_type from the Rate table via a rate_names CTE (rn.metric_type), populated by derive_metric_type() in rate_sync.py. Phase 3's SQL files hardcode 'cpu' AS metric_type everywhere (including all VM cost files), bypassing the Rate table.

Fix: Add metric_type to rate_info_map (it's already on the Rate model), and use {{metric_type}} in the SQL templates instead of hardcoded literals.


Medium: MD5 for label_hash — use SHA-256

MD5 is used across all 25 SQL files for label_hash. The codebase already has public.jsonb_sha256_text (installed via migration 0046), and the risk register doc mentions widening to sha256 / VARCHAR(64).

Fix: Switch to encode(sha256(...), 'hex') in PostgreSQL and to_hex(sha256(...)) in Trino. Update label_hash column to VARCHAR(64).


Low / Informational

  • No new tests. 29 files changed, 0 test files. At minimum: rate_info_map property, flag-ON vs flag-OFF orchestration paths, aggregation metric_type routing.
  • No audit logging distinguishing RTU vs legacy path — add structured log entries with cost_model_id and pipeline path.
  • Duplicated orchestration logic across both flag branches — extracting shared logic into a helper method would reduce maintenance drift (this becomes moot if the fix above is adopted).
  • Trino CAST(NULL AS json) chain in label_hash — simplify to CAST(NULL AS varchar) or '' directly.

Positive

  • Removing {{labels_field | sqlsafe}} in favor of hardcoded column names (pod_labels, volume_labels, all_labels) reduces SQL injection surface. Good change.
  • Schema-based tenant isolation fully preserved in all RTU paths.
  • Feature flag + aggregation staging model is sound once the flag-OFF path is fixed.

@jordigilh
Copy link
Copy Markdown
Contributor Author

@myersCody — heads-up on one architectural decision in this PR that needs your sign-off before merge.

Feature flag is now warning-only (RTU unconditional)

Phase 3 converts all 25 monthly/tag/VM SQL files to INSERT into rates_to_usage. This makes the COST_BREAKDOWN_RTU_UNLEASH_FLAG unable to gate between RTU and legacy paths:

  • Why: If the flag is OFF and usage costs go through legacy direct-write (_update_usage_costs), the aggregation DELETE — which now runs unconditionally because monthly/tag/VM costs are in RTU — would wipe those legacy usage rows. Maintaining two versions of every SQL file (RTU + legacy) would reintroduce the dual-path architecture you rejected in PR [COST-7249] Technical design: Cost breakdown for OpenShift price list costs #5948.

  • What we did: The RTU pipeline runs for all cost types regardless of the flag. When the flag is OFF, a LOG.warning is emitted for operational visibility. The legacy _update_usage_costs() method is preserved but not called, keeping git-revert clean.

  • Impact on rollback: Phase 3 rollback requires a git revert of the SQL migration commits, not a flag toggle. This is consistent with R18 (distribution SQL rewrite) which also uses git revert as its rollback mechanism.

Question: Are you comfortable with the flag becoming warning-only, with git revert as the rollback mechanism? If you'd prefer to preserve the flag as a kill switch, the alternative is maintaining dual SQL versions for all 25 files — which we assessed as higher risk.

The full rationale is documented in the updated risk register (R20, Phase 3 section) in this PR.


Everything else in the audit fixes (SHA-256, metric_type centralization, DELETE scope, tests) aligns with Phase 2 patterns and documented options. See the updated PR description for a commit-by-commit breakdown.

@jordigilh jordigilh added the ocp-smoke-tests pr_check will run ocp + ocp on cloud smoke tests, used when changes affect ocp. label May 5, 2026
@jordigilh
Copy link
Copy Markdown
Contributor Author

/retest

@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

❌ Patch coverage is 95.94595% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.4%. Comparing base (17f4089) to head (402a203).

Additional details and impacted files
@@           Coverage Diff           @@
##            main   #6043     +/-   ##
=======================================
- Coverage   94.4%   94.4%   -0.0%     
=======================================
  Files        362     362             
  Lines      32101   32140     +39     
  Branches    3538    3542      +4     
=======================================
+ Hits       30293   30327     +34     
- Misses      1173    1179      +6     
+ Partials     635     634      -1     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jordigilh
Copy link
Copy Markdown
Contributor Author

/retest

2 similar comments
@jordigilh
Copy link
Copy Markdown
Contributor Author

/retest

@jordigilh
Copy link
Copy Markdown
Contributor Author

/retest

@myersCody myersCody added the flightpath-pr Issues being worked on by the flight path team label May 6, 2026
@myersCody myersCody self-assigned this May 6, 2026
@myersCody
Copy link
Copy Markdown
Contributor

@myersCody — heads-up on one architectural decision in this PR that needs your sign-off before merge.

Feature flag is now warning-only (RTU unconditional)

Phase 3 converts all 25 monthly/tag/VM SQL files to INSERT into rates_to_usage. This makes the COST_BREAKDOWN_RTU_UNLEASH_FLAG unable to gate between RTU and legacy paths:

  • Why: If the flag is OFF and usage costs go through legacy direct-write (_update_usage_costs), the aggregation DELETE — which now runs unconditionally because monthly/tag/VM costs are in RTU — would wipe those legacy usage rows. Maintaining two versions of every SQL file (RTU + legacy) would reintroduce the dual-path architecture you rejected in PR [COST-7249] Technical design: Cost breakdown for OpenShift price list costs #5948.
  • What we did: The RTU pipeline runs for all cost types regardless of the flag. When the flag is OFF, a LOG.warning is emitted for operational visibility. The legacy _update_usage_costs() method is preserved but not called, keeping git-revert clean.
  • Impact on rollback: Phase 3 rollback requires a git revert of the SQL migration commits, not a flag toggle. This is consistent with R18 (distribution SQL rewrite) which also uses git revert as its rollback mechanism.

Question: Are you comfortable with the flag becoming warning-only, with git revert as the rollback mechanism? If you'd prefer to preserve the flag as a kill switch, the alternative is maintaining dual SQL versions for all 25 files — which we assessed as higher risk.

The full rationale is documented in the updated risk register (R20, Phase 3 section) in this PR.

Everything else in the audit fixes (SHA-256, metric_type centralization, DELETE scope, tests) aligns with Phase 2 patterns and documented options. See the updated PR description for a commit-by-commit breakdown.

Yes! So, this is part of the reason I pushed for these to be done in a separate phases is because I knew these files will not be controllable through an unleash flag.

The hope is to get phase 2 into production before we merge phase 3. Once phase 2 is in prod with no issues, we can remove the unleash flag. I always planned for the rollback strategy for phase 3 would be to revert the commit.

However, I did not foresee you needing another migration, which complicates the revert strategy. I think what we need to do is pull the migration out of this PR and merge it beforehand. We can extend the capacity of the label hash prior to these changes with no impact on the system.

That allows us to push in these changes and have a clean revert if we experience problems.

cluster_id,
cluster_alias,
data_source,
NULL AS persistentvolumeclaim,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add the column to the insert if it is always null?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point — persistentvolumeclaim is always NULL for VM costs since these rows come from data_source = 'Pod'. The column is nullable with a NULL default, and the aggregation JOIN handles NULLs via IS NOT DISTINCT FROM, so omitting it from the INSERT is safe.

Will remove it. The same pattern applies to volume_labels (also always NULL in this file) — will trim that too for consistency. Applies to hourly_vm_core.sql, hourly_cost_vm_tag_based.sql, and hourly_vm_core_tag_based.sql as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed `persistentvolumeclaim` and `volume_labels` from all four VM hourly SQL files in commit `7794ad9`. The INSERT columns now exactly match the non-NULL values the SELECT produces.

@myersCody
Copy link
Copy Markdown
Contributor

Github is down right now and I can't leave comments on the PR lines directly. Sorry for the annoying comments that are going to start rolling in, but I want to get you some feedback ASAP.

Reference:
https://github.com/project-koku/koku/pull/6043/changes#diff-f0f05a0d5883c67612ca839a0a04e95f28cb6044612c5140be4e19bfb7563004R141-R148

Feedback:
This structure doesn't match the preexisting pattern for the repo on how we write our SQL files, Example

@myersCody
Copy link
Copy Markdown
Contributor

@cached_property
    def rate_info_map(self):
        """Return (metric, cost_type, tag_key) -> {rate_uuid, custom_name} for all Rate rows."""
        if not self.cost_model:
            return {}
        rate_rows = Rate.objects.filter(price_list__cost_model_maps__cost_model=self.cost_model).only(
            "uuid", "metric", "cost_type", "custom_name", "tag_key"
        )
        return {
            (rate.metric, rate.cost_type, rate.tag_key or ""): {
                "rate_uuid": str(rate.uuid),
                "custom_name": rate.custom_name,
            }
            for rate in rate_rows
        }

What is the benefit of basically recreating the json structure that we removed here? I feel like we have the ability to query the Rate table directly now to see if the rate exists?

Additionally how are we handling scenarios where we have more than one price list. Are we taking validity period into consideration here?

@jordigilh
Copy link
Copy Markdown
Contributor Author

Agreed — will split the migration out into a standalone PR. The migration (0350_widen_ratestousage_label_hash) just widens label_hash from VARCHAR(32) to VARCHAR(64) for SHA-256. It's backwards-compatible and has zero runtime impact on Phase 2's MD5 hashes.

One change from the current migration: removing the reverse-only TRUNCATE TABLE rates_to_usage. That was added to handle test-session edge cases where 64-char hashes existed. For the standalone migration:

  • Forward: Widens the column — no-op for existing 32-char data
  • Reverse: Django's AlterField narrows back to VARCHAR(32). If 64-char data exists (from Phase 3), PostgreSQL will reject it — which is correct because Phase 3 must be reverted first. No silent data loss.

Will open the migration PR shortly and rebase Phase 3 to exclude it.

@jordigilh
Copy link
Copy Markdown
Contributor Author

This structure doesn't match the preexisting pattern for the repo on how we write our SQL files

Agreed — the sql/ directory Phase 3 files (tag rates, monthly costs, aggregate) use a compacted multi-column-per-line INSERT column list, while the repo's established pattern is one column per line. The self_hosted_sql/ and trino_sql/ files already follow the correct pattern. Will reformat all sql/openshift/cost_model/ files to match.

@jordigilh
Copy link
Copy Markdown
Contributor Author

What is the benefit of basically recreating the json structure that we removed here? I feel like we have the ability to query the Rate table directly now to see if the rate exists?

The intent was a cached lookup table (@cached_property) to avoid N+1 queries — rate_info_map is accessed ~15 times per update across populate_monthly_cost, populate_tag_usage_costs, populate_tag_usage_default_costs, and _update_vm_usage_costs. Each lookup resolves rate_uuid and custom_name for the SQL template parameters.

That said, you're right that it partially recreates the structure we removed. An alternative is to query Rate directly per-invocation with the indexes already on the table (rate_metric_idx, rate_price_list_idx). The tradeoff is query count vs. dict overhead. Happy to refactor to direct queries if you prefer that pattern — the per-query cost should be negligible with the indexes.

Additionally how are we handling scenarios where we have more than one price list. Are we taking validity period into consideration here?

Valid catch — this is a bug. The current rate_info_map query fetches from all price lists:

Rate.objects.filter(price_list__cost_model_maps__cost_model=self.cost_model)

This does not scope to the effective price list for the billing period. All other rate properties (infrastructure_rates, supplementary_rates, etc.) correctly resolve via self.effective_ratesPriceListManager.get_effective_price_list(cost_model_uuid, price_list_effective_on), but rate_info_map bypasses that resolution.

Will fix by scoping the Rate query to the effective price list when price_list_effective_on is set, matching the pattern used by effective_rates.

@jordigilh jordigilh force-pushed the COST-7249/phase3-sql-migration branch 4 times, most recently from 82dc125 to 5ea5827 Compare May 6, 2026 20:13
@myersCody
Copy link
Copy Markdown
Contributor

What is the benefit of basically recreating the json structure that we removed here? I feel like we have the ability to query the Rate table directly now to see if the rate exists?

The intent was a cached lookup table (@cached_property) to avoid N+1 queries — rate_info_map is accessed ~15 times per update across populate_monthly_cost, populate_tag_usage_costs, populate_tag_usage_default_costs, and _update_vm_usage_costs. Each lookup resolves rate_uuid and custom_name for the SQL template parameters.

That said, you're right that it partially recreates the structure we removed. An alternative is to query Rate directly per-invocation with the indexes already on the table (rate_metric_idx, rate_price_list_idx). The tradeoff is query count vs. dict overhead. Happy to refactor to direct queries if you prefer that pattern — the per-query cost should be negligible with the indexes.

Additionally how are we handling scenarios where we have more than one price list. Are we taking validity period into consideration here?

Valid catch — this is a bug. The current rate_info_map query fetches from all price lists:

Rate.objects.filter(price_list__cost_model_maps__cost_model=self.cost_model)

This does not scope to the effective price list for the billing period. All other rate properties (infrastructure_rates, supplementary_rates, etc.) correctly resolve via self.effective_ratesPriceListManager.get_effective_price_list(cost_model_uuid, price_list_effective_on), but rate_info_map bypasses that resolution.

Will fix by scoping the Rate query to the effective price list when price_list_effective_on is set, matching the pattern used by effective_rates.

What is the benefit of basically recreating the json structure that we removed here? I feel like we have the ability to query the Rate table directly now to see if the rate exists?

The intent was a cached lookup table (@cached_property) to avoid N+1 queries — rate_info_map is accessed ~15 times per update across populate_monthly_cost, populate_tag_usage_costs, populate_tag_usage_default_costs, and _update_vm_usage_costs. Each lookup resolves rate_uuid and custom_name for the SQL template parameters.

That said, you're right that it partially recreates the structure we removed. An alternative is to query Rate directly per-invocation with the indexes already on the table (rate_metric_idx, rate_price_list_idx). The tradeoff is query count vs. dict overhead. Happy to refactor to direct queries if you prefer that pattern — the per-query cost should be negligible with the indexes.

Additionally how are we handling scenarios where we have more than one price list. Are we taking validity period into consideration here?

Valid catch — this is a bug. The current rate_info_map query fetches from all price lists:

Rate.objects.filter(price_list__cost_model_maps__cost_model=self.cost_model)

This does not scope to the effective price list for the billing period. All other rate properties (infrastructure_rates, supplementary_rates, etc.) correctly resolve via self.effective_ratesPriceListManager.get_effective_price_list(cost_model_uuid, price_list_effective_on), but rate_info_map bypasses that resolution.

Will fix by scoping the Rate query to the effective price list when price_list_effective_on is set, matching the pattern used by effective_rates.

@jordigilh You bring up a good point about the N+1 query. Another additional benefit of this approach is that it allows you to do the work of figuring out the validity periods of price lists in one place instead of 15 different places. I am sold on keeping it a cache property.

@jordigilh jordigilh force-pushed the COST-7249/phase3-sql-migration branch from b9ee8b5 to e05c012 Compare May 11, 2026 00:52
jordigilh added a commit to jordigilh/koku that referenced this pull request May 11, 2026
Extend label_hash from VARCHAR(32) (MD5) to VARCHAR(64) in
preparation for Phase 3's SHA-256 upgrade. This is a backwards-
compatible schema change with zero runtime impact on existing
MD5 hashes.

Split out from Phase 3 PR project-koku#6043 per TL request so that Phase 3
remains a clean git-revert without migration entanglement.

Co-authored-by: Cursor <cursoragent@cursor.com>
@jordigilh
Copy link
Copy Markdown
Contributor Author

/retest

@jordigilh jordigilh force-pushed the COST-7249/phase3-sql-migration branch from b8239cd to 4ef269b Compare May 15, 2026 11:50
@koku-ci-triager-bot
Copy link
Copy Markdown
Collaborator

🤖 CI Triager — Diagnosis

Check: Red Hat Konflux / koku-ci / koku
PipelineRun: koku-ci-j58xg
Root cause: The run-iqe-cji task failed at step-deploy-iqe-cji due to a network connection reset while fetching IQE job pod logs, not due to a test assertion failure. This is a transient infrastructure issue.
Evidence:

error: read tcp 10.128.96.110:52838->44.205.71.56:6443: read: connection reset by peer
Test command failed

  RAN: /usr/local/bin/oc logs koku-iqe-l4sdibh-x9mmg --namespace=ephemeral-pz7eap ...
  STDOUT: (empty)
  STDERR: (empty)

Some jobs failed: {'koku-iqe-l4sdibh': 'Failed'}

Action: Re-trigger the koku-ci check. The failure is unrelated to this PR's code changes — the IQE pod log fetch was interrupted by a network error.

Generated automatically. Review before applying.

@koku-ci-triager-bot
Copy link
Copy Markdown
Collaborator

🤖 CI Triager — Diagnosis

Check: Red Hat Konflux / koku-ci / koku
PipelineRun: koku-ci-pfnll
Root cause: The IQE smoke test ran successfully for many test cases, but two OCP sources timed out waiting for manifest_complete_date to be set — the files were fully processed (6/6) but the summary/aggregation step never completed. This PR directly modifies aggregate_rates_to_daily_summary.sql and the RTU pipeline (all 25 monthly/tag/VM SQL files), making this a likely Koku-side regression in the cost model summarization step.

Evidence:

2026-05-19 16:13:57 ERROR Waiting for source ingest failed, see:
  {'manifests': [{'files_processed': '6/6',
                  'process_complete_date': '2026-05-19 15:47:17',
                  'manifest_complete_date': None, ...}]}

2026-05-19 16:40:50 ERROR Waiting for source ingest failed, see:
  {'manifests': [{'files_processed': '6/6',
                  'process_complete_date': '2026-05-19 16:14:10',
                  'manifest_complete_date': None, ...}]}

# Repeated until timeout:
'2026-04' Summary processing not complete for source ddbd114b-7fed-41df-bead-cd6f075a367a
'2026-05' Summary processing not complete for source ddbd114b-7fed-41df-bead-cd6f075a367a

Some jobs failed: {'koku-iqe-lmvedj8': 'Failed'}

Affected files in diff:

  • koku/masu/database/sql/openshift/cost_model/usage_rates/aggregate_rates_to_daily_summary.sql
  • koku/masu/database/trino_sql/openshift/cost_model/ (multiple SQL files converted to RTU)
  • koku/masu/database/ocp_report_db_accessor.py

Action: Investigate whether the changes to aggregate_rates_to_daily_summary.sql or the RTU orchestration in ocp_report_db_accessor.py can cause the manifest summary step to silently not set manifest_complete_date. Check for exception swallowing, early returns, or missing update_cost_summary calls in the new RTU flow. Re-trigger koku-ci after a fix.

Generated automatically. Review before applying.

@jordigilh jordigilh force-pushed the COST-7249/phase3-sql-migration branch from 97ae01d to 82ed053 Compare May 20, 2026 13:40
jordigilh and others added 12 commits May 20, 2026 11:20
Move all remaining cost model SQL files to INSERT into rates_to_usage
instead of directly writing to reporting_ocpusagelineitem_daily_summary.
This completes the RTU pipeline migration for all cost types.

SQL files converted (25 total):
- 4 PostgreSQL tag rate files (infra/supp specific + default)
- 3 PostgreSQL monthly cost files (node/cluster, PVC, VM)
- 2 PostgreSQL tag-based monthly files (node_cost_by_tag, PVC_by_tag)
- 8 Trino SQL files (VM hourly/monthly, VM core, project, GPU)
- 8 self-hosted SQL files (mirrors of Trino)

Accessor/orchestrator changes:
- Add rate_info_map property to CostModelDBAccessor for Rate UUID/
  custom_name lookup by (metric, cost_type, tag_key)
- Update 6 accessor methods to accept and pass cost_model_id,
  rate_uuid, custom_name to SQL templates
- Rewire orchestration: monthly/tag/VM processing now runs inside
  the RTU-enabled block, with legacy path preserved for flag-off

Aggregation expansion:
- Step 1 DELETE now covers all cost model rows (usage + monthly + tag)
- Step 2 INSERT: usage costs (monthly_cost_type IS NULL) with base JOIN
- Step 3 INSERT: monthly/tag costs (monthly_cost_type IS NOT NULL)
  without base JOIN, including GPU cost routing

Co-authored-by: Cursor <cursoragent@cursor.com>
Phase 3 converts all monthly/tag/VM SQL to INSERT into rates_to_usage.
The legacy direct-write path is no longer viable because the aggregation
DELETE would wipe those rows. The COST_BREAKDOWN_RTU_UNLEASH_FLAG is
still read; when OFF, a warning is logged but the RTU path runs for all
cost types to prevent silent data loss.

Co-authored-by: Cursor <cursoragent@cursor.com>
to SHA-256, restore tag DELETE scope

- Add _get_routing_metric_type() helper to OCPReportDBAccessor to derive
  the aggregation bucket (cpu/memory/storage/gpu) from cost parameters.
  Replace hardcoded metric_type in 25 SQL files with {{metric_type}}.
- Replace md5() with sha256() across all PostgreSQL, Trino, and
  self-hosted SQL files. Widen label_hash column from VARCHAR(32) to
  VARCHAR(64) via migration 0350.
- Restore tag_key scoping on DELETE in node_cost_by_tag.sql and
  monthly_cost_persistentvolumeclaim_by_tag.sql to prevent over-deletion.

Co-authored-by: Cursor <cursoragent@cursor.com>
…HA-256

- R20: document Phase 3 realization — RTU unconditional, feature flag is
  warning-only, rollback via git revert (not flag toggle).
- R13: record md5 → sha256 upgrade, document built-in PG sha256(),
  migration 0350, no data migration needed.
- Update risk matrix and changelog (v1.3).

Co-authored-by: Cursor <cursoragent@cursor.com>
…outing tests

- TC-R20-01: update expected order to rtu → monthly → vm → agg → markup → dist.
- TC-R20-03: assert monthly/vm/agg/markup called even with no cost model.
- TC-55: assert RTU called when flag is OFF (flag override behavior).
- Add TestRoutingMetricType with 12 unit tests covering all aggregation
  buckets (cpu, memory, storage, gpu) for _get_routing_metric_type().

Co-authored-by: Cursor <cursoragent@cursor.com>
Avoid recomputing rate metadata lookup on every access.

Co-authored-by: Cursor <cursoragent@cursor.com>
…o fallback

- Run black on cost_model_db_accessor, ocp_report_db_accessor, and
  ocp_cost_model_cost_updater to fix CI Sanity pre-commit failures.
- Fix populate_tag_usage_default_costs to fall back to metric-level
  rate_info when tag-key-specific entry is missing (consistent with
  populate_tag_usage_costs).

Co-authored-by: Cursor <cursoragent@cursor.com>
Re-run formatting with the project-pinned black 22.12.0
(--line-length 119) to match CI Sanity pre-commit checks.

Co-authored-by: Cursor <cursoragent@cursor.com>
Phase 3 SQL now writes tag-based costs to rates_to_usage instead
of daily_summary. Update test assertions to verify RatesToUsage
rows and update mock expectations for new kwargs (cost_model_id,
rate_info_map, source_uuid, report_period_id).

Co-authored-by: Cursor <cursoragent@cursor.com>
… issues

- Fix NULL cluster_id in node_cost_by_tag.sql allocated pod INSERT
  (was hardcoded NULL instead of selecting cluster_id from temp table)
- Set cost_model=None and rate_info_map={} on CostModelDBAccessor
  mocks to prevent MagicMock objects from reaching SQL queries
- Update test assertions to check RatesToUsage instead of
  daily_summary for monthly/tag-based cost verification

Co-authored-by: Cursor <cursoragent@cursor.com>
…tion

Add a reverse-only TRUNCATE on rates_to_usage so that narrowing
label_hash back to varchar(32) does not fail when 64-char SHA-256
hashes exist from earlier tests in the same session.

Co-authored-by: Cursor <cursoragent@cursor.com>
Phase 3 makes monthly/vm/agg/markup unconditional even when no effective
price list exists. Update test_cleanup_when_no_effective_price_list to
expect those calls, and broaden test_rtu_custom_name_fallback to accept
monthly/node/cluster cost_type identifiers in addition to usage metrics.

Co-authored-by: Cursor <cursoragent@cursor.com>
@jordigilh jordigilh force-pushed the COST-7249/phase3-sql-migration branch from 23b44ec to 325b7e8 Compare May 20, 2026 15:26
Copy link
Copy Markdown
Collaborator

@koku-ci-triager-bot koku-ci-triager-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 CI Triager Suggestion

Check: Sanity
Root cause: black reformatted koku/masu/test/processor/ocp/test_phase2_rates_to_usage.py because the multi-line set union expression doesn't follow black's style. Run pre-commit run --all-files locally to auto-fix.

Accept the suggestion below with one click.

Comment on lines +1243 to +1249
known_identifiers = set(metric_constants.COST_MODEL_USAGE_RATES) | set(
metric_constants.COST_MODEL_MONTHLY_RATES
) | set(metric_constants.COST_MODEL_VM_USAGE_RATES) | set(
metric_constants.COST_MODEL_NODE_RATES
) | {
"Node", "Cluster", "PVC", "OCP_VM", "Node_Core_Month", "Node_Core_Hour",
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apply black formatting for the multi-line set union expression.

Suggested change
known_identifiers = set(metric_constants.COST_MODEL_USAGE_RATES) | set(
metric_constants.COST_MODEL_MONTHLY_RATES
) | set(metric_constants.COST_MODEL_VM_USAGE_RATES) | set(
metric_constants.COST_MODEL_NODE_RATES
) | {
"Node", "Cluster", "PVC", "OCP_VM", "Node_Core_Month", "Node_Core_Hour",
}
known_identifiers = (
set(metric_constants.COST_MODEL_USAGE_RATES)
| set(metric_constants.COST_MODEL_MONTHLY_RATES)
| set(metric_constants.COST_MODEL_VM_USAGE_RATES)
| set(metric_constants.COST_MODEL_NODE_RATES)
| {
"Node",
"Cluster",
"PVC",
"OCP_VM",
"Node_Core_Month",
"Node_Core_Hour",
}
)

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Collaborator

@koku-ci-triager-bot koku-ci-triager-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 CI Triager Suggestion

Check: Sanity
Root cause: Two pre-commit hooks failed after the latest commit (2026-05-20T20:38:19Z):

  1. reorder-python-imports — 5 files have a blank line added between the module docstring and the first import; the tool removes it.
  2. blacktest_common.py has an inline class body (class Foo: ...) that black 22.12.0 expands to multi-line; ocp_cost_model_cost_updater.py wraps a ternary in redundant parentheses that black removes.

Accept each suggestion below with one click.

Comment thread koku/masu/test/util/ocp/test_common.py Outdated

class ManifestFactory(ModelFactory[utils.Manifest]):
...
class ManifestFactory(ModelFactory[utils.Manifest]): ...
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

black reformats inline class bodies to multi-line form.

Suggested change
class ManifestFactory(ModelFactory[utils.Manifest]): ...
class ManifestFactory(ModelFactory[utils.Manifest]):
...

log_json(
msg="RTU pipeline path selected",
rtu_enabled=rtu_enabled,
cost_model_id=(str(self._cost_model_id) if self._cost_model_id else None),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

black removes redundant parentheses around ternary expressions in keyword arguments.

Suggested change
cost_model_id=(str(self._cost_model_id) if self._cost_model_id else None),
cost_model_id=str(self._cost_model_id) if self._cost_model_id else None,

Comment thread koku/gunicorn_conf.py Outdated
Comment on lines +1 to +2
"""Gunicorn configuration file."""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reorder-python-imports removes the blank line between the module docstring and the first import.

Suggested change
"""Gunicorn configuration file."""
"""Gunicorn configuration file."""

Comment on lines +1 to +2
"""Tests for the PriceListManager."""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reorder-python-imports removes the blank line between the module docstring and the first import.

Suggested change
"""Tests for the PriceListManager."""
"""Tests for the PriceListManager."""

Comment on lines +1 to +2
"""Tests for the PriceList and PriceListCostModelMap models."""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reorder-python-imports removes the blank line between the module docstring and the first import.

Suggested change
"""Tests for the PriceList and PriceListCostModelMap models."""
"""Tests for the PriceList and PriceListCostModelMap models."""

Comment on lines +1 to +2
"""Masu OCP post processor module tests."""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reorder-python-imports removes the blank line between the module docstring and the first import.

Suggested change
"""Masu OCP post processor module tests."""
"""Masu OCP post processor module tests."""

Comment on lines +18 to +19
"""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reorder-python-imports removes the blank line between the module docstring and the first import.

Suggested change
"""
"""

@koku-ci-triager-bot
Copy link
Copy Markdown
Collaborator

🤖 CI Triager — Diagnosis

Check: Red Hat Konflux / koku-ci / koku
PipelineRun: koku-ci-br2lv | TaskRun: koku-ci-br2lv-run-iqe-cji | Step: step-deploy-iqe-cji
Root cause: IQE smoke tests fail because manifest_complete_date is never populated — source data is ingested (files_processed: 6/6, process_complete_date set) but the manifest is never marked complete, meaning the summarization pipeline stalls or exits without writing its completion marker. This PR changes ocp_cost_model_cost_updater.py and multiple OCP cost-model SQL templates; a regression there could cause an unhandled exception during cost-model summarization that silently prevents the manifest from being marked complete.

Evidence:

2026-05-20 23:35:27 ERROR Waiting for source ingest failed, see:
  {
    "2026-05-01": {"manifests": [{
      "files_processed": "6/6",
      "process_complete_date": "2026-05-20 23:17:02",
      "manifest_complete_date": null   <-- never set
    }]}
  }
ERROR   Could not do "function check_ingest_complete()"
        at iqe_cost_management/fixtures/helpers.py:519 in time

Pattern repeated across all test attempts (~23:35 → 02:35); all OCP ingest tests errored.

Action:

  1. Check Koku worker logs in the ephemeral namespace (ephemeral-7wb4wu) for exceptions thrown during OCP cost-model summarization after file processing completes.
  2. Review changes to ocp_cost_model_cost_updater.py and the modified SQL templates for logic errors that could raise an unhandled exception after process_complete_date is set but before manifest_complete_date is written.
  3. Fix the regression and re-trigger the smoke tests.

Generated automatically. Review before acting.

…UMENT

Trino's PostgreSQL connector exposes jsonb columns as json type,
and CAST(json AS varchar) is not valid in Trino. Replace with
json_format(CAST(x AS json)) which correctly serializes json to
varchar for the sha256 label_hash computation.

COST-7249

Co-authored-by: Cursor <cursoragent@cursor.com>
@koku-ci-triager-bot
Copy link
Copy Markdown
Collaborator

🤖 CI Triager — Diagnosis

Check: Red Hat Konflux / koku-ci / koku
PipelineRun: koku-ci-bq2tc (running for 3+ hours, started 2026-05-21T14:52:54Z)
Root cause: The step-deploy-iqe-cji step is stuck in an infinite polling loop waiting for OCP source summary processing to complete. The PipelineRun also carries a SnapshotCreationFailed annotation, which may indicate the deployed image was not built from the correct SHA. The PR modifies OCP cost model processing SQL (Trino and self-hosted), which is the code path exercised by the ocp-smoke-tests.

Evidence:

# PipelineRun koku-ci-bq2tc annotation:
test.appstudio.openshift.io/snapshot-creation-report: SnapshotCreationFailed

# step-deploy-iqe-cji repeating every 10s for 3+ hours:
2026-05-21 18:14:07 INFO '2026-05' Files processing 6/6
2026-05-21 18:14:07 INFO '2026-05' Summary processing not complete for source 777b682a-0fca-45fc-9993-bc7fe3cc418c
2026-05-21 18:14:07 INFO '2026-04' Files processing 6/6
2026-05-21 18:14:07 INFO '2026-04' Summary processing not complete for source 777b682a-0fca-45fc-9993-bc7fe3cc418c

Action: The pipeline will eventually timeout (~4 more hours). Two possible causes:

  1. Konflux infra issue (likely primary): The SnapshotCreationFailed annotation suggests the group snapshot could not be created. Re-trigger the pipeline with a /retest comment or by pushing a new commit.
  2. Koku-side regression (investigate if re-trigger also gets stuck): This PR changes OCP cost model SQL (hourly_cost_virtual_machine, hourly_vm_core, monthly_vm_core, etc.). If the new SQL causes silent failures or infinite loops in the masu processing pipeline, summary processing will never complete. Monitor the koku worker logs in the ephemeral namespace ephemeral-1yc9mn if the issue recurs.

Generated automatically. Review before acting.

@jordigilh
Copy link
Copy Markdown
Contributor Author

/retest

@koku-ci-triager-bot
Copy link
Copy Markdown
Collaborator

🤖 CI Triager — Diagnosis

Check: Red Hat Konflux / koku-ci / koku
PipelineRun: koku-ci-r56zf (started 2026-05-22T01:51:03Z, failed 2026-05-22T06:35:54Z)
Root cause: Two specific IQE tests time out waiting for OCP summary processing to complete (manifest_complete_date: None), indicating that the Koku masu summarization pipeline hangs or fails silently for AI-workload and MIG-workload OCP sources after this PR's changes to the OCP cost model SQL. Standard OCP ingest tests (10+) all pass; only these two AI/MIG workload tests fail. The SnapshotCreationFailed group-snapshot annotation is also present but the deployed image SHA matches the PR head (8d1dc651), so this is a genuine regression in the PR's code, not a Konflux infra image issue.

Evidence:

# Failing tests (ERROR):
tests/data_setup/test_all_data_ingest.py::test_api_ocp_source_ai_workloads_ingest
  ERROR: Waiting for source ingest failed — manifest_complete_date: None
         files_processed: 6/6, process_complete_date set, but manifest never completes

tests/data_setup/test_all_data_ingest.py::test_api_ocp_source_mig_workloads_ingest
  ERROR: Waiting for source ingest failed — manifest_complete_date: None
         same pattern

# 10+ other OCP/AWS ingest tests PASSED in the same run
# Annotation: test.appstudio.openshift.io/snapshot-creation-report: SnapshotCreationFailed

Action: This is a Koku-side regression introduced by this PR. AI and MIG workloads trigger a code path that the new unconditional RTU pipeline (Phase 3) or the new GPU/VM SQL does not complete correctly. Investigate:

  1. monthly_cost_gpu.sql (Trino + self-hosted) — AI workloads typically involve GPU resources; a SQL error here would cause the summarization to hang.
  2. _aggregate_rates_to_daily_summary() — the unconditional DELETE+INSERT may be failing silently for these workload types, preventing manifest_complete_date from being set.
  3. Look at masu worker logs in the ephemeral namespace for the two sources that timed out — search for tracebacks or error messages around cost model update steps.

This is the third consecutive koku-ci run failing with the same pattern on this PR.

Generated automatically. Review before acting.

monthly_cost_gpu.sql was missed in the prior INVALID_CAST_ARGUMENT fix.
Trino rejects CAST(json AS varchar); use json_format() instead.

This is the code path triggered by AI and MIG workload IQE tests
(test_api_ocp_source_ai_workloads_ingest,
test_api_ocp_source_mig_workloads_ingest).

Co-authored-by: Cursor <cursoragent@cursor.com>
@koku-ci-triager-bot
Copy link
Copy Markdown
Collaborator

🤖 CI Triager — Diagnosis

Check: Red Hat Konflux / koku-ci / koku
PipelineRun: koku-ci-w2vnq
Root cause: The IQE OCP GPU smoke tests are returning GPU count = 0 for all assertions. This is a Koku-side regression caused by Phase 3 changes in this PR: monthly_cost_gpu.sql (both Trino and self-hosted) now INSERT into rates_to_usage instead of reporting_ocpusagelineitem_daily_summary directly. The IQE tests read GPU counts from the daily summary table, but the aggregation pipeline that transfers rates_to_usagereporting_ocpusagelineitem_daily_summary is not populating GPU rows correctly in the smoke test environment.

Evidence:

FAILED tests/rest_api/v1/test__ocp_gpu_reports.py::test_api_ocp_gpu_raw_calc_group_by_model_daily
  AssertionError: Expected 4, got 0 of L40S GPUs — assert 0 == 4
FAILED tests/rest_api/v1/test__ocp_gpu_reports.py::test_api_ocp_gpu_user_flow_cluster_view[no_include_no_exlude]
  AssertionError: GPU count mismatch for node_l40s_1: expected 7, got 0 — assert 0 == 7
FAILED tests/rest_api/v1/test__ocp_mig_reports.py::test_api_ocp_mig_raw_calc_group_by_model_daily
  AssertionError: GPU count mismatch for A100 SXM4 40GB on node node-a100-mixed-mig-01: expected 3 — assert 0 == 3
= 23 failed, 4089 passed =

All 23 failures are in test__ocp_gpu_reports.py and test__ocp_mig_reports.py, all with assert 0 == N (GPU count returned 0).

Affected files:

  • koku/masu/database/trino_sql/openshift/cost_model/monthly_cost_gpu.sql — now inserts into rates_to_usage
  • koku/masu/database/self_hosted_sql/openshift/cost_model/monthly_cost_gpu.sql — now inserts into rates_to_usage

Action: Verify that _aggregate_rates_to_daily_summary is correctly processing GPU rows from rates_to_usage into reporting_ocpusagelineitem_daily_summary. The smoke test environment ran the cost model updater but the GPU counts are not appearing in the daily summary table that the API reads from. Check whether the aggregation SQL (aggregate_rates_to_daily_summary.sql) handles the GPU data_source rows from rates_to_usage.

Generated automatically. Review before applying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

flightpath-pr Issues being worked on by the flight path team ocp-smoke-tests pr_check will run ocp + ocp on cloud smoke tests, used when changes affect ocp. smokes-required Label to show that smokes tests should be run against these changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants