[FLPATH-3323] Add AWS self-hosted/on-prem support by ydayagi · Pull Request #5918 · project-koku/koku

ydayagi · 2026-03-01T13:20:16Z

Add self-hosted PostgreSQL support for AWS provider, following the pattern established for OCP. This enables running koku without Trino/Hive for on-premises deployments.

Changes:

Add Django models for AWS line items (aws_line_items, aws_line_items_daily)
Add migration for partitioned AWS line item tables
Add self_hosted_sql/aws/ directory with PostgreSQL-converted SQL files
Refactor base processor with generic write_to_self_hosted_table()
Add _prepare_dataframe_for_write() hook for provider-specific columns
Add delete_day_postgres() with manifestid-based deletion
Update AWS db accessor to use get_sql_folder_name() for SQL selection
Add delete_self_hosted_data_by_source() for cleanup

Jira: https://issues.redhat.com/browse/FLPATH-3323

Jira Ticket

COST-####

Description

This change will ...

Testing

Checkout Branch
Restart Koku
Hit endpoint or launch shell
1. You should see ...
Do more things...

Release Notes

proposed release note

* [COST-####](https://issues.redhat.com/browse/COST-####) Fix some things

gemini-code-assist · 2026-03-01T13:20:47Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances Koku's capabilities by introducing self-hosted PostgreSQL support for AWS. This change allows on-premises deployments to process AWS cost and usage data without relying on Trino or Hive, providing greater flexibility and control over data infrastructure. The implementation involves new Django models for AWS line items, a generalized framework for writing and deleting data in PostgreSQL, and a suite of AWS-specific SQL queries tailored for the self-hosted environment.

Highlights

AWS Self-Hosted PostgreSQL Support: Introduced comprehensive support for self-hosted PostgreSQL for AWS providers, mirroring the existing OpenShift (OCP) pattern. This enables Koku to run without Trino/Hive in on-premises deployments.
New Django Models for AWS Line Items: Added new Django models, AWSLineItem and AWSLineItemDaily, to manage AWS line item data directly within PostgreSQL, leveraging partitioning for efficient data handling.
Refactored Base Processor for Generic Self-Hosted Writes: Centralized the logic for writing data to self-hosted PostgreSQL tables by refactoring ReportParquetProcessorBase. This includes a new write_to_self_hosted_table() method, a _prepare_dataframe_for_write() hook for provider-specific column additions, and a self_hosted_line_item_model property.
Manifest-ID Based Data Deletion: Implemented a generic delete_day_postgres() method in the base processor that uses manifest IDs for data deletion, ensuring consistent cleanup across self-hosted providers.
AWS-Specific Self-Hosted SQL Files: Added a new directory structure (self_hosted_sql/aws/) containing PostgreSQL-converted SQL files for various AWS reporting summaries, including daily, compute, cost, network, and storage.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

koku/masu/database/aws_report_db_accessor.py
- Updated SQL file path references to use get_sql_folder_name() for consistency.
- Added delete_self_hosted_data_by_source() method to remove data from self-hosted tables based on provider UUID.
koku/masu/database/self_hosted_sql/aws/openshift/populate_daily_summary/0_prepare_daily_summary_tables.sql
- Added SQL script to create temporary and summary tables for AWS OpenShift daily data in PostgreSQL.
koku/masu/database/self_hosted_sql/aws/openshift/populate_daily_summary/1_resource_matching_by_cluster.sql
- Added SQL script for resource matching by cluster, inserting data into temporary daily summary tables.
koku/masu/database/self_hosted_sql/aws/openshift/populate_daily_summary/2_summarize_data_by_cluster.sql
- Added SQL script to summarize daily data by cluster, including storage, unattributed storage, direct resource, and tag matching logic.
koku/masu/database/self_hosted_sql/aws/openshift/populate_daily_summary/3_reporting_ocpawscostlineitem_project_daily_summary_p.sql
- Added SQL script to insert managed table data into the final PostgreSQL reporting table for OCP on AWS project daily summaries.
koku/masu/database/self_hosted_sql/aws/openshift/reporting_ocpaws_matched_tags.sql
- Added SQL script to identify matched tags between AWS and OCP resources for self-hosted environments.
koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpaws_compute_summary_p.sql
- Added SQL script to populate the OCP on AWS compute summary table for UI reporting in PostgreSQL.
koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpaws_cost_summary_by_account_p.sql
- Added SQL script to populate the OCP on AWS cost summary by account table for UI reporting in PostgreSQL.
koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpaws_cost_summary_by_region_p.sql
- Added SQL script to populate the OCP on AWS cost summary by region table for UI reporting in PostgreSQL.
koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpaws_cost_summary_by_service_p.sql
- Added SQL script to populate the OCP on AWS cost summary by service table for UI reporting in PostgreSQL.
koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpaws_database_summary_p.sql
- Added SQL script to populate the OCP on AWS database summary table for UI reporting in PostgreSQL.
koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpaws_network_summary_p.sql
- Added SQL script to populate the OCP on AWS network summary table for UI reporting in PostgreSQL.
koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpaws_storage_summary_p.sql
- Added SQL script to populate the OCP on AWS storage summary table for UI reporting in PostgreSQL.
koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpawscostlineitem_project_daily_summary_p.sql
- Added SQL script to insert managed table data into the final PostgreSQL reporting table for OCP on AWS project daily summaries for UI.
koku/masu/database/self_hosted_sql/aws/reporting_awscostentrylineitem_daily_summary.sql
- Added SQL script to insert daily AWS cost entry line item summaries into PostgreSQL.
koku/masu/database/self_hosted_sql/aws/reporting_awscostentrylineitem_summary_by_ec2_compute_p.sql
- Added SQL script to insert AWS cost entry line item summaries by EC2 compute into PostgreSQL.
koku/masu/database/self_hosted_sql/aws/reporting_ocpinfrastructure_provider_map.sql
- Added SQL script to map OCP infrastructure providers to AWS resources for self-hosted environments.
koku/masu/processor/aws/aws_report_parquet_processor.py
- Initialized _is_daily and _date_column attributes for self-hosted processing.
- Added self_hosted_line_item_model property to return the appropriate Django model for AWS line items.
- Added get_table_names_for_delete() to specify AWS tables for deletion.
- Implemented _prepare_dataframe_for_write() to add manifestid to the dataframe before writing to PostgreSQL.
koku/masu/processor/ocp/ocp_report_parquet_processor.py
- Initialized _date_column attribute for OCP processing.
- Refactored write_to_self_hosted_table() by moving common logic to the base class and implemented _prepare_dataframe_for_write() to add OCP-specific columns.
koku/masu/processor/report_parquet_processor_base.py
- Introduced self_hosted_line_item_model as an abstract property for subclasses to define their Django models.
- Added _prepare_dataframe_for_write() as an abstract method for subclasses to add provider-specific columns.
- Implemented a generic write_to_self_hosted_table() method to handle data writing to PostgreSQL using Django models and partitioning.
- Added get_table_names_for_delete() to return table names for deletion, with a default implementation.
- Implemented delete_day_postgres() for manifest-ID based deletion of data from self-hosted PostgreSQL tables.
koku/reporting/migrations/0344_aws_line_item_models.py
- Added a new migration to create AWSLineItem and AWSLineItemDaily Django models, enabling PostgreSQL partitioning for AWS line item data.
koku/reporting/provider/aws/self_hosted_models.py
- Added new file defining AWSLineItemBase, AWSLineItem, and AWSLineItemDaily Django models for self-hosted PostgreSQL storage of AWS line items.
- Included SELF_HOSTED_MODEL_MAP and SELF_HOSTED_DAILY_MODEL_MAP for model lookup.
- Provided helper functions get_self_hosted_models() and get_self_hosted_table_names() for managing self-hosted models and table names.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively adds self-hosted PostgreSQL support for the AWS provider, aligning it with the existing OCP implementation. The changes are well-structured, including the introduction of new Django models for AWS line items, a corresponding database migration, and new PostgreSQL-compatible SQL scripts for data summarization. The refactoring of ReportParquetProcessorBase to create a generic write_to_self_hosted_table method is a significant improvement, centralizing common logic and simplifying provider-specific processors. I've identified a potential issue in one of the new SQL scripts related to tag matching that needs to be addressed.

codecov · 2026-03-01T14:40:26Z

Codecov Report

❌ Patch coverage is 94.30894% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.4%. Comparing base (47db450) to head (493613a).

Additional details and impacted files

@@           Coverage Diff           @@
##            main   #5918     +/-   ##
=======================================
- Coverage   94.4%   94.4%   -0.0%     
=======================================
  Files        362     364      +2     
  Lines      31988   32323    +335     
  Branches    3513    3526     +13     
=======================================
+ Hits       30185   30501    +316     
- Misses      1168    1185     +17     
- Partials     635     637      +2

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dchorvat1 · 2026-03-02T10:38:49Z

/retest

dchorvat1 · 2026-03-02T13:26:22Z

/retest

dchorvat1 · 2026-03-04T10:32:39Z

/retest

dchorvat1 · 2026-03-04T13:24:28Z

/retest

dchorvat1 · 2026-03-06T08:12:58Z

/retest

ydayagi · 2026-03-08T19:42:27Z

/retest

myersCody · 2026-03-10T14:41:07Z

@dchorvat1 can you run our integration tests on these to confirm functionality, then move it out of draft.

myersCody · 2026-05-05T19:40:58Z

@@ -0,0 +1,154 @@
+CREATE TABLE IF NOT EXISTS {{schema | sqlsafe}}.managed_aws_openshift_daily_temp


What is the migration strategy for these tables in the on premise flow? Cause it doesn't appear like we have one at all from my perspective.

i will create django db models. is that ok?

myersCody · 2026-05-05T19:52:30Z

Testing Instructions are required

myersCody · 2026-05-05T20:09:46Z

+        """Return list of table names to delete from. Override in subclass if needed."""
+        return [self._table_name]
+
+    def delete_day_postgres(self, start_date, reportnumhours=None):


Why are we calling this delete day when we delete the entire month?

it is something i took from the existing trino impl. ocp deletes by day and the others dont. but they share a base class

myersCody · 2026-05-05T20:10:25Z

+        """Return list of table names to delete from. Override in subclass if needed."""
+        return [self._table_name]
+
+    def delete_day_postgres(self, start_date, reportnumhours=None):


You pass in start_date here, but don't seem to use it anywhere.

it is the base class. OCP impl uses it

myersCody · 2026-05-05T20:18:14Z

+        """Return list of table names to delete from. Override in subclass if needed."""
+        return [self._table_name]
+
+    def delete_day_postgres(self, start_date, reportnumhours=None):


I highly recommend we follow the call chain for this

for csv_filename in file_list: # set start date based on data in the file being processed: if self.provider_type == Provider.PROVIDER_OCP: self.start_date = self.ocp_files_to_process[csv_filename.stem]["meta_reportdatestart"] self._delete_old_data(Path(csv_filename)) if self.provider_type == Provider.PROVIDER_OCP and self.report_type is None: msg = "Unknown report type, skipping file processing" LOG.warning( log_json( self.tracing_id, msg=msg, context=self.error_context, filename=csv_filename, ) ) return

Inside of _delete_old_data:

if settings.ONPREM: self._delete_old_data_postgres(filename) else: self._delete_old_data_trino(filename)

def _delete_old_data_postgres(self, filename): """remove records with data older than the data in the file being processed""" # Get reportnumhours for OCP (will be None for non-OCP) reportnumhours = None if self.ocp_files_to_process: reportnumhours = int(self.ocp_files_to_process[filename.stem]["meta_reportnumhours"]) # Processor handles deleting from all relevant tables (raw and daily for OCP) processor = self._get_report_processor(daily=False) processor.delete_day_postgres(self.start_date, reportnumhours)

Are you deleting a whole month of data each time we process a csv?

No — the delete runs once per manifest, not per CSV. The should_s3_parquet_be_cleared / get_s3_parquet_cleared guard in _delete_old_data ensures only the first file in a manifest triggers the delete. The manifestid-based approach deletes data from previous manifests for the same source/month, effectively replacing old data with the current manifest's data. See the guard logic at koku/masu/processor/parquet/parquet_report_processor.py lines 676-692.

ydayagi · 2026-05-07T02:37:36Z

Testing Instructions are required

i used the existing iqe tests for aws and aws on ocp. only thing i added was the ONPREM setting. please advise what other items you expect

myersCody · 2026-05-07T15:20:30Z

Testing Instructions are required

i used the existing iqe tests for aws and aws on ocp. only thing i added was the ONPREM setting. please advise what other items you expect

Did you run the test locally or did you rely on CI? I don't think CI has been set up to trigger those tests yet.

ydayagi · 2026-05-07T16:04:32Z

Testing Instructions are required

i used the existing iqe tests for aws and aws on ocp. only thing i added was the ONPREM setting. please advise what other items you expect

Did you run the test locally or did you rely on CI? I don't think CI has been set up to trigger those tests yet.

there is a label for it and i used it in the PR. the labels changed a few times due to CI issues. I also executed the tests locally. all the aws_for/on_ocp and aws in general

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

koku-ci-triager-bot · 2026-05-19T12:21:13Z

🤖 CI Triager — Diagnosis

Check: Red Hat Konflux / koku-ci / koku
PipelineRun: koku-ci-fhqdl
Root cause: The deploy-application task timed out waiting for the ephemeral Clowder environment to become ready. Multiple dependent services (sources-api, rbac, puptoo) failed to start, and the Clowder environment was locked. This is a transient infrastructure issue unrelated to this PR's code changes.
Evidence:

Warning  ClowdEnvLocked   clowdapp/koku     Clowder Environment [env-ephemeral-lyfq6z] is locked
Warning  ClowdAppNotReady clowdapp/koku     ClowdApp [koku] is not ready
Warning  BackOff          pod/sources-api-svc-67c9f56576-xl25j  Back-off restarting failed container

ERROR: deploy failed: timed out waiting for ClowdApp-owned resources

Action: Re-trigger the koku-ci check. The ephemeral environment infrastructure was unhealthy at the time this run executed.

Generated automatically. Review before applying.

ydayagi requested review from a team as code owners March 1, 2026 13:20

github-actions Bot added the smokes-required Label to show that smokes tests should be run against these changes. label Mar 1, 2026

gemini-code-assist Bot reviewed Mar 1, 2026

View reviewed changes

Comment thread ...atabase/self_hosted_sql/aws/openshift/populate_daily_summary/2_summarize_data_by_cluster.sql Outdated

ydayagi force-pushed the trino2pgaws branch 4 times, most recently from cfc4e11 to b5c72b8 Compare March 1, 2026 14:26

ydayagi force-pushed the trino2pgaws branch from b5c72b8 to b45473b Compare March 1, 2026 15:00

ydayagi requested review from bacciotti, lcouzens and myersCody March 1, 2026 15:01

ydayagi force-pushed the trino2pgaws branch 4 times, most recently from 9cdc204 to d66e3e6 Compare March 1, 2026 18:59

ydayagi force-pushed the trino2pgaws branch 2 times, most recently from 901d1a1 to 7f60de5 Compare March 3, 2026 14:12

ydayagi force-pushed the trino2pgaws branch 2 times, most recently from cf14ee6 to f46c61c Compare March 8, 2026 15:37

ydayagi force-pushed the trino2pgaws branch 3 times, most recently from 75b327c to 1cf3547 Compare March 10, 2026 09:15

lcouzens added the flightpath-pr Issues being worked on by the flight path team label Mar 10, 2026

myersCody marked this pull request as draft March 10, 2026 14:40

myersCody added the on-hold label Mar 18, 2026

myersCody requested changes May 5, 2026

View reviewed changes

myersCody reviewed May 5, 2026

View reviewed changes

Comment thread koku/masu/processor/report_parquet_processor_base.py Outdated

myersCody reviewed May 5, 2026

View reviewed changes

ydayagi force-pushed the trino2pgaws branch 7 times, most recently from 62cec59 to f5a7761 Compare May 11, 2026 05:54

[FLPATH-3323] Add AWS self-hosted/on-prem support

493613a

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ydayagi force-pushed the trino2pgaws branch from f5a7761 to 493613a Compare May 11, 2026 05:57

		@@ -0,0 +1,154 @@
		CREATE TABLE IF NOT EXISTS {{schema \| sqlsafe}}.managed_aws_openshift_daily_temp

Conversation

ydayagi commented Mar 1, 2026

Jira Ticket

Description

Testing

Release Notes

Uh oh!

gemini-code-assist Bot commented Mar 1, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

codecov Bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dchorvat1 commented Mar 2, 2026

Uh oh!

dchorvat1 commented Mar 2, 2026

Uh oh!

dchorvat1 commented Mar 4, 2026

Uh oh!

dchorvat1 commented Mar 4, 2026

Uh oh!

dchorvat1 commented Mar 6, 2026

Uh oh!

ydayagi commented Mar 8, 2026

Uh oh!

myersCody commented Mar 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

myersCody commented May 5, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ydayagi commented May 7, 2026

Uh oh!

myersCody commented May 7, 2026

Uh oh!

ydayagi commented May 7, 2026

Uh oh!

koku-ci-triager-bot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov Bot commented Mar 1, 2026 •

edited

Loading