Skip to content

[FLPATH-3323] Add AWS self-hosted/on-prem support#5918

Draft
ydayagi wants to merge 1 commit into
mainfrom
trino2pgaws
Draft

[FLPATH-3323] Add AWS self-hosted/on-prem support#5918
ydayagi wants to merge 1 commit into
mainfrom
trino2pgaws

Conversation

@ydayagi
Copy link
Copy Markdown
Collaborator

@ydayagi ydayagi commented Mar 1, 2026

Add self-hosted PostgreSQL support for AWS provider, following the pattern established for OCP. This enables running koku without Trino/Hive for on-premises deployments.

Changes:

  • Add Django models for AWS line items (aws_line_items, aws_line_items_daily)
  • Add migration for partitioned AWS line item tables
  • Add self_hosted_sql/aws/ directory with PostgreSQL-converted SQL files
  • Refactor base processor with generic write_to_self_hosted_table()
  • Add _prepare_dataframe_for_write() hook for provider-specific columns
  • Add delete_day_postgres() with manifestid-based deletion
  • Update AWS db accessor to use get_sql_folder_name() for SQL selection
  • Add delete_self_hosted_data_by_source() for cleanup

Jira: https://issues.redhat.com/browse/FLPATH-3323

Jira Ticket

COST-####

Description

This change will ...

Testing

  1. Checkout Branch
  2. Restart Koku
  3. Hit endpoint or launch shell
    1. You should see ...
  4. Do more things...

Release Notes

  • proposed release note
* [COST-####](https://issues.redhat.com/browse/COST-####) Fix some things

@ydayagi ydayagi requested review from a team as code owners March 1, 2026 13:20
@github-actions github-actions Bot added the smokes-required Label to show that smokes tests should be run against these changes. label Mar 1, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances Koku's capabilities by introducing self-hosted PostgreSQL support for AWS. This change allows on-premises deployments to process AWS cost and usage data without relying on Trino or Hive, providing greater flexibility and control over data infrastructure. The implementation involves new Django models for AWS line items, a generalized framework for writing and deleting data in PostgreSQL, and a suite of AWS-specific SQL queries tailored for the self-hosted environment.

Highlights

  • AWS Self-Hosted PostgreSQL Support: Introduced comprehensive support for self-hosted PostgreSQL for AWS providers, mirroring the existing OpenShift (OCP) pattern. This enables Koku to run without Trino/Hive in on-premises deployments.
  • New Django Models for AWS Line Items: Added new Django models, AWSLineItem and AWSLineItemDaily, to manage AWS line item data directly within PostgreSQL, leveraging partitioning for efficient data handling.
  • Refactored Base Processor for Generic Self-Hosted Writes: Centralized the logic for writing data to self-hosted PostgreSQL tables by refactoring ReportParquetProcessorBase. This includes a new write_to_self_hosted_table() method, a _prepare_dataframe_for_write() hook for provider-specific column additions, and a self_hosted_line_item_model property.
  • Manifest-ID Based Data Deletion: Implemented a generic delete_day_postgres() method in the base processor that uses manifest IDs for data deletion, ensuring consistent cleanup across self-hosted providers.
  • AWS-Specific Self-Hosted SQL Files: Added a new directory structure (self_hosted_sql/aws/) containing PostgreSQL-converted SQL files for various AWS reporting summaries, including daily, compute, cost, network, and storage.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • koku/masu/database/aws_report_db_accessor.py
    • Updated SQL file path references to use get_sql_folder_name() for consistency.
    • Added delete_self_hosted_data_by_source() method to remove data from self-hosted tables based on provider UUID.
  • koku/masu/database/self_hosted_sql/aws/openshift/populate_daily_summary/0_prepare_daily_summary_tables.sql
    • Added SQL script to create temporary and summary tables for AWS OpenShift daily data in PostgreSQL.
  • koku/masu/database/self_hosted_sql/aws/openshift/populate_daily_summary/1_resource_matching_by_cluster.sql
    • Added SQL script for resource matching by cluster, inserting data into temporary daily summary tables.
  • koku/masu/database/self_hosted_sql/aws/openshift/populate_daily_summary/2_summarize_data_by_cluster.sql
    • Added SQL script to summarize daily data by cluster, including storage, unattributed storage, direct resource, and tag matching logic.
  • koku/masu/database/self_hosted_sql/aws/openshift/populate_daily_summary/3_reporting_ocpawscostlineitem_project_daily_summary_p.sql
    • Added SQL script to insert managed table data into the final PostgreSQL reporting table for OCP on AWS project daily summaries.
  • koku/masu/database/self_hosted_sql/aws/openshift/reporting_ocpaws_matched_tags.sql
    • Added SQL script to identify matched tags between AWS and OCP resources for self-hosted environments.
  • koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpaws_compute_summary_p.sql
    • Added SQL script to populate the OCP on AWS compute summary table for UI reporting in PostgreSQL.
  • koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpaws_cost_summary_by_account_p.sql
    • Added SQL script to populate the OCP on AWS cost summary by account table for UI reporting in PostgreSQL.
  • koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpaws_cost_summary_by_region_p.sql
    • Added SQL script to populate the OCP on AWS cost summary by region table for UI reporting in PostgreSQL.
  • koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpaws_cost_summary_by_service_p.sql
    • Added SQL script to populate the OCP on AWS cost summary by service table for UI reporting in PostgreSQL.
  • koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpaws_database_summary_p.sql
    • Added SQL script to populate the OCP on AWS database summary table for UI reporting in PostgreSQL.
  • koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpaws_network_summary_p.sql
    • Added SQL script to populate the OCP on AWS network summary table for UI reporting in PostgreSQL.
  • koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpaws_storage_summary_p.sql
    • Added SQL script to populate the OCP on AWS storage summary table for UI reporting in PostgreSQL.
  • koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpawscostlineitem_project_daily_summary_p.sql
    • Added SQL script to insert managed table data into the final PostgreSQL reporting table for OCP on AWS project daily summaries for UI.
  • koku/masu/database/self_hosted_sql/aws/reporting_awscostentrylineitem_daily_summary.sql
    • Added SQL script to insert daily AWS cost entry line item summaries into PostgreSQL.
  • koku/masu/database/self_hosted_sql/aws/reporting_awscostentrylineitem_summary_by_ec2_compute_p.sql
    • Added SQL script to insert AWS cost entry line item summaries by EC2 compute into PostgreSQL.
  • koku/masu/database/self_hosted_sql/aws/reporting_ocpinfrastructure_provider_map.sql
    • Added SQL script to map OCP infrastructure providers to AWS resources for self-hosted environments.
  • koku/masu/processor/aws/aws_report_parquet_processor.py
    • Initialized _is_daily and _date_column attributes for self-hosted processing.
    • Added self_hosted_line_item_model property to return the appropriate Django model for AWS line items.
    • Added get_table_names_for_delete() to specify AWS tables for deletion.
    • Implemented _prepare_dataframe_for_write() to add manifestid to the dataframe before writing to PostgreSQL.
  • koku/masu/processor/ocp/ocp_report_parquet_processor.py
    • Initialized _date_column attribute for OCP processing.
    • Refactored write_to_self_hosted_table() by moving common logic to the base class and implemented _prepare_dataframe_for_write() to add OCP-specific columns.
  • koku/masu/processor/report_parquet_processor_base.py
    • Introduced self_hosted_line_item_model as an abstract property for subclasses to define their Django models.
    • Added _prepare_dataframe_for_write() as an abstract method for subclasses to add provider-specific columns.
    • Implemented a generic write_to_self_hosted_table() method to handle data writing to PostgreSQL using Django models and partitioning.
    • Added get_table_names_for_delete() to return table names for deletion, with a default implementation.
    • Implemented delete_day_postgres() for manifest-ID based deletion of data from self-hosted PostgreSQL tables.
  • koku/reporting/migrations/0344_aws_line_item_models.py
    • Added a new migration to create AWSLineItem and AWSLineItemDaily Django models, enabling PostgreSQL partitioning for AWS line item data.
  • koku/reporting/provider/aws/self_hosted_models.py
    • Added new file defining AWSLineItemBase, AWSLineItem, and AWSLineItemDaily Django models for self-hosted PostgreSQL storage of AWS line items.
    • Included SELF_HOSTED_MODEL_MAP and SELF_HOSTED_DAILY_MODEL_MAP for model lookup.
    • Provided helper functions get_self_hosted_models() and get_self_hosted_table_names() for managing self-hosted models and table names.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively adds self-hosted PostgreSQL support for the AWS provider, aligning it with the existing OCP implementation. The changes are well-structured, including the introduction of new Django models for AWS line items, a corresponding database migration, and new PostgreSQL-compatible SQL scripts for data summarization. The refactoring of ReportParquetProcessorBase to create a generic write_to_self_hosted_table method is a significant improvement, centralizing common logic and simplifying provider-specific processors. I've identified a potential issue in one of the new SQL scripts related to tag matching that needs to be addressed.

@ydayagi ydayagi force-pushed the trino2pgaws branch 4 times, most recently from cfc4e11 to b5c72b8 Compare March 1, 2026 14:26
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 1, 2026

Codecov Report

❌ Patch coverage is 94.30894% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.4%. Comparing base (47db450) to head (493613a).

Additional details and impacted files
@@           Coverage Diff           @@
##            main   #5918     +/-   ##
=======================================
- Coverage   94.4%   94.4%   -0.0%     
=======================================
  Files        362     364      +2     
  Lines      31988   32323    +335     
  Branches    3513    3526     +13     
=======================================
+ Hits       30185   30501    +316     
- Misses      1168    1185     +17     
- Partials     635     637      +2     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ydayagi ydayagi force-pushed the trino2pgaws branch 4 times, most recently from 9cdc204 to d66e3e6 Compare March 1, 2026 18:59
@ydayagi ydayagi added aws-smoke-tests pr_check will run aws + ocp on aws smoke tests, used when changes affect AWS only. full-run-smoke-tests pr_check will run all smoke tests. Used for large or wider reaching changes. and removed full-run-smoke-tests pr_check will run all smoke tests. Used for large or wider reaching changes. labels Mar 1, 2026
@dchorvat1 dchorvat1 added on-prem-processing pr_check will deploy and run the on-prem data pipeline processing flow. ocp-smoke-tests pr_check will run ocp + ocp on cloud smoke tests, used when changes affect ocp. and removed aws-smoke-tests pr_check will run aws + ocp on aws smoke tests, used when changes affect AWS only. labels Mar 2, 2026
@dchorvat1
Copy link
Copy Markdown
Member

/retest

1 similar comment
@dchorvat1
Copy link
Copy Markdown
Member

/retest

@ydayagi ydayagi force-pushed the trino2pgaws branch 2 times, most recently from 901d1a1 to 7f60de5 Compare March 3, 2026 14:12
@dchorvat1
Copy link
Copy Markdown
Member

/retest

1 similar comment
@dchorvat1
Copy link
Copy Markdown
Member

/retest

@dchorvat1
Copy link
Copy Markdown
Member

/retest

@ydayagi ydayagi force-pushed the trino2pgaws branch 2 times, most recently from cf14ee6 to f46c61c Compare March 8, 2026 15:37
@ydayagi
Copy link
Copy Markdown
Collaborator Author

ydayagi commented Mar 8, 2026

/retest

@ydayagi ydayagi force-pushed the trino2pgaws branch 3 times, most recently from 75b327c to 1cf3547 Compare March 10, 2026 09:15
@lcouzens lcouzens added the flightpath-pr Issues being worked on by the flight path team label Mar 10, 2026
@myersCody myersCody marked this pull request as draft March 10, 2026 14:40
@myersCody
Copy link
Copy Markdown
Contributor

@dchorvat1 can you run our integration tests on these to confirm functionality, then move it out of draft.

@@ -0,0 +1,154 @@
CREATE TABLE IF NOT EXISTS {{schema | sqlsafe}}.managed_aws_openshift_daily_temp
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the migration strategy for these tables in the on premise flow? Cause it doesn't appear like we have one at all from my perspective.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will create django db models. is that ok?

@myersCody
Copy link
Copy Markdown
Contributor

Testing Instructions are required

Comment thread koku/masu/processor/report_parquet_processor_base.py Outdated
"""Return list of table names to delete from. Override in subclass if needed."""
return [self._table_name]

def delete_day_postgres(self, start_date, reportnumhours=None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we calling this delete day when we delete the entire month?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is something i took from the existing trino impl. ocp deletes by day and the others dont. but they share a base class

"""Return list of table names to delete from. Override in subclass if needed."""
return [self._table_name]

def delete_day_postgres(self, start_date, reportnumhours=None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You pass in start_date here, but don't seem to use it anywhere.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is the base class. OCP impl uses it

"""Return list of table names to delete from. Override in subclass if needed."""
return [self._table_name]

def delete_day_postgres(self, start_date, reportnumhours=None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I highly recommend we follow the call chain for this

for csv_filename in file_list:
            # set start date based on data in the file being processed:
            if self.provider_type == Provider.PROVIDER_OCP:
                self.start_date = self.ocp_files_to_process[csv_filename.stem]["meta_reportdatestart"]

            self._delete_old_data(Path(csv_filename))
            if self.provider_type == Provider.PROVIDER_OCP and self.report_type is None:
                msg = "Unknown report type, skipping file processing"
                LOG.warning(
                    log_json(
                        self.tracing_id,
                        msg=msg,
                        context=self.error_context,
                        filename=csv_filename,
                    )
                )
                return

Inside of _delete_old_data:

if settings.ONPREM:
            self._delete_old_data_postgres(filename)
        else:
            self._delete_old_data_trino(filename)
def _delete_old_data_postgres(self, filename):
        """remove records with data older than the data in the file being processed"""
        # Get reportnumhours for OCP (will be None for non-OCP)
        reportnumhours = None
        if self.ocp_files_to_process:
            reportnumhours = int(self.ocp_files_to_process[filename.stem]["meta_reportnumhours"])

        # Processor handles deleting from all relevant tables (raw and daily for OCP)
        processor = self._get_report_processor(daily=False)
        processor.delete_day_postgres(self.start_date, reportnumhours)

Are you deleting a whole month of data each time we process a csv?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No — the delete runs once per manifest, not per CSV. The should_s3_parquet_be_cleared / get_s3_parquet_cleared guard in _delete_old_data ensures only the first file in a manifest triggers the delete. The manifestid-based approach deletes data from previous manifests for the same source/month, effectively replacing old data with the current manifest's data. See the guard logic at koku/masu/processor/parquet/parquet_report_processor.py lines 676-692.

@ydayagi
Copy link
Copy Markdown
Collaborator Author

ydayagi commented May 7, 2026

Testing Instructions are required

i used the existing iqe tests for aws and aws on ocp. only thing i added was the ONPREM setting. please advise what other items you expect

@myersCody
Copy link
Copy Markdown
Contributor

Testing Instructions are required

i used the existing iqe tests for aws and aws on ocp. only thing i added was the ONPREM setting. please advise what other items you expect

Did you run the test locally or did you rely on CI? I don't think CI has been set up to trigger those tests yet.

@ydayagi
Copy link
Copy Markdown
Collaborator Author

ydayagi commented May 7, 2026

Testing Instructions are required

i used the existing iqe tests for aws and aws on ocp. only thing i added was the ONPREM setting. please advise what other items you expect

Did you run the test locally or did you rely on CI? I don't think CI has been set up to trigger those tests yet.

there is a label for it and i used it in the PR. the labels changed a few times due to CI issues. I also executed the tests locally. all the aws_for/on_ocp and aws in general

@ydayagi ydayagi force-pushed the trino2pgaws branch 7 times, most recently from 62cec59 to f5a7761 Compare May 11, 2026 05:54
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@koku-ci-triager-bot
Copy link
Copy Markdown
Collaborator

🤖 CI Triager — Diagnosis

Check: Red Hat Konflux / koku-ci / koku
PipelineRun: koku-ci-fhqdl
Root cause: The deploy-application task timed out waiting for the ephemeral Clowder environment to become ready. Multiple dependent services (sources-api, rbac, puptoo) failed to start, and the Clowder environment was locked. This is a transient infrastructure issue unrelated to this PR's code changes.
Evidence:

Warning  ClowdEnvLocked   clowdapp/koku     Clowder Environment [env-ephemeral-lyfq6z] is locked
Warning  ClowdAppNotReady clowdapp/koku     ClowdApp [koku] is not ready
Warning  BackOff          pod/sources-api-svc-67c9f56576-xl25j  Back-off restarting failed container

ERROR: deploy failed: timed out waiting for ClowdApp-owned resources

Action: Re-trigger the koku-ci check. The ephemeral environment infrastructure was unhealthy at the time this run executed.

Generated automatically. Review before applying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

aws-smoke-tests pr_check will run aws + ocp on aws smoke tests, used when changes affect AWS only. flightpath-pr Issues being worked on by the flight path team on-hold on-prem-processing pr_check will deploy and run the on-prem data pipeline processing flow. smokes-required Label to show that smokes tests should be run against these changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants