Skip to content

[FLPATH-3327] Add GCP self-hosted/on-prem support#5943

Open
ydayagi wants to merge 3 commits into
mainfrom
trino2pggcp
Open

[FLPATH-3327] Add GCP self-hosted/on-prem support#5943
ydayagi wants to merge 3 commits into
mainfrom
trino2pggcp

Conversation

@ydayagi
Copy link
Copy Markdown
Collaborator

@ydayagi ydayagi commented Mar 11, 2026

Add Django ORM models and PostgreSQL support for GCP line item data
storage in on-prem deployments without Trino/Hive.

Changes:

  • Add GCPLineItem and GCPLineItemDaily Django models with partitioning
  • Add migration 0346 for GCP line item tables
  • Update GCP processor with self_hosted_line_item_model property
  • Update GCP db accessor to use get_sql_folder_name()
  • Add delete_self_hosted_data_by_source() for cleanup
  • Copy PostgreSQL SQL files to self_hosted_sql/gcp/

https://issues.redhat.com/browse/FLPATH-3323

@ydayagi ydayagi requested review from a team as code owners March 11, 2026 10:35
@github-actions github-actions Bot added the smokes-required Label to show that smokes tests should be run against these changes. label Mar 11, 2026
@ydayagi ydayagi changed the title Trino2pggcp [FLPATH-3327] Add GCP self-hosted/on-prem support Mar 11, 2026
@ydayagi ydayagi added gcp-smoke-tests pr_check will run gcp + ocp on gcp smoke tests, used when changes affect GCP only. on-prem-processing pr_check will deploy and run the on-prem data pipeline processing flow. flightpath-pr Issues being worked on by the flight path team labels Mar 11, 2026
@ydayagi ydayagi force-pushed the trino2pggcp branch 2 times, most recently from de827f4 to 5aae3eb Compare March 11, 2026 10:48
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 11, 2026

Codecov Report

❌ Patch coverage is 96.94224% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.4%. Comparing base (47db450) to head (a9b95d3).

Additional details and impacted files
@@           Coverage Diff           @@
##            main   #5943     +/-   ##
=======================================
+ Coverage   94.4%   94.4%   +0.1%     
=======================================
  Files        362     368      +6     
  Lines      31988   32827    +839     
  Branches    3513    3532     +19     
=======================================
+ Hits       30185   30998    +813     
- Misses      1168    1194     +26     
  Partials     635     635             
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the platform's capabilities by enabling the storage and processing of Google Cloud Platform (GCP) cost and usage data within self-hosted, on-premise environments. It achieves this by integrating new Django ORM models with PostgreSQL table partitioning, providing a robust and scalable solution for managing large datasets without relying on external data lakes like Trino or Hive. The changes streamline data ingestion, summarization, and cleanup processes for cloud providers in an on-prem context.

Highlights

  • GCP Self-Hosted Support: Introduced comprehensive support for Google Cloud Platform (GCP) line item data storage in on-premise deployments, eliminating the dependency on Trino/Hive.
  • Django ORM Models & PostgreSQL Partitioning: Added new Django ORM models (GCPLineItem and GCPLineItemDaily) with PostgreSQL partitioning to efficiently manage and store GCP line item data locally.
  • Unified Self-Hosted Data Processing: Refactored report parquet processors across AWS, Azure, and GCP to leverage a common base class for writing data to self-hosted PostgreSQL tables, ensuring consistent data ingestion and management.
  • Enhanced Data Deletion Logic: Implemented new PostgreSQL functions and logic for date-scoped and manifest-ID based deletion of self-hosted data, allowing for more granular control over data retention and reprocessing.
  • New PostgreSQL SQL Templates: Included a suite of new PostgreSQL SQL files for AWS, Azure, and GCP to handle daily summary, UI summary, and matched tag processing within the self-hosted environment.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • koku/koku/reportdb_accessor_postgres.py
    • Added get_delete_day_by_manifestid_and_date_sql for date-scoped data deletion in PostgreSQL.
  • koku/masu/database/aws_report_db_accessor.py
    • Updated SQL file path retrieval to use get_sql_folder_name().
    • Added delete_self_hosted_data_by_source for cleaning up self-hosted AWS data.
  • koku/masu/database/azure_report_db_accessor.py
    • Updated SQL file path retrieval to use get_sql_folder_name().
    • Added delete_self_hosted_data_by_source for cleaning up self-hosted Azure data.
  • koku/masu/database/gcp_report_db_accessor.py
    • Updated SQL file path retrieval to use get_sql_folder_name().
    • Added delete_self_hosted_data_by_source for cleaning up self-hosted GCP data.
  • koku/masu/database/self_hosted_sql/aws/openshift/populate_daily_summary/0_prepare_daily_summary_tables.sql
    • Added SQL to create temporary and summary tables for AWS OCP daily data processing.
  • koku/masu/database/self_hosted_sql/aws/openshift/populate_daily_summary/1_resource_matching_by_cluster.sql
    • Added SQL for resource matching and data insertion into temporary tables for AWS OCP.
  • koku/masu/database/self_hosted_sql/aws/openshift/populate_daily_summary/2_summarize_data_by_cluster.sql
    • Added SQL for summarizing AWS OCP data by cluster, including storage and network costs.
  • koku/masu/database/self_hosted_sql/aws/openshift/populate_daily_summary/3_reporting_ocpawscostlineitem_project_daily_summary_p.sql
    • Added SQL to insert managed AWS OCP data into the final PostgreSQL summary table.
  • koku/masu/database/self_hosted_sql/aws/openshift/reporting_ocpaws_matched_tags.sql
    • Added SQL for identifying matched tags between AWS and OCP resources.
  • koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpaws_compute_summary_p.sql
    • Added SQL for populating AWS OCP UI compute summary tables.
  • koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpaws_cost_summary_by_account_p.sql
    • Added SQL for populating AWS OCP UI cost summary by account tables.
  • koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpaws_cost_summary_by_region_p.sql
    • Added SQL for populating AWS OCP UI cost summary by region tables.
  • koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpaws_cost_summary_by_service_p.sql
    • Added SQL for populating AWS OCP UI cost summary by service tables.
  • koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpaws_cost_summary_p.sql
    • Added SQL for populating AWS OCP UI overall cost summary tables.
  • koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpaws_database_summary_p.sql
    • Added SQL for populating AWS OCP UI database summary tables.
  • koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpaws_network_summary_p.sql
    • Added SQL for populating AWS OCP UI network summary tables.
  • koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpaws_storage_summary_p.sql
    • Added SQL for populating AWS OCP UI storage summary tables.
  • koku/masu/database/self_hosted_sql/aws/openshift/ui_summary/reporting_ocpawscostlineitem_project_daily_summary_p.sql
    • Added SQL for inserting managed AWS OCP project daily summary data into PostgreSQL.
  • koku/masu/database/self_hosted_sql/aws/reporting_awscostentrylineitem_daily_summary.sql
    • Added SQL for inserting AWS daily line item data into the summary table.
  • koku/masu/database/self_hosted_sql/aws/reporting_awscostentrylineitem_summary_by_ec2_compute_p.sql
    • Added SQL for summarizing AWS EC2 compute costs.
  • koku/masu/database/self_hosted_sql/aws/reporting_ocpinfrastructure_provider_map.sql
    • Added SQL for mapping OCP and AWS infrastructure providers.
  • koku/masu/database/self_hosted_sql/azure/openshift/populate_daily_summary/0_prepare_daily_summary_tables.sql
    • Added SQL to create temporary and summary tables for Azure OCP daily data processing.
  • koku/masu/database/self_hosted_sql/azure/openshift/populate_daily_summary/1_resource_matching_by_cluster.sql
    • Added SQL for resource matching and data insertion into temporary tables for Azure OCP.
  • koku/masu/database/self_hosted_sql/azure/openshift/populate_daily_summary/2_summarize_data_by_cluster.sql
    • Added SQL for summarizing Azure OCP data by cluster, including storage and network costs.
  • koku/masu/database/self_hosted_sql/azure/openshift/populate_daily_summary/3_reporting_ocpazurecostlineitem_project_daily_summary_p.sql
    • Added SQL to insert managed Azure OCP data into the final PostgreSQL summary table.
  • koku/masu/database/self_hosted_sql/azure/openshift/reporting_ocpazure_matched_tags.sql
    • Added SQL for identifying matched tags between Azure and OCP resources.
  • koku/masu/database/self_hosted_sql/azure/openshift/ui_summary/reporting_ocpazure_compute_summary_p.sql
    • Added SQL for populating Azure OCP UI compute summary tables.
  • koku/masu/database/self_hosted_sql/azure/openshift/ui_summary/reporting_ocpazure_cost_summary_by_account_p.sql
    • Added SQL for populating Azure OCP UI cost summary by account tables.
  • koku/masu/database/self_hosted_sql/azure/openshift/ui_summary/reporting_ocpazure_cost_summary_by_location_p.sql
    • Added SQL for populating Azure OCP UI cost summary by location tables.
  • koku/masu/database/self_hosted_sql/azure/openshift/ui_summary/reporting_ocpazure_cost_summary_by_service_p.sql
    • Added SQL for populating Azure OCP UI cost summary by service tables.
  • koku/masu/database/self_hosted_sql/azure/openshift/ui_summary/reporting_ocpazure_cost_summary_p.sql
    • Added SQL for populating Azure OCP UI overall cost summary tables.
  • koku/masu/database/self_hosted_sql/azure/openshift/ui_summary/reporting_ocpazure_database_summary_p.sql
    • Added SQL for populating Azure OCP UI database summary tables.
  • koku/masu/database/self_hosted_sql/azure/openshift/ui_summary/reporting_ocpazure_network_summary_p.sql
    • Added SQL for populating Azure OCP UI network summary tables.
  • koku/masu/database/self_hosted_sql/azure/openshift/ui_summary/reporting_ocpazure_storage_summary_p.sql
    • Added SQL for populating Azure OCP UI storage summary tables.
  • koku/masu/database/self_hosted_sql/azure/openshift/ui_summary/reporting_ocpazurecostlineitem_project_daily_summary_p.sql
    • Added SQL for inserting managed Azure OCP project daily summary data into PostgreSQL.
  • koku/masu/database/self_hosted_sql/azure/reporting_azurecostentrylineitem_daily_summary.sql
    • Added SQL for inserting Azure daily line item data into the summary table.
  • koku/masu/database/self_hosted_sql/azure/reporting_ocpinfrastructure_provider_map.sql
    • Added SQL for mapping OCP and Azure infrastructure providers.
  • koku/masu/database/self_hosted_sql/gcp/get_invoice_month_dates.sql
    • Added SQL for fetching extended invoice month dates for GCP.
  • koku/masu/database/self_hosted_sql/gcp/openshift/populate_daily_summary/0_prepare_daily_summary_tables.sql
    • Added SQL to create temporary and summary tables for GCP OCP daily data processing.
  • koku/masu/database/self_hosted_sql/gcp/openshift/populate_daily_summary/1_resource_matching_by_cluster.sql
    • Added SQL for resource matching and data insertion into temporary tables for GCP OCP.
  • koku/masu/database/self_hosted_sql/gcp/openshift/populate_daily_summary/2_summarize_data_by_cluster.sql
    • Added SQL for summarizing GCP OCP data by cluster, including storage and network costs.
  • koku/masu/database/self_hosted_sql/gcp/openshift/populate_daily_summary/3_reporting_ocpgcpcostlineitem_project_daily_summary_p.sql
    • Added SQL to insert managed GCP OCP data into the final PostgreSQL summary table.
  • koku/masu/database/self_hosted_sql/gcp/openshift/reporting_ocpgcp_matched_tags.sql
    • Added SQL for identifying matched tags between GCP and OCP resources.
  • koku/masu/database/self_hosted_sql/gcp/openshift/ui_summary/reporting_ocpgcp_compute_summary_p.sql
    • Added SQL for populating GCP OCP UI compute summary tables.
  • koku/masu/database/self_hosted_sql/gcp/openshift/ui_summary/reporting_ocpgcp_cost_summary_by_account_p.sql
    • Added SQL for populating GCP OCP UI cost summary by account tables.
  • koku/masu/database/self_hosted_sql/gcp/openshift/ui_summary/reporting_ocpgcp_cost_summary_by_gcp_project_p.sql
    • Added SQL for populating GCP OCP UI cost summary by GCP project tables.
  • koku/masu/database/self_hosted_sql/gcp/openshift/ui_summary/reporting_ocpgcp_cost_summary_by_region_p.sql
    • Added SQL for populating GCP OCP UI cost summary by region tables.
  • koku/masu/database/self_hosted_sql/gcp/openshift/ui_summary/reporting_ocpgcp_cost_summary_by_service_p.sql
    • Added SQL for populating GCP OCP UI cost summary by service tables.
  • koku/masu/database/self_hosted_sql/gcp/openshift/ui_summary/reporting_ocpgcp_cost_summary_p.sql
    • Added SQL for populating GCP OCP UI overall cost summary tables.
  • koku/masu/database/self_hosted_sql/gcp/openshift/ui_summary/reporting_ocpgcp_database_summary_p.sql
    • Added SQL for populating GCP OCP UI database summary tables.
  • koku/masu/database/self_hosted_sql/gcp/openshift/ui_summary/reporting_ocpgcp_network_summary_p.sql
    • Added SQL for populating GCP OCP UI network summary tables.
  • koku/masu/database/self_hosted_sql/gcp/openshift/ui_summary/reporting_ocpgcp_storage_summary_p.sql
    • Added SQL for populating GCP OCP UI storage summary tables.
  • koku/masu/database/self_hosted_sql/gcp/openshift/ui_summary/reporting_ocpgcpcostlineitem_project_daily_summary_p.sql
    • Added SQL for inserting managed GCP OCP project daily summary data into PostgreSQL.
  • koku/masu/database/self_hosted_sql/gcp/reporting_gcpcostentrylineitem_daily_summary.sql
    • Added SQL for inserting GCP daily line item data into the summary table.
  • koku/masu/database/self_hosted_sql/gcp/reporting_ocpinfrastructure_provider_map.sql
    • Added SQL for mapping OCP and GCP infrastructure providers.
  • koku/masu/database/self_hosted_sql/openshift/cost_model/monthly_cost_gpu.sql
    • Updated GPU product label extraction to use ::jsonb->>'nvidia_com_gpu_product' for consistency.
  • koku/masu/database/self_hosted_sql/openshift/ocp_special_matched_tags.sql
    • Added SQL for aggregating special matched tags for OCP.
  • koku/masu/processor/aws/aws_report_parquet_processor.py
    • Updated to track daily status and define _date_column.
    • Implemented self_hosted_line_item_model property to return AWS Django models.
    • Implemented get_table_names_for_delete to return relevant AWS table names.
    • Implemented _prepare_dataframe_for_write to add manifestid to the dataframe.
  • koku/masu/processor/azure/azure_report_parquet_processor.py
    • Updated to track daily status and define _date_column.
    • Implemented self_hosted_line_item_model property to return Azure Django models.
    • Implemented get_table_names_for_delete to return relevant Azure table names.
    • Implemented _prepare_dataframe_for_write to add manifestid to the dataframe.
  • koku/masu/processor/gcp/gcp_report_parquet_processor.py
    • Updated to track daily status and define _date_column.
    • Implemented self_hosted_line_item_model property to return GCP Django models.
    • Implemented get_table_names_for_delete to return relevant GCP table names.
    • Implemented _prepare_dataframe_for_write to add manifestid to the dataframe.
  • koku/masu/processor/ocp/ocp_report_parquet_processor.py
    • Defined _date_column for OCP data.
    • Refactored write_to_self_hosted_table to delegate to the base class implementation.
  • koku/masu/processor/report_parquet_processor_base.py
    • Introduced self_hosted_line_item_model property and _prepare_dataframe_for_write method for subclass implementation.
    • Added a generic write_to_self_hosted_table method that uses Django models and PostgreSQL partitioning.
    • Added get_table_names_for_delete and delete_day_postgres for manifest-ID based deletion in PostgreSQL.
  • koku/masu/test/database/test_aws_report_db_accessor.py
    • Added tests for delete_self_hosted_data_by_source functionality.
  • koku/masu/test/database/test_azure_report_db_accessor.py
    • Added tests for delete_self_hosted_data_by_source functionality.
  • koku/masu/test/database/test_gcp_report_db_accessor.py
    • Added tests for delete_self_hosted_data_by_source functionality.
  • koku/masu/test/processor/aws/test_aws_report_parquet_processor.py
    • Added tests for _is_daily flag, self_hosted_line_item_model, get_table_names_for_delete, _prepare_dataframe_for_write, and write_to_self_hosted_table.
  • koku/masu/test/processor/azure/test_azure_report_parquet_processor.py
    • Added tests for _is_daily flag, self_hosted_line_item_model, get_table_names_for_delete, _prepare_dataframe_for_write, and write_to_self_hosted_table.
  • koku/masu/test/processor/gcp/test_gcp_report_parquet_processor.py
    • Added tests for _is_daily flag, self_hosted_line_item_model, get_table_names_for_delete, _prepare_dataframe_for_write, and write_to_self_hosted_table.
  • koku/masu/test/processor/ocp/test_ocp_report_parquet_processor.py
    • Updated test for write_to_self_hosted_table to reflect the base class refactoring.
  • koku/masu/test/processor/test_report_parquet_processor_base.py
    • Added tests for base class methods: self_hosted_line_item_model, _prepare_dataframe_for_write, write_to_self_hosted_table (when no model), and get_table_names_for_delete.
  • koku/masu/util/aws/common.py
    • Added get_table_names_for_delete to retrieve table names for PostgreSQL deletion.
    • Added _delete_old_data_postgres_by_date for date-scoped deletion in PostgreSQL.
    • Added _clear_csv_only for S3 CSV file deletion in on-prem environments.
    • Updated get_or_clear_daily_s3_by_date to integrate new PostgreSQL deletion logic for on-prem deployments.
  • koku/reporting/migrations/0344_aws_line_item_models.py
    • Added migration to create AWSLineItem and AWSLineItemDaily Django models with PostgreSQL partitioning.
  • koku/reporting/migrations/0345_azure_line_item_models.py
    • Added migration to create AzureLineItem Django model with PostgreSQL partitioning.
  • koku/reporting/migrations/0346_gcp_line_item_models.py
    • Added migration to create GCPLineItem and GCPLineItemDaily Django models with PostgreSQL partitioning.
  • koku/reporting/provider/aws/models.py
    • Imported new self-hosted AWS line item models (AWSLineItem, AWSLineItemDaily).
  • koku/reporting/provider/aws/self_hosted_models.py
    • Added new Django models AWSLineItem and AWSLineItemDaily for self-hosted PostgreSQL storage, including partitioning information.
  • koku/reporting/provider/azure/models.py
    • Imported new self-hosted Azure line item model (AzureLineItem).
  • koku/reporting/provider/azure/self_hosted_models.py
    • Added new Django model AzureLineItem for self-hosted PostgreSQL storage, including partitioning information.
  • koku/reporting/provider/gcp/models.py
    • Imported new self-hosted GCP line item models (GCPLineItem, GCPLineItemDaily).
  • koku/reporting/provider/gcp/self_hosted_models.py
    • Added new Django models GCPLineItem and GCPLineItemDaily for self-hosted PostgreSQL storage, including partitioning information.
Activity
  • No human activity has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant changes to add self-hosted/on-prem support for GCP, following a similar pattern for AWS and Azure. This includes new Django models for line item data, PostgreSQL-specific SQL queries for data processing, and refactoring of existing database accessors and processors to accommodate the on-premise logic. The changes are extensive and well-structured, with new tests covering the added functionality. My review focuses on potential security vulnerabilities, data type correctness for financial data, and opportunities for code consolidation to improve maintainability.

Note: Security Review did not run due to the size of the PR.

Comment on lines +95 to +102
return f"""
DELETE FROM "{schema_name}"."{table_name}"
WHERE source = '{source}'
AND year = '{year}'
AND month = '{month}'
AND manifestid != '{manifestid}'
AND {DATE_COLUMN} >= DATE '{processing_date}'
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This method constructs a raw SQL query using an f-string, which is a potential SQL injection vulnerability. Although the values might be system-generated, it is a security best practice to use parameterized queries. Please consider modifying this method to return a SQL template and a list of parameters, and then use cursor.execute(sql, params) at the call site to safely execute the query. This would provide protection against SQL injection.

Comment on lines +18 to +21
unblended_cost FLOAT,
blended_cost FLOAT,
savingsplan_effective_cost FLOAT,
calculated_amortized_cost FLOAT,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using FLOAT for currency columns such as unblended_cost, blended_cost, savingsplan_effective_cost, and calculated_amortized_cost can lead to floating-point inaccuracies. For financial calculations, it is highly recommended to use DECIMAL or NUMERIC data types to ensure precision. This recommendation also applies to other temporary tables created in this pull request.

    unblended_cost DECIMAL,
    blended_cost DECIMAL,
    savingsplan_effective_cost DECIMAL,
    calculated_amortized_cost DECIMAL,

usage_pricing_unit = models.CharField(max_length=256, null=True)

# Cost columns
cost = models.FloatField(null=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using FloatField for currency values like cost can introduce precision issues. It is a best practice to use DecimalField for all monetary values to maintain accuracy in financial calculations. This also applies to other new models for AWS and Azure introduced in this pull request.

Suggested change
cost = models.FloatField(null=True)
cost = models.DecimalField(max_digits=24, decimal_places=9, null=True)

Comment on lines +527 to +555
def delete_self_hosted_data_by_source(self, provider_uuid):
"""Delete data from all self-hosted tables by source UUID (for on-prem).

This deletes data from the line item tables when a source is deleted.

Args:
provider_uuid: The provider UUID to delete data for
"""
from reporting.provider.aws.self_hosted_models import get_self_hosted_models

provider_uuid_str = str(provider_uuid)
total_deleted = 0

with schema_context(self.schema):
for model in get_self_hosted_models():
deleted_count, _ = model.objects.filter(source=provider_uuid_str).delete()

if deleted_count:
LOG.info(
log_json(
msg="deleted self-hosted data by source",
table=model._meta.db_table,
provider_uuid=provider_uuid_str,
deleted_count=deleted_count,
)
)
total_deleted += deleted_count

return total_deleted
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The method delete_self_hosted_data_by_source is nearly identical in AWSReportDBAccessor, AzureReportDBAccessor, and GCPReportDBAccessor. To improve maintainability and reduce code duplication, consider moving this logic to a shared base class, such as ReportDBAccessorBase. The provider-specific get_self_hosted_models function could be defined as an abstract method in the base class that subclasses are required to implement.

@ydayagi ydayagi force-pushed the trino2pggcp branch 2 times, most recently from 1e5ec1c to da2a4ac Compare March 11, 2026 19:46
@@ -0,0 +1,154 @@
CREATE TABLE IF NOT EXISTS {{schema | sqlsafe}}.managed_aws_openshift_daily_temp
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the migration strategy for these tables in the on premise flow? Cause it doesn't appear like we have one at all from my perspective.

"""Return list of table names to delete from. Override in subclass if needed."""
return [self._table_name]

def delete_day_postgres(self, start_date, reportnumhours=None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we calling this delete day when we delete the entire month?

# Delete from existing tables
total_deleted = 0
for table_name in existing_tables:
delete_sql = get_report_db_accessor().get_delete_day_by_manifestid_sql(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this method is terrible considering it deletes the entire month.

 def get_delete_day_by_manifestid_sql(
        self, schema_name: str, table_name: str, source: str, year: str, month: str, manifestid: str
    ):
        """Return the SQL to delete data where manifestid doesn't match."""
        return f"""
            DELETE FROM "{schema_name}"."{table_name}"
            WHERE source = '{source}'
              AND year = '{year}'
              AND month = '{month}'
              AND manifestid != '{manifestid}'
        """

"""Return list of table names to delete from. Override in subclass if needed."""
return [self._table_name]

def delete_day_postgres(self, start_date, reportnumhours=None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You pass in start_date here but don't seem to use it anywhere

"""Return list of table names to delete from. Override in subclass if needed."""
return [self._table_name]

def delete_day_postgres(self, start_date, reportnumhours=None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I highly recommend we follow the call chain for this

for csv_filename in file_list:
            # set start date based on data in the file being processed:
            if self.provider_type == Provider.PROVIDER_OCP:
                self.start_date = self.ocp_files_to_process[csv_filename.stem]["meta_reportdatestart"]

            self._delete_old_data(Path(csv_filename))
            if self.provider_type == Provider.PROVIDER_OCP and self.report_type is None:
                msg = "Unknown report type, skipping file processing"
                LOG.warning(
                    log_json(
                        self.tracing_id,
                        msg=msg,
                        context=self.error_context,
                        filename=csv_filename,
                    )
                )
                return

Inside of _delete_old_data:

if settings.ONPREM:
            self._delete_old_data_postgres(filename)
        else:
            self._delete_old_data_trino(filename)
def _delete_old_data_postgres(self, filename):
        """remove records with data older than the data in the file being processed"""
        # Get reportnumhours for OCP (will be None for non-OCP)
        reportnumhours = None
        if self.ocp_files_to_process:
            reportnumhours = int(self.ocp_files_to_process[filename.stem]["meta_reportnumhours"])

        # Processor handles deleting from all relevant tables (raw and daily for OCP)
        processor = self._get_report_processor(daily=False)
        processor.delete_day_postgres(self.start_date, reportnumhours)

Are you deleting a whole month of data each time we process a csv?

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ydayagi ydayagi enabled auto-merge (squash) May 11, 2026 06:03
@ydayagi ydayagi force-pushed the trino2pggcp branch 2 times, most recently from 66a2c9c to 4c0bf43 Compare May 11, 2026 06:07
Add self-hosted PostgreSQL support for Azure provider, following the
same pattern as AWS.

Changes:
- Add Django model for Azure line items (azure_line_items)
- Add migration for partitioned Azure line item table
- Add self_hosted_sql/azure/ directory with PostgreSQL-converted SQL files
- Update Azure processor with _date_column, self_hosted_line_item_model
- Update Azure db accessor to use get_sql_folder_name()
- Add delete_self_hosted_data_by_source() for cleanup

Jira: https://issues.redhat.com/browse/FLPATH-3323

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Yaron Dayagi <ydayagi@redhat.com>
Add Django ORM models and PostgreSQL support for GCP line item data
storage in on-prem deployments without Trino/Hive.

Changes:
- Add GCPLineItem and GCPLineItemDaily Django models with partitioning
- Add migration 0346 for GCP line item tables
- Update GCP processor with self_hosted_line_item_model property
- Update GCP db accessor to use get_sql_folder_name()
- Add delete_self_hosted_data_by_source() for cleanup
- Copy PostgreSQL SQL files to self_hosted_sql/gcp/

https://issues.redhat.com/browse/FLPATH-3323

Signed-off-by: Yoni Dayagi <ydayagi@redhat.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@koku-ci-triager-bot
Copy link
Copy Markdown
Collaborator

🤖 CI Triager — Diagnosis

Check: Red Hat Konflux / koku-ci / koku
PipelineRun: koku-ci-6k9xx
Root cause: The deploy-application task timed out waiting for the ephemeral Clowder environment to become ready. Multiple dependent services (sources-api, rbac, puptoo) failed to start, and the Clowder environment was locked. This is a transient infrastructure issue unrelated to this PR's code changes.
Evidence:

Warning  ClowdEnvLocked   clowdapp/koku     Clowder Environment [env-ephemeral-zdz28l] is locked
Warning  ClowdAppNotReady clowdapp/koku     ClowdApp [koku] is not ready
Warning  BackOff          pod/sources-api-svc-5d45d8d498-vpfc6  Back-off restarting failed container

ERROR: deploy failed: timed out waiting for ClowdApp-owned resources

Action: Re-trigger the koku-ci check. The ephemeral environment infrastructure was unhealthy at the time this run executed.

Generated automatically. Review before applying.

@koku-ci-triager-bot
Copy link
Copy Markdown
Collaborator

🤖 CI Triager — Warning

Check: Migration convention
Root cause: This PR adds 3 migration files, but the Koku convention requires at most 1 migration per PR. Multiple migrations should be squashed into a single file before merging.
Evidence:

koku/reporting/migrations/0351_awslineitem_awslineitemdaily_and_more.py
koku/reporting/migrations/0352_azurelineitem_managedazureopenshiftdaily_and_more.py
koku/reporting/migrations/0353_gcplineitem_gcplineitemdaily_and_more.py

Action: Squash the migrations into a single file:

python koku/manage.py squashmigrations <app_label> <first_migration> <last_migration>

Replace the three migration files with the generated squashed migration and update the dependencies accordingly.

Generated automatically. Review before applying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

flightpath-pr Issues being worked on by the flight path team gcp-smoke-tests pr_check will run gcp + ocp on gcp smoke tests, used when changes affect GCP only. on-hold on-prem-processing pr_check will deploy and run the on-prem data pipeline processing flow. smokes-required Label to show that smokes tests should be run against these changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants