[FLPATH-3327] Add GCP self-hosted/on-prem support#5943
Conversation
de827f4 to
5aae3eb
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5943 +/- ##
=======================================
+ Coverage 94.4% 94.4% +0.1%
=======================================
Files 362 368 +6
Lines 31988 32827 +839
Branches 3513 3532 +19
=======================================
+ Hits 30185 30998 +813
- Misses 1168 1194 +26
Partials 635 635 🚀 New features to boost your workflow:
|
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the platform's capabilities by enabling the storage and processing of Google Cloud Platform (GCP) cost and usage data within self-hosted, on-premise environments. It achieves this by integrating new Django ORM models with PostgreSQL table partitioning, providing a robust and scalable solution for managing large datasets without relying on external data lakes like Trino or Hive. The changes streamline data ingestion, summarization, and cleanup processes for cloud providers in an on-prem context. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant changes to add self-hosted/on-prem support for GCP, following a similar pattern for AWS and Azure. This includes new Django models for line item data, PostgreSQL-specific SQL queries for data processing, and refactoring of existing database accessors and processors to accommodate the on-premise logic. The changes are extensive and well-structured, with new tests covering the added functionality. My review focuses on potential security vulnerabilities, data type correctness for financial data, and opportunities for code consolidation to improve maintainability.
Note: Security Review did not run due to the size of the PR.
| return f""" | ||
| DELETE FROM "{schema_name}"."{table_name}" | ||
| WHERE source = '{source}' | ||
| AND year = '{year}' | ||
| AND month = '{month}' | ||
| AND manifestid != '{manifestid}' | ||
| AND {DATE_COLUMN} >= DATE '{processing_date}' | ||
| """ |
There was a problem hiding this comment.
This method constructs a raw SQL query using an f-string, which is a potential SQL injection vulnerability. Although the values might be system-generated, it is a security best practice to use parameterized queries. Please consider modifying this method to return a SQL template and a list of parameters, and then use cursor.execute(sql, params) at the call site to safely execute the query. This would provide protection against SQL injection.
| unblended_cost FLOAT, | ||
| blended_cost FLOAT, | ||
| savingsplan_effective_cost FLOAT, | ||
| calculated_amortized_cost FLOAT, |
There was a problem hiding this comment.
Using FLOAT for currency columns such as unblended_cost, blended_cost, savingsplan_effective_cost, and calculated_amortized_cost can lead to floating-point inaccuracies. For financial calculations, it is highly recommended to use DECIMAL or NUMERIC data types to ensure precision. This recommendation also applies to other temporary tables created in this pull request.
unblended_cost DECIMAL,
blended_cost DECIMAL,
savingsplan_effective_cost DECIMAL,
calculated_amortized_cost DECIMAL,| usage_pricing_unit = models.CharField(max_length=256, null=True) | ||
|
|
||
| # Cost columns | ||
| cost = models.FloatField(null=True) |
There was a problem hiding this comment.
Using FloatField for currency values like cost can introduce precision issues. It is a best practice to use DecimalField for all monetary values to maintain accuracy in financial calculations. This also applies to other new models for AWS and Azure introduced in this pull request.
| cost = models.FloatField(null=True) | |
| cost = models.DecimalField(max_digits=24, decimal_places=9, null=True) |
| def delete_self_hosted_data_by_source(self, provider_uuid): | ||
| """Delete data from all self-hosted tables by source UUID (for on-prem). | ||
|
|
||
| This deletes data from the line item tables when a source is deleted. | ||
|
|
||
| Args: | ||
| provider_uuid: The provider UUID to delete data for | ||
| """ | ||
| from reporting.provider.aws.self_hosted_models import get_self_hosted_models | ||
|
|
||
| provider_uuid_str = str(provider_uuid) | ||
| total_deleted = 0 | ||
|
|
||
| with schema_context(self.schema): | ||
| for model in get_self_hosted_models(): | ||
| deleted_count, _ = model.objects.filter(source=provider_uuid_str).delete() | ||
|
|
||
| if deleted_count: | ||
| LOG.info( | ||
| log_json( | ||
| msg="deleted self-hosted data by source", | ||
| table=model._meta.db_table, | ||
| provider_uuid=provider_uuid_str, | ||
| deleted_count=deleted_count, | ||
| ) | ||
| ) | ||
| total_deleted += deleted_count | ||
|
|
||
| return total_deleted |
There was a problem hiding this comment.
The method delete_self_hosted_data_by_source is nearly identical in AWSReportDBAccessor, AzureReportDBAccessor, and GCPReportDBAccessor. To improve maintainability and reduce code duplication, consider moving this logic to a shared base class, such as ReportDBAccessorBase. The provider-specific get_self_hosted_models function could be defined as an abstract method in the base class that subclasses are required to implement.
1e5ec1c to
da2a4ac
Compare
| @@ -0,0 +1,154 @@ | |||
| CREATE TABLE IF NOT EXISTS {{schema | sqlsafe}}.managed_aws_openshift_daily_temp | |||
There was a problem hiding this comment.
What is the migration strategy for these tables in the on premise flow? Cause it doesn't appear like we have one at all from my perspective.
| """Return list of table names to delete from. Override in subclass if needed.""" | ||
| return [self._table_name] | ||
|
|
||
| def delete_day_postgres(self, start_date, reportnumhours=None): |
There was a problem hiding this comment.
Why are we calling this delete day when we delete the entire month?
| # Delete from existing tables | ||
| total_deleted = 0 | ||
| for table_name in existing_tables: | ||
| delete_sql = get_report_db_accessor().get_delete_day_by_manifestid_sql( |
There was a problem hiding this comment.
The name of this method is terrible considering it deletes the entire month.
def get_delete_day_by_manifestid_sql(
self, schema_name: str, table_name: str, source: str, year: str, month: str, manifestid: str
):
"""Return the SQL to delete data where manifestid doesn't match."""
return f"""
DELETE FROM "{schema_name}"."{table_name}"
WHERE source = '{source}'
AND year = '{year}'
AND month = '{month}'
AND manifestid != '{manifestid}'
"""
| """Return list of table names to delete from. Override in subclass if needed.""" | ||
| return [self._table_name] | ||
|
|
||
| def delete_day_postgres(self, start_date, reportnumhours=None): |
There was a problem hiding this comment.
You pass in start_date here but don't seem to use it anywhere
| """Return list of table names to delete from. Override in subclass if needed.""" | ||
| return [self._table_name] | ||
|
|
||
| def delete_day_postgres(self, start_date, reportnumhours=None): |
There was a problem hiding this comment.
I highly recommend we follow the call chain for this
for csv_filename in file_list:
# set start date based on data in the file being processed:
if self.provider_type == Provider.PROVIDER_OCP:
self.start_date = self.ocp_files_to_process[csv_filename.stem]["meta_reportdatestart"]
self._delete_old_data(Path(csv_filename))
if self.provider_type == Provider.PROVIDER_OCP and self.report_type is None:
msg = "Unknown report type, skipping file processing"
LOG.warning(
log_json(
self.tracing_id,
msg=msg,
context=self.error_context,
filename=csv_filename,
)
)
return
Inside of _delete_old_data:
if settings.ONPREM:
self._delete_old_data_postgres(filename)
else:
self._delete_old_data_trino(filename)
def _delete_old_data_postgres(self, filename):
"""remove records with data older than the data in the file being processed"""
# Get reportnumhours for OCP (will be None for non-OCP)
reportnumhours = None
if self.ocp_files_to_process:
reportnumhours = int(self.ocp_files_to_process[filename.stem]["meta_reportnumhours"])
# Processor handles deleting from all relevant tables (raw and daily for OCP)
processor = self._get_report_processor(daily=False)
processor.delete_day_postgres(self.start_date, reportnumhours)
Are you deleting a whole month of data each time we process a csv?
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
66a2c9c to
4c0bf43
Compare
Add self-hosted PostgreSQL support for Azure provider, following the same pattern as AWS. Changes: - Add Django model for Azure line items (azure_line_items) - Add migration for partitioned Azure line item table - Add self_hosted_sql/azure/ directory with PostgreSQL-converted SQL files - Update Azure processor with _date_column, self_hosted_line_item_model - Update Azure db accessor to use get_sql_folder_name() - Add delete_self_hosted_data_by_source() for cleanup Jira: https://issues.redhat.com/browse/FLPATH-3323 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Yaron Dayagi <ydayagi@redhat.com>
Add Django ORM models and PostgreSQL support for GCP line item data storage in on-prem deployments without Trino/Hive. Changes: - Add GCPLineItem and GCPLineItemDaily Django models with partitioning - Add migration 0346 for GCP line item tables - Update GCP processor with self_hosted_line_item_model property - Update GCP db accessor to use get_sql_folder_name() - Add delete_self_hosted_data_by_source() for cleanup - Copy PostgreSQL SQL files to self_hosted_sql/gcp/ https://issues.redhat.com/browse/FLPATH-3323 Signed-off-by: Yoni Dayagi <ydayagi@redhat.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
🤖 CI Triager — Diagnosis Check: Red Hat Konflux / koku-ci / koku Action: Re-trigger the Generated automatically. Review before applying. |
|
🤖 CI Triager — Warning Check: Migration convention Action: Squash the migrations into a single file: Replace the three migration files with the generated squashed migration and update the dependencies accordingly. Generated automatically. Review before applying. |
Add Django ORM models and PostgreSQL support for GCP line item data
storage in on-prem deployments without Trino/Hive.
Changes:
https://issues.redhat.com/browse/FLPATH-3323