Skip to content

Add analyze_tables#257

Merged
wjxiz1992 merged 4 commits intoNVIDIA:devfrom
GaryShen2008:add-analyze-tables
Apr 30, 2026
Merged

Add analyze_tables#257
wjxiz1992 merged 4 commits intoNVIDIA:devfrom
GaryShen2008:add-analyze-tables

Conversation

@GaryShen2008
Copy link
Copy Markdown
Collaborator

Fix #256

Run COMPUTE STATISTICS before query running
It only runs on raw dataset, not including Iceberg and Delta Lake.

Run COMPUTE STATISTICS before query running
It only runs on raw dataset, not including Iceberg and Delta Lake.

Signed-off-by: Gary Shen <gashen@nvidia.com>
Copilot AI review requested due to automatic review settings April 30, 2026 09:30
@GaryShen2008 GaryShen2008 marked this pull request as draft April 30, 2026 09:31
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

This PR adds an --analyze_tables flag to both nds_power.py and nds_h_power.py. When set, instead of creating TempViews the scripts register permanent catalog tables via CREATE TABLE IF NOT EXISTS … USING … LOCATION and then run ANALYZE TABLE … COMPUTE STATISTICS on each one before executing benchmark queries; the feature is intentionally skipped for Iceberg and Delta Lake formats.

Confidence Score: 5/5

Safe to merge — all findings are P2 style issues with no functional impact.

Only P2 comments (inconsistent identifier quoting and a stale docstring). The core logic — catalog table creation, ANALYZE execution, and CLI wiring — is correct. No P0 or P1 issues found.

No files require special attention.

Important Files Changed

Filename Overview
nds/nds_power.py Adds analyze_tables flag that creates real catalog tables and runs ANALYZE TABLE COMPUTE STATISTICS before benchmark queries; minor inconsistency in backtick quoting for ANALYZE SQL and a stale docstring.
nds-h/nds_h_power.py Mirrors nds_power.py changes: adds analyze_tables support with the same helper functions and the same minor issues (raw backtick formatting in ANALYZE SQL, stale docstring).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[run_query_stream called] --> B{input_format\niceberg or delta?}
    B -- Yes, iceberg --> C[register_iceberg_tables]
    B -- Yes, delta --> D[register_delta_tables]
    B -- No --> E{analyze_tables flag?}
    E -- False --> F[reader.load.createOrReplaceTempView\nfor each table]
    E -- True --> G[_create_data_source_table\nCREATE TABLE IF NOT EXISTS USING LOCATION]
    G --> H[ANALYZE TABLE COMPUTE STATISTICS]
    F --> I[Run benchmark queries]
    H --> I
    C --> I
    D --> I
Loading

Reviews (2): Last reviewed commit: "Fix typo" | Re-trigger Greptile

Comment thread nds-h/nds_h_power.py
Comment on lines 419 to +422
choices=['parquet', 'orc', 'avro', 'csv', 'json', 'iceberg', 'delta'],
default='parquet')
parser.add_argument('--analyze_tables',
action='store_true',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Redundant default=False with store_true

argparse's store_true action already sets default=False implicitly; the explicit default=False is redundant. Same pattern appears in nds/nds_power.py. No functional impact, but it adds noise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an optional “analyze tables” step to the NDS and NDS-H power-run drivers so users can precompute Spark table statistics before executing benchmark queries (requested in issue #256).

Changes:

  • Add --analyze_tables CLI flag to enable ANALYZE TABLE … COMPUTE STATISTICS after creating each TempView.
  • Plumb the analyze_tables option through run_query_stream()setup_tables() in both NDS and NDS-H power-run scripts.
  • Record per-table ANALYZE timing in execution_time_list alongside TempView creation timings.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
nds/nds_power.py Adds --analyze_tables flag and runs ANALYZE TABLE … COMPUTE STATISTICS after each TempView creation when enabled.
nds-h/nds_h_power.py Mirrors the same --analyze_tables behavior for the NDS-H power run script.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread nds/nds_power.py Outdated
Comment thread nds-h/nds_h_power.py Outdated
Signed-off-by: Gary Shen <gashen@nvidia.com>
Remove default False since it's false by default
Correct some typo

Signed-off-by: Gary Shen <gashen@nvidia.com>
Signed-off-by: Gary Shen <gashen@nvidia.com>
@GaryShen2008 GaryShen2008 marked this pull request as ready for review April 30, 2026 09:56
@GaryShen2008 GaryShen2008 requested a review from wjxiz1992 April 30, 2026 09:56
Copy link
Copy Markdown
Collaborator

@wjxiz1992 wjxiz1992 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wjxiz1992 wjxiz1992 merged commit e257ead into NVIDIA:dev Apr 30, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Support ANALYZE TABLE before running queries

4 participants