Add analyze_tables#257
Conversation
Run COMPUTE STATISTICS before query running It only runs on raw dataset, not including Iceberg and Delta Lake. Signed-off-by: Gary Shen <gashen@nvidia.com>
Greptile SummaryThis PR adds an Confidence Score: 5/5Safe to merge — all findings are P2 style issues with no functional impact. Only P2 comments (inconsistent identifier quoting and a stale docstring). The core logic — catalog table creation, ANALYZE execution, and CLI wiring — is correct. No P0 or P1 issues found. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[run_query_stream called] --> B{input_format\niceberg or delta?}
B -- Yes, iceberg --> C[register_iceberg_tables]
B -- Yes, delta --> D[register_delta_tables]
B -- No --> E{analyze_tables flag?}
E -- False --> F[reader.load.createOrReplaceTempView\nfor each table]
E -- True --> G[_create_data_source_table\nCREATE TABLE IF NOT EXISTS USING LOCATION]
G --> H[ANALYZE TABLE COMPUTE STATISTICS]
F --> I[Run benchmark queries]
H --> I
C --> I
D --> I
Reviews (2): Last reviewed commit: "Fix typo" | Re-trigger Greptile |
| choices=['parquet', 'orc', 'avro', 'csv', 'json', 'iceberg', 'delta'], | ||
| default='parquet') | ||
| parser.add_argument('--analyze_tables', | ||
| action='store_true', |
There was a problem hiding this comment.
Redundant
default=False with store_true
argparse's store_true action already sets default=False implicitly; the explicit default=False is redundant. Same pattern appears in nds/nds_power.py. No functional impact, but it adds noise.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Pull request overview
Adds an optional “analyze tables” step to the NDS and NDS-H power-run drivers so users can precompute Spark table statistics before executing benchmark queries (requested in issue #256).
Changes:
- Add
--analyze_tablesCLI flag to enableANALYZE TABLE … COMPUTE STATISTICSafter creating each TempView. - Plumb the
analyze_tablesoption throughrun_query_stream()→setup_tables()in both NDS and NDS-H power-run scripts. - Record per-table ANALYZE timing in
execution_time_listalongside TempView creation timings.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
nds/nds_power.py |
Adds --analyze_tables flag and runs ANALYZE TABLE … COMPUTE STATISTICS after each TempView creation when enabled. |
nds-h/nds_h_power.py |
Mirrors the same --analyze_tables behavior for the NDS-H power run script. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Gary Shen <gashen@nvidia.com>
Remove default False since it's false by default Correct some typo Signed-off-by: Gary Shen <gashen@nvidia.com>
Fix #256
Run COMPUTE STATISTICS before query running
It only runs on raw dataset, not including Iceberg and Delta Lake.