Skip to content

[ BOUNTY] Add deterministic seed support to data_generator (#4)#8

Open
xcapselx wants to merge 2 commits into
thanhle74:mainfrom
xcapselx:feat/seed-support-thanhle74
Open

[ BOUNTY] Add deterministic seed support to data_generator (#4)#8
xcapselx wants to merge 2 commits into
thanhle74:mainfrom
xcapselx:feat/seed-support-thanhle74

Conversation

@xcapselx

Copy link
Copy Markdown

Summary

Add deterministic seed support to tools/data_generator.py so generated data can be reproduced exactly from a seed. The existing --seed flag was present but helper functions used the global random module instead of the seeded instance, breaking determinism.

Changes

  • tools/data_generator.py:

    • Fixed random_phone(), random_email(), and random_datetime() to accept an optional rng parameter
    • Updated all DataGenerator method call sites to pass self.random to helper functions
    • Changed --seed default from 42 to None (random seed when not specified)
    • Added --print-seed flag to print the seed used so a random run can be reproduced
    • Auto-enable --print-seed when no --seed is supplied
    • Write _metadata.json with seed and parameters in output directory
    • Print reproduction command when --print-seed is active
  • tests/test_data_generator_seed.py (new file):

    • test_same_seed_produces_identical_output — same seed = byte-for-byte identical
    • test_different_seeds_produce_different_output — different seeds = different output
    • test_deterministic_across_three_seeds — verifies 3 seeds (42, 123, 999) produce identical hashes across runs
    • test_print_seed_flag_exists — verifies --print-seed flag is recognized
    • test_seed_none_generates_random_seed — verifies default seed is None
  • data/README.md:

    • Added "Test Data Generation" section with usage examples for --seed and --print-seed

Testing

python -m unittest tests.test_data_generator_seed -v

Result:

test_deterministic_across_three_seeds ... ok
test_different_seeds_produce_different_output ... ok
test_print_seed_flag_exists ... ok
test_same_seed_produces_identical_output ... ok
test_seed_none_generates_random_seed ... ok

Ran 5 tests in 0.021s
OK

Build diagnostic: diagnostic/build-549809c9.json (encryptly .logd unavailable on Windows; JSON metadata included).

Checklist

  • Relevant modules affected by these changes build locally
  • Tests pass locally
  • Diagnostic build log is committed in this PR
  • Documentation has been updated (data/README.md)
  • Changes are scoped to the PR purpose and avoid unrelated cleanup
  • Security, privacy, and error-handling implications have been considered

Closes #4

LE-VAI added 2 commits June 18, 2026 19:15
- Fix helper functions (random_phone, random_email, random_datetime) to accept rng parameter
- Update all DataGenerator call sites to pass self.random for full determinism
- Change --seed default from 42 to None (random seed when not specified)
- Add --print-seed flag to print seed for reproducibility
- Auto-enable --print-seed when no seed is supplied
- Write _metadata.json with seed and parameters in output directory
- Add tests proving deterministic output for seeds 42, 123, 999
- Update data/README.md with seed usage examples
@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@xcapselx, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 57 minutes and 5 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 94dd05cf-f5d7-4fce-bb5f-ce0581268d99

📥 Commits

Reviewing files that changed from the base of the PR and between 94e0fb0 and 7da0e03.

📒 Files selected for processing (4)
  • data/README.md
  • diagnostic/build-549809c9.json
  • tests/test_data_generator_seed.py
  • tools/data_generator.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[$25 BOUNTY] [Python] feat: Add deterministic seed support to data_generator.py

2 participants