docs: Documentation and samples updates by nj1973 · Pull Request #1677 · GoogleCloudPlatform/professional-services-data-validator

nj1973 · 2026-02-20T11:37:14Z

Description of changes

Just fixing a missing documentation entry and a few sample file issues.

Issues to be closed

Checklist

I have followed the CONTRIBUTING Guide.
I have commented my code, particularly in hard-to-understand areas
I have updated any relevant documentation to reflect my changes, if applicable
I have added unit and/or integration tests relevant to my change as needed
I have already checked locally that all unit tests and linting are passing (use the tests/local_check.sh script)
I have manually executed end-to-end testing (E2E) with the affected databases/engines

nj1973 · 2026-02-20T16:23:07Z

/gcbrun

nehanene15 · 2026-02-23T21:01:54Z

        [
            "client_project_id",
-            "(Optional) BigQuery job/billing project (can differ from data project)",
+            "(Optional) BigQuery compute/billing project (can differ from data project)",


We should call this the 'quota project id'. The compute/slots will still be billed to the project specified by --project-id. Could reference this in the docs - https://docs.cloud.google.com/docs/quotas/quota-project

But this project_id is also used to define the client here:

professional-services-data-validator/data_validation/clients.py

Line 122 in 48f9ad0

project=effective_project,

Which I believe ensures compute is assigned to that project too.

This allows users to have their billing/compute in a different project to the one that houses the data.

Or did I misunderstand how this works?

After reviewing with Gemini - setting "client_project_id" effectively sets the billing (quota and slots) to "client_project_id". The only place we use this is get_bigquery_client. Typical behavior is that the billing - quota & slots is paid by the caller (client). When we provide this option, we are telling the user that they can use another project for billing. Therefore, I would prefer that we call this option billing_project_id and potentially accept client_project_id with a warning that customer should use billing_project_id and the option client_project_id may be deprecated in the future.

What do both of you think ?

Client in client_project_id was picked because it felt like the most accurate term, it is the project used when creating the client used to submit jobs. The customer who asked for this extra control was not just concerned about getting slot usage in the correct project, they didn't even have BigQuery job submit privileges in the "data" project.

The option is also used to define quota because that appeared to make sense. I suppose it is possible for someone to want that extra control but I imagine it is not common.

It feels like this can be resolved by setting the option description to something unambiguous without changing the option name.

Maybe this is slightly more accurate:

"(Optional) Overrides the default project id when creating BigQuery clients. Provides extra control on which project BigQuery jobs are submitted to (including quota).",

sundar-mudupalli-work · 2026-03-05T00:00:13Z


 CONNECTION_SOURCE_FIELDS = {
    consts.SOURCE_TYPE_BIGQUERY: [
        ["project_id", "GCP Project to use for BigQuery"],


I recommend changing this explanation to "GCP project holding the BigQuery dataset"

or BigQuery table - since we deal with tables mostly or dataset/table to be inclusive of how find-tables works.

This a used for everything if you don't override the compute/billing side of things with the optional parameter. If we state it is for the dataset/table then it would be misleading for cases where only a single project is in play.

How about:

Default project for BigQuery data and client (client project can be overridden with --client-project-id)

Maybe I am not understanding this - the "client project" (call it A) in BigQuery billing terms is the project to which the service account making the submit job request belongs. By default A pays for query while "project_id" pays for storage. So "project_id" is not default for billing. Cross project service accounts are not common, so the service account often belong to project_id.

Yep, I agree with that description.

My choice of "default" in the wording is a DVT default. If the user does not provide "A" (which is an optional argument) then "project_id" will be used by DVT for both the data project and the one used when creating the client. This is the behaviour we've always had, one "project_id" to rule them all.

We recently added "A" to support customers who have the split project model. If the user does provide the optional "A" then "A" is used for the client and the original "project_id" option continues to be used for data project.

Hi,

Here is what I set up

data-project - has the dataset and the table

query-project - project that holds the service account that is querying bigQuery

billing-project - project that is being billed for the query.

One service account with privileges to run jobs in the billing project and query project and view the data in the data-project.

Two connections to bq specifying the data project, each with a different client project id that is query-project and billing project respectively.
This was done as follows:

gcloud iam service-accounts create bq-query-sa --display-name "BigQuery Query Service Account" --description "Service Account to test BigQuery billing" --project=query-project gcloud projects add-iam-policy-binding query-project --member="serviceAccount:bq-query-sa@query-project.iam.gserviceaccount.com" --role="roles/bigquery.jobUser gcloud projects add-iam-policy-binding billing-project --member="serviceAccount:bq-query-sa@query-project.iam.gserviceaccount.com" --role="roles/bigquery.jobUser" gcloud projects add-iam-policy-binding data-project --member="serviceAccount:bq-query-sa@query-project.iam.gserviceaccount.com" --role="roles/bigquery.dataViewer" gcloud iam service-accounts add-iam-policy-binding "bq-query-sa@query-project.iam.gserviceaccount.com" --member="user:my_ldap@google.com" --role="roles/iam.serviceAccountTokenCreator" gcloud auth application-default login --impersonate-service-account=bq-query-sa@query-project.iam.gserviceaccount.com data-validation connections add --connection-name bq1 BigQuery --project-id data-project --client-project-id query-project data-validation connections add --connection-name bq2 BigQuery --project-id data-project --client-project-id billing-project

The following two bq commands were run, both successful in billing the query project and billing project respectively.

bq query --use_legacy_sql=false 'SELECT * from `pso-kokoro-resources.pso_data_validator.dvt_core_types`' bq query --use_legacy_sql=false --project_id=billing-project 'SELECT * from `pso-kokoro-resources.pso_data_validator.dvt_core_types`'

The following 2 data validation commands were run, both throwing an error saying permission denied:

data-validation query -c bq1 -q 'select * from pso-kokoro-resources.pso_data_validator.dvt_core_types' data-validation query -c bq2 -q 'select * from pso-kokoro-resources.pso_data_validator.dvt_core_types'

However when lines 121-127 in clients.py were modified as follows (commenting out other parameters):

return bigquery.Client( project=effective_project, # client_info=info, # credentials=credentials, # default_query_job_config=job_config, # client_options=options, )

Both of the above commands ran correctly.

Here are changes we could make for clarity:

Fix the issue with clients.py - why is it complaining about permissions - something strange with the other arguments.

Allow the service account to determine the default billing project - rather than choosing the data project. This may require us to write out bigquery table names as data-project.dataset.table when converting queries to SQL. This could be a breaking change we reserve for the next release.

Clarify that the project_id refers to the location of the data and the optional billing_project can be provided to bill to a specific project. If not provided, then the project that is billed is the project where the service account is created.

Sundar Mudupalli

sundar-mudupalli-work · 2026-03-31T20:05:17Z

Hi,

I have raised Issue #1710 and provided a PR #1715. This PR should be updated after that PR is closed.

Sundar Mudupalli

nj1973 · 2026-04-01T12:07:05Z

@sundar-mudupalli-work , as you are planning to deprecate the option I'll scale this PR back to only fixing the sample files.

nj1973 · 2026-04-01T12:12:14Z

Hi,

I have raised Issue #1710 and provided a PR #1715. This PR should be updated after that PR is closed.

Sundar Mudupalli

I've scaled back the changes on this PR, ready for re-review @sundar-mudupalli-work

nj1973 · 2026-04-01T12:12:29Z

/gcbrun

nj1973 · 2026-04-15T11:00:18Z

/gcbrun

nj1973 added 3 commits February 20, 2026 11:08

docs: Add missing doc entry for BigQuery --client-project-id option

d06790f

docs: Fix broken links in Docker readme

51b78da

docs: Fix incorrect curl commands in SQL Server Dockerfile

2930a9b

nj1973 linked an issue Feb 20, 2026 that may be closed by this pull request

BigQuery connection option --client-project-id is missing from docs #1676

Open

pull-request-size bot added the size/M label Feb 20, 2026

Merge branch 'develop' into 1676-doc-and-sample-updates

35a55b5

nehanene15 reviewed Feb 23, 2026

View reviewed changes

nj1973 added 2 commits February 24, 2026 08:41

Merge branch 'develop' into 1676-doc-and-sample-updates

70aec79

Merge branch 'develop' into 1676-doc-and-sample-updates

bc021aa

nj1973 requested a review from nehanene15 February 26, 2026 09:52

nj1973 added 3 commits February 26, 2026 09:52

Merge branch 'develop' into 1676-doc-and-sample-updates

6cf988f

Merge branch 'develop' into 1676-doc-and-sample-updates

246c7f2

Merge branch 'develop' into 1676-doc-and-sample-updates

de688a3

sundar-mudupalli-work reviewed Mar 5, 2026

View reviewed changes

sundar-mudupalli-work requested changes Mar 5, 2026

View reviewed changes

nj1973 added 2 commits March 6, 2026 08:43

Merge branch 'develop' into 1676-doc-and-sample-updates

ff5cb5f

Merge branch 'develop' into 1676-doc-and-sample-updates

6aff517

nj1973 added 2 commits April 1, 2026 13:07

Merge branch 'develop' into 1676-doc-and-sample-updates

f8ff36d

chore: Backout BigQuery doc changes

5e9f12e

pull-request-size bot added size/S and removed size/M labels Apr 1, 2026

nj1973 requested a review from sundar-mudupalli-work April 1, 2026 12:11

nj1973 added 3 commits April 2, 2026 14:52

Merge branch 'develop' into 1676-doc-and-sample-updates

43df6cb

Merge branch 'develop' into 1676-doc-and-sample-updates

35fe666

Merge branch 'develop' into 1676-doc-and-sample-updates

af3a75b

Conversation

nj1973 commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Issues to be closed

Checklist

Uh oh!

nj1973 commented Feb 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nj1973 Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sundar-mudupalli-work Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sundar-mudupalli-work commented Mar 31, 2026

Uh oh!

nj1973 commented Apr 1, 2026

Uh oh!

nj1973 commented Apr 1, 2026

Uh oh!

nj1973 commented Apr 1, 2026

Uh oh!

nj1973 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nj1973 commented Feb 20, 2026 •

edited

Loading

nj1973 Mar 5, 2026 •

edited

Loading

sundar-mudupalli-work Mar 18, 2026 •

edited

Loading