Skip to content

docs: Documentation and samples updates#1677

Open
nj1973 wants to merge 16 commits intodevelopfrom
1676-doc-and-sample-updates
Open

docs: Documentation and samples updates#1677
nj1973 wants to merge 16 commits intodevelopfrom
1676-doc-and-sample-updates

Conversation

@nj1973
Copy link
Copy Markdown
Collaborator

@nj1973 nj1973 commented Feb 20, 2026

Description of changes

Just fixing a missing documentation entry and a few sample file issues.

Issues to be closed

Checklist

  • I have followed the CONTRIBUTING Guide.
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated any relevant documentation to reflect my changes, if applicable
  • I have added unit and/or integration tests relevant to my change as needed
  • I have already checked locally that all unit tests and linting are passing (use the tests/local_check.sh script)
  • I have manually executed end-to-end testing (E2E) with the affected databases/engines

@nj1973
Copy link
Copy Markdown
Collaborator Author

nj1973 commented Feb 20, 2026

/gcbrun

Comment thread data_validation/cli_tools.py Outdated
[
"client_project_id",
"(Optional) BigQuery job/billing project (can differ from data project)",
"(Optional) BigQuery compute/billing project (can differ from data project)",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should call this the 'quota project id'. The compute/slots will still be billed to the project specified by --project-id. Could reference this in the docs - https://docs.cloud.google.com/docs/quotas/quota-project

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this project_id is also used to define the client here:

Which I believe ensures compute is assigned to that project too.

This allows users to have their billing/compute in a different project to the one that houses the data.

Or did I misunderstand how this works?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reviewing with Gemini - setting "client_project_id" effectively sets the billing (quota and slots) to "client_project_id". The only place we use this is get_bigquery_client. Typical behavior is that the billing - quota & slots is paid by the caller (client). When we provide this option, we are telling the user that they can use another project for billing. Therefore, I would prefer that we call this option billing_project_id and potentially accept client_project_id with a warning that customer should use billing_project_id and the option client_project_id may be deprecated in the future.

What do both of you think ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client in client_project_id was picked because it felt like the most accurate term, it is the project used when creating the client used to submit jobs. The customer who asked for this extra control was not just concerned about getting slot usage in the correct project, they didn't even have BigQuery job submit privileges in the "data" project.

The option is also used to define quota because that appeared to make sense. I suppose it is possible for someone to want that extra control but I imagine it is not common.

It feels like this can be resolved by setting the option description to something unambiguous without changing the option name.

Maybe this is slightly more accurate:

"(Optional) Overrides the default project id when creating BigQuery clients. Provides extra control on which project BigQuery jobs are submitted to (including quota).",

@nj1973 nj1973 requested a review from nehanene15 February 26, 2026 09:52

CONNECTION_SOURCE_FIELDS = {
consts.SOURCE_TYPE_BIGQUERY: [
["project_id", "GCP Project to use for BigQuery"],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend changing this explanation to "GCP project holding the BigQuery dataset"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or BigQuery table - since we deal with tables mostly or dataset/table to be inclusive of how find-tables works.

Copy link
Copy Markdown
Collaborator Author

@nj1973 nj1973 Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This a used for everything if you don't override the compute/billing side of things with the optional parameter. If we state it is for the dataset/table then it would be misleading for cases where only a single project is in play.

How about:

Default project for BigQuery data and client (client project can be overridden with --client-project-id)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am not understanding this - the "client project" (call it A) in BigQuery billing terms is the project to which the service account making the submit job request belongs. By default A pays for query while "project_id" pays for storage. So "project_id" is not default for billing. Cross project service accounts are not common, so the service account often belong to project_id.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I agree with that description.

My choice of "default" in the wording is a DVT default. If the user does not provide "A" (which is an optional argument) then "project_id" will be used by DVT for both the data project and the one used when creating the client. This is the behaviour we've always had, one "project_id" to rule them all.

We recently added "A" to support customers who have the split project model. If the user does provide the optional "A" then "A" is used for the client and the original "project_id" option continues to be used for data project.

Copy link
Copy Markdown
Collaborator

@sundar-mudupalli-work sundar-mudupalli-work Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

Here is what I set up

  1. data-project - has the dataset and the table
  2. query-project - project that holds the service account that is querying bigQuery
  3. billing-project - project that is being billed for the query.
  4. One service account with privileges to run jobs in the billing project and query project and view the data in the data-project.
  5. Two connections to bq specifying the data project, each with a different client project id that is query-project and billing project respectively.
    This was done as follows:
gcloud iam service-accounts create bq-query-sa --display-name "BigQuery Query Service Account" --description "Service Account to test BigQuery billing" --project=query-project
gcloud projects add-iam-policy-binding query-project --member="serviceAccount:bq-query-sa@query-project.iam.gserviceaccount.com" --role="roles/bigquery.jobUser
gcloud projects add-iam-policy-binding billing-project --member="serviceAccount:bq-query-sa@query-project.iam.gserviceaccount.com" --role="roles/bigquery.jobUser"
gcloud projects add-iam-policy-binding data-project --member="serviceAccount:bq-query-sa@query-project.iam.gserviceaccount.com" --role="roles/bigquery.dataViewer"
gcloud iam service-accounts add-iam-policy-binding "bq-query-sa@query-project.iam.gserviceaccount.com" --member="user:my_ldap@google.com" --role="roles/iam.serviceAccountTokenCreator"
gcloud auth application-default login --impersonate-service-account=bq-query-sa@query-project.iam.gserviceaccount.com
data-validation connections add --connection-name bq1 BigQuery --project-id data-project --client-project-id query-project
data-validation connections add --connection-name bq2 BigQuery --project-id data-project --client-project-id billing-project

The following two bq commands were run, both successful in billing the query project and billing project respectively.

bq query --use_legacy_sql=false 'SELECT * from `pso-kokoro-resources.pso_data_validator.dvt_core_types`'
bq query --use_legacy_sql=false --project_id=billing-project 'SELECT * from `pso-kokoro-resources.pso_data_validator.dvt_core_types`'

The following 2 data validation commands were run, both throwing an error saying permission denied:

data-validation query -c bq1 -q 'select * from pso-kokoro-resources.pso_data_validator.dvt_core_types'
data-validation query -c bq2 -q 'select * from pso-kokoro-resources.pso_data_validator.dvt_core_types'

However when lines 121-127 in clients.py were modified as follows (commenting out other parameters):

    return bigquery.Client(
        project=effective_project,
        # client_info=info,
        # credentials=credentials,
        # default_query_job_config=job_config,
        # client_options=options,
    )

Both of the above commands ran correctly.

Here are changes we could make for clarity:

  1. Fix the issue with clients.py - why is it complaining about permissions - something strange with the other arguments.
  2. Allow the service account to determine the default billing project - rather than choosing the data project. This may require us to write out bigquery table names as data-project.dataset.table when converting queries to SQL. This could be a breaking change we reserve for the next release.
  3. Clarify that the project_id refers to the location of the data and the optional billing_project can be provided to bill to a specific project. If not provided, then the project that is billed is the project where the service account is created.

Sundar Mudupalli

@sundar-mudupalli-work
Copy link
Copy Markdown
Collaborator

Hi,

I have raised Issue #1710 and provided a PR #1715. This PR should be updated after that PR is closed.

Sundar Mudupalli

@nj1973
Copy link
Copy Markdown
Collaborator Author

nj1973 commented Apr 1, 2026

@sundar-mudupalli-work , as you are planning to deprecate the option I'll scale this PR back to only fixing the sample files.

@nj1973
Copy link
Copy Markdown
Collaborator Author

nj1973 commented Apr 1, 2026

Hi,

I have raised Issue #1710 and provided a PR #1715. This PR should be updated after that PR is closed.

Sundar Mudupalli

I've scaled back the changes on this PR, ready for re-review @sundar-mudupalli-work

@nj1973
Copy link
Copy Markdown
Collaborator Author

nj1973 commented Apr 1, 2026

/gcbrun

@nj1973
Copy link
Copy Markdown
Collaborator Author

nj1973 commented Apr 15, 2026

/gcbrun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BigQuery connection option --client-project-id is missing from docs Fix mistakes in sample SQL Server Dockerfile

3 participants