Conversation
|
/gcbrun |
| [ | ||
| "client_project_id", | ||
| "(Optional) BigQuery job/billing project (can differ from data project)", | ||
| "(Optional) BigQuery compute/billing project (can differ from data project)", |
There was a problem hiding this comment.
We should call this the 'quota project id'. The compute/slots will still be billed to the project specified by --project-id. Could reference this in the docs - https://docs.cloud.google.com/docs/quotas/quota-project
There was a problem hiding this comment.
But this project_id is also used to define the client here:
Which I believe ensures compute is assigned to that project too.
This allows users to have their billing/compute in a different project to the one that houses the data.
Or did I misunderstand how this works?
There was a problem hiding this comment.
After reviewing with Gemini - setting "client_project_id" effectively sets the billing (quota and slots) to "client_project_id". The only place we use this is get_bigquery_client. Typical behavior is that the billing - quota & slots is paid by the caller (client). When we provide this option, we are telling the user that they can use another project for billing. Therefore, I would prefer that we call this option billing_project_id and potentially accept client_project_id with a warning that customer should use billing_project_id and the option client_project_id may be deprecated in the future.
What do both of you think ?
There was a problem hiding this comment.
Client in client_project_id was picked because it felt like the most accurate term, it is the project used when creating the client used to submit jobs. The customer who asked for this extra control was not just concerned about getting slot usage in the correct project, they didn't even have BigQuery job submit privileges in the "data" project.
The option is also used to define quota because that appeared to make sense. I suppose it is possible for someone to want that extra control but I imagine it is not common.
It feels like this can be resolved by setting the option description to something unambiguous without changing the option name.
Maybe this is slightly more accurate:
"(Optional) Overrides the default project id when creating BigQuery clients. Provides extra control on which project BigQuery jobs are submitted to (including quota).",
|
|
||
| CONNECTION_SOURCE_FIELDS = { | ||
| consts.SOURCE_TYPE_BIGQUERY: [ | ||
| ["project_id", "GCP Project to use for BigQuery"], |
There was a problem hiding this comment.
I recommend changing this explanation to "GCP project holding the BigQuery dataset"
There was a problem hiding this comment.
or BigQuery table - since we deal with tables mostly or dataset/table to be inclusive of how find-tables works.
There was a problem hiding this comment.
This a used for everything if you don't override the compute/billing side of things with the optional parameter. If we state it is for the dataset/table then it would be misleading for cases where only a single project is in play.
How about:
Default project for BigQuery data and client (client project can be overridden with --client-project-id)
There was a problem hiding this comment.
Maybe I am not understanding this - the "client project" (call it A) in BigQuery billing terms is the project to which the service account making the submit job request belongs. By default A pays for query while "project_id" pays for storage. So "project_id" is not default for billing. Cross project service accounts are not common, so the service account often belong to project_id.
There was a problem hiding this comment.
Yep, I agree with that description.
My choice of "default" in the wording is a DVT default. If the user does not provide "A" (which is an optional argument) then "project_id" will be used by DVT for both the data project and the one used when creating the client. This is the behaviour we've always had, one "project_id" to rule them all.
We recently added "A" to support customers who have the split project model. If the user does provide the optional "A" then "A" is used for the client and the original "project_id" option continues to be used for data project.
There was a problem hiding this comment.
Hi,
Here is what I set up
- data-project - has the dataset and the table
- query-project - project that holds the service account that is querying bigQuery
- billing-project - project that is being billed for the query.
- One service account with privileges to run jobs in the billing project and query project and view the data in the data-project.
- Two connections to bq specifying the data project, each with a different client project id that is query-project and billing project respectively.
This was done as follows:
gcloud iam service-accounts create bq-query-sa --display-name "BigQuery Query Service Account" --description "Service Account to test BigQuery billing" --project=query-project
gcloud projects add-iam-policy-binding query-project --member="serviceAccount:bq-query-sa@query-project.iam.gserviceaccount.com" --role="roles/bigquery.jobUser
gcloud projects add-iam-policy-binding billing-project --member="serviceAccount:bq-query-sa@query-project.iam.gserviceaccount.com" --role="roles/bigquery.jobUser"
gcloud projects add-iam-policy-binding data-project --member="serviceAccount:bq-query-sa@query-project.iam.gserviceaccount.com" --role="roles/bigquery.dataViewer"
gcloud iam service-accounts add-iam-policy-binding "bq-query-sa@query-project.iam.gserviceaccount.com" --member="user:my_ldap@google.com" --role="roles/iam.serviceAccountTokenCreator"
gcloud auth application-default login --impersonate-service-account=bq-query-sa@query-project.iam.gserviceaccount.com
data-validation connections add --connection-name bq1 BigQuery --project-id data-project --client-project-id query-project
data-validation connections add --connection-name bq2 BigQuery --project-id data-project --client-project-id billing-project
The following two bq commands were run, both successful in billing the query project and billing project respectively.
bq query --use_legacy_sql=false 'SELECT * from `pso-kokoro-resources.pso_data_validator.dvt_core_types`'
bq query --use_legacy_sql=false --project_id=billing-project 'SELECT * from `pso-kokoro-resources.pso_data_validator.dvt_core_types`'
The following 2 data validation commands were run, both throwing an error saying permission denied:
data-validation query -c bq1 -q 'select * from pso-kokoro-resources.pso_data_validator.dvt_core_types'
data-validation query -c bq2 -q 'select * from pso-kokoro-resources.pso_data_validator.dvt_core_types'
However when lines 121-127 in clients.py were modified as follows (commenting out other parameters):
return bigquery.Client(
project=effective_project,
# client_info=info,
# credentials=credentials,
# default_query_job_config=job_config,
# client_options=options,
)
Both of the above commands ran correctly.
Here are changes we could make for clarity:
- Fix the issue with clients.py - why is it complaining about permissions - something strange with the other arguments.
- Allow the service account to determine the default billing project - rather than choosing the data project. This may require us to write out bigquery table names as data-project.dataset.table when converting queries to SQL. This could be a breaking change we reserve for the next release.
- Clarify that the project_id refers to the location of the data and the optional billing_project can be provided to bill to a specific project. If not provided, then the project that is billed is the project where the service account is created.
Sundar Mudupalli
|
@sundar-mudupalli-work , as you are planning to deprecate the option I'll scale this PR back to only fixing the sample files. |
I've scaled back the changes on this PR, ready for re-review @sundar-mudupalli-work |
|
/gcbrun |
|
/gcbrun |
Description of changes
Just fixing a missing documentation entry and a few sample file issues.
Issues to be closed
Checklist
CONTRIBUTINGGuide.tests/local_check.shscript)