WIP Add upload to azure functionality#356
WIP Add upload to azure functionality#356Qi77Qi wants to merge 2 commits intoDataBiosphere:masterfrom
Conversation
| from tests import config | ||
| from tests.infra.server import ThreadedLocalServer, BaseHTTPRequestHandler | ||
| from terra_notebook_utils.http import HTTPAdapter, Retry, http_session | ||
| from terra_notebook_utils.http_session import HTTPAdapter, Retry, http_session |
There was a problem hiding this comment.
I had to rename http file because of some naming conflict error
96fa58b to
4b98abe
Compare
xbrianh
left a comment
There was a problem hiding this comment.
This looks like good progress, and it's exciting to see TNU support Azure storage :)
I have left a few comments and questions. Also, the test suites for blobstore and copy_client will need to be extended to cover Azure operations.
| 4. Attach your terminal to the image via `docker exec -it test-image bash`, then navigate to the directory the code is mounted to via `cd /work`. Note that the above command ensures any changes you make to files in the repo will be updated in the image as well. | ||
| 5. log in with your Google credentials using `gcloud auth application-default login`, | ||
| 6. install requirements with `pip install -r requirements.txt` | ||
| 6. install requirements with `pip3 install -r requirements.txt` |
There was a problem hiding this comment.
Python developers typically work in Python virtual environments. Inside a Python 3 virtual environment you use pip and python, not pip3 and python3.
This line can be reverted, and we can rely on developers to understand which pip to use.
| from azure.identity import DefaultAzureCredential | ||
|
|
||
| class AzureBlobStore(blobstore.BlobStore): | ||
| schema = "https://" |
There was a problem hiding this comment.
This schema is unfortunately awkward for TNU. Currently, the CLI command to copy a drs file into a Google bucket is
tnu drs copy drs://foo gs://my-bucket/my-key
Note the gs:// schema. The analogous command for an azure container looks awkward due to the https:// schema
tnu drs copy drs://foo https://some-weird-azure-url
A further complication is URL detection. TNU uses the schema to understand the storage provider for destination URLs. Logic for detecting Azure destinations will need to be added here.
| self._azure_blob_client.delete_blob("include") | ||
|
|
||
| def exists(self): | ||
| return self._azure_blob_client.exists() |
There was a problem hiding this comment.
Multipart uploads are supported for both s3 and gs. It is typically more performant to upload large objects as parts, sometimes concurrently. Also, it my not be possible to upload a large object as a single part. For instance in S3 you cannot upload an object larger than 5GB with a single PUT.
How are large object uploads handled in Azure?
There was a problem hiding this comment.
it seems like multi part is supported under the hood by the azure sdk...see this
| from tests import config | ||
| from tests.infra.server import ThreadedLocalServer, BaseHTTPRequestHandler | ||
| from terra_notebook_utils.http import HTTPAdapter, Retry, http_session | ||
| from terra_notebook_utils.http_session import HTTPAdapter, Retry, http_session |
|
@xbrianh the error was something like |
441e48a to
0700125
Compare
|
closing this PR in favor of #362 since this is from a fork |
TODO:
Figure out how to use managed identity instead of access key for auth.
Tested on a terra VM