Skip to content

WIP Add upload to azure functionality#356

Closed
Qi77Qi wants to merge 2 commits intoDataBiosphere:masterfrom
Qi77Qi:add-azure
Closed

WIP Add upload to azure functionality#356
Qi77Qi wants to merge 2 commits intoDataBiosphere:masterfrom
Qi77Qi:add-azure

Conversation

@Qi77Qi
Copy link
Copy Markdown
Collaborator

@Qi77Qi Qi77Qi commented Sep 22, 2021

TODO:
Figure out how to use managed identity instead of access key for auth.

Tested on a terra VM

>>> drs.copy("drs://jade.datarepo-dev.broadinstitute.org/v1_0c86170e-312d-4b39-a0a4-2a2bfaa24c7a_c0e40912-8b14-43f6-9a2f-b278144d0060", "https://qijlbdgpc4zqdee.blob.core.windows.net/qi-test-container/subdir/jade1")
2021-09-22 05:50:19::INFO  Enabling requester pays for your workspace. This will only take a few seconds...
2021-09-22 05:50:19::WARNING  Failed to init requester pays for workspace qi-monitoring-1218/qi-ws-1: Expected '204', got '401' for 'https://rawls.dsde-prod.broadinstitute.org/api/workspaces/qi-monitoring-1218/qi-ws-1/enableRequesterPaysForLinkedServiceAccounts'. You will not be able to access DRS URIs that interact with requester pays buckets.
2021-09-22 05:50:21::INFO  Request URL: 'https://qijlbdgpc4zqdee.blob.core.windows.net/qi-test-container/subdir/jade1'
2021-09-22 05:50:21::INFO  Request method: 'PUT'
2021-09-22 05:50:21::INFO  Request headers:
2021-09-22 05:50:21::INFO      'x-ms-blob-type': 'REDACTED'
2021-09-22 05:50:21::INFO      'Content-Length': '62043448'
2021-09-22 05:50:21::INFO      'If-None-Match': '*'
2021-09-22 05:50:21::INFO      'x-ms-version': 'REDACTED'
2021-09-22 05:50:21::INFO      'Content-Type': 'application/octet-stream'
2021-09-22 05:50:21::INFO      'Accept': 'application/xml'
2021-09-22 05:50:21::INFO      'User-Agent': 'azsdk-python-storage-blob/12.9.0 Python/3.7.10 (Linux-5.4.104+-x86_64-with-debian-buster-sid)'
2021-09-22 05:50:21::INFO      'x-ms-date': 'REDACTED'
2021-09-22 05:50:21::INFO      'x-ms-client-request-id': '8e96e9a6-1bcd-11ec-b6fa-0242ac120005'
2021-09-22 05:50:21::INFO      'Authorization': 'REDACTED'
2021-09-22 05:50:21::INFO  A body is sent with the request
2021-09-22 05:50:23::INFO  Response status: 201
2021-09-22 05:50:23::INFO  Response headers:
2021-09-22 05:50:23::INFO      'Content-Length': '0'
2021-09-22 05:50:23::INFO      'Content-MD5': 'REDACTED'
2021-09-22 05:50:23::INFO      'Last-Modified': 'Wed, 22 Sep 2021 17:50:23 GMT'
2021-09-22 05:50:23::INFO      'ETag': '"0x8D97DF173D92DB6"'
2021-09-22 05:50:23::INFO      'Server': 'Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0'
2021-09-22 05:50:23::INFO      'x-ms-request-id': '565e8615-201e-00d6-41da-af7108000000'
2021-09-22 05:50:23::INFO      'x-ms-client-request-id': '8e96e9a6-1bcd-11ec-b6fa-0242ac120005'
2021-09-22 05:50:23::INFO      'x-ms-version': 'REDACTED'
2021-09-22 05:50:23::INFO      'x-ms-content-crc64': 'REDACTED'
2021-09-22 05:50:23::INFO      'x-ms-request-server-encrypted': 'REDACTED'
2021-09-22 05:50:23::INFO      'Date': 'Wed, 22 Sep 2021 17:50:23 GMT'
 DefaultEndpointsProtocol=https;AccountNa   100%   [========================================]   59.2MiB   17.9MiB/s   3.31s

Comment thread tests/test_http.py
from tests import config
from tests.infra.server import ThreadedLocalServer, BaseHTTPRequestHandler
from terra_notebook_utils.http import HTTPAdapter, Retry, http_session
from terra_notebook_utils.http_session import HTTPAdapter, Retry, http_session
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to rename http file because of some naming conflict error

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the conflict?

@Qi77Qi Qi77Qi force-pushed the add-azure branch 3 times, most recently from 96fa58b to 4b98abe Compare September 22, 2021 20:28
Copy link
Copy Markdown
Member

@xbrianh xbrianh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like good progress, and it's exciting to see TNU support Azure storage :)

I have left a few comments and questions. Also, the test suites for blobstore and copy_client will need to be extended to cover Azure operations.

Comment thread README.md Outdated
4. Attach your terminal to the image via `docker exec -it test-image bash`, then navigate to the directory the code is mounted to via `cd /work`. Note that the above command ensures any changes you make to files in the repo will be updated in the image as well.
5. log in with your Google credentials using `gcloud auth application-default login`,
6. install requirements with `pip install -r requirements.txt`
6. install requirements with `pip3 install -r requirements.txt`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python developers typically work in Python virtual environments. Inside a Python 3 virtual environment you use pip and python, not pip3 and python3.

This line can be reverted, and we can rely on developers to understand which pip to use.

from azure.identity import DefaultAzureCredential

class AzureBlobStore(blobstore.BlobStore):
schema = "https://"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This schema is unfortunately awkward for TNU. Currently, the CLI command to copy a drs file into a Google bucket is

tnu drs copy drs://foo gs://my-bucket/my-key

Note the gs:// schema. The analogous command for an azure container looks awkward due to the https:// schema

tnu drs copy drs://foo https://some-weird-azure-url

A further complication is URL detection. TNU uses the schema to understand the storage provider for destination URLs. Logic for detecting Azure destinations will need to be added here.

self._azure_blob_client.delete_blob("include")

def exists(self):
return self._azure_blob_client.exists()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multipart uploads are supported for both s3 and gs. It is typically more performant to upload large objects as parts, sometimes concurrently. Also, it my not be possible to upload a large object as a single part. For instance in S3 you cannot upload an object larger than 5GB with a single PUT.

How are large object uploads handled in Azure?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like multi part is supported under the hood by the azure sdk...see this

Comment thread tests/test_http.py
from tests import config
from tests.infra.server import ThreadedLocalServer, BaseHTTPRequestHandler
from terra_notebook_utils.http import HTTPAdapter, Retry, http_session
from terra_notebook_utils.http_session import HTTPAdapter, Retry, http_session
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the conflict?

@Qi77Qi
Copy link
Copy Markdown
Collaborator Author

Qi77Qi commented Sep 27, 2021

@xbrianh the error was something like no module found: http..and I googled a bit for the error, and someone suggested it could be naming conflict, so I changed the name of that file

@Qi77Qi Qi77Qi force-pushed the add-azure branch 3 times, most recently from 441e48a to 0700125 Compare September 28, 2021 13:21
cleanup

try

enable logging

err

test

test

test
@Qi77Qi
Copy link
Copy Markdown
Collaborator Author

Qi77Qi commented Sep 28, 2021

closing this PR in favor of #362 since this is from a fork

@Qi77Qi Qi77Qi closed this Sep 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants