Skip to content

Add multipart parallel upload support for s3#26

Open
xbrianh wants to merge 3 commits into
masterfrom
bhannafi-parallel-multipart
Open

Add multipart parallel upload support for s3#26
xbrianh wants to merge 3 commits into
masterfrom
bhannafi-parallel-multipart

Conversation

@xbrianh
Copy link
Copy Markdown
Member

@xbrianh xbrianh commented Apr 2, 2019

This is useful for optimizing DSS endpoints, and may have a wider audience.

@xbrianh xbrianh self-assigned this Apr 2, 2019
@xbrianh xbrianh requested a review from ttung April 2, 2019 22:55
Comment thread cloud_blobstore/s3.py Outdated
kwargs['Metadata'] = metadata
mpu = self.s3_client.create_multipart_upload(Bucket=bucket, Key=key, **kwargs)

size = src_file_handle.getbuffer().nbytes
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getbuffer() appears to only exist for bytesio.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll make size a parameter

Comment thread cloud_blobstore/s3.py
part_size: int,
content_type: str=None,
metadata: dict=None,
parallelization_factor=8) -> typing.Sequence[dict]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
parallelization_factor=8) -> typing.Sequence[dict]:
parallelization_factor: int=8,
) -> typing.Sequence[dict]:

Comment thread cloud_blobstore/s3.py
metadata: dict=None,
parallelization_factor=8) -> typing.Sequence[dict]:
"""
Upload a file object in parallel.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs more docblock love.

Comment thread cloud_blobstore/s3.py
Upload a file object in parallel.
:param bucket:
"""
kwargs: dict = dict()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this type annotation actually useful? I would assume that kwargs = dict() conveys the same information.

Possibly this may be more useful?

Suggested change
kwargs: dict = dict()
kwargs: Dict[str, Any] = dict()

Comment thread cloud_blobstore/s3.py
raise BlobNotFoundError(f"Could not find s3://{bucket}/{key}") from ex
raise BlobStoreUnknownError(ex)

def multipart_parallel_upload(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest design concern I have with this is that if you eventually support a GCP multipart upload, this API is not particularly generalized. It might be worth studying what parameters need to be passed in for GCP multipart upload, and moving the common parameters up to the front, and the S3-specific ones to the end.

Comment thread tests/test_s3blobstore.py
part_size,
)
part_size = 5 * 1024 * 1024
with self.subTest("should work parallelization factor of 1"):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how useful of a test this is...

Brian and others added 3 commits August 2, 2019 16:21
This optimization is usefu for the DSS, and probably has a wider
audience.
@xbrianh xbrianh force-pushed the bhannafi-parallel-multipart branch from 459e507 to f8b4564 Compare August 2, 2019 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants