Skip to content

Conversation

@ararslan
Copy link
Member

@ararslan ararslan commented Oct 4, 2024

I've had these changes locally for months (possibly a year or more?) but hadn't committed or pushed them. I don't know if/when I'll have the bandwidth to ensure this gets over the finish line, so if someone is interested in picking this up then please feel free to do so.

Summary of changes:

  • s3_copy now supports a version keyword argument that facilitates copying a specified version of an object.
  • A new function s3_multipart_copy to mirror s3_multipart_upload has been added, which calls UploadPartCopy in the API.
  • An explicit cp(::S3Path, ::S3Path) method has been implemented, which avoids the fallback cp(::AbstractPath, ::AbstractPath) method that reads the source file into memory before writing to the destination.
    • To avoid breaking the convenient but possibly unintended prior behavior of using different credentials for the source and destination paths, the fallback method is called when the source and destination credentials differ.
  • cp(::S3Path, ::S3Path) allows the user to opt into a multipart copy, in which case multipart is used when the source is larger than the specified part size (50 MiB by default). A multipart copy is unconditionally used when the source is at least 5 GiB. This behavior mimics that of the AWS CLI. Note that this now requires an additional API call to HeadObject in order to retrieve the source size.

@ararslan
Copy link
Member Author

ararslan commented Oct 4, 2024

bors try

@ararslan
Copy link
Member Author

Relevant: JuliaCloud/AWS.jl#695

[multipart copy](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html).
# Optional Arguments
- `part_size_mb`: maximum size per uploaded part, in mebibytes (MiB).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's worth exposing an option that allows matching the part size between the source and destination. IIUC, that should make the range-based accesses faster while copying. If a file is big enough for a multipart copy, it was probably uploaded with a multipart upload, in which case the parts and their sizes can be obtained with S3.get_object_attributes. Lacking that permission, one can also get the part size with S3.head_object by passing Dict("partNumber" => 1) as a query parameter, and the number of parts will be in the entity tag of the source object.

Comment on lines 507 to 512
to_bucket,
to_path,
"$bucket/$path",
source,
Dict("headers" => headers);
aws_config=aws,
kwargs...,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
to_bucket,
to_path,
"$bucket/$path",
source,
Dict("headers" => headers);
aws_config=aws,
kwargs...,
to_bucket, to_path, source, Dict("headers" => headers); aws_config=aws, kwargs...

Comment on lines +1096 to +1098
"x-amz-copy-source-range" => string(
"bytes=", first(byte_range), '-', last(byte_range)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
"x-amz-copy-source-range" => string(
"bytes=", first(byte_range), '-', last(byte_range)
)
"x-amz-copy-source-range" =>
string("bytes=", first(byte_range), '-', last(byte_range)),

Summary of changes:
- `s3_copy` now supports a `version` keyword argument that facilitates
  copying a specified version of an object.
- A new function `s3_multipart_copy` to mirror `s3_multipart_upload` has
  been added, which calls `UploadPartCopy` in the API.
- An explicit `cp(::S3Path, ::S3Path)` method has been implemented,
  which avoids the fallback `cp(::AbstractPath, ::AbstractPath)` method
  that reads the source file into memory before writing to the
  destination.
- `cp(::S3Path, ::S3Path)` allows the user to opt into a multipart copy,
  in which case multipart is used when the source is larger than the
  specified part size (50 MiB by default). A multipart copy is
  unconditionally used when the source is at least 5 GiB. This behavior
  mimics that of the AWS CLI. Note that this now requires an additional
  API call to `HeadObject` in order to retrieve the source size.
@ararslan ararslan force-pushed the aa/multipart-copy branch from 57bf305 to 27ad265 Compare May 21, 2025 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants