Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,5 @@

# Build artifacts
/build/
/bin/
checksums.txt
26 changes: 26 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,3 +90,29 @@ The metadata file should be a JSON file in the format:
The `aliases` and `metadata` properties are optional. Some Gen3 data commons require the `authz` property to be specified in order to upload a data file.

If you do not know what `authz` to use, you can look at your `Profile` tab or `/identity` page of the Gen3 data commons you are uploading to. You will see a list of _authz resources_ in the format `/example/authz/resource`: these are the authz resources you have access to.

## Multipart Upload

The `data-client` supports multipart upload for large files, which splits files into smaller chunks (parts) for more reliable uploads with resume capability.

### Chunk Size (Message Size) in Multipart Uploads

When uploading files using multipart upload, the file is divided into chunks (also referred to as "parts" or "messages"). The chunk size is automatically determined based on the file size:

- **For files ≤ 512 MB**: 32 MB chunks
- **For files 512 MB - ~49 GB**: 5 MB chunks (S3 minimum)
- This threshold (~49 GB = 10,000 parts × 5 MB) is where files hit the S3 limit of 10,000 parts when using the minimum chunk size
- **For files > ~49 GB**: Dynamically calculated to stay within S3's limit of 10,000 parts per upload
- Minimum chunk size: 5 MB (S3 requirement)
- Maximum number of parts: 10,000
- Chunk sizes are rounded up to the nearest MB for efficiency

**Example chunk sizes:**
- 100 MB file → 32 MB chunks (4 parts)
- 1 GB file → 5 MB chunks (~205 parts)
- 10 GB file → 5 MB chunks (~2,048 parts)
- 50 GB file → 6 MB chunks (~8,534 parts)
- 100 GB file → 11 MB chunks (~9,310 parts)
- 1 TB file → 105 MB chunks (~9,987 parts)

The multipart upload process runs up to 10 concurrent part uploads for optimal performance, with automatic retry logic for failed parts.
59 changes: 53 additions & 6 deletions client/g3cmd/upload-multipart.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,19 @@ import (
)

const (
minChunkSize = 5 * 1024 * 1024 // S3 minimum part size
maxMultipartParts = 10000
// minChunkSize is the minimum chunk/part size for multipart uploads (5 MB)
// This is enforced by AWS S3 for all parts except the last part
minChunkSize = 5 * 1024 * 1024 // 5 MB - S3 minimum part size

// maxMultipartParts is the maximum number of parts allowed in a single multipart upload
// This is an AWS S3 limitation
maxMultipartParts = 10000

// maxConcurrentUploads is the number of parallel workers uploading parts concurrently
maxConcurrentUploads = 10
maxRetries = 5

// maxRetries is the maximum number of retry attempts per part upload
maxRetries = 5
)

func NewUploadMultipartCmd() *cobra.Command {
Expand All @@ -39,7 +48,13 @@ func NewUploadMultipartCmd() *cobra.Command {
Use: "upload-multipart",
Short: "Upload a single file using multipart upload",
Long: `Uploads a large file to object storage using multipart upload.
This method is resilient to network interruptions and supports resume capability.`,
This method is resilient to network interruptions and supports resume capability.

The file is automatically split into chunks (parts) for upload:
- Files ≤ 512 MB: 32 MB chunks
- Files > 512 MB: Dynamically calculated chunks (minimum 5 MB, maximum 10,000 parts)

Up to 10 parts are uploaded concurrently with automatic retry for failed parts.`,
Example: `./data-client upload-multipart --profile=myprofile --file-path=./large.bam
./data-client upload-multipart --profile=myprofile --file-path=./data.bam --guid=existing-guid`,
RunE: func(cmd *cobra.Command, args []string) error {
Expand Down Expand Up @@ -87,7 +102,16 @@ func UploadSingleFile(profile, bucket, filePath, guid string) error {
return MultipartUpload(context.TODO(), g3, fileInfo, bucket, true)
}

// MultipartUpload is now clean, context-aware, and uses modern progress bars
// MultipartUpload handles uploading large files by splitting them into multiple parts.
// This method is resilient to network interruptions and supports concurrent uploads.
//
// The file is divided into chunks (parts) whose size is automatically determined by
// optimalChunkSize() based on the file size. The chunk size ranges from 32 MB for
// smaller files to larger sizes for files approaching the 5 TB limit, always staying
// within AWS S3's constraint of maximum 10,000 parts per upload.
//
// Up to maxConcurrentUploads (10) parts are uploaded in parallel, with automatic
// retry logic for failed parts.
func MultipartUpload(ctx context.Context, g3 client.Gen3Interface, req common.FileUploadRequestObject, bucketName string, showProgress bool) error {
g3.Logger().Printf("File Upload Request: %#v\n", req)

Expand Down Expand Up @@ -262,7 +286,30 @@ func backoffDuration(attempt int) time.Duration {
return min(time.Duration(1<<uint(attempt))*200*time.Millisecond, 10*time.Second)
}

// Choose optimal chunk size
// optimalChunkSize determines the ideal chunk/part size for multipart upload based on file size.
// The chunk size (also known as "message size" or "part size") affects upload performance and
// must comply with S3 constraints.
//
// Calculation logic:
// - For files ≤ 512 MB: Returns 32 MB chunks for optimal performance
// - For files > 512 MB: Calculates fileSize/maxMultipartParts, with minimum of 5 MB
// - Enforces minimum of 5 MB (S3 requirement for all parts except the last)
// - Rounds up to nearest MB for alignment
//
// This results in:
// - Files ≤ 512 MB: 32 MB chunks
// - Files 512 MB - ~49 GB: 5 MB chunks (minimum enforced)
// The ~49 GB threshold (10,000 parts × 5 MB) is where files exceed S3's
// 10,000 part limit when using the minimum chunk size
// - Files > ~49 GB: Dynamically calculated to stay under 10,000 parts
//
// Examples:
// - 100 MB file → 32 MB chunks (4 parts)
// - 1 GB file → 5 MB chunks (~205 parts)
// - 10 GB file → 5 MB chunks (~2,048 parts)
// - 50 GB file → 6 MB chunks (~8,534 parts)
// - 100 GB file → 11 MB chunks (~9,310 parts)
// - 1 TB file → 105 MB chunks (~9,987 parts)
func optimalChunkSize(fileSize int64) int64 {
if fileSize <= 512*1024*1024 {
return 32 * 1024 * 1024 // 32MB for smaller files
Expand Down