A Go rewrite of the h5ai downloader with concurrent download support and additional features.
- Concurrent Downloads: Use multiple goroutines for parallel file downloads
- Export Only Mode: Save URLs to file instead of downloading
- Flexible Output: Control directory structure and output files
- Caching: HTTP response caching to avoid redundant requests
- Progress Tracking: Track download progress and resume interrupted downloads
- Multiple URL Support: Process single URLs or files containing multiple URLs
# Build from source
go build -o h5ai_downloader
# Or run directly
go run main.go [options]# Download from a single URL to default directory (./files)
./h5ai_downloader -url "http://example.com/files/" -depth 3 -workers 8
# Download to custom directory
./h5ai_downloader -url "http://example.com/files/" -output "./downloads" -workers 4
# Download from multiple URLs in a file
./h5ai_downloader -file urls.txt -depth 2 -workers 4# Export URLs to default file (urls.txt)
./h5ai_downloader -url "http://example.com/files/" -export-only
# Export URLs to custom file with flat structure
./h5ai_downloader -url "http://example.com/files/" -export-only -flat -output my_urls.txt
# Export with directory structure preserved
./h5ai_downloader -url "http://example.com/files/" -export-only -output detailed_urls.txt| Option | Short | Description | Default |
|---|---|---|---|
--url |
-u |
Single URL to scrape | - |
--file |
-f |
File containing URLs to scrape | - |
--depth |
-d |
Maximum depth for scraping | 4 |
--workers |
Number of concurrent download workers | 4 | |
--export-only |
Save URLs to file instead of downloading | false | |
--flat |
Skip directory structure | false | |
--output |
Output directory for downloads OR filename for export | ./files (download) / urls.txt (export) |
When using the --file option, create a text file with one URL per line. Optionally specify custom depth:
http://example1.com/files/
http://example2.com/data/ 5
http://example3.com/docs/ 2
| Feature | Python Version | Go Version |
|---|---|---|
| Basic h5ai crawling | ✅ | ✅ |
| Download tracking | ✅ | ✅ |
| HTTP caching | ✅ | ✅ |
| Multiple URLs | ✅ | ✅ |
| Concurrent downloads | ❌ | ✅ |
| Export-only mode | ❌ | ✅ |
| Flat export option | ❌ | ✅ |
| Custom output file | ❌ | ✅ |
| Worker pool control | ❌ | ✅ |
The Go version offers significant performance improvements:
- Concurrent Downloads: Download multiple files simultaneously using configurable worker pools
- Better Memory Usage: More efficient memory management compared to Python
- Faster Startup: No interpreter overhead
- Built-in HTTP Client: Optimized HTTP handling without external dependencies
- Cache System: Stores HTTP responses in
.gobfiles for quick retrieval - URL Collector: Thread-safe collection of downloadable URLs during crawling
- Download Tracker: Persistent tracking of completed downloads to enable resuming
- Worker Pool: Configurable number of goroutines for concurrent downloads
├── main.go # Main application code
├── go.mod # Go module definition
├── url_cache/ # HTTP response cache (created automatically)
├── downloaded_db/ # Download completion tracking (created automatically)
└── [downloaded files] # Downloaded content preserving directory structure
./h5ai_downloader -url "http://files.example.com/" -depth 2 -workers 8 -output "./my_downloads"./h5ai_downloader -url "http://files.example.com/" -export-only -output backup_urls.txtCreate sites.txt:
http://site1.com/files/ 3
http://site2.com/data/ 5
http://site3.com/docs/
Run:
./h5ai_downloader -file sites.txt -workers 6 -output "./downloads"./h5ai_downloader -url "http://files.example.com/" -export-only -flat -output flat_urls.txtThe -output parameter has dual functionality:
- Download Mode (default): Specifies the output directory where files will be downloaded
- Default:
./files - Example:
-output "./my_downloads"creates directory structure undermy_downloads/
- Default:
- Export Mode (
-export-only): Specifies the filename for the exported URL list- Default:
urls.txt - Example:
-output "backup_urls.txt"creates a file namedbackup_urls.txt
- Default:
-
When
flat=false(default): Maintains the original directory structure from the server -
When
flat=true: Downloads all files to the output directory root (no subdirectories) -
The Go version maintains compatibility with the Python version's cache and download tracking
-
Default worker count is 4, but can be adjusted based on your system and network capacity
-
Export-only mode is useful for creating backup lists or processing URLs with external tools
-
The flat option in export mode outputs just the URLs without directory structure information