โ ๏ธ WARNING - USE AT YOUR OWN RISKโ ๏ธ This tool moves, copies, synchronizes, and can DELETE files across your entire file system.
- Test thoroughly in TEST MODE before using production mode
- Always maintain current backups before running in production
- Review your
organizer_config.jsoncarefully- The
--deleteflag in rsync will remove files from target that don't exist in source- No warranty is provided - see LICENSE file
YOU are responsible for any data loss. Use with caution.
The easiest way to use File Organizer is with the desktop app!
./manage_organizer.sh guiThe desktop app provides:
- Visual controls - Start/stop the organizer with simple buttons
- Mode selection - Choose test mode, production mode, daemon mode, etc. with checkboxes
- Live log viewer - See what's happening in real-time
- File tree browser - Browse your organized files visually
- Cross-platform - Works on Mac, Linux, and Windows
The desktop app gives you a friendly interface to:
- Test Mode - Safely test the organizer without making changes
- Production Mode - Run the full organizer with your real files
- Daemon Mode - Run continuously in the background
- Sync Only - Just synchronize folders between drives
- Dedupe Only - Just find and remove duplicate files
What it does: Safely experiments with sample files in the test/ folder
- Files affected: Only files in
test/directory (never touches your real files) - Output: Creates
test/organized/with organized sample files - When to use: Learning how the organizer works, testing new configurations
- Safety: Completely safe - your real files are never touched
What it does: Full production organization that runs continuously in the background
- Files affected: Your real files in configured source folders
- Output: Creates
~/organized/with soft links to your organized files - When to use: Production use - keep your files organized automatically
- Behavior: Runs forever, organizing new files as they appear
What it does: Full production organization that runs once then stops
- Files affected: Your real files in configured source folders
- Output: Creates
~/organized/with soft links to your organized files - When to use: One-time organization of existing files
- Behavior: Scans once, organizes everything, then exits
What it does: Bidirectionally synchronizes folder pairs (no file organization)
- Files affected: Copies files between source and target folders (bidirectional sync)
- Sync logic: If file in target is newer or missing in source โ copy from target to source, otherwise copy from source to target
- Output: Files synchronized between folder pairs, no organization into categories
- When to use: Just want to backup/sync files between locations without organizing them
- Behavior: Fast bidirectional file synchronization without ML analysis or categorization
What it does: Finds and removes duplicate soft links in your organized folder
- Files affected: Only soft links in
~/organized/(never touches original files) - Output: Removes duplicate links, keeps one link per unique file
- When to use: Clean up after multiple runs, remove redundant organization
- Safety: Only removes soft links, your original files are never deleted
When the organizer runs multiple times, you might end up with:
~/organized/images/photo.jpg โ ~/Pictures/IMG_001.jpg
~/organized/2024/photo.jpg โ ~/Pictures/IMG_001.jpg (duplicate)
~/organized/backup/photo.jpg โ ~/Pictures/IMG_001.jpg (duplicate)
Deduplicate mode removes the extra soft links, keeping only:
~/organized/images/photo.jpg โ ~/Pictures/IMG_001.jpg (kept)
~/Pictures/IMG_001.jpg (original file untouched)
- Test Mode โ Try it out safely with sample files
- Real - Scan Once โ Do a one-time organization of your files
- Real Background Daemon โ Keep it running for continuous organization
- **** โ Clean up duplicates if needed
- Sync Only โ Just backup files without organizing
- Python 3.13 (tested) or Python 3.8+
- pip package manager
Python packages:
pip install -r requirements.txtSystem tools (required for OCR and video/PDF processing):
# macOS
brew install tesseract poppler
# Linux
apt-get install tesseract-ocr poppler-utils# Start the desktop app
./manage_organizer.sh guiThe desktop app will open with a friendly interface where you can:
- Select Test Mode to safely experiment
- Choose Production Mode when ready for real files
- Pick Daemon Mode to run continuously
- Use Sync Only or Dedupe Only for specific tasks
If you want to use the desktop app with your real files, first create a configuration:
# Copy the template
cp organizer_config.template.json organizer_config.json
# Edit with your actual paths
nano organizer_config.json # or use your favorite editorEdit the "drives" and "sync_pairs" sections:
{
"drives": {
"MAIN_DRIVE": "/Users/yourname",
"EXTERNAL_DRIVE": "/Volumes/YourExternalDrive",
"PROTON_DRIVE": "/Users/yourname/ProtonDrive",
"GOOGLE_DRIVE": "/Users/yourname/GoogleDrive/MyFiles/"
},
"sync_pairs": [
{
"source": "MAIN_DRIVE/Pictures",
"target": "EXTERNAL_DRIVE/Pictures"
},
{
"source": "MAIN_DRIVE/Documents",
"target": "GOOGLE_DRIVE/Documents"
}
],
"exclude_patterns": [
".git",
"node_modules",
"__pycache__"
]
}Drive Placeholders: You can use drive names (like MAIN_DRIVE/Pictures) in sync_pairs. If a drive is not available, the program will skip it gracefully with a warning message.
Bidirectional Sync: Files are synced in both directions:
- If a file in
targethas a later date OR doesn't exist insourceโ copied fromtargettosource - Otherwise โ copied from
sourcetotarget
Note: The desktop app is recommended for most users. The command line is for advanced users who prefer terminal-based control.
# Single scan of test/ directory
python file_organizer.py --scan-once
# Check the results
ls -la test/organized/# Single scan (safe, review first)
python file_organizer.py --REAL --scan-once
# Start daemon mode (runs continuously)
python file_organizer.py --REAL# Background daemon commands
./manage_organizer.sh start # Start daemon
./manage_organizer.sh stop # Stop daemon
./manage_organizer.sh status # Check status
./manage_organizer.sh log # View logs
# Interactive commands
./manage_organizer.sh test # Test mode
./manage_organizer.sh test-real # Production mode
./manage_organizer.sh sync # Sync only
./manage_organizer.sh dedupe # Dedupe onlyThe organizer learns categories from your files.
- Analyzes filenames, folder names, and content
- Discovers patterns and creates categories automatically
- Only creates categories with enough matching files
- No irrelevant categories
Example:
- You have 15 files about "budget" โ Creates
/budget/category - You have 127 Python files โ Creates
/python/category - You have 0 files about "fishing" โ No
/fishing/category
- Filename dates override file metadata
20240101-report.txtgoes in/2024/even if created in 2025- Supports formats: YYYYMMDD, YYYY-MM-DD, YYYY, and even DD-MM-YY
- Falls back to file creation/modification dates
- Test mode (default): Safe testing with auto-created test folders
- Production mode (--REAL): Your actual file system
- Easy to experiment without risk
Automatically organizes by file type:
documents/- .txt, .doc, .docx, .pdf, .rtf, .odtimages/- .jpg, .png, .gif, .bmp, .tiff, .webpvideos/- .mp4, .avi, .mov, .mkv, .wmvaudio/- .mp3, .wav, .flac, .aac, .oggcode/- .py, .js, .html, .css, .java, .cpp
Text Extraction & Content Analysis:
.txt- Plain text โ.docx- Microsoft Word (python-docx) โ.rtf- Rich Text Format (striprtf) โ.odt- OpenDocument Text (odfpy) โ.doc- Old Word format (basic text extraction, limited)โ ๏ธ .pdf- PDF files (PyPDF2) โ- Scanned PDFs: Auto-detects and uses OCR if no text found โจ
- Images - Advanced OCR and object recognition โจ NEW!
- EasyOCR: Reads text in photos, artistic fonts, low contrast (better than Tesseract!)
- Tesseract: Fallback OCR with multiple modes
- CLIP (optional): Recognizes objects in photos (fish, people, buildings, etc.)
- Formats:
.jpg,.png,.gif,.bmp,.tiff,.webp
- Videos - Frame-by-frame OCR text extraction โจ
- Formats:
.mp4,.avi,.mov,.mkv,.wmv,.flv,.m4v - Samples frames every 10 seconds, extracts visible text
- Formats:
System Requirements:
tesseractOCR engine:brew install tesseract(macOS) orapt install tesseract-ocr(Linux)poppler-utilsfor PDFโimage:brew install poppler(macOS) orapt install poppler-utils(Linux)
AI Features (Optional):
- EasyOCR: Enabled by default, works on CPU (slower than Tesseract but more accurate)
- CLIP: Disabled by default (very slow on CPU ~30min first load). Enable with
"use_clip": truein config if you have a GPU
- Simple Configuration: Just specify folder pairs to sync - no complex drive mappings
- Bidirectional Sync: Automatically syncs files in both directions based on file dates
- Smart Logic: If file in target is newer or missing in source โ copy from target to source, otherwise copy from source to target
- Concurrent Chunked Sync: Large folders sync in parallel chunks for faster progress
- Cloud Storage Optimized: Special flags for Google Drive, ProtonDrive, and other FUSE mounts
- Exclude Patterns: Skip
.git,node_modules, and other patterns you don't want synced - Robust Error Handling: Graceful handling of flaky drives and timeouts
After running in test mode:
test/
โโโ foo/
โโโ bar/
โโโ baz/
โโโ organized/
โโโ 2024/ # Files from 2024
โโโ 2025/ # Files from 2025
โโโ documents/ # All document files
โโโ images/ # All image files
โโโ davis/ # Discovered: Miles Davis (from music files)
โโโ ella/ # Discovered: Ella Fitzgerald
โโโ ella-fitzgerald/ # Discovered: Full name (bigram)
โโโ existence/ # Discovered: Philosophy content
โโโ fitzgerald/ # Discovered: From music collection
โโโ life/ # Discovered: Philosophy theme
โโโ miles/ # Discovered: Miles Davis
โโโ miles-davis/ # Discovered: Full name (bigram)
โโโ music/ # Discovered: Music-related content
โโโ notes/ # Discovered: Notes files
โโโ peggy/ # Discovered: Peggy Lee (OCR from image + text)
โโโ people/ # Discovered: CLIP vision detected people
โโโ something/ # Discovered: From filenames
Each folder contains soft links to the actual files wherever they are. Categories are discovered automatically by analyzing file content, filenames, and even OCR text from images!
- Scans all your files
- Extracts keywords from filenames and folder names
- Extracts keywords from file content (if enabled)
- Counts keyword frequencies across all files
- Finds keywords that appear โฅ 3 times (configurable)
- Creates categories for significant keywords
- Only creates category if โฅ 5 files match (configurable)
- Keeps top 50 categories by frequency (configurable)
- Creates soft link folders for discovered categories
- Adds soft links to all matching files
- Saves discovered categories to JSON for review
After running:
cat ~/.file_organizer_discovered_categories.jsonExample output:
{
"python": {
"file_count": 127,
"sample_files": ["/path/to/script.py", ...]
},
"budget": {
"file_count": 15,
"sample_files": ["/path/to/2024-budget.xlsx", ...]
}
}Edit organizer_config.json:
{
"sync_pairs": [
{
"source": "/Users/yourname/dev",
"target": "/Volumes/ExternalDrive/dev"
},
{
"source": "/Users/yourname/Documents",
"target": "/Users/yourname/GoogleDrive/Documents"
}
],
"exclude_patterns": [
".git",
"node_modules",
"__pycache__",
".DS_Store",
"*.pyc",
".venv",
"venv"
],
"output_base": "~/organized",
"enable_content_analysis": true,
"enable_duplicate_detection": false,
"enable_folder_sync": true
}The sync logic is simple and smart:
-
For each file in both folders:
- If file exists only in
targetโ copy tosource - If file exists only in
sourceโ copy totarget - If file exists in both:
- If
targetis newer โ copy fromtargettosource - Otherwise โ copy from
sourcetotarget
- If
- If file exists only in
-
Exclude patterns (like
.git,node_modules) are skipped automatically
{
"ml_content_analysis": {
"enabled": true,
"min_keyword_frequency": 8,
"min_category_size": 5,
"max_categories": 250,
"min_word_length": 5,
"stop_words_enabled": true
},
"use_rsync": true,
"rsync_checksum_mode": "timestamp",
"rsync_size_only": false,
"rsync_additional_args": [
"--omit-dir-times",
"--no-perms",
"--no-group",
"--no-owner"
],
"sync_chunk_subfolders": 30,
"sync_chunk_concurrency": 1,
"sync_timeout_minutes": 60
}drives: Optional drive shortcuts (e.g.,"MAIN_DRIVE": "/Users/yourname") - use these in sync_pairs for conveniencesync_pairs: List of folder pairs to synchronize (bidirectional) - can use drive placeholders or direct pathsexclude_patterns: Patterns to skip during sync (e.g.,.git,node_modules)output_base: Where to create soft link folders for organized filesenable_content_analysis: Enable/disable ML content discoveryenable_folder_sync: Enable/disable folder synchronizationml_content_analysis: Tune category discovery thresholdsuse_rsync: Use rsync for fast folder synchronization (fallback to Python if unavailable)rsync_checksum_mode:"timestamp"(fast) or"checksum"(thorough but slow)sync_chunk_concurrency: Number of parallel sync operationssync_timeout_minutes: Timeout for large folder syncs
If too many categories:
"ml_content_analysis": {
"min_keyword_frequency": 5, // Increase
"min_category_size": 10, // Increase
"max_categories": 30 // Decrease
}If too few categories:
"ml_content_analysis": {
"min_keyword_frequency": 2, // Decrease
"min_category_size": 3, // Decrease
"max_categories": 250, // Increase (current default)
"min_word_length": 4 // Allow shorter words
}# Create test environment and run
python file_organizer.py --create-test
python file_organizer.py --scan-once
# Or just run (auto-creates test if needed)
python file_organizer.py --scan-once
# Check results
ls -la test/organized/# Single scan (safe, review first)
python file_organizer.py --REAL --scan-once
# Review results
tail -100 ~/.file_organizer.log
# Start daemon (runs continuously)
python file_organizer.py --REAL
# Advanced: specific operations
python file_organizer.py --REAL --sync-only # Only sync folders
python file_organizer.py --REAL --dedupe-only # Only remove duplicatespython file_organizer.py [OPTIONS]
Options:
-R, --REAL Run in PRODUCTION mode (default: TEST mode)
--scan-once Run single scan instead of daemon
--create-test Create test environment and exit
--sync-only Only synchronize folders (production mode)
--dedupe-only Only remove duplicates (production mode)
--config PATH Custom config file pathAll activity is logged to ~/.file_organizer.log
Monitor in real-time:
tail -f ~/.file_organizer.logCheck for errors:
grep ERROR ~/.file_organizer.log- Soft Links Only - Original files never moved or modified
- Test Mode - Safe experimentation with isolated test folders
- Git Tracking - Optional version control for all changes
- Comprehensive Logging - Know exactly what happened
- Exclude Folders - Prevent recursion and protect system folders
- Graceful Error Handling - Continues on individual file failures
# Check the log
tail -100 ~/.file_organizer.log
# Look for errors
grep ERROR ~/.file_organizer.logSome directories may require elevated permissions:
sudo python file_organizer.py --REAL --scan-onceAdjust thresholds in organizer_config.json under ml_content_analysis:
- Increase
min_keyword_frequencyfor fewer categories - Decrease
min_category_sizefor more categories - Adjust
max_categoriesto limit total
- Check that
enable_content_analysisis true - Lower
min_keyword_frequencyandmin_category_size - Verify files have readable content
- Check log for content analysis errors
$ python file_organizer.py --scan-once
======================================================================
TEST MODE - Operating on test/ directory
======================================================================
$ ls test/organized/
2024/ 2025/ documents/ images/ insurance/ fishing/$ python file_organizer.py --REAL --scan-once
======================================================================
PRODUCTION MODE - Operating on your entire file system
======================================================================
$ cat ~/.file_organizer_discovered_categories.json
{
"python": {"file_count": 127},
"javascript": {"file_count": 85},
"budget": {"file_count": 15}
}$ python file_organizer.py --REAL &
$ tail -f ~/.file_organizer.logEssential Steps:
- โ Run in test mode first
- โ Review discovered categories
- โ Configure paths for your system
- โ Start with a small subset of folders
- โ Have backups of critical data
- โ Monitor logs for first few cycles
- โ Understand how to stop it (Ctrl+C or kill)
Keep two folders in sync with optimized rsync:
"enable_folder_sync": true,
"use_rsync": true,
"rsync_checksum_mode": "timestamp",
"rsync_size_only": true,
"rsync_additional_args": [
"--omit-dir-times",
"--no-perms",
"--no-group",
"--no-owner",
"--delete-after"
],
"sync_pairs": [
{
"source": "/source/folder",
"target": "/target/folder"
}
],
"sync_chunk_subfolders": 10,
"sync_chunk_concurrency": 3,
"sync_timeout_minutes": 180Performance Notes:
rsync_size_only: true- Much faster for cloud storage (Google Drive, ProtonDrive)rsync_additional_args- Reduces FUSE metadata overheadsync_chunk_concurrency: 3- Sync multiple subfolders in parallelsync_timeout_minutes: 180- 3-hour timeout for large folders
Find and remove duplicates (keeps newest):
"enable_duplicate_detection": trueTrack all changes with Git:
"enable_git_tracking": true,
"git_user": "Your Name",
"git_email": "[email protected]"Background backup with retry logic and cloud storage support:
"enable_background_backup": true,
"backup_drive_path": "/path/to/backupdrive",
"backup_directories": [
"/path/to/important/folder"
]Supported Backup Targets:
- External drives (USB, Thunderbolt)
- Cloud storage mounts (Google Drive, ProtonDrive, Dropbox)
- Network drives (SMB, NFS)
- Any mounted filesystem
- No Hardcoded Categories - Learns from YOUR files
- Smart Date Handling - Filename dates override metadata
- Test & Production Modes - Safe experimentation
- Cloud Storage Optimized - Special handling for Google Drive, ProtonDrive, etc.
- Concurrent Sync - Large folders sync in parallel chunks
- ML-Powered - Discovers patterns in your files
- Comprehensive - Handles edge cases gracefully
- Portable - Works anywhere, no hardcoded paths
Before asking for help:
- Use the desktop app - It shows logs and status visually
- Check the logs:
tail -100 ~/.file_organizer.logor use the desktop app's log viewer - Review configuration:
cat organizer_config.json - Test mode first: Use "Test Mode" in the desktop app or
python file_organizer.py --scan-once - Verify paths are absolute and exist
MIT License - Free to use and modify
- Install dependencies (
pip install -r requirements.txt) - Run test mode (
python file_organizer.py --scan-once) - Review test results (
ls test/organized/) - Check discovered categories (
cat ~/.file_organizer_discovered_categories.json) - Edit config for your system (
organizer_config.json) - Test with one small folder first
- Review logs for errors (
tail ~/.file_organizer.log) - Gradually expand to more folders
- Consider enabling advanced features
Remember: Start with test mode, then start small in production mode!
# Safe way to start
python file_organizer.py --scan-once # Test mode
python file_organizer.py --REAL --scan-once # Production (review first!)Your files, your categories, your way. ๐