Added missing requirement in pyptoject, added shorthand flags, updated README

annejan · annejan · commit 3a8f1229ad8a · 2025-06-13T20:58:54.000+02:00
diff --git a/README.md b/README.md
@@ -1,119 +1,118 @@
 # PDF Metadata Scanner
 
-A Python tool to recursively scan a folder for PDF files and extract:
+A command-line tool to recursively scan folders for PDF files and extract:
 
 - PDF metadata (Info dictionary via `pikepdf`)
 - XMP and RDF metadata
-- Metadata from embedded images (JPEG, PNG, TIFF — EXIF, text, and other supported fields)
+- Embedded image metadata (JPEG, PNG, TIFF — EXIF, text, and other supported fields)
 
-### 🛠 Features
+---
+
+## 🛠 Features
 
-- Recursive folder scanning
-- Clean separation of logs (warnings/errors) and metadata output
-- Supports multiple image formats (via Pillow)
-- Handles XMP/RDF and embedded image metadata
+- 🔍 Recursive folder scanning
+- 🧼 Clean separation of metadata output and error/warning logs
+- 🖼 Embedded image metadata support via Pillow (JPEG, PNG, TIFF)
+- 📑 XMP/RDF metadata parsing
+- ⚙️ Optional progress bar and verbose logging
 
 ---
 
-## ✅ Requirements
+## 📦 Installation
+
+Install directly from PyPI:
 
 ```bash
-pip install -r requirements.txt 
+pip install pdf-metadata-scanner
 ````
 
+Or from source:
+
+```bash
+git clone https://github.com/yourname/pdf-metadata-scanner.git
+cd pdf-metadata-scanner
+pip install .
+```
+
 ---
 
 ## 🚀 Usage
 
+After installing:
+
 ```bash
-python scanner.py <folder> [--log LOG_FILE] [--out OUTPUT_FILE] [--verbose] [--progress]
+pdfscan <folder> [--log LOG_FILE] [--out OUTPUT_FILE] [--verbose] [--progress]
 ```
 
-Or if you want to install:
+If running from source without installation:
 
 ```bash
-pip install .
-pdfscan <folder> [--log LOG_FILE] [--out OUTPUT_FILE] [--verbose] [--progress]
-
+python scanner.py <folder> [--log LOG_FILE] [--out OUTPUT_FILE] [--verbose] [--progress]
 ```
 
 ### Arguments:
 
-| Flag         | Description                                  | Default                   |
-| ------------ | -------------------------------------------- | ------------------------- |
-| `folder`     | Folder to recursively scan for PDFs          | *required*                |
-| `--log`      | Log file for warnings/errors                 | `scanner_warnings.log`    |
-| `--out`      | Output file for extracted metadata           | `pdf_metadata_output.txt` |
-| `--verbose`  | Output logs to both file and console         |                           |
-| `--progress` | Show a live progress bar while scanning PDFs |                           |
-
+| Flag                | Shorthand | Description                                  | Default                   |
+| ------------------- | --------- | -------------------------------------------- | ------------------------- |
+| `folder`            |           | Folder to recursively scan for PDFs          | *required*                |
+| `--log LOG_FILE`    | `-l`      | Log file for warnings/errors                 | `scanner_warnings.log`    |
+| `--out OUTPUT_FILE` | `-o`      | Output file for extracted metadata           | `pdf_metadata_output.txt` |
+| `--verbose`         | `-v`      | Output logs to both file and console         | *(off)*                   |
+| `--progress`        | `-p`      | Show a live progress bar while scanning PDFs | *(off)*                   |
 
 ---
 
 ## 🧾 Example
 
 ```bash
-python scanner.py ./documents --log logs.txt --out metadata.txt
+pdfscan ./documents --log logs.txt --out metadata.txt --verbose --progress
 ```
 
 * `logs.txt`: Contains only errors or warnings.
 * `metadata.txt`: Contains all extracted metadata.
 
 ---
 
-## 📦 Output Structure
-
-Metadata output (`--out`) includes:
+## 📄 Output Format
 
 ```
-[PDF Metadata] ...
-    /Author: John Doe
-    /Title: Sample
-[XMP Metadata] ...
-[Image Metadata] ...
-    306: 2023:12:31 12:34:56
-    dpi: (300, 300)
+[PDF Metadata] test.pdf
+    /Author: Jane Doe
+    /Title: Example Document
+
+[XMP Metadata] test.pdf
+    <dc:title>Example</dc:title>
+    <dc:creator>Jane Doe</dc:creator>
+
+[Image Metadata] test.pdf - Page 1 - Im0
+    DateTimeOriginal: 2024:01:01 12:00:00
+    DPI: (300, 300)
 ```
 
 ---
 
-## 🧪 Unit Testing
+## 🧪 Testing
 
-This project includes unit tests to ensure core functionality works correctly.
+This project includes unit tests for core functionality.
 
-### Running Tests
-
-Make sure you have `unittest` (comes with Python standard library) and the required dependencies installed:
+### Run tests:
 
 ```bash
 pip install -r requirements-dev.txt
-````
-
-To run the tests, execute:
-
-```bash
 python -m unittest test_scanner.py
 ```
 
-### What is Tested?
-
-* Extraction of PDF metadata using mocked PDF files
-* Parsing of XMP and RDF metadata
-* Extraction of image metadata from embedded images (JPEG, PNG)
-* Proper handling of non-image PDF objects
-
-### Adding Tests
-
-Feel free to add more tests in `test_scanner.py` for new features or edge cases.
+---
 
 ## 🔒 Notes
 
-* Some image formats in PDFs (e.g. CCITT, JBIG2) are skipped due to incompatibility.
-* PNG metadata (text fields) and EXIF from JPEG/TIFF are both supported.
-* This tool does not modify the PDFs — it only reads metadata.
+* Some image formats (e.g. CCITT, JBIG2) are skipped due to decoding limitations.
+* PNG and JPEG/TIFF metadata is extracted where available.
+* The tool is read-only — it does **not** modify PDFs.
 
 ---
 
-## 📃 License
+## 🧾 License
 
 MIT License
+
diff --git a/publiccode.yml b/publiccode.yml
@@ -4,7 +4,7 @@ name: PDF Metadata Scanner
 applicationSuite: ''
 url: 'https://github.com/annejan/pdf-metadata-scanner'
 landingURL: 'https://github.com/annejan/pdf-metadata-scanner'
-softwareVersion: '0.1.1'
+softwareVersion: '0.1.2'
 releaseDate: '2025-06-13'
 
 developmentStatus: development
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [project]
 name = "pdf-metadata-scanner"
-version = "0.1.1"
+version = "0.1.2"
 description = "A CLI tool for extracting and analyzing metadata from PDFs, including embedded images and XMP/RDF metadata."
 readme = "README.md"
 requires-python = ">=3.11"
@@ -11,8 +11,10 @@ authors = [
 dependencies = [
   "pikepdf",
   "pypdf",
-  "Pillow"
+  "Pillow",
+  'tqdm'
 ]
+keywords = ["PDF", "metadata", "scanner", "EXIF", "XMP"]
 
 [project.scripts]
 pdfscan = "scanner:main"
diff --git a/scanner.py b/scanner.py
@@ -133,11 +133,11 @@ def scan_folder(folder, out, show_progress=False):
 
 def main():
     parser = argparse.ArgumentParser(description="Extract metadata from PDFs.")
-    parser.add_argument("folder", help="Folder to scan recursively")
-    parser.add_argument("--log", default="scanner.log", help="Log file path (warnings/errors)")
-    parser.add_argument("--out", default="pdf_metadata_output.txt", help="Output file for metadata")
-    parser.add_argument("--verbose", action="store_true", help="Enable verbose logging to console")
-    parser.add_argument("--progress", action="store_true", help="Show progress while scanning PDFs")
+    parser.add_argument("folder", help="folder to scan recursively")
+    parser.add_argument("-l", "--log", default="scanner.log", help="log file path (warnings/errors)")
+    parser.add_argument("-o", "--out", default="pdf_metadata_output.txt", help="output file for metadata")
+    parser.add_argument("-v", "--verbose", action="store_true", help="output logs to both file and console")
+    parser.add_argument("-p", "--progress", action="store_true", help="show a live progress bar while scanning PDFs")
     args = parser.parse_args()
 
     setup_logger(args.log, verbose=args.verbose)