Skip to content

Commit 3a8f122

Browse files
committed
Added missing requirement in pyptoject, added shorthand flags, updated README
1 parent 09dc779 commit 3a8f122

File tree

4 files changed

+66
-65
lines changed

4 files changed

+66
-65
lines changed

README.md

Lines changed: 56 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -1,119 +1,118 @@
11
# PDF Metadata Scanner
22

3-
A Python tool to recursively scan a folder for PDF files and extract:
3+
A command-line tool to recursively scan folders for PDF files and extract:
44

55
- PDF metadata (Info dictionary via `pikepdf`)
66
- XMP and RDF metadata
7-
- Metadata from embedded images (JPEG, PNG, TIFF — EXIF, text, and other supported fields)
7+
- Embedded image metadata (JPEG, PNG, TIFF — EXIF, text, and other supported fields)
88

9-
### 🛠 Features
9+
---
10+
11+
## 🛠 Features
1012

11-
- Recursive folder scanning
12-
- Clean separation of logs (warnings/errors) and metadata output
13-
- Supports multiple image formats (via Pillow)
14-
- Handles XMP/RDF and embedded image metadata
13+
- 🔍 Recursive folder scanning
14+
- 🧼 Clean separation of metadata output and error/warning logs
15+
- 🖼 Embedded image metadata support via Pillow (JPEG, PNG, TIFF)
16+
- 📑 XMP/RDF metadata parsing
17+
- ⚙️ Optional progress bar and verbose logging
1518

1619
---
1720

18-
## ✅ Requirements
21+
## 📦 Installation
22+
23+
Install directly from PyPI:
1924

2025
```bash
21-
pip install -r requirements.txt
26+
pip install pdf-metadata-scanner
2227
````
2328

29+
Or from source:
30+
31+
```bash
32+
git clone https://github.com/yourname/pdf-metadata-scanner.git
33+
cd pdf-metadata-scanner
34+
pip install .
35+
```
36+
2437
---
2538

2639
## 🚀 Usage
2740

41+
After installing:
42+
2843
```bash
29-
python scanner.py <folder> [--log LOG_FILE] [--out OUTPUT_FILE] [--verbose] [--progress]
44+
pdfscan <folder> [--log LOG_FILE] [--out OUTPUT_FILE] [--verbose] [--progress]
3045
```
3146

32-
Or if you want to install:
47+
If running from source without installation:
3348

3449
```bash
35-
pip install .
36-
pdfscan <folder> [--log LOG_FILE] [--out OUTPUT_FILE] [--verbose] [--progress]
37-
50+
python scanner.py <folder> [--log LOG_FILE] [--out OUTPUT_FILE] [--verbose] [--progress]
3851
```
3952

4053
### Arguments:
4154

42-
| Flag | Description | Default |
43-
| ------------ | -------------------------------------------- | ------------------------- |
44-
| `folder` | Folder to recursively scan for PDFs | *required* |
45-
| `--log` | Log file for warnings/errors | `scanner_warnings.log` |
46-
| `--out` | Output file for extracted metadata | `pdf_metadata_output.txt` |
47-
| `--verbose` | Output logs to both file and console | |
48-
| `--progress` | Show a live progress bar while scanning PDFs | |
49-
55+
| Flag | Shorthand | Description | Default |
56+
| ------------------- | --------- | -------------------------------------------- | ------------------------- |
57+
| `folder` | | Folder to recursively scan for PDFs | *required* |
58+
| `--log LOG_FILE` | `-l` | Log file for warnings/errors | `scanner_warnings.log` |
59+
| `--out OUTPUT_FILE` | `-o` | Output file for extracted metadata | `pdf_metadata_output.txt` |
60+
| `--verbose` | `-v` | Output logs to both file and console | *(off)* |
61+
| `--progress` | `-p` | Show a live progress bar while scanning PDFs | *(off)* |
5062

5163
---
5264

5365
## 🧾 Example
5466

5567
```bash
56-
python scanner.py ./documents --log logs.txt --out metadata.txt
68+
pdfscan ./documents --log logs.txt --out metadata.txt --verbose --progress
5769
```
5870

5971
* `logs.txt`: Contains only errors or warnings.
6072
* `metadata.txt`: Contains all extracted metadata.
6173

6274
---
6375

64-
## 📦 Output Structure
65-
66-
Metadata output (`--out`) includes:
76+
## 📄 Output Format
6777

6878
```
69-
[PDF Metadata] ...
70-
/Author: John Doe
71-
/Title: Sample
72-
[XMP Metadata] ...
73-
[Image Metadata] ...
74-
306: 2023:12:31 12:34:56
75-
dpi: (300, 300)
79+
[PDF Metadata] test.pdf
80+
/Author: Jane Doe
81+
/Title: Example Document
82+
83+
[XMP Metadata] test.pdf
84+
<dc:title>Example</dc:title>
85+
<dc:creator>Jane Doe</dc:creator>
86+
87+
[Image Metadata] test.pdf - Page 1 - Im0
88+
DateTimeOriginal: 2024:01:01 12:00:00
89+
DPI: (300, 300)
7690
```
7791

7892
---
7993

80-
## 🧪 Unit Testing
94+
## 🧪 Testing
8195

82-
This project includes unit tests to ensure core functionality works correctly.
96+
This project includes unit tests for core functionality.
8397

84-
### Running Tests
85-
86-
Make sure you have `unittest` (comes with Python standard library) and the required dependencies installed:
98+
### Run tests:
8799

88100
```bash
89101
pip install -r requirements-dev.txt
90-
````
91-
92-
To run the tests, execute:
93-
94-
```bash
95102
python -m unittest test_scanner.py
96103
```
97104

98-
### What is Tested?
99-
100-
* Extraction of PDF metadata using mocked PDF files
101-
* Parsing of XMP and RDF metadata
102-
* Extraction of image metadata from embedded images (JPEG, PNG)
103-
* Proper handling of non-image PDF objects
104-
105-
### Adding Tests
106-
107-
Feel free to add more tests in `test_scanner.py` for new features or edge cases.
105+
---
108106

109107
## 🔒 Notes
110108

111-
* Some image formats in PDFs (e.g. CCITT, JBIG2) are skipped due to incompatibility.
112-
* PNG metadata (text fields) and EXIF from JPEG/TIFF are both supported.
113-
* This tool does not modify the PDFs — it only reads metadata.
109+
* Some image formats (e.g. CCITT, JBIG2) are skipped due to decoding limitations.
110+
* PNG and JPEG/TIFF metadata is extracted where available.
111+
* The tool is read-only — it does **not** modify PDFs.
114112

115113
---
116114

117-
## 📃 License
115+
## 🧾 License
118116

119117
MIT License
118+

publiccode.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ name: PDF Metadata Scanner
44
applicationSuite: ''
55
url: 'https://github.com/annejan/pdf-metadata-scanner'
66
landingURL: 'https://github.com/annejan/pdf-metadata-scanner'
7-
softwareVersion: '0.1.1'
7+
softwareVersion: '0.1.2'
88
releaseDate: '2025-06-13'
99

1010
developmentStatus: development

pyproject.toml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[project]
22
name = "pdf-metadata-scanner"
3-
version = "0.1.1"
3+
version = "0.1.2"
44
description = "A CLI tool for extracting and analyzing metadata from PDFs, including embedded images and XMP/RDF metadata."
55
readme = "README.md"
66
requires-python = ">=3.11"
@@ -11,8 +11,10 @@ authors = [
1111
dependencies = [
1212
"pikepdf",
1313
"pypdf",
14-
"Pillow"
14+
"Pillow",
15+
'tqdm'
1516
]
17+
keywords = ["PDF", "metadata", "scanner", "EXIF", "XMP"]
1618

1719
[project.scripts]
1820
pdfscan = "scanner:main"

scanner.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -133,11 +133,11 @@ def scan_folder(folder, out, show_progress=False):
133133

134134
def main():
135135
parser = argparse.ArgumentParser(description="Extract metadata from PDFs.")
136-
parser.add_argument("folder", help="Folder to scan recursively")
137-
parser.add_argument("--log", default="scanner.log", help="Log file path (warnings/errors)")
138-
parser.add_argument("--out", default="pdf_metadata_output.txt", help="Output file for metadata")
139-
parser.add_argument("--verbose", action="store_true", help="Enable verbose logging to console")
140-
parser.add_argument("--progress", action="store_true", help="Show progress while scanning PDFs")
136+
parser.add_argument("folder", help="folder to scan recursively")
137+
parser.add_argument("-l", "--log", default="scanner.log", help="log file path (warnings/errors)")
138+
parser.add_argument("-o", "--out", default="pdf_metadata_output.txt", help="output file for metadata")
139+
parser.add_argument("-v", "--verbose", action="store_true", help="output logs to both file and console")
140+
parser.add_argument("-p", "--progress", action="store_true", help="show a live progress bar while scanning PDFs")
141141
args = parser.parse_args()
142142

143143
setup_logger(args.log, verbose=args.verbose)

0 commit comments

Comments
 (0)