Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
29d1cb0
Move src/ocr/marker to packages/ocr/marker
p-j-smith Oct 7, 2025
bf2ae89
Make packages/ocr/marker an installable package and a workspace member
p-j-smith Oct 7, 2025
2052e10
Update marker docker file to use uv base image
p-j-smith Oct 7, 2025
12f5d8a
Use HOST_DATE env var for marker volumes
p-j-smith Oct 7, 2025
29ffbb4
Don't extract images with Marker
p-j-smith Oct 7, 2025
acc43b3
Update instructions for running inference with marker
p-j-smith Oct 7, 2025
725b1bb
Use aiohttp for async calls to the marker api
p-j-smith Oct 7, 2025
50187df
Move pyonb_marker files into src/ directory
p-j-smith Oct 7, 2025
2e79c27
Move src/ocr/paddleocr to pacakges/ocr/paddleocr
p-j-smith Oct 7, 2025
54c5217
Make packages/ocr/paddleocr an installable package and a workspace me…
p-j-smith Oct 7, 2025
5cf7c0e
Restructure pyonb_paddleocr api to be in line with marker and docling
p-j-smith Oct 7, 2025
e28cb49
Remove unused dependencies from paddleocr pyproject.toml
p-j-smith Oct 7, 2025
79482ef
Update paddle docker file to use uv base image
p-j-smith Oct 7, 2025
4d9e951
Use HOST_DATA env var for paddleocr volumes
p-j-smith Oct 7, 2025
d003a46
move paddleocr_pyonb to pyonb_paddleocr
p-j-smith Oct 7, 2025
bf9a5df
Use aiohttp for async calls to the paddleocr api
p-j-smith Oct 7, 2025
301f4e7
Use correct context for paddleocr docker service
p-j-smith Oct 7, 2025
3439371
Add README with instructions on running paddleocr
p-j-smith Oct 7, 2025
91dc728
Ensure all necessary dependencies are installed in paddleocr Dockerfile
p-j-smith Oct 7, 2025
ac7303b
Fix typo in marker README
p-j-smith Oct 7, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,21 +50,21 @@ services:
marker:
profiles: [marker]
build:
context: src/ocr/marker
context: packages/ocr/marker
dockerfile: Dockerfile
args:
<<: *build-args-common
MARKER_API_PORT: ${MARKER_API_PORT}
environment:
<<: [*proxy-common, *common-env]
CONTAINER_DATA_FOLDER: /data
DATA_FOLDER: /data
MARKER_API_PORT: ${MARKER_API_PORT}
env_file:
- ./.env
ports:
- "${MARKER_API_PORT}:${MARKER_API_PORT}"
volumes:
- ${HOST_DATA_FOLDER}:${CONTAINER_DATA_FOLDER:-/data}
- ${PWD}/${DATA_FOLDER}:/data
networks:
- pyonb_ocr_api
healthcheck:
Expand All @@ -84,21 +84,21 @@ services:
paddleocr:
profiles: [paddleocr]
build:
context: src/ocr/paddleocr
context: packages/ocr/paddleocr
dockerfile: Dockerfile
args:
<<: *build-args-common
PADDLEOCR_API_PORT: ${PADDLEOCR_API_PORT}
environment:
<<: [*proxy-common, *common-env]
CONTAINER_DATA_FOLDER: /data
DATA_FOLDER: /data
PADDLEOCR_API_PORT: ${PADDLEOCR_API_PORT}
env_file:
- ./.env
ports:
- "${PADDLEOCR_API_PORT}:${PADDLEOCR_API_PORT}"
volumes:
- ${HOST_DATA_FOLDER}:${CONTAINER_DATA_FOLDER:-/data}
- ${PWD}/${DATA_FOLDER}:/data
networks:
- pyonb_ocr_api
healthcheck:
Expand Down
19 changes: 19 additions & 0 deletions packages/ocr/marker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
FROM ghcr.io/astral-sh/uv:python3.13-bookworm AS app

SHELL ["/bin/bash", "-o", "pipefail", "-e", "-u", "-x", "-c"]

WORKDIR /app
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

COPY ./pyproject.toml .
COPY ./README.md .
COPY ./src src/

RUN uv venv
RUN --mount=type=cache,target=/root/.cache/uv,sharing=locked uv sync --no-editable --no-dev

# make uvicorn etc available
ENV PATH="/app/.venv/bin:$PATH"

CMD uvicorn pyonb_marker.api:app --host 0.0.0.0 --port "$MARKER_API_PORT" --workers 4 --use-colors
30 changes: 30 additions & 0 deletions packages/ocr/marker/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Instructions

## Python

First install `pyonb_marker`. From the top-level `pyonb` directory:

```shell
uv sync --extra marker
```

Then, to convert a PDF to markdown:

```python
import pyonb_marker

result = pyonb_marker.convert_pdf_to_markdown(
filepath="path/to/data/input.pdf",
)
```

## Docker compose

From the `pyonb/packages/ocr/marker` directory:

```shell
docker compose run marker data/ms-note-one-page.pdf data/output.md
```

Note, you will need to set `DATA_FOLDER` in a `.env` file,
e.g.: `DATA_FOLDER=path/to/data/input.pdf`.
19 changes: 19 additions & 0 deletions packages/ocr/marker/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
[build-system]
build-backend = "hatchling.build"
requires = ["hatchling"]

[project]
dependencies = [
"accelerate",
"fastapi[standard]",
"marker-pdf",
"ollama",
"python-dotenv",
"requests",
"uvicorn",
]
description = "pyonb wrapper around marker"
name = "pyonb-marker"
readme = "README.md"
requires-python = ">=3.11"
version = "0.1.0"
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,11 @@
from fastapi import FastAPI, File, HTTPException, UploadFile, status
from fastapi.responses import JSONResponse, RedirectResponse

from pyonb_marker.main import convert_pdf_to_markdown

_today = datetime.datetime.now(datetime.UTC).strftime("%Y_%m_%d") # type: ignore[attr-defined] # mypy complains that 'Module has no attribute "UTC"'
logging.basicConfig(
filename="marker." + datetime.datetime.now(tz=datetime.UTC).strftime("%Y%m%d") + ".log",
filename=f"marker-{_today}.log",
format="%(asctime)s %(message)s",
filemode="a",
)
Expand All @@ -18,18 +21,6 @@
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

# TODO(tom): improve imports - below try statements horrible
try:
# local
from .main import run_marker
except Exception:
logger.exception("Detected inside Docker container.")
# Docker container
try:
from main import run_marker # type: ignore # noqa: PGH003
except Exception:
logger.exception("Marker imports not possible.")

app = FastAPI(swagger_ui_parameters={"tryItOutEnabled": True})


Expand Down Expand Up @@ -71,7 +62,7 @@ async def inference(file: Annotated[UploadFile, File()] = None) -> JSONResponse:
# marker requires path to file rather than UploadFile object, so create temp copy of file
with Path(f"temp_api_file_{file.filename}").open("wb") as f: # noqa: ASYNC230
f.write(content)
result, _ = run_marker(f"temp_api_file_{file.filename}")
result = convert_pdf_to_markdown(f"temp_api_file_{file.filename}")
except Exception as e:
raise HTTPException(status_code=400, detail=f"Failed to run marker. Error: {e}") from e
else:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,37 +22,29 @@ def setup_converter(config, config_parser) -> PdfConverter: # noqa: ANN001
)


def convert_pdf_to_markdown(file_path: str | Path, output_format: str | Path = "markdown", use_llm: bool = True): # noqa: ANN201
def convert_pdf_to_markdown( # noqa: ANN201
file_path: str | Path,
output_format: str | Path = "markdown",
use_llm: bool = True,
):
"""Convert the PDF to markdown using Marker and optionally use LLM for improved accuracy."""
config = {
"output_format": output_format,
"use_llm": use_llm,
"llm_service": "marker.services.ollama.OllamaService",
"ollama_model": "llama3.2",
"ollama_base_url": "http://localhost:11434",
"disable_images": True,
}
config_parser = ConfigParser(config)
converter = setup_converter(config_parser.generate_config_dict(), config_parser)
try:
# Optionally enable LLM for improved accuracy
config = {
"output_format": output_format,
"use_llm": use_llm,
"llm_service": "marker.services.ollama.OllamaService",
"ollama_model": "llama3.2",
"ollama_base_url": "http://localhost:11434",
}
config_parser = ConfigParser(config)
# Create the converter with the necessary settings
converter = setup_converter(config_parser.generate_config_dict(), config_parser)

# Process the PDF file and convert to the specified output format
rendered = converter(str(file_path))

# Extract the text (Markdown, JSON, or HTML) from the rendered object
text, _, images = text_from_rendered(rendered)
text, _, _ = text_from_rendered(rendered)
except Exception:
logger.exception("Error processing PDF.")
else:
return text, images


def run_marker(input_pdf_path: str | Path): # noqa: ANN201
"""Execute marker."""
res, images = convert_pdf_to_markdown(file_path=input_pdf_path, use_llm=True, output_format="json")

return res, images
return text


if __name__ == "__main__":
Expand All @@ -64,11 +56,11 @@ def run_marker(input_pdf_path: str | Path): # noqa: ANN201
input_pdf_path = Path(sys.argv[1])
output_txt_path = Path(sys.argv[2])

res, images = run_marker(input_pdf_path)
text = convert_pdf_to_markdown(input_pdf_path)

try:
with output_txt_path.open("w", encoding="utf-8") as f:
f.write(res)
f.write(text)

logger.info("Text extracted to %s", output_txt_path)

Expand Down
35 changes: 35 additions & 0 deletions packages/ocr/paddleocr/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
FROM ghcr.io/astral-sh/uv:python3.13-bookworm AS app

SHELL ["/bin/bash", "-o", "pipefail", "-e", "-u", "-x", "-c"]


RUN --mount=type=cache,target=/var/cache/apt \
--mount=type=cache,target=/var/lib/apt \
apt-get update \
&& apt-get install -y --no-install-recommends \
ccache \
cmake \
curl \
ffmpeg \
libpoppler-cpp-dev \
libsm6 \
libxext6 \
pkg-config \
poppler-utils \
&& rm -rf /var/lib/apt/lists/*

WORKDIR /app
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

COPY ./pyproject.toml .
COPY ./README.md .
COPY ./src src/

RUN uv venv
RUN --mount=type=cache,target=/root/.cache/uv,sharing=locked uv sync --no-editable --no-dev

# make uvicorn etc available
ENV PATH="/app/.venv/bin:$PATH"

CMD uvicorn pyonb_paddleocr.api:app --host 0.0.0.0 --port "$PADDLEOCR_API_PORT" --workers 4 --use-colors
24 changes: 24 additions & 0 deletions packages/ocr/paddleocr/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Instructions

Before using the `paddleocr` API for OCR, you will need to set the `PADDLEOCR_API_PORT`
environment variable in the top-level `.env` file.

## Docker Compose

You will need to define the `OCR_FORWARDING_API_PORT` in the `.env` file.

Then, spin up the `ocr-forwarding-api` and `kreuzberg` services:

```shell
docker-compose --profile paddleocr up --build --detach
```

You can then use `curl` to send a PDF to the forwarding API:

```shell
curl -v -X POST http://127.0.0.1:8110/paddleocr/inference_single \
-F "[email protected]" \
-H "accept: application/json"
```

Note, this assumes you have set `OCR_FORWARDING_API_PORT` to `8110`.
22 changes: 22 additions & 0 deletions packages/ocr/paddleocr/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
[build-system]
build-backend = "hatchling.build"
requires = ["hatchling"]

[project]
dependencies = [
"fastapi",
"paddleocr==2.10.0",
"paddlepaddle",
"pdf2image",
"pillow",
"python-multipart",
"python-poppler",
"requests",
"setuptools",
"uvicorn",
]
description = "pyonb wrapper around paddleocr"
name = "pyonb-paddleocr"
readme = "README.md"
requires-python = ">=3.11"
version = "0.1.0"
Loading
Loading