Fast lexical embeddings built on token counts, TF–IDF, and compact sparse‑to‑dense MLPs.
Read the full story in our blog post.
⏩ For the Quickest Start, check out our Luxical One model on HuggingFace.
⚠️ NOTE: No Planned Active Maintainance⚠️ This GitHub repository is made available for reproducibility and the advancement of scientific research into fast text embedding methods. At DatologyAI we are proudly comitted to our customers, which unfortunately limits the time we have to actively monitor this repository, accept PRs, or otherwise maintain this project.
pip install luxicalWe currently support MacOS and Linux and Python versions 3.11, 3.12, and 3.13. These limitations are due to the inclusion of complied Rust extension code via the included arrow-tokenize package.
We have a basic HuggingFace integration that supports inference only. This integration still requires the luxical package.
from pathlib import Path
from transformers import AutoModel
# Load from the Huggingface Hub.
hub_model = AutoModel.from_pretrained("datologyai/luxical-one", trust_remote_code=True)
# Load from a local export directory.
local_dir = Path("~/Downloads/luxical_one_hf").expanduser()
local_model = AutoModel.from_pretrained(local_dir, trust_remote_code=True)
# Embed.
emb = local_model(["Luxical integrates with Huggingface."]).embeddingsfrom pathlib import Path
import luxical.embedder
import luxical.misc_utils
local_dir = Path("~/Downloads/luxical_one.npz").expanduser()
embedder = luxical.embedder.Embedder.load(str(model_local_path))
emb = embedder(["Luxical goes", "very fast"], progress_bars=True)
# EXTRA
# Luxical models typically experience no quality degradation from uint8 quantization.
# Additionally, file-formats supporting dictionary-encoding-based compression
# (e.g. Parquet) may automatically compress roundtrip-quantized data by 4x.
emb_uint8 = luxical.misc_utils.fast_8bit_uniform_scalar_quantize(emb, limit=0.5)
emb_roundtrip = luxical.misc_utils.dequantize_8bit_uniform_scalar_quantized(
emb_uint8, limit=0.5
)
# EXTRA
# Luxical ships with helper methods to integrate with pyarrow.
import pyarrow as pa
import pyarrow.parquet as pq
emb_pyarrow = luxical.misc_utils.numpy_ndarray_to_pyarrow_fixed_size_list_array(
emb_roundtrip
)
emb_table = pa.table(
{
"document_id": ["doc_1", "doc_2"],
"embedding": emb_pyarrow,
}
)
pq.write_table(emb_table, "fast_and_small_embeddings.parquet")Luxical uses lexical (word-based) features. To do this, it tokenizes input text and constructs a Term Frequency (TF) representation over the tokens (aka a "bag of words" featurization). It then applies an Inverse Document Frequency (IDF) scaling and L2 normalization to produce a sparse unit vector.
Luxical is not fully lexical, though. After constructing the sparse unit vector of TF-IDF features, a small feed-forward ReLU neural network maps these features to a dense, normalized embedding.
Consider a sparse feature vector and a dense weight matrix. Multiplying them involves only the columns corresponding to nonzero entries in the sparse vector:
Specifically, only the columns in
- Efficient sparse‑by‑dense matmuls (written with Numba) achive high performance on CPU in the sparse-to-dense projection.
- Custom efficient IDF-scaling code avoids performance penalty from IDF-scaling
- Another approach to speeding up this step would be fusing the IDF-scaling weights directly into the model weights. If
$A$ is the dense-to-sparse projection matrix,$\mathbf{b}$ is the IDF weight vector, and$\mathbf{x}$ is the sparse bag-of-words vector, we can IDF-scale the projection matrix$A$ , i.e. matmul ordering$(A,\mathrm{diag}(\mathbf{b})),\mathbf{x}$ instead of IDF-scaling the input, i.e. matmul ordering$A,(\mathrm{diag}(\mathbf{b}),\mathbf{x})$ . Although this lets us scale the weights once ahead of time for all inputs, this fusion complexifies the implementation of training because it changes the parameterization of the model and thus affects the trajectory of gradient-based optimization. Our fast scaling code sidesteps this issue.
- Another approach to speeding up this step would be fusing the IDF-scaling weights directly into the model weights. If
- Shallow MLP with ReLU and normalization between layers provides more representation power basically for free.
- A shallow network with nonlinearity provides noticeably better embedding quality at modest compute. The additional FLOPs cost is small compared with tokenization overhead, so our small MLP addition retains CPU‑friendly performance while improving accuracy over a single linear projection.
Luxical student embeddings are trained to match pairwise similarities from a teacher model using a KL‑divergence on temperature‑scaled Gram matrices.
Luxical uses a simple distillation objective to match a teacher model's pairwise similarities.
Core ideas:
- Compute Gram matrices for student and teacher mini‑batches.
- Remove the self‑similarity diagonal to avoid trivial peaks.
- Apply temperature scaling and compute KL divergence (
log_target=True).
Relevant functions: remove_diagonal, contrastive_distillation_loss in src/luxical/training.py.
Pseudo‑code:
tau = 3.0 # Set the temperature.
S = normalize(student(X)) # [B, D]
T = normalize(teacher(X)) # [B, D]
G_s = S @ S.T
G_t = T @ T.T
G_s = remove_diagonal(G_s)
G_t = remove_diagonal(G_t)
loss = tau**2 * KLDiv(log_softmax(G_s/tau), log_softmax(G_t/tau))See ./projects/luxical_one for example training code that walks through the steps of:
- Embedding a training corpus with a teacher model.
- Identifying a body of common ngrams from a training corpus and determining approximate inverse-document-frequency scaling for these terms.
- Training the core embedder parameters via knowledge distillation.
- Install
justand set up the dev environment:- macOS:
brew install just - Ubuntu:
sudo apt-get install -y just
- macOS:
- For Rust builds (
arrow_tokenize):- macOS: install the Rust toolchain locally.
- Linux wheels: Docker is required for manylinux targets.
- Helpful commands:
just help— list tasksjust setup-dev— create the Python env for developmentjust lint— autoformat, lint (with autofix), and typecheckjust test— run tests
- Choose patch/minor/major version. Versions may be separate for
luxicalandarrow-tokenize. - Bump version(s):
- for luxical: edit
src/luxical/__about__.py. - for arrow_tokenize: edit
arrow_tokenize/Cargo.toml.
- for luxical: edit
- Update
README.mdwith a dated entry for Luxical updates. Update the Release Notes section ofarrow_tokenize/README.mdfor arrow_tokenize updates. - Build wheels (see below)
- Publish to PyPI (see below)
- Optional: tag versions and push tags, e.g.
git tag vX.Y.Zorgit tag arrow-tokenize-vA.B.Cfollowed bygit push --tags.
For steps 4 & 5 you would run just clean build-luxical publish-wheel-luxical to release a luxical-only change, or just clean build-arrow-tokenize build-luxical publish-wheel-arrow-tokenize publish-wheel-luxical to ship an update to both arrow-tokenize and luxical (e.g. updating the arrow-tokenize code and then updating luxical to use the new version of arrow-tokenize).
Version sources are in code, and builds read from those sources when creating wheels.
-
luxical
- Source of truth:
src/luxical/__about__.py:1→__version__ = "X.Y.Z" - Runtime API:
from luxical import __version__
- Source of truth:
-
arrow_tokenize (Rust extension)
- Source of truth:
arrow_tokenize/Cargo.toml→[package].version = "A.B.C" - Runtime API:
import arrow_tokenize as at; at.__version__
- Source of truth:
For the core Luxical codebase (in src):
- Local wheel:
just build-luxical - Publish to the configured index:
just publish-luxical --no-dry-run
For the arrow-tokenize Rust extension:
- macOS local wheels:
just build-arrow-tokenize-macos-local - Linux wheels via Docker:
just build-arrow-tokenize-linux-cross
Get an API token from PyPI (under the account settings page). Add it to your ~/.pypirc
# Contents of ~/.pypirc
[testpypi]
repository = https://upload.pypi.org/legacy/
username = __token__
password = pypi-<something>
[pypi]
repository = https://test.pypi.org/legacy/
username = __token__
password = pypi-<something>
just publish-wheel-luxical
just publish-wheel-arrow-tokenize- Clarify upper limit of supported Python versions in
pyproject.toml
- Correct license metadata in
pyproject.toml
- Public release
- Initialize release notes and publishing workflow (internal).
