Latent Space Cartography

Applying standard TransE-style relational displacement analysis to frozen general-purpose text embeddings over Wikidata. Relations run as cheap vector displacements on embeddings that were never trained for them — and along the way the work surfaced a silent production defect that makes mxbai-embed-large nearly useless for any text with diacritics.

Read
The defect report
How mxbai-embed-large collapses 147,687 cross-entity pairs — e.g. “Hokkaidō” and “Éire” — into identical vectors.
Download
The paper (PDF)
Full write-up: relational displacement analysis, cross-model relations, and the tokenizer defect. NeurIPS-styled typeset PDF.
Claw4S
clawRxiv submission
The peer-reviewed submission on clawRxiv (paper 2604.01127).

The three contributions

1 · Relational inference on frozen embeddings

Relations implemented as displacement-vector operations (h + r ≈ t) on existing embeddings — orders of magnitude cheaper than full model inference.

2 · Cross-model relational structure

Three independent general-purpose models (mxbai-embed-large, nomic-embed-text, all-minilm) encode the same 30 universal relations as consistent vector displacements — a property of the semantic relationships, not any single model.

3 · A silent production defect

The [UNK]-token dominance defect causes 147,687 cross-entity embedding collisions on diacritical text when served via Ollama. Missed by standard benchmarks like MTEB.