Applying standard TransE-style relational displacement analysis to
frozen general-purpose text embeddings over
Wikidata. Relations run as cheap vector displacements on
embeddings that were never trained for them — and along the way the
work surfaced a silent production defect that makes
mxbai-embed-large nearly useless for any text with diacritics.
mxbai-embed-large collapses 147,687 cross-entity pairs
— e.g. “Hokkaidō” and “Éire”
— into identical vectors. →
Relations implemented as displacement-vector operations
(h + r ≈ t) on existing embeddings — orders of
magnitude cheaper than full model inference.
Three independent general-purpose models (mxbai-embed-large, nomic-embed-text, all-minilm) encode the same 30 universal relations as consistent vector displacements — a property of the semantic relationships, not any single model.
The [UNK]-token dominance defect causes 147,687
cross-entity embedding collisions on diacritical text when served via
Ollama. Missed by standard benchmarks like MTEB.