← All writing
·2 min read

Migrating from Fastembed ONNX to Hugging Face TEI

The specific trade-offs, the retrieval quality delta on an institutional corpus, and why the infra complexity was worth it.

Fastembed ONNX on CPU is a reasonable place to start a RAG system. The library is easy to install, the model catalog covers the obvious choices, and you can get a working embedding pipeline in an afternoon without standing up any additional infrastructure. For VeriCite's early proof-of-concept, that was the right call.

It stopped being the right call when retrieval quality became the constraint. The model catalog that's practical to serve through Fastembed doesn't include paraphrase-multilingual-MiniLM-L12-v2 with the throughput we needed on an institutional corpus, and it doesn't include a cross-encoder reranker like BAAI/bge-reranker-v2-m3 at all. Reranking matters disproportionately on the kind of corpus VeriCite deals with — documents where the lexical overlap between a query and the relevant passage is low, and where a bi-encoder's cosine similarity is a noisy signal.

Hugging Face TEI gives you both. It's a purpose-built inference server for text embeddings and rerankers, with batching tuned for GPU throughput. The trade-off is operational: you're now running GPU nodes, managing batching configuration, and shipping k8s manifests for a TEI sidecar alongside Qdrant. That's real infra complexity, and it doesn't pay for itself unless retrieval quality is actually on the critical path.

In this case it was. After the migration, retrieval quality on the institutional corpus improved measurably — specifically on queries where the relevant document used different vocabulary than the query itself. That's the problem the cross-encoder is solving: it sees the full (query, passage) pair rather than comparing independent embeddings. The bi-encoder gets you to the right neighborhood; the reranker gets you to the right door.

The infra overhead is now just a fixed cost. The model options going forward are much wider.