Improving Single Token Compression for Retrieval Augmented Generation

2 minute read

Authors: Mansheel Agarwal, Shyam Varahagiri, Earl Ranario, Akanksha Giliyal, Horacio Contreras

Traditional retrieval methods like RAG or xRAG compress a single document to just one token, leading to hallucinations. We propose an enhanced approach that generates multiple synthetic queries with the same intent but different perspectives using an LLM, then uses ensemble scoring to rerank and compress the highest-scoring document into a single token. We also introduce a multi-document token generation method that aggregates embeddings from top-ranked documents for richer context.

Background

xRAG architecture Figure 1: The original xRAG method compresses document embeddings into a single token for the language model, reducing memory and compute overhead over standard RAG but losing contextual richness.

xRAG compresses a retrieved document into a single token before passing it to the reader model. While this dramatically reduces memory and compute overhead compared to standard RAG, it is prone to hallucinations when the wrong document is retrieved — and it cannot handle queries that require information from multiple documents.

Method

Our proposed method Figure 2: Our method generates multiple synthetic queries from the original query, retrieves top documents for each, and uses ensemble scoring to select the best document before xRAG compression.

Our approach consists of three key components:

Synthetic Query Generation — LLaMA-7B generates k query variants with the same intent but different perspectives, broadening document retrieval coverage.
Ensemble Scoring — Retrieved documents are reranked by aggregating relevance scores across all synthetic queries, reducing the chance of selecting an incorrect document.
Token Compression — The highest-scoring document is compressed into a single token via the xRAG projector and passed to the reader model (Mistral-7B).

Synthetic query prompt Figure 3: Prompt template used to generate k synthetic query variations.

Experiments

Dataset: TriviaQA with LLaMA-7B-generated distractor documents to simulate noisy retrieval conditions. Performance is measured via Exact Match (EM).

Results

k (synthetic queries)	Exact Match	Δtime (s)
0 (baseline)	0.51	—
1	0.51	+7
5	0.51	+33
10	0.52	+62
15	0.51	+78
20	0.53	+80

The best single-document result is achieved at k=20 (EM=0.53). The multi-document token generation method aggregates embeddings from the top-n documents, achieving a total processing time of 15.6s vs. 43.2s for the original xRAG — improving both accuracy and efficiency simultaneously.

Embedding Analysis

Figure 4: paCMAP visualization of query, synthetic query, and document embeddings. Synthetic queries occupy the same embedding space as the original query (linear distribution), while document embeddings cluster narrowly — suggesting room to improve retrieval diversity.

Conclusion

Our methods improve both accuracy and efficiency of xRAG, setting new benchmarks in retrieval-augmented generation. Future work will focus on hybridizing the multi-document and multi-query approaches and building a custom structured query-answer dataset better suited to these retrieval strategies.

References

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.
Cheng, X., Wang, X., Zhang, X., Ge, T., Chen, S., Wei, F., Zhang, H., & Zhao, D. (2024). xRAG: Extreme Context Compression for Retrieval-Augmented Generation with One Token. arXiv preprint.

Share on

X Facebook LinkedIn Bluesky

Earl Ranario

Improving Single Token Compression for Retrieval Augmented Generation

Background

Method

Experiments

Results

Embedding Analysis

Conclusion

References

Share on

You May Also Enjoy

Are vision-language models ready to zero-shot replace supervised classification models in agriculture?

Enabling Plant Phenotyping in Weedy Environments using Multi-Modal Imagery via Synthetic and Generated Training Data

Integration of crop modeling and sensing into molecular breeding for nutritional quality and stress tolerance

AgRowStitch: A High-fidelity Image Stitching Pipeline for Ground-based Agricultural Images