Theoretical Limitations of Embedding Models and Their Applications in Turkish: An In-Depth Look
TL;DR
Information Retrieval (IR) systems have evolved rapidly with AI and NLP, largely thanks to embedding models that enable semantic representation and complex query answering. Yet, as highlighted in Google DeepMind’s paper “On the Theoretical Limitations of Embedding-Based Retrieval”—and supported by our own analysis of Turkish embedding models—these methods face inherent theoretical constraints.
A single-vector embedding can only take you so far. No matter how large the vector dimension is, it’s fundamentally constrained. As you add more documents, the number of possible distinctions grows exponentially, while the vector size remains fixed. Inevitably, this means some queries—even very simple ones like “Who loves pizza?”—fail to retrieve the right result. The limitation isn’t in the data; it’s in the single-vector representation itself.
That’s where cross-encoders come in. Instead of squeezing meaning into one static vector, they evaluate the query and the candidate document together, modeling their interactions directly. The payoff is much higher accuracy—though at the cost of extra compute and latency.
The best practice is to combine the two:
- Embeddings for fast candidate retrieval.
- Cross-encoders for precise re-ranking.
This hybrid setup strikes the balance between speed and accuracy.
Understanding the Limits of Embedding Models in Theory and Practice
Embedding models are at the core of modern information retrieval, offering powerful tools for representing meaning and handling complex queries. Yet to grasp their full potential, it is necessary to balance the theoretical insights that define their boundaries with the practical benchmarks that reveal how they perform in real-world scenarios. This dual perspective sets the stage for examining how embedding models have risen to prominence and the expectations placed upon them.
The Rise of Embedding Models and Expectations
Over the past twenty years, information retrieval has shifted from sparse methods such as BM25 to dense retrieval powered by neural language models. These models generate vector representations of inputs, enabling stronger generalization and the ability to tackle complex retrieval tasks. Expectations have grown further as dense approaches aim to represent any query and relevance definition. For instance, the QUEST dataset studies retrieval for logical, multi-condition queries, while BRIGHT focuses on reasoning-based relevance. Both challenge the limits of dense retrieval systems and highlight future directions.
However, contrary to the widespread belief that these models are capable of handling any retrieval task, Google DeepMind researchers point out some inherent theoretical limitations of embedding models. The paper demonstrates that these limitations do not only arise from unrealistic queries but can also appear in extremely simple and realistic scenarios. This necessitates a fundamental re-evaluation of existing information retrieval paradigms.
Theoretical Limitations of Embedding Models: Why Can't They Represent Everything?
The paper "On the Theoretical Limitations of Embedding-Based Retrieval" reveals a fundamental limitation of embedding models: their ability to represent all possible combinations of relevant documents for a given query is restricted. The authors state that these limitations stem from the use of vector representations in geometric space by embedding models and can be analyzed with known results from mathematical research.
The core argument of the paper is that results from communication complexity theory apply directly to information retrieval. For any fixed embedding dimension, there will always be document combinations that cannot be retrieved, marking a fundamental limit for embedding models. This is formalized through the relationship between embedding dimension and the sign-rank of the query–relevance matrix, which can be far larger than embedding sizes typically used in practice.
To show this limitation is universal, the authors ran a “free embedding optimization” experiment, directly fitting vectors to test data. Findings revealed that for each dimension (d), there exists a critical point: once the document set grows beyond a certain size, embeddings cannot encode all possible relevance combinations. Importantly, this breakdown follows a polynomial relationship between embedding dimension and the number of documents, underscoring that the limitation persists regardless of model type or training data.
The LIMIT Dataset: Bringing Theory to Practice
To test these theoretical limits in practice, the authors introduce LIMIT (Linguistic Multitask Instruction-following with Theoretical underpinnings), a simple yet realistic dataset. Even on straightforward tasks—e.g., query “Who likes apples?” with document “Jon likes apples”—state-of-the-art embedding models perform poorly, highlighting the real impact of theoretical constraints.
In summary, the paper shows that single-vector embeddings face inherent limits. Current benchmarks, often small and overfitted, conceal these weaknesses. As tasks demand returning more complex top-k combinations (e.g., logical relations between documents), models inevitably hit a representational ceiling that cannot be overcome without new methods.
Benchmark Analysis on Turkish Embedding Models: Does the LIMIT Theory Apply to Turkish as Well?
To test whether these theoretical limits also apply to Turkish, we ran a benchmark comparing five Turkish embedding models across three retrieval methods (Bi-Encoder, Multi-Vector, Cross-Encoder). The goal was to assess LIMIT’s impact and identify the most effective model–approach combinations for practical use.
Analysis Methodology and Models
Models and approaches used in the analysis are as follows:
Models:
- BAAI/bge-m3
- newmindai/TurkEmbed4Retrieval
- newmindai/modernbert-base-tr-uncased-allnli-stsb
- sentence-transformers/paraphrase-multilingual-mpnet-base-v2
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Approaches:
- Bi-Encoder: Standard, single-vector representation approach.
- Multi-Vector: An approach that provides richer representation by using multiple vectors.
- Cross-Encoder: A reranking approach that directly calculates relevance by processing the query and document pair together.
Performance evaluation was primarily conducted using Recall@2, Recall@10, and Recall@20 metrics (Figure 2). These metrics indicate the proportion of relevant documents found among the first k results and directly measure the top-k retrieval challenge emphasized by the LIMIT theory.
Figure 2. Recall@k comparison of models on the retrieval task.
Benchmark Results and Theoretical Validation
The analysis results clearly showed that the LIMIT theory also applies to Turkish embedding models. All tested models performed below the theoretically expected performance limits. Specifically, while the theoretical expectation for Recall@2 was < 0.6, even the best model, BAAI/bge-m3 (with the multi-vector approach), remained at 0.3132 (Figure 3 and Figure 4). Similarly, Recall@10 and Recall@20 values also stayed below theoretical limits (Figure 1 and Figure 6). This situation confirms that embedding models have an inherent representational capacity limitation, especially when it comes to complex relevance definitions and high-dimensional document combinations (Figure 5).



Approach Comparison and Practical Implications
The benchmark analysis also revealed the performance of different retrieval approaches:
Multi-Vector Approach showed the best performance (Recall@2: 0.3132), especially when used with the BAAI/bge-m3 model. This demonstrates the effectiveness of using multiple vectors to overcome the limitations of single-vector representation. A 29.4% improvement was observed for BAAI/bge-m3 compared to the Bi-Encoder approach.
Cross-Encoder Approach provided significant improvements for larger k values (in the reranking stage), especially for Recall@20 (e.g., 34.3% improvement for newmindai/modernbert-base-tr-uncased-allnli-stsb). However, it performed worse than Bi-Encoder and Multi-Vector approaches for small k values like Recall@2.
Bi-Encoder Approach, while still suitable for fast and simple solutions, remained more limited compared to Multi-Vector and Cross-Encoder approaches in the face of challenges posed by the LIMIT theory.
Model-wise:
- BAAI/bge-m3 generally exhibited the best performance.
- Newmindai/TurkEmbed4Retrieval stood out as a good alternative specifically optimized for Turkish.
- The sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model showed the lowest performance.
These results provide important practical guidance for developers of Turkish information retrieval systems. In scenarios involving complex relevance definitions and high-dimensional document combinations, using more advanced approaches like Multi-Vector or Cross-Encoder can play a critical role in overcoming the limitations of single-vector Bi-Encoder models.
Future Work and Ways to Overcome Limitations
The paper ‘On the Theoretical Limitations of Embedding-Based Retrieval’ by Google DeepMind and our Turkish benchmark analysis clearly demonstrate the inherent limitations of current embedding-based information retrieval models. However, this is not a cause for despair regarding the future of the field, but rather a call to develop new and more innovative approaches. As the paper states, the community needs to be aware of these limitations and choose alternative retrieval approaches such as cross-encoders or multi-vector models when trying to build systems capable of handling the full range of instruction-based queries
Potential Areas for Improvement
Development of Multi-Vector Representations: When a single dense vector cannot capture semantic complexity, representing documents or queries with multiple vectors helps address the combinatorial challenges of LIMIT. Our Turkish benchmark showed clear improvements of Multi-Vector over Bi-Encoder.
Hybrid Approaches: Combining sparse methods (e.g., BM25) with dense retrieval can boost both precision and coverage. Using cross-encoders for reranking further improves accuracy by better ranking candidate documents.
Larger and More Diverse Datasets: Synthetic datasets like LIMIT reveal theoretical limits, but larger, diverse datasets reflecting real-world complexity improve generalization. For Turkish, creating rich datasets across domains is especially vital.
Fine-Tuning and Domain Adaptation: Task or domain-specific fine-tuning allows models to better capture relevance in context, improving practical performance despite theoretical constraints.
New Model Architectures: Going beyond current embeddings with new representations—such as graph-based or relational networks—may overcome fundamental limitations highlighted by communication complexity theory.
Our Mind
Google DeepMind’s paper ‘On the Theoretical Limitations of Embedding-Based Retrieval’ serves as an important warning for the future of embedding-based information retrieval systems. This work shows that the performance of models is not only limited by the quality of training data or model size, but also by inherent mathematical and theoretical constraints. Our Turkish benchmark analysis also confirmed that these theoretical findings are valid for models in the Turkish language. In particular, approaches such as Multi-Vector and Cross-Encoder have been shown to offer potential in overcoming the limitations of single-vector Bi-Encoder models.
These findings carry important implications for researchers and developers in the field of information retrieval. Simply building larger models or using more data may no longer be sufficient. Instead, exploring new architectures and hybrid approaches that expand the fundamental representational capabilities of embedding models is critical to realizing the true potential of future information retrieval systems. This direction will allow us to deepen our theoretical understanding, ground our practical solutions on more solid foundations, and continue to push the boundaries of AI-powered information retrieval.
References
[1] Weller, O., Boratko, M., Naim, I., & Lee, J. (2025). On the Theoretical Limitations of Embedding-Based Retrieval. arXiv preprint arXiv:2508.21038. https://arxiv.org/pdf/2508.21038
[2] Malaviya, S., et al. (2023). QUEST: A Dataset for Querying with Logical Operators. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). https://arxiv.org/pdf/2305.11694
[3] Su, J., et al. (2024). BRIGHT: A Benchmark for Reasoning in Information Retrieval. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). https://arxiv.org/pdf/2407.12883
[4] Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., & Liu, Z. (2024). Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint arXiv:2402.03216. https://arxiv.org/pdf/2402.03216
[5] Ezerceli, Ö., Gümüşçekiçci, G., Erkoç, T., & Özenç, B. (2025, June). TurkEmbed4Retrieval: Turkish Embedding Model for Retrieval Task. In 2025 33rd Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). IEEE. https://ieeexplore.ieee.org/abstract/document/11112338
[6] Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084. https://arxiv.org/pdf/1908.10084