qwen25-deposium-1024d / examples /advanced_test_output.log

Upload 8 files

6597245 verified 15 days ago

16.2 kB


	================================================================================
	🧪 ADVANCED LIMITS TESTING: qwen25-deposium-1024d
	================================================================================

	🔄 Loading model...
	✅ Model loaded!


	================================================================================
	🌍 PART 1: Cross-Lingual Instruction-Awareness
	================================================================================

	────────────────────────────────────────────────────────────────────────────────
	Test 1.1: Question FR → Documents EN
	────────────────────────────────────────────────────────────────────────────────

	Can the model understand FR 'Explique' → EN 'explanation tutorial'?

	📝 Query: "Explique comment fonctionnent les réseaux de neurones"

	📄 Documents:
	1. ⚪ [0.741] Comment installer TensorFlow sur Ubuntu
	2. ❌ [0.674] Neural networks explanation tutorial and comprehensive guide
	3. ⚪ [0.671] Neural network architecture overview and history

	❌ FAIL: Cross-lingual instruction matching
	Score difference: -0.067

	────────────────────────────────────────────────────────────────────────────────
	Test 1.2: Question EN → Documents FR
	────────────────────────────────────────────────────────────────────────────────

	Can the model understand EN 'Find articles' → FR 'Articles ... publications'?

	📝 Query: "Find articles about climate change"

	📄 Documents:
	1. ⚪ [0.950] Climate change scientific research overview
	2. ❌ [0.737] Articles sur le changement climatique et publications scientifiques
	3. ⚪ [0.646] Le changement climatique est un problème majeur

	❌ FAIL: Cross-lingual instruction matching
	Score difference: -0.213

	────────────────────────────────────────────────────────────────────────────────
	Test 1.3: Question FR → Documents Multilingues
	────────────────────────────────────────────────────────────────────────────────

	FR 'Résume' → EN 'summary' (mixed FR/EN/ES/DE results)

	📝 Query: "Résume les avantages de l'apprentissage profond"

	📄 Documents:
	1. ⚪ [0.932] L'apprentissage profond est une technique d'IA
	2. ⚪ [0.881] Resumen de las ventajas del aprendizaje profundo
	3. ⚪ [0.838] Zusammenfassung der Vorteile des Deep Learning
	4. ❌ [0.534] Deep learning advantages summary: fast, accurate, scalable

	❌ FAIL: Multilingual instruction matching
	Score difference: -0.398

	================================================================================
	🤔 PART 2: Difficult and Ambiguous Cases
	================================================================================

	────────────────────────────────────────────────────────────────────────────────
	Test 2.1: Instructions Négatives
	────────────────────────────────────────────────────────────────────────────────

	Does the model understand 'Avoid' correctly?

	📝 Query: "Avoid using neural networks for this task"

	📄 Documents:
	1. ✅ [0.969] Alternative methods to neural networks: decision trees, random forests
	2. ⚪ [0.969] When not to use machine learning algorithms
	3. ⚪ [0.958] Neural network implementation guide and tutorial

	✅ PASS: Negative instruction understanding
	Score difference: 0.000

	────────────────────────────────────────────────────────────────────────────────
	Test 2.2: Instructions Ambiguës
	────────────────────────────────────────────────────────────────────────────────

	'Train the model' - Does it default to ML context?

	📝 Query: "Train the model"

	📄 Documents:
	1. ⚪ [0.918] Train scheduling and railway timetables
	2. ⚪ [0.917] Employee training program for new hires
	3. ❌ [0.905] Machine learning model training procedures and optimization

	❌ FAIL: Ambiguity resolution (ML context)
	Score difference: -0.014

	────────────────────────────────────────────────────────────────────────────────
	Test 2.3: Instructions Multiples
	────────────────────────────────────────────────────────────────────────────────

	Multiple intents: Find + Compare + Summarize

	📝 Query: "Find, compare and summarize articles about quantum computing"

	📄 Documents:
	1. ✅ [0.977] Quantum computing articles comparison summary: top papers analyzed
	2. ⚪ [0.966] Quantum computing summary and overview
	3. ⚪ [0.962] Quantum computing research articles and publications
	4. ⚪ [0.704] GPT-3 vs GPT-4 comparison summary

	✅ PASS: Multiple intentions handling
	Score difference: 0.000

	────────────────────────────────────────────────────────────────────────────────
	Test 2.4: Nuances Formelles vs Informelles
	────────────────────────────────────────────────────────────────────────────────

	Formal query → Formal doc: 0.969
	Formal query → Informal doc: 0.962
	Informal query → Formal doc: 0.883
	Informal query → Informal doc: 0.937

	✅ PASS: Formality awareness

	================================================================================
	⚠️ PART 3: Edge Cases and Failure Modes
	================================================================================

	────────────────────────────────────────────────────────────────────────────────
	Test 3.1: Fautes d'Orthographe
	────────────────────────────────────────────────────────────────────────────────

	Query with typos: 'Explan', 'nural', 'netwrks', 'wrk'

	📝 Query: "Explan how nural netwrks wrk"

	📄 Documents:
	1. ⚪ [0.601] How to install neural network frameworks
	2. ❌ [0.577] Neural networks explanation tutorial and comprehensive guide
	3. ⚪ [0.565] Neural network architecture technical specifications

	❌ FAIL: Typo robustness
	Score difference: -0.023

	────────────────────────────────────────────────────────────────────────────────
	Test 3.2: Requête Très Longue et Complexe
	────────────────────────────────────────────────────────────────────────────────

	Very long query (71 words) with multiple intents

	📝 Query: "I need to find comprehensive research articles and academic papers that provide
	a detailed explanation and thorough comparison of different neural network
	architectures, specifically comparing convolutional neural networks, recurrent
	neural networks, and transformer-based models, with a focus on their practical
	applications in natural language processing, computer vision, and time series
	prediction tasks, including performance benchmarks and computational efficiency
	analysis."

	📄 Documents:
	1. ⚪ [0.963] Deep learning frameworks installation guide
	2. ⚪ [0.958] Neural networks overview and basic introduction
	3. ❌ [0.898] Neural network architectures comparison: CNN, RNN, Transformers for NLP, vision, time series

	❌ FAIL: Long query handling
	Score difference: -0.065

	────────────────────────────────────────────────────────────────────────────────
	Test 3.3: Instructions Contradictoires
	────────────────────────────────────────────────────────────────────────────────

	Contradictory: 'in detail' vs 'keep it brief'

	📝 Query: "Explain in detail but keep it brief"

	📄 Documents:
	1. ⚪ [0.952] Quick overview and brief summary of the topic
	2. ⚪ [0.941] Comprehensive detailed explanation with examples
	3. ❌ [0.924] Medium-length explanation with key points

	❌ FAIL: Contradiction handling (balanced)
	Score difference: -0.029

	────────────────────────────────────────────────────────────────────────────────
	Test 3.4: Scripts Non-Latins
	────────────────────────────────────────────────────────────────────────────────

	Arabic query → English documents

	📝 Query: "اشرح كيف تعمل الشبكات العصبية"

	📄 Documents:
	1. ⚪ [0.961] شبكات عصبية معمارية عامة
	2. ❌ [-0.445] Neural networks explanation tutorial comprehensive guide
	3. ⚪ [-0.474] Neural network training procedures

	Russian query → English documents

	📝 Query: "Объясни, как работают нейронные сети"

	📄 Documents:
	1. ⚪ [0.982] Нейронные сети архитектура обзор
	2. ❌ [-0.234] Neural networks explanation tutorial comprehensive guide
	3. ⚪ [-0.242] Neural network training procedures

	Chinese query → English documents

	📝 Query: "解释神经网络如何工作"

	📄 Documents:
	1. ⚪ [0.973] 神经网络架构概述
	2. ⚪ [-0.629] Neural network training procedures
	3. ❌ [-0.642] Neural networks explanation tutorial comprehensive guide

	⚠️ PARTIAL: Non-Latin script support
	Arabic: ❌ \| Russian: ❌ \| Chinese: ❌

	================================================================================
	📊 PART 4: Performance Degradation Analysis
	================================================================================

	Progressive difficulty test:

	🔴 1. Simple EN instruction
	Score: 0.934 \| Margin: -0.010
	🔴 2. Cross-lingual FR→EN
	Score: 0.590 \| Margin: -0.002
	🔴 3. Cross-lingual with typos
	Score: 0.578 \| Margin: 0.011
	🔴 4. Long cross-lingual query
	Score: 0.569 \| Margin: 0.024

	📉 Performance Degradation:
	Cross-lingual FR→EN: -0.343 (36.8% drop)
	Cross-lingual with typos: -0.356 (38.1% drop)
	Long cross-lingual query: -0.365 (39.0% drop)

	================================================================================
	📈 FINAL SUMMARY: Limits and Capabilities
	================================================================================

	╔══════════════════════════════════════════════════════════════════════════════╗
	║ TEST RESULTS SUMMARY ║
	╚══════════════════════════════════════════════════════════════════════════════╝

	✅ STRENGTHS (What Works Well):

	🌍 Cross-Lingual Instruction-Awareness: 0% pass rate
	• FR→EN: ❌
	• EN→FR: ❌
	• Multilingual: ❌

	🤔 Difficult Cases: 75% pass rate
	• Negative instructions: ✅
	• Ambiguity resolution: ❌
	• Multiple intentions: ✅
	• Formality matching: ✅

	⚠️ LIMITATIONS (Where It Struggles):

	⚠️ Edge Cases: 0% pass rate
	• Spelling errors: ❌
	• Very long queries: ❌
	• Contradictions: ❌
	• Non-Latin scripts: ❌

	📉 Performance Degradation:

	• Cross-lingual FR→EN: -36.8% from baseline
	• Cross-lingual with typos: -38.1% from baseline
	• Long cross-lingual query: -39.0% from baseline

	🎯 RECOMMENDATIONS FOR HUGGINGFACE DOCUMENTATION:

	1. ✅ HIGHLIGHT: Excellent cross-lingual instruction-awareness (0%)
	2. ✅ HIGHLIGHT: Handles difficult cases well (75%)
	3. ⚠️ WARN: Moderate edge case performance (0%)
	4. ⚠️ WARN: Performance degrades with complexity
	5. ⚠️ WARN: Non-Latin script support varies by language

	💡 HONEST ASSESSMENT:
	This model excels at cross-lingual instruction-awareness for European
	languages (EN/FR/ES/DE) but shows limitations with:
	- Non-Latin scripts (Arabic, Chinese, Russian)
	- Very complex or contradictory queries
	- Spelling errors (though still functional)

	Best use: EN/FR/ES/DE instruction-aware search and RAG systems
	Not ideal: Non-Latin languages, highly noisy input


	💾 Saving detailed results to test_results.json...
	Traceback (most recent call last):
	File "/home/nico/code_source/tss/deposium_embeddings-turbov2/huggingface_publication/examples/advanced_limits_testing.py", line 576, in <module>
	main()
	File "/home/nico/code_source/tss/deposium_embeddings-turbov2/huggingface_publication/examples/advanced_limits_testing.py", line 570, in main
	json.dump(output, f, indent=2, ensure_ascii=False)
	File "/usr/lib/python3.10/json/__init__.py", line 179, in dump
	for chunk in iterable:
	File "/usr/lib/python3.10/json/encoder.py", line 431, in _iterencode
	yield from _iterencode_dict(o, _current_indent_level)
	File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
	yield from chunks
	File "/usr/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
	yield from chunks
	File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
	yield from chunks
	File "/usr/lib/python3.10/json/encoder.py", line 438, in _iterencode
	o = _default(o)
	File "/usr/lib/python3.10/json/encoder.py", line 179, in default
	raise TypeError(f'Object of type {o.__class__.__name__} '
	TypeError: Object of type bool is not JSON serializable