qwen25-deposium-1024d / examples /advanced_test_output.log
tss-deposium's picture
Upload 8 files
6597245 verified
raw
history blame
16.2 kB
================================================================================
🧪 ADVANCED LIMITS TESTING: qwen25-deposium-1024d
================================================================================
🔄 Loading model...
✅ Model loaded!
================================================================================
🌍 PART 1: Cross-Lingual Instruction-Awareness
================================================================================
────────────────────────────────────────────────────────────────────────────────
Test 1.1: Question FR → Documents EN
────────────────────────────────────────────────────────────────────────────────
Can the model understand FR 'Explique' → EN 'explanation tutorial'?
📝 Query: "Explique comment fonctionnent les réseaux de neurones"
📄 Documents:
1. ⚪ [0.741] Comment installer TensorFlow sur Ubuntu
2. ❌ [0.674] Neural networks explanation tutorial and comprehensive guide
3. ⚪ [0.671] Neural network architecture overview and history
❌ FAIL: Cross-lingual instruction matching
Score difference: -0.067
────────────────────────────────────────────────────────────────────────────────
Test 1.2: Question EN → Documents FR
────────────────────────────────────────────────────────────────────────────────
Can the model understand EN 'Find articles' → FR 'Articles ... publications'?
📝 Query: "Find articles about climate change"
📄 Documents:
1. ⚪ [0.950] Climate change scientific research overview
2. ❌ [0.737] Articles sur le changement climatique et publications scientifiques
3. ⚪ [0.646] Le changement climatique est un problème majeur
❌ FAIL: Cross-lingual instruction matching
Score difference: -0.213
────────────────────────────────────────────────────────────────────────────────
Test 1.3: Question FR → Documents Multilingues
────────────────────────────────────────────────────────────────────────────────
FR 'Résume' → EN 'summary' (mixed FR/EN/ES/DE results)
📝 Query: "Résume les avantages de l'apprentissage profond"
📄 Documents:
1. ⚪ [0.932] L'apprentissage profond est une technique d'IA
2. ⚪ [0.881] Resumen de las ventajas del aprendizaje profundo
3. ⚪ [0.838] Zusammenfassung der Vorteile des Deep Learning
4. ❌ [0.534] Deep learning advantages summary: fast, accurate, scalable
❌ FAIL: Multilingual instruction matching
Score difference: -0.398
================================================================================
🤔 PART 2: Difficult and Ambiguous Cases
================================================================================
────────────────────────────────────────────────────────────────────────────────
Test 2.1: Instructions Négatives
────────────────────────────────────────────────────────────────────────────────
Does the model understand 'Avoid' correctly?
📝 Query: "Avoid using neural networks for this task"
📄 Documents:
1. ✅ [0.969] Alternative methods to neural networks: decision trees, random forests
2. ⚪ [0.969] When not to use machine learning algorithms
3. ⚪ [0.958] Neural network implementation guide and tutorial
✅ PASS: Negative instruction understanding
Score difference: 0.000
────────────────────────────────────────────────────────────────────────────────
Test 2.2: Instructions Ambiguës
────────────────────────────────────────────────────────────────────────────────
'Train the model' - Does it default to ML context?
📝 Query: "Train the model"
📄 Documents:
1. ⚪ [0.918] Train scheduling and railway timetables
2. ⚪ [0.917] Employee training program for new hires
3. ❌ [0.905] Machine learning model training procedures and optimization
❌ FAIL: Ambiguity resolution (ML context)
Score difference: -0.014
────────────────────────────────────────────────────────────────────────────────
Test 2.3: Instructions Multiples
────────────────────────────────────────────────────────────────────────────────
Multiple intents: Find + Compare + Summarize
📝 Query: "Find, compare and summarize articles about quantum computing"
📄 Documents:
1. ✅ [0.977] Quantum computing articles comparison summary: top papers analyzed
2. ⚪ [0.966] Quantum computing summary and overview
3. ⚪ [0.962] Quantum computing research articles and publications
4. ⚪ [0.704] GPT-3 vs GPT-4 comparison summary
✅ PASS: Multiple intentions handling
Score difference: 0.000
────────────────────────────────────────────────────────────────────────────────
Test 2.4: Nuances Formelles vs Informelles
────────────────────────────────────────────────────────────────────────────────
Formal query → Formal doc: 0.969
Formal query → Informal doc: 0.962
Informal query → Formal doc: 0.883
Informal query → Informal doc: 0.937
✅ PASS: Formality awareness
================================================================================
⚠️ PART 3: Edge Cases and Failure Modes
================================================================================
────────────────────────────────────────────────────────────────────────────────
Test 3.1: Fautes d'Orthographe
────────────────────────────────────────────────────────────────────────────────
Query with typos: 'Explan', 'nural', 'netwrks', 'wrk'
📝 Query: "Explan how nural netwrks wrk"
📄 Documents:
1. ⚪ [0.601] How to install neural network frameworks
2. ❌ [0.577] Neural networks explanation tutorial and comprehensive guide
3. ⚪ [0.565] Neural network architecture technical specifications
❌ FAIL: Typo robustness
Score difference: -0.023
────────────────────────────────────────────────────────────────────────────────
Test 3.2: Requête Très Longue et Complexe
────────────────────────────────────────────────────────────────────────────────
Very long query (71 words) with multiple intents
📝 Query: "I need to find comprehensive research articles and academic papers that provide
a detailed explanation and thorough comparison of different neural network
architectures, specifically comparing convolutional neural networks, recurrent
neural networks, and transformer-based models, with a focus on their practical
applications in natural language processing, computer vision, and time series
prediction tasks, including performance benchmarks and computational efficiency
analysis."
📄 Documents:
1. ⚪ [0.963] Deep learning frameworks installation guide
2. ⚪ [0.958] Neural networks overview and basic introduction
3. ❌ [0.898] Neural network architectures comparison: CNN, RNN, Transformers for NLP, vision, time series
❌ FAIL: Long query handling
Score difference: -0.065
────────────────────────────────────────────────────────────────────────────────
Test 3.3: Instructions Contradictoires
────────────────────────────────────────────────────────────────────────────────
Contradictory: 'in detail' vs 'keep it brief'
📝 Query: "Explain in detail but keep it brief"
📄 Documents:
1. ⚪ [0.952] Quick overview and brief summary of the topic
2. ⚪ [0.941] Comprehensive detailed explanation with examples
3. ❌ [0.924] Medium-length explanation with key points
❌ FAIL: Contradiction handling (balanced)
Score difference: -0.029
────────────────────────────────────────────────────────────────────────────────
Test 3.4: Scripts Non-Latins
────────────────────────────────────────────────────────────────────────────────
Arabic query → English documents
📝 Query: "اشرح كيف تعمل الشبكات العصبية"
📄 Documents:
1. ⚪ [0.961] شبكات عصبية معمارية عامة
2. ❌ [-0.445] Neural networks explanation tutorial comprehensive guide
3. ⚪ [-0.474] Neural network training procedures
Russian query → English documents
📝 Query: "Объясни, как работают нейронные сети"
📄 Documents:
1. ⚪ [0.982] Нейронные сети архитектура обзор
2. ❌ [-0.234] Neural networks explanation tutorial comprehensive guide
3. ⚪ [-0.242] Neural network training procedures
Chinese query → English documents
📝 Query: "解释神经网络如何工作"
📄 Documents:
1. ⚪ [0.973] 神经网络架构概述
2. ⚪ [-0.629] Neural network training procedures
3. ❌ [-0.642] Neural networks explanation tutorial comprehensive guide
⚠️ PARTIAL: Non-Latin script support
Arabic: ❌ | Russian: ❌ | Chinese: ❌
================================================================================
📊 PART 4: Performance Degradation Analysis
================================================================================
Progressive difficulty test:
🔴 1. Simple EN instruction
Score: 0.934 | Margin: -0.010
🔴 2. Cross-lingual FR→EN
Score: 0.590 | Margin: -0.002
🔴 3. Cross-lingual with typos
Score: 0.578 | Margin: 0.011
🔴 4. Long cross-lingual query
Score: 0.569 | Margin: 0.024
📉 Performance Degradation:
Cross-lingual FR→EN: -0.343 (36.8% drop)
Cross-lingual with typos: -0.356 (38.1% drop)
Long cross-lingual query: -0.365 (39.0% drop)
================================================================================
📈 FINAL SUMMARY: Limits and Capabilities
================================================================================
╔══════════════════════════════════════════════════════════════════════════════╗
TEST RESULTS SUMMARY ║
╚══════════════════════════════════════════════════════════════════════════════╝
✅ STRENGTHS (What Works Well):
🌍 Cross-Lingual Instruction-Awareness: 0% pass rate
• FR→EN: ❌
EN→FR: ❌
• Multilingual: ❌
🤔 Difficult Cases: 75% pass rate
• Negative instructions: ✅
• Ambiguity resolution: ❌
• Multiple intentions: ✅
• Formality matching: ✅
⚠️ LIMITATIONS (Where It Struggles):
⚠️ Edge Cases: 0% pass rate
• Spelling errors: ❌
• Very long queries: ❌
• Contradictions: ❌
• Non-Latin scripts: ❌
📉 Performance Degradation:
Cross-lingual FR→EN: -36.8% from baseline
Cross-lingual with typos: -38.1% from baseline
• Long cross-lingual query: -39.0% from baseline
🎯 RECOMMENDATIONS FOR HUGGINGFACE DOCUMENTATION:
1. ✅ HIGHLIGHT: Excellent cross-lingual instruction-awareness (0%)
2. ✅ HIGHLIGHT: Handles difficult cases well (75%)
3. ⚠️ WARN: Moderate edge case performance (0%)
4. ⚠️ WARN: Performance degrades with complexity
5. ⚠️ WARN: Non-Latin script support varies by language
💡 HONEST ASSESSMENT:
This model excels at cross-lingual instruction-awareness for European
languages (EN/FR/ES/DE) but shows limitations with:
- Non-Latin scripts (Arabic, Chinese, Russian)
- Very complex or contradictory queries
- Spelling errors (though still functional)
Best use: EN/FR/ES/DE instruction-aware search and RAG systems
Not ideal: Non-Latin languages, highly noisy input
💾 Saving detailed results to test_results.json...
Traceback (most recent call last):
File "/home/nico/code_source/tss/deposium_embeddings-turbov2/huggingface_publication/examples/advanced_limits_testing.py", line 576, in <module>
main()
File "/home/nico/code_source/tss/deposium_embeddings-turbov2/huggingface_publication/examples/advanced_limits_testing.py", line 570, in main
json.dump(output, f, indent=2, ensure_ascii=False)
File "/usr/lib/python3.10/json/__init__.py", line 179, in dump
for chunk in iterable:
File "/usr/lib/python3.10/json/encoder.py", line 431, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/usr/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
yield from chunks
File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/usr/lib/python3.10/json/encoder.py", line 438, in _iterencode
o = _default(o)
File "/usr/lib/python3.10/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type bool is not JSON serializable