File size: 16,245 Bytes

================================================================================
  🧪 ADVANCED LIMITS TESTING: qwen25-deposium-1024d
================================================================================

🔄 Loading model...
✅ Model loaded!


================================================================================
  🌍 PART 1: Cross-Lingual Instruction-Awareness
================================================================================

────────────────────────────────────────────────────────────────────────────────
  Test 1.1: Question FR → Documents EN
────────────────────────────────────────────────────────────────────────────────

Can the model understand FR 'Explique' → EN 'explanation tutorial'?

📝 Query: "Explique comment fonctionnent les réseaux de neurones"

📄 Documents:
  1. ⚪ [0.741] Comment installer TensorFlow sur Ubuntu
  2. ❌ [0.674] Neural networks explanation tutorial and comprehensive guide
  3. ⚪ [0.671] Neural network architecture overview and history

❌ FAIL: Cross-lingual instruction matching
   Score difference: -0.067

────────────────────────────────────────────────────────────────────────────────
  Test 1.2: Question EN → Documents FR
────────────────────────────────────────────────────────────────────────────────

Can the model understand EN 'Find articles' → FR 'Articles ... publications'?

📝 Query: "Find articles about climate change"

📄 Documents:
  1. ⚪ [0.950] Climate change scientific research overview
  2. ❌ [0.737] Articles sur le changement climatique et publications scientifiques
  3. ⚪ [0.646] Le changement climatique est un problème majeur

❌ FAIL: Cross-lingual instruction matching
   Score difference: -0.213

────────────────────────────────────────────────────────────────────────────────
  Test 1.3: Question FR → Documents Multilingues
────────────────────────────────────────────────────────────────────────────────

FR 'Résume' → EN 'summary' (mixed FR/EN/ES/DE results)

📝 Query: "Résume les avantages de l'apprentissage profond"

📄 Documents:
  1. ⚪ [0.932] L'apprentissage profond est une technique d'IA
  2. ⚪ [0.881] Resumen de las ventajas del aprendizaje profundo
  3. ⚪ [0.838] Zusammenfassung der Vorteile des Deep Learning
  4. ❌ [0.534] Deep learning advantages summary: fast, accurate, scalable

❌ FAIL: Multilingual instruction matching
   Score difference: -0.398

================================================================================
  🤔 PART 2: Difficult and Ambiguous Cases
================================================================================

────────────────────────────────────────────────────────────────────────────────
  Test 2.1: Instructions Négatives
────────────────────────────────────────────────────────────────────────────────

Does the model understand 'Avoid' correctly?

📝 Query: "Avoid using neural networks for this task"

📄 Documents:
  1. ✅ [0.969] Alternative methods to neural networks: decision trees, random forests
  2. ⚪ [0.969] When not to use machine learning algorithms
  3. ⚪ [0.958] Neural network implementation guide and tutorial

✅ PASS: Negative instruction understanding
   Score difference: 0.000

────────────────────────────────────────────────────────────────────────────────
  Test 2.2: Instructions Ambiguës
────────────────────────────────────────────────────────────────────────────────

'Train the model' - Does it default to ML context?

📝 Query: "Train the model"

📄 Documents:
  1. ⚪ [0.918] Train scheduling and railway timetables
  2. ⚪ [0.917] Employee training program for new hires
  3. ❌ [0.905] Machine learning model training procedures and optimization

❌ FAIL: Ambiguity resolution (ML context)
   Score difference: -0.014

────────────────────────────────────────────────────────────────────────────────
  Test 2.3: Instructions Multiples
────────────────────────────────────────────────────────────────────────────────

Multiple intents: Find + Compare + Summarize

📝 Query: "Find, compare and summarize articles about quantum computing"

📄 Documents:
  1. ✅ [0.977] Quantum computing articles comparison summary: top papers analyzed
  2. ⚪ [0.966] Quantum computing summary and overview
  3. ⚪ [0.962] Quantum computing research articles and publications
  4. ⚪ [0.704] GPT-3 vs GPT-4 comparison summary

✅ PASS: Multiple intentions handling
   Score difference: 0.000

────────────────────────────────────────────────────────────────────────────────
  Test 2.4: Nuances Formelles vs Informelles
────────────────────────────────────────────────────────────────────────────────

Formal query → Formal doc:   0.969
Formal query → Informal doc: 0.962
Informal query → Formal doc:   0.883
Informal query → Informal doc: 0.937

✅ PASS: Formality awareness

================================================================================
  ⚠️ PART 3: Edge Cases and Failure Modes
================================================================================

────────────────────────────────────────────────────────────────────────────────
  Test 3.1: Fautes d'Orthographe
────────────────────────────────────────────────────────────────────────────────

Query with typos: 'Explan', 'nural', 'netwrks', 'wrk'

📝 Query: "Explan how nural netwrks wrk"

📄 Documents:
  1. ⚪ [0.601] How to install neural network frameworks
  2. ❌ [0.577] Neural networks explanation tutorial and comprehensive guide
  3. ⚪ [0.565] Neural network architecture technical specifications

❌ FAIL: Typo robustness
   Score difference: -0.023

────────────────────────────────────────────────────────────────────────────────
  Test 3.2: Requête Très Longue et Complexe
────────────────────────────────────────────────────────────────────────────────

Very long query (71 words) with multiple intents

📝 Query: "I need to find comprehensive research articles and academic papers that provide
    a detailed explanation and thorough comparison of different neural network
    architectures, specifically comparing convolutional neural networks, recurrent
    neural networks, and transformer-based models, with a focus on their practical
    applications in natural language processing, computer vision, and time series
    prediction tasks, including performance benchmarks and computational efficiency
    analysis."

📄 Documents:
  1. ⚪ [0.963] Deep learning frameworks installation guide
  2. ⚪ [0.958] Neural networks overview and basic introduction
  3. ❌ [0.898] Neural network architectures comparison: CNN, RNN, Transformers for NLP, vision, time series

❌ FAIL: Long query handling
   Score difference: -0.065

────────────────────────────────────────────────────────────────────────────────
  Test 3.3: Instructions Contradictoires
────────────────────────────────────────────────────────────────────────────────

Contradictory: 'in detail' vs 'keep it brief'

📝 Query: "Explain in detail but keep it brief"

📄 Documents:
  1. ⚪ [0.952] Quick overview and brief summary of the topic
  2. ⚪ [0.941] Comprehensive detailed explanation with examples
  3. ❌ [0.924] Medium-length explanation with key points

❌ FAIL: Contradiction handling (balanced)
   Score difference: -0.029

────────────────────────────────────────────────────────────────────────────────
  Test 3.4: Scripts Non-Latins
────────────────────────────────────────────────────────────────────────────────

Arabic query → English documents

📝 Query: "اشرح كيف تعمل الشبكات العصبية"

📄 Documents:
  1. ⚪ [0.961] شبكات عصبية معمارية عامة
  2. ❌ [-0.445] Neural networks explanation tutorial comprehensive guide
  3. ⚪ [-0.474] Neural network training procedures

Russian query → English documents

📝 Query: "Объясни, как работают нейронные сети"

📄 Documents:
  1. ⚪ [0.982] Нейронные сети архитектура обзор
  2. ❌ [-0.234] Neural networks explanation tutorial comprehensive guide
  3. ⚪ [-0.242] Neural network training procedures

Chinese query → English documents

📝 Query: "解释神经网络如何工作"

📄 Documents:
  1. ⚪ [0.973] 神经网络架构概述
  2. ⚪ [-0.629] Neural network training procedures
  3. ❌ [-0.642] Neural networks explanation tutorial comprehensive guide

⚠️ PARTIAL: Non-Latin script support
   Arabic: ❌ | Russian: ❌ | Chinese: ❌

================================================================================
  📊 PART 4: Performance Degradation Analysis
================================================================================

Progressive difficulty test:

🔴 1. Simple EN instruction
   Score: 0.934 | Margin: -0.010
🔴 2. Cross-lingual FR→EN
   Score: 0.590 | Margin: -0.002
🔴 3. Cross-lingual with typos
   Score: 0.578 | Margin: 0.011
🔴 4. Long cross-lingual query
   Score: 0.569 | Margin: 0.024

📉 Performance Degradation:
   Cross-lingual FR→EN: -0.343 (36.8% drop)
   Cross-lingual with typos: -0.356 (38.1% drop)
   Long cross-lingual query: -0.365 (39.0% drop)

================================================================================
  📈 FINAL SUMMARY: Limits and Capabilities
================================================================================

╔══════════════════════════════════════════════════════════════════════════════╗
║                          TEST RESULTS SUMMARY                                 ║
╚══════════════════════════════════════════════════════════════════════════════╝

✅ STRENGTHS (What Works Well):

  🌍 Cross-Lingual Instruction-Awareness: 0% pass rate
     • FR→EN: ❌
     • EN→FR: ❌
     • Multilingual: ❌

  🤔 Difficult Cases: 75% pass rate
     • Negative instructions: ✅
     • Ambiguity resolution: ❌
     • Multiple intentions: ✅
     • Formality matching: ✅

⚠️ LIMITATIONS (Where It Struggles):

  ⚠️ Edge Cases: 0% pass rate
     • Spelling errors: ❌
     • Very long queries: ❌
     • Contradictions: ❌
     • Non-Latin scripts: ❌

📉 Performance Degradation:

   • Cross-lingual FR→EN: -36.8% from baseline
   • Cross-lingual with typos: -38.1% from baseline
   • Long cross-lingual query: -39.0% from baseline

🎯 RECOMMENDATIONS FOR HUGGINGFACE DOCUMENTATION:

  1. ✅ HIGHLIGHT: Excellent cross-lingual instruction-awareness (0%)
  2. ✅ HIGHLIGHT: Handles difficult cases well (75%)
  3. ⚠️ WARN: Moderate edge case performance (0%)
  4. ⚠️ WARN: Performance degrades with complexity
  5. ⚠️ WARN: Non-Latin script support varies by language

💡 HONEST ASSESSMENT:
   This model excels at cross-lingual instruction-awareness for European
   languages (EN/FR/ES/DE) but shows limitations with:
   - Non-Latin scripts (Arabic, Chinese, Russian)
   - Very complex or contradictory queries
   - Spelling errors (though still functional)

   Best use: EN/FR/ES/DE instruction-aware search and RAG systems
   Not ideal: Non-Latin languages, highly noisy input


💾 Saving detailed results to test_results.json...
Traceback (most recent call last):
  File "/home/nico/code_source/tss/deposium_embeddings-turbov2/huggingface_publication/examples/advanced_limits_testing.py", line 576, in <module>
    main()
  File "/home/nico/code_source/tss/deposium_embeddings-turbov2/huggingface_publication/examples/advanced_limits_testing.py", line 570, in main
    json.dump(output, f, indent=2, ensure_ascii=False)
  File "/usr/lib/python3.10/json/__init__.py", line 179, in dump
    for chunk in iterable:
  File "/usr/lib/python3.10/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.10/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/usr/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type bool is not JSON serializable