|
--- |
|
title: Spanish Embeddings Api |
|
emoji: 🐨 |
|
colorFrom: green |
|
colorTo: green |
|
sdk: docker |
|
pinned: false |
|
--- |
|
|
|
# Multilingual & Legal Embeddings API |
|
|
|
A high-performance FastAPI application providing access to **5 specialized embedding models** for Spanish, Catalan, English, and multilingual text. Each model has its own dedicated endpoint for optimal performance and clarity. |
|
|
|
🌐 **Live API**: [https://aurasystems-spanish-embeddings-api.hf.space](https://aurasystems-spanish-embeddings-api.hf.space) |
|
📖 **Interactive Docs**: [https://aurasystems-spanish-embeddings-api.hf.space/docs](https://aurasystems-spanish-embeddings-api.hf.space/docs) |
|
|
|
## 🚀 Quick Start |
|
|
|
### Basic Usage |
|
```bash |
|
# Test jina-v3 endpoint (multilingual, loads at startup) |
|
curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/jina-v3" \ |
|
-H "Content-Type: application/json" \ |
|
-d '{"texts": ["Hello world", "Hola mundo"], "normalize": true}' |
|
|
|
# Test Catalan RoBERTa endpoint |
|
curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/roberta-ca" \ |
|
-H "Content-Type: application/json" \ |
|
-d '{"texts": ["Bon dia", "Com estàs?"], "normalize": true}' |
|
``` |
|
|
|
## 📚 Available Models & Endpoints |
|
|
|
| Endpoint | Model | Languages | Dimensions | Max Tokens | Loading Strategy | |
|
|----------|--------|-----------|------------|------------|------------------| |
|
| `/embed/jina-v3` | jinaai/jina-embeddings-v3 | Multilingual (30+) | 1024 | 8192 | **Startup** | |
|
| `/embed/roberta-ca` | projecte-aina/roberta-large-ca-v2 | Catalan | 1024 | 512 | On-demand | |
|
| `/embed/jina` | jinaai/jina-embeddings-v2-base-es | Spanish, English | 768 | 8192 | On-demand | |
|
| `/embed/robertalex` | PlanTL-GOB-ES/RoBERTalex | Spanish Legal | 768 | 512 | On-demand | |
|
| `/embed/legal-bert` | nlpaueb/legal-bert-base-uncased | English Legal | 768 | 512 | On-demand | |
|
|
|
### Model Recommendations |
|
|
|
- **🌍 General multilingual**: Use `/embed/jina-v3` - Best overall performance |
|
- **🇪🇸 Spanish general**: Use `/embed/jina` - Excellent for Spanish/English |
|
- **🇪🇸 Spanish legal**: Use `/embed/robertalex` - Specialized for legal texts |
|
- **🏴 Catalan**: Use `/embed/roberta-ca` - Best for Catalan text |
|
- **🇬🇧 English legal**: Use `/embed/legal-bert` - Specialized for legal documents |
|
|
|
## 🔗 API Endpoints |
|
|
|
### Model-Specific Embedding Endpoints |
|
|
|
Each model has its dedicated endpoint: |
|
|
|
``` |
|
POST /embed/jina-v3 # Multilingual (startup model) |
|
POST /embed/roberta-ca # Catalan |
|
POST /embed/jina # Spanish/English |
|
POST /embed/robertalex # Spanish Legal |
|
POST /embed/legal-bert # English Legal |
|
``` |
|
|
|
### Utility Endpoints |
|
|
|
``` |
|
GET / # API information |
|
GET /health # Health check and model status |
|
GET /models # List all models with specifications |
|
``` |
|
|
|
## 📖 Usage Examples |
|
|
|
### Python |
|
|
|
```python |
|
import requests |
|
|
|
API_URL = "https://aurasystems-spanish-embeddings-api.hf.space" |
|
|
|
# Example 1: Multilingual with Jina v3 (startup model - fastest) |
|
response = requests.post( |
|
f"{API_URL}/embed/jina-v3", |
|
json={ |
|
"texts": [ |
|
"Hello world", # English |
|
"Hola mundo", # Spanish |
|
"Bonjour monde", # French |
|
"こんにちは世界" # Japanese |
|
], |
|
"normalize": True |
|
} |
|
) |
|
result = response.json() |
|
print(f"Jina v3: {result['dimensions']} dimensions") # 1024 |
|
|
|
# Example 2: Catalan text with RoBERTa-ca |
|
response = requests.post( |
|
f"{API_URL}/embed/roberta-ca", |
|
json={ |
|
"texts": [ |
|
"Bon dia, com estàs?", |
|
"Barcelona és una ciutat meravellosa", |
|
"M'agrada la cultura catalana" |
|
], |
|
"normalize": True |
|
} |
|
) |
|
catalan_result = response.json() |
|
print(f"Catalan: {catalan_result['dimensions']} dimensions") # 1024 |
|
|
|
# Example 3: Spanish legal text with RoBERTalex |
|
response = requests.post( |
|
f"{API_URL}/embed/robertalex", |
|
json={ |
|
"texts": [ |
|
"Artículo primero de la constitución", |
|
"El contrato será válido desde la fecha de firma", |
|
"La jurisprudencia establece que..." |
|
], |
|
"normalize": True |
|
} |
|
) |
|
legal_result = response.json() |
|
print(f"Spanish Legal: {legal_result['dimensions']} dimensions") # 768 |
|
|
|
# Example 4: English legal text with Legal-BERT |
|
response = requests.post( |
|
f"{API_URL}/embed/legal-bert", |
|
json={ |
|
"texts": [ |
|
"This agreement is legally binding", |
|
"The contract shall be governed by English law", |
|
"The party hereby agrees and covenants" |
|
], |
|
"normalize": True |
|
} |
|
) |
|
english_legal_result = response.json() |
|
print(f"English Legal: {english_legal_result['dimensions']} dimensions") # 768 |
|
|
|
# Example 5: Spanish/English bilingual with Jina v2 |
|
response = requests.post( |
|
f"{API_URL}/embed/jina", |
|
json={ |
|
"texts": [ |
|
"Inteligencia artificial y machine learning", |
|
"Artificial intelligence and machine learning", |
|
"Procesamiento de lenguaje natural" |
|
], |
|
"normalize": True |
|
} |
|
) |
|
bilingual_result = response.json() |
|
print(f"Bilingual: {bilingual_result['dimensions']} dimensions") # 768 |
|
``` |
|
|
|
### JavaScript/Node.js |
|
|
|
```javascript |
|
const API_URL = 'https://aurasystems-spanish-embeddings-api.hf.space'; |
|
|
|
// Function to get embeddings from specific endpoint |
|
async function getEmbeddings(endpoint, texts) { |
|
const response = await fetch(`${API_URL}/embed/${endpoint}`, { |
|
method: 'POST', |
|
headers: { |
|
'Content-Type': 'application/json', |
|
}, |
|
body: JSON.stringify({ |
|
texts: texts, |
|
normalize: true |
|
}) |
|
}); |
|
|
|
if (!response.ok) { |
|
throw new Error(`Error: ${response.status}`); |
|
} |
|
|
|
return await response.json(); |
|
} |
|
|
|
// Usage examples |
|
try { |
|
// Multilingual embeddings |
|
const multilingualResult = await getEmbeddings('jina-v3', [ |
|
'Hello world', |
|
'Hola mundo', |
|
'Ciao mondo' |
|
]); |
|
console.log('Multilingual dimensions:', multilingualResult.dimensions); |
|
|
|
// Catalan embeddings |
|
const catalanResult = await getEmbeddings('roberta-ca', [ |
|
'Bon dia', |
|
'Com estàs?' |
|
]); |
|
console.log('Catalan dimensions:', catalanResult.dimensions); |
|
|
|
} catch (error) { |
|
console.error('Error:', error); |
|
} |
|
``` |
|
|
|
### cURL Examples |
|
|
|
```bash |
|
# Multilingual with Jina v3 (startup model) |
|
curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/jina-v3" \ |
|
-H "Content-Type: application/json" \ |
|
-d '{ |
|
"texts": ["Hello", "Hola", "Bonjour"], |
|
"normalize": true |
|
}' |
|
|
|
# Catalan with RoBERTa-ca |
|
curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/roberta-ca" \ |
|
-H "Content-Type: application/json" \ |
|
-d '{ |
|
"texts": ["Bon dia", "Com estàs?"], |
|
"normalize": true |
|
}' |
|
|
|
# Spanish legal with RoBERTalex |
|
curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/robertalex" \ |
|
-H "Content-Type: application/json" \ |
|
-d '{ |
|
"texts": ["Artículo primero"], |
|
"normalize": true |
|
}' |
|
|
|
# English legal with Legal-BERT |
|
curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/legal-bert" \ |
|
-H "Content-Type: application/json" \ |
|
-d '{ |
|
"texts": ["This agreement is binding"], |
|
"normalize": true |
|
}' |
|
|
|
# Spanish/English bilingual with Jina v2 |
|
curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed/jina" \ |
|
-H "Content-Type: application/json" \ |
|
-d '{ |
|
"texts": ["Texto en español", "Text in English"], |
|
"normalize": true |
|
}' |
|
``` |
|
|
|
## 📋 Request/Response Schema |
|
|
|
### Request Body |
|
|
|
```json |
|
{ |
|
"texts": ["text1", "text2", "..."], |
|
"normalize": true, |
|
"max_length": null |
|
} |
|
``` |
|
|
|
| Field | Type | Required | Default | Description | |
|
|-------|------|----------|---------|-------------| |
|
| `texts` | array[string] | ✅ Yes | - | 1-50 texts to embed | |
|
| `normalize` | boolean | No | `true` | L2-normalize embeddings | |
|
| `max_length` | integer/null | No | `null` | Max tokens (model-specific limits) | |
|
|
|
### Response Body |
|
|
|
```json |
|
{ |
|
"embeddings": [[0.123, -0.456, ...], [0.789, -0.012, ...]], |
|
"model_used": "jina-v3", |
|
"dimensions": 1024, |
|
"num_texts": 2 |
|
} |
|
``` |
|
|
|
## ⚡ Performance & Limits |
|
|
|
- **Maximum texts per request**: 50 |
|
- **Startup model**: `jina-v3` loads at startup (fastest response) |
|
- **On-demand models**: Load on first request (~30-60s first time) |
|
- **Typical response time**: 100-300ms after models are loaded |
|
- **Memory optimization**: Automatic cleanup for large batches |
|
- **CORS enabled**: Works from any domain |
|
|
|
## 🔧 Advanced Usage |
|
|
|
### LangChain Integration |
|
|
|
```python |
|
from langchain.embeddings.base import Embeddings |
|
from typing import List |
|
import requests |
|
|
|
class MultilingualEmbeddings(Embeddings): |
|
"""LangChain integration for multilingual embeddings""" |
|
|
|
def __init__(self, endpoint: str = "jina-v3"): |
|
""" |
|
Initialize with specific endpoint |
|
|
|
Args: |
|
endpoint: One of "jina-v3", "roberta-ca", "jina", "robertalex", "legal-bert" |
|
""" |
|
self.api_url = f"https://aurasystems-spanish-embeddings-api.hf.space/embed/{endpoint}" |
|
self.endpoint = endpoint |
|
|
|
def embed_documents(self, texts: List[str]) -> List[List[float]]: |
|
response = requests.post( |
|
self.api_url, |
|
json={"texts": texts, "normalize": True} |
|
) |
|
response.raise_for_status() |
|
return response.json()["embeddings"] |
|
|
|
def embed_query(self, text: str) -> List[float]: |
|
return self.embed_documents([text])[0] |
|
|
|
# Usage examples |
|
multilingual_embeddings = MultilingualEmbeddings("jina-v3") |
|
catalan_embeddings = MultilingualEmbeddings("roberta-ca") |
|
spanish_legal_embeddings = MultilingualEmbeddings("robertalex") |
|
``` |
|
|
|
### Semantic Search |
|
|
|
```python |
|
import numpy as np |
|
from typing import List, Tuple |
|
|
|
def semantic_search(query: str, documents: List[str], endpoint: str = "jina-v3", top_k: int = 5): |
|
"""Semantic search using specific model endpoint""" |
|
|
|
response = requests.post( |
|
f"https://aurasystems-spanish-embeddings-api.hf.space/embed/{endpoint}", |
|
json={"texts": [query] + documents, "normalize": True} |
|
) |
|
|
|
embeddings = np.array(response.json()["embeddings"]) |
|
query_embedding = embeddings[0] |
|
doc_embeddings = embeddings[1:] |
|
|
|
# Calculate cosine similarities (already normalized) |
|
similarities = np.dot(doc_embeddings, query_embedding) |
|
top_indices = np.argsort(similarities)[::-1][:top_k] |
|
|
|
return [(idx, similarities[idx]) for idx in top_indices] |
|
|
|
# Example: Multilingual search |
|
documents = [ |
|
"Python programming language", |
|
"Lenguaje de programación Python", |
|
"Llenguatge de programació Python", |
|
"Language de programmation Python" |
|
] |
|
|
|
results = semantic_search("código en Python", documents, "jina-v3") |
|
for idx, score in results: |
|
print(f"{score:.4f}: {documents[idx]}") |
|
``` |
|
|
|
## 🚨 Error Handling |
|
|
|
### HTTP Status Codes |
|
|
|
| Code | Description | |
|
|------|-------------| |
|
| 200 | Success | |
|
| 400 | Bad Request (validation error) | |
|
| 422 | Unprocessable Entity (schema error) | |
|
| 500 | Internal Server Error (model loading failed) | |
|
|
|
### Common Errors |
|
|
|
```python |
|
# Handle errors properly |
|
try: |
|
response = requests.post( |
|
"https://aurasystems-spanish-embeddings-api.hf.space/embed/jina-v3", |
|
json={"texts": ["text"], "normalize": True} |
|
) |
|
response.raise_for_status() |
|
result = response.json() |
|
except requests.exceptions.HTTPError as e: |
|
print(f"HTTP error: {e}") |
|
print(f"Response: {response.text}") |
|
except requests.exceptions.RequestException as e: |
|
print(f"Request error: {e}") |
|
``` |
|
|
|
## 📊 Model Status Check |
|
|
|
```python |
|
# Check which models are loaded |
|
health = requests.get("https://aurasystems-spanish-embeddings-api.hf.space/health") |
|
status = health.json() |
|
|
|
print(f"API Status: {status['status']}") |
|
print(f"Startup model loaded: {status['startup_model_loaded']}") |
|
print(f"Available models: {status['available_models']}") |
|
print(f"Models loaded: {status['models_count']}/5") |
|
|
|
# Check endpoint status |
|
for model, endpoint_status in status['endpoints'].items(): |
|
print(f"{model}: {endpoint_status}") |
|
``` |
|
|
|
## 🔒 Authentication & Rate Limits |
|
|
|
- **Authentication**: None required (open API) |
|
- **Rate limits**: Generous limits on Hugging Face Spaces |
|
- **CORS**: Enabled for all origins |
|
- **Usage**: Free for research and commercial use |
|
|
|
## 🏗️ Architecture |
|
|
|
### Endpoint-Per-Model Design |
|
- **Startup model**: `jina-v3` loads at application startup for fastest response |
|
- **On-demand loading**: Other models load when first requested |
|
- **Memory optimization**: Progressive loading reduces startup time |
|
- **Model caching**: Once loaded, models remain in memory for fast inference |
|
|
|
### Technical Stack |
|
- **FastAPI**: Modern async web framework |
|
- **Transformers**: Hugging Face model library |
|
- **PyTorch**: Deep learning backend |
|
- **Docker**: Containerized deployment |
|
- **Hugging Face Spaces**: Cloud hosting platform |
|
|
|
## 📄 Model Licenses |
|
|
|
- **Jina models**: Apache 2.0 |
|
- **RoBERTa models**: MIT/Apache 2.0 |
|
- **Legal-BERT**: Apache 2.0 |
|
|
|
## 🤝 Support & Contributing |
|
|
|
- **Issues**: [GitHub Issues](https://huggingface.co/spaces/AuraSystems/spanish-embeddings-api/discussions) |
|
- **Interactive Docs**: [FastAPI Swagger UI](https://aurasystems-spanish-embeddings-api.hf.space/docs) |
|
- **Model Papers**: Check individual model pages on Hugging Face |
|
|
|
--- |
|
|
|
Built with ❤️ using **FastAPI** and **Hugging Face Transformers** |