Spaces:
Build error
Build error
File size: 4,467 Bytes
4e4961e 5009cb8 4e4961e 5009cb8 4e4961e 5009cb8 4e4961e 5009cb8 4e4961e 5009cb8 4e4961e 5009cb8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
"""Speech recognition service interface.
This module defines the interface for speech-to-text (STT) services that convert
audio content into textual representation. The interface supports multiple STT
models and providers with consistent error handling.
The interface is designed to be:
- Model-agnostic: Works with any STT implementation (Whisper, Parakeet, etc.)
- Language-aware: Handles multiple languages and dialects
- Error-resilient: Provides detailed error information for debugging
- Performance-conscious: Supports both batch and streaming transcription
"""
from abc import ABC, abstractmethod
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from ..models.audio_content import AudioContent
from ..models.text_content import TextContent
class ISpeechRecognitionService(ABC):
"""Interface for speech recognition services.
This interface defines the contract for converting audio content to text
using various STT models and providers. Implementations should handle
different audio formats, languages, and quality levels.
Example:
```python
# Use through dependency injection
stt_service = container.resolve(ISpeechRecognitionService)
# Transcribe audio
text_result = stt_service.transcribe(
audio=audio_content,
model="whisper-large"
)
print(f"Transcribed: {text_result.text}")
print(f"Language: {text_result.language}")
print(f"Confidence: {text_result.confidence}")
```
"""
@abstractmethod
def transcribe(self, audio: 'AudioContent', model: str) -> 'TextContent':
"""Transcribe audio content to text using specified STT model.
Converts audio data into textual representation with language detection
and confidence scoring. The method should handle various audio formats
and quality levels gracefully.
Implementation considerations:
- Audio preprocessing (noise reduction, normalization)
- Language detection and handling
- Confidence scoring and quality assessment
- Memory management for large audio files
- Timeout handling for long audio content
Args:
audio: The audio content to transcribe. Must contain valid audio data
in a supported format (WAV, MP3, FLAC, etc.) with appropriate
sample rate and duration.
model: The STT model identifier to use for transcription. Examples:
- "whisper-small": Fast, lower accuracy
- "whisper-large": Slower, higher accuracy
- "parakeet": Real-time optimized
Must be supported by the implementation.
Returns:
TextContent: The transcription result containing:
- text: The transcribed text content
- language: Detected or specified language code
- confidence: Overall transcription confidence (0.0-1.0)
- metadata: Additional information like word-level timestamps,
alternative transcriptions, processing time
Raises:
SpeechRecognitionException: If transcription fails due to:
- Unsupported audio format or quality
- Model loading or inference errors
- Network issues (for cloud-based models)
- Insufficient system resources
ValueError: If input parameters are invalid:
- Empty or corrupted audio data
- Unsupported model identifier
- Invalid audio format specifications
Example:
```python
# Load audio file
with open("speech.wav", "rb") as f:
audio = AudioContent(
data=f.read(),
format="wav",
sample_rate=16000,
duration=30.0
)
# Transcribe with high-accuracy model
try:
result = service.transcribe(audio, "whisper-large")
if result.confidence > 0.8:
print(f"High confidence: {result.text}")
else:
print(f"Low confidence: {result.text} ({result.confidence:.2f})")
except SpeechRecognitionException as e:
print(f"Transcription failed: {e}")
```
"""
pass |