File size: 9,078 Bytes
51454d1 131a90d 51454d1 131a90d 51454d1 e39aaa1 51454d1 131a90d 51454d1 131a90d 51454d1 131a90d 51454d1 131a90d 51454d1 131a90d 51454d1 131a90d 51454d1 131a90d ada3adb 1933e19 ada3adb 1933e19 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
---
tags:
- financial-filings
- classification
- xgboost
- jina-embeddings-v3
- finance
- nlp
library_name: xgboost
metrics:
- f1: 0.935
- accuracy: 0.95
model-index:
- name: hierarchical-filing-classifier
results:
- task:
type: text-classification
name: Financial Document Classification
metrics:
- type: f1
value: 0.935
name: Weighted F1
- type: accuracy
value: 0.973
name: Top-2 Router Accuracy
---
# Financial Reports Hierarchical Classifier
This is a production-grade Hierarchical Cascade Classifier designed to categorize Global and European financial filings into **29 distinct classes**. It powers the classification engine for **FinancialReports**.
## π Performance Highlights
| Metric | Score | Interpretation |
| :--- | :--- | :--- |
| **Global Weighted F1** | **93.5%** | State-of-the-art performance for unstructured financial text. |
| **Top-2 Router Accuracy** | **97.3%** | The correct specialist is consulted 97.3% of the time. |
| **Call Transcript Precision** | **100%** | Zero false positives for transcripts. |
| **Delisting Precision** | **100%** | High-precision signal for critical negative corporate events. |
### Detailed Performance by Filing Type
*Scores based on a hold-out test set of ~5,500 documents.*
| Filing Type | Precision | Recall | F1-Score |
| :--- | :--- | :--- | :--- |
| **Interest Rate Update/Notice** | 98.9% | 98.1% | **0.99** |
| **Proxy Solicitation** | 98.6% | 94.4% | **0.96** |
| **Annual Report** | 96.7% | 95.6% | **0.96** |
| **Investor Presentation** | 97.3% | 94.2% | **0.96** |
| **Voting Results** | 94.9% | 96.4% | **0.96** |
| **Audit Report** | 94.7% | 96.4% | **0.96** |
| **Director's Dealing** | 95.9% | 95.0% | **0.95** |
| **Dividend Notice** | 97.8% | 93.0% | **0.95** |
| **Fund Factsheet** | 96.0% | 94.4% | **0.95** |
| **Net Asset Value (NAV)** | 92.6% | 97.6% | **0.95** |
| **Interim / Quarterly Report** | 93.7% | 96.3% | **0.95** |
| **AGM Information** | 95.0% | 93.9% | **0.94** |
| **Remuneration Info** | 97.3% | 91.4% | **0.94** |
| **Report Publication Announcement** | 93.5% | 94.8% | **0.94** |
| **Earnings Release** | 93.2% | 94.0% | **0.94** |
| **ESG / Sustainability Info** | 96.3% | 90.7% | **0.93** |
| **Governance Info** | 97.1% | 89.5% | **0.93** |
| **Capital/Financing Update** | 97.0% | 89.2% | **0.93** |
| **Call Transcript** | **100.0%** | 86.7% | **0.93** |
| **Major Shareholding Notification** | 93.0% | 92.5% | **0.93** |
| **Board/Management Info** | 91.8% | 93.4% | **0.93** |
| **Transaction in Own Shares** | 90.0% | 94.8% | **0.92** |
| **Legal Proceedings** | 92.7% | 90.3% | **0.91** |
| **Regulatory Filings (Generic)** | 89.3% | 93.2% | **0.91** |
| **Management Reports** | 90.8% | 88.2% | **0.89** |
| **M&A Activity** | 95.1% | 81.4% | **0.88** |
| **Share Issue/Capital Change** | 86.0% | 89.3% | **0.88** |
| **Delisting Announcement** | **100.0%** | 75.9% | **0.86** |
---
## ποΈ Architecture
The system uses a **2-Stage Soft-Routing Architecture** to break the "Semantic Ceiling" often found in flat classifiers:
1. **Level 1 (The Router):** A Jina-V3 embedding model feeds an XGBoost Router that predicts one of 8 Main Categories (e.g., "Financial Reporting", "Equity Info").
2. **Level 2 (The Specialists):** The document is passed to the top-2 most likely Specialist Models, which compete to assign the final fine-grained label.
## β οΈ Critical Usage Note: The "Wrapper Effect"
Financial documents are often massive (500+ pages) but must be truncated to fit into GPU memory for embedding. However, **Document Length** is a critical feature for distinguishing a full *Annual Report* from a short *Press Release* announcing it.
**To achieve 93% accuracy, you must decouple embedding text from feature engineering:**
1. **Embedding (GPU):** Pass the truncated text (e.g., first 32k characters) to Jina-V3.
2. **Feature Vector (XGBoost):** Calculate `log1p(length)` using the **True Original Length** of the document, not the truncated string length.
*If you do not provide the original length, the model will assume the document is short and may misclassify massive Annual Reports as simple Press Releases.*
## π» Usage
```python
from huggingface_hub import snapshot_download
import sys
# 1. Download Models
model_path = snapshot_download(repo_id="FinancialReports/hierarchical-filing-classifier")
# 2. Add path and import wrapper
sys.path.append(model_path)
from inference_wrapper import FinancialFilingClassifier
# 3. Initialize
classifier = FinancialFilingClassifier(model_path)
# 4. Scenario: A 2MB Annual Report
real_doc_length = 2500000 # 2.5 Million chars
truncated_text = "ACME CORP ANNUAL REPORT 2024... [Truncated at 32k chars]"
# 5. Predict (Ensure your wrapper/API handles the length argument)
result = classifier.predict(
text=truncated_text,
# Logic note: Ensure the classifier applies log1p to this value
# instead of len(truncated_text) before passing to XGBoost.
)
print(result)
# Output:
# {
# 'category': 'Financial Reporting',
# 'label': 'Annual Report',
# 'score': 0.985,
# }
```
## π Taxonomy (29 Classes)
The model classifies documents into this hierarchy:
| **Financial Reporting** | **Equity Information** | **Listing & Regulatory** |
| :--- | :--- | :--- |
| β’ Annual Report<br>β’ Earnings Release<br>β’ Interim / Quarterly Report<br>β’ Audit Report | β’ Major Shareholding Notification<br>β’ Transaction in Own Shares (Buyback)<br>β’ Share Issue / Capital Change<br>β’ Notice of Dividend Amount | β’ Regulatory Filings (RNS)<br>β’ Delisting Announcement<br>β’ Prospectus<br>β’ Registration Form |
| **AGM Information** | **Management** | **Investor Comm** |
| :--- | :--- | :--- |
| β’ AGM Information (Pre/Post)<br>β’ Voting Results<br>β’ Proxy Solicitation | β’ Director's Dealing<br>β’ Management Reports<br>β’ Remuneration Info<br>β’ Board Changes | β’ Investor Presentation<br>β’ Call Transcript<br>β’ Report Publication Announcement |
| **M&A and Legal** | **Debt Information** | **Investment Vehicle** |
| :--- | :--- | :--- |
| β’ M&A Activity<br>β’ Legal Proceedings Report | β’ Capital/Financing Update<br>β’ Interest Rate Notice | β’ Net Asset Value (NAV)<br>β’ Fund Factsheet |
## π The Standard: Financial Reporting Classification Framework (FRCF)
The taxonomy used by this model is based on the **[Financial Reporting Classification Framework (FRCF)](https://financialreports.eu/financial-reporting-classification-framework/)**, an open-source standard designed to organize corporate disclosures in a consistent, cross-jurisdictional format.
Unlike fragmented regulatory schemes, the FRCF organizes disclosures by **functional purpose**, ensuring comparability across markets (e.g., mapping a US *10-K* and a European *Annual Financial Report* to the same standardized `Annual Report` category).
* **[Explore the Framework](https://financialreports.eu/financial-reporting-classification-framework/)**
* **[Download Methodology (PDF)](https://financialreports.eu/download/frcf-methodology.pdf)**
## π Training Data
The model was trained on a proprietary **Golden Dataset of 27,671 financial filings**, manually curated to represent the diverse landscape of global corporate reporting.
* **Source:** Real-world filings from listed companies across **Europe (primary focus)**, North America, and Asia.
* **Multilingual:** Includes documents in English, French, German, and other major European languages (leveraging the multilingual capabilities of Jina-V3).
* **Diversity:** The dataset preserves the natural "long-tail" distribution of financial data, ranging from massive 500+ page **Annual Reports** to single-page **Press Releases** and complex **ESG Disclosures**.
* **Quality Control:** Mapped to a strict 2-level hierarchy to resolve semantic ambiguities common in regulatory filings (e.g., distinguishing a *Share Buyback* announcement from a *Director's Dealing* notification).
## βοΈ Deployment & Hardware
This model is optimized for **GPU Inference** due to the heavy 8192-token context window of the Jina encoder. While CPU inference is possible, it is significantly slower.
### Recommended Configuration
| Component | Recommendation | Notes |
| :--- | :--- | :--- |
| **GPU** | **NVIDIA T4 (16GB)** | The "Sweet Spot" for cost/performance. Capable of ~50 docs/sec in batch mode. |
| **Alternative** | NVIDIA L4 / A10 | Recommended for high-concurrency production APIs. |
| **VRAM** | 16 GB Minimum | Required to embed long documents without OOM errors. |
| **System RAM** | 16 GB+ | Standard requirement for PyTorch + XGBoost overhead. |
### Critical Environment Settings
To load the underlying Jina-V3 model, you **must** allow remote code execution in your environment variables (Docker, Kubernetes, or Hugging Face Endpoints):
```bash
HF_TRUST_REMOTE_CODE=True
```
### Throughput Benchmarks (T4 GPU)
* **Live API Latency:** ~200ms β 500ms per document.
* **Batch Processing:** ~40 β 50 documents per second (Batch Size: 64).
|