Financial Reports Hierarchical Classifier

This is a production-grade Hierarchical Cascade Classifier designed to categorize Global and European financial filings into 29 distinct classes. It powers the classification engine for FinancialReports.

πŸš€ Performance Highlights

Metric Score Interpretation
Global Weighted F1 93.5% State-of-the-art performance for unstructured financial text.
Top-2 Router Accuracy 97.3% The correct specialist is consulted 97.3% of the time.
Call Transcript Precision 100% Zero false positives for transcripts.
Delisting Precision 100% High-precision signal for critical negative corporate events.

Detailed Performance by Filing Type

Scores based on a hold-out test set of ~5,500 documents.

Filing Type Precision Recall F1-Score
Interest Rate Update/Notice 98.9% 98.1% 0.99
Proxy Solicitation 98.6% 94.4% 0.96
Annual Report 96.7% 95.6% 0.96
Investor Presentation 97.3% 94.2% 0.96
Voting Results 94.9% 96.4% 0.96
Audit Report 94.7% 96.4% 0.96
Director's Dealing 95.9% 95.0% 0.95
Dividend Notice 97.8% 93.0% 0.95
Fund Factsheet 96.0% 94.4% 0.95
Net Asset Value (NAV) 92.6% 97.6% 0.95
Interim / Quarterly Report 93.7% 96.3% 0.95
AGM Information 95.0% 93.9% 0.94
Remuneration Info 97.3% 91.4% 0.94
Report Publication Announcement 93.5% 94.8% 0.94
Earnings Release 93.2% 94.0% 0.94
ESG / Sustainability Info 96.3% 90.7% 0.93
Governance Info 97.1% 89.5% 0.93
Capital/Financing Update 97.0% 89.2% 0.93
Call Transcript 100.0% 86.7% 0.93
Major Shareholding Notification 93.0% 92.5% 0.93
Board/Management Info 91.8% 93.4% 0.93
Transaction in Own Shares 90.0% 94.8% 0.92
Legal Proceedings 92.7% 90.3% 0.91
Regulatory Filings (Generic) 89.3% 93.2% 0.91
Management Reports 90.8% 88.2% 0.89
M&A Activity 95.1% 81.4% 0.88
Share Issue/Capital Change 86.0% 89.3% 0.88
Delisting Announcement 100.0% 75.9% 0.86

πŸ—οΈ Architecture

The system uses a 2-Stage Soft-Routing Architecture to break the "Semantic Ceiling" often found in flat classifiers:

  1. Level 1 (The Router): A Jina-V3 embedding model feeds an XGBoost Router that predicts one of 8 Main Categories (e.g., "Financial Reporting", "Equity Info").
  2. Level 2 (The Specialists): The document is passed to the top-2 most likely Specialist Models, which compete to assign the final fine-grained label.

⚠️ Critical Usage Note: The "Wrapper Effect"

Financial documents are often massive (500+ pages) but must be truncated to fit into GPU memory for embedding. However, Document Length is a critical feature for distinguishing a full Annual Report from a short Press Release announcing it.

To achieve 93% accuracy, you must decouple embedding text from feature engineering:

  1. Embedding (GPU): Pass the truncated text (e.g., first 32k characters) to Jina-V3.
  2. Feature Vector (XGBoost): Calculate log1p(length) using the True Original Length of the document, not the truncated string length.

If you do not provide the original length, the model will assume the document is short and may misclassify massive Annual Reports as simple Press Releases.

πŸ’» Usage

from huggingface_hub import snapshot_download
import sys

# 1. Download Models
model_path = snapshot_download(repo_id="FinancialReports/hierarchical-filing-classifier")

# 2. Add path and import wrapper
sys.path.append(model_path)
from inference_wrapper import FinancialFilingClassifier

# 3. Initialize
classifier = FinancialFilingClassifier(model_path)

# 4. Scenario: A 2MB Annual Report
real_doc_length = 2500000  # 2.5 Million chars
truncated_text = "ACME CORP ANNUAL REPORT 2024... [Truncated at 32k chars]"

# 5. Predict (Ensure your wrapper/API handles the length argument)
result = classifier.predict(
    text=truncated_text, 
    # Logic note: Ensure the classifier applies log1p to this value
    # instead of len(truncated_text) before passing to XGBoost.
)

print(result)
# Output:
# {
#   'category': 'Financial Reporting', 
#   'label': 'Annual Report', 
#   'score': 0.985,
# }

πŸ“‚ Taxonomy (29 Classes)

The model classifies documents into this hierarchy:

Financial Reporting Equity Information Listing & Regulatory
β€’ Annual Report
β€’ Earnings Release
β€’ Interim / Quarterly Report
β€’ Audit Report
β€’ Major Shareholding Notification
β€’ Transaction in Own Shares (Buyback)
β€’ Share Issue / Capital Change
β€’ Notice of Dividend Amount
β€’ Regulatory Filings (RNS)
β€’ Delisting Announcement
β€’ Prospectus
β€’ Registration Form
AGM Information Management Investor Comm
β€’ AGM Information (Pre/Post)
β€’ Voting Results
β€’ Proxy Solicitation
β€’ Director's Dealing
β€’ Management Reports
β€’ Remuneration Info
β€’ Board Changes
β€’ Investor Presentation
β€’ Call Transcript
β€’ Report Publication Announcement
M&A and Legal Debt Information Investment Vehicle
β€’ M&A Activity
β€’ Legal Proceedings Report
β€’ Capital/Financing Update
β€’ Interest Rate Notice
β€’ Net Asset Value (NAV)
β€’ Fund Factsheet

πŸ“œ The Standard: Financial Reporting Classification Framework (FRCF)

The taxonomy used by this model is based on the Financial Reporting Classification Framework (FRCF), an open-source standard designed to organize corporate disclosures in a consistent, cross-jurisdictional format.

Unlike fragmented regulatory schemes, the FRCF organizes disclosures by functional purpose, ensuring comparability across markets (e.g., mapping a US 10-K and a European Annual Financial Report to the same standardized Annual Report category).

πŸ“š Training Data

The model was trained on a proprietary Golden Dataset of 27,671 financial filings, manually curated to represent the diverse landscape of global corporate reporting.

  • Source: Real-world filings from listed companies across Europe (primary focus), North America, and Asia.
  • Multilingual: Includes documents in English, French, German, and other major European languages (leveraging the multilingual capabilities of Jina-V3).
  • Diversity: The dataset preserves the natural "long-tail" distribution of financial data, ranging from massive 500+ page Annual Reports to single-page Press Releases and complex ESG Disclosures.
  • Quality Control: Mapped to a strict 2-level hierarchy to resolve semantic ambiguities common in regulatory filings (e.g., distinguishing a Share Buyback announcement from a Director's Dealing notification).

βš™οΈ Deployment & Hardware

This model is optimized for GPU Inference due to the heavy 8192-token context window of the Jina encoder. While CPU inference is possible, it is significantly slower.

Recommended Configuration

Component Recommendation Notes
GPU NVIDIA T4 (16GB) The "Sweet Spot" for cost/performance. Capable of ~50 docs/sec in batch mode.
Alternative NVIDIA L4 / A10 Recommended for high-concurrency production APIs.
VRAM 16 GB Minimum Required to embed long documents without OOM errors.
System RAM 16 GB+ Standard requirement for PyTorch + XGBoost overhead.

Critical Environment Settings

To load the underlying Jina-V3 model, you must allow remote code execution in your environment variables (Docker, Kubernetes, or Hugging Face Endpoints):

HF_TRUST_REMOTE_CODE=True

Throughput Benchmarks (T4 GPU)

  • Live API Latency: ~200ms – 500ms per document.
  • Batch Processing: ~40 – 50 documents per second (Batch Size: 64).
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results