boilerplate_detection / README.md

maifeng

Upload folder using huggingface_hub

282c553 verified 3 months ago

preview code

raw

history blame

7.53 kB

metadata

license: apache-2.0
language: en
tags:
  - text-classification
  - financial-text
  - boilerplate-detection
  - analyst-reports
  - transformers
pipeline_tag: text-classification
widget:
  - text: >-
      EEA - The securities and related financial instruments described herein
      may not be eligible for sale in all jurisdictions or to certain categories
      of investors.
    example_title: Legal Disclaimer
  - text: >-
      This report contains forward-looking statements that involve risks and
      uncertainties regarding future events.
    example_title: Forward-Looking Statement
  - text: >-
      Our revenue increased by 15% compared to last quarter due to strong demand
      in emerging markets.
    example_title: Business Performance
  - text: >-
      The information contained herein is confidential and proprietary and may
      not be disclosed without written permission.
    example_title: Confidentiality Notice
  - text: >-
      We launched three innovative products this quarter that exceeded our
      initial sales projections by 40%.
    example_title: Product Update

Boilerplate Detection Model for Financial Documents

This model detects boilerplate (formulaic/repetitive) text in financial analyst reports, distinguishing it from substantive business content.

Model Description

Developed for analyzing corporate culture discussions in analyst reports by filtering out standardized boilerplate content including legal disclaimers, forward-looking statements, and other formulaic language.

Research Context

This model was developed as part of the research paper "Dissecting Corporate Culture Using Generative AI" to preprocess analyst reports for culture analysis. The model identifies and removes boilerplate segments that would otherwise introduce noise in substantive content analysis.

Training Methodology

Data Collection:
- 2.4 million analyst reports from Thomson One's Investext (2000-2020)
- Reports from top 20 brokers by volume analyzed systematically
Training Data:
- Positive examples (boilerplate): Top 10% most frequently repeated segments per broker-year, appearing ≥5 times
- Negative examples: Randomly selected non-repeated segments
- Dataset: 547,790 examples (54,779 boilerplate, 493,011 non-boilerplate)
- Split: 80/10/10 for train/validation/test
Architecture Design:
- Embedding Layer: Frozen sentence-transformers/all-mpnet-base-v2
- Pooling: Mean pooling over token embeddings
- Classification Head: Lightweight 3-layer MLP (768 → 16 → 8 → 2)
- Strategy: Frozen embeddings preserve semantic understanding while classification head learns boilerplate patterns
Performance Metrics:
- Test AUC: 0.966
- False Positive Rate: 0.093
- False Negative Rate: 0.073
- Decision threshold: 0.22 (median probability)

Intended Uses

Primary Use Cases

Preprocessing financial analyst reports for content analysis
Filtering boilerplate from earnings call transcripts
Cleaning regulatory filings for substantive information extraction
Preparing financial text for sentiment analysis or topic modeling

Out-of-Scope Uses

General web content filtering (trained on financial documents)
Non-English text classification
Real-time streaming applications (optimized for batch processing)

Usage Examples

Using the Transformers Pipeline (Recommended)

from transformers import pipeline

# Load the model (requires trust_remote_code=True for custom architecture)
classifier = pipeline(
    "text-classification",
    model="maifeng/boilerplate_detection",
    trust_remote_code=True,
    device=0 if torch.cuda.is_available() else -1
)

# Single text classification
text = "This report contains forward-looking statements that involve risks and uncertainties."
result = classifier(text)
print(result)
# Output: [{'label': 'BOILERPLATE', 'score': 0.9987}]

# Batch classification for efficiency
texts = [
    "Revenue increased by 15% this quarter driven by strong product demand.",
    "The securities described herein may not be eligible for sale in all jurisdictions.",
    "Our new AI initiative has reduced operational costs by 30%.",
    "Past performance is not indicative of future results.",
]

results = classifier(texts, batch_size=32)
for text, result in zip(texts, results):
    label = result['label']
    score = result['score']
    print(f"{'[BOILERPLATE]' if label == 'BOILERPLATE' else '[CONTENT]    '} "
          f"(confidence: {score:.1%}) {text[:60]}...")

Direct Model Usage

from transformers import AutoTokenizer, AutoModel
import torch

# Load model and tokenizer with trust_remote_code
model = AutoModel.from_pretrained(
    "maifeng/boilerplate_detection",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("maifeng/boilerplate_detection")

# Prepare input
texts = ["Your text here", "Another example"]
inputs = tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

# Get predictions
model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    
# Process results
for i, text in enumerate(texts):
    probs = probabilities[i].numpy()
    label = "BOILERPLATE" if probs[1] > 0.5 else "NOT_BOILERPLATE"
    confidence = probs[1] if label == "BOILERPLATE" else probs[0]
    print(f"{label}: {confidence:.2%} - {text[:50]}...")

Integration in Document Processing Pipeline

def filter_boilerplate(documents, threshold=0.5):
    """Filter out boilerplate segments from documents"""
    classifier = pipeline(
        "text-classification",
        model="maifeng/boilerplate_detection",
        trust_remote_code=True
    )
    
    results = classifier(documents, batch_size=32)
    
    filtered_docs = []
    for doc, result in zip(documents, results):
        if result['label'] == 'NOT_BOILERPLATE' or result['score'] < threshold:
            filtered_docs.append(doc)
    
    return filtered_docs

# Example usage
analyst_reports = [...]  # Your document segments
substantive_content = filter_boilerplate(analyst_reports)
print(f"Retained {len(substantive_content)}/{len(analyst_reports)} segments")

Model Limitations

Domain Specificity: Optimized for financial analyst reports; performance may degrade on other document types
Temporal Bias: Trained on 2000-2020 data; newer boilerplate patterns may not be recognized
Language: English-only model
Context Window: Maximum 512 tokens per segment
Binary Classification: Does not distinguish between types of boilerplate

Ethical Considerations

Transparency: Users should understand that substantive content may occasionally be misclassified as boilerplate
Bias: Training data from top brokers may not represent all financial communication styles
Use Case: Should not be used as sole method for regulatory compliance or legal document analysis

Citation

@article{mai2024dissecting,
  title={Dissecting Corporate Culture Using Generative AI},
  author={Mai, Feng and others},
  journal={Working Paper},
  year={2024}
}

Technical Requirements

Python 3.7+
PyTorch 1.9+
Transformers 4.20+
CUDA (optional, for GPU acceleration)

License

Apache 2.0 - See LICENSE file for details

Contact

For questions or issues, please open an issue on the model repository.