Upload folder using huggingface_hub

Browse files

Files changed (5) hide show

README.md +183 -34
config.json +12 -3
configuration_boilerplate.py +24 -0
model.safetensors +2 -2
modeling_boilerplate.py +99 -0

README.md CHANGED Viewed

@@ -1,62 +1,211 @@
 ---
 license: apache-2.0
 tags:
 - text-classification
-- pytorch
 - transformers
-- mean-pooling
 pipeline_tag: text-classification
 widget:
-- text: "This report contains forward-looking statements that involve risks and uncertainties."
-- text: "Our revenue increased by 15% compared to last quarter due to strong demand."
 ---
-# Text Classification Model with Mean Pooling
-This model combines a transformer encoder with mean pooling and a custom classification head for text classification tasks.
-## Model Architecture
-- **Base Model**: Transformer encoder (default: all-mpnet-base-v2)
-- **Pooling**: Mean pooling over token embeddings
-- **Classification Head**: 3-layer MLP (768 → 16 → 8 → 2)
-- **Task**: Binary text classification
-## Usage
 ```python
 from transformers import AutoTokenizer, AutoModel
 import torch
-# Load model and tokenizer
-model = AutoModel.from_pretrained("maifeng/boilerplate_detection", trust_remote_code=True)
 tokenizer = AutoTokenizer.from_pretrained("maifeng/boilerplate_detection")
-# Prepare text
-texts = ["Your text here", "Another text example"]
 # Get predictions
-predictions = model.predict(texts, tokenizer)
-print(predictions)
-# Or use the pipeline
-from transformers import pipeline
-classifier = pipeline("text-classification", model="maifeng/boilerplate_detection", tokenizer="maifeng/boilerplate_detection")
-results = classifier(texts)
-print(results)
 ```
-## Training Details
-- Trained with PyTorch and Ignite
-- Uses cross-entropy loss with class weights
-- Sample weighting for handling class imbalance
-- Early stopping based on validation AUC
-## Performance
-[Add your performance metrics here]
-- AUC: [Your score]
-- F1: [Your score]
-- Precision: [Your score]
-- Recall: [Your score]

 ---
 license: apache-2.0
+language: en
 tags:
 - text-classification
+- financial-text
+- boilerplate-detection
+- analyst-reports
 - transformers
 pipeline_tag: text-classification
 widget:
+- text: "EEA - The securities and related financial instruments described herein may not be eligible for sale in all jurisdictions or to certain categories of investors."
+  example_title: "Legal Disclaimer"
+- text: "This report contains forward-looking statements that involve risks and uncertainties regarding future events."
+  example_title: "Forward-Looking Statement"
+- text: "Our revenue increased by 15% compared to last quarter due to strong demand in emerging markets."
+  example_title: "Business Performance"
+- text: "The information contained herein is confidential and proprietary and may not be disclosed without written permission."
+  example_title: "Confidentiality Notice"
+- text: "We launched three innovative products this quarter that exceeded our initial sales projections by 40%."
+  example_title: "Product Update"
 ---
+# Boilerplate Detection Model for Financial Documents
+This model detects boilerplate (formulaic/repetitive) text in financial analyst reports, distinguishing it from substantive business content.
+## Model Description
+Developed for analyzing corporate culture discussions in analyst reports by filtering out standardized boilerplate content including legal disclaimers, forward-looking statements, and other formulaic language.
+### Research Context
+This model was developed as part of the research paper "Dissecting Corporate Culture Using Generative AI" to preprocess analyst reports for culture analysis. The model identifies and removes boilerplate segments that would otherwise introduce noise in substantive content analysis.
+### Training Methodology
+1. **Data Collection**:
+   - 2.4 million analyst reports from Thomson One's Investext (2000-2020)
+   - Reports from top 20 brokers by volume analyzed systematically
+2. **Training Data**:
+   - **Positive examples (boilerplate)**: Top 10% most frequently repeated segments per broker-year, appearing ≥5 times
+   - **Negative examples**: Randomly selected non-repeated segments
+   - **Dataset**: 547,790 examples (54,779 boilerplate, 493,011 non-boilerplate)
+   - **Split**: 80/10/10 for train/validation/test
+3. **Architecture Design**:
+   - **Embedding Layer**: Frozen sentence-transformers/all-mpnet-base-v2
+   - **Pooling**: Mean pooling over token embeddings
+   - **Classification Head**: Lightweight 3-layer MLP (768 → 16 → 8 → 2)
+   - **Strategy**: Frozen embeddings preserve semantic understanding while classification head learns boilerplate patterns
+4. **Performance Metrics**:
+   - **Test AUC**: 0.966
+   - **False Positive Rate**: 0.093
+   - **False Negative Rate**: 0.073
+   - **Decision threshold**: 0.22 (median probability)
+## Intended Uses
+### Primary Use Cases
+- Preprocessing financial analyst reports for content analysis
+- Filtering boilerplate from earnings call transcripts
+- Cleaning regulatory filings for substantive information extraction
+- Preparing financial text for sentiment analysis or topic modeling
+### Out-of-Scope Uses
+- General web content filtering (trained on financial documents)
+- Non-English text classification
+- Real-time streaming applications (optimized for batch processing)
+## Usage Examples
+### Using the Transformers Pipeline (Recommended)
+```python
+from transformers import pipeline
+# Load the model (requires trust_remote_code=True for custom architecture)
+classifier = pipeline(
+    "text-classification",
+    model="maifeng/boilerplate_detection",
+    trust_remote_code=True,
+    device=0 if torch.cuda.is_available() else -1
+)
+# Single text classification
+text = "This report contains forward-looking statements that involve risks and uncertainties."
+result = classifier(text)
+print(result)
+# Output: [{'label': 'BOILERPLATE', 'score': 0.9987}]
+# Batch classification for efficiency
+texts = [
+    "Revenue increased by 15% this quarter driven by strong product demand.",
+    "The securities described herein may not be eligible for sale in all jurisdictions.",
+    "Our new AI initiative has reduced operational costs by 30%.",
+    "Past performance is not indicative of future results.",
+]
+results = classifier(texts, batch_size=32)
+for text, result in zip(texts, results):
+    label = result['label']
+    score = result['score']
+    print(f"{'[BOILERPLATE]' if label == 'BOILERPLATE' else '[CONTENT]    '} "
+          f"(confidence: {score:.1%}) {text[:60]}...")
+```
+### Direct Model Usage
 ```python
 from transformers import AutoTokenizer, AutoModel
 import torch
+# Load model and tokenizer with trust_remote_code
+model = AutoModel.from_pretrained(
+    "maifeng/boilerplate_detection",
+    trust_remote_code=True
+)
 tokenizer = AutoTokenizer.from_pretrained("maifeng/boilerplate_detection")
+# Prepare input
+texts = ["Your text here", "Another example"]
+inputs = tokenizer(
+    texts,
+    padding=True,
+    truncation=True,
+    max_length=512,
+    return_tensors="pt"
+)
 # Get predictions
+model.eval()
+with torch.no_grad():
+    outputs = model(**inputs)
+    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
+# Process results
+for i, text in enumerate(texts):
+    probs = probabilities[i].numpy()
+    label = "BOILERPLATE" if probs[1] > 0.5 else "NOT_BOILERPLATE"
+    confidence = probs[1] if label == "BOILERPLATE" else probs[0]
+    print(f"{label}: {confidence:.2%} - {text[:50]}...")
+```
+### Integration in Document Processing Pipeline
+```python
+def filter_boilerplate(documents, threshold=0.5):
+    """Filter out boilerplate segments from documents"""
+    classifier = pipeline(
+        "text-classification",
+        model="maifeng/boilerplate_detection",
+        trust_remote_code=True
+    )
+    results = classifier(documents, batch_size=32)
+    filtered_docs = []
+    for doc, result in zip(documents, results):
+        if result['label'] == 'NOT_BOILERPLATE' or result['score'] < threshold:
+            filtered_docs.append(doc)
+    return filtered_docs
+# Example usage
+analyst_reports = [...]  # Your document segments
+substantive_content = filter_boilerplate(analyst_reports)
+print(f"Retained {len(substantive_content)}/{len(analyst_reports)} segments")
 ```
+## Model Limitations
+1. **Domain Specificity**: Optimized for financial analyst reports; performance may degrade on other document types
+2. **Temporal Bias**: Trained on 2000-2020 data; newer boilerplate patterns may not be recognized
+3. **Language**: English-only model
+4. **Context Window**: Maximum 512 tokens per segment
+5. **Binary Classification**: Does not distinguish between types of boilerplate
+## Ethical Considerations
+- **Transparency**: Users should understand that substantive content may occasionally be misclassified as boilerplate
+- **Bias**: Training data from top brokers may not represent all financial communication styles
+- **Use Case**: Should not be used as sole method for regulatory compliance or legal document analysis
+## Citation
+```bibtex
+@article{mai2024dissecting,
+  title={Dissecting Corporate Culture Using Generative AI},
+  author={Mai, Feng and others},
+  journal={Working Paper},
+  year={2024}
+}
+```
+## Technical Requirements
+- Python 3.7+
+- PyTorch 1.9+
+- Transformers 4.20+
+- CUDA (optional, for GPU acceleration)
+## License
+Apache 2.0 - See LICENSE file for details
+## Contact
+For questions or issues, please open an issue on the [model repository](https://huggingface.co/maifeng/boilerplate_detection).

config.json CHANGED Viewed

@@ -1,14 +1,23 @@
 {
   "architectures": [
-    "TextClassifierModel"
   ],
   "base_model_name": "sentence-transformers/all-mpnet-base-v2",
-  "classifier_hidden_dims": [
     16,
     8
   ],
   "dropout": 0.05,
-  "model_type": "text-classifier",
   "torch_dtype": "float32",
   "transformers_version": "4.53.3"
 }

 {
   "architectures": [
+    "BoilerplateDetector"
   ],
   "base_model_name": "sentence-transformers/all-mpnet-base-v2",
+  "classifier_dims": [
     16,
     8
   ],
   "dropout": 0.05,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "NOT_BOILERPLATE",
+    "1": "BOILERPLATE"
+  },
+  "label2id": {
+    "BOILERPLATE": 1,
+    "NOT_BOILERPLATE": 0
+  },
+  "model_type": "boilerplate",
   "torch_dtype": "float32",
   "transformers_version": "4.53.3"
 }

configuration_boilerplate.py ADDED Viewed

	@@ -0,0 +1,24 @@

+"""Configuration for boilerplate detection model"""
+from transformers import PretrainedConfig
+class BoilerplateConfig(PretrainedConfig):
+    model_type = "boilerplate"
+    def __init__(
+        self,
+        base_model_name="sentence-transformers/all-mpnet-base-v2",
+        num_labels=2,
+        hidden_size=768,
+        classifier_dims=[16, 8],
+        dropout=0.05,
+        **kwargs
+    ):
+        super().__init__(num_labels=num_labels, **kwargs)
+        self.base_model_name = base_model_name
+        self.hidden_size = hidden_size
+        self.classifier_dims = classifier_dims
+        self.dropout = dropout
+        self.id2label = {0: "NOT_BOILERPLATE", 1: "BOILERPLATE"}
+        self.label2id = {"NOT_BOILERPLATE": 0, "BOILERPLATE": 1}

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:67706fb1e7c79194c16ac9ff4a651e37e2c939b04b57639b19a06901ebf5f2e1
-size 438020384

 version https://git-lfs.github.com/spec/v1
+oid sha256:d30e88acc6da21ba6c12a67e26c2fdd11e87976c0c3f1ae06c773ee5f19bbfe2
+size 438020320

modeling_boilerplate.py ADDED Viewed

	@@ -0,0 +1,99 @@

+"""Custom model definition for boilerplate detection"""
+import torch
+import torch.nn as nn
+from transformers import PreTrainedModel, PretrainedConfig, AutoModel
+from transformers.modeling_outputs import SequenceClassifierOutput
+class BoilerplateConfig(PretrainedConfig):
+    model_type = "boilerplate"
+    def __init__(
+        self,
+        base_model_name="sentence-transformers/all-mpnet-base-v2",
+        num_labels=2,
+        hidden_size=768,
+        classifier_dims=[16, 8],
+        dropout=0.05,
+        **kwargs
+    ):
+        super().__init__(num_labels=num_labels, **kwargs)
+        self.base_model_name = base_model_name
+        self.hidden_size = hidden_size
+        self.classifier_dims = classifier_dims
+        self.dropout = dropout
+        self.id2label = {0: "NOT_BOILERPLATE", 1: "BOILERPLATE"}
+        self.label2id = {"NOT_BOILERPLATE": 0, "BOILERPLATE": 1}
+class BoilerplateDetector(PreTrainedModel):
+    config_class = BoilerplateConfig
+    def __init__(self, config):
+        super().__init__(config)
+        self.config = config
+        # Load frozen SBERT
+        self.transformer = AutoModel.from_pretrained(config.base_model_name)
+        for param in self.transformer.parameters():
+            param.requires_grad = False
+        # Classification head
+        self.dropout = nn.Dropout(config.dropout)
+        self.fc1 = nn.Linear(config.hidden_size, config.classifier_dims[0])
+        self.fc2 = nn.Linear(config.classifier_dims[0], config.classifier_dims[1])
+        self.fc3 = nn.Linear(config.classifier_dims[1], config.num_labels)
+        self.init_weights()
+    def mean_pooling(self, model_output, attention_mask):
+        token_embeddings = model_output[0]
+        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
+            input_mask_expanded.sum(1), min=1e-9
+        )
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        labels=None,
+        return_dict=None,
+        **kwargs
+    ):
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.transformer(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            return_dict=True,
+            **kwargs
+        )
+        sentence_embeddings = self.mean_pooling(outputs, attention_mask)
+        # Forward through classification head with dropout only during training
+        x = torch.nn.functional.relu(self.fc1(sentence_embeddings))
+        if self.training:
+            x = self.dropout(x)
+        x = torch.nn.functional.relu(self.fc2(x))
+        if self.training:
+            x = self.dropout(x)
+        logits = self.fc3(x)
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(logits.view(-1, self.config.num_labels), labels.view(-1))
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )