maifeng commited on
Commit
c3614a5
·
verified ·
1 Parent(s): 282c553

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +44 -163
  2. config.json +2 -2
  3. model.safetensors +1 -1
README.md CHANGED
@@ -6,206 +6,87 @@ tags:
6
  - financial-text
7
  - boilerplate-detection
8
  - analyst-reports
9
- - transformers
10
  pipeline_tag: text-classification
11
  widget:
12
- - text: "EEA - The securities and related financial instruments described herein may not be eligible for sale in all jurisdictions or to certain categories of investors."
13
- example_title: "Legal Disclaimer"
14
- - text: "This report contains forward-looking statements that involve risks and uncertainties regarding future events."
15
- example_title: "Forward-Looking Statement"
16
  - text: "Our revenue increased by 15% compared to last quarter due to strong demand in emerging markets."
17
- example_title: "Business Performance"
18
- - text: "The information contained herein is confidential and proprietary and may not be disclosed without written permission."
19
- example_title: "Confidentiality Notice"
20
- - text: "We launched three innovative products this quarter that exceeded our initial sales projections by 40%."
21
- example_title: "Product Update"
22
  ---
23
 
24
- # Boilerplate Detection Model for Financial Documents
25
 
26
- This model detects boilerplate (formulaic/repetitive) text in financial analyst reports, distinguishing it from substantive business content.
27
 
28
  ## Model Description
29
 
30
- Developed for analyzing corporate culture discussions in analyst reports by filtering out standardized boilerplate content including legal disclaimers, forward-looking statements, and other formulaic language.
31
 
32
- ### Research Context
33
 
34
- This model was developed as part of the research paper "Dissecting Corporate Culture Using Generative AI" to preprocess analyst reports for culture analysis. The model identifies and removes boilerplate segments that would otherwise introduce noise in substantive content analysis.
35
 
36
- ### Training Methodology
37
 
38
- 1. **Data Collection**:
39
- - 2.4 million analyst reports from Thomson One's Investext (2000-2020)
40
- - Reports from top 20 brokers by volume analyzed systematically
41
-
42
- 2. **Training Data**:
43
- - **Positive examples (boilerplate)**: Top 10% most frequently repeated segments per broker-year, appearing ≥5 times
44
- - **Negative examples**: Randomly selected non-repeated segments
45
- - **Dataset**: 547,790 examples (54,779 boilerplate, 493,011 non-boilerplate)
46
- - **Split**: 80/10/10 for train/validation/test
47
-
48
- 3. **Architecture Design**:
49
- - **Embedding Layer**: Frozen sentence-transformers/all-mpnet-base-v2
50
- - **Pooling**: Mean pooling over token embeddings
51
- - **Classification Head**: Lightweight 3-layer MLP (768 → 16 → 8 → 2)
52
- - **Strategy**: Frozen embeddings preserve semantic understanding while classification head learns boilerplate patterns
53
-
54
- 4. **Performance Metrics**:
55
- - **Test AUC**: 0.966
56
- - **False Positive Rate**: 0.093
57
- - **False Negative Rate**: 0.073
58
- - **Decision threshold**: 0.22 (median probability)
59
-
60
- ## Intended Uses
61
 
62
- ### Primary Use Cases
63
- - Preprocessing financial analyst reports for content analysis
64
- - Filtering boilerplate from earnings call transcripts
65
- - Cleaning regulatory filings for substantive information extraction
66
- - Preparing financial text for sentiment analysis or topic modeling
67
 
68
- ### Out-of-Scope Uses
69
- - General web content filtering (trained on financial documents)
70
- - Non-English text classification
71
- - Real-time streaming applications (optimized for batch processing)
72
 
73
- ## Usage Examples
 
 
 
74
 
75
- ### Using the Transformers Pipeline (Recommended)
76
 
77
- ```python
78
- from transformers import pipeline
79
-
80
- # Load the model (requires trust_remote_code=True for custom architecture)
81
- classifier = pipeline(
82
- "text-classification",
83
- model="maifeng/boilerplate_detection",
84
- trust_remote_code=True,
85
- device=0 if torch.cuda.is_available() else -1
86
- )
87
-
88
- # Single text classification
89
- text = "This report contains forward-looking statements that involve risks and uncertainties."
90
- result = classifier(text)
91
- print(result)
92
- # Output: [{'label': 'BOILERPLATE', 'score': 0.9987}]
93
-
94
- # Batch classification for efficiency
95
  texts = [
96
- "Revenue increased by 15% this quarter driven by strong product demand.",
97
  "The securities described herein may not be eligible for sale in all jurisdictions.",
98
- "Our new AI initiative has reduced operational costs by 30%.",
99
- "Past performance is not indicative of future results.",
 
100
  ]
101
 
102
- results = classifier(texts, batch_size=32)
103
- for text, result in zip(texts, results):
104
- label = result['label']
105
- score = result['score']
106
- print(f"{'[BOILERPLATE]' if label == 'BOILERPLATE' else '[CONTENT] '} "
107
- f"(confidence: {score:.1%}) {text[:60]}...")
108
- ```
109
-
110
- ### Direct Model Usage
111
-
112
- ```python
113
- from transformers import AutoTokenizer, AutoModel
114
- import torch
115
-
116
- # Load model and tokenizer with trust_remote_code
117
- model = AutoModel.from_pretrained(
118
- "maifeng/boilerplate_detection",
119
- trust_remote_code=True
120
- )
121
- tokenizer = AutoTokenizer.from_pretrained("maifeng/boilerplate_detection")
122
-
123
- # Prepare input
124
- texts = ["Your text here", "Another example"]
125
- inputs = tokenizer(
126
- texts,
127
- padding=True,
128
- truncation=True,
129
- max_length=512,
130
- return_tensors="pt"
131
- )
132
-
133
- # Get predictions
134
- model.eval()
135
- with torch.no_grad():
136
- outputs = model(**inputs)
137
- probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
138
-
139
- # Process results
140
- for i, text in enumerate(texts):
141
- probs = probabilities[i].numpy()
142
- label = "BOILERPLATE" if probs[1] > 0.5 else "NOT_BOILERPLATE"
143
- confidence = probs[1] if label == "BOILERPLATE" else probs[0]
144
- print(f"{label}: {confidence:.2%} - {text[:50]}...")
145
- ```
146
-
147
- ### Integration in Document Processing Pipeline
148
-
149
- ```python
150
- def filter_boilerplate(documents, threshold=0.5):
151
- """Filter out boilerplate segments from documents"""
152
- classifier = pipeline(
153
- "text-classification",
154
- model="maifeng/boilerplate_detection",
155
- trust_remote_code=True
156
- )
157
 
158
- results = classifier(documents, batch_size=32)
 
 
159
 
160
- filtered_docs = []
161
- for doc, result in zip(documents, results):
162
- if result['label'] == 'NOT_BOILERPLATE' or result['score'] < threshold:
163
- filtered_docs.append(doc)
164
 
165
- return filtered_docs
166
 
167
- # Example usage
168
- analyst_reports = [...] # Your document segments
169
- substantive_content = filter_boilerplate(analyst_reports)
170
- print(f"Retained {len(substantive_content)}/{len(analyst_reports)} segments")
171
  ```
172
 
173
  ## Model Limitations
174
 
175
- 1. **Domain Specificity**: Optimized for financial analyst reports; performance may degrade on other document types
176
- 2. **Temporal Bias**: Trained on 2000-2020 data; newer boilerplate patterns may not be recognized
177
- 3. **Language**: English-only model
178
- 4. **Context Window**: Maximum 512 tokens per segment
179
- 5. **Binary Classification**: Does not distinguish between types of boilerplate
180
-
181
- ## Ethical Considerations
182
-
183
- - **Transparency**: Users should understand that substantive content may occasionally be misclassified as boilerplate
184
- - **Bias**: Training data from top brokers may not represent all financial communication styles
185
- - **Use Case**: Should not be used as sole method for regulatory compliance or legal document analysis
186
 
187
  ## Citation
188
 
189
  ```bibtex
190
- @article{mai2024dissecting,
191
  title={Dissecting Corporate Culture Using Generative AI},
192
- author={Mai, Feng and others},
193
- journal={Working Paper},
194
- year={2024}
195
  }
196
  ```
197
 
198
- ## Technical Requirements
199
-
200
- - Python 3.7+
201
- - PyTorch 1.9+
202
- - Transformers 4.20+
203
- - CUDA (optional, for GPU acceleration)
204
-
205
  ## License
206
 
207
- Apache 2.0 - See LICENSE file for details
208
-
209
- ## Contact
210
-
211
- For questions or issues, please open an issue on the [model repository](https://huggingface.co/maifeng/boilerplate_detection).
 
6
  - financial-text
7
  - boilerplate-detection
8
  - analyst-reports
 
9
  pipeline_tag: text-classification
10
  widget:
11
+ - text: "The securities and related financial instruments described herein may not be eligible for sale in all jurisdictions or to certain categories of investors."
 
 
 
12
  - text: "Our revenue increased by 15% compared to last quarter due to strong demand in emerging markets."
13
+ - text: "This report contains forward-looking statements that involve risks and uncertainties."
14
+ - text: "We launched three innovative products this quarter that exceeded our sales projections by 40%."
 
 
 
15
  ---
16
 
17
+ # Boilerplate Detection for Financial Text
18
 
19
+ This model identifies boilerplate (formulaic, repetitive) language in financial documents, distinguishing it from substantive business content. It was developed to preprocess analyst reports for research on corporate culture analysis.
20
 
21
  ## Model Description
22
 
23
+ The model uses a frozen sentence transformer (all-mpnet-base-v2) combined with a lightweight classification head to identify boilerplate text segments. Training data consisted of analyst reports from 2000-2020, where boilerplate examples were identified as frequently repeated segments across reports from the same brokerage house.
24
 
25
+ The architecture combines mean-pooled embeddings from the sentence transformer with a simple 3-layer neural network (768 → 16 → 8 → 2) for classification. This approach preserves semantic understanding while learning patterns specific to financial boilerplate language.
26
 
27
+ ## Usage
28
 
29
+ Since this model uses a custom architecture, you need to use the direct loading approach rather than the pipeline interface:
30
 
31
+ ```python
32
+ import sys
33
+ import huggingface_hub
34
+ from transformers import AutoTokenizer
35
+ import torch
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
+ # Load model components
38
+ model_path = huggingface_hub.snapshot_download('maifeng/boilerplate_detection')
39
+ sys.path.insert(0, model_path)
 
 
40
 
41
+ from modeling_boilerplate import BoilerplateDetector, BoilerplateConfig
 
 
 
42
 
43
+ # Initialize model
44
+ config = BoilerplateConfig.from_pretrained('maifeng/boilerplate_detection')
45
+ model = BoilerplateDetector.from_pretrained('maifeng/boilerplate_detection')
46
+ tokenizer = AutoTokenizer.from_pretrained('maifeng/boilerplate_detection')
47
 
48
+ model.eval()
49
 
50
+ # Classify texts
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  texts = [
 
52
  "The securities described herein may not be eligible for sale in all jurisdictions.",
53
+ "Revenue increased by 15% this quarter due to strong market demand.",
54
+ "This report contains forward-looking statements involving risks.",
55
+ "Our new product line exceeded initial sales expectations significantly."
56
  ]
57
 
58
+ results = []
59
+ for text in texts:
60
+ inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
 
62
+ with torch.no_grad():
63
+ outputs = model(**inputs)
64
+ probs = torch.nn.functional.softmax(outputs.logits, dim=-1)[0]
65
 
66
+ label = 'BOILERPLATE' if probs[1] > 0.5 else 'NOT_BOILERPLATE'
67
+ confidence = probs[1].item() if label == 'BOILERPLATE' else probs[0].item()
 
 
68
 
69
+ results.append({'text': text, 'label': label, 'confidence': confidence})
70
 
71
+ for result in results:
72
+ print(f"{result['label']:>15}: {result['confidence']:.1%} - {result['text'][:60]}...")
 
 
73
  ```
74
 
75
  ## Model Limitations
76
 
77
+ This model is specifically trained on financial analyst reports from 2000-2020 and performs best on similar English-language financial documents. It may not generalize well to other domains or document types. The model processes text segments up to 512 tokens and provides binary classification only.
 
 
 
 
 
 
 
 
 
 
78
 
79
  ## Citation
80
 
81
  ```bibtex
82
+ @article{li2025dissecting,
83
  title={Dissecting Corporate Culture Using Generative AI},
84
+ author={Li, Kai and Mai, Feng and Shen, Rui and Yang, Chelsea and Zhang, Tengfei},
85
+ journal={Review of Financial Studies},
86
+ year={2025}
87
  }
88
  ```
89
 
 
 
 
 
 
 
 
90
  ## License
91
 
92
+ Apache 2.0
 
 
 
 
config.json CHANGED
@@ -8,6 +8,7 @@
8
  8
9
  ],
10
  "dropout": 0.05,
 
11
  "hidden_size": 768,
12
  "id2label": {
13
  "0": "NOT_BOILERPLATE",
@@ -18,6 +19,5 @@
18
  "NOT_BOILERPLATE": 0
19
  },
20
  "model_type": "boilerplate",
21
- "torch_dtype": "float32",
22
- "transformers_version": "4.53.3"
23
  }
 
8
  8
9
  ],
10
  "dropout": 0.05,
11
+ "dtype": "float32",
12
  "hidden_size": 768,
13
  "id2label": {
14
  "0": "NOT_BOILERPLATE",
 
19
  "NOT_BOILERPLATE": 0
20
  },
21
  "model_type": "boilerplate",
22
+ "transformers_version": "4.56.1"
 
23
  }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d30e88acc6da21ba6c12a67e26c2fdd11e87976c0c3f1ae06c773ee5f19bbfe2
3
  size 438020320
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ae32f779630e735df1f705d9b7d4743541c6d7f604d00b61e59e44cad7c25dca
3
  size 438020320