File size: 18,602 Bytes
f3cdc9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c2c0982
f3cdc9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
edbc054
f3cdc9d
 
 
 
80bdfad
f3cdc9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d20d2d1
f3cdc9d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>BIS Reasoning 1.0 - Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning</title>
    <meta name="description" content="The first large-scale Japanese dataset for evaluating belief-inconsistent syllogistic reasoning in large language models">
    <link rel="stylesheet" href="styles.css">
    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&family=Playfair+Display:wght@400;600;700&display=swap" rel="stylesheet">
    <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
</head>
<body>
    <!-- Navigation -->
    <nav class="navbar">
        <div class="nav-container">
            <div class="nav-logo">
                <h2>BIS Reasoning 1.0</h2>
            </div>
            <ul class="nav-menu">
                <li><a href="#home" class="nav-link">Home</a></li>
                <li><a href="#abstract" class="nav-link">Abstract</a></li>
                <li><a href="#dataset" class="nav-link">Dataset</a></li>
                <li><a href="#results" class="nav-link">Results</a></li>
                <li><a href="#resources" class="nav-link">Resources</a></li>
            </ul>
            <div class="hamburger">
                <span></span>
                <span></span>
                <span></span>
            </div>
        </div>
    </nav>

    <!-- Hero Section -->
    <section id="home" class="hero">
        <div class="hero-container">
            <div class="hero-content">
                <h1 class="hero-title">
                    BIS Reasoning 1.0
                    <span class="hero-subtitle">The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning</span>
                </h1>
                <p class="hero-description">
                    Evaluating the logical reasoning capabilities of Large Language Models when faced with conclusions that contradict common beliefs
                </p>
                <div class="hero-stats">
                    <div class="stat-item">
                        <span class="stat-number" data-target="5000">0</span>
                        <span class="stat-label">Syllogistic Problems</span>
                    </div>
                    <div class="stat-item">
                        <span class="stat-number" data-target="7">0</span>
                        <span class="stat-label">LLMs Evaluated</span>
                    </div>
                    <div class="stat-item">
                        <span class="stat-number" data-target="46">0</span>
                        <span class="stat-label">Semantic Categories</span>
                    </div>
                </div>
                <div class="hero-buttons">
                    <a href="#dataset" class="btn btn-primary">Explore Dataset</a>
                    <a href="#results" class="btn btn-secondary">View Results</a>
                </div>
            </div>
            <div class="hero-visual">
                <div class="syllogism-example">
                    <div class="premise">
                        <span class="premise-label">Premise 1:</span>
                        <span class="premise-text">All charcoal is processed biomass fuel</span>
                    </div>
                    <div class="premise">
                        <span class="premise-label">Premise 2:</span>
                        <span class="premise-text">All ceramics are charcoal</span>
                    </div>
                    <div class="conclusion">
                        <span class="conclusion-label">Conclusion:</span>
                        <span class="conclusion-text">All ceramics are processed biomass fuel</span>
                        <span class="validity-badge">Logically Valid</span>
                    </div>
                </div>
            </div>
        </div>
    </section>

    <!-- Abstract Section -->
    <section id="abstract" class="abstract">
        <div class="container">
            <h2 class="section-title">Abstract</h2>
            <div class="abstract-content">
                <p class="abstract-text">
                    We present <strong>BIS Reasoning 1.0</strong>, the first large-scale Japanese dataset of syllogistic reasoning problems explicitly designed to evaluate belief-inconsistent reasoning in large language models (LLMs). Unlike prior datasets such as NeuBAROCO and JFLD, which focus on general or belief-aligned reasoning, BIS introduces logically valid yet belief-inconsistent syllogisms to uncover reasoning biases in LLMs trained on human-aligned corpora.
                </p>
                <p class="abstract-text">
                    We benchmark state-of-the-art models—including GPT models, Claude models, and leading Japanese LLMs—revealing significant variance in performance, with GPT-4o achieving 79.54% accuracy. Our analysis identifies critical weaknesses in current LLMs when handling logically valid but belief-conflicting inputs.
                </p>
                <div class="key-contributions">
                    <h3>Key Contributions</h3>
                    <div class="contributions-grid">
                        <div class="contribution-item">
                            <div class="contribution-icon">📊</div>
                            <h4>First Japanese Benchmark</h4>
                            <p>5,000 carefully curated belief-inconsistent syllogisms in Japanese</p>
                        </div>
                        <div class="contribution-item">
                            <div class="contribution-icon">🤖</div>
                            <h4>Comprehensive Evaluation</h4>
                            <p>Systematic comparison of 8 state-of-the-art LLMs</p>
                        </div>
                        <div class="contribution-item">
                            <div class="contribution-icon">🔍</div>
                            <h4>Bias Analysis</h4>
                            <p>Detailed investigation of reasoning biases and failure modes</p>
                        </div>
                        <div class="contribution-item">
                            <div class="contribution-icon">⚕️</div>
                            <h4>Real-world Implications</h4>
                            <p>Critical insights for high-stakes applications</p>
                        </div>
                    </div>
                </div>
            </div>
        </div>
    </section>

    <!-- Dataset Section -->
    <section id="dataset" class="dataset">
        <div class="container">
            <h2 class="section-title">Dataset Overview</h2>
            <div class="dataset-content">
                <div class="dataset-description">
                    <h3>BIS Dataset Construction</h3>
                    <p>
                        The BIS dataset consists of 5,000 carefully constructed syllogistic reasoning problems designed to test the robustness of logical inference in LLMs under conditions of belief inconsistency. Each example comprises two premises and one conclusion that is strictly entailed by syllogistic rules, but deliberately conflicts with general knowledge.
                    </p>
                </div>
                
                <div class="example-section">
                    <h3>Example from Dataset</h3>
                    <div class="example-container">
                        <img src="example.png" alt="Example syllogism from BIS dataset" class="example-image">
                        <div class="example-explanation">
                            <p>This example illustrates a belief-inconsistent syllogism where the conclusion is logically valid but contradicts common real-world beliefs about ceramics and biomass fuel.</p>
                        </div>
                    </div>
                </div>

                <div class="categories-section">
                    <h3>Dataset Categories</h3>
                    <div class="categories-visual">
                        <img src="combined_analysis.png" alt="Dataset category analysis" class="categories-image">
                        <div class="categories-description">
                            <p>The dataset covers 46 distinct semantic categories, consolidated into 10 broader final categories including Human/Body/Senses, Animals/Organisms, Structure/Logic, and Natural Phenomena/Matter.</p>
                        </div>
                    </div>
                </div>
            </div>
        </div>
    </section>

    <!-- Results Section -->
    <section id="results" class="results">
        <div class="container">
            <h2 class="section-title">Results & Analysis</h2>
            
            <div class="results-overview">
                <h3>Model Performance Overview</h3>
                <div class="performance-table">
                    <table>
                        <thead>
                            <tr>
                                <th>Model</th>
                                <th>BIS Accuracy (%)</th>
                                <th>NeuBAROCO Accuracy (%)</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr class="top-performer">
                                <td>GPT-4o</td>
                                <td>79.54</td>
                                <td>94.01</td>
                            </tr>
                            <tr>
                                <td>llm-jp-3-13b</td>
                                <td>59.86</td>
                                <td>67.66</td>
                            </tr>
                            <tr>
                                <td>GPT-4-turbo</td>
                                <td>59.48</td>
                                <td>67.66</td>
                            </tr>
                            <tr>
                                <td>llm-jp-3-13b-instruct3</td>
                                <td>40.90</td>
                                <td>38.32</td>
                            </tr>
                            <tr>
                                <td>stockmark-13b</td>
                                <td>40.34</td>
                                <td>47.90</td>
                            </tr>
                            <tr>
                                <td>Claude-3-sonnet</td>
                                <td>20.34</td>
                                <td>78.44</td>
                            </tr>
                            <tr>
                                <td>Claude-3-opus</td>
                                <td>7.18</td>
                                <td>61.07</td>
                            </tr>
                        </tbody>
                    </table>
                </div>
            </div>

            <div class="prompt-analysis">
                <h3>Prompt Engineering Analysis</h3>
                <div class="prompt-charts">
                    <div class="chart-container">
                        <h4>Japanese Prompts - Error Recovery Rate</h4>
                        <img src="accuracy.png" alt="Error sample accuracy by prompt type" class="chart-image">
                    </div>
                    <div class="chart-container">
                        <h4>Japanese vs English Prompts Comparison</h4>
                        <img src="english_retest.png" alt="Prompt type accuracy comparison" class="chart-image">
                    </div>
                </div>
                <div class="prompt-insights">
                    <div class="insight-item">
                        <h4>Chain-of-Thought Effectiveness</h4>
                        <p>Chain-of-thought prompting achieved 87% accuracy improvement on previously failed samples, demonstrating GPT-4o's latent reasoning capabilities when explicitly guided.</p>
                    </div>
                    <div class="insight-item">
                        <h4>Logic-Focused Instructions</h4>
                        <p>Prompts emphasizing logical evaluation and belief inconsistency achieved 76% improvement, showing sensitivity to explicit instructional framing.</p>
                    </div>
                    <div class="insight-item">
                        <h4>Language Impact</h4>
                        <p>English prompts showed similar patterns but with less pronounced gaps, likely due to GPT-4o's extensive English training.</p>
                    </div>
                </div>
            </div>
        </div>
    </section>

    <!-- Key Findings Section -->
    <section id="findings" class="findings">
        <div class="container">
            <h2 class="section-title">Key Findings</h2>
            <div class="findings-grid">
                <div class="finding-item">
                    <div class="finding-icon">🎯</div>
                    <h3>Performance Variance</h3>
                    <p>Significant variance in performance across models, with GPT-4o leading at 79.54% accuracy while Claude models underperformed dramatically on BIS despite strong NeuBAROCO results.</p>
                </div>
                <div class="finding-item">
                    <div class="finding-icon">🧠</div>
                    <h3>Belief Bias Impact</h3>
                    <p>LLMs struggle disproportionately with belief-inconsistent problems, often overriding logical inference in favor of plausibility heuristics.</p>
                </div>
                <div class="finding-item">
                    <div class="finding-icon">📝</div>
                    <h3>Prompt Sensitivity</h3>
                    <p>Strategic prompt design significantly impacts performance, with chain-of-thought and logic-focused instructions dramatically improving accuracy.</p>
                    </div>
                <div class="finding-item">
                    <div class="finding-icon">⚖️</div>
                    <h3>Scale vs. Reasoning</h3>
                    <p>Model size alone doesn't guarantee reasoning performance. Training approach and architectural biases are more critical factors.</p>
                </div>
                <div class="finding-item">
                    <div class="finding-icon">🏥</div>
                    <h3>High-Stakes Implications</h3>
                    <p>Critical vulnerabilities revealed for deployment in law, healthcare, and scientific research where logical consistency is paramount.</p>
                </div>
                <div class="finding-item">
                    <div class="finding-icon">🔬</div>
                    <h3>Future Research</h3>
                    <p>Need for bias-resistant model design and comprehensive evaluation beyond standard benchmarks for reliable AI systems.</p>
                </div>
            </div>
        </div>
    </section>

    <!-- Resources Section -->
    <section id="resources" class="resources">
        <div class="container">
            <h2 class="section-title">Resources</h2>
            <div class="resources-content">
                <div class="resource-links">
                    <div class="resource-item">
                        <h3>📊 Dataset</h3>
                        <p>Access the complete BIS Reasoning 1.0 dataset</p>
                        <a href="https://hf.co/datasets/nguyenthanhasia/BIS_Reasoning_v1.0" class="resource-link" target="_blank">Hugging Face Dataset</a>
                    </div>
                    <div class="resource-item">
                        <h3>📄 Research Paper</h3>
                        <p>Read the full academic paper with detailed methodology and analysis</p>
                        <a href="https://arxiv.org/pdf/2506.06955" class="resource-link" target="_blank">Download Paper (PDF)</a>
                    </div>                   
                </div>
                
                <div class="citation">
                    <h3>Citation</h3>
                    <div class="citation-box">
                        <pre><code>@article{nguyen2025bis,
      title={BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning}, 
      author={Ha-Thanh Nguyen, Chaoran Liu, Qianying Liu, Hideyuki Tachibana, Su Myat Noe, Yusuke Miyao, Koichi Takeda, Sadao Kurohashi},
      year={2025},
      eprint={2506.06955},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.06955}, 
}</code></pre>
                    </div>
                </div>

                <div class="contact">
                    <h3>Contact</h3>
                    <p>For questions about the research or dataset:</p>
                    <p><strong>Corresponding Author:</strong> Ha-Thanh Nguyen</p>
                    <p><strong>Email:</strong> nguyenhathanh@nii.ac.jp</p>
                    <p><strong>Affiliation:</strong> Research and Development Center for Large Language Models, NII, Tokyo, Japan</p>
                </div>
            </div>
        </div>
    </section>

    <!-- Footer -->
    <footer class="footer">
        <div class="container">
            <div class="footer-content">
                <div class="footer-section">
                    <h3>BIS Reasoning 1.0</h3>
                    <p>Advancing the evaluation of logical reasoning in Large Language Models</p>
                </div>
                <div class="footer-section">
                    <h4>Research</h4>
                    <ul>
                        <li><a href="#abstract">Abstract</a></li>
                        <li><a href="#dataset">Dataset</a></li>
                        <li><a href="#results">Results</a></li>
                    </ul>
                </div>
                <div class="footer-section">
                    <h4>Resources</h4>
                    <ul>
                        <li><a href="https://hf.co/datasets/nguyenthanhasia/BIS_Reasoning_v1.0" target="_blank">Dataset</a></li>
                        <li><a href="https://arxiv.org/pdf/2506.06955" target="_blank">Paper</a></li>
                        <li><a href="#">Code</a></li>
                    </ul>
                </div>
            </div>
            <div class="footer-bottom">
                <p>&copy; 2025 Research and Development Center for Large Language Models, NII. All rights reserved.</p>
            </div>
        </div>
    </footer>

    <script src="script.js"></script>
</body>
</html>