|
--- |
|
title: Code Similarity Visualization with GraphCodeBERT |
|
emoji: 🧠 |
|
colorFrom: gray |
|
colorTo: blue |
|
sdk: gradio |
|
sdk_version: 5.38.0 |
|
app_file: app.py |
|
pinned: false |
|
license: mit |
|
short_description: Augmenting the Interpretability of GraphCodeBERT |
|
--- |
|
|
|
# Code Similarity Visualization with GraphCodeBERT |
|
|
|
This interactive application visualizes token-level embeddings generated by [GraphCodeBERT](https://huggingface.co/microsoft/graphcodebert-base) for classical sorting algorithms. It supports pairwise comparison of algorithms based on their representation in the model’s embedding space, using PCA for dimensionality reduction. |
|
|
|
## ✒️ Reference |
|
|
|
Martinez-Gil, J. (2025). |
|
**Augmenting the Interpretability of GraphCodeBERT for Code Similarity Tasks**. |
|
*International Journal of Software Engineering and Knowledge Engineering*, 35(05), 657–678. |
|
|
|
## 🚀 Features |
|
|
|
- Select two classical sorting algorithms. |
|
- Automatic tokenization and embedding via GraphCodeBERT. |
|
- PCA-based projection into 2D space for visualization. |
|
- Clear matplotlib plots showing token-level distribution differences. |
|
|
|
## 🧠 Technical Overview |
|
|
|
- **Model**: [`microsoft/graphcodebert-base`](https://huggingface.co/microsoft/graphcodebert-base) |
|
- **Embedding Layer**: Last hidden state |
|
- **Reduction**: Principal Component Analysis (PCA) |
|
- **Interface**: Gradio |
|
- **Languages**: Python 3.10+ |
|
|
|
## 🛠 Dependencies |
|
|
|
All required libraries are listed in `requirements.txt`: |
|
|
|
``` |
|
|
|
transformers |
|
torch |
|
scikit-learn |
|
numpy |
|
matplotlib |
|
gradio |
|
Pillow |
|
|
|
``` |
|
|
|
## 🖥️ Intended Use |
|
|
|
- Academic teaching and demonstration of code embeddings |
|
- Qualitative evaluation of pretrained models for source code |
|
- Supplementary visualization for software engineering publications |
|
|
|
## 📬 Contact |
|
|
|
**Jorge Martinez-Gil** |
|
Senior Research Scientist in Computer Science |