jorgemarcc's picture
Update README.md
f113199 verified
---
title: Code Similarity Visualization with GraphCodeBERT
emoji: 🧠
colorFrom: gray
colorTo: blue
sdk: gradio
sdk_version: 5.38.0
app_file: app.py
pinned: false
license: mit
short_description: Augmenting the Interpretability of GraphCodeBERT
---
# Code Similarity Visualization with GraphCodeBERT
This interactive application visualizes token-level embeddings generated by [GraphCodeBERT](https://huggingface.co/microsoft/graphcodebert-base) for classical sorting algorithms. It supports pairwise comparison of algorithms based on their representation in the model’s embedding space, using PCA for dimensionality reduction.
## ✒️ Reference
Martinez-Gil, J. (2025).
**Augmenting the Interpretability of GraphCodeBERT for Code Similarity Tasks**.
*International Journal of Software Engineering and Knowledge Engineering*, 35(05), 657–678.
## 🚀 Features
- Select two classical sorting algorithms.
- Automatic tokenization and embedding via GraphCodeBERT.
- PCA-based projection into 2D space for visualization.
- Clear matplotlib plots showing token-level distribution differences.
## 🧠 Technical Overview
- **Model**: [`microsoft/graphcodebert-base`](https://huggingface.co/microsoft/graphcodebert-base)
- **Embedding Layer**: Last hidden state
- **Reduction**: Principal Component Analysis (PCA)
- **Interface**: Gradio
- **Languages**: Python 3.10+
## 🛠 Dependencies
All required libraries are listed in `requirements.txt`:
```
transformers
torch
scikit-learn
numpy
matplotlib
gradio
Pillow
```
## 🖥️ Intended Use
- Academic teaching and demonstration of code embeddings
- Qualitative evaluation of pretrained models for source code
- Supplementary visualization for software engineering publications
## 📬 Contact
**Jorge Martinez-Gil**
Senior Research Scientist in Computer Science