--- title: Code Similarity Visualization with GraphCodeBERT emoji: ๐Ÿง  colorFrom: gray colorTo: blue sdk: gradio sdk_version: 5.38.0 app_file: app.py pinned: false license: mit short_description: Augmenting the Interpretability of GraphCodeBERT --- # Code Similarity Visualization with GraphCodeBERT This interactive application visualizes token-level embeddings generated by [GraphCodeBERT](https://huggingface.co/microsoft/graphcodebert-base) for classical sorting algorithms. It supports pairwise comparison of algorithms based on their representation in the modelโ€™s embedding space, using PCA for dimensionality reduction. ## โœ’๏ธ Reference Martinez-Gil, J. (2025). **Augmenting the Interpretability of GraphCodeBERT for Code Similarity Tasks**. *International Journal of Software Engineering and Knowledge Engineering*, 35(05), 657โ€“678. ## ๐Ÿš€ Features - Select two classical sorting algorithms. - Automatic tokenization and embedding via GraphCodeBERT. - PCA-based projection into 2D space for visualization. - Clear matplotlib plots showing token-level distribution differences. ## ๐Ÿง  Technical Overview - **Model**: [`microsoft/graphcodebert-base`](https://huggingface.co/microsoft/graphcodebert-base) - **Embedding Layer**: Last hidden state - **Reduction**: Principal Component Analysis (PCA) - **Interface**: Gradio - **Languages**: Python 3.10+ ## ๐Ÿ›  Dependencies All required libraries are listed in `requirements.txt`: ``` transformers torch scikit-learn numpy matplotlib gradio Pillow ``` ## ๐Ÿ–ฅ๏ธ Intended Use - Academic teaching and demonstration of code embeddings - Qualitative evaluation of pretrained models for source code - Supplementary visualization for software engineering publications ## ๐Ÿ“ฌ Contact **Jorge Martinez-Gil** Senior Research Scientist in Computer Science