Francesco
Gentile

TL;DR PhD student at the University of Trento decoding the internal logic of vision models through mechanistic interpretability.

Index.01
Personal Profile

Profile.

Academic background, research motivations, and the journey toward model transparency.

Timeline

2024 — Present

PhD in AI

University of Trento

Supervised by Prof. Elisa Ricci

2022 — 2024

MsC Artificial Intelligence

University of Trento

2019 — 2022

BsC Computer Science

University of Trento

Research Mapping

Mechanistic Interpretability Current
Temporal Graph LearningTopological DLScene UnderstandingVision Models

Currently a PhD student at the University of Trento, I am supervised by Prof. Elisa Ricci. My work is based at the intersection of deep learning and model transparency, focusing on how we can better understand the decisions made by complex neural architectures.

My current research focuses on weight-based mechanistic interpretability within vision models. I am specifically interested in decomposing the weights of pre-trained models into simpler, more manageable components. By analyzing how these subcomponents compose and interact, I aim to provide a more granular view of the computations performed by vision architectures.

Prior to my PhD, I completed both my Bachelor's in Computer Science and my Master's in Artificial Intelligence at the University of Trento. During these years, my research interests were quite broad, covering areas such as Temporal Graph Learning, Topological Deep Learning, and Scene Understanding. These projects provided a foundation for my current focus on the internal structure and compositionality of deep learning models.

Francesco Gentile
Index.02
Updated 2026.06

Papers.

A formal record of peer-reviewed research papers, exploring the intersections of computer vision and mechanistic interpretability.

2026
1 Paper
Ref. 2026.03
New

Published Venue

Computer Vision and Pattern Recognition (CVPR)

From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition

As vision-language models are deployed at scale, understanding their internal mechanisms becomes increasingly critical. Existing interpretability methods predominantly rely on activations, making them dataset-dependent, vulnerable to data bias, and often restricted to coarse head-level explanations. We introduce SITH (Semantic Inspection of Transformer Heads), a fully data-free, training-free framework that directly analyzes CLIP's vision transformer in weight space. For each attention head, we decompose its value-output matrix into singular vectors and interpret each one via COMP (Coherent Orthogonal Matching Pursuit), a new algorithm that explains them as sparse, semantically coherent combinations of human-interpretable concepts. We show that SITH yields coherent, faithful intra-head explanations, validated through reconstruction fidelity and interpretability experiments. This allows us to use SITH for precise, interpretable weight-space model edits that amplify or suppress specific concepts, improving downstream performance without retraining. Furthermore, we use SITH to study model adaptation, showing how fine-tuning primarily reweights a stable semantic basis rather than learning entirely new features.

Index.03
Code & Artifacts

Projects.

A collection of technical artifacts, from research software to academic milestones.

REF.01
Course Work Completed

Continuous-Time Dynamic Graphs for Information Cascade Prediction

In this work, we propose a novel approach to information cascade prediction that leverages continuous-time dynamic graphs to capture the complex interactions and dynamics among cascades on a global scale; second, we introduce a mechanism for selecting the most informative neighbors when making predictions in memory-based models for continuous-time dynamic graphs.

#Topological DL#Temporal Graphs#Information Cascade Prediction
Coming Soon
REF.02
Thesis Completed

Reviving Graph Neural Networks for Human-Object Interaction Detection

This thesis addresses the challenges of Human-Object Interaction (HOI) detection by introducing three novel methodologies for complex scene understanding. It explores a graph-based approach to structure entity relationships, a hypergraph model to capture higher-order interactions beyond simple pairs, and a vision-language framework that leverages large language models to guide visual attention.

#HOI Detection#Topological DL#Scene Understanding#MsC Thesis
Coming Soon
REF.03
Course Work Completed

Estimation of Distribution using ENergy-based models

This work introduces Estimation of Distribution using ENergy-based models (EDEN), a novel Estimation of Distribution Algorithm (EDA) for black-box optimization. EDEN leverages a neural network equipped with hypergraph convolutions to approximate a population's fitness landscape as an energy-based probability model. Candidate solutions are subsequently generated by sampling this model using modified Langevin dynamics with adaptive noise.

#Optimization#Energy-based Models#Topological DL
Coming Soon
REF.04
Course Work Completed

Referring Expression Comprehension as Scene Graph Grounding

This project introduces a novel, task-specific architecture for Referring Expression Comprehension (REC) that explicitly models complex scene semantics. Instead of standard word-to-image matching, the model extracts a textual scene graph to generate structured queries. Using a custom DETR-like decoder with Graph Attention and optimized bounding box proposals, the network accurately localizes target entities by verifying their specific visual relationships.

#REC#Scene Understanding#DETR#Graph Neural Networks
Coming Soon
REF.05
Thesis Completed

Autoencoder with Multi-Scale Spatio-Temporal Attention for Skeleton-based Action Recognition

This project introduces a single-stream Transformer architecture for skeleton-based human action recognition that processes joint and bone features concurrently to leverage their natural dependencies. To overcome the limitations of independent spatial and temporal modeling, it utilizes a novel Multi-Scale Spatio-Temporal Attention (MS-STA) module. By integrating efficient FastFormer mechanics alongside structural and temporal embeddings, the model is designed to dynamically capture complex cross-spacetime relationships with linear computational complexity.

#Skeleton-based Action Recognition#Graph Neural Networks#BsC Thesis
Coming Soon
Index.04
Bulletins & Field Notes

News.

A stream of updates, paper acceptances, and technical milestones from the field.

I will be in Denver for CVPR 2026 to present SITH. See you Friday, June 5th at 10:45am at poster 267!

SITH has been accepted to CVPR 2026