I'm a student in the Master in High Performance Computing (MHPC) programme at SISSA / ICTP, Trieste. My work focuses on parallel and GPU-accelerated scientific computing — from cache-aware single-core kernels up to distributed solvers running on Leonardo and LUMI. More recently I've been applying that toolkit to ML/AI workloads: fine-tuning LLMs with rule-based RL, running multi-agent LLM systems on cluster GPUs, and Bayesian inference on real scientific data.
The repositories below collect coursework, group projects, and a few solo experiments. The HPC ones build with CMake or Make and ship with benchmarks where they make sense; the ML ones are reproducible end-to-end (data fetched from public archives, training reports checked in).
🌐 For the full experience visit the live portfolio — animated lattice background, project cards, terminal hero, the works.
|
languages parallelism |
scientific libraries ml / ai tooling |
The six below are pinned for ML/AI + HPC engineering roles. The full set (9 projects, including atmospheric simulation, eigenvalue solvers, and modern C++) lives on the live portfolio site.
From-scratch REINFORCE / RLOO with PPO-style clipping and a KL penalty, applied to fine-tuning Llama-3.2-1B-Instruct for <think>…</think> reasoning on GSM8K. A miniature, didactic version of the recipe behind DeepSeek-R1-style reasoners — rule-based reward, no preference data, no learned reward model. Includes a small ablation over KL strength and group size with a discussion of the format-vs-reasoning reward-hacking trade-off. PyTorch, Transformers, TRL.
Treat the LLM as a particle in a statistical-mechanics system: sweep the qwen2.5 family from 0.5b to 14b × eight temperatures, study single-agent output distributions and N=4 multi-agent opinion dynamics over R=10 rounds. Headline finding: the genuine control parameter is model size, not sampling temperature — the Ising-style phase picture only fits the small models. Local Ollama, CrewAI for the multi-agent layer, SLURM template for CINECA Leonardo.
Quasi-periodic Gaussian Process built from scratch — hand-written kernel, log marginal likelihood via Cholesky factorisation, scipy.optimize for hyperparameters, posterior predictive from the standard equations — fit to a real Kepler stellar light curve to recover a candidate rotation period. Lomb–Scargle periodogram seeds the period prior. Data fetched live from the STScI archive; nothing to download.
GPU programming portfolio progressing from first CUDA kernels to a production-quality GPU-accelerated Lattice Boltzmann fluid solver. Shared-memory transpose with bank-conflict avoidance, distributed matrix multiplication via MPI + cuBLAS (Cannon-style), and a 2D Jacobi solver across multiple GPUs. The same kernel-writing fluency that's needed for fast ML training and inference paths.
Eight progressive HPC projects: distributed identity and matmul, Cannon's algorithm, OpenMP fundamentals, Jacobi solvers (pure MPI → hybrid MPI+OpenMP → parallel HDF5 output), and a 3D diffusion solver with FFTW3-MPI. Each project compares blocking, non-blocking, and collective communication strategies — the foundations behind any distributed-training stack.
Annotated C examples building intuition for why code performs the way it does on modern hardware: cache hierarchies (memory mountains, blocked transpose, AoS vs. SoA), branch prediction, loop reordering and unrolling for ILP, software prefetching, sparse-matrix layouts, FP rounding. Benchmarks across laptop, Leonardo, and LUMI. The microarchitectural floor every fast ML kernel sits on top of.
→ finishing the MHPC programme at SISSA / ICTP
→ looking for ML/AI + HPC engineering roles — GPU programming, distributed training infrastructure, LLM systems
→ open to research collaborations on numerical PDEs, parallel solvers, and applied / scientific ML
