Skip to content
View Gabriel-Pedde's full-sized avatar
:shipit:
:shipit:
  • SISSA
  • Trieste, Italy

Highlights

  • Pro

Block or report Gabriel-Pedde

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Gabriel-Pedde/README.md

Banner

parallel scientific computing — gpu kernels for ml + hpc — mpi · openmp · cuda · openacc — pytorch · llms · rloo · gp — numerical solvers for PDEs — leonardo · lumi · ulysses

MHPC Email GitHub

Live portfolio site


⟨ about ⟩

I'm a student in the Master in High Performance Computing (MHPC) programme at SISSA / ICTP, Trieste. My work focuses on parallel and GPU-accelerated scientific computing — from cache-aware single-core kernels up to distributed solvers running on Leonardo and LUMI. More recently I've been applying that toolkit to ML/AI workloads: fine-tuning LLMs with rule-based RL, running multi-agent LLM systems on cluster GPUs, and Bayesian inference on real scientific data.

The repositories below collect coursework, group projects, and a few solo experiments. The HPC ones build with CMake or Make and ship with benchmarks where they make sense; the ML ones are reproducible end-to-end (data fetched from public archives, training reports checked in).

🌐 For the full experience visit the live portfolio — animated lattice background, project cards, terminal hero, the works.


⟨ stack ⟩

languages

C++ C Fortran Python

parallelism

CUDA OpenACC MPI OpenMP cuBLAS

scientific libraries

NetCDF PETSc SLEPc HDF5 FFTW ParaView

ml / ai

PyTorch Transformers TRL CrewAI Ollama scikit-learn

tooling

CMake Linux SLURM Git Bash

benchmarked on: Leonardo LUMI Ulysses


⟨ featured projects ⟩

The six below are pinned for ML/AI + HPC engineering roles. The full set (9 projects, including atmospheric simulation, eigenvalue solvers, and modern C++) lives on the live portfolio site.

From-scratch REINFORCE / RLOO with PPO-style clipping and a KL penalty, applied to fine-tuning Llama-3.2-1B-Instruct for <think>…</think> reasoning on GSM8K. A miniature, didactic version of the recipe behind DeepSeek-R1-style reasoners — rule-based reward, no preference data, no learned reward model. Includes a small ablation over KL strength and group size with a discussion of the format-vs-reasoning reward-hacking trade-off. PyTorch, Transformers, TRL.

Treat the LLM as a particle in a statistical-mechanics system: sweep the qwen2.5 family from 0.5b to 14b × eight temperatures, study single-agent output distributions and N=4 multi-agent opinion dynamics over R=10 rounds. Headline finding: the genuine control parameter is model size, not sampling temperature — the Ising-style phase picture only fits the small models. Local Ollama, CrewAI for the multi-agent layer, SLURM template for CINECA Leonardo.

Quasi-periodic Gaussian Process built from scratch — hand-written kernel, log marginal likelihood via Cholesky factorisation, scipy.optimize for hyperparameters, posterior predictive from the standard equations — fit to a real Kepler stellar light curve to recover a candidate rotation period. Lomb–Scargle periodogram seeds the period prior. Data fetched live from the STScI archive; nothing to download.

GPU programming portfolio progressing from first CUDA kernels to a production-quality GPU-accelerated Lattice Boltzmann fluid solver. Shared-memory transpose with bank-conflict avoidance, distributed matrix multiplication via MPI + cuBLAS (Cannon-style), and a 2D Jacobi solver across multiple GPUs. The same kernel-writing fluency that's needed for fast ML training and inference paths.

Eight progressive HPC projects: distributed identity and matmul, Cannon's algorithm, OpenMP fundamentals, Jacobi solvers (pure MPI → hybrid MPI+OpenMP → parallel HDF5 output), and a 3D diffusion solver with FFTW3-MPI. Each project compares blocking, non-blocking, and collective communication strategies — the foundations behind any distributed-training stack.

Annotated C examples building intuition for why code performs the way it does on modern hardware: cache hierarchies (memory mountains, blocked transpose, AoS vs. SoA), branch prediction, loop reordering and unrolling for ILP, software prefetching, sparse-matrix layouts, FP rounding. Benchmarks across laptop, Leonardo, and LUMI. The microarchitectural floor every fast ML kernel sits on top of.


⟨ currently ⟩

→ finishing the MHPC programme at SISSA / ICTP
→ looking for ML/AI + HPC engineering roles — GPU programming, distributed training infrastructure, LLM systems
→ open to research collaborations on numerical PDEs, parallel solvers, and applied / scientific ML

⟨ contact ⟩

Email GitHub Portfolio

Pinned Loading

  1. Best-Practice-MHPC Best-Practice-MHPC Public

    2D atmospheric fluid dynamics simulation of a thermal bubble rising in dry air, implemented in Fortran with progressive parallelisation strategies: pure MPI, hybrid MPI+OpenMP, hybrid MPI+OpenACC (…

    Fortran 1

  2. SLEPc_Schrodinger_Solver SLEPc_Schrodinger_Solver Public

    Code to solve Schrodinger equation both in 2d and 3d, with boundary conditions and different potentials, based on SLEPc and PETSc packages

    C 1

  3. gp-kepler-from-scratch gp-kepler-from-scratch Public

    A quasi-periodic Gaussian Process built from scratch — no `sklearn.gaussian_process`, no `gpytorch`, no `george` — and fit to a real Kepler stellar light curve to recover a candidate rotation period.

    Jupyter Notebook

  4. llama-rloo-reasoning llama-rloo-reasoning Public

    Fine tuning a Llama model with rule-based RL, such that it follows an XML formatting standard, on GSM8K math problems, trained and benchmarked on Leonardo supercomputer. Contains also REINFORCE wit…

    Jupyter Notebook

  5. llm-opinion-dynamics llm-opinion-dynamics Public

    Repo studying a small research problem: treat AI agents as particles interacting in a heat bath (AI temperature is equivalent to bath temperature), ask them the same arithmetic question and study h…

    Python

  6. Parallel-computing Parallel-computing Public

    HPC implementations in MPI, OpenMP, and hybrid MPI+OpenMP. Covers distributed matrix operations, Jacobi solvers, parallel I/O with HDF5, and spectral PDE solving with FFTW-MPI. Benchmarked on the L…

    Makefile