Echoflow

Do you have Terabytes of unprocessed Sonar data? Use this for easy viewing and inference!

EchoFlow is a three-stage containerised pipeline that converts raw Kongsberg EK80 echosounder files into echograms and DINO ViT attention maps.

Stages:

Conversion (raw) — decodes .raw pings to volume-backscattering strength NetCDF files via pyEcholab.
Pre-processing (preprocessing) — contrast-stretches and tiles echograms as PNGs.
Inference (infer) — runs a DINO Vision Transformer to produce per-patch attention heat-maps.

Features

Preprocessing: Prepares the data for inference.
Inference: Utilises DINO for visual inference and attention inspection.
Docker Support: Run the pipeline in an isolated Docker environment for consistency and ease of use.

Repository Structure

.
├── raw_consumer/                # process raw to xarray
│   ├── Dockerfile.raw            # Dockerfile for volumetric backscatter computing with pyEcholab
│   ├── preprocessing.py          # Script for converting raw file to volumetric backscatter cubes
├── preprocessing/               # Preprocessing components
│   ├── Dockerfile.preprocessing  # Dockerfile for preprocessing
│   ├── preprocessing.py          # Script for preprocessing input data
├── inference/                   # Inference components
│   ├── Dockerfile.infer          # Dockerfile for inference
│   ├── attention_inspect.py      # Script for inspecting attention maps
│   ├── inspect_attention.py      # Main script for running inference and inspection
│   ├── requirements.txt          # Python dependencies for inference demo
│   ├── utils.py                  # Utility functions for inference demo
│   ├── vision_transformer.py     # DINO Vision Transformer model implementation
├── docker-compose.yml            # Docker Compose file to run the entire pipeline
├── entrypoint.sh                 # Entrypoint script for Docker container
├── infer.py                      # Main script to run inference outside Docker
├── run_docker.sh                 # Script to run the pipeline using Docker
├── watchdog.py                   # Script to watch for changes in the pipeline

Installation

Prerequisites

Docker ≥ 24
Docker Compose v2 (docker compose — note: no hyphen)
Git
AWS CLI (for downloading the sample input file)

Clone with submodules

git clone --recurse-submodules https://github.com/erlingdevold/EchoFlow.git

If you have already cloned without submodules:

git submodule update --init --recursive

Without Docker (native install)

The raw_consumer stage requires HDF5 and netCDF C libraries:

macOS: brew install hdf5 netcdf
Ubuntu/Debian: apt-get install libhdf5-dev libnetcdf-dev
Docker images include these already — no action needed for containerised use.

Then install Python dependencies for the stages you need:

pip install -r raw_consumer/requirements.txt
pip install -r preprocessing/requirements.txt
pip install -r inference/requirements.txt

Populate input

The command below fetches a publicly available NOAA EK80 test file (~105 MB) into data/input/ and initialises git submodules before building the Docker images.

aws s3 cp --no-sign-request \
  "s3://noaa-wcsd-pds/data/raw/Bell_M._Shimada/SH2306/EK80/Hake-D20230811-T165727.raw" \
  data/input/

touch ./inference/checkpoint.pth

git submodule sync --recursive

Setup

Running with Docker

Run the full pipeline

docker compose up --build

This builds and starts all three stages (Conversion, Pre-processing, Inference) plus the progress monitor.

Run individual stages

You can also run each stage independently:

Stage	Command
Stage 1 — Conversion	`docker compose up --build raw`
Stage 2 — Pre-processing	`docker compose up --build preprocessing`
Stage 3 — Inference	`docker compose up --build infer`
All stages (no monitor)	`docker compose up --build raw preprocessing infer`

Each stage reads from and writes to bind-mounted directories under ./data/, so stages can be run in sequence without rebuilding upstream containers.

Monitor dashboard

Once the pipeline is running, a progress monitor is available at http://localhost:8050.

The dashboard is a pipeline progress monitor — it tracks file counts and tail-logs for each stage. It is not an inference viewer. Actual outputs are written to:

data/raw_consumer/ — converted NetCDF files (Stage 1 output)
data/preprocessing/ — echogram PNGs (Stage 2 output)
data/inference/ — attention map PNGs (Stage 3 output)

Performance / parallelism

Stages 1 (conversion) and 2 (pre-processing) use process pools to parallelise work across files — controlled by the MAX_WORKERS env var (defaults to CPU count). Stage 3 (inference) uses batched GPU inference with a configurable BATCH_SIZE (default 4) and auto-detects CUDA when available. Because .raw files are large XML datagrams, file I/O is the primary bottleneck for stages 1–2; adding CPU cores within a node yields proportional throughput gains. Parallelism is intra-node only and there is no built-in cluster or scheduler integration.

Benchmarking

The benchmark script runs the full three-stage pipeline inside Docker on real EK80 data, varying MAX_WORKERS to measure scaling:

# 1. Download 64 test echograms (~6.4 GB) from NOAA public bucket
bash download_bench_data.sh

# 2. Build Docker images
docker compose build

# 3. Install requirements
pip install tyro

# 4. Run benchmark (default: 1,2,4,8 workers)
python benchmark.py

See python benchmark.py --help for options. Reference results on a 12-core Intel i7-12700, 32 GB RAM, RTX 3090 (64 files, 6.7 GB):

Workers	Stage 1 (s)	Stage 2 (s)	Stage 3 (s)	Total (s)	s/file
1	139.6	514.0	393	1046.6	16.35
2	86.5	275.7	393	755.2	11.80
4	56.3	156.2	393	605.5	9.46
8	56.9	97.2	393	547.1	8.55

Stages 1–2 scale with MAX_WORKERS; stage 3 is GPU-bound. In production, stages overlap via the watchdog. Run python benchmark.py on your hardware to regenerate.

ENV variables

Global

MAX_WORKERS — max parallel workers for stages 1–2 (default: CPU count)

`watchdog.py`

LOG_DIR (default: "/data/log")
INPUT_DIR (default: "/data/sonar")
OUTPUT_DIR (default: "/data/processed")

`raw.py` (Stage 1)

INPUT_DIR (default: "/data/sonar")
OUTPUT_DIR (default: "/data/processed")
LOG_DIR (default: "log")

`preprocessing.py` (Stage 2)

INPUT_DIR (default: "/data/processed")
OUTPUT_DIR (default: "/data/test_imgs")
LOG_DIR (default: ".")
KEEP_INTERMEDIATES — preserve .nc files after processing (default: "true")

`inspect_attention.py` (Stage 3)

INPUT_DIR (default: "/data/test_imgs")
OUTPUT_DIR (default: "/data/inference")
LOG_DIR (default: ".")
PATCH_SZ (default: 8)
ARCH (default: 'vit_small')
DOWNSAMPLE_SIZE (default: 5000)
BATCH_SIZE — images per GPU forward pass (default: 4)
DEVICE — "cuda", "cpu", or auto-detect (default: auto)

Output

The output of the inference step, including generated attention maps and transformed images, will be saved in data/inference/. Each run creates a subdirectory named after the input file for organised output management.

Contributing

See CONTRIBUTING.md for guidelines on reporting bugs, suggesting features, and submitting pull requests.

License

Licensed under the MIT License — see LICENSE for details.

Acknowledgements

This pipeline uses the DINO Vision Transformer for attention-based image analysis. The implementation is based on research from the original DINO paper by Facebook AI Research (FAIR).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Echoflow

Features

Repository Structure

Installation

Prerequisites

Clone with submodules

Without Docker (native install)

Populate input

Setup

Running with Docker

Run the full pipeline

Run individual stages

Monitor dashboard

Performance / parallelism

Benchmarking

ENV variables

Global

`watchdog.py`

`raw.py` (Stage 1)

`preprocessing.py` (Stage 2)

`inspect_attention.py` (Stage 3)

Output

Contributing

License

Acknowledgements

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Echoflow

Features

Repository Structure

Installation

Prerequisites

Clone with submodules

Without Docker (native install)

Populate input

Setup

Running with Docker

Run the full pipeline

Run individual stages

Monitor dashboard

Performance / parallelism

Benchmarking

ENV variables

Global

watchdog.py

raw.py (Stage 1)

preprocessing.py (Stage 2)

inspect_attention.py (Stage 3)

Output

Contributing

License

Acknowledgements

`watchdog.py`

`raw.py` (Stage 1)

`preprocessing.py` (Stage 2)

`inspect_attention.py` (Stage 3)