3. System Design and Implementation

This section presents the technical design and implementation of the autonomous aiming system. The system is structured as a multi-stage pipeline that processes camera frames through detection, tracking, 3D localisation, and predictive aiming before issuing gimbal commands. Each subsection describes one stage of the pipeline, with emphasis on the engineering decisions that shaped the design.

3.1 System Architecture

The system is implemented in Python and runs as a set of concurrent threads, each responsible for a distinct stage of the pipeline. The entry point (run_aimbot.py) instantiates the AimbotPipeline class, which creates all worker threads and the communication channels between them. Figure 3-1 illustrates the complete data flow.

Figure 3-1: System Pipeline Architecture

flowchart TB
    subgraph HW ["HARDWARE"]
        direction LR
        CAM["🎥 Hikvision Camera\n200 FPS"]
        IMU_HW["📐 Serial IMU\n460800 baud"]
        GIMBAL["🎯 Gimbal MCU\nUSB Serial"]
    end

    subgraph DETECT ["DETECTION THREAD"]
        direction TB
        DET["YOLO11-nano Detection\n+ Traditional CV Refinement"]
        TRK["Kalman Tracker\n4 independent filters\nCenter · Radius · Angle"]
        PNP["PnP Solver — IPPE\n5-point planar geometry"]
        TRF["IMU Coordinate Transform\nCamera Frame → World Frame"]
        DET -->|"ModelResult\n[plates + center]"| TRK
        TRK -->|"PowerRune state\n[angles, IDs, radii]"| PNP
        PNP -->|"tvec, rvec\n[3D pose]"| TRF
    end

    subgraph PREDICT ["PREDICTION THREAD"]
        direction TB
        POS["Position Filter\nStatic Kalman Filter in 3D\n+ EMA Normal Smoothing"]
        VEL["Velocity Estimator\nAsync scipy curve_fit\na·sin·ωt+φ·+b"]
        LEAD["Lead Angle Computation\nAnalytical Integration\n+ Bullet Drop ½gt²"]
        FIRE["Fire Decision\nDesperation Curve\nPrecision vs Timing"]
        POS -->|"stable center\n+ plane normal"| VEL
        VEL -->|"fitted params\nor constant ⅓π"| LEAD
        LEAD -->|"yaw, pitch\n(absolute)"| FIRE
    end

    subgraph COMM ["COMMUNICATION THREAD"]
        TX["Serial TX — Yaw · Pitch · Fire\nCRC-16 Protected Packets"]
        RX["Serial RX — Projectile Speed\nReferee System Telemetry"]
    end

    CAM -->|"Mailbox ‹FramePacket›"| DET
    IMU_HW -->|"Mailbox ‹IMUPacket›"| TRF
    IMU_HW -.->|"peek · latest IMU"| LEAD
    TRF -->|"Mailbox ‹PowerRune›"| POS
    FIRE -->|"Mailbox ‹PredictionOut›"| TX
    RX -.->|"Mailbox ‹ProjectileSpeed›"| VEL
    TX --> GIMBAL
    GIMBAL --> RX

    style HW fill:#2d2d3f,stroke:#7c7caa,color:#ddd,stroke-width:2px
    style DETECT fill:#1b2838,stroke:#4a90d9,color:#ddd,stroke-width:2px
    style PREDICT fill:#1b2838,stroke:#4a90d9,color:#ddd,stroke-width:2px
    style COMM fill:#2d2d3f,stroke:#7c7caa,color:#ddd,stroke-width:2px

The pipeline consists of five principal workers:

CameraWorker captures frames from either a Hikvision industrial camera or a pre-recorded video file, and publishes them for downstream processing.
IMUWorker reads orientation data from a serial inertial measurement unit and publishes it for use by the coordinate transformer and predictor.
DetectionRunner consumes frames and IMU data, runs the detection model, tracks the rune state, solves PnP for 3D pose, and applies the IMU-based world-frame transform.
PredictionWorker consumes the stabilised rune state, estimates the velocity profile, computes the angular lead, applies ballistic corrections, and decides when to fire.
GimbalCommunicationWorker transmits the final aiming commands over USB serial and receives projectile speed telemetry from the robot's referee system module.

Inter-thread communication is handled exclusively through a custom Mailbox class. Unlike a standard FIFO queue, the Mailbox is a single-slot container: each put() overwrites the previous value, and get() consumes it. This design was a deliberate engineering choice driven by the real-time nature of the system. At 200 frames per second, any stage that falls behind would accumulate a growing backlog of stale data in a traditional queue. The Mailbox ensures that every worker always operates on the most recent data available, eliminating latency accumulation at the cost of discarding intermediate values. The Mailbox is implemented using a threading.Condition variable, providing thread-safe blocking with timeout support for both get() and non-consuming peek() operations.

The system is configured through two JSON configuration files: detector_config.json for the camera, model, tracker, PnP, and IMU parameters, and predictor_config.json for the position filter, velocity estimator, and fire control parameters. This separation allows detector tuning and predictor tuning to be performed independently.

3.2 Detection Module

The detection module is responsible for identifying the Power Rune's components in each camera frame: the five illuminated arm plates and the central hub. It uses a hybrid approach combining a neural network for coarse localisation with traditional computer vision for refinement.

3.2.1 Dataset Preparation and Model Training

The detection model was trained on a custom dataset curated from annotated Power Rune footage. The original annotations were in YOLO keypoint format (bounding boxes with associated keypoint coordinates). A conversion pipeline (prepare_dataset.py) was developed to transform these into standard horizontal bounding box (HBB) detection labels with three classes:

Class 0 (Target): Plates that need to be shot (previously classes 0 and 2 in the source data, merged).
Class 1 (Hit): Plates that have already been activated (previously classes 1 and 3, merged).
Class 2 (Center): A bounding box for the rune's central hub, which does not exist in the original annotations and had to be synthesised.

The center bounding box was generated by computing the centroid of all keypoints that fall outside any existing plate bounding box (these correspond to keypoints on the central hub structure). The box dimensions were set proportionally to the average plate width using the physical ratio of the center hub diameter to the plate width (33:122), producing tight, correctly scaled labels.

The YOLO11-nano model was trained for 150 epochs at 640×640 resolution with augmentations specifically tailored for the Power Rune domain. Full 360-degree rotation augmentation (degrees=180) was applied because the rune can appear at any orientation as it spins. Heavy HSV saturation and brightness variance was used to handle the contrast between the brightly glowing LED plates and the dark arena background. Mosaic augmentation was enabled throughout training to improve detection of small objects at varying distances. The model was trained with the AdamW optimiser and cosine learning rate scheduling, with early stopping at 30 epochs of no improvement.

3.2.2 Inference Pipeline

During inference, each frame undergoes three stages of processing:

Gamma preprocessing. Before being passed to the neural network, the frame is darkened using a lookup table (LUT) based gamma correction. This suppresses ambient lighting and enhances the contrast of the glowing LED plates against the background, making the model's task substantially easier. The gamma value is configurable (default γ = 3.0).

YOLO inference. The darkened frame is processed by the YOLO11-nano model, which outputs bounding boxes with class predictions. For each detection of class 2 (center), the bounding box centroid is used directly as the target point. For plate detections (classes 0 and 1), a more precise localisation is needed.

Traditional CV refinement. For plates, the system extracts the region of interest from the original (non-darkened) frame and applies a brightness threshold (pixel value > 200) to isolate the glowing LED pattern. After morphological opening to remove noise, the bounding rectangle of the remaining bright pixels is computed, and its centroid is used as the refined plate center. This consistently produces more accurate center estimates than the YOLO bounding box centroid, because the bounding box includes non-illuminated structural elements of the arm. Additionally, HSV colour masking is applied within each plate ROI to classify the plate as Red or Blue. The hue ranges are tuned for the specific LED colours used in the RoboMaster arena (red wrapping around 0°/180° in OpenCV's HSV space, blue in the 90°–140° range). A minimum pixel threshold of 5% of the ROI area prevents spurious classifications on low-saturation regions.

3.3 Tracking Module

Raw frame-by-frame detections are noisy, intermittent, and lack temporal continuity. The tracker is responsible for maintaining a stable, continuous model of the rune's state across frames, including the center position, radius, angular state, rotation speed, and the identity of each arm.

3.3.1 Kalman Filter Design

The tracker employs four independent Kalman Filters, each modelling a 2D state vector of [position, velocity]:

Center X and Center Y: Track the pixel coordinates of the rune's center. High process noise (configurable, default Q = 1000) allows the filter to follow sudden position changes caused by gimbal motion, while the measurement noise (R = 1.0) is kept relatively low because the YOLO center detection is reasonably accurate.
Radius: Tracks the pixel distance from center to plates. Low process noise (Q = 1.0) is used because the physical radius is rigid and should not change; only measurement noise causes variation.
Reference Angle: Tracks the angular position of "Arm 0" (the reference arm), from which all other arms are offset by 72°. High process noise (Q = 500) is necessary to track the sinusoidal velocity profile of the Large Rune without excessive lag.

Each filter's state transition matrix and process noise covariance are recomputed at every step using the actual measured time delta between frames (dt), rather than assuming a fixed frame rate. This is important because the camera frame rate is not perfectly constant, and using a fixed dt would introduce systematic errors in the velocity estimates.

3.3.2 Buffered Initialisation

Before steady-state tracking can begin, the tracker must establish reliable initial estimates for the rune's angular velocity and rotation direction. Rather than attempting to initialise from a single frame, the tracker implements a buffered initialisation phase. During this phase, it collects timestamped angle measurements for at least 200 milliseconds and a minimum of 5 frames. It then applies numpy.unwrap to handle angle wrap-around (e.g., a measurement sequence crossing from 359° to 1°), followed by linear regression (numpy.polyfit) on the unwrapped angles to compute the initial angular velocity as the slope of the best-fit line. This approach is robust to both high and low frame rates: at 200 FPS, 200 milliseconds provides approximately 40 data points for a reliable regression; at 30 FPS, the minimum 5-frame constraint ensures the regression is not degenerate.

During initialisation, the tracker also classifies the rune as "small" or "large" by counting the average number of simultaneously illuminated target plates. The Small Rune illuminates one target at a time, while the Large Rune illuminates two. If the average count exceeds 1.4, the rune is classified as "large." This classification determines the velocity model used downstream by the predictor.

3.3.3 Measurement Association and Plate Identity

In steady-state tracking, the tracker must associate each detected plate with one of the five arm slots (0 through 4). This is done through angular gating. The filter first predicts the reference angle for the current timestep, then computes the expected angle for each arm as ref_angle + i × 72° × direction. Each detected plate's measured angle is compared to all five expected positions, and the detection is assigned to the slot with the smallest angular difference, provided the difference is below a configurable gating threshold (default: 0.26 radians, approximately 15°). Detections that fall outside all gating windows are rejected as noise or false positives.

To avoid the circular mean problem when averaging reference angle measurements (where naively averaging 1° and 359° would yield 180°), the tracker uses circular statistics: it computes the mean of the sine and cosine components of the measured angles and recovers the mean angle via arctan2. The Kalman Filter is then updated using the shortest angular distance between this mean and the current state, preventing discontinuities at the 0°/360° boundary.

3.4 3D Localisation and World-Frame Stabilisation

Pixel-space tracking provides the rune's angular state and plate positions in the image, but the predictor requires the rune's position in three-dimensional space to compute bullet flight time and aiming angles. This is achieved through a three-stage process: PnP pose estimation, IMU-based coordinate transformation, and world-frame position filtering.

3.4.1 PnP Solver

The PnP solver takes the five tracked plate positions (in pixel coordinates) and the known physical geometry of the rune (five points equally spaced at 72° intervals on a circle of radius 0.285 metres, all lying on the Z = 0 plane) and computes the translation vector (tvec) and rotation vector (rvec) that describe the rune's pose in the camera's coordinate frame. The OpenCV SOLVEPNP_IPPE method is used, which is specifically optimised for planar point configurations and avoids the degeneracies that general-purpose PnP solvers can encounter when all points are coplanar. The 3D position of each plate in the camera frame (position_cam) is then computed by applying the solved rotation and translation to the known object-space coordinates.

3.4.2 IMU-Based Coordinate Transformation

The camera-frame coordinates produced by PnP are relative to the camera's current orientation, which changes as the gimbal moves. To obtain coordinates that remain stable regardless of turret motion, an IMU mounted on the gimbal provides real-time Euler angles (pitch, roll, yaw) at high rates (460800 baud serial).

A critical engineering challenge here is temporal synchronisation. The IMU and camera run on independent clocks and at different rates. The transformer maintains a circular buffer of the 200 most recent IMU packets and, for each incoming frame, performs a nearest-timestamp lookup to find the IMU reading closest to the frame's capture time. This is a linear search over a small buffer, which is fast enough for real-time operation.

The Euler angles are then mapped from the IMU's physical coordinate frame to the OpenCV camera convention. This mapping is non-trivial because the IMU and camera axes do not align: the IMU's X-axis (forward) corresponds to the camera's Z-axis (forward), and the vertical axes point in opposite directions (IMU Z-up vs. camera Y-down). The implemented mapping is:

Camera pitch = negated IMU pitch (opposite vertical conventions)
Camera yaw = negated IMU yaw (opposite vertical axis for rotation)
Camera roll = IMU roll directly (same forward axis)

The combined rotation matrix R = R_yaw × R_pitch × R_roll is then applied to transform tvec, rvec, and all plate positions from camera frame to world frame.

3.4.3 World-Frame Position Filter

Even after IMU-based stabilisation, the world-frame coordinates exhibit transient jitter during fast gimbal movements, caused by slight timestamp misalignment and PnP solver sensitivity. A secondary Kalman Filter (RunePositionFilter) is applied in the world frame to suppress this jitter. This filter uses a static model (zero-velocity assumption), reflecting the physical fact that the Power Rune is bolted to the arena wall and does not translate. The process noise is set very low (Q = 0.001) to enforce stability, while the measurement noise (R = 1.0) allows the filter to reject sudden coordinate jumps during gimbal snaps.

The rune's plane orientation (its normal vector) is filtered separately using an exponential moving average (α = 0.05), with re-normalisation at every step to maintain unit length. This filtered normal vector is later used by the predictor to define the plane of rotation for the angular lead computation.

3.5 Prediction and Fire Control

The predictor is the final computational stage, responsible for computing where to aim and when to fire. It receives the stabilised rune state from the detector and must account for the target's future rotation, bullet flight time, processing latency, gimbal actuation delay, and gravitational drop.

3.5.1 Lead Time Estimation

The total lead time is the sum of three components:

Bullet flight time: Distance to rune center divided by bullet speed (default 23 m/s, configurable, optionally updated from live referee system telemetry via EMA filtering with α = 0.1).
Processing latency: The time elapsed between frame capture and prediction output, measured per frame and filtered via EMA (α = 0.2) to smooth out OS scheduling jitter.
Gimbal actuation delay: A fixed constant (default 20 ms) representing the mechanical response time of the turret motors. This is the parameter most directly affecting shot accuracy and is tuned empirically.

3.5.2 Velocity Estimation and Angular Lead

For the Small Rune, the angular lead is simply the rule-mandated constant speed (⅓π rad/s) multiplied by the total lead time.

For the Large Rune, the velocity follows v(t) = a·sin(ωt + φ) + b with b = 2.090 − a. The parameters a, ω, and φ are unknown and randomised each time the rune becomes available. The VelocityEstimator fits these parameters online by collecting timestamped speed measurements in a circular buffer (capacity: 400 points). Once at least 50 data points are available, a fit is triggered using scipy.optimize.curve_fit with the trust-region-reflective (trf) method and parameter bounds derived directly from the competition rules (a ∈ [0.780, 1.045], ω ∈ [1.884, 2.000], φ ∈ [−π, π]).

The fitting is computationally expensive (up to 1000 iterations) and would block the prediction loop if run synchronously. It is therefore executed asynchronously in a single-worker ThreadPoolExecutor. When the fit completes, the resulting parameters are swapped into the shared state atomically under a lock. The prediction loop continues using the most recently available parameters without interruption.

Once parameters are available, the angular lead is computed by analytically integrating the velocity function over the flight window [t_now, t_now + t_lead]:

∫ v(t) dt = [−(a/ω)·cos(ωt + φ) + b·t]

This closed-form integral avoids the error accumulation of numerical integration methods and is computationally trivial.

3.5.3 Spatial Lead Computation

The angular lead is converted to a 3D aim point as follows. The vector from the stable rune center to the target plate's current world position is first projected onto the stable rune plane (removing any out-of-plane jitter). This projected vector is then rotated around the plane's normal vector by the computed lead angle using the Rodrigues rotation formula (cv2.Rodrigues). The result is the predicted world-space position of the target plate at the moment of impact.

Bullet drop is then applied: the vertical component of the aim point is adjusted upward by ½ × g × t², where t is the bullet flight time and g = 9.81 m/s². The corrected world-space position is converted to absolute yaw and pitch angles using arctan2, which are then differenced against the current gimbal orientation (from the IMU) to produce relative yaw and pitch error commands for the turret motors.

3.5.4 Fire Decision

The fire decision balances precision with timing pressure. The should_fire function computes the total angular error between the current and commanded gimbal positions and compares it against an allowed error threshold derived from the physical angular radius of the plate at the rune's distance (plate_radius / distance, using the small-angle approximation).

The allowed error varies over time according to a "desperation curve." During the first portion of the shooting window (default: 60%), the system holds out for a precise shot, requiring the error to fall within the inner 20% of the plate radius. As the window expires, the threshold relaxes quadratically, expanding to 100% of the plate radius by the deadline. The time limits are set according to the competition rules: 2.5 seconds for the first target, 1.0 second for the second target of a Large Rune pair. This strategy maximises the chance of landing a center hit (which yields better buffs per the ring-based scoring system) while ensuring that a shot is taken before time runs out.

3.6 Hardware Communication

The system communicates with the robot's gimbal controller via a custom serial protocol over USB at 115200 baud. Each outgoing packet consists of a header byte (0xA5), a 16-bit payload size, a message type byte, the payload data, and a CRC-16 checksum (CCITT/XMODEM polynomial 0x1021).

The primary outgoing message (type 0x01) contains three fields: yaw error (32-bit float), pitch error (32-bit float), and a fire flag (8-bit unsigned integer). The gimbal controller interprets these as relative corrections to apply to the current turret orientation.

The protocol also supports incoming telemetry: message type 0x02 carries the projectile speed as a 32-bit float, measured by the referee system's barrel speed sensor. This value is pushed to the predictor's speed_q Mailbox, enabling real-time adaptation of the bullet flight time estimate.

3.7 Summary

Taken together, the system forms a complete closed-loop pipeline from raw sensor input to physical turret actuation. Every stage was designed around the constraints of the problem: the Mailbox architecture prioritises data freshness over completeness, the hybrid detection scheme trades a small amount of added complexity for substantially better localisation accuracy, the tracker's buffered initialisation adapts to variable frame rates, and the asynchronous curve fitting keeps the prediction loop responsive while a computationally expensive model converges in the background. The following section describes how this system was tested and what results were observed.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Datasets		Datasets
main		main
trainer		trainer
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

3. System Design and Implementation

3.1 System Architecture

3.2 Detection Module

3.2.1 Dataset Preparation and Model Training

3.2.2 Inference Pipeline

3.3 Tracking Module

3.3.1 Kalman Filter Design

3.3.2 Buffered Initialisation

3.3.3 Measurement Association and Plate Identity

3.4 3D Localisation and World-Frame Stabilisation

3.4.1 PnP Solver

3.4.2 IMU-Based Coordinate Transformation

3.4.3 World-Frame Position Filter

3.5 Prediction and Fire Control

3.5.1 Lead Time Estimation

3.5.2 Velocity Estimation and Angular Lead

3.5.3 Spatial Lead Computation

3.5.4 Fire Decision

3.6 Hardware Communication

3.7 Summary

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

3. System Design and Implementation

3.1 System Architecture

3.2 Detection Module

3.2.1 Dataset Preparation and Model Training

3.2.2 Inference Pipeline

3.3 Tracking Module

3.3.1 Kalman Filter Design

3.3.2 Buffered Initialisation

3.3.3 Measurement Association and Plate Identity

3.4 3D Localisation and World-Frame Stabilisation

3.4.1 PnP Solver

3.4.2 IMU-Based Coordinate Transformation

3.4.3 World-Frame Position Filter

3.5 Prediction and Fire Control

3.5.1 Lead Time Estimation

3.5.2 Velocity Estimation and Angular Lead

3.5.3 Spatial Lead Computation

3.5.4 Fire Decision

3.6 Hardware Communication

3.7 Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages