This repository contains the official implementation of a high-fidelity, monocular vision-based perception pipeline designed for autonomous Unmanned Aerial Vehicle (UAV) landing in unstructured environments.
By leveraging Vision Transformers (OneFormer) and a robust temporal stabilization architecture, this system semantically decouples safe landing zones (e.g., drivable roads, walkways) from hazards without relying on heavy sensor fusion arrays (LiDAR/Stereo Depth) or cooperative infrastructure (ArUco tags).
- Overview
- System Architecture
- Key Features
- Hardware & Software Requirements
- Installation
- Usage
- Results & Performance
- Future Work
The primary goal of this project is to develop a robust, modular, monocular vision-based perception system for autonomous UAV landing in unstructured, non-cooperative environments.
By eliminating the reliance on heavy sensor arrays (LiDAR/Stereo) and cooperative infrastructure (ArUco tags), this system addresses three critical gaps in current literature:
- The Approach: Addresses the lack of annotated aerial imagery by developing a procedural synthetic data generation pipeline in Blender (template provided).
- The Validation: Achieves Sim-to-Real zero-shot transfer on the real-world UAVid benchmark (Mean F1: 0.7006), exhibiting superior fine-grained boundary adherence compared to human annotations.
- The Approach: Replaces traditional CNNs with a Vision Transformer (
OneFormerin our case) to capture global scene context and prevent the over-smoothing of critical geometric boundaries. - The Validation: Successfully decouples physically similar but semantically distinct surfaces (e.g., active traffic lanes vs. walkways), establishing a High-Precision Safety Profile (83.90% Safe Landing Precision).
- The Approach: Overcomes the "Edge Computing Gap" by pairing the Transformer backbone with deterministic Euclidean Distance Transforms and Kalman Filtering to translate pixel probabilities into flight vectors.
The pipeline processes monocular RGB video feeds through a 6-stage architecture optimized for edge hardware (e.g., NVIDIA Jetson AGX).
- Frame Preprocessing (CLAHE): Applies Contrast Limited Adaptive Histogram Equalization to mitigate shadow artifacts and extreme illumination variance in urban canyons.
- Semantic Segmentation: Utilizes a lightweight Vision Transformer (
OneFormer Swin-Tiny) to classify the environment into unified Super Classes (Buildings, Trees, Roads, Walkways, Objects). - Temporal Mask Stabilization: Employs OpenCV DIS Optical Flow (Fast Dense Flow) combined with Temporal Mode Smoothing to warp and align masks, eliminating inter-frame semantic flickering.
- Dynamic Mask Merging: Contextual heuristics merge safe classes (Walkways, empty Roads) into a Binary Safe Landing Zone (SLZ) mask, while isolating dynamic agents and static hazards into No-Fly Zones (NFZ).
- Landing Zone Detection: Extracts the optimal landing centroid and maximum safe radius via spatial filtering and morphological closing operations.
- Trajectory Smoothing: A 6-state Kalman Filter equipped with Euclidean distance gating (Outlier Rejection) tracks the final descent vector to prevent coordinate teleportation.
- Transformer-Based Perception: Surpasses traditional CNNs (DeepLabv3+) and Single-Stage detectors (YOLOv11-seg) in resolving long-range global context and fine-grained structural details (e.g., electric poles).
- Sim-to-Real Generalization: Validated on proprietary DJI Mini suburban datasets and UAVid urban benchmarks without site-specific fine-tuning.
- Edge-Feasible: Integrates OpenCV backends for spatial and temporal filtering, allowing the heavy ViT backbone to operate effectively on constrained computing budgets.
- Aerospace-Grade Stabilization: Prevents erratic flight controller inputs through optical flow mask warping and kinetic state tracking.
Suburban Domain
|
Urban Domain
|
- Workstation Prototyping: Tested on RTX A6000 for ~1 Segmentation/s - Performance is Model dependant.
- Edge Deployment: NVIDIA Jetson AGX Orin / Jetson Orin Nano.
- Ubuntu 20.04 / 22.04
- Python 3.8+
- CUDA 11.8+
torch,torchvisionopencv-python,numpy,scipy
-
Clone the repository:
git clone [https://github.com/ShrikB/Targeted-Landing.git](https://github.com/ShrikB/Targeted-Landing.git) cd Targeted-Landing -
Set up the virtual environment:
python3 -m venv venv source venv/bin/activate -
Install dependencies:
pip install -r requirements.txt
-
Model Weights: Place your custom fine-tuned OneFormer weights in the
model/directory.Targeted-Landing/ ├── model/ │ └── model11_cusdat/
The main pipeline can be executed via the primary inference script. Configuration for video inputs, model paths, and CLAHE parameters can be adjusted directly in the script header.
python Simulated_Modular_Test.metric.pyThe pipeline automatically generates timed analytics and distinct output folders for debugging and verification:
outputs/
├── extracted_frames/ # Raw & CLAHE-enhanced RGB frames
├── semantic_output/ # Stabilized OneFormer segmentation masks
├── masked_output/ # Binary SLZ/NFZ merged masks
├── landing_zones/ # Final vector targeting overlays
└── processing_timing.json # Stage-by-stage latency analytics
For dataset generation, a template for generating raw and masks is provided given the City Generation Add-On is enabled. Simply activate Generate.py file followed by the Maskup.py file. The generic template for both can be adjusted for more classes, frames, variances, plus any specific changed enabled by the Blender Engine.
The pipeline was subjected to a Binary Suitability Analysis to evaluate its viability as a Conservative Filter.
- Ignored/NFZ Recall:
99.39%(Near-perfect hazard avoidance). - Safe/SLZ Precision:
83.90%(High confidence in selected landing zones). - Mean Safety Precision:
0.8499(Good to Excellent confidence interval).
- Desktop Workstation (RTX A6000): ~1.17s / frame.
- Jetson AGX Edge Hardware: ~6.67s / frame. (Note: Landing zone designation is a discrete, low-frequency event. The precision of the ViT justifies the latency over standard high-frequency obstacle avoidance).
To scale this architecture from a research prototype to a field-deployable system, the following optimizations are on the roadmap:
- NVIDIA TensorRT Integration: Migrating the PyTorch inference backend to optimized TensorRT engines to maximize tensor core utilization on the Jetson AGX.
- Docker Containerization: Encapsulating the ROS, CUDA, and PyTorch dependency chain to ensure seamless portability across heterogeneous UAV platforms.
- Dynamic Garbage Collection: Implementing an automated storage management protocol for edge devices. To prevent storage exhaustion during sustained flights via rolling buffer.
This project is licensed under the MIT License - see the LICENSE file for details.




