This article is regarding the paper, titled '3D Common Corruptions and Data Augmentation' and was published by these researchers 'Oguzhan Fatih Kar Teresa Yeo Andrei Atanov Amir Zamir' working for the Swiss Federal Institute of Technology (EPFL).
Let us go over an overview of this paper.
•Computer vision (CV) models are usually trained on images designed to help the models understand the real world. But when deployed in the real world, they tend to encounter naturally occurring distribution shifts from their training data. A distribution shift means that the test data is not from the same distribution as the training/evaluation data and this can cause a generalization gap when we consider real time performance.
•These shifts range from lower-level distortions, such as motion blur and illumination changes, to semantic ones, like object occlusion. We will talk more about these later.
•Each of them represents a possible Point of Failure for any CV model which leads to a need for specific tests for these vulnerabilities before deployment. •This paper introduces a set of image transformations that can be used as corruptions to evaluate the robustness of models as well as data augmentation mechanisms to help reduce the generalisation gap.
The main focus of this paper is to present a more realistic set of corruptions which mirror reality better. Before we go into this, let us first discuss what this paper tries to improve on.
In 2019, rigorous benchmarks were established in this paper to test image classifiers' robustness against real-life query images, which were integrated into ImageNet and the corruptions are commonly called 2DCC. This benchmark dataset consists of 15 diverse corruption types applied to the images of ImageNet. The corruptions are drawn from four main categories - noise, blur, weather, and digital.
Main problem is that while reality is 3D and these corruptions do not incorporate 3D information. This means there is again a distribution shift between the training data, even after augmentation with these corruptions, and reality which the model is going to perform poorly on. Also, these techniques generate uniform distortions which are again, not representative of reality.
As an example, Consider a self-driving car. Would an understanding of 3D depth and information not help the autonomous system better than 2D?
There are other family of methods proposed to improve model robustness such as data augmentation with corrupted data, texture changes, image compositions and transformations. While these methods do help in generalisation, performance gains are nonuniform.
Photorealistic image synthesis involves techniques to generate realistic images. Some of these techniques have been recently used to create corruption data (e.g; GANs/Stable Diffusion).. Some of the 3D transformations proposed in the paper are instantiations of these methods, with the downstream goal of testing and improving model robustness in a unified framework with a wide set of corruptions.
Adversarial corruptions add imperceptible worst-case shifts to the input to fool a model. Most of the failure cases of models in the real world are not the result of adversarial corruptions but rather naturally occurring distribution shifts. Thus, their focus in this paper is to generate corruptions that are likely to occur in the real world.
The proposed corruptions are generated via algorithms with exposed parameters, enabling fine-grained analysis of robustness, e.g., by continuously increasing the 3D motion blur. They are efficient to compute and can be computed on-the-fly during training as data augmentation with a small increase in computational cost. They are also extendable, i.e., they can be applied to standard vision datasets, e.g., ImageNet (discussed later), that do not come with 3D labels.
Coming to 3DCC specifically, the proposed shifts incorporate 3D information to generate corruptions that are consistent with the scene geometry. This leads to shifts that are more likely to occur in the real world. The resulting set includes 20 corruptions, each representing a distribution shift from training data.
3DCC addresses several aspects of the real world, such as camera motion, weather, occlusions, depth of field, and lighting.
The corruptions in 3DCC are more diverse and realistic compared to 2D-only approaches, which we can see here..
The bottom row shows sample 2D corruptions applied uniformly over the image. This leads to corruptions that are unlikely to happen in the real world, e.g. having the same motion blur over the entire image irrespective of the distance to camera (top left). The top row shows their 3D counterparts from 3DCC. The circled regions highlight the effect of incorporating 3D information.
More specifically, in 3DCC,
- motion blur has a motion parallax effect where objects further away from the camera seem to move less,
- defocus blur has a depth of field effect, akin to a large aperture effect in real cameras, where certain regions of the image can be selected to be in focus,
- lighting takes the scene geometry into account when illuminating the scene and casts shadows on objects,
- fog gets denser further away from the camera,
- occlusions of a target object, e.g. fridge (blue mask), are created by changing the camera's viewpoint and having its view naturally obscured by another object, e.g. the plant (red mask). This is in contrast to its 2D counterpart that randomly discards patches..
Now let us get into what these 3D corruptions actually are: The authors have defined 8 different corruption types, namely depth of field, camera motion, lighting, video, weather, view changes, semantics, and noise, resulting in 20 corruptions in 3DCC. Let us go through these types..
Depth of field corruptions create refocused images. They keep a part of the image in focus while blurring the rest.
The authors have considered a layered approach that splits the scene into multiple layers. For each layer, the corresponding blur level is computed using a pinhole camera model. The blurred layers are then composited with alpha blending.
They have generated near focus and far focus corruptions by randomly changing the focus region to the near or far part of the scene.
Camera motion creates blurry images due to camera movement during exposure. To generate this effect, the input image is first transformed into a point cloud using the depth information. Then, a trajectory (camera motion) is defined, and novel views rendered along this trajectory.
The generated views are then combined to obtain parallax-consistent motion blur. XY-motion blur and Z-motion blur is defined when the main camera motion is along the image XY-plane or Z-axis, respectively.
Lighting corruptions change scene illumination by adding new light sources and modifying the original illumination. Blender was used to place these new light sources and compute the corresponding illumination for a given viewpoint in the 3D mesh.
For the flash corruption, a light source is placed at the camera's location, while for shadow corruption, it is placed at random diverse locations outside the camera frustum. Likewise, for multi-illumination corruption, we compute the illumination from a set of random light sources with different locations and luminosities
Video corruptions arise during the processing and streaming of videos. Using the scene 3D, the authors create a video using multiple frames from a single image by defining a trajectory, like motion blur. The idea is that most video corruptions occur due to imperfect lossy compression.
Average bit rate (ABR) and constant rate factor (CRF) are generated for compression, and bit error to capture corruptions induced by imperfect video transmission channels is also generated. After applying the corruptions over the video, a single frame is picked as the final corrupted image.
View changes are due to variations in the camera extrinsic and focal length. The proposed framework enables rendering RGB images conditioned on several changes, such as field of view, camera roll and camera pitch, using Blender. This enables analysis of the sensitivity of models to various view changes in a controlled manner.
The authors have also generated images with view jitter that can be used to analyze if models' predictions flicker with slight changes in viewpoint.
Noise corruptions arise from imperfect camera sensors (which arise mostly from hardware). 2DCC also had noise corruptions (Gaussian, Poisson and Impulse), but here, there are different noises introduced..
For low-light noise, the authors decreased the pixel intensities and added Poisson-Gaussian distributed noise to reflect the low-light imaging setting . ISO noise also follows a Poisson-Gaussian distribution, with a fixed photon noise (modelled by a Poisson) and varying electronic noise (modelled by a Gaussian). Color quantization is another corruption that reduces the bit depth of the RGB image.
Note: Only this subset of the proposed corruptions is not based on 3D information.
For Semantics, the authors considered occlusion and scale changes. In occlusion corruption, the views of an object are generated occluded by other objects. This contrasts with random 2D masking of pixels to create an unnatural occlusion effect that is irrespective of image content. Occlusion rate can be controlled to probe model robustness against occlusion changes.
Similarly, in scale corruption, the views of an object are rendered with varying distances from the camera location. The objects can be selected by randomly picking a point in the scene or using the semantic annotations.
Weather corruptions degrade visibility by obscuring parts of the scene due to disturbances in the medium. This paper expands on the fog distortion presented in the 2DCC paper, and proposes fog3D. The mathematical reasoning is given here..
The standard optical model for fog is used:
where I(x) is the resulting foggy image at pixel x, R(x) is the clean image, A is atmospheric light, and t(x) is the transmission function describing the amount of light that reaches the camera.
When the medium is homogeneous, the transmission depends on the distance from the camera where d(x) is the scene depth and β is the attenuation coefficient controlling the fog thickness.
To implement these corruptions, the authors used the 16k Taskonomy test images. For all the corruptions except the ones in view changes and semantics which change the scene, the protocol in 2DCC was followed and defined 5 shift intensities, resulting in approximately 1 million corrupted images.
For view changes and semantics, the authors rendered 32k images with smoothly changing parameters, e.g., roll angle, using the Replica dataset.
Here we have a visualisation of the effect of shift intensities on some of these corruptions;
Top: Increasing the shift intensity results in larger blur, less illumination, and denser fog.
Bottom: The object becomes more occluded or shrinks in size using calculated viewpoint changes. The blue mask denotes the amodal visible parts of the fridge/couch, and the red mask is the occluded part. The leftmost column shows the clean images.
3DCC can also be applied to standard datasets without 3D information. This can be seen when the techniques were applied on ImageNet and COCO validation sets by leveraging depth predictions from the MiDaS model, a state-of-the-art depth estimator. Generated images are seen to be physically plausible, demonstrating that 3DCC can be used for other datasets by the community to generate a diverse set of image corruptions..
We can see how the objects in the circled regions go from sharp to blurry depending on the focus region and scene geometry.
While benchmarking uses corrupted images as test data, we can also use them as augmentations of training data to build invariances towards these corruptions. This is one of the main goals of this paper. 3DCC is designed to capture corruptions that are more likely to appear in the real world, hence it has a sensible augmentation value as well. Thus, in addition to benchmarking robustness using 3DCC, this framework can also be viewed as new data augmentation strategies that take the 3D scene geometry into account. The augmentations can be efficiently generated on-the-fly during training using parallel implementations, here are the hardware details on the tests done in the paper.
For example, the depth of field augmentations take 0.87 seconds (wall clock time) on a single V100 GPU for a batch size of 128 images with 224 × 224 resolution. For comparison, applying 2D defocus blur requires 0.54 seconds, on average. It is also possible to precompute certain selected parts of the augmentation process, e.g., the illuminations for lighting augmentations, to increase efficiency.
Now that we have talked about the theoretical concepts behind these techniques, let us now discuss the experimental results.. For these experiments, UNet and DPT models were trained on Taskonomy. The likelihood losses were optimised with AMSGrad with Laplacian prior. The authors also experimented with DPT models trained on Omnidata that mixes a diverse set of training datasets.
Several popular data augmentation strategies were implemented: DeepAugment, style augmentation, Cross Domain Ensembles (X-DE) and adversarial training . Finally, they trained a model with augmentation with corruptions from 2DCC (2DCC augmentation), and another model with 3D data augmentation on top of that (2DCC + 3D augmentation).
Results - Exposing Vulnerabilities
The models were tested on the 3DCC images to understand their vulnerabilities on 2 tasks: surface normals and depth estimation tasks.
Existing robustness mechanisms are found to be insufficient for addressing real-world corruptions approximated by 3DCC. Performance of models with different robustness mechanisms under 3DCC for surface normals (left) and depth estimation (right) tasks are shown.
All models here are UNets and are trained with Taskonomy data.
Each bar shows the L1 error averaged over all 3DCC corruptions (lower is better). The black error bars show the error at the lowest and highest shift intensity. The red line denotes the performance of the baseline model on clean (uncorrupted) data. This denotes that existing robustness mechanisms, including those with diverse augmentations, perform poorly under 3DCC.
These mechanisms improved the performance over the baseline but are still far from the performance on clean data. This suggests that 3DCC exposes robustness issues and can serve as a challenging testbed for models.
The 2DCC augmentation model returns slightly lower L1 error, indicating that diverse 2D data augmentation only partially helps against 3D corruptions.
Results - Effect of Datasets and Architectures
The baseline UNet and DPT models trained on Taskonomy have similar performance, especially on the view change corruptions. By training with larger and more diverse data with Omnidata, the DPT performance improves. Similar observations were made on vision transformers for classification.
The numbers in the legend are the average performance over all the corruptions. We can see that all the models are sensitive to 3D corruptions, e.g., z-motion blur and shadow.
Thus, combining architectural advancements with diverse and large training data can play an important role in robustness against 3DCC. Furthermore, when combined with 3D augmentations, they improve robustness to real-world corruptions
Results - Semantic Tasks
The previous benchmarking results were focused on surface normals and depth estimation tasks. In addition to them, the paper also shows benchmarking on panoptic segmentation and object recognition tasks as additional illustrative 3DCC evaluations.
For panoptic segmentation, semantic corruptions are used, and for object classification, ImageNet-3DCC is used by applying corruptions from 3DCC to the ImageNet validation set.
The models are trained on Omnidata and Taskonomy datasets with the occlusion corruption from 3DCC to check the robustness for panoptic segmentation.
The figure quantifies the effect of occlusion on the predictions of models, i.e. how the models' intersection over union (IoU) scores change with increasing occlusion, for selected objects. This is computed on the test scenes from Replica.
The occlusion ratio is defined as the number of occluded pixels divided by the sum of occluded and visible pixels of the object.
The plots expose the occlusion handling capabilities of the models and show that the Omnidata trained model is generally more robust than the Taskonomy one. The degradation in model predictions is class specific and becomes more severe with higher occlusion ratios.
The trends are class-specific possibly due to shape of the objects and their scene context, e.g., fridge predictions remain unchanged up until 0.50 occlusion ratio, while couch predictions degrade more linearly for Omnidata model.
Results - Robustness of ImageNet3D
The performances of the robust ImageNet models from ImageNet-2DCC leaderboards were compared. Following 2DCC, the mean corruption error (mCE) is computed by dividing the models' errors by AlexNet errors and averaging over corruptions. The performance of models degrade significantly, including those with diverse augmentations.
As expected, while the general trends are similar between the two benchmarks as 2D and 3D corruptions are not completely disjoint, 3DCC exposes vulnerabilities that are not captured by 2DCC, which can be informative during model development by exposing trends and vulnerabilities that are not captured by 2DCC, e.g., ANT has better mCE on 2DCC compared to AugMix, while they perform similarly on 3DCC.
Let us now see how 3DCC performs in comparison with a commercial tool. 3DCC aims to expose a model's vulnerabilities to certain real-world corruptions. This requires the corruptions generated by 3DCC to be like real corrupted data. As generating such labelled data is expensive and scarcely available, as a proxy evaluation, we instead compare the realism of 3DCC to synthesis made by Adobe After Effects (AE) which is a commercial product to generate high-quality photorealistic data and often relies on expensive and manual processes.
This comparison was done using the Hyperism dataset that comes with high-resolution z-depth labels. 200 images were then generated that are near- and far-focused using 3DCC and AE. We can see that 3DCC generates corruptions similar to AE.
From a Quantitative POV, the prediction errors are computed from a baseline normal model when the input is from 3DCC or AE. The scatter plot of L1 errors demonstrates a strong correlation, 0.80, between the two approaches.
For calibration and control, the scatter plots for some corruptions from 2DCC (defocus blur) are also provided to show the significance of correlations. They have significantly lower correlations with AE, indicating the depth of field effect created via 3DCC matches AE generated data reasonably well.
Shot noise (right) is a control baseline, i.e. a randomly selected corruption, to calibrate the significance of the correlation measure.
As discussed earlier, one of the main goals of this paper is to introduce the use of 3DCC as an Augmentation technique. For this, the paper demonstrates the effectiveness of the proposed augmentations qualitatively and quantitatively. The authors evaluate UNet and DPT models trained on Taskonomy (T+UNet, T+DPT) and DPT trained on Omnidata (O+DPT) to see the effect of training dataset and model architecture. For the other models, they initialize from O+DPT model and train with 2DCC augmentations (O+DPT+2DCC) and 3D augmentations on top of that (O+DPT+2DCC+3D), i.e. the final proposed model. The proposed model was also trained using crosstask consistency (X-TC) constraints, denoted as (Ours+X-TC) in the results. Lastly, they evaluated a model trained on OASIS training data.
Qualitative:
4 image sets were considered for testing if augmentation helps. i. OASIS validation images, ii. AE corrupted data, iii. manually collected DSLR data, and iv. in-the-wild YouTube videos.
In the figure, the ground truth is gray when it is not available, e.g., for YouTube. The predictions in the last two rows are from the O+DPT+2DCC+3D (Ours) model. It is further trained with cross-task consistency (X-TC) constraints (Ours+X-TC). They are noticeably sharper and more accurate.
Quantitative:
The table has the computed errors made by the models on 2DCC, 3DCC, AE, and OASIS validation set (no fine-tuning). Again, the proposed models yield lower errors across datasets showing the effectiveness of augmentations. Note that robustness against corrupted data is improved without sacrificing performance on in-the-wild clean data, i.e., OASIS.
L1 errors are multiplied by 100 for readability. The O+DPT+2DCC+3D model is denoted by Ours. The authors also trained this model using cross-task consistency (X-TC) constraints (Ours+X-TC). 'Ours' models yield lower errors across the benchmarks. 2DCC and 3DCC are applied on the same Taskonomy test images.
Demo:
Finally I want to show a demo of the use case of this augmentation improvement on model performance. So, here is a picture of me from some time back..
We do see a marked improvement in the model which was trained with these corrupted images as augmentations in the training dataset.
So, to conclude, 3DCC is thus a framework to test and improve model robustness against real-world distribution shifts, particularly those centred around 3D.
Experiments demonstrate that the proposed 3D Common Corruptions is a challenging benchmark that exposes model vulnerabilities under real world plausible corruptions. Furthermore, the proposed data augmentation leads to more robust predictions compared to baselines.
After the paper was published in 2022, ImageNet-3DCC is a part of Shift Happens (ICML 2022) as well as a part of the RobustBench benchmarks now!
While the authors did an excellent job covering a broad aspect, here are some limitations listed by them and some by me:
3D quality: 3DCC is upper-bounded by the quality of 3D data. The current 3DCC is an imperfect but useful approximation of real-world 3D corruptions, as was shown. The fidelity is expected to improve with higher resolution sensory data and better depth prediction models.
Non-exhaustive set: This set of 3D corruptions and augmentations are not exhaustive. They instead serve as a starter set for researchers to experiment with. The framework can be employed to generate more domain-specific distribution shifts with minimal manual effort.
Large-scale evaluation: While the paper does evaluate some recent robustness approaches in our analyses, the authors' main goal was to show that 3DCC successfully exposes vulnerabilities. Proper scientific evaluation of a large variety of models in different domains was not done.
Use cases of augmentations: While the authors focussed on robustness, investigating their usefulness on other applications, e.g., self-supervised learning, could be worthwhile.
Evaluation tasks: The experiments done were with dense regression tasks. However, 3DCC can be applied to different tasks, including classification and other semantic ones. Investigating failure cases of semantic models against, e.g. on smoothly changing occlusion rates for several objects could provide more insights.
Unknown Scenes: The OmniData networks were trained on several million images of the starter dataset, featuring mostly general indoor scenes (rather than faces, humans, landscapes, etc). If the query image severely deviates from the training data, the performance is expected to degrade.
One personal limitation from my side when I was trying to run the codes and the demos, the authors claim that the methods are easy to implement and run to generate corruptions for any image, but this is unfortunately not the case. The corruption generating codes do not work in all machines (in my case, the M2 Mac), and it is not easy to isolate techniques without running atleast part of their pipeline code.
Pratik Sen
The reviewer is a Masters student in Data Engineering and Analytics in TUM.
His favourite topics of interest include Natural Language Processing, Machine Learning, Ethics in AI, Computer Vision,Explainability in Machine Understanding etc.




























