A little knowledge about cameras/optics will be useful to understand the post; I tried to include supplementary links where helpful.
This past year, I worked on a project for a Computer Vision class to construct depth maps from a scene using a stationary camera by changing aperture and focus only, via a Monte Carlo method. This a writeup of my work; comment below to discuss/ask questions!
Images borrowed from a paper by Jacobs et al. The goal of this project was to take a scene (shown left) and construct a depth map (shown right). For this project, I was interested in creating relative depth maps (where a pixel is darker/lighter if it is closer to the camera than another pixel) as opposed to an absolute depth map (where the intensity of a pixel is proportional to its distance from the camera).
In a camera, the aperture refers to the hole in front of the camera that controls how much light hits the camera sensors. A larger aperture lets in more light, leading to a brighter image. Camera apertures are measured by f-stops; for this project, my camera had a minimum aperture of f/8 and a maximum aperture of f/2.8. The focus of a camera controls the depth at which the picture is the least blurry.
By taking the same picture of a scene with different aperture/focus settings for a camera, an image stack can be composed, varying either aperture or focus throughout the stack. In an aperture stack, the exposure, or amount of lighting, in the scene remains constant, but the aperture of the camera varies across the aperture stack. So that the exposure remains constant, the shutter speed of the camera varies inversely with the aperture size throughout the stack. In a focus stack, the focus distance of the camera varies throughout the stack.
My goal was to create depth maps of a scene, which encode the distance of a pixel from the camera, using aperture and focus stacks only. Here are the two scenes that I created depth maps for:
Why is this interesting/hard?
Typically, special hardware such as a Kinect is used to create depth maps. Using 2d images only, a standard approach is to take a picture of a scene from two different camera viewpoints and use image disparity between the two images to create a depth map. However, in this project, my camera was stationary. There is research on creating depth maps from focus stacks (depth from focus), but very little work on depth maps from aperture stacks only, combining depth maps from aperture and focus stacks, and using a Monte Carlo method for improved depth maps. I worked on these three improvements to existing work.
Focal Stack Depth Maps
Depth maps from image stacks rely on the fact that the local focus of a region changes as the camera’s aperture or focus distance changes, and that this change is somehow correlated with the region’s depth relative to the camera. A focus measure describes the local focus of a region, so that the focus measure is high when the region is sharp, clear, and in focus, and low when it is low, blurry, and out of focus. There are a variety of focus measures, but almost all require a radius that defines the region that the focus measure should be calculated over. The relationship between focus measure and object distance is straightforward in a focal stack: an object should have maximal focus in the frame where the camera is focused at a distance closest to the object’s distance. Consider the two regions labeled below, and their focus measure plotted against frames where focus distance varies from very near to far away:
The closer region (curve 1) peaks at max focus (the y-axis) at an earlier frame (the x-axis) then the further region (curve 2), indicating that the region described by curve 1 is closer than the region described in curve 2. This is the standard relationship exploited to create depth maps from focal stacks, and will be similarly used in my techniques. The depth maps exploiting this relationship were decent:
In my depth maps, white = close to the camera
Aperture Stack Depth Maps
The relationship between focus measure and object distance is more complicated in an aperture stack, as the focus distance doesn’t change in an aperture stack. However, as the aperture size increases, objects further away from the focus distance become blurred, due to the fact that more light rays from different locations will go through the larger aperture and hit the same point. This phenomenon decreases the depth of field, or region where objects are acceptably sharp, as aperture size increases. Note that in this image all depth of fields, regardless of their size, are still centered around the same focus distance, which doesn’t change throughout the aperture stack.
We can exploit this relationship to determine how far away an object is from the camera’s focus distance. An object that blurs more when aperture size increases is further away from the focus plane, and an object that blurs less when aperture size increases is closer to the focus plane. By minimizing the focus distance to 0.11m, we ensure that no objects are between the camera and its focus plane. This removes the ambiguity of an object far away from the focus plane but close to the camera. Consider the two regions labeled below, and their focus measure plotted against frames as the aperture size of the camera changes from f/8 (smallest aperture, largest depth of field) to f/2.8 (largest aperture, smallest depth of field):
The closer region (curve 1) is more resilient to blurring then the further region (curve 2) as aperture size increases, implying that region 1 is closer to the camera then region 2. I modeled this resilience by a simple heuristic, dividing the minimum focus response of a region by its maximum focus response in the aperture stack. We expect max focus response in minimal aperture (frame 1) and minimum focus response in maximal aperture (frame 10) but maximizing and minimizing over the whole aperture stack makes our aperture depth map more robust. Using a straightforward application of this heuristic with a focus measure such as Tenengrad yields reasonable depth maps:
Monte Carlo Depth Maps
The techniques used to generate aperture and focus depth maps in current literature use an absolute measure of the depth of a pixel. For example, the depth of a pixel for the focus depth map is the frame where the pixel is most focused, while the depth of a pixel for the aperture depth map is the minimal focus divided the maximal focus of that pixel throughout the aperture stack. The depth maps generated by these techniques are rough, as seen previously.
I propose constructing depth maps using a comparative, Monte Carlo, random sampling approach for better depth maps. For both focus and aperture depth maps, I run 500,000 iterations of selecting random rectangles in the image with width and height of at most 40 pixels. At each iteration, I use the same heuristics for either focus or aperture described in the previous sections to determine which rectangle is closer to the screen. I subtract from the depth of all the pixels in the closer rectangle, and add to the depth of all the pixels in the further rectangle. For my Monte Carlo algorithm, I use a simple 3×3 laplacian, averaged across the pixels in my randomly selected rectangles as a focus measure for those regions. This comparative approach converges to a global depth map, as regions that are globally far away will be considered further away relative to most other random regions.
I implemented some optimizations to improve the quality of the Monte Carlo depth maps. I skipped over rectangles if they didn’t have enough texture for a focus measure to be accurate. In addition, smooth surfaces are problematic for aperture depth map reconstructions because they do not blur much as aperture increases, even when the surfaces are far from the depth of field. In the Monte Carlo, comparison approach, this can be accounted for when comparing two regions by subtracting a texture factor from the smoother region before considering the increase in blurriness of our two regions. As seen below, these Monte Carlo depth maps were of higher quality than depth maps generated by standard techniques.
Monte Carlo Focus Depth Maps:
Monte Carlo Aperture Depth Maps:
Combining Aperture and Focus Depth Maps
In the last sections, we constructed depth maps from a scene using an aperture stack only, or a focus stack only. Now, I combine the depth maps from both of these stacks for a higher quality combination depth map. There are intuitive reasons for doing so, as both depth maps encode parts of the scene more accurately then the other depth map. For example, in the first scene, the aperture depth map misidentifies the blue cylinder as far away when it is actually the closest object in the image, and the focus depth map misidentifies the leftmost tree as far away when it is actually a medium distance away from the camera. The focus depth map in general seems to be of higher quality than the aperture depth map. To combine them, we use a simple strategy with the focus depth map as the baseline. For each pixel, we consider its standard deviation from its neighbors within 5 pixels of it. If the standard deviation is large, there is a chance we misidentified its depth, so we replace its depth with a weighted average of the focus depth and the aperture depth based on its standard deviation. The combination depth maps, reproduced below, yield improvements on the Monte Carlo depth maps from focus and aperture stacks:
I described methods to create aperture and focus depth maps, a Monte Carlo approach to creating depth maps, and a way to combine aperture and focus depth maps. In this section, I will evaluate all of the depth maps generated. One approach would be to compare, pixel-by-pixel, the difference between a generated depth map and a ground truth depth map obtained from a Kinect scanner. Since I didn’t have a Kinect scanner, I used a comparative approach to evaluate depth maps. I manually labeled regions in each scene, ordered by increasing depth. Here are my labeled ground truths, where thicker rectangles denote regions that are closer to the camera then thinner rectangles:
To evaluate each depth map, I randomly select 500,000 pairs of points from these regions. I tally whether each depth map correctly predicted which point was closer to the camera, based on the labeled ground truths. Percentage of correct predictions, the depth map accuracy, was the measure of how good a depth map was. Note that a depth map only encodes useful information if its accuracy significantly above 50%, or random guessing. I compared my Monte Carlo techniques against a variety of focus measures: Tenegrad, Helmli’s mean method, DCT modified, Modified Laplacian, and Brenner, with radius varying from 10 pixels to 640 pixels. These measures were implemented by Pertuz here.
Top focus depth maps
Top aperture depth maps
Top depth maps, overall
My three improvements to depth maps from aperture and focus stacks were successful. My first task, generating depth maps from an aperture stack only, leveraged the fact that objects further away from the focus distance become more blurred as the aperture of the camera increases. By constraining the focus distance to very close to the camera (0.11m) I was able to to generate a depth map, as the focus distance was a weak approximation of the camera center. However, my aperture depth maps were of lower quality then my focus depth maps. This could of been because my assumption that no objects were between the focal plane and the camera was not entirely correct, which is why the blue object on the left of my first scene is colored darker than the close grass in the aperture depth maps.
My second contribution was using a Monte Carlo approach to generate my depth maps. The depth maps generated by the Monte Carlo approach are of higher quality, in both detail and accuracy, than the absolute depth maps generated. I argue that this approach is better for two main reasons. First, depth maps generated the standard way require a focus measure of a certain radius to be specified. This ideal radius can vary based on the texture of the image, as too large of a radius will ignore small objects, while too small of a radius may capture image noise or texture instead of the focus of a region. The random sampling approach selects rectangle of varying size, making the focus measure more robust to these changes. Secondly, simplifying the depth map problem from an absolute depth measure to a relative comparison problem between two textures allows for optimizations impossible in the more difficult problem space.
My third contribution was combining aperture and focus Monte Carlo depth maps for a higher quality, combination depth map. I was successful at this as well, because I noticed that the aperture depth map was better able to encode certain parts of the scene than the focus depth map. Because the focus depth map was of higher quality then the aperture depth map, I used the focus depth map as a baseline and filled in inconsistencies using an aperture depth map, which resulted in a combination depth map better than either depth map, evidenced by its top ranking among all depth maps.
There are a couple of enhancements that I can implement to improve the quality of my depth maps. My camera jittered between frames of my focus and aperture stacks, so using a tripod would remove this jitter and increase the quality of the depth maps. Also, the camera I used had a small range of different aperture sizes and focus distances. A better camera that could change aperture from f/2.8 to f/16, for example, would have been better able to highlight differences in a region’s focus through a larger aperture stack. I can also post process my depth maps to make them smoother, using the prior that depth in the real world is relatively smooth except at edges. This technique should be particularly helpful for my Monte Carlo depth maps, which have rectangular artifacts since the regions I compare are rectangular.