In the past few years, a number of computer vision and machine learning researchers have tried to take the groundbreaking inference capabilities of convolutional neural networks (CNNs, maybe you’ve heard of them?) and apply them to 360° images. Initially, the results were surprisingly poor.
It turns out that there are important differences between 360° images and the typical central-perspective images we’re used to. Over the past few years, a number of papers have shed a ton of insight into the problem, and the solutions are quite varied.
In this post, I look to introduce the problem, provide an overview of the myriad solutions, and paint a picture of, what I believe to be, the most promising path forward. In short, I aim to provide you with all the things you never thought you needed to know about modern 360° computer vision.
This article is long and covers a lot, but I hope it will serve as a useful reference for anyone looking to understand the 360° image domain in the context of computer vision and deep learning research. I also hope that this will inspire you, the reader, to develop new techniques for and applications of this exciting technology.
Author’s Note: Much of this article is derived from my own dissertation on this topic. I have done my best to avoid any authorship bias in presenting relevant research but, of course, my conclusions are impacted by my own experiences working on this technology.
[All images and figures in this article are my own creation unless otherwise cited in-line or at the end.]
TABLE OF CONTENTS
= What is 360° Computer Vision?
**** Applications
**** Spherical distortion
**** 3 guiding principles
= Understanding Spherical Images
**** Map projections
**** Image representation matters!
**** Translational equivariance and convolution
= Recent Solutions (that don’t use the icosahedron)
**** Learning-based methods
**** Convolution reparameterization methods
**** Location-adaptive methods (for equirectangular images)
= The Icosahedral Sphere
**** What is a subdivided icosahedron?
**** Why use the subdivided icosahedron?
**** So many ways to convolve on the icosahedron!
**** Representational drawbacks
**** A potential compromise solution: tangent images
What is 360° computer vision?
For this article, I am going to define 360° computer vision as:
Performing inference, estimation, and modeling tasks using images that capture a scene with a 180° x 360° field of view.
These images go by a variety of different names: omnidirectional images, panoramas, spherical images, 360° images, etc. At the end of the day, each is largely describing the same thing. Here’s an example of such an image:
For the purposes of this article, I’m going to use the definition of a spherical image put forth by Krishnan and Nayar, 2009:
- a 4π steradian field of view
- a single (effective) center of projection
- a uniform resolution in every direction
These images can be captured in a variety of ways.
- 360° cameras like the Ricoh Theta or Insta360 One
- Polydioptric, or multi-lens, cameras like Facebook’s Surround360 or Google’s now defunct Jump VR camera
- 360° panoramas stitched from a set of perspective images, like those used in Google Street View
- Virtual renderings from high-quality 3D models, like images in the popular Stanford2D-3D-S dataset
Any one of these approaches are viable to create an omnidirectional image, although each have their own pros and cons. Regardless, once we have our 360° images or video, the applications are endless.
Applications
- It facilitates virtual tourism, like these videos from National Geographic (perhaps a nice escape from pandemic-induced house arrest)
- 360 can help us navigate, like through Google’s Street View service or Mapillary
- Looking for a new house? 360 is moving into e-commerce too, like Zillow’s 3D Home platform for showcasing real-estate listings
Now, all of the above share the common theme of “immersive experiences.” But there are other important use cases too:
- Medical applications, such as wide field of view endoscopies (like this one from Saneso or the innovative, swallowable CapsoCam) that can help physicians save more lives by improving cancer detection.
- Robotics and autonomous vehicle applications. Tesla’s autopilot has a full 360° horizontal field of view, while Dyson’s 360 Eye vacuum leverages the power of omnidirectional vision to navigate it’s surroundings.
- Indoor 3D modeling. This is more of a research task that supports many of the applications already described, but it’s become very popular to take advantage of 360° images for room layout estimation (Yang and Zhang, CVPR 2016, Xu et al., WACV 2017, Yang et al., CVPR 2018, Zou et al., CVPR 2018, Yang et al., CVPR 2019, Fernandez-Labrador et al., Robotics and Automation 2020)
- And there are many more! Think of anything that could benefit from eyes in the back of it’s head.
With all of these applications, it’s no wonder that computer vision researchers are trying to incorporate these images into their algorithms!
But what makes these images so unique that they have launched their own line of research?
The simple answer is spherical distortion.
Spherical distortion
In some ways, spherical distortion can be considered a type of lens distortion, but its origin is actually quite different. While lens distortion is caused by physical properties of the lens in the camera,
Spherical distortion is a function of the choice of image representation.
Spherical distortion results in particularly heavy content deformation in 360° images. Why is this the case? Well, it turns out that we model 360° images by a different camera model than the central-perspective images we know and love.
We replace the concept of an image plane with that of an image sphere.
Now, if we want to use any of our existing computer vision algorithms or CNNs on one of these spherical images, we first have to find a way to represent it in a planar way (i.e. like the standard images we’re familiar with).
The problem is that there is no way to do this without distorting the image.
Why, you ask?
Because:
A sphere is not isometric to a plane.
This is a consequence of Gauss’s Theorema Egregium. We all know Gauss from his influential work on *nearly every mathematical application* you can think of. It seems that our favorite polymath is still impacting technological discoveries well into the 21st century.
This lack of an isometry between the image sphere and the nice planar representations we desire means that, no matter what we do, we’re always going to have some distortion in our image.
It turns out that this distortion has a significant, detrimental impact on our ability to apply convolutional neural networks, as well as other computer vision algorithms, to 360° image inputs. This outcome is due to an important assumption we’ve made when designing many of these groundbreaking tools: namely, that our inputs are undistorted (or undistort-able) images. As a result, we often see 360° image inputs result in worse performance than their traditional, central-perspective counterparts, leading to a performance gap.
3 Guiding Principles
A number of researchers have sought ways to close this performance gap by modifying the algorithms themselves. Others have focused on finding spherical image representations with less distortion. While there are pros and cons for both approaches, I suggest that any practical solution to this problem satisfies 3 conditions:
- Distortion must be sufficiently addressed.
- The solution must efficiently scale to high resolution spherical images.
- The transfer of central-perspective image computer vision algorithms to spherical images must require minimal additional effort.
Distortion needs to be addressed for the reason described above: the unavoidable content deformation driving the performance gap.
Efficient scalability is also key for this wide field of view format. Discretizing a field of view into a pixel grid is an inherently lossy operation. To capture the world with the same level of detail and granularity of a central-perspective image, a spherical image must have a much higher pixel resolution.
Finally, computer vision is a field with over 50 years of existing work. Any solution for 360° images should not reinvent the wheel. It should be possible to make use of the decades of existing research with minimal additional effort.
Over the rest of this article, we will look into where, specifically, spherical distortion comes from and why it is so problematic, especially for CNNs. We will also highlight some of the recent methods proposed to reduce distortion’s impact, and go into some of their pros and cons. Finally, I will wrap this up with my own thoughts on a path forward for 360° computer vision (HINT: it involves focusing on better spherical image representations, rather than modifying existing algorithms).
I hope, by the end of it, you will have found this article both inspirational and informative, and you, too, will be eager to push the limits of the 360° imaging domain.
Understanding Spherical Images
When developing computer vision algorithms for 360° images, the crux of the problem boils down to sufficiently addressing the distortion induced by mapping the sphere to a plane. It turns out this struggle predates computer vision by a lot. Long before computer vision set its sights on spherical images, cartographers were developing novel mathematical models, map projections, to project the Earth onto flat maps. As it happens, their efforts for over 2,000 years have provided a good basis to understand our spherical image problems today.
Map projections
Take a look at the two most prevalent spherical image representations
The cube map, shown unfolded on the left, is very popular for graphics applications like environment mapping, and it enjoys hardware support in most GPUs. The equirectangular image, shown on the right, is more popular for computer vision applications, thanks to its contiguous layout and perhaps simply because it is a rectangular image.
Although popular, both of these representations are quite distorted. A useful tool to visualize spherical distortion is Tissot’s indicatrix. It demonstrates how a perfect circle on the sphere is deformed into an ellipse on the plane. How this circle deforms (i.e. the area, eccentricity, and orientation of the ellipse) tells us the type and extent of local distortion.
Here are those two spherical image representations again, this time with Tissot’s indicatrices super-imposed. For clarity, I’m only showing one face of the cube, but the distortion characteristics are the same for all 6:
Contrast those with the sphere below. Notice the strong horizontal distortion effects in the equirectangular image and how they increase as we get closer to the top and bottom of the image. Observe how, in the cube map, while less extreme than equirectangular image, distortion increases towards the corners of each face.
To understand why these representations are so distorted, it helps to consider how these representations are formed.
Most map projections seek to map the surface of the sphere to a developable surface, basically a shape that can be unraveled to a plane without stretching or warping of any kind.
The cube map, for example, is formed by the gnomonic projection, also known as the rectilinear projection, which projects a sphere onto a tangent plane:
In the case of the cube map, these tangent planes are the 6 faces of the inscribing cube. The projections on the +/- X and +/-Z faces are centered at the equator, and hence are projected in the equatorial aspect, while the +/-Y faces are centered at the poles, and are thus are projected in the polar aspect.
Observe in the image above how the polar aspect leads to content radiating outward from the center of the face. [Note: I will point out later how this effect actually makes cube maps a not-so-great representation for convolution.]
Equirectangular images are formed in a different way. They are an equirectangular projection, which projects the sphere onto a cylinder.
A quirk of equirectangular projections is that they are a type of equidistant projection, preserving distances between lines of latitude across the image. This is why distortion only shows up horizontally.
Image representation matters!
As it happens, there are an infinite number ways to project a sphere onto a plane. However, how we choose to project it and the surface we choose to project it onto make a big difference in the type of distortion we will see as a result.
Generally there are three properties to consider with regards to distortion:
- Equal area projections preserve the relative areal scale of regions on a map. When considering a spherical image, this implies that the amount of information in a certain region of the image is going to be equivalent to the amount of information at another, equally-sized region. However, equal area projections will result in distortions elsewhere in the image.
- Conformal projections preserve local angles, which means that at any single point on the map, shapes are locally accurate. For a spherical image, this can be interpreted as ensuring the compactness and regularity of information in the image. Put differently, the image content may be deformed by distortion, but that deformation will be equal in all directions at any given point. It is impossible for a spherical projection to be both conformal and equal area.
- Equidistant projections preserve the distance between locations on the sphere. A projection cannot be equidistant in every direction (lest it be an isometry of the sphere), but equidistant projections preserve distance along some direction of lines in the resulting map. As we saw, an equirectangular image is an example of an equidistant projection.
For a little perspective, here are some examples of each kind of projection with Tissot’s indicatrices super-imposed (you’ll likely be familiar with the conformal example, as it’s been hung in many school classrooms for decades and the subject of plenty of controversy).
Now, you might be wondering why I’ve indulged this tangent into cartography. My goal here is to demonstrate that, when it comes to 360° (spherical) images:
Our choice of map projection (i.e. our image representation) has a significant impact on spherical image distortion.
[Note: If you have further questions about or interest in map projections, I highly encourage you to take a look at the manual published for the US Geological Survey by John P. Snyder, the modern “godfather of map projections.”]
Translational equivariance and convolution
So now let’s get back to the issue I mentioned before about distortion and convolution. Namely, why does too much distortion disrupt the proper function of CNNs?
To answer this question, we first need to remember how CNNs function.
Unlike fully-connected networks or multi-layer perceptrons, which require a parameter for every input, convolutional neural networks enable location-independent parameter sharing via the convolutional filtering operation. This is a revolutionary concept that has helped to power the success of deep learning in computer vision. However, it’s also the reason CNNs see performance degrade on 360° images.
This parameter-sharing design relies on a concept called translational equivariance. This means that if we were to shift the input image in some way, the output of the filtering operation should shift equivalently. Mathematically, we can write this as:
for some filter h(•) and translation function t(•).
But spherical distortion disrupts this translational equivariance. Take a look at the following example.
Let’s say we have this picture of a dog:
On the left figure below, I’ve projected this image onto a hemisphere at the equator and mapped it to an image via the gnomonic projection. Look at what happens when I then try to shift it up by 45°, in the right image:
We see the location-dependent distortion effects of the spherical image come into play.
Convolution is a local operation. With this location-dependent scaling and warping, we can’t expect the convolutional filter to have the same outputs after this translation. In other words, spherical distortion breaks the translational equivariance required for proper CNN function.
So why does this happen? It might help to look at the math behind convolution.
Recall the discrete convolution operation (shown here in the 1D form):
This operation is really two parts, a sampling followed by a weighted sum.
Here δ[•], represents the discrete Dirac delta, or impulse, function:
Now, although our images are discretized by a pixel grid, it’s common in computer vision to treat them as continuous, interpolating between pixels as necessary. With that in mind, let’s consider our convolutional sampling as continuous as well:
Note that now, we use δ(•) to represent the continuous impulse function:
So, our convolutional kernel must sample the same area at each location in the data, otherwise the summation becomes biased by unequal information. Similarly, the distribution of data at each point must be equivalent. The first constraint implies that the image must be equiareal, while the second implies the need for conformality But, as we know, it is not possible for spherical images to satisfy both properties.
A quick aside: In the continuous sense, the impulse function can equivalently be expressed as the area of a zero-centered, isotropic Gaussian distribution at the limit where the variance goes to zero:
In 2D, such a distribution is represented by an infinitely small circle. Hence, when convolving a filter with a spherical image, each datum sampled from the sphere can be modeled as the area bounded by an infinitesimal circle on the sphere’s surface. This is why Tissot’s indicatrix is such a useful tool for understanding the effects of distortion on convolution.
Alright, so at this point we hopefully understand 3 things:
- Spherical distortion cannot be removed.
- Spherical images == map projections and the choice of map projection dictates how distorted our 360° image will be
- Spherical distortion breaks the translation equivariance required for proper CNN function
Now that we’ve established the fundamental problem, let’s take a look at some of the solutions!
Recent Solutions (that don’t use the icosahedron)
For something of a niche topic within the grand scope of computer vision, the spherical distortion problem has attracted a surprising amount of interest from the academic community. Perhaps more surprising is just how wide-ranging the proposed solutions have been. I’ve broken the related work into 3 categories. It’s not a perfect division, but it’s a fairly MECE breakdown.
In this section, I’m going to go through some of the more impactful papers on spherical image convolution in light detail, and provide a decently comprehensive reading list for the interested reader.
[Note: I’ve left out a key subset of the most recent work here (icosahedral methods), because we’re going to devote the entire next section to looking at them.]
Learning-based methods
This category of work aims to learn the distortion function from the data itself, mainly focusing on transfer learning methods. Perhaps not coincidentally, much of this work was done by the same few researchers.
The paper to highlight in this category is: Kernel Transformer Networks for Compact Spherical Convolution (Su and Grauman, CVPR 2019).
This work proposes a mini-network that learns to modify convolutional filters in a location-dependent manner. The idea here is to provide a simple module that, when trained once on a specific type of distortion, can be plugged into any existing network for efficient inference-time transfer to spherical image inputs. A key idea behind this approach is that it works by focusing on adapting the convolutional kernel, rather than the content of the image. In some ways, this is a learned variation of the location-adaptive methods we will look at shortly.
Other useful papers:
- Learning Spherical Convolution for Fast Features from 360 Imagery (Su and Grauman, NeurIPS 2017)
This paper seeks to address equirectangular distortion by learning a row-dependent adaptation for each kernel in a network. - Snap Angle Prediction for 360 Panoramas (Xiong and Grauman, ECCV 2018)
Here, the authors develop a learned, content-aware rotation of the spherical data in the cube map format that “snaps-to” a view where the most relevant information has minimal distortion.
This category of work is especially interesting because, unlike the other categories, it doesn’t attempt to change or redefine any aspect of the standard CNN pipeline. It uses all the same tools of traditional CNNs and takes the input in common spherical image formats. This makes these methods scalable and transferable. The drawback here, though, is that there is no guarantee that a network will sufficiently model the image distortion enough to restore the necessary equivariance.
Convolution reparameterization methods
This category of methods seek to generalize or approximate convolution on the sphere. Typically these solutions are designed to be agnostic to the choice of image representation and focus on the theoretical question of processing spherical signals.
In this category, the paper to highlight is Spherical CNNs (Cohen et al., ICLR 2018, Best Paper) [Code].
This was the paper that introduced the world to the importance of translational equivariance in the context of spherical images. Or perhaps more accurately, this was the paper that demonstrated that the spherical distortion problem could be solved by ensuring rotational equivariance. Cohen et al. provide a solution in the form of a generalized convolution on SO(3) rotations. The crux of this method is to convolve over the rotation space of the spherical inputs (rather than the translation space on a planar image). For efficient operation, the convolution is performed in the frequency domain and it is implemented using a generalize fast Fourier transform. As the paper is quite heavy on the theory, I will refer the interested reader to it directly.
While some of the related work preceded this paper, it’s perhaps fair to credit “Spherical CNNs” with laying the theoretical foundation for a lot of the subsequent work.
Other useful papers:
- Learning SO(3) Equivariant Representations with Spherical CNNs (Esteves et al., ECCV 2018, Oral) [Code]
In some ways scooped by “Spherical CNNs” this work proposes a similar SO(3) equivariant convolution based on the Spherical Fourier Transform, decomposing the spherical signal and filters into their spherical harmonic bases. It’s worth noting that this approach is faster than “Spherical CNNs.” - DeepSphere: Efficient Spherical Convolutional Neural Network with HEALPix Sampling for Cosmological Applications (Perraudin et al., Astronomy and Computing, 2018 and Defferrard et al., ICLR Workshops, 2019) [Code]
This work looks to avoid the computational expense of the Fourier transform and also seek a solution that does not require a full sphere as input. The authors propose an efficient graph convolution on the HEALPix representation of the sphere. It’s not entirely clear how this fares compared to other methods as the authors do not provide a comparison.
By design, these methods circumvent the spherical distortion problem by focusing on the spherical input signal directly. Furthermore, these methods are often quite efficient thanks to fast spectral or graph operations. However, the primary downside to these reparameterization approaches is that they can’t reuse existing networks. While you can implement a ResNet-50 spherical CNN, for example, you can’t then initialize it with pre-trained weights from some off-the-shelf ResNet-50 model. As the parameterization has changed as well, we can’t apply our existing insight into kernel sizes and network design either.
Location-adaptive methods (for equirectangular images)
This category consists of methods that seek to explicitly encode spherical distortion into the convolutional kernel’s sampling function in a location-dependent format. The idea is that, if distortion deforms the content, why not just deform the kernel accordingly?
The paper to highlight in this category is SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirec-
tional Images (Coors et al., ECCV 2018).
In this work, the authors try to directly address distortion in the equirectangular image by changing where the convolutional kernel samples from the image in a location-dependent way. Considering a convolutional filter as a tangent plane to a sphere, the authors use the inverse gnomonic projection* to project the kernel grid onto an equirectangular image. Additionally, Coors et al. also address the issue of oversampling the sphere that results from the pixel redundancy caused by distortion. They evaluate the use of spiral spherical sampling to apply the filter in a more uniform way on the image. This is actually an important concept and one that will be touched on in more depth by the icosahedral methods presented in the next block. The tasks addressed here are classification and object detection.
* As it happens, this is not the ideal projection for an equirectangular image. In my paper, Mapped Convolutions, I show that the inverse equirectangular projection is actually the more appropriate choice for this approach. That being said, the improvements are marginal (1–2%).
Other useful papers:
- OmniDepth: Dense Depth Estimation for Indoors Spherical Panoramas (Zioulis et al., ECCV 2018) [Code][Dataset]
This paper seeks to address the horizontal effects of equirectangular distortion through the use of rectangular filter banks on the first few layers of the network. It also publishes a large, rendered, indoor 360° depth estimation dataset. - Distortion-Aware Convolutional Filters for Dense Prediction in Panoramic Images (Tateno et al., ECCV 2018)
This paper concurrently proposed the same inverse gnomonic kernel as Coors et al. The big difference is that this work focused on dense prediction tasks, namely depth estimation and semantic segmentation. This work also compares to cube map representations (Spoiler alert: location-adaptive kernels on equirectangular images work better than regular grids on cube maps) - Corners for Layout: End-to-End Layout Recovery from 360 Images (Fernandez-Labrador et al., arXiv, 2019)
Like Coors et al. and Tateno et al., this work also deforms the kernel using the inverse gnomonic projections. This paper is different, however, in that it focuses on one of the big research applications of 360° panoramas: indoor 3D reconstruction and layout prediction. As far as I have am familiar with, this is actually the only layout prediction paper to leverage any novel kernels or representations of the 360° image.
Personally, I am a fan of this class of approaches, because they take advantage of a priori knowledge, like the closed form expression for the projection function. The ability of a network to learn to adapt is certainly interesting, but it’s better if we can also encode any additional information at our disposal.
There are a couple drawbacks of these approaches, too. First, it’s not abundantly clear from the papers whether just modifying the kernel’s sampling function goes far enough to fully address distortion (spoiler alert: it doesn’t). Second, many of these methods also modify the convolution operation, which means that existing super-efficient library implementations of convolution can’t be used anymore. In practice, this impacts network speed and scalability.
The Icosahedral Sphere
2019 marked an change in many of the approaches to spherical image convolutions. That year saw 6 papers come out touting the benefits of using the subdivided icosahedron to represent the sphere.
In this section, we’ll start out with what the icosahedron is, why it is so useful (HINT: it comes back to reducing distortion), and we’ll highlight each of these recent papers with some light depth.
What is a subdivided icosahedron?
Let’s start out with what an icosahedron is. Well, it’s a 20-face, convex, regular polyhedron. Take a look at an example here:
An important concept to us is that it’s the Platonic solid with the most faces. Now why is that important?
Think about the classical method of exhaustion for approximating a circle with inscribed regular polygons (i.e. polygons where each side is equal length):
In 3D, we can look to approximate a sphere in the same way. But, instead of using polygons (a 2D concept), we should use convex polyhedra (the 3D equivalent). Although there is an infinite number of regular 2D polygons, there is actually a finite number of regular convex polyhedra (i.e. each face has the same shape and area). Specifically, we have the 5 Platonic solids. So, it makes sense that we ought to use the one with the most faces. Enter the icosahedron.
Now, the icosahedron has another useful property. It is a subdivision surface. This means that we can divided each face into an number of smaller, equal-area faces. So the 20-face icosahedron (left), can become an 80-face icosahedron (right) by subdividing each face:
This is interesting, but how does it change anything?
Well, it can become an even better spherical approximation if we’re willing to budge a little on the perfectly equal-sized faces. Using Loop subdivision (Loop, 1987), we can interpolate the vertices and start to approximate a sphere:
If we do this ad infinitum, our icosahedron will eventually be indistinguishable from the sphere. In effect, we can consider this to be a 3D method of exhaustion.
A quick note on nomenclature. When talking about subdivided icosahedra, I’m going to refer to the 20-face regular icosahedron as a level 0 icosahedron, while after one subdivision, it will be called level 1, etc. This is simply an easy way to refer to different subdivision levels.
Why use the subdivided icosahedron?
If it’s not immediately clear from the subdivision pictures above, the answer to this question is pretty simple. Summarizing from one of the most prominent minds in modern cartography (Kimerling et al., 1999):
The subdivided icosahedron is among the least-distorted spherical representations.
As we talked about before: cartographers have been looking at this problem for millenia. Let’s not try to reinvent the wheel and, instead, leverage some existing insight into the matter.
Take a look at the figure below. Here, I’ve projected our Earth image onto the the faces of the the original icosahedron via gnomonic projection, and I’ve unfolded the icosahedron to its net. I’ve also super-imposed Tissot’s indicatrices on the faces so you can visualize the distortion characteristic.
Look at how nearly-uniform those circles are. Again, our goal is to have perfect circles that are all the same size. I’d say we get pretty close here.
So, clearly the icosahedron is great for reducing distortion. But, there is also another reason it’s become so popular:
Subdivision parallels image up-sampling
Take a look at the equations below. These show how to compute the number of faces and vertices for each successive subdivision level.
Take note of that factor of 4. A subdivision turns 1 face into 4 faces, just like how image up-sampling turns 1 pixel into 4 pixels.
For vertices, it’s not quite the same, but it’s fairly close.
Now remember, many of our favorite fully-convolutional network architectures (like FCNs, U-Nets, and ResNets) incorporate these down-sampling and up-sampling operations to encode and then decode learned features. Because subdivision has this nice parallel with these operations, the subdivided icosahedron can fit nicely into our existing CNN paradigm.
So many ways to convolve on the icosahedron!
As I pointed out before, 6 different papers proposed novel analysis using the icosahedron in 2019. Let’s take a shallow dive into each of these papers to see, empirically, how and why the icosahedron works, and what considerations we have to make to use it.
I will start this section with my own contribution to this approach, because I directly compare different spherical representations.
Mapped Convolutions (Eder et al., arXiv 2019) [Code]
Convolutions on Spherical Images (Eder and Frahm, CVPR Workshops 2019, Oral)
The key contribution of these papers is to tout the representational benefits of the subdivided icosahedron and to provide a solution to convolve on its surface. In these works, I look at the tasks of depth estimation and semantic segmentation using an equirectangular image, a cube map, and a level 7 subdivided icosahedron. To provide an apples-to-apples comparison, I generalize the location-adaptive methods by developing a “mapped convolution” operation that accepts an adjacency list to determine where a convolutional kernel should sample. This approach allows us to map the kernel to the faces of the subdivided icosahedron without changing the convolution operation in any way. The results show a 12.6% improvement in overall semantic segmentation mean IOU and a nearly 17% improvement in absolute error for depth estimation,
Simply by resampling the image to the subdivided icosahedron!
These outcomes reinforce the importance of the choice of spherical image representation.
An additional result is that the cube map is not a great choice of spherical image representation due to orientation inconsistency at +/-Y faces that we highlighted earlier (remember how content radiates from the center?) and filter ambiguity at the corners.
The problem with my mapped convolution approach in these works is that it can get quite slow as network depth or image resolution increases.
Spherical CNNs on Unstructured Grids (Jiang et al., ICLR 2019) [Code]
This paper is where my mutually-exclusive division of the related work falls apart. This approach uses the subdivided icosahedron, but it is also a type of reparameterization method. This work, often abbreviated as UG-SCNN, reparameterizes convolution as a linear combination of differential operators on the surface of an icosahedral mesh. This was the first work to attempt to take advantage of the lower distortion present in the icosahedron representation as well as the efficiency gains that can be achieved by reparameterizing the convolution. This method circumvents the scalability problem of kernel-modifications by leveraging fast differential computations to approximation convolution, and it scales better to higher resolution spherical images as a result. This scalability comes at the cost of transferability, however. Because it no longer uses the traditional convolution operation, it does not permit network reuse. That being said, it still provides some of the best performance to-date for spherical image tasks.
Gauge Equivariant Convolutional Networks and the Icosahedral CNN (Cohen et al., ICML 2019)
I’m not going to go into too much depth on this paper, because someone else has already written a great explanatory overview of it. The high-level, though, is that Cohen et al. recognize that the icosahedral net consists of five parallelogram strips that can have one of six orientations, or gauges, around the icosahedron. They use these strips to define an atlas of charts that relate the the planar parallelograms to locations on the icosahedron. They are then able to apply standard 2D 3x3 convolutional filters (masking out 2 weights) on the charts, using a gauge transform to ensure consistent feature orientation. Like the other methods in this sections, this paper addresses the issue of filter orientability. However, perhaps speaking to the credit of the author, Taco Cohen, this work, like his previous Spherical CNNs paper, formalizes the problem in a rigorous way.
There are a couple things to observe from this method. First, the use of the standard 2D convolution operator. This is great, because many of the other approaches either modify the convolutional kernel (i.e. location-adaptive methods) or approximate it (i.e. reparameterization methods). Those changes either slow things down, inhibiting scalability, or break transferability. By using the 2D convolution, this approach can take advantage of efficient convolution implementations. The drawback of this approach is that the charts mean we can’t use Loop subdivision to more closely approximate the sphere. Unlike the Mapped Convolutions and UG-SCNN that operate on a mesh representation, we are limited here to the distortion-reduction properties of the level 0 icosahedron (which are still pretty good, by the way).
SpherePHD: Applying CNNs on a Spherical PolyHeDron Representation of 360 Degree Images (Lee et al., CVPR 2019)
Like the three laid out above, this paper also proposed the use of the subdivided icosahedron a a spherical image representation. This work analyzes the distortion properties of the representation and defines new, orientation-dependent kernels for convolution and spatial pooling. The authors also propose a weight-sharing design to address the differing orientations of faces across the icosahedron. In some ways, because Lee et al. redefine the convolution operator, this can be considered a reparameterization method as well.
Orientation-Aware Semantic Segmentation on Icosahedron Spheres (Zhang et al., ICCV 2019)
This final icosahedral paper of 2019 proposes a method that makes use of the icosahedral net. The authors define a special hexagonal convolution on the vertices of the icosahedron that can interpolate to and from the standard 2D convolution operation. In this way, they provide transferability for existing networks, while still operating on the triangularly tessellated icosahedron.
Focused on the task of semantic segmentation, this approach demonstrates the best scalability and transferability of the 2019 papers (and the ones that came before them). Like Cohen et al. (2019), it is limited to the distortion properties of the level 0 icosahedron because it represents an image on the icosahedral net. Once again, though, let’s remember that even the level 0 icosahedron is a lot better than the cube map and equirectangular image in terms of distortion. This shows up in the high accuracy and IOU scores for semantic segmentation results in this approach.
Representational drawbacks
Despite all this recent work touting the benefits of the subdivided icosahedron, it has some drawbacks as well.
First, it’s comprised of (“tessellated by”) a bunch of triangles. The 2D convolution we know and love is built for pixel grids. As a result, these methods have to either:
(1) modify the convolutional kernel, which means we can’t take advantage of super-efficient implementations provided by many popular deep learning libraries (e.g. PyTorch, TensorFlow, cuDNN, etc.), or
(2) add some extra operations like special padding, transforms, or interpolation to the pipeline
With these changes, we typically either run into scalability issues for high resolution images and deep architectures or we lose or impede network transferability.
The second drawback actually comes from something we thought was a benefit. It turns out that our beautiful analogy between up-sampling and subdivision is an albatross.
With this analogy, if we want to represent high resolution spherical images, we need to use higher and higher subdivision levels to get there. There’s a reason most of the listed work operated at levels 5 (“UG-SCNN,” “Gauge Equivariant CNNs”), 7 (“SpherePHD,” “Mapped Convolutions”), and, most recently, 8 (“Orientation-Aware Semantic Segmentation”).
Our subdivision-to-up-sampling analogy does not scale!
But, this is a problem! Remember what we said earlier:
Discretizing the world into pixels is a lossy operation.
A measure of how much detail is preserved by any pixel representation is the angular resolution of image. This is simply the field of view in a certain direction divided by the number of pixels along that dimension. A lower angular resolution per pixel means a more detailed image.
Take a look at the angular resolution of a central-perspective VGA image below. It has a much lower angular resolution than the levels 5, 7, and 8 spherical images. It’s most similar to a level 10, which is 4 times larger than the highest resolution examined by the icosahedral methods we just reviewed.
If we want to provide our network with the same high level of detail that is available to our state-of-the-art networks trained on central-perspective images, we need to scale to much higher spherical images than we are currently. But this subdivision/up-sampling analogy is holding us back!
Let’s quickly recall why we wanted to use the icosahedron to begin with: distortion reduction.
It turns out that we’re about as good as we’re going to get after ~3 subdivisions.
Any additional subdividing is only necessary to match to the spherical image resolution. This is problematic if we think back to those 3 guiding principles I laid out at the beginning of this article. We need scalability for 360° images. But this analogy is not the way to get there.
A potential compromise solution: tangent images
In the last part of this section, I am going to explain my most recent work on this subject. I am a big believer that approaches that focus on new representations, like the one I present here, provide the best bet for an all-around solution for 360° computer vision.
Tangent Images for Mitigating Spherical Distortion (Eder et al., CVPR 2020)[Code]
This work addresses many of the shortcomings of the aforementioned research (representational difficulties, scalability limitations, and transferability concerns), and:
It provides the best scalability and transfer performance of any approach so far, by a wide margin.
One of the reasons for this is that, in this approach, I depart slightly from the true icosahedron. Instead I propose a new representation, derived from the icosahedron. I call this the tangent image representation.
What are tangent images?
Tangent images are the gnomonic projection of a spherical image onto square, oriented pixel grids set tangent to the sphere at the center of each icosahedral face.
To create tangent images, we first set a base level, b, of subdivision. This determines our distortion characteristic, the number of tangent images we’re going to generate (it’s equal to the number of faces of that base level), and the field of view of each tangent image.
The dimension of each square tangent image, d, will depend on the resolution of our spherical image, s (in terms of equivalent subdivision level), by the relation:
So let’s say we have a 4k equirectangular image, which is roughly equivalent to a level 10 icosahedron. With a base level of 1, we will create a set of 80 512x512 pixel images.
The nice thing about this design, is that it actually preserves the factor-of-four scaling we get from image up-sampling and subdivision, but without tethering subdivision to 360° image resolution.
With this design, tangent images decouple subdivision level from image resolution, resolving the problem of scalability. As they are derived from the icosahedron, they provide a very low distortion characteristic as well:
And finally, because they are a pixel grid representation, they require no changes to the convolutional kernel, nor any special padding, transformation, or interpolation during inference, which makes network transfer extremely simple.
In fact, using them is easy. Simply resample to tangent images, run whatever algorithm you want, and then resample back to the sphere. Two resampling operations are all you need:
I’ll point you to the paper for the nitty-gritty details, but suffice it to say that, with tangent images:
- We can efficiently operate at at least 4x higher resolution of any prior work (level 10 spherical images)
- Provide a 10% improvement in network transfer performance, without fine-tuning
- And even unlock improved performance thanks to the extra wide field of 360° images, compared to an equivalent network trained only on central-perspective image.
The final, really cool thing about tangent images is:
They’re not just for deep learning!
Check out how tangent images improve SIFT keypoint detection (Lowe, 1999), which is a key step in structure from motion or sparse SLAM:
Even low-level tasks, like edge detection, are improved using tangent images:
This is possible because:
We have addressed spherical distortion through the representation, yet we still use a pixel grid.
This facilitates very easy usage and a very broad application scope.
Now, I don’t want to end this on a low note, but there are still some drawbacks to tangent images.
Perhaps the most important one is that we are, in some ways, giving up the biggest benefit of 360: the wide field of view. You can think of tangent images as modeling a polydioptric rig with each camera arranged in a specific way, with a specific field of view, sharing the same center of projection.
Although the initial results are exciting, there’s still a lot of interesting work to be done for more advanced ways to share information between views during the learning process.
The Path Forward
I want to wrap up with some of my own thoughts on the subject, distilled from a few years of full immersion into the 360° image problem.
First and foremost, there are some really great emergent technologies that are seeking to leverage 360° and wide field of view imaging. Immersive experiences seem to have hit the market first, with established companies like Google, Facebook, and Zillow leading the way. However, many smaller companies have incorporated this technology as well (e.g., Threshold360, Virtually Anywhere).
Other applications are not far behind. In addition to the exciting medical and robotics applications we talked about at the top of this article, products are also quickly appearing for remote inspection and documentation (OpenSpace, Holo Builder), leveraging the 360° field of view for an easier user experience.
Commodity 360° cameras are a fairly recent innovation. With the limitless bounds of human creativity and the desire to capture the world in a more expansive way, this technology is only going to grow.
Yet if we are to enable this growth, we need to build effective tools to do so.
Deep learning has been a revolutionary innovation. If we want to leverage it in support of 360° technologies, we need to get rid of this 360° image performance gap.
From my own experience:
I believe that the best path forward in this regard is through the development of better 360° image representations that leverage existing optimizations.
We can modify every existing computer vision or deep learning algorithm for use with 360° images, but this will be a slow and painstaking process. A more effective, general solution will be the development of improved or novel image representations, like tangent images, that can enable improvements for all algorithms, across the board.
2D convolution is so ubiquitous that it is being physically embedded in chips. Our solution to the 360° image problem shouldn’t fight this progress. It should find a way to make it work for us.
So, for any researchers who may be reading this to gather background on the subject, I recommend to think about the image representation first and let that guide your solution to the problem.
Associated Reading List:
Here is the aggregated list of the papers discussed in this article. To really get an overview of what’s been done, I highly suggest diving into the papers themselves.
- Cohen, T., Weiler, M., Kicanaoglu, B., and Welling, M. (2019). Gauge equivariant convolutional networks and the icosahedral cnn. In International Conference on Machine Learning, pages 1321–1330.
- Cohen, T. S., Geiger, M., Khler, J., and Welling, M. (2018). Spherical CNNs. In International Conference on Learning Representations.
- Coors, B., Condurache, A. P., and Geiger, A. (2018). Spherenet: Learning spherical representations for detection and classification in omnidirectional images. In European Conference on Computer Vision (ECCV), pages 525–541. Springer.
- Eder, M., & Frahm, J. M. (2019). Convolutions on spherical images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 1–5).
- Eder, M., Price, T., Vu, T., Bapat, A., & Frahm, J. M. (2019). Mapped Convolutions. arXiv preprint arXiv:1906.11096.
- Eder, M., Shvets, M., Lim, J., & Frahm, J. M. (2020). Tangent Images for Mitigating Spherical Distortion. Accepted to The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Esteves, C., Allen-Blanchette, C., Makadia, A., and Daniilidis, K. (2018). Learning so (3) equivariant representations with spherical cnns. In Proceedings of the European Conference on Computer Vision (ECCV), pages 52–68.
- Fernandez-Labrador, C., Facil, J. M., Perez-Yus, A., Demonceaux, C., Civera, J., & Guerrero, J. J. (2020). Corners for layout: End-to-end layout recovery from 360 images. IEEE Robotics and Automation Letters, 5(2), 1255–1262.
- Jiang, C. M., Huang, J., Kashinath, K., Prabhat, Marcus, P., and Niessner, M. (2019). Spherical CNNs on unstructured grids. In International Conference on Learning Representations.
- Perraudin, N., Defferrard, M., Kacprzak, T., and Sgier, R. (2018). Deepsphere: Efficient spherical convolutional neural network with healpix sampling for cosmological applications. Astronomy and Computing.
- Su, Y.-C. and Grauman, K. (2017). Learning spherical convolution for fast features from 360 imagery. In Advances in Neural Information Processing Systems, pages 529–539.
- Su, Y.-C. and Grauman, K. (2019). Kernel transformer networks for compact spherical convolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Tateno, K., Navab, N., and Tombari, F. (2018). Distortion-aware convolutional filters for dense prediction in panoramic images. In European Conference on Computer Vision (ECCV), pages 732–750. Springer.
- Xiong, B. and Grauman, K. (2018). Snap angle prediction for 360 panoramas. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–18.
- Zhang, C., Liwicki, S., Smith, W., & Cipolla, R. (2019). Orientation-aware semantic segmentation on icosahedron spheres. In Proceedings of the IEEE International Conference on Computer Vision (pp. 3533–3541).
- Zioulis, N., Karakottas, A., Zarpalas, D., & Daras, P. (2018). Omnidepth: Dense depth estimation for indoors spherical panoramas. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 448–465).
Additional Citations
Images
Arnold, Karen. Golden Retriever Dog. www.publicdomainpictures.net/en/view-image.php?image=35696&picture=golden-retriever-dog. Accessed 05–12–2020. Public domain.
Hastings-Trew, James. Earth Texture Map. http://planetpixelemporium.com/earth8081.html. Accessed 04–16–2019. Used with written permission of creator.
Citations
Kimerling, J. A., Sahr, K., White, D., and Song, L. (1999). Comparing geometrical properties of global grids. Cartography and Geographic Information Science, 26(4):271–288.
Krishnan, G. and Nayar, S. K. (2009). Towards a true spherical camera. In Human Vision and Electronic Imaging XIV, volume 7240, page 724002. International Society for Optics and Photonics.
Loop, C. (1987). Smooth subdivision surfaces based on triangles. Master’s thesis, University of Utah, Department of Mathematics.
Snyder, J. P. (1987). Map projections — A working manual (Vol. 1395). US Government Printing Office.