Going in-depth on iPhone Spatial Video

The iOS 17.2 beta has brought the ability to shoot in Spatial Video for the forthcoming Vision Pro, and a handful of press participated in a demo where they could view Spatial Video on the Vision Pro headset. While the stuff recorded by Apple with the cameras in the Vision Pro headset naturally had better stereo separation than the iPhone, most members of the press seemed impressed by the content taken from a device that’s far more likely to be available to capture memories. (I’m more than a little curious to see a demo like that myself, but I’d settle for some good sushi.)

Earlier this summer I gave a quick overview of stereoscopic terms and filmmaking. Part of that post had to do with guessing at what Spatial Video was. In Apple’s marketing materials, they show third-person vantages of people experiencing perfectly separated, holographic experiences of things, but the reality is that it has a lot more in common with the left-eye/right-eye combo of traditional stereoscopic video.

In my piece this summer I linked to Chris Flick’s WWDC video, which covers general stereo terms and how Apple is handling streaming stereo content. The file container has a left-eye video stream and then metadata covering the differences between the two eyes in order to reconstruct the right-eye view. When Apple unveiled the iPhone 15 Pro and Pro Max, they touted that a beta update would bring the ability to shoot that spatial video, but they didn’t get into details, and showed another sci-fi hologram thing.

Computer, arch.

The iPhone 15’s Wide and Ultra Wide cameras were arranged so that when the iPhone was held horizontally, the software could crop in on them and get two similar-ish views. A reminder on the tech specs for the iPhone 15 Pro lenses and sensors that are being combined for Spatial Video:

48MP Main: 24 mm, ƒ/1.78 aperture, second‑generation sensor‑shift optical image stabilization, 100% Focus Pixels, support for super‑high‑resolution photos (24MP and 48MP)
12MP Ultra Wide: 13 mm, ƒ/2.2 aperture and 120° field of view, 100% Focus Pixels

I wondered what Apple would do to augment the left and right eye video capture to match them better, as anyone with an iPhone knows that there is a perceptible quality difference between these 0.5x and 1.0x lenses, but as my friend Dan Sturm pointed out on Mastodon, it doesn’t seem to be doing a whole heck of a lot:

First things first, I have to admit, I’ve been obsessing over trying to pull this stuff apart since the beta came out. It’s so easy to get caught up in the excitement around these types of things because it’s a new, magical experience. But there is no magic. This is exactly what you would expect Stereo3D footage from an iPhone to look like.

It’s very interesting to me how many [Stereo3D] “rules” they’re just ignoring here. The [depth of field] on the lenses does not match. The detail, color, compression, stabilization (or lack there of) does not match. The final image is not what one would call “good”, but it does work. It is [Stereo3D] footage from an iPhone.

Admittedly, for many people, it will feel like magic.

Dan’s referring to the slight differences between the two vantage points, which was one of the problems with iPhone video capture that I described back in June. Stu Maschwitz, and others found similar results, so it’s pretty safe to say it’s not a fluke.

To capture good 3D video, you ideally want identical lenses, and sensors, synchronously capturing what’s happening so that the only difference is the horizontal offset. Any differences in color, value, softness, will all seem to shimmer as your brain combines the two images. It’ll still have the illusion of being 3D, but it would be fatiguing, or uncomfortable to watch.

Without personally having access to a Vision Pro, I can only tell you things based on these videos we pulled apart using Spatialify, a iOS app that’s available only via a TestFlight beta. It is possible that visionOS is doing some additional processing of these videos as it decodes them, though my hunch is that it will continue to be exactly what it appears to be: two images from two very different cameras, put together.

There is also the matter of the fact that these videos are limited to 1080p30. I understand that the different focal ranges require substantial cropping on the Ultra Wide camera and produce a substantial drop in quality, but I’m less certain why the crop is exactly 1920×1080, since that’s not even the sensor’s aspect ratio. This video isn’t going into a 2000s-era TV, it’s meant to be viewed in an infinite canvas.

This limitation, more than anything else, really hampers why someone might want to capture Spatial Video right now. No, resolution isn’t everything—but it’s also not nothing. People also tend to shoot vertical videos because of how we hold our phones for both recording and viewing. This feature is asking people to choose between sharing a video optimized for phone viewing, or recording something that’s going to be part of a personal viewing experience.

Also consider the fact that Apple isn’t letting the iPhone 15 capture Spatial Photos. Stereoscopic photography has been around longer than motion picture film. That there’s no function to take a photo suggests something about the quality of the imagery. After all, it’s very easy to scrutinize a single still frame, while it’s a lot easier to forgive flaws in a constantly moving image.

I’m not saying that Apple’s Spatial Video implementation is bad. But I would be hesitant to recommend anyone switch their Camera app over to Spatial Video and shoot all of their videos with it right now. For the time being, I think people generally would be happier if they continued to shoot and share video as they do right now. You can always watch a video floating on a card in space with a Vision Pro headset, and at 4K resolution you can make it fill as much of your field of view as you might like.

So you still want to do it, huh?

If someone really does want to shoot Spatial Video, I’d recommend considering the subject matter first. In the demos Apple provided, they had a sushi chef making sushi for the journalists to record. The chef was near enough to camera that the chef would have internal depth, and also depth relative to the environment. Apple’s other videos also centered on people in environments.

From CNet’s Scott Stein:

The camera app makes recommendations on turning the camera sideways, and staying a certain distance from a subject. I was told to stay within 3 to 8 feet of what I was shooting for a good spatial video, but when I shot my test recording of someone making sushi at a table I got up closer and it looked perfectly fine. I also recorded in a well-lit room, but apparently the spatial video recording mode prevents adjustments on brightness and contrast, which means low-light recording may end up grainier than normal videos.

Shooting things like a wide-open space, with nothing in the foreground or mid ground, is not going to look or feel like much of anything. Please, I beg you, don’t shoot Spatial Video of fireworks—there will be no depth at all. Just because you think, “it’ll be in 3D” doesn’t mean it has any internal depth at that distance. You want that? Then record someone holding a sparkler.

Jason Snell took his iPhone 15 Pro to a Cal game for a couple of to shoot some Spatial Video. (See my video breaking it down.) Being in a stadium might feel big and grand, because you’re immersed in a large space that surrounds you—but it’s not something the iPhone can really capture. Spatial Video doesn’t surround you at all. You’re looking into a window at a stadium, and are very much separated from it instead of immersed. It will feel pretty flat without someone in the foreground as a subject. So definitely keep that window metaphor in mind.

Shooting something extremely close would likely cause an issue with things potentially breaking frame, so you could get close to something, provided it was “sticking out” at camera and not crossing the entire field of view.

Apple tries to mitigate issues with things breaking frame by having that fuzzy falloff at the edges instead of a hard termination where you’d be seeing something that’s exiting the frame in only one eye potentially causing strain. So be mindful of that, because you’re not going to see a soft-matte edge as you’re recording in the Camera app.

Try to always record Spatial Video in well-lit areas. There are still going to be subtle shifts in everything, because the lenses and sensors will match better when they don’t have to compensate for high ISO sensor noise and differences in aperture.

Eyes toward the future

I might sound pessimistic, but I’m not—I’m only skeptical of what’s currently on offer. It’s early days for Spatial Video. The first pass at Portrait Mode was really, really bad, but it’s now refined enough to where it can be executed as a post-processing option. Improvements over many years in hardware and software got us to this point, and if it’s a focus for Apple, I’m sure it’ll get there too.

I would caution that I’m only making that comparison to Portrait Mode in terms of the march of progress, not because I ever envision Apple shipping an entirely post-processed Spatial Video mode. As I mentioned in my previous post, when you offset something in post you’re cutting it up into layers, and every place where there’s something in front of something else is an occlusion. Where there are occlusions, there’s no data—and that has to be filled in. Could Apple do that with generative fill that’s stable across a frame range? Maybe, but that seems like more of a Google thing.

Maybe in a future iPhone we’ll have a better Ultra Wide camera sensor, lens, and optical stabilization? Perhaps we’ll see more advanced machine learning to transform the more detailed Wide camera data to cover over the inconsistencies in the Ultra Wide’s view? Maybe there will be the option to record 4K and Spatial Video with a later iPhone so you feel like you have the best of both worlds?

When Apple lets you take a still Spatial Photo, then that’s the signal that they’re confident in the image quality and not just the emotional content.

There’s no FOMO here, not right now, for Spatial Video. If the primary reason you shoot video is to remember moments with people, then make sure you keep that in mind. The people are more important than whether or not the video is 2D or 3D.

[Joe Rosensteel is a VFX artist and writer based in Los Angeles.]

Report a typo

This Week's Sponsor

By Joe Rosensteel

■ Vision Pro

Going in-depth on iPhone Spatial Video

So you still want to do it, huh?

Eyes toward the future

Search Six Colors