John Carmack on Developing the Netflix App for Oculus

John Carmack on Developing the Netflix App for Oculus
// The Netflix Tech Blog

Hi, this is Anthony Park, VP of Engineering at Netflix. We've been working with Oculus to develop a Netflix app for Samsung Gear VR. The app includes a Netflix Living Room, allowing members to get the Netflix experience from the comfort of a virtual couch, wherever they bring their Gear VR headset. It's available to Oculus users today. We've been working closely with John Carmack, CTO of Oculus and programmer extraordinaire, to bring our TV user interface to the Gear VR headset. Well, honestly, John did most of the development himself(!), so I've asked him to be a guest blogger today and share his experience with implementing the new app. Here's a sneak peek at the experience, and I'll let John take it from here...

Netflix Living Room on Gear VR



The Netflix Living Room

Despite all the talk of hardcore gamers and abstract metaverses, a lot of people want to watch movies and shows in virtual reality. In fact, during the development of Gear VR, Samsung internally referred to it as the HMT, for "Head Mounted Theater." Current VR headsets can't match a high end real world home theater, but in many conditions the "best seat in the house" may be in the Gear VR that you pull out of your backpack.
Some of us from Oculus had a meeting at Netflix HQ last month, and when things seemed to be going well, I blurted out "Grab an engineer, let's do this tomorrow!"
That was a little bit optimistic, but when Vijay Gondi and Anthony Park came down from Netflix to Dallas the following week, we did get the UI running in VR on the second day, and video playing shortly thereafter.
The plan of attack was to take the Netflix TV codebase and present it on a virtual TV screen in VR. Ideally, the Netflix code would be getting events and drawing surfaces, not even really aware that it wasn't showing up on a normal 2D screen.
I wrote a "VR 2D Shell" application that functioned like a very simplified version of our Oculus Cinema application; the big screen is rendered with our peak-quality TimeWarp layer support, and the environment gets a neat dynamic lighting effect based on the screen contents. Anything we could get into a texture could be put on the screen.
The core Netflix application uses two Android Surfaces – one for the user interface layer, and one for the decoded video layer. To present these in VR I needed to be able to reference them as OpenGL textures, so the process was: create an OpenGL texture ID, use that to initialize a SurfaceTexture object, then use that to initialize a Surface object that could be passed to Netflix.
For the UI surface, this worked great -- when the Netflix code does a swapbuffers, the VR code can have the SurfaceTexture do an update, which will latch the latest image into an EGL external image, which can then be texture mapped onto geometry by the GPU.
The video surface was a little more problematic. To provide smooth playback, the video frames are queued a half second ahead, tagged with a "release time" that the Android window compositor will use to pick the best frame each update. The SurfaceTexture interface that I could access as a normal user program only had an "Update" method that always returned the very latest frame submitted. This meant that the video came out a half second ahead of the audio, and stuttered a lot.
To fix this, I had to make a small change in the Netflix video decoding system so it would call out to my VR code right after it submitted each frame, letting me know that it had submitted something with a particular release time. I could then immediately update the surface texture and copy it out to my own frame queue, storing the release time with it. This is an unfortunate waste of memory, since I am duplicating over a dozen video frames that are also being buffered on the surface, but it gives me the timing control I need.
Initially input was handled with a Bluetooth joypad emulating the LRUD / OK buttons of a remote control, but it was important to be able to control it using just the touchpad on the side of Gear VR. Our preferred VR interface is "gaze and tap", where a cursor floats in front of you in VR, and tapping is like clicking a mouse. For most things, this is better than gamepad control, but not as good as a real mouse, especially if you have to move your head significant amounts. Netflix has support for cursors, but there is the assumption that you can turn it on and off, which we don't really have.
We wound up with some heuristics driving the behavior. I auto-hide the cursor when the movie starts playing, inhibit cursor updates briefly after swipes, and send actions on touch up instead of touch down so you can perform swipes without also triggering touches. It isn't perfect, but it works pretty well.

Layering of the Android Surfaces within the Netflix Living Room



Display

The screens on the Gear VR supported phones are all 2560x1440 resolution, which is split in half to give each eye a 1280x1440 view that covers approximately 90 degrees of your field of view. If you have tried previous Oculus headsets, that is more than twice the pixel density of DK2, and four times the pixel density of DK1. That sounds like a pretty good resolution for videos until you consider that very few people want a TV screen to occupy a 90 degree field of view. Even quite large screens are usually placed far enough away to be about half of that in real life.
The optics in the headset that magnify the image and allow your eyes to focus on it introduce both a significant spatial distortion and chromatic aberration that needs to be corrected. The distortion compresses the pixels together in the center and stretches them out towards the outside, which has the positive effect of giving a somewhat higher effective resolution in the middle where you tend to be looking, but it also means that there is no perfect resolution for content to be presented in. If you size it for the middle, it will need mip maps and waste pixels on the outside. If you size it for the outside, it will be stretched over multiple pixels in the center.
For synthetic environments on mobile, we usually size our 3D renderings close to the outer range, about 1024x1024 pixels per eye, and let it be a little blurrier in the middle, because we care a lot about performance. On high end PC systems, even though the actual headset displays are lower resolution than Gear VR, sometimes higher resolution scenes are rendered to extract the maximum value from the display in the middle, even if the majority of the pixels wind up being blended together in a mip map for display.
The Netflix UI is built around a 1280x720 resolution image. If that was rendered to a giant virtual TV covering 60 degrees of your field of view in the 1024x1024 eye buffer, you would have a very poor quality image as you would only be seeing a quarter of the pixels. If you had mip maps it would be a blurry mess, otherwise all the text would be aliased fizzing in and out as your head made tiny movements each frame.
The technique we use to get around this is to have special code for just the screen part of the view that can directly sample a single textured rectangle after the necessary distortion calculations have been done, and blend that with the conventional eye buffers. These are our "Time Warp Layers". This has limited flexibility, but it gives us the best possible quality for virtual screens (and also the panoramic cube maps in Oculus 360 Photos). If you have a joypad bound to the phone, you can toggle this feature on and off by pressing the start button. It makes an enormous difference for the UI, and is a solid improvement for the video content.
Still, it is drawing a 1280 pixel wide UI over maybe 900 pixels on the screen, so something has to give. Because of the nature of the distortion, the middle of the screen winds up stretching the image slightly, and you can discern every single pixel in the UI. As you get towards the outer edges, and especially the corners, more and more of the UI pixels get blended together. Some of the Netflix UI layout is a little unfortunate for this; small text in the corners is definitely harder to read.
So forget 4K, or even full-HD. 720p HD is the highest resolution video you should even consider playing in a VR headset today.
This is where content protection comes into the picture. Most studios insist that HD content only be played in a secure execution environment to reduce opportunities for piracy. Modern Android systems' video CODECs can decode into special memory buffers that literally can't be read by anything other than the video screen scanning hardware; untrusted software running on the CPU and GPU have no ability to snoop into the buffer and steal the images. This happens at the hardware level, and is much more difficult to circumvent than software protections.
The problem for us is that to draw a virtual TV screen in VR, the GPU fundamentally needs to be able to read the movie surface as a texture. On some of the more recent phone models we have extensions to allow us to move the entire GPU framebuffer into protected memory and then get the ability to read a protected texture, but because we can't write anywhere else, we can't generate mip maps for it. We could get the higher resolution for the center of the screen, but then the periphery would be aliasing, and we lose the dynamic environment lighting effect, which is based on building a mip map of the screen down to 1x1. To top it all off, the user timing queue to get the audio synced up wouldn't be possible.
The reasonable thing to do was just limit the streams to SD resolution – 720x480. That is slightly lower than I would have chosen if the need for a secure execution environment weren't an issue, but not too much. Even at that resolution, the extreme corners are doing a little bit of pixel blending.

Flow diagram for SD video frames to allow composition with VR
In an ideal world, the bitrate / resolution tradeoff would be made slightly differently for VR. On a retina class display, many compression artifacts aren't really visible, but the highly magnified pixels in VR put them much more in your face. There is a hard limit to how much resolution is useful, but every visible compression artifact is correctable with more bitrate.



Power Consumption

For a movie viewing application, power consumption is a much bigger factor than for a short action game. My target was to be able to watch a two hour movie in VR starting at 70% battery. We hit this after quite a bit of optimization, but the VR app still draws over twice as much power as the standard Netflix Android app.
When a modern Android system is playing video, the application is only shuffling the highly compressed video data from the network to the hardware video CODEC, which decompresses it to private buffers, which are then read by the hardware composer block that performs YUV conversion and scaling directly as it feeds it to the display, without ever writing intermediate values to a framebuffer. The GPU may even be completely powered off. This is pretty marvelous – it wasn't too long ago when a PC might use 100x the power to do it all in software.
For VR, in addition to all the work that the standard application is doing, we are rendering stereo 3D scenes with tens of thousands of triangles and many megabytes of textures in each one, and then doing an additional rendering pass to correct for the distortion of the optics.
When I first brought up the system in the most straightforward way with the UI and video layers composited together every frame, the phone overheated to the thermal limit in less than 20 minutes. It was then a process of finding out what work could be avoided with minimal loss in quality.
The bulk of a viewing experience should be pure video. In that case, we only need to mip-map and display a 720x480 image, instead of composing it with the 1280x720 UI. There were no convenient hooks in the Netflix codebase to say when the UI surface was completely transparent, so I read back the bottom 1x1 pixel mip map from the previous frame's UI composition and look at the alpha channel: 0 means the UI was completely transparent, and the movie surface can be drawn by itself. 255 means the UI is solid, and the movie can be ignored. Anything in between means they need to be composited together. This gives the somewhat surprising result that subtitles cause a noticeable increase in power consumption.
I had initially implemented the VR gaze cursor by drawing it into the UI composition surface, which was a useful check on my intersection calculations, but it meant that the UI composition had to happen every single frame, even when the UI was completely static. Moving the gaze cursor back to its own 3D geometry allowed the screen to continue reusing the previous composition when nothing was changing, which is usually more than half of the frames when browsing content.
One of the big features of our VR system is the "Asynchronous Time Warp", where redrawing the screen and distortion correcting in response to your head movement is decoupled from the application's drawing of the 3D world. Ideally, the app draws 60 stereo eye views a second in sync with Time Warp, but if the app fails to deliver a new set of frames then Time Warp will reuse the most recent one it has, re-projecting it based on the latest head tracking information. For looking around in a static environment, this works remarkably well, but it starts to show the limitations when you have smoothly animating objects in view, or your viewpoint moves sideways in front of a surface.
Because the video content is 30 or 24 fps and there is no VR viewpoint movement, I cut the scene update rate to 30 fps during movie playback for a substantial power savings. The screen is still redrawn at 60 fps, so it doesn't feel any choppier when you look around. I go back to 60 fps when the lights come up, because the gaze cursor and UI scrolling animations look significantly worse at 30 fps.
If you really don't care about the VR environment, you can go into a "void theater", where everything is black except the video screen, which obviously saves additional power. You could even go all the way to a face-locked screen with no distortion correction, which would be essentially the same power draw as the normal Netflix application, but it would be ugly and uncomfortable.



A year ago, I had a short list of the top things that I felt Gear VR needed to be successful. One of them was Netflix. It was very rewarding to be able to do this work right before Oculus Connect and make it available to all of our users in such a short timeframe. Plus, I got to watch the entire season of Daredevil from the comfort of my virtual couch. Because testing, of course.
-John