Notes about low latency video

I recently read a few rather extraordinary marketing claims from OnLive about their new server-side gaming technology.  Since they don’t really make much technical information available, one is left to speculate what they mean by “1 ms” latency, especially when it is directly compared to “500 ms to 750 ms lag” in video conferencing.  I’m sure that statement makes sense from someone’s perspective, just not from a video compression perspective.

I sat around over a beer last evening and wrote the following to the Schrödinger mailing list, because someone asked. Rather than answering the question, I decided to talk andomly about how low-latency video encoding works.

One key point about low latency video encoding is that the output bits that represent the pixel have to exist somewhere in the bitstream between the time the encoder gets the pixel from the camera, and N ms later, where N is the latency.

One method of very low-latency compression works on a scanline basis. An example is the low-delay profile of Dirac.  A camera reads out a few scan lines (say, 16), the encoder compresses them, and then sends those bits out over ethernet or ASI or whatever.  The latency is on the order of a few scan lines, say 16*2 + a small number.  Why 16*2? Because it takes 16 lines to read in the 16 line chunk, then spends the
time that it takes to read in the next chunk to encode the first chunk and send it out over the wire.  Simultaneously, the decoder reads in the data and decodes.  Then during the third set of 16 lines, the decoder scans out the uncompressed lines.  So the decoder scans out line 0 as the camera is scanning out line 32.  Real encoders need a bit of extra time for synchronization, so 32 is ideal.  Of course, in a real system there is network latency, but we’ll make someone else worry about that. 32 lines works out to be abous 1 ms for 1080p at 30 frames per second, depending on exactly the system you’re using.  Compression ratios are purposefully low, since you can’t spread around worst-case bits at all, and because this kind of compression is only really useful for studio work.

Note that camera that has a few-scanline latency start at USD 10,000 and an encoder/decoder pair for DiracPro is about USD 4,000, IIRC. This is not the kind of technology you roll out in a consumer product.

Another method is similar, but using an entire frame instead of a few scan lines.  In this case, you get a theoretical latency of 2 frames, or about 60 ms for 30 fps video.  I’ve seen companies advertising encoder/decoder pairs that claim 70 ms latency (of course, without any network latency), and I can pretty much believe this number.  Again, you can’t get away with cheap hardware — my DV camera has an internal latency somewhere between 90 and 120 ms, and HDV cameras are much worse.

In a frame-based low-latency system, it’s much more realistic to use motion compensation, in which you use the previous one or two frames as reference pictures.  Since the general point of using motion compensation is to decrease the bit rate, this causes compression artifacts immediately after scene changes that clear up after a few
frames, and is very characteristic of the technique.

Due to the way that Dirac puts together pictures, the non-low-delay profiles of Dirac has a approximate latency of 4 pictures for a simple implementation, although you can decrease this to nearly 2 pictures with more complex algorithms.  Schroedinger implements the simple algorithm, and with suitable modifications (it does not do
this by default) you can get close to 4 frames latency.  Schro’s implementation of Low-Delay Profile is also 4 frames, since it uses the same code.

Entropy Wave has implementations of the more complex algorithm for Simple and Intra profiles, as well as an actual low delay implementation of Low-Delay profile, with latencies that are very near the theoretical latencies.  These are not open source.  Unfortunately, since all the code that currently can use these codecs is frame based, there’s very minor advantage over Schroedinger unless you write a bunch of custom
code.

It should be obvious at this point that the “1 ms” number has very little to do with video compression, and a lot more to do with how game engines work.

3 Responses to “Notes about low latency video”

  1. romulo says:

    The guys at OnLive conference didn’t mean 1ms latency on the video compression. What they said is that since players “clients” run on the same computer, they will see no lag between themselves. Its not about video streaming, its about how players interact with each other inside the server.

  2. Being involved in low-delay audio transmission (through CELT and other projects), I also got interested in low-delay video transmission. After all, it’s no use having a conference with a low-delay audio codec if your video path has 200 ms delay (either audio and video are out of sync or you have to delay the audio). It turns out that systems that work on whole frames are basically hopeless for low-delay video. In your example above of 60 ms encoder-decoder delay, the problem is that there’s a lot of other delays that need to be added. Even a “perfect” one-frame-at-a-time camera will have 30 ms delay, then the display side will have another 30 ms at least. Then you’ll have another ~30 ms for the network buffer. At last, the general idea of compressing is to have a size that’s close to your network bandwidth. When that’s the case, then the time it takes to transfer an entire frame is also 30 ms. Add to that the speed of light, maybe 10 ms if you’re close, and you get a total of *at least* 160 ms. In practice, it’ll be very hard to go below 200 ms with that kind of approach.

    So not only do we need input and output hardware that works on scan lines, but we also need codecs that can handle that. As far as I understand (I could be wrong), the Theora bit-stream requires the encoder to have seen the entire frame before it can output anything. That’s a fundamental limitation that Dirac does not have. The standard version of Dirac will still have a delay of several lines because of the vertical wavelet transform, but not that bad. IIRC Dirac Pro can use a simpler Haar transform that adds less delay.

  3. Gregory Maxwell says:

    If someone is looking for a fun project— Elphel makes a line of fully open-hardware video cameras. They use a CMOS sensor with an electronic rolling shutter. The image data is streamed into a FPGA, where on-chip demosasicing and jpeg compression is performed. The verilog for the FPGA is available.

    If someone was interested they could have these fairly inexpensive cameras producing output with a few scanlines latency with some development work.