YCbCr Gamut Checking

October 7th, 2009

I recently added a pattern to GStreamer’s videotestsrc that can be used to check YCbCr to RGB conversion is being done correctly as part of video output.  It is the result of a clever hack — some YCbCr values, when converted to RGB, are out of range, so as part of the conversion process, they are clamped to the nearest RGB value.  The pattern generator creates a checkerboard pattern of a color (say, red) and a YCbCr value that upon correct conversion will result in the same color.  Thus the pattern should be invisible.  Usefully, these out-of-gamut YCbCr values are preserved by video codecs, so I can present to you a Theora video demonstrating this:

Firefox does the conversion correctly, so it’s unlikely you’ll see the pattern. However, some video display drivers still get this wrong, so you might see the pattern when playing the video in a standalone program that uses XV. For those of you with working kit, I created a demonstration video that simulates a bad conversion:

Sometimes it’s possible to see the pattern very faintly due to rounding in even a correct conversion. This is unavoidable because the RGB->YCbCr->RGB round trip is lossy.

Orc moved to code.entropywave.com

October 2nd, 2009

The git repository for Orc has moved to code.entropywave.com, where it will also likely obtain an actual web page soon.  code is a new website for open-source and free software projects sponsored by Entropy Wave.

Cog in gst-plugins-bad

September 19th, 2009

I finally moved my collection of Orc-based GStreamer plugins (codename “Cog”) into gst-plugins-bad, since they’re moved on from being an experiment.  Orc is a runtime compiler for a simple cross-platform assembly-like language that specifically targets SIMD instructions for several processors.  Orc is very effective inside it’s domain, which is small but growing.

One such application that is covered is chroma subsampling and color matrixing for video, semi-incorrectly referred to as “colorspace conversion” in GStreamer.  There has been a colorspace element in Cog (cogcolorspace) for some time, but I never really bothered to do any speed comparisons between it and the default GStreamer colorspace element (ffmpegcolorspace), which is based on code copied from FFMpeg.  However, recently I did, and was somewhat surprised (although I shouldn’t have been) that cogcolorspace is the same speed as, or much faster than, ffmpegcolorspace for almost all operations.  (Please note that the FFMpeg code was forked a long time ago and heavily modified, so it does not reflect FFMpeg itself, only GStreamer’s ffmpegcolorspace.)

This is a scatter plot of the run time (in ms) for converting 1000 frames of 320×240 video between a variety of uncompressed video formats:

Colorspace element execution time scatter plot

The axes are execution time (in ms), with cogcolorspace on the horizontal axis and ffmpegcolorspace on the vertical axis.  The green line represents same execution time, thus for points below the line, ffmpegcolorspace was faster, for those above, cogcolorspace was faster.  Most of the points clustered around the green line are statistically the same as the green line, since my timing method is quite crude.  Things to observe from this graph are that 1) many cases are very similar in speed, indicating that both ffmpegcolorspace and cogcolorspace are using similar code paths, 2) some cases, cogcolorspace is a lot faster, probably indicating that there isn’t an assembly fast path in ffmpegcolorspace for that conversion, and 3) a few cases (which, not coincidentally, are the most heavily used cases) ffmpegcolorspace is slightly faster than cogcolorspace.

The conclusions to draw from this are that 1) by writing very generic code with Orc, you can get very similar results to hand-crafted assembly code, and 2) a developer can cover a lot more cases with a small amount of work, and 3) there are a few cases where special-case Orc code would be beneficial.

This is only the low quality mode that cogcolorspace supports, which is similar or identical in quality to ffmpegcolorspace.  Higher-quality conversion is also implemented in most cases, and is only slightly slower in speed.  This is the real advantage of Orc — Orc takes care of huge number of combinations of options, and produces good SIMD code for all of them.

Orc-0.4.0

May 31st, 2009

Lately, I’ve been working on a side project called Orc as a replacement for liboil.  Liboil’s first major problem has always been that it doesn’t scale well — every software package that wanted to use liboil typically required several new liboil functions, and then someone would need to actually write assembly code for those functions on several architectures.  My original plan was to develop a critical mass of functions, and then additions would be “simple”.  This never happened.  The second major problem is that liboil’s compilation is terribly fragile.  Thousands of lines of inline assembly code that depends on specific compilers, compiler versions, libtool internals, and random snippets of code such as “if $user != msmith” do not lead to a maintainable project.

Orc is now to the point where it can not only reproduce about 90% of the code that is currently in liboil, but also generate 90% of the code that should be in liboil, but nobody ever wrote.  At runtime.  And the Orc language allows you to describe your own liboil-style functions.  At runtime.  Or, you can also use it like a normal compiler, converting Orc language source into N different assembly source files for every possible vector instruction set combination.

A large part of the decoding path in Schroedinger has been converted to optionally use Orc, where speed is either slightly faster or 20-30% faster than the previous liboil code.  The real benefit is that takes only a few minutes to convert code that took weeks to develop originally.  A side project of mine, Cog, has turned into a showcase for Orc, with demonstrations of video processing GStreamer elements, such as format and colorspace conversion and scaling.  I’ve found that since it is so easy and fast to create vectorized code, it now becomes possible to offer additional features to users, such as quality vs. speed tradeoffs.

Orc can generate code for MMX and SSE on x86 and x86_64, and Altivec on PowerPC, as well as NEON for ARM and c64x+DSP code.  The NEON and c64x+ backends are not currently open source.

Download 0.4.0Online documentation.

Entropy Wave

April 27th, 2009

I see Christian outed my new company, Entropy Wave.  The mission of the new company is to create video post-production tools using open media technology for a wide range of users, including high-end studios, professional video editors, and hobbyists.  Most of our products will be based on open-source code, including projects I’ve been heavily involved with such as GStreamer, Schroedinger, Orc, and various Xiph projects.

Existing and upcoming products include:

  • A GStreamer-based Media SDK that allows developers to rapidly create and deploy applications on major platforms (Windows, Linux, OS/X)
  • QuickTime plugins for DiracPro (SMPTE VC-2)
  • A video encoder application geared toward content producers putting video on the web
  • A capture application compatible with Numedia‘s line of DiracPro hardware encoders

In addition, Entropy Wave can provide support and custom development services in a variety of areas including open media.

Notes about low latency video

March 27th, 2009

I recently read a few rather extraordinary marketing claims from OnLive about their new server-side gaming technology.  Since they don’t really make much technical information available, one is left to speculate what they mean by “1 ms” latency, especially when it is directly compared to “500 ms to 750 ms lag” in video conferencing.  I’m sure that statement makes sense from someone’s perspective, just not from a video compression perspective.

I sat around over a beer last evening and wrote the following to the Schrödinger mailing list, because someone asked. Rather than answering the question, I decided to talk andomly about how low-latency video encoding works.

One key point about low latency video encoding is that the output bits that represent the pixel have to exist somewhere in the bitstream between the time the encoder gets the pixel from the camera, and N ms later, where N is the latency.

One method of very low-latency compression works on a scanline basis. An example is the low-delay profile of Dirac.  A camera reads out a few scan lines (say, 16), the encoder compresses them, and then sends those bits out over ethernet or ASI or whatever.  The latency is on the order of a few scan lines, say 16*2 + a small number.  Why 16*2? Because it takes 16 lines to read in the 16 line chunk, then spends the
time that it takes to read in the next chunk to encode the first chunk and send it out over the wire.  Simultaneously, the decoder reads in the data and decodes.  Then during the third set of 16 lines, the decoder scans out the uncompressed lines.  So the decoder scans out line 0 as the camera is scanning out line 32.  Real encoders need a bit of extra time for synchronization, so 32 is ideal.  Of course, in a real system there is network latency, but we’ll make someone else worry about that. 32 lines works out to be abous 1 ms for 1080p at 30 frames per second, depending on exactly the system you’re using.  Compression ratios are purposefully low, since you can’t spread around worst-case bits at all, and because this kind of compression is only really useful for studio work.

Note that camera that has a few-scanline latency start at USD 10,000 and an encoder/decoder pair for DiracPro is about USD 4,000, IIRC. This is not the kind of technology you roll out in a consumer product.

Another method is similar, but using an entire frame instead of a few scan lines.  In this case, you get a theoretical latency of 2 frames, or about 60 ms for 30 fps video.  I’ve seen companies advertising encoder/decoder pairs that claim 70 ms latency (of course, without any network latency), and I can pretty much believe this number.  Again, you can’t get away with cheap hardware — my DV camera has an internal latency somewhere between 90 and 120 ms, and HDV cameras are much worse.

In a frame-based low-latency system, it’s much more realistic to use motion compensation, in which you use the previous one or two frames as reference pictures.  Since the general point of using motion compensation is to decrease the bit rate, this causes compression artifacts immediately after scene changes that clear up after a few
frames, and is very characteristic of the technique.

Due to the way that Dirac puts together pictures, the non-low-delay profiles of Dirac has a approximate latency of 4 pictures for a simple implementation, although you can decrease this to nearly 2 pictures with more complex algorithms.  Schroedinger implements the simple algorithm, and with suitable modifications (it does not do
this by default) you can get close to 4 frames latency.  Schro’s implementation of Low-Delay Profile is also 4 frames, since it uses the same code.

Entropy Wave has implementations of the more complex algorithm for Simple and Intra profiles, as well as an actual low delay implementation of Low-Delay profile, with latencies that are very near the theoretical latencies.  These are not open source.  Unfortunately, since all the code that currently can use these codecs is frame based, there’s very minor advantage over Schroedinger unless you write a bunch of custom
code.

It should be obvious at this point that the “1 ms” number has very little to do with video compression, and a lot more to do with how game engines work.

A different kind of release

January 6th, 2009

InVisible cover

Seven Seas publishing just released the latest book of comic goodness, InVisible written by my partner Tristan.  Unlike releasing software, where gratification is instant, Tristan finished this project months ago but the books took the slow boat from whereever the printing was done.  Or something.  Gratuitous Amazon link (buy now!)

Tristan also has a story with our friend Atticus Wolrab in Comic Book Tattoo, a book that can only be described as a tome.  12″x12″x2″, it has dozens of amazing stories based loosely on Tori Amos songs.  It came out last summer coinciding with San Diego Comicon, where I got to see the Tori Amos fan base in all their glory.

Another One Bites the Dust

November 10th, 2008

I have an interesting history with wireless routers and hubs.  I always keep a spare around, since they seem to die at random.  Some are dead on arrival, which often prompts me to buy two at the same time.  The death seems to follow the same pattern — first, it drops wireless connections every few days, then every few hours, then all the time.  Belkin, Linksys, Netgear, D-Link, Apple, I have a pile of hardware with broken radios.

I am now on my way to buy two more wireless routers.  I’d pay extra to ensure that I’d get one that won’t die in 6-12 months, but that technology is apparently not for sale.

Clear-cutting the Jungle

September 24th, 2008

Lennart is one of the few people thinking about audio on Linux at a high enough level to define and sort out the problems. I endorse this message.

Well, except for the part about portability — GStreamer works quite well on OS/X and Windows. Now only if it had a good raw audio subsystem, like what you would use in a game engine…

Dirac in the news

September 20th, 2008

The release of VLC with Dirac support (via Schrödinger) and the release of the Dirac research codec (confusingly named dirac-1.0.0, sorry) has caused a bit of news in the geek press. I’ve noticed a few uninformed comments out there, and figured it would be wise to provide real information from a Dirac developer.

  •  Decoding Dirac takes a lot of CPU. This is true, depending on your definition of “a lot”.  It is also on purpose.  MPEG-4/ASP uses more CPU than MPEG-2, but gets better compression.  Likewise with MPEG-4/AVC vs. MPEG-4/ASP.  However, it doesn’t matter.  A video stream either plays on a CPU or doesn’t.  And most new computers (that aren’t specifically underpowered) are fully capable of playing Dirac at 1080p/30.
  • Encoding Dirac can be slow.  Right now, you either get slow and good (dirac-research) or fast and crappy (Schrödinger).  This is an area of active development.
  • Dirac and Theora will likely coexist.  For a variety of historical and technical reasons, Theora encoder development has been concentrated on SD and smaller sizes (and corresponding bit rates) and Dirac encoder development has concentrated on SD and HD sizes and bit rates.  And each currently appear to be better than the other in their respective areas.  Given limited developer resources, I imagine this trend will continue.
  • Tools exist for Dirac.  Several will be released in the next few months, including both DirectShow and QuickTime plugins.  (There are currently a few showstopper bugs remaining.)
  • Comparing apples to oranges still doesn’t make sense.  Many video encoders cannot be compared to each other because they focus on different problem domains.  “1 Mbit/sec” is not a full description of how a particular video was encoded.  That could mean CBR with a strict buffer model, or simply (file_size/duration), which are completely different creatures.

Also, there are Dirac demo videos here.