Orc-0.4.0

May 31st, 2009

Lately, I’ve been working on a side project called Orc as a replacement for liboil.  Liboil’s first major problem has always been that it doesn’t scale well — every software package that wanted to use liboil typically required several new liboil functions, and then someone would need to actually write assembly code for those functions on several architectures.  My original plan was to develop a critical mass of functions, and then additions would be “simple”.  This never happened.  The second major problem is that liboil’s compilation is terribly fragile.  Thousands of lines of inline assembly code that depends on specific compilers, compiler versions, libtool internals, and random snippets of code such as “if $user != msmith” do not lead to a maintainable project.

Orc is now to the point where it can not only reproduce about 90% of the code that is currently in liboil, but also generate 90% of the code that should be in liboil, but nobody ever wrote.  At runtime.  And the Orc language allows you to describe your own liboil-style functions.  At runtime.  Or, you can also use it like a normal compiler, converting Orc language source into N different assembly source files for every possible vector instruction set combination.

A large part of the decoding path in Schroedinger has been converted to optionally use Orc, where speed is either slightly faster or 20-30% faster than the previous liboil code.  The real benefit is that takes only a few minutes to convert code that took weeks to develop originally.  A side project of mine, Cog, has turned into a showcase for Orc, with demonstrations of video processing GStreamer elements, such as format and colorspace conversion and scaling.  I’ve found that since it is so easy and fast to create vectorized code, it now becomes possible to offer additional features to users, such as quality vs. speed tradeoffs.

Orc can generate code for MMX and SSE on x86 and x86_64, and Altivec on PowerPC, as well as NEON for ARM and c64x+DSP code.  The NEON and c64x+ backends are not currently open source.

Download 0.4.0Online documentation.

Entropy Wave

April 27th, 2009

I see Christian outed my new company, Entropy Wave.  The mission of the new company is to create video post-production tools using open media technology for a wide range of users, including high-end studios, professional video editors, and hobbyists.  Most of our products will be based on open-source code, including projects I’ve been heavily involved with such as GStreamer, Schroedinger, Orc, and various Xiph projects.

Existing and upcoming products include:

  • A GStreamer-based Media SDK that allows developers to rapidly create and deploy applications on major platforms (Windows, Linux, OS/X)
  • QuickTime plugins for DiracPro (SMPTE VC-2)
  • A video encoder application geared toward content producers putting video on the web
  • A capture application compatible with Numedia’s line of DiracPro hardware encoders

In addition, Entropy Wave can provide support and custom development services in a variety of areas including open media.

Notes about low latency video

March 27th, 2009

I recently read a few rather extraordinary marketing claims from OnLive about their new server-side gaming technology.  Since they don’t really make much technical information available, one is left to speculate what they mean by “1 ms” latency, especially when it is directly compared to “500 ms to 750 ms lag” in video conferencing.  I’m sure that statement makes sense from someone’s perspective, just not from a video compression perspective.

I sat around over a beer last evening and wrote the following to the Schrödinger mailing list, because someone asked. Rather than answering the question, I decided to talk andomly about how low-latency video encoding works.

One key point about low latency video encoding is that the output bits that represent the pixel have to exist somewhere in the bitstream between the time the encoder gets the pixel from the camera, and N ms later, where N is the latency.

One method of very low-latency compression works on a scanline basis. An example is the low-delay profile of Dirac.  A camera reads out a few scan lines (say, 16), the encoder compresses them, and then sends those bits out over ethernet or ASI or whatever.  The latency is on the order of a few scan lines, say 16*2 + a small number.  Why 16*2? Because it takes 16 lines to read in the 16 line chunk, then spends the
time that it takes to read in the next chunk to encode the first chunk and send it out over the wire.  Simultaneously, the decoder reads in the data and decodes.  Then during the third set of 16 lines, the decoder scans out the uncompressed lines.  So the decoder scans out line 0 as the camera is scanning out line 32.  Real encoders need a bit of extra time for synchronization, so 32 is ideal.  Of course, in a real system there is network latency, but we’ll make someone else worry about that. 32 lines works out to be abous 1 ms for 1080p at 30 frames per second, depending on exactly the system you’re using.  Compression ratios are purposefully low, since you can’t spread around worst-case bits at all, and because this kind of compression is only really useful for studio work.

Note that camera that has a few-scanline latency start at USD 10,000 and an encoder/decoder pair for DiracPro is about USD 4,000, IIRC. This is not the kind of technology you roll out in a consumer product.

Another method is similar, but using an entire frame instead of a few scan lines.  In this case, you get a theoretical latency of 2 frames, or about 60 ms for 30 fps video.  I’ve seen companies advertising encoder/decoder pairs that claim 70 ms latency (of course, without any network latency), and I can pretty much believe this number.  Again, you can’t get away with cheap hardware — my DV camera has an internal latency somewhere between 90 and 120 ms, and HDV cameras are much worse.

In a frame-based low-latency system, it’s much more realistic to use motion compensation, in which you use the previous one or two frames as reference pictures.  Since the general point of using motion compensation is to decrease the bit rate, this causes compression artifacts immediately after scene changes that clear up after a few
frames, and is very characteristic of the technique.

Due to the way that Dirac puts together pictures, the non-low-delay profiles of Dirac has a approximate latency of 4 pictures for a simple implementation, although you can decrease this to nearly 2 pictures with more complex algorithms.  Schroedinger implements the simple algorithm, and with suitable modifications (it does not do
this by default) you can get close to 4 frames latency.  Schro’s implementation of Low-Delay Profile is also 4 frames, since it uses the same code.

Entropy Wave has implementations of the more complex algorithm for Simple and Intra profiles, as well as an actual low delay implementation of Low-Delay profile, with latencies that are very near the theoretical latencies.  These are not open source.  Unfortunately, since all the code that currently can use these codecs is frame based, there’s very minor advantage over Schroedinger unless you write a bunch of custom
code.

It should be obvious at this point that the “1 ms” number has very little to do with video compression, and a lot more to do with how game engines work.

A different kind of release

January 6th, 2009

InVisible cover

Seven Seas publishing just released the latest book of comic goodness, InVisible written by my partner Tristan.  Unlike releasing software, where gratification is instant, Tristan finished this project months ago but the books took the slow boat from whereever the printing was done.  Or something.  Gratuitous Amazon link (buy now!)

Tristan also has a story with our friend Atticus Wolrab in Comic Book Tattoo, a book that can only be described as a tome.  12″x12″x2″, it has dozens of amazing stories based loosely on Tori Amos songs.  It came out last summer coinciding with San Diego Comicon, where I got to see the Tori Amos fan base in all their glory.

Another One Bites the Dust

November 10th, 2008

I have an interesting history with wireless routers and hubs.  I always keep a spare around, since they seem to die at random.  Some are dead on arrival, which often prompts me to buy two at the same time.  The death seems to follow the same pattern — first, it drops wireless connections every few days, then every few hours, then all the time.  Belkin, Linksys, Netgear, D-Link, Apple, I have a pile of hardware with broken radios.

I am now on my way to buy two more wireless routers.  I’d pay extra to ensure that I’d get one that won’t die in 6-12 months, but that technology is apparently not for sale.

Clear-cutting the Jungle

September 24th, 2008

Lennart is one of the few people thinking about audio on Linux at a high enough level to define and sort out the problems. I endorse this message.

Well, except for the part about portability — GStreamer works quite well on OS/X and Windows. Now only if it had a good raw audio subsystem, like what you would use in a game engine…

Dirac in the news

September 20th, 2008

The release of VLC with Dirac support (via Schrödinger) and the release of the Dirac research codec (confusingly named dirac-1.0.0, sorry) has caused a bit of news in the geek press. I’ve noticed a few uninformed comments out there, and figured it would be wise to provide real information from a Dirac developer.

  •  Decoding Dirac takes a lot of CPU. This is true, depending on your definition of “a lot”.  It is also on purpose.  MPEG-4/ASP uses more CPU than MPEG-2, but gets better compression.  Likewise with MPEG-4/AVC vs. MPEG-4/ASP.  However, it doesn’t matter.  A video stream either plays on a CPU or doesn’t.  And most new computers (that aren’t specifically underpowered) are fully capable of playing Dirac at 1080p/30.
  • Encoding Dirac can be slow.  Right now, you either get slow and good (dirac-research) or fast and crappy (Schrödinger).  This is an area of active development.
  • Dirac and Theora will likely coexist.  For a variety of historical and technical reasons, Theora encoder development has been concentrated on SD and smaller sizes (and corresponding bit rates) and Dirac encoder development has concentrated on SD and HD sizes and bit rates.  And each currently appear to be better than the other in their respective areas.  Given limited developer resources, I imagine this trend will continue.
  • Tools exist for Dirac.  Several will be released in the next few months, including both DirectShow and QuickTime plugins.  (There are currently a few showstopper bugs remaining.)
  • Comparing apples to oranges still doesn’t make sense.  Many video encoders cannot be compared to each other because they focus on different problem domains.  “1 Mbit/sec” is not a full description of how a particular video was encoded.  That could mean CBR with a strict buffer model, or simply (file_size/duration), which are completely different creatures.

Also, there are Dirac demo videos here.

Random Updates

August 29th, 2008

I’m at this annoying stage with Orc and Liboil where I need to add a feature to Liboil to support a new feature in Schrödinger, except that the new feature would be really easy to write if Orc was further along. Sigh. So I continue to muddle along not working on either.

In other news, jirac, the Java Dirac decoder written by Bart Wiegmans as a Google Summer of Code project, is pretty much feature complete and integrated into Cortado. There are a few showstopper bugs remaining, but I’m hoping for a release soon.

And in yet other news, I will be attending IBC in Amsterdam in a few weeks to meet up with various people to talk about Dirac and GStreamer. If you would like to meet up, let me know. There will be at least 3 booths related to Dirac: the BBC will be demonstrating Dirac as used for high-definition video distribution, NHK will be demonstrating Super Hi-vision with Dirac compression (that’s 4320p, kids), and Numedia will be demoing their hardware that handles Dirac Pro (SMPTE VC-2), which is Dirac for studio compression.

I’ve been seeing comments on teh Internets about “when Dirac is ready…” Just wanted to let you all know that Dirac is ready now.

Dirac news

July 15th, 2008

I haven’t written about the Dirac project recently.  We’ve reorganized a bit recently: most information is on diracvideo.org now, “Dirac” is now used to describe the overall project, with various subprojects like Schrödinger, dirac-research (formerly the Dirac codebase, intermingled with the specification), and a bunch others.  I’m finding that a lot of my time recently is cat herding all the subprojects and creating a coherent whole.

One such subproject is Bart Wiegmans’ GSoC project to implement Dirac in Java and get it working in Cortado.  It’s moving along quite nicely, and has reached the feel-good milestone of creating a pretty picture.  Right now, it only handles a very limited set of bitstreams — intra pictures only, variable length coding, and a specific choice of wavelet filter.  I think Bart has fixed a bug related to chroma handling since the time I created this screenshot, which is why it’s in greyscale.

Schro Java Screenshot

Another subproject, and another GSoC project, is Mattias Bolte’s OpenGL decoding backend for Schrödinger.  This is roughly similar to the existing CUDA backend, although hopefully will work on a wider range of hardware.  He’s uncovering a number of resource usage issues in Schrödinger, such as the decoder using up to half a GB of RAM for temporary storage in some cases.   Alas, my hardware does not handle GLSL, so there will be no screenshots here.  The scope of the current project will require the use of GLSL, since serious signal processing is beyond clever texture hacks.

On a related note, progress is being made in gst-plugins-gl, the GStreamer OpenGL plugins.  Julien Isorce has done some amazing infrastructure work expanding on my original gst-plugins-gl code, which was really just an experiment.  At some point, the schrodec GStreamer element will decode directly in hardware using OpenGL, and then deliver the pictures as textures to downstream GStreamer elements, which will render it directly to the display.  Fillipo Argolias has been writing some cool GStreamer video filters that use OpenGL, but on a different branch than Julien’s work.  (My current task is merging these branches.) These filters will be used for Cheese.

Maarten Lankhorst has been working on a DirectShow filter for encoding/decoding Dirac streams based on Schrödinger.  It currently only works with Dirac inside Ogg, and is nearing the point where it is useful for people to use.  We’ll be doing a testing release sometime soon.

I have been working on a QuickTime component in whatever time I’ve had remaining after trying to keep up with everyone else’s work.  It works only with the QuickTime container, but it encodes and decodes pretty consistently with the Apple tools.  It’s pretty much ready for a testing release as well; just need the time to get it out the door.

Update: A screenie of Julien’s glfiltercube example (on a Mac, FTW):

gst-plugins-gl screenshot

Introducing Orc

May 23rd, 2008

Orc is a new sub-project of Liboil that I’ve been working on for a few weeks.  Orc stands for Oil Runtime Compiler, which further expands to Optimized Inner Loop Runtime Compiler.  As its name implies, the Orc library compiles code into a runnable function, and does it using SIMD instructions when available.  The “code” is currently an intermediate form that is roughly a platform-agnostic assembly language that understands variables and simple arrays.  It’s an intermediate form in the sense that it’s currently only stored as a list of structs — there isn’t a parser yet.  Orc can generate MMX and Altivec code for a few simple functions that is as fast as the corresponding liboil function.  There are also software fallbacks for those not-so-mainstream architectures.

There have been many motivations for creating Orc, which I will go into at another time.

One of the primary goals of Orc is to create a simple compiler (actually, it’s a really fancy assembler that understands register allocation) that is user-expandable: an Orc user can create additional opcodes, variable types, and rules for translating those opcodes into machine code.  One of the potential applications for Orc is pixman, which would require adding types (ARGB pixels) and opcodes (compositing operations).