I previously wrote about how I was starting a Metro application, for learning and fun’s sake. I talked how I wanted to exercise some knowledge in multimedia and get comfortable with all the WinRT goodies. I also covered that I wanted to make a video editing application. That idea, even now, is pretty abstract. The loose goals are “something like Adobe Premiere, Avid for iPad or iMovie”. I’m making this app as I go along. This specific post digs into a lot of Media Foundation and I sometimes make a lot of assumptions on everyone’s knowledge (and willingness to read about something as boring as COM media APIs). My hope is cover some thought process and even a few humps in the road so others may not have to endure.
Making a Video Editing Engine
Any video editing application is going to need a way of dealing with media. The term “engine” might be a little strong in this case, which in reality would be a set of components. These components are suited for specific tasks related to processing media, like reading media attributes, processing an audio/video stream and decompressing it, encoding, mixing of audio and video, and of course playback. Metro allows for a subset of MediaFoundation that will facilitate a large portion of the multimedia features. If you are familiar with Media Foundation, you will notice the subset is very trimmed down from it’s desktop counterpart. My initial reaction, was “OMG! How am I going to do anything without XYZ?!”, but I found the subset and tradeoff to eventually be beneficial and just easier. Some areas required some flexing of the API, but it’s hard to complain when you get so much more for “free”.
My first look at how this editing engine would be made was to see what existing infrastructure existed in Metro already. There’s a transcode API and MediaElement, both of which can take a custom Media Foundation Transform (MFT) to add effects. I can see using the transcode API in some situations in this app, but in reality, by the time we are done, we’ll have “rewritten” all that functionality (plus more). Unfortunately transcode and MediaElement do not allow the kind of flexibility we want in this application. To get the control we really need, we’ll need to mainly use the IMFSourceReader, IMFWriterSink, a custom media source, Direct3D and the IMFMediaEngine.
The Media Type Detector
All media container formats (WMV, MOV, AVI, MKV, etc) contain important information about the media it houses. For this app, we need to know things like, “What kind of media streams do you have?” and “What compression format are the streams”, “What is the duration of the stream?”. This kind of data is useful to be able to actually decode the media, but it also nice to have so it can be displayed to the user. With this component we can discover video resolutions, so we can property initialize other parts of the application. It is also possible to discover multiple audio tracks in case the user wishes to include those within the video editing session.
DirectShow had a built in class to do this, called IMediaDet. MediaFoundation has the same functionality, but it’s rolled into the IMFSourceReader. The source reader handles quite a bit, which I’ll cover next, but for this requirement, we are only using it’s GetCurrentMediaType/GetNativeMediaType functionality.
Note on Metro and IMFSourceReader: Because of the sandbox restrictions, you may find creating a IMFSourceReader using MFCreateSourceReaderFromURL problematic. You will want to use MFCreateSourceReaderFromByteStream, which takes an IMFByteStream. You can use any WinRT random access stream, first by using MFCreateMFByteStreamOnStreamEx. That function will wrap your WinRT stream with an IMFByteStream. It is also useful in other areas of Media Foundation that use/require IMFByteStream
The Media Readers
Reading media is where things get a little more complicated. This component needs to handle everything from parsing the container, media samples, decoding. In the case of video, the GPU should be leveraged as much as possible for using accelerated decoding and colorspace conversion. For this, like the media type detector, we use the IMFSourceReader. The source reader handles everything you need to read and decode media almost automatically, at least media supported by installed codecs. If it’s a video stream, you can tell it to decode to a specific colorspace (eg RGB32). If it’s audio, you can have it down/up sample, and even covert the number of channels. This is extremely helpful when you need a specific format to do audio/video mixing, and also for outputting to specific encoders.
The IMFSourceReader has a method called ReadSample. This will return the next consecutive media sample. If your media contains an audio track and video track, ReadSample(…) could return an audio sample, or a video sample in the order it was interleaved into the media container. This is great for some cases, but we have an issue. What if a user wishes to move an audio track +/-5 seconds relative to the video? What if the user only wants to use a video stream from the media? Or just an audio track? The solution here is to write an “Audio Sample Reader” and a “Video Sample Reader”. Each is independent of each other, which allows for the greatest flexibility, independent seeking, and overcomes a situation where the IMFSourceReader will queue up media samples if you do not “ReadSample” from a specific stream.
With the “Video Sample Reader”, I needed to make sure performance is key. I wrote how to achieve GPU acceleration with IMFSourceReader previously. Not all codecs allow for GPU decoding, but H264 and VC1 should in most cases along with YUV –> RGB conversions. When you do have a GPU enabled IMFSourceReader, ReadSample(…) will give you an IMFDXGIBuffer in the IMFSample’s buffers. I have settled on GPU surfaces as the base format returned by my “Video Sample Reader”. They work to render to XAML, and they work to sending to encoders.
The “Audio Sample Reader” is slightly less complex. There is no GPU acceleration for audio decoding in Media Foundation . I simply let the consumer set the output format (channels, sample rate, bits per channel), and Media Foundation diligently uncompressed and coverts to that format. Setting all audio streams to decode to a single uncompressed format will simplify the process of making our own custom audio mixer later.
The Media Writer
To encode media, I am using the IMFSinkWriter. With this interface one can simply specify an output stream or file, configure audio and video streams encoder parameters and just send uncompressed samples to it. The rest is magically handled for you. Though this API is very easy to use, there are plenty of “gotchas”. The first is you need to be very aware of what container formats can contain certain streams. For instance, an MP4 container cannot hold a VC-1 video stream. WMV (aka ASF) can technically hold “any” format, but compatibility with many players is minimal. The second issue is you must be very aware of what parameters to use for the codec. Some audio codecs only work with very specific bitrates, not all are published fully either. You can find valid WMA bitrates and information in my rant here.
There is also a more subtle “gotcha” that I came across. When I first tested the writer, the output video looked very “jerky”. I double and triple checked the time stamps, but everything looked fine. I found that every time I sent a video sample to the writer, it would not encode immediately and required at least a few samples in the queue to do its job. In essence, I kept modifying the output before it got encoded. To solve this, I created a IMFVideoSampleAllocator. Before I would write the video sample, I would first grab a sample from the allocator, copy the source sample it, then send the copy to the sink writer. Once the sink writer is finished with the sample, it will return to the allocator.
The Media Sample Player
Now I have a “media type detector”, “media readers” and a “media writer”. This is the bare essentials for being able to transcode any format to any format. That is important when a user wishes to compile their editing project to a media file as fast as the device can. What we need now is a way to show a “live preview” to the user. “Media readers” will just read as fast as it can, do not render audio and do not keep the audio and video in sync. Audio rendering is not a problem with Metro APIs, but writing clocks and doing stream syncing and doing correct timing is not easy or fun. The solution I came up with is to leverage the IMFMediaEngine and write a custom media source.
The IMFMediaEngine can be configured to use a custom media source. Typically folks may use this functionality to write a custom protocol (eg RTSP, RTP, etc) that plugs right into the Media Foundation pipeline. For our situation, we want to make it a “virtual media source”. By “virtual”, I mean the media source streams don’t really exist as a file, but is created on the fly from a user’s video editing session (picture a non linear video editor UI). It receives output from the audio and video mixers that I have yet to discuss. So think of this virtual source as a virtual file and when IMFMediaEngine is told to play the next sample, the mixers will compose an audio or video sample based on the current media time. This could involve reading a new sample from the “media readers”. If the IMFMediaEngine is told to seek, the application will virtually seek within the editing session, rendering audio and video from the mixer.
One “gotcha” with this setup, is IMFMediaEngine will not let you directly instantiate your custom media source. This is a requirement for me as I needed to make sure I can pump it with media samples to be rendered. The work-around was to implement IMFMediaEngineExtension and register it with the MF_MEDIA_ENGINE_EXTENSION when you create the IMFMediaEngine. When you tell a IMFMediaEngine to open a custom URI, eg “myproto://”, it will call into you, where you can instantiate your custom source, keep reference to it and pass it back to the IMFMediaEngine.
Note: When IMFMediaEngine calls your custom IMFMediaStream::RequestSample(…), DO NOT synchronously call a IMFSourceReader::ReadSample(…). It will deadlock. Instead put IMFSourceReader in async mode, or use something like MS ppl library to asynchronously call IMFSourceReader::ReadSample.
At this moment, this is all I have written. I plan on tackling the infrastructure for the video mixer next, which involves the fun stuff of things like Direct3D/2D. I will post on this when I have something more solid. It will be exciting for me as I can finally demonstrate the output of the system-as-it-stands. Right now, to you, it’s just a long, rambling blog post