Saturday, January 17, 2015

Handling audio and video

As you know, I have been working on WebcamStudio and ScreenStudio in the last few years.  The main issue I faced while developing those projects was ensuring that audio and video would be synchronized.


You have to keep in mind that you are always limited by the power of the CPU.  A faster machine means that you will be able to capture larger displays and encode into a more compressed format.  Capturing a webcam at 320x240 is actually quite easy to do.  Capturing your whole desktop is another thing.  The video format is also putting a lot of stress on the CPU as you need to encode the video source (and audio) in realtime.

Assuming that the capture is done at 30 fps, you will have less than 33 milliseconds to capture one frame, apply any video effect on the image and encode it into the video stream.  I'm telling you, 33 milliseconds is really not a long time to do all this work.  For each frame, WebcamStudio has to:
  • Capture a single frame from all videos sources (Screen, webcam, image, etc...)
  • Apply effects on each frame
  • Merge all capture frames into a single frame by compositing each image one over the other in the proper order
  • Push the final output frame image to the encoder buffer
  • Encode the image frames from the buffer into the video file.
  • Synchronize captured audio with the video stream in the video file.
The same goes on with audio in WebcamStudio.  Frankly, a lot is happening in a 33 milliseconds timespan.



ScreenStudio has a simpler approach.  It relies on FFMPEG or AVCONV to execute the desktop capture into a video stream.  Where WebcamStudio is doing all the job, ScreenStudio is relying on FFMPEG/AVCONV to do that job.

The biggest issue when capturing audio and video is synchronizing them.  Look around the web, the issue of unsynchronized audio and video is everywhere.  Why is it so hard to do?

At the low level, when you capture a single image frame from a video source or a single audio frame from an audio source, a timestamp is set to the frame.  This timestamp (PTS: Presentation TimeStamp) is used to tell the encoder when to use this frame in the final video stream. The software executing the capture will rely on the driver used to set the proper timestamp on each frame.  Sadly, each driver (video or audio) will provide two kinds of value:
  • Zero based:  The first captured frame is 0, the second is 1, and so on...
  • Time based: Each captured frame is using the current clock time.
When using multiple capture sources (audio and video), it becomes harder to synchronize them is they are using different timestamp values.  So you might think that we just need to use a conversion to set every input stream to the same type of PTS...  Wrong!  Technically is would work in an ideal world but its not.

For example, when capturing images from a webcam, you are totally dependant on the webcam driver.  The first image provided by the driver can have the right timestamp if it's based on the clock time but by converting this timestamp to a Zero Based, you are loosing the exact time of that frame.  There is no way to know if the first image was captured by the webcam as of now, a few milliseconds ago or a whole second ago.  This is due to the fact that the driver is buffering a few frames before making them available to the capturing software.


Have you noticed that when displaying your webcam with any software (Cheese, PhotoBooth, etc...), there is a slight delay before seeing the first image?  In some case, you can even see a really small delay between what you are doing and what is actually displayed on your computer screen.  Just showing the webcam on the computer screen does not required a lot of processing power.  Capture the image and paint it on the screen.  Easy, right?



Just do this simple test:  Display your webcam on your screen, and count to 5 with your voice and fingers at the same time.  You will see a small delay between what you hear from your voice and what is displayed on your computer screen.  The difference for this live capture is always around 100 milliseconds.  It's not really big and it's almost unnoticeable but it can be a real pain when developing a software that will capture your computer screen, you webcam, the microphone in realtime.

To achieve a perfect synchronization when doing a realtime audio and video capture, many technics can be used.  And frankly, sometimes, it's almost impossible to do...

Hope you enjoyed!

Patrick