Parallelizing FLAC Encoding

Fred Akalin

One thing I noticed ever since getting a multi-core system was that the reference FLAC encoder is not multi-threaded. This isn't a huge problem for most people as you can simply encode multiple files at the same time but I usually rip my audio CDs into a single audio file with a cue sheet instead of separate track files and so I am usually encoding a single large audio file instead of multiple smaller ones. Even so, encoding a CD-length audio file takes under a minute but I thought it would be a fun and useful weekend project to see if I could parallelize the simpler example encoder. The format specification indicates that input blocks are encoded independently which makes the problem embarassingly parallel and trawling through the FLAC mailing lists reveals that no one has had the time nor the inclination to look into it.

However, I was able to write a multithreaded FLAC encoder that achieves near-linear speedup with only minor hacks to the libFLAC API. Here are some encode times on an 8-core 2.8 GHz Xeon 5400 for a 636 MB wave file (some caveats are discussed below):

baseline34.906s
1 threads31.424s
2 threads16.936s
4 threads10.173s
8 threads6.808s

I took the simple approach of sharding the input file into n roughly equal pieces and passing them to n encoder threads, assembling the output file from the n output buffers. In general this is not a good way of partitioning the workload as time is wasted if one shard takes significantly more time to process but for my use case this isn't a significant problem.

The best way to share the input file among the encoding threads is to map it into memory. In fact, memory-mapped file I/O has so many advantages in general that I'm surprised at how little I see it used, although it does have the disadvantage of requiring a bit more bookkeeping. Here is how I use it in my multithreaded encoder (slightly paraphrased):

#include <fcntl.h> /* open() */
#include <sys/mman.h> /* mmap()/munmap() */
#include <sys/stat.h> /* stat() */
#include <unistd.h> /* close() */

int main(int argc, char *argv[]) {
  int fdin;
  struct stat buf;
  char *bufin;

  fdin = open(argv[1], O_RDONLY);
  fstat(fdin, &buf);
  bufin = mmap(NULL, buf.st_size, PROT_READ, MAP_SHARED, fdin, 0);

  /* The input file (passed in via argv[1]) is now mapped read-only to
     the memory region in bufin up to bufin + buf.st_size. */

  /* Note that you can work directly with the mapped input file
     instead of fread()ing the header into a buffer. */
  if((buf.st_size < WAV_HEADER_SIZE) ||
     memcmp(bufin, "RIFF", 4) ||
     memcmp(bufin+8, "WAVEfmt \020\000\000\000\001\000\002\000", 16) ||
     memcmp(bufin+32, "\004\000\020\000data", 8)) {
    /* Invalid input file: print error and exit. */
  }

  for (i = 0; i < num_threads; ++i) {
    shard_infos[i].bufin = bufin + WAV_HEADER_SIZE + i * bytes_per_thread;
    /* bufsize for the last thread may be slightly larger. */
    shard_infos[i].bufsize = bytes_per_thread;
  }

  /* Spawn encode threads (which calls encode_shard() below) passing
     an element of shard_infos to each. */

  ...

  munmap(bufin, buf.st_size);
  close(fdin);
}

FLAC__bool encode_shard(struct shard_info *shard_info) {
  FLAC__StreamEncoder *encoder = FLAC__stream_encoder_new();

  ...

  /* The input file is paged in lazily as this function accesses
     bufin from shard_info->bufin. */
  FLAC__stream_encoder_process_interleaved(encoder,
                                           shard_info->bufin,
                                           shard_info->bufsize);

  ...

  FLAC__stream_encoder_delete(encoder);
}

However, handling the output file is a bit trickier. Since the encoded FLAC data output by the threads vary in size we have to wait until all encoding threads are done before we know the right offsets to write the output data. A convenient and fast way to handle this is to use asynchronous I/O; we know where to write the output data for a shard as soon as the encoding thread for all previous shards finish so we simply wait for the encoding threads in shard order and queue up a write request after each thread finishes. Here I use the POSIX asynchronous I/O API in my multithreaded encoder (again, slightly paraphrased):

#include <aio.h> /* aio_*() */
#include <pthread.h> /* pthread_*() */
#include <string.h> /* memset() */

int main(int argc, char *argv[]) {
  int fdout;
  pthread_t threads[MAX_THREADS];
  struct aiocb aiocbs[MAX_THREADS];
  unsigned long byte_offset = 0;

  /* mmap input file in. */

  ...

  fdout = open(argv[2], O_WRONLY | O_CREAT | O_TRUNC);

  /* Spawn encode threads passing an element of shard_infos to
     each. */

  ...

  /* Wait for each thread in sequence and queue up output writes. */

  /* We need to zero out any aiocb struct that we use before we fill
     in any members. */
  memset(aiocbs, 0, num_threads * sizeof(*aiocbs));
  for (i = 0; i < num_threads; ++i) {
    pthread_join(threads[i], NULL);
    aiocbs[i].aio_buf = shard_infos[i].bufout;
    aiocbs[i].aio_nbytes = shards_infos[i].bytes_written;
    aiocbs[i].aio_offset = byte_offset;
    aiocbs[i].aio_fildes = fdout;
    aio_write(&aiocbs[i]);
    byte_offset += shard_infos[i].bytes_written;
  }

  /* Wait for all output writes to finish. */

  for (i = 0; i < num_threads; ++i) {
    const struct aiocb *aiocbp = &aiocbs[i];
    aio_suspend(&aiocbp, 1, NULL);
    aio_return(&aiocbs[i]);
  }

  close(fdout);
}

The POSIX API is a bit unwieldy for this use case; ideally, there would be a version of aio_suspend() that would suspend the calling process until all of the specified requests have completed. As it is now the simplest way is to loop through the requests as above, especially since the maximum number of simultaneous asynchronous I/O requests is usually quite small (16 on my system).

Also, I found that the OS X implementation of aio_write() did not obey this part of the specified behavior:

  If O_APPEND is set for aiocbp->aio_fildes, aio_write() operations append
  to the file in the same order as the calls were made.  If O_APPEND is not
  set for the file descriptor, the write operation will occur at the abso-
  lute position from the beginning of the file plus aiocbp->aio_offset.

but it was just as easy (and clearer) to explicitly set the correct offset.

I had to hack up libFLAC a bit to implement my multithreaded encoder. I exposed the update_metadata_() to make it easy to write the correct number of total samples in the metadata field and also to zero out the min/max framesize fields. I also exposed the FLAC__stream_encoder_set_do_md5() function (which it should have been in the first place) so that I can turn off the writing of md5 field in the metadata. Finally, I added the function FLAC__stream_encoder_set_current_frame_number() so that the correct frame numbers are written at encode time.

For comparison purposes I turn off md5 calculation in my multithreaded encoder as well as the baseline one. Since calling FLAC__stream_encoder_set_current_frame_number() causes crashes with vericiation turned on I also turn that off. The numbers above reflect that so they're underestimates of how a production multithreaded encoder would perform. However, the essential behavior of the program shouldn't change much.

Here is a patch file for the flac 1.2.1 source that implements the hacks I described above. Here is the source for my multithreaded FLAC encoder. I've tested it with i686-apple-darwin9-gcc-4.0.1 and i686-apple-darwin9-gcc-4.2.1 on Mac OS X. I got the above numbers compiling mt_encode.c with gcc 4.2.1 and the switches -Wall -Werror -g -O2 -ansi.