Digital Audio Basics

dgatwood · Jan 27, 2011

What do we mean when we say "digital audio"? At its most basic, digital audio is audio in any format that consists of ones and zeroes. However, there are several different formats that are all called "digital audio", and they are not all equal. This brief explanation describes some of the common audio formats and how they differ.

LPCM (linear pulse code modulation)

This is the most common type of digital audio. For the most part, if an audio file is a WAV file, an AIFF file, a SDII file, etc., you're using PCM audio. An audio CD uses PCM, as do Blu-Ray discs and even a handful of DVDs (rare because of the bit rate).

PCM audio can be represented in several different forms:

Unsigned integer—values range from 0-255 (8-bit), 0-65,535 (16-bit), or in theory, 0-16,777,215 (24-bit). The minimum (largest negative) and maximum (largest positive) input voltages supported by the particular audio interface are mapped onto the minimum and maximum values, respectively, with a 0 volt signal falling right in the middle of the range. This format is relatively uncommon in practice because it was mostly used for 8-bit audio, which is also relatively uncommon these days.
Signed integer—values range from -128 to 127 (8-bit), -32,768 to 32,767 (16-bit), or -8,388,608 to 8,388,607 (24-bit). The minimum (largest negative) and maximum (largest positive) input voltages supported by the particular audio interface are mapped onto the minimum and maximum values, respectively, with a 0 volt signal falling right in the middle of the range. This is a relatively common storage format, and is the format used for most audio files on disk. CDs use this format, as do the uncompressed formats on DVDs and Blu-Ray discs.
Floating point—values range from some ridiculously large negative value to some ridiculously large positive value. A 0 volt signal is encoded as 0.0. The minimum (largest negative) and maximum (largest positive) input voltages supported by the particular audio interface are mapped onto -1.0 and +1.0. This format is almost nonexistent as an on-disk format, but is heavily used in audio plug-in architectures. The reason for using it for audio plug-ins is that it reduces the need to normalize the output of each plug-in during processing. A 32-bit floating-point value can hold a 24-bit integer value with both headroom and additional precision for smaller values. The result is that if a plug-in produces too loud or too soft a signal on its output, you can compensate for it in a subsequent gain stage without added noise or clipping.

"But wait a minute," you say. "I thought everything was ones and zeroes. Now you're telling me that it can be 65,535 or 0.5. What's going on?"

With computers, numbers are represented in a format called binary. This means that everything is physically either a zero or a one. (Well, to be precise, everything is a voltage level that is either close to zero or close to Vcc, but that's not important right now.)

A single zero or one is called a "bit". Individually, these are interesting only occasionally in computer science—mostly when dealing with permissions, hardware drivers, or crypto. For the most part, when you're writing software, you aren't concerned with bits individually. Computers don't generally do much with a single zero or one at a time. Instead, they group them into collections of bits with special names.

A group of four bits is called a "nibble". These are mostly interesting for notational convenience, though in historical architectures they were sometimes more significant, if memory serves. A nibble is commonly represented as a single base-16 (hexadecimal) digit. In base 16, you count 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, 10, ... 19, 1A, 1B, ... 1F, 20, 21, and so on. This is important, and you'll see why later.

A group of (usually) eight bits is called a "byte". These are the fundamental unit of data flow in most computer architectures. A byte is two digits in base 16.

Computers also work with larger groups of bits. A "word" is a group of several bytes (the actual number depends on the type of CPU you're talking about) that are usually the same size as a location in memory (referred to as an "address"). So for a 32-bit CPU, the word length is typically 32 bits, for a 64-bit CPU, the word length is 64 bits, etc.

Computers store data into memory in different ways depending on whether the numbers are unsigned integers, signed integers, or floating-point values, and depending on certain aspects of the CPUs themselves.

Before we go further, I should point out that to a computer, all of these are a series of bytes. The only thing that determines whether a number is an unsigned value, a signed value, a floating-point value, etc. is that the programmer has told the computer to treat that number as a signed value or whatever. Otherwise, they're just bytes.

Unsigned Integers

First, we'll look at unsigned integers. Unsigned integers are easy. To see an unsigned integer as a computer sees it, you simply convert the number into base 16 (hexadecimal).

Then, you must pad the left end of the number with zeroes. If you are working with an 8-bit value, you must have exactly two hex digits (one byte) in total. If you are working with a 16-bit value, you must have four hex digits. A 24-bit value is generally stored on disk with six bytes, but when the processor is working on it, the value gets copied into a special area of temporary storage inside your microprocessor (commonly known as a register) that is eight bytes long (32 bits), so the way the computer sees it depends on whether you're talking about the way it looks in memory or inside the CPU.

Finally, you must cut the number up into groups of two hex digits each. Each of these is stored as a separate, individually-addressable byte inside your computer.

Now the order in which you list these numbers depends on your CPU. With Intel computers, data is stored in what is known as "little endian" format. This means that the first pair of digits stored in your computer's memory is the pair on the right end of that number, moving towards your left. With pretty much every other architecture on the planet, data is stored in "big endian" format, which means that data is stored in the order that most humans normally read numbers (from left to right).

Note: There are actually some CPUs that use other orders besides big or little endian, but these are the only two you are likely to encounter in practice.

Signed Integers

Now for signed integers. Signed integers are stored in one of two ways: ones' complement or two's complement, depending on the architecture. Most modern computers have settled on two's complement, so that's all I'm going to describe here.

With two's complement, positive values are stored as-is, so a 37 is stored as 37, or 0x25. Negative values are stored as the bitwise inverse of a positive number, with the leftmost bit always set to 1. So 13 in an 8-bit binary value is 0b00001101, and -13 in binary is 0b11110010. Thus, the leftmost digit tells you whether the value is positive or negative. When you look at this in hex, 13 is 0x0d, and -13 is 0xf2. (You will notice that we generally precede numbers in binary with 0b as a notational convention, and similarly prefix numbers in hexadecimal with 0x.)

This encoding is convenient because you can do things like add and subtract numbers without worrying about whether you are mixing signs. If you trivially add the numbers above together, you get all zeroes (with a carry of 1, which you ignore because it falls outside the allowable range for an 8-bit value).

These numbers are similarly cut up and stored in either big or little endian format, and are padded much like unsigned integers, with one exception: because the leftmost bit determines sign, a signed integer must be padded with the leftmost digit. Thus, a positive number is padded with zeroes, and a negative number is padded with ones in binary, 0xff bytes in hex.

Floating-Point Numbers

Floating-point numbers are stored in a number of different formats internally (with Intel architectures having some particularly oddball variations), but ultimately all of the interesting ones tend to boil down to dividing a series of bits (for example, 32 bits for a 32-bit floating-point value) into three parts: a sign bit (0 for positive, 1 for negative), a fraction, and an exponent, generally in that order.

The fraction is multiplied by 2 raised to the exponent minus the exponent bias (if applicable). So for an example, consider the following IEEE 32-bit (4 byte) value:

0xff800003

In binary, this is

0b11111111 10000000 00000000 00000011

The top bit is the sign bit. The next eight are the exponent, and the remaining bits are the fraction. Thus, the fractional part is 2 and the exponent part is 8 bits of 1s, which is 255. In this case, we will treat that exponent part as being an unsigned value with a bias of 127, so subtract this bias and you get 128. Thus, the bit pattern above is equivalent to:

3 * (2^128)

Which is on the order of a one with 39 zeroes after it....

The details of the number of bits for the exponent part and the fraction part differ depending on whether you are talking about IEEE 32-bit floating-point values, 64-bit values, 128-bit values, or that weird 80-bit format the Intel chips support. Also, in some hardware supporting IEEE floating point, instead of using an unsigned integer with a bias for the exponent part, it treats those bits as a signed integer (of a very weird length). In short, if you wanted to know why this is not commonly written out to disk, there's your answer.

Other formats

There are a number of common compressed audio formats, many of which are variants on other formats. This list is not exhaustive. In fact, it's not even close.

MPEG/MPEG-2 Layer 3

This is a common compressed format. It uses perceptual coding to throw away portions of the audio signal that most people won't be able to hear anyway. Its primary advantages are that it stores sound compactly and that it is commonly available. Its primary disadvantage is that there's a lot of distortion (swishing) in the high frequency content, particularly at lower bit rates.

A related format, MPEG-2 layer 2, is used for some compressed audio tracks on DVDs (in addition to AC-3 and maybe a couple of others). IIRC, MP3 can also be used in Blu-Ray audio, but I don't think this use is common. MP2 and MP3 are not ideal as acquisition formats because they are lossy.

AAC

This is another common compressed format, also based on the principle of perceptual coding. It is commonly used in conjunction with H.264 in AVCHD video recordings, Blu-Ray discs, etc. It is also the default ripping format used by iTunes, and the format sold on the iTunes store. It offers subjectively better sound quality than MP3 at most bit rates. However, it is marginally less commonly available. Like MP3, AAC is not ideal as an acquisition format because it is lossy.

AC-3

This is the Dolby Digital format. It is commonly used for compressing multichannel audio on DVDs and Blu-Ray discs. You probably won't encounter it anywhere else.

Digital Audio Basics

dgatwood

is out. Leave a message.