Where's the audio in an audio file? Please enlighten me.

paw1

New member
One thing I don't really understand about digital audio is how/where the actual sounds are stored/recreated. Hear me out.

Lets say that we've got a 16-bit audio clip at 44.1kHz:
-16 bits (two bytes) gives each sample a dynamic resolution of 65536 from zero to maximum "gain". That's all it does (I think).
-44.1kHz gives 44100 samples per second.

If I'm not mistaken, this is all data an audio clip contains.
So, for each sample we get the dynamic level, but a dynamic level of what? There's no more information.
Where's the data that gives pitch and timbre, what about multiple layers of sounds?

:wtf:
 
Last edited:
All there is is a sequence of "dynamic levels." It's a representation of the electrical signal you want the amplifier to send to your speakers. To be more specific about the electrical signal: what an amplifier does is produce an AC voltage that varies continuously between some positive level and a similar negative level. That makes one or more physical parts of the speaker move in an (approximately) matching manner, which produces a series of pressure variations in the air. That's all sound is: pressure variations in the air, that propogate through it, kind of like ripples on the surface of a pond.

Really fast up-and-down values in the file -> really fast up-and-down voltage -> speaker cone (or other driver) moving back and forth really fast -> really quick variations in air pressure = what we call a "high pitched sound" -> really quick movements of your ear drum -> really quick movements of little bones in your ear -> really quick movement of some fluid in a chamber in your ear -> movement of particular little hairs in that fluid that are structured so that they bend in response to quick movements of fluid around them -> electrical signals from nerves attached to those hairs running through a series of cells to your brain -> a perception in your mind of a high pitched sound.

Pitch arises from the frequency of the wave (specifically, in musical terms, the fundamental, or the slowest and largest variation in a periodically-repeating series of values, or "wave")
Timbre arises from the shape of the wave not being just a simple sine curve or, to put it another way, that it has a fundamental and also some other quicker and smaller variations overlaid on top of it (harmonics)
Multiple layers of sound occure the same way they do out in the air. You overlay one series of values (or series of varying air pressures) on top of another. To aid in distinguishing among them, we generally use two sound sources in different places, aka stereo.
 
All there is is a sequence of "dynamic levels." It's a representation of the electrical signal you want the amplifier to send to your speakers. To be more specific about the electrical signal: what an amplifier does is produce an AC voltage that varies continuously between some positive level and a similar negative level. That makes one or more physical parts of the speaker move in an (approximately) matching manner, which produces a series of pressure variations in the air. That's all sound is: pressure variations in the air, that propogate through it, kind of like ripples on the surface of a pond.

Really fast up-and-down values in the file -> really fast up-and-down voltage -> speaker cone (or other driver) moving back and forth really fast -> really quick variations in air pressure = what we call a "high pitched sound" -> really quick movements of your ear drum -> really quick movements of little bones in your ear -> really quick movement of some fluid in a chamber in your ear -> movement of particular little hairs in that fluid that are structured so that they bend in response to quick movements of fluid around them -> electrical signals from nerves attached to those hairs running through a series of cells to your brain -> a perception in your mind of a high pitched sound.

Pitch arises from the frequency of the wave (specifically, in musical terms, the fundamental, or the slowest and largest variation in a periodically-repeating series of values, or "wave")
Timbre arises from the shape of the wave not being just a simple sine curve or, to put it another way, that it has a fundamental and also some other quicker and smaller variations overlaid on top of it (harmonics)
Multiple layers of sound occure the same way they do out in the air. You overlay one series of values (or series of varying air pressures) on top of another. To aid in distinguishing among them, we generally use two sound sources in different places, aka stereo.

So am I right if I assume that the speaker cone moves "with" the signal?
In the example below +/- 1V is the dynamic range of the signal. Amplified, +/- 1V pushes the speaker exactly to its excursion-limit(s).


Sample 1...........................][...Sample 2............................][...Sample 3............................][...Sample 4...........................][...Sample 5............................][...Sample 6..........................

SIGNAL
0V.........(+signal rises)........][...0,5V.......(+signal rises)........][...1V.........(+signal falls)..........][...0,5V.........(+signal falls)......][...0V...........(-signal rises)........][...-0,5V...............................

SPEAKER MOVEMENT
Speaker in neutral position....][...Halfway outward excursion....][...Maximum outward excursion...][...Halfway outward excursion...][...Speaker in neutral position......][..Halfway inward excursion......


Would this give a correct representation of the signal/speaker relationship?
 
Last edited:
Yeah, that's pretty much right.

The only caveat is that you normally wouldn't think in terms of going all the way the maximum excursion of the speaker driver. What you typically are trying to do is just to move the speaker enough so that it produces a sound of the volume (SPL) you want to listen to. That's usually well under the maximum SPL that speaker can produce.

Speakers will generally lose fidelity (i.e. move in ways that fail to produce sound pressure waves that correspond to the voltage waves they receive) as they reach their physical limits. A speaker that's full-on hitting its physical maximum excursion would probably sound pretty bad.

As an aside: perhaps this is just a way of saying that the maximum excursion - if you mean, by "maximum," the furthest it can go and still preserve acceptable fidelity - is less than the physical maximum. Another aside: the example you threw out describes a signal with a full wavelength of 8 samples. At 44.1k samples per second, the sound would have a frequency of about 5.5 kHz. That's well up higher than the range where you find the fundamentals of musical notes. It's where you find percussive sounds, the articulation of consonants in speech and singing and that sort of thing. For additional context: the fundamental of A above middle C - as most people know - has a "standard" tuning of 440 Hz, which means that it plays out over about 100 samples. Sometimes people don't realize quite how fast a 44.1k sampling rate actually is.
 
Yeah, that's pretty much right.

The only caveat is that you normally wouldn't think in terms of going all the way the maximum excursion of the speaker driver. What you typically are trying to do is just to move the speaker enough so that it produces a sound of the volume (SPL) you want to listen to. That's usually well under the maximum SPL that speaker can produce.

Speakers will generally lose fidelity (i.e. move in ways that fail to produce sound pressure waves that correspond to the voltage waves they receive) as they reach their physical limits. A speaker that's full-on hitting its physical maximum excursion would probably sound pretty bad.

As an aside: perhaps this is just a way of saying that the maximum excursion - if you mean, by "maximum," the furthest it can go and still preserve acceptable fidelity - is less than the physical maximum. Another aside: the example you threw out describes a signal with a full wavelength of 8 samples. At 44.1k samples per second, the sound would have a frequency of about 5.5 kHz. That's well up higher than the range where you find the fundamentals of musical notes. It's where you find percussive sounds, the articulation of consonants in speech and singing and that sort of thing. For additional context: the fundamental of A above middle C - as most people know - has a "standard" tuning of 440 Hz, which means that it plays out over about 100 samples. Sometimes people don't realize quite how fast a 44.1k sampling rate actually is.

Yeah, I didn't mean that the cone would go to the physical maximum of the speaker, it was just easier to make a comparison when simplified that way. The way I imagine it, the speaker moves "with" the signal, but in a non-linear fashion.

I think I've got the basic idea of digital audio reproduction after you explained it to me. I thought I already did, but after trying to explain it to a friend of mine, I understood a piece of the puzzle was missing:)

Superb explanation by the way!
 
It is not exactly true that the sample level/voltage correspond directly to the position of the speaker. It's more about velocity or accelleration of the speaker cone, but honestly that's where my head explodes.
 
Yeah, the speaker will have mechanical limitations due to its weight, magnet, etc and subsequent ballistics which is why we have intermodulation distortion. It simply can not reproduce an amplified waveform 100% efficiently. This is a major hurdle in speaker design.

In terms of digital audio, the incoming AC voltage from the analogue domain, say a microphone preamp to an AD converter, is sampled 44100 times per second (in the case of 44.1kHz sampling rate) and it's travel is stored in digital "words" (bits). The dynamic resolution of these bits can be thought of as the amount of "steps" available. This is determined by the wordlength and in the case of 16-bit, as you mentioned, we have 65536 steps. This basically serves to "smooth out" the travel of the waveform throughout it's dynamic swing across the zero line (infinity). The more steps we have, the higher the resolution of the wave's travel. This method of encoding is called Pulse Code Modulation (PCM). More information here: Pulse-code modulation - Wikipedia, the free encyclopedia.

Upon playback, the data (samples stored in bits) are retrieved from the storage medium and reconstructed by the DA converter and reconstruction filter. The waveform is then smoothed out once again on the output end back into the analogue domain (voltage) as a continuous wavform.

Hope that helps.

Cheers :)
 
Yeah, the speaker will have mechanical limitations due to its weight, magnet, etc and subsequent ballistics which is why we have intermodulation distortion. It simply can not reproduce an amplified waveform 100% efficiently. This is a major hurdle in speaker design.

In terms of digital audio, the incoming AC voltage from the analogue domain, say a microphone preamp to an AD converter, is sampled 44100 times per second (in the case of 44.1kHz sampling rate) and it's travel is stored in digital "words" (bits). The dynamic resolution of these bits can be thought of as the amount of "steps" available. This is determined by the wordlength and in the case of 16-bit, as you mentioned, we have 65536 steps. This basically serves to "smooth out" the travel of the waveform throughout it's dynamic swing across the zero line (infinity). The more steps we have, the higher the resolution of the wave's travel. This method of encoding is called Pulse Code Modulation (PCM). More information here: Pulse-code modulation - Wikipedia, the free encyclopedia.

Upon playback, the data (samples stored in bits) are retrieved from the storage medium and reconstructed by the DA converter and reconstruction filter. The waveform is then smoothed out once again on the output end back into the analogue domain (voltage) as a continuous wavform.

Hope that helps.

Cheers :)

Defnitely helps:) Thank you!
 
Back
Top