Ok. Having just finished typing this, I feel compelled to tell you that I won't be hurt if you don't want to read it. Just let me know and I'll stop.
What the numbers mean:
Your audio is like a film. Film is not actually moving images, it's just a whole bunch of still images that, when looked at back to back to back in quick succession create the illusion of movement.
The 44.1 vs. 96k (I think I might have been wrong about that being kHz. It's just K as in thousand) is the number of samples taken per second - the number of still images in the film per second. More samples means more detail and less audio smudging (Think of filming a hand waving rapidly. If you only have, say, ten frames in a second, it's just going to look like a big blur. Twenty frames and it becomes a little less blurry. Etc...). Like I said before, you most likely won't notice the difference between a single track recorded at 44.1 vs. 96, but as you layer multiple tracks into a mix, that added detail makes itself more apparent by allowing more tracks to sit cleanly together.
The 24 bit vs. 16 bit thing goes hand in hand with the sample rate in that each sample that is taken in a second (whether it is 44.1K or 96K) is composed of that many bits. A bit is your basic piece of computer information and it can only carry the minimal amount of information - on or off (1 or 0). Best for later if you start thinking of this information in term of bit depth.
So, if your interface did AD conversion at only 1-bit, then each sample that was taken could have only one of two possible depths - 1 or 0. Bump that up to 2-bits, and you now have samples that can be 00, 01, 10, or 11 - four possible depths. 4 bits, suddenly you've got 16 possible sample sizes: 0000, 0001, 0011, 0111, 1000, 1001, 1011, 1111, 1101, 1100, etc... (write them all out if you don't beliove me).
So, keep in mind that you are trying to use these two pieces of information to approximate a sound wave. So, try drawing a sound wave type curve (just, you know, like a sin wave looking thing). Pretend that the curve you just drew is exactly one second of audio. Slice it into eight even pieces along the horizontal axis, and draw a point at the height of the curve at each of those slices. You now have a picture of what your audio would be represented by if you recorded at a sample rate of 8 (as opposed to the standard 44,100. We're simplifying because 44,000 slices is a lot to draw). Now, slice it into 4 even slices along the vertical axis. For each of the points you drew at the horizontal slices, move them to one of these four heights. This is waht your audio would look like if you recorded at 2-bits with eight samples per second. If you connect these eight dots, you'll notice that they don't look very much like your curve.
As you increase these numbers notice how it improves the resemblance of your approximation to the actual sound wave. More slices along the vertical axis (higher bit depth) means it's easier to differentiate between two adjacent slices that have similar height. More slices along the horizontal axis (higher sample rate) means the line you end up drawing to connect the dots gets closer to the roundness of an actual curve.
So, a couple notes. The bit depth ultimately determines your dynamic range. This is because the higher the bit depth, the more differentiation there is between absolute silence (all zeros) and clipping (all ones).
Also, this explains why you absolutely never want to clip in a digital setting. Once you've reached the absolute top of your dynamic range (all ones), you cant cram any more information in there. You can't just suddenly jump to 25-bits, and any individual bit can only tell you one thing. So digital clipping just completely cuts off the tops of your sound waves and makes for ugly sounding distortion.
... Jesus that was a lot.

I'll answer the rest of what you asked after I smoke a cigarette. And let me know if anything I wrote wasn't clear - I can try and draw a picture, see if it helps any.