SonicAlbert said:
Even if the sample rates were not synced properly (which is possible) the result would most likely be that the audio would play back at the wrong pitch, not that it would be distorted or otherwise messed up.
That shouldn't be possible unless either A. the drivers or hardware suck, or B. you have misconfigured the interface to take input from S/PDIF while using the internal clock. Make sure the interface is set to "External Clock from S/PDIF" or some such.
The receiving device should ALWAYS be synchronized to the clock from the sending device unless either A. the sending device is synchronized to the clock from the receiving device (with a word clock cable, for example), or B. both devices are synchronized to some third source.
As for the "just the wrong pitch" thing, no, it wouldn't. If the sample rates of the two devices differ, you would not just get the wrong pitch (though you would get that). What you say would only be true if the buffer size for the audio interface were infinitely large and if the receiving device always waited until it had received enough data to play to the end before it started playing. However, in the real world, the buffer in the receiving device is of a finite size, and the receiving device always starts playing immediately. For this reason, in addition to a pitch problem, you would also get all sorts of distortion.
Device A sends out 48 kHz worth of data, Device B expects 44.1 kHz, that means that every second, it gets an overrun. That means that the data from the source either wraps around and overwriting unplayed data or gets dropped on the floor to prevent this from occurring. Either way, the net effect is that 3900 samples per second get thrown away, either as a large chunk or as a lot of smaller chunks, depending on the interface's buffer size, packet size, read head offset, etc.
Device A sends out 44.1 kHz worth of data, Device B expects 48 kHz, so device B ends up getting an underrun. That means that it needs more data to play, but no data has arrived. Thus, it plays silence for 3900 samples per second, either as a single chunk or a lot of smaller chunks, depending on the interface's buffer size, packet size, read head offset, etc.