Another factor here is just how little meaning can be put into an RMS measurement of an entire song. An average volume as averaged across three minutes or so tells us almost nothing about the actual dynamics within any given part of that three minutes. While this is an extreme example, it makes the point: three minutes of sine wave or pink noise averaged to an RMS of -9dBFS will sound entirely different from a three minute track which has a running RMS average of -18dBFS 50% of the time and an RMS of 0dBFS the other 50% of the time, yet the later will also have the same average RMS for the entire track.
For a bit more real-life of an example, let's say that we have two songs where, if we were to exclude the peaks, have the same average volume. One song is at 120BPS, the other at 180BPS, and at every beat there's a heavy peak shooting up to 0dBFS. Basically they have the same kind of dynamic content, but because the second song has 50% more peak energy than the first, that will drag the average RMS reading for the second song up. Same kind of dynamics, same point-to-point crest factor, yet for the overall track the RMS will read higher and the crest factor will therefore read lower.
This also means that the average non-peak volume of the second song could actually be lowered, with the real-time crest factors increased, making the song more "dynamic", but the average RMS and crest factor readings for the entire track would be no different than the louder, lower crest factor first track. Higher dynamics, more perceived "range" but no change in the numbers.
G.