Test case
the other day I was profiling the difference between floating point and integer multiplication. To test the multiplication contribution with the least amount of overhead contribution the functions did a number of multiplications.
floating point
The test for floating point case was as follows:
double PrfMultiply(size_t nMax, double d, double dNumber)
{
for (size_t n = 0; n != nMax; ++n)
{
d *= dNumber;
d *= dNumber;
d *= dNumber;
d *= dNumber;
d += (d < 100 ? 100.0 : 0);
}
return d;
}
For x64 the Visual Studio compiler uses SSE instructions for floating point for numbers so it's no surprise then to see four (scalar) multiplication instructions in the generated code:
00007FF7C0D41700 mulsd xmm7,xmm1
00007FF7C0D41704 mulsd xmm7,xmm1
00007FF7C0D41708 mulsd xmm7,xmm1
00007FF7C0D4170C mulsd xmm7,xmm1
Note: it's also remarkable that the compiler only uses the scalar SSE instruction and not invokes the packed variant (mulpd); From this table one can also see that (double) floating point multiplication lasts around 5 CPU cycles which makes it a fairly fast instruction
integer
For integer the function looks almost the same:
size_t PrfIntegerMultiply(size_t nMax, size_t n, size_t nNumber)
{
for (size_t n2 = 0; n2 != nMax; ++n2)
{
n *= nNumber;
n *= nNumber;
n *= nNumber;
n *= nNumber;
n -= (n > 1000 ? 995 : 0);
}
return n;
}
It turns out that Visual Studio 2019 (in release mode) already optimized the four multiplication steps and coalesced them in one multiplication instruction (3^4 == 81 == 51h):
00007FF61C131840 imul rbx,rbx,51h
Conclusion
When profiling it's always advisable to look at the generated assembly code. Often the optimizer is smarter than you think or may even completely removes function invocation. This is especially the case with fixed predefined numbers and (static) functions in translation units.