Profiling
The other day I was profiling a SSE optimized distance function and the more optimized form was 3 times as slow as the basic variant. The code was a bit like this (skipping the SSE variant):
template <typename T>
class Point
{
public:
constexpr Point (T x, T y);
constexpr T GetX () const;
constexpr T GetY () const;
private:
T m_x;
T m_y;
};
template <typename T>
constexpr Point<T>::Point(T x, T y)
: m_x(x)
, m_y(y)
{
}
template <typename T>
constexpr T Point<T>::GetX() const
{
return m_x;
}
template <typename T>
constexpr T Point<T>::GetY() const
{
return m_y;
}
// explicit (DLL) instantiation
template class __declspec(dllexport) Point<double>;
double DistSqr(const Point<double>& rpt1, const Point<double>& rpt2)
{
const double dx = rpt1.GetX() - rpt2.GetX();
const double dy = rpt1.GetY() - rpt2.GetY();
return (dx * dx) + (dy * dy);
}
It turned out that exported functions take a major performance hit; much larger than the two multiplications in 'DistSqr' function. The explicit exported template instantiation exports all functions; even constexpr and inline functions.The reason that calling exported function is slower:
- invocation is a call instead of one simple memory read instruction
- it suppresses other optimizations
- just plain more instructions needed to transform data from 'Point' class to function
Removing the export the function was 3 times faster than without the exported attribute. The accessor functions 'GetX' are then inlined.
Lessons learned:
- DLL and call invocations can harm performance
- inspect the assembly
Note: accessor functions like 'GetX' are prescribed by OOP but be aware of their potential performance cost.