Sunday, February 13, 2022

Careful with profiling

Profiling

 The other day I was profiling a SSE optimized distance function and the more optimized form was 3 times as slow as the basic variant. The code was a bit like this (skipping the SSE variant):

template <typename T>
class Point
{
public:
   constexpr      Point     (T x, T y);

   constexpr T    GetX      () const;
   constexpr T    GetY      () const;

private:
   T              m_x;
   T              m_y;
};

template <typename T>
constexpr Point<T>::Point(T x, T y)
:  m_x(x)
,  m_y(y)
{
}

template <typename T>
constexpr T Point<T>::GetX() const
{
   return m_x;
}

template <typename T>
constexpr T Point<T>::GetY() const
{
   return m_y;
}

// explicit (DLL) instantiation
template class __declspec(dllexport) Point<double>;

double DistSqr(const Point<double>& rpt1, const Point<double>& rpt2)
{
    const double dx = rpt1.GetX() - rpt2.GetX();
    const double dy = rpt1.GetY() - rpt2.GetY();
   
    return (dx * dx) + (dy * dy);
}

It turned out that exported functions take a major performance hit; much larger than the two multiplications in 'DistSqr' function. The explicit exported template instantiation exports all functions; even constexpr and inline functions.The reason that calling exported function is slower:

  • invocation is a call instead of one simple memory read instruction
  • it suppresses other optimizations
  • just plain more instructions needed to transform data from 'Point' class to function

Removing the export the function was 3 times faster than without the exported attribute. The accessor functions 'GetX' are then inlined.

Lessons learned:

  • DLL and call invocations can harm performance
  • inspect the assembly

Note: accessor functions like 'GetX'  are prescribed by OOP but be aware of their potential performance cost.



No comments:

Post a Comment

Careful with std::ranges

<ranges>   C++20 has added the the ranges library. Basically it works on ranges instead of iterators but added some subtle constraint...