Sunday, January 31, 2021

Multiplication and the optimizer

Test case

 the other day I was profiling the difference between floating point and integer multiplication. To test the multiplication contribution with the least amount of overhead contribution the functions did a number of multiplications. 

floating point

The test for floating point case was as follows:


double PrfMultiply(size_t nMax, double d, double dNumber)
{
   for (size_t n = 0; n != nMax; ++n)
   {
      d *= dNumber;
      d *= dNumber;
      d *= dNumber;
      d *= dNumber;

      d += (d < 100 ? 100.0 : 0);
   }

   return d;
}

  For x64 the Visual Studio compiler uses SSE instructions for floating point for numbers so it's no surprise then to see four (scalar) multiplication instructions in the generated code:


00007FF7C0D41700  mulsd       xmm7,xmm1  
00007FF7C0D41704  mulsd       xmm7,xmm1  
00007FF7C0D41708  mulsd       xmm7,xmm1  
00007FF7C0D4170C  mulsd       xmm7,xmm1 

Note: it's also remarkable that the compiler only uses the scalar SSE instruction and not invokes the packed variant (mulpd); From this table one can also see that (double) floating point multiplication lasts around 5 CPU cycles which makes it a fairly fast instruction

integer

  For integer the function looks almost the same:


size_t PrfIntegerMultiply(size_t nMax, size_t n, size_t nNumber)
{
   for (size_t n2 = 0; n2 != nMax; ++n2)
   {
      n *= nNumber;
      n *= nNumber;
      n *= nNumber;
      n *= nNumber;

      n -= (n > 1000 ? 995 : 0);
   }

   return n;
}

 It turns out that Visual Studio 2019 (in release mode) already optimized the four multiplication steps and coalesced them in one multiplication instruction (3^4 == 81 == 51h):


00007FF61C131840  imul        rbx,rbx,51h  

Conclusion

When profiling it's always advisable to look at the generated assembly code. Often the optimizer is smarter than you think or may even completely removes function invocation. This is especially the case with fixed predefined numbers and (static) functions in translation units.

Tuesday, January 12, 2021

CComPtr

IUnknown

 COM is Microsoft's component framework. It was created in the 90's but still used for native development and UWP. All components in this framework must support the IUnknown interface:


interface IUnknown
{
   virtual HRESULT STDMETHODCALLTYPE QueryInterface(REFIID riid, void** ppvObject) = 0;
   virtual ULONG   STDMETHODCALLTYPE AddRef() = 0;
   virtual ULONG   STDMETHODCALLTYPE Release() = 0;
};

This interface has three functions:

  • QueryInterface to query and access the components supported interfaces
  • AddRef to increment the reference count
  • Release to decrement the reference count. When the reference count goes to zero the component is destroyed.

 Client code must adhere to the protocol of incrementing the interface when using it and releasing the interface when done. Mostly components returned from functions are already incremented so that client code only need to decrement the reference count.

 Example

  For the example here the function to create error info is used. The address of a pointer must supplied and when successful the interface must be released:


ICreateErrorInfo* pErrorInfo = nullptr;

HRESULT hr = ::CreateErrorInfo(&pErrorInfo);

if (pErrorInfo)
{
   pErrorInfo->Release();
}

 One of the major error causes when using COM is that reference counts are not administered correctly:

  • when reference counts are more released than incremented by client code it causes cashes and access violations. This may happen beyond the fault location (in time and space).
  • when too few reference counts are released it may cause resource leaks. For example on my work in the past a colleague released one VMR9 interface too little. This resulted in a full thread leak since the VMR9 is a fat object.

 Smart pointer

 Luckily Microsoft has acknowledged this problem and created smart pointer classes for COM interfaces. There are two flavor's. One comes from the compiler support classes '_com_ptr_t'. The other one is CComPtr and comes from the ATL library. The CComPtr is desmontrated in the followign examples.

Code in above example becomes easier especially when there would be multiple return paths:


CComPtr<ICreateErrorInfo> ptrErrorInfo;
HRESULT hr = ::CreateErrorInfo(&ptrErrorInfo);

if (ptrErrorInfo)
{
   //no release necessary
}

 As usual with smart pointers they work transparent. One can assign them to other smart pointers or even return from functions:


CComPtr<ICreateErrorInfo> CreateErrorInfoPtr()
{
  CComPtr<ICreateErrorInfo> ptrErrorInfo;
  HRESULT hr = ::CreateErrorInfo(&ptrErrorInfo);

  return ptrErrorInfo;
}

const CComPtr<ICreateErrorInfo> ptr = CreateMyErrorInfoPtr();

 With returning raw pointers a leak is easily created when the client code ignores the created interface return value. In above case with smart pointers it is suboptimal but it wouldn't hurt when client code ignores the return value since the returned temporary object goes out of scope and releasing thereby the created interface.

 Implementation

  A possible implementation could be as follow (borrowed and modified from the original source):


template <class T>
class CComPtr
{
public:
   CComPtr()
      : m_p(nullptr)
   {
   }

   CComPtr(T* p)
      : m_p(p)
   {
      if (m_p != nullptr)
         m_p->AddRef();
   }

   CComPtr(const CComPtr& rptr)
      : CComPtr(rptr.m_p)
   {
   }

   ~CComPtr()
   {
      if (m_p)
         m_p->Release();
   }

   CComPtr& operator=(const CComPtr& rptr)
   {
      if (m_p != rptr.m_p)
      { 
         if (m_p) 
            m_p->Release();
         
         m_p = rptr.m_p;
         m_p->AddRef();
      }

      return *this;
   }

private:
   T*    m_p;
};

Conclusion

  CComPtr is one the best addition to COM programming. It greatly simplifies COM client implementaitons and solves almost completely all reference counting and tracking issues. It's actually hard to do wrong with CComPtr since it also asserts when a contained interface is overwritten accidently.

 'Effective COM' mentions in item 22 that 'smart interface pointers add at least as much complexity as they remove'. I strongly disagree as can be read from this article. The book mentions a small problem with old CComPtr which is also solved in the latest release of ATL.


Sunday, January 10, 2021

C++ solutions for some C issues

The C language

 The C language was co-developed with UNIX and played and important part in the ICT history. It was a small language with little overhead. Unfortunately in the use of it it had some issues as well:

  • dangling pointers
  • memory leaks
  • buffer overruns

 C++

  C++ original goal was to offer and object oriented programming language compatible with C. Later it incorporated generics as well in the form of 'templates'. 

 Modern C++ offers solutions for above issues:

  • smart pointers like unique_ptr and shared_ptr own a memory resource. The smart pointer releases the memory when the last reference to the smart pointer is going out of scope. This solves the problems of dangling pointers and memory leaks. Note that shared_ptr's are not completely opaque: they still have some sharp edges as well like the circular reference problem and the inability to used shared_from_this from a constructor
  • std::vector offers a safe way to manage a contiguous buffer. It has iterators for access and it automatically grows when elements are added. The memory is released when going out of scope. Again it offers an alternative for all problems above.
  • std::string is comparable with std::vector but is specialized for strings. C uses character arrays and they are prone for all of the above mentioned problems

Good C++ code can be as fast or even faster than corresponding C code. As usual one has to know the idioms and read the standard C++ books.

Example 

Suppose you have a function which fills a variable length buffer and do some processing on it. It has multiple early out paths.

 C case

In C this could be:


#include <stdlib.h>

bool f()
{
   size_t nLen = GetBufferLength();

   int* p = malloc(nLen * sizeof(int));
   
   if (!GetBuffer(p, nLen))
   {
      free(p);
      return false;
   }

   if (!EncodeBuffer(p, nLen))
   {
      free(p);
      return false;
   }
   
   free(p);
   return true;
}

 C++ case

 In C++ one can use std:vector as variable length buffer:


#include <vector>

bool f()
{
   const size_t nLen = GetBufferLength();

   std::vector<int>	vec(nLen);
   
   if (!GetBuffer(vec.data(), vec.size()))
   {
      return false;
   }

   if (!EncodeBuffer(vec.data(), vec.size()))
   {
      return false;
   }
   
   return true;
}

There is only one small performance drawback in using std::vector: its elements get default or zero initialized which may be an issue in case a huge buffer is allocated.

Sunday, January 3, 2021

Visual Studio std::pow implementations

pow

After an upgrade of Visual Studio 2017 to 2019 it was noticed that regression tests were failing with the new version. There were multiple causes; one of them was that (yet again) the implementation of std::pow had changed.

 The Visual Studio 2017 implementation uses a different code path for the common power 2 (square) case: it issues a simple multiplication. The implementation is something like this:


_Check_return_ inline double pow(_In_ double _Xx, _In_ int _Yx) noexcept
       {
       if (_Yx == 2)
              return (_Xx * _Xx);

       return (_CSTD pow(_Xx, static_cast<double>(_Yx)));
       }

 The Visual Studio 2019 implementation isn't available in source code form but the exceptional code path for calculating the square seems not present anymore. This gives (small) differences with some numbers, e.g. the square of '0.10000000055703842' gives a different result.

Alternative

Luckily boost offers an alternative for calculating squared and other integer powers known at compile time in its math library:


#include <boost/math/special_functions/pow.hpp>

constexpr double d = 0.10000000055703842;

const double d2 = boost::math::pow<2>(d);

 Using this function should give stable result for the coming upgrades of Visual Studio. It has also the extra benefit of better performance. Results of a test case with running many power calculations:

Function Time (s)
boost::math::pow<2> 0.241
std::pow 9.395

  Note that instead of using a power function direct multiplication is ofc also possible. Often though these power functions are fed with another calculated value which otherwise has to be duplicated or write down explicitly:


const double d = std::sqrt(boost::math::pow<2>(pt.x - x) + boost::math::pow<2>(pt.y - y));

 Boost's math::pow has the extra benefit of doing the least amount of multiplications in case the power is larger than 3.

Saturday, January 2, 2021

Careful with std::mutex

 recursive_mutex

 std::recursive_mutex can be locked again by the owning thread without blocking. A normal std::mutex doesn't support this behavior and attempt to lock it again from the owning thread is even undefined behavior.

 Still C++ experts promote std::mutex over std::recursive_mutex. The locking is more clear but programmers need to take extra care now.

Example

 Suppose a class with two data members (A and B) and the data needs to be protected from access by multiple threads. The class has two functions; one to change A and one to change A and B. The following implementation is wrong:


#include <mutex>

class Dummy
{
public:
   void SetA(int a)
   {
      std::unique_lock<std::mutex> lck(m_mtx);

      m_a = a;
   }

   void SetAB(int a, int b)
   {
      std::unique_lock<std::mutex> lck(m_mtx);

      m_b = b;

      SetA(a);  // error: mutex is still locked
   }

private:
   std::mutex  m_mtx;
   int         m_a;
   int         m_b;
};

  In the function SetAB the mutex is still locked when invoking SetA leading to undefined behavior.

 Fixing this by adding a SetB function and invoking this function is also incorrect:


class Dummy
{
public:
   void SetA(int a)
   {
      std::unique_lock<std::mutex> lck(m_mtx);

      m_a = a;
   }

   void SetB(int b)
   {
      std::unique_lock<std::mutex> lck(m_mtx);

      m_b = b;
   }

   void SetAB(int a, int b)
   {
      SetA(a);  
      SetB(b);  
   }

   // rest same as before
};

  This is wrong because of another reason: A and B are now not modified under the same lock. A race condition may emerge where A is already changed but B not yet. This may break an invariant if A and B need to be updated together.

 There are multiple solutions to this issue:

  • differentiate between locking and non locking (implementation) functions. Drawback is that the number of (private) member functions are increasing and thereby obfuscating the design
  • set member data A and B directly without using member or other functions. For simple functions like class above this is okay but many times the SetXxx member function does extra things (e.g. notifying clients). Duplicating these code is not attractive.

 Example of differentiate between locking and non locking functions:

class Dummy
{
public:
   void SetA(int a)
   {
      std::unique_lock<std::mutex> lck(m_mtx);

      SetAImpl(a);
   }

   void SetB(int b)
   {
      std::unique_lock<std::mutex> lck(m_mtx);

      SetBImpl(b);
   }

   void SetAB(int a, int b)
   {
      std::unique_lock<std::mutex> lck(m_mtx);

      SetAImpl(a);
      SetBImpl(b);
   }

private:
   void SetAImpl(int a)
   {
      m_a = a;
   }

   //etc.
   
   // rest same as before
};

  Above example is a simplified version. In real code classes may have dozens of member functions. Also the invocation of another member function when the mutex is already locked may happen indirect; e.g. through an event notification.

Thursday, December 31, 2020

OOP and performance

Encapsulation and information hiding

  Encapsulation and information hiding are major concepts of object oriented programming (OOP). From a maintenance perspective it is beneficial not to be dependent or know about the implementation of an object. Not considering these aspects though can also harm the performance. Example given here may be obvious but is used for illustration purposes.

Example

  Consider the following sample class which uses a std::vector to store its data elements.The sample has the following member functions:

  • Add to add a datum
  • Clear to clear all contained data
  • Sum (or something else) to extract the data

#include <numeric>
#include <vector>

class Sample
{
public:
   void Add(double d)
   {
      m_vecData.push_back(d);
   }

   void Clear()
   {
      m_vecData.clear();
   }

   double Sum() const
   {
      return std::accumulate(m_vecData.cbegin(), m_vecData.cend(), 0.0);
   }

private:
   std::vector<double>  m_vecData;
};

Suppose there is some recorder class which produces sample data. There are a number of ways samples can be retrieved from the recorder:

  1. a function GetSample which returns the sample.
  2. a function FillSample which clear and fills a supplied sample.

class Recorder
{
public:
   Sample GetSample() const
   {
      Sample s;
      s.Add(m_dValue); // in real life the data comes from e.g. a buffer, file or socket

      return s;
   }

   void FillSample(Sample* pSample) const
   {
      pSample->Clear();
      pSample->Add(m_dValue);
   }
   
private:
   double   m_dValue = 1.0;
};

A client use is that many samples are extracted from the recorder in e.g. a loop.


// example 1:
double SampleGet(const Recorder& rRecorder, size_t nMax)
{
   double dTotal = 0.0;

   for (size_t n = 0; n != nMax; ++n)
   {
      const Sample s = rRecorder.GetSample();

      dTotal += s.Sum();
   }

   return dTotal;
}

// example 2:
double SampleGet(const Recorder& rRecorder, size_t nMax)
{
   double dTotal = 0.0;

   Sample s;

   for (size_t n = 0; n != nMax; ++n)
   {
      s = rRecorder.GetSample();

      dTotal += s.Sum();
   }

   return dTotal;
}

// example 3:
double SampleFill(const Recorder& rRecorder, size_t nMax)
{
   double dTotal = 0.0;

   Sample s;

   for (size_t n = 0; n != nMax; ++n)
   {
      rRecorder.FillSample(&s);

      dTotal += s.Sum();
   }

   return dTotal;
}

std::vector only reallocates memory when the capacity is too small. This steers the performance for a large part since memory allocations are performance wise relative heavy:

  1. using GetSample is the cleanest solution from a C++ perspective. Although NRVO may prevent superfluous sample copies, the sample (and thereby the vector) still needs to be created inside the function which may hamper the performance.
  2. using GetSample with the sample outside the loop is not much better.
  3. using FillSample is not the cleanest interface but is favorable from a performance perspective in this case since memory (re)allocation will only occur if the supplied Sample' vector cannot hold enough data elements

Above aspects are an issue due to the implementation details of the 'Sample' class and when performance considerations need to be weighted.

Counter argument

  The classical counter argument from an OO perspective would be that the interface should reflect and prevent bad client use. For this case one could delete the copy- and move constructors and assignment operators. Not sure if that is a good direction. In itself it's not a bad thing that samples get copied. After all they may be treated as value objects Also deleting these functions would prevent them of storing them in the preferred STL container std::vector.

    Conclusion

    As usual in engineering aspects have to be judged and balanced. The OO principles are good guidelines but it's good to know their limitations and break them when necessary.

    'Effective C++' (third edition) mentions a similar case in Item 26 'Postpone variable definitions as long as possible'.

    Wednesday, December 30, 2020

    inline in C++

    Keyword

     It's a common myth that the inline specifier in C++ is ignored by the compiler. While the standard states that it's only a hint for the compiler at least on Visual Studio 2019 the inlining process is influenced by the keyword.

    Test case

     For the test case the following class is used:

    
    struct Dummy
    {
        explicit      Dummy       (int n);
        
        int           Get         () const;  // defined in cpp
        int           GetInline   () const;
    
    private:
        int           m_n;
    };
    
    inline int Dummy::GetInline() const
    {
       return m_n;
    }
    
    

    Assembly

      The use of inlining by the compiler can be studied by looking at the produced assembly in release mode. Tip to use DebugBreak() under debugger to stop around call location.

    For normal function invocations it looks like:

    
    00007FF71ECF166C  lea         rcx,[rsp+30h]  
    00007FF71ECF1671  call        Dummy::Get (07FF71ECF1730h)  
    00007FF71ECF1676
    00007FF71ECF167C
    00007FF71ECF1680  add         ecx,eax  
    
    

     The inlined version looks like:

    
    00007FF71ECF1676
    00007FF71ECF167C  add         ecx,dword ptr [rsp+38h]  
    
    

     For function invocation a call instruction is issued while the inlined version uses an add of the memory content to register ecx. The assembly code for the non inlined function definition is a plain memory copy instruction so performance wise the call is just overhead.

    Results

     Tests take place on Visual Studio 2019 version 16.8.3 with a x64 build in release mode. The following cases are tested:

    1. use class inside a module
    2. use class inside a module with /LTCG (i.e. 'link time code generation')
    3. use exported class (i.e. the class is exported from a DLL and used in another module)

    The results are as follow:

    Case Get GetInline
    use inside a module not inlined inlined
    use inside a module with /LTCG inlined inlined
    use exported class not inlined inlined

    Conclusion

      The inline keyword is not ignored by Visual Studio 2019 and used as a hint. Recommendation to use it where applicable (e.g. one line 'getters' and performance intensive use cases). 

      The use of inline in a DLL context is food for discussion but it's okay as long as you rebuild the DLL (and its clients) when you change the source code of the DLL. MFC and boost are standard examples. 

     Other aspects not discussed here are use in (exported) templates and constexpr constructors (with noexcept).

    External links

    •  https://devblogs.microsoft.com/cppblog/inlining-decisions-in-visual-studio/

    Careful with std::ranges

    <ranges>   C++20 has added the the ranges library. Basically it works on ranges instead of iterators but added some subtle constraint...