Sunday, January 31, 2021

Multiplication and the optimizer

Test case

 the other day I was profiling the difference between floating point and integer multiplication. To test the multiplication contribution with the least amount of overhead contribution the functions did a number of multiplications. 

floating point

The test for floating point case was as follows:


double PrfMultiply(size_t nMax, double d, double dNumber)
{
   for (size_t n = 0; n != nMax; ++n)
   {
      d *= dNumber;
      d *= dNumber;
      d *= dNumber;
      d *= dNumber;

      d += (d < 100 ? 100.0 : 0);
   }

   return d;
}

  For x64 the Visual Studio compiler uses SSE instructions for floating point for numbers so it's no surprise then to see four (scalar) multiplication instructions in the generated code:


00007FF7C0D41700  mulsd       xmm7,xmm1  
00007FF7C0D41704  mulsd       xmm7,xmm1  
00007FF7C0D41708  mulsd       xmm7,xmm1  
00007FF7C0D4170C  mulsd       xmm7,xmm1 

Note: it's also remarkable that the compiler only uses the scalar SSE instruction and not invokes the packed variant (mulpd); From this table one can also see that (double) floating point multiplication lasts around 5 CPU cycles which makes it a fairly fast instruction

integer

  For integer the function looks almost the same:


size_t PrfIntegerMultiply(size_t nMax, size_t n, size_t nNumber)
{
   for (size_t n2 = 0; n2 != nMax; ++n2)
   {
      n *= nNumber;
      n *= nNumber;
      n *= nNumber;
      n *= nNumber;

      n -= (n > 1000 ? 995 : 0);
   }

   return n;
}

 It turns out that Visual Studio 2019 (in release mode) already optimized the four multiplication steps and coalesced them in one multiplication instruction (3^4 == 81 == 51h):


00007FF61C131840  imul        rbx,rbx,51h  

Conclusion

When profiling it's always advisable to look at the generated assembly code. Often the optimizer is smarter than you think or may even completely removes function invocation. This is especially the case with fixed predefined numbers and (static) functions in translation units.

Tuesday, January 12, 2021

CComPtr

IUnknown

 COM is Microsoft's component framework. It was created in the 90's but still used for native development and UWP. All components in this framework must support the IUnknown interface:


interface IUnknown
{
   virtual HRESULT STDMETHODCALLTYPE QueryInterface(REFIID riid, void** ppvObject) = 0;
   virtual ULONG   STDMETHODCALLTYPE AddRef() = 0;
   virtual ULONG   STDMETHODCALLTYPE Release() = 0;
};

This interface has three functions:

  • QueryInterface to query and access the components supported interfaces
  • AddRef to increment the reference count
  • Release to decrement the reference count. When the reference count goes to zero the component is destroyed.

 Client code must adhere to the protocol of incrementing the interface when using it and releasing the interface when done. Mostly components returned from functions are already incremented so that client code only need to decrement the reference count.

 Example

  For the example here the function to create error info is used. The address of a pointer must supplied and when successful the interface must be released:


ICreateErrorInfo* pErrorInfo = nullptr;

HRESULT hr = ::CreateErrorInfo(&pErrorInfo);

if (pErrorInfo)
{
   pErrorInfo->Release();
}

 One of the major error causes when using COM is that reference counts are not administered correctly:

  • when reference counts are more released than incremented by client code it causes cashes and access violations. This may happen beyond the fault location (in time and space).
  • when too few reference counts are released it may cause resource leaks. For example on my work in the past a colleague released one VMR9 interface too little. This resulted in a full thread leak since the VMR9 is a fat object.

 Smart pointer

 Luckily Microsoft has acknowledged this problem and created smart pointer classes for COM interfaces. There are two flavor's. One comes from the compiler support classes '_com_ptr_t'. The other one is CComPtr and comes from the ATL library. The CComPtr is desmontrated in the followign examples.

Code in above example becomes easier especially when there would be multiple return paths:


CComPtr<ICreateErrorInfo> ptrErrorInfo;
HRESULT hr = ::CreateErrorInfo(&ptrErrorInfo);

if (ptrErrorInfo)
{
   //no release necessary
}

 As usual with smart pointers they work transparent. One can assign them to other smart pointers or even return from functions:


CComPtr<ICreateErrorInfo> CreateErrorInfoPtr()
{
  CComPtr<ICreateErrorInfo> ptrErrorInfo;
  HRESULT hr = ::CreateErrorInfo(&ptrErrorInfo);

  return ptrErrorInfo;
}

const CComPtr<ICreateErrorInfo> ptr = CreateMyErrorInfoPtr();

 With returning raw pointers a leak is easily created when the client code ignores the created interface return value. In above case with smart pointers it is suboptimal but it wouldn't hurt when client code ignores the return value since the returned temporary object goes out of scope and releasing thereby the created interface.

 Implementation

  A possible implementation could be as follow (borrowed and modified from the original source):


template <class T>
class CComPtr
{
public:
   CComPtr()
      : m_p(nullptr)
   {
   }

   CComPtr(T* p)
      : m_p(p)
   {
      if (m_p != nullptr)
         m_p->AddRef();
   }

   CComPtr(const CComPtr& rptr)
      : CComPtr(rptr.m_p)
   {
   }

   ~CComPtr()
   {
      if (m_p)
         m_p->Release();
   }

   CComPtr& operator=(const CComPtr& rptr)
   {
      if (m_p != rptr.m_p)
      { 
         if (m_p) 
            m_p->Release();
         
         m_p = rptr.m_p;
         m_p->AddRef();
      }

      return *this;
   }

private:
   T*    m_p;
};

Conclusion

  CComPtr is one the best addition to COM programming. It greatly simplifies COM client implementaitons and solves almost completely all reference counting and tracking issues. It's actually hard to do wrong with CComPtr since it also asserts when a contained interface is overwritten accidently.

 'Effective COM' mentions in item 22 that 'smart interface pointers add at least as much complexity as they remove'. I strongly disagree as can be read from this article. The book mentions a small problem with old CComPtr which is also solved in the latest release of ATL.


Sunday, January 10, 2021

C++ solutions for some C issues

The C language

 The C language was co-developed with UNIX and played and important part in the ICT history. It was a small language with little overhead. Unfortunately in the use of it it had some issues as well:

  • dangling pointers
  • memory leaks
  • buffer overruns

 C++

  C++ original goal was to offer and object oriented programming language compatible with C. Later it incorporated generics as well in the form of 'templates'. 

 Modern C++ offers solutions for above issues:

  • smart pointers like unique_ptr and shared_ptr own a memory resource. The smart pointer releases the memory when the last reference to the smart pointer is going out of scope. This solves the problems of dangling pointers and memory leaks. Note that shared_ptr's are not completely opaque: they still have some sharp edges as well like the circular reference problem and the inability to used shared_from_this from a constructor
  • std::vector offers a safe way to manage a contiguous buffer. It has iterators for access and it automatically grows when elements are added. The memory is released when going out of scope. Again it offers an alternative for all problems above.
  • std::string is comparable with std::vector but is specialized for strings. C uses character arrays and they are prone for all of the above mentioned problems

Good C++ code can be as fast or even faster than corresponding C code. As usual one has to know the idioms and read the standard C++ books.

Example 

Suppose you have a function which fills a variable length buffer and do some processing on it. It has multiple early out paths.

 C case

In C this could be:


#include <stdlib.h>

bool f()
{
   size_t nLen = GetBufferLength();

   int* p = malloc(nLen * sizeof(int));
   
   if (!GetBuffer(p, nLen))
   {
      free(p);
      return false;
   }

   if (!EncodeBuffer(p, nLen))
   {
      free(p);
      return false;
   }
   
   free(p);
   return true;
}

 C++ case

 In C++ one can use std:vector as variable length buffer:


#include <vector>

bool f()
{
   const size_t nLen = GetBufferLength();

   std::vector<int>	vec(nLen);
   
   if (!GetBuffer(vec.data(), vec.size()))
   {
      return false;
   }

   if (!EncodeBuffer(vec.data(), vec.size()))
   {
      return false;
   }
   
   return true;
}

There is only one small performance drawback in using std::vector: its elements get default or zero initialized which may be an issue in case a huge buffer is allocated.

Sunday, January 3, 2021

Visual Studio std::pow implementations

pow

After an upgrade of Visual Studio 2017 to 2019 it was noticed that regression tests were failing with the new version. There were multiple causes; one of them was that (yet again) the implementation of std::pow had changed.

 The Visual Studio 2017 implementation uses a different code path for the common power 2 (square) case: it issues a simple multiplication. The implementation is something like this:


_Check_return_ inline double pow(_In_ double _Xx, _In_ int _Yx) noexcept
       {
       if (_Yx == 2)
              return (_Xx * _Xx);

       return (_CSTD pow(_Xx, static_cast<double>(_Yx)));
       }

 The Visual Studio 2019 implementation isn't available in source code form but the exceptional code path for calculating the square seems not present anymore. This gives (small) differences with some numbers, e.g. the square of '0.10000000055703842' gives a different result.

Alternative

Luckily boost offers an alternative for calculating squared and other integer powers known at compile time in its math library:


#include <boost/math/special_functions/pow.hpp>

constexpr double d = 0.10000000055703842;

const double d2 = boost::math::pow<2>(d);

 Using this function should give stable result for the coming upgrades of Visual Studio. It has also the extra benefit of better performance. Results of a test case with running many power calculations:

Function Time (s)
boost::math::pow<2> 0.241
std::pow 9.395

  Note that instead of using a power function direct multiplication is ofc also possible. Often though these power functions are fed with another calculated value which otherwise has to be duplicated or write down explicitly:


const double d = std::sqrt(boost::math::pow<2>(pt.x - x) + boost::math::pow<2>(pt.y - y));

 Boost's math::pow has the extra benefit of doing the least amount of multiplications in case the power is larger than 3.

Saturday, January 2, 2021

Careful with std::mutex

 recursive_mutex

 std::recursive_mutex can be locked again by the owning thread without blocking. A normal std::mutex doesn't support this behavior and attempt to lock it again from the owning thread is even undefined behavior.

 Still C++ experts promote std::mutex over std::recursive_mutex. The locking is more clear but programmers need to take extra care now.

Example

 Suppose a class with two data members (A and B) and the data needs to be protected from access by multiple threads. The class has two functions; one to change A and one to change A and B. The following implementation is wrong:


#include <mutex>

class Dummy
{
public:
   void SetA(int a)
   {
      std::unique_lock<std::mutex> lck(m_mtx);

      m_a = a;
   }

   void SetAB(int a, int b)
   {
      std::unique_lock<std::mutex> lck(m_mtx);

      m_b = b;

      SetA(a);  // error: mutex is still locked
   }

private:
   std::mutex  m_mtx;
   int         m_a;
   int         m_b;
};

  In the function SetAB the mutex is still locked when invoking SetA leading to undefined behavior.

 Fixing this by adding a SetB function and invoking this function is also incorrect:


class Dummy
{
public:
   void SetA(int a)
   {
      std::unique_lock<std::mutex> lck(m_mtx);

      m_a = a;
   }

   void SetB(int b)
   {
      std::unique_lock<std::mutex> lck(m_mtx);

      m_b = b;
   }

   void SetAB(int a, int b)
   {
      SetA(a);  
      SetB(b);  
   }

   // rest same as before
};

  This is wrong because of another reason: A and B are now not modified under the same lock. A race condition may emerge where A is already changed but B not yet. This may break an invariant if A and B need to be updated together.

 There are multiple solutions to this issue:

  • differentiate between locking and non locking (implementation) functions. Drawback is that the number of (private) member functions are increasing and thereby obfuscating the design
  • set member data A and B directly without using member or other functions. For simple functions like class above this is okay but many times the SetXxx member function does extra things (e.g. notifying clients). Duplicating these code is not attractive.

 Example of differentiate between locking and non locking functions:

class Dummy
{
public:
   void SetA(int a)
   {
      std::unique_lock<std::mutex> lck(m_mtx);

      SetAImpl(a);
   }

   void SetB(int b)
   {
      std::unique_lock<std::mutex> lck(m_mtx);

      SetBImpl(b);
   }

   void SetAB(int a, int b)
   {
      std::unique_lock<std::mutex> lck(m_mtx);

      SetAImpl(a);
      SetBImpl(b);
   }

private:
   void SetAImpl(int a)
   {
      m_a = a;
   }

   //etc.
   
   // rest same as before
};

  Above example is a simplified version. In real code classes may have dozens of member functions. Also the invocation of another member function when the mutex is already locked may happen indirect; e.g. through an event notification.

Careful with std::ranges

<ranges>   C++20 has added the the ranges library. Basically it works on ranges instead of iterators but added some subtle constraint...