Showing posts with label Performance. Show all posts
Showing posts with label Performance. Show all posts

Sunday, May 7, 2023

Careful with std::shared_ptr

std::shared_ptr

std::shared_ptr is a C++ smart pointer who takes shared ownership of the pointee. It solves some of the memory problems associated with the C language which are memory leaks; buffer overruns and dangling pointers. The first two can be solved by using std::vector; the first and last one by using std::shared_ptr. shared_ptr has though some sharp edges:

  • one can create cycles between std::shared_ptr (i.e. A points to B; B points to A). The memory isn't released then. Solution is to use std::weak_ptr to break the cycle. Alternatively one can use a raw pointer to point back to the owner.
  • never assign a 'raw' resource to two std::shared_ptr's. Instead once a resource is assigned to a std::shared_ptr use the std::shared_ptr to share that resource.
  • use enabled_shared_from_this to hand out a std::shared_ptr of yourself. This fails in constructor because the std::shared_ptr structure is build after the constructor returns.

Garbage collectors don't suffer from these issues but the runtime price one pays for it is large. Also their non-deterministic destruction may another big hurdle to coop with.

Tuesday, August 9, 2022

Watch out for MSVC's std::map's move constructor

move constructor

 The other day I found out that MSVC's std::map move constructor doesn't have noexcept:

map(map&& _Right) : _Mybase(_STD move(_Right)) {}

This means that it can throw when moved and it's not eligible for move_if_noexcept which is used in vector reallocation's. I filed a report but it got rejected. Take care when you want map's to be moveable (which can be a factor compared to copying all nodes).

Thursday, April 21, 2022

Careful with mixing types in loop variables

Loop

 The other day I stumbled upon code which accidentally used a double as end condition:

#include <cstdint>
#include <cstdlib>

constexpr double g_dW = 10;

int main()
{
int res = 0;

for (int n = 0; n < g_dW; ++n)
{
res += std::rand(); // use rand just to have a valid loop
}

return res;
}

One would think that the optimizer Visual Studio 2019 16.13 is smart enough to use integer comparison but this is not the case. VS2019 issues a relative expensive integer to double conversion and comparison:

   
    for (int n = 0; n < g_dW; ++n)
00007FF7C9A01011  movsd       xmm6,mmword ptr [__real@4024000000000000 (07FF7C9A02240h)]  
00007FF7C9A01019  mov         edi,ebx  
00007FF7C9A0101B  nop         dword ptr [rax+rax]  
    {
        res += std::rand();
00007FF7C9A01020  call        qword ptr [__imp_rand (07FF7C9A02188h)]  
00007FF7C9A01026  inc         edi  
00007FF7C9A01028  add         ebx,eax  
00007FF7C9A0102A  movd        xmm0,edi  
00007FF7C9A0102E  cvtdq2pd    xmm0,xmm0  
00007FF7C9A01032  comisd      xmm6,xmm0  
00007FF7C9A01036  ja          main+20h (07FF7C9A01020h)  
    }
    
    return res;
 

The solution is simple to use an integer as block condition:

 constexpr int g_dW = 10;

Therefore watch out that VS2019 does not optimize things automatically.

 

Wednesday, February 16, 2022

Watch out for virtual function invocation

virtual function

 In some heavily invoked code the following construct was used:

interface ITable
{
   virtual std::string GetData() const = 0;
};

class Table : public ITable
{
public:
   virtual std::string GetData() const override
   {
      return std::string{"1"};
   }
};

class TableCt : public Table
{
public:
   long GetDataCt() const
   {
      const std::string strData = GetData();  // watch out
      return std::stol(strData);
   }
};

Client code invokes 'GetDataCt', e.g.:

void f()
{
    TableCt tbl;
    tbl.GetDataCt();  
}

in the real case TableCt wasn't the final class but GetData will not be overriden anywhere else. The code is functional correct but it takes a performance hit since base class function 'GetData' is invoked through vtable:

    TableCt tbl;
    tbl.GetDataCt();
00007FF653F710BC  lea         rdx,[rsp+30h]  
00007FF653F710C1  lea         rcx,[tbl]  
00007FF653F710C6  call        rax  

  On closer thought this is logical since the compiler must deal with 'GetData' being overwritten by a derived class. Therefore explicitly calling the base class function is needed to tell the compiler that it must use the base class:

class TableCt : public Table
{
   long GetDataCt() const
   {
      return std::stol(__super::GetData());
   }
};

 Using the keyword final could also do the trick too although it depends then on the cleverness of the compiler.

Sunday, February 13, 2022

Careful with profiling

Profiling

 The other day I was profiling a SSE optimized distance function and the more optimized form was 3 times as slow as the basic variant. The code was a bit like this (skipping the SSE variant):

template <typename T>
class Point
{
public:
   constexpr      Point     (T x, T y);

   constexpr T    GetX      () const;
   constexpr T    GetY      () const;

private:
   T              m_x;
   T              m_y;
};

template <typename T>
constexpr Point<T>::Point(T x, T y)
:  m_x(x)
,  m_y(y)
{
}

template <typename T>
constexpr T Point<T>::GetX() const
{
   return m_x;
}

template <typename T>
constexpr T Point<T>::GetY() const
{
   return m_y;
}

// explicit (DLL) instantiation
template class __declspec(dllexport) Point<double>;

double DistSqr(const Point<double>& rpt1, const Point<double>& rpt2)
{
    const double dx = rpt1.GetX() - rpt2.GetX();
    const double dy = rpt1.GetY() - rpt2.GetY();
   
    return (dx * dx) + (dy * dy);
}

It turned out that exported functions take a major performance hit; much larger than the two multiplications in 'DistSqr' function. The explicit exported template instantiation exports all functions; even constexpr and inline functions.The reason that calling exported function is slower:

  • invocation is a call instead of one simple memory read instruction
  • it suppresses other optimizations
  • just plain more instructions needed to transform data from 'Point' class to function

Removing the export the function was 3 times faster than without the exported attribute. The accessor functions 'GetX' are then inlined.

Lessons learned:

  • DLL and call invocations can harm performance
  • inspect the assembly

Note: accessor functions like 'GetX'  are prescribed by OOP but be aware of their potential performance cost.



Saturday, February 12, 2022

Mapping enums to enums

Mapping enums to enums

 There is sometimes the need to translate one enumerate value to another. For example let's have the case of enum 'A' and 'B':

enum A
{
   eA0 = 0,
   eA1,
   eA2,
};

enum B
{
   eB0 = 0,
   eB1,
   eB2,
};

A simple translation without asserting on non existing values could work as follow:

B Translate(A eA)
{
    B e = eB0;

    switch (eA)
    {
    case eA0: e = eB0;  break;
    case eA1: e = eB1;  break;
    case eA2: e = eB2;  break;
    }

    return e;
}

This is a straightforward one to one mapping. With Compiler explorer one can see that the Visual Studio 2019 compiler translates this in a bunch of compare and jump statements:

B Translate(A) PROC
        test    ecx, ecx
        je      SHORT $LN4@Translate
        sub     ecx, 1
        je      SHORT $LN5@Translate
        cmp     ecx, 1
        jne     SHORT $LN4@Translate
        mov     eax, 2
        ret     0
$LN5@Translate:
        mov     eax, 1
        ret     0
$LN4@Translate:
        xor     eax, eax
        ret     0
B Translate(A) ENDP   

In this case the value are exact the same so a more optimal solution would be that it just uses the value:

B Translate2(A eA)
{
    return static_cast<B>(eA);
}

In Compiler explorer it can be seen now that this is denser and faster:

B Translate2(EA) PROC
        mov     eax, ecx
        ret     0
EB Translate2(EA) ENDP  

It seems that Visual Studio doesn't recognize this optimization by itself.

More robust

Above cast can be dangerous if values do not correspond. One can trap for this and for accidental changes by adding some static_asserts

enum A
{
   eAbegin = 0,
   eA0     = ABegin,
   eA1,
   eA2,
   eAend
};

B Translate2(A eA)
{
    static_assert(eAbegin == eBbegin);
    static_assert(eAend == eBend);

    // or (verbose and requires updating after adding an enum):
    static_assert(eA0 == eB0);
    static_assert(eA1 == eB1);
    static_assert(eA2 == eB2);
    
    return static_cast<B>(eA);
}
 

Sunday, August 15, 2021

Mapping enums to value

Mapping enums

  The STL has map and  std::unorderd_map to map enums to values. For example:

#include <string>
#include <unordered_map>

enum E
{
   e0,
   e1,
};

struct Foo
{
   Foo()
   {
      m_umap.emplace(e0, "Test1");
      m_umap.emplace(e1, "Test2");
   }

   std::string Find(E e) const
   {
      auto it = m_umap.find(e);
      
      return it != m_umap.cend() ? it->second : std::string{};
   }
   
   std::unordered_map<E, std::string>  m_umap;
};
 std::unordered_maps are fast with on average O(c) lookup (besides the hash function). Still it can be more optimal by using an array and use the enum as index:

#include <array>
#include <string>

enum E
{
   e0 = 0,
   e1,
   eEnd,
};

struct Foo
{
   Foo()
   {
      m_a[e0] = "Test1"
      m_a[e1] = "Test2"
   }

   std::string Find(E e) const
   {
      return m_a[e];
   }
   
   std::array<std::string, eEnd>  m_a;
};

This is the optimal form since with array's the values are in contiguous memory and the lookup is O(c) without a the need to calculate a hash beforehand. 

 Other alternatives are using a switch-case or linear lookup in a  std::vector.

 Some performance measurements with a 10 value enum: 


Method Time (s)
array 0.32
map 2.40
switch-case 1.11
unordered_map 1.60

Sunday, April 11, 2021

Return value optimization and assignment

Return value optimization

 'Return Value Optimization' (RVO) is a compiler optimization where it can apply copy elision in case a function returns an object by value. There are two flavors:

  • RVO
  • NRVO

 In the following example a function returns a class X:

class X
{
};

X Get()
{
   return X{};
}

int main()
{
   X x1 = Get();  // 1 constructor call
}

 Without RVO in above case there would be two constructor calls: one for creating the anonymous temporary and one to copy construct that to the call side.

 With RVO the compiler can directly construct the object in the call side location and circumvent the extra copy. In C++ 17 this is only guaranteed when returning anonymous variables.

 The mechanism works as if the function 'Get' has a hidden argument in which a class of type 'X' can be created directly. The MSDN article in the link below does a great job of explaining it.

Named return value optimization

With 'Named Return Value Optimization' (NRVO) the return variable inside the function has a name. Still the compiler is allowed to optimize this one away as well:

X Get()
{
   X x;  // named value

   return x;
}

int main()
{
   X x1 = Get();
}

  Be aware that the compiler is allowed but not required to remove the extra copy. Observations on Visual Studio 2019:

  • in _DEBUG mode two constructor calls: one inside the function and one copy constructor to the final destination.
  • in RELEASE mode only one constructor call. NRVO was applied here.

Assignment 

Things are less ideal when an object is assigned instead of constructed. Consider the following example:

int main()
{
   X x1;         // 1 constructor call
   x1 = Get();   // 1 constructor and one (move) assignment call
}

The code needs two constructor- and one assignment calls: The 'Get' function needs again a type X for its hidden argument and the compiler creates a temporary hidden object to be filled inside the function. This temporary is then assigned to call side variable 'x1'. 

 From a performance perspective this is less ideal to fill an object; especially if default constructors are relative heavy. It can be considered to switch back to the classic way of filling an object through argument. This can be an attractive alternative when assignment is a dominant use case:

void Fill(X* p)
{
}

int main()
{
   X x1;         // 1 constructor call
   Fill(&x1);
}

Links

Sunday, March 7, 2021

Enums in a container

Enums

 Enumerates (enums) are often used to distinguish mutual exclusive properties. An enum variable can have than only one value. There are other use cases where more than one enum value can be active at the same time. If these enums are unique and ascending it gives a possibility for optimized storage by using std::bitset over other STL containers.

 Container

 To allow non mutual exclusive enums and store them in a container the following options exist:

  • use a std::vector
  • use a std::bitset
  • enums are represented by unique
 using a std:vector is the traditional solution but using the alternative of a std::bitset is more optimal; both in performance as in less memory consumption. The last option is effectively the same as using bitset but enums are more naturllly sequentially ordened.

Example 

Suppose you have the case where used file types are represented by enums and there is a 'Content' class which gives an oversight of all used file types.


enum EFileTypes
{
   eFtBegin = 0,
   eFtData  = eFtBegin,
   eFtBitmap,
   eFtConfig,
   eFtEnd,
};

using std::vector 

Normally one uses a std::vector to store elements. Above example would then become like this:


#include <algorithm>
#include <vector>


struct Content
{
   void AddFileType(EFileTypes eFt)
   {
      if (!HasFileType(eFt))
      {
         m_vecTypes.push_back(eFt);
      }
   }

   bool HasFileType(EFileTypes eFt) const
   {
      return std::find(m_vecTypes.cbegin(), m_vecTypes.cend(), eFt) != m_vecTypes.cend();
   }

   std::vector<EFileTypes>	m_vecTypes;
};

using std::bitset

 with a bitset one can use the (unique) enum value as position in a bitset:


#include <bitset>


struct Content
{
   void AddFileType(EFileTypes eFt)
   {
      m_btsTypes.set(eFt);
   }

   bool HasFileType(EFileTypes eFt) const
   {
      return m_btsTypes.test(eFt);
   }

   std::bitset<eFtEnd>	m_btsTypes;
};

 Using a bitset has thus the following advantages:

  • O (c) lookup to check if an enum is present
  • memory dense when there are only a couple of enums. Only when you have a large enum range (e.g. > 64) this can become a less attractive option since the bitset is always the size of the largest enum value
  • equality operator is easily made

 

Sunday, January 31, 2021

Multiplication and the optimizer

Test case

 the other day I was profiling the difference between floating point and integer multiplication. To test the multiplication contribution with the least amount of overhead contribution the functions did a number of multiplications. 

floating point

The test for floating point case was as follows:


double PrfMultiply(size_t nMax, double d, double dNumber)
{
   for (size_t n = 0; n != nMax; ++n)
   {
      d *= dNumber;
      d *= dNumber;
      d *= dNumber;
      d *= dNumber;

      d += (d < 100 ? 100.0 : 0);
   }

   return d;
}

  For x64 the Visual Studio compiler uses SSE instructions for floating point for numbers so it's no surprise then to see four (scalar) multiplication instructions in the generated code:


00007FF7C0D41700  mulsd       xmm7,xmm1  
00007FF7C0D41704  mulsd       xmm7,xmm1  
00007FF7C0D41708  mulsd       xmm7,xmm1  
00007FF7C0D4170C  mulsd       xmm7,xmm1 

Note: it's also remarkable that the compiler only uses the scalar SSE instruction and not invokes the packed variant (mulpd); From this table one can also see that (double) floating point multiplication lasts around 5 CPU cycles which makes it a fairly fast instruction

integer

  For integer the function looks almost the same:


size_t PrfIntegerMultiply(size_t nMax, size_t n, size_t nNumber)
{
   for (size_t n2 = 0; n2 != nMax; ++n2)
   {
      n *= nNumber;
      n *= nNumber;
      n *= nNumber;
      n *= nNumber;

      n -= (n > 1000 ? 995 : 0);
   }

   return n;
}

 It turns out that Visual Studio 2019 (in release mode) already optimized the four multiplication steps and coalesced them in one multiplication instruction (3^4 == 81 == 51h):


00007FF61C131840  imul        rbx,rbx,51h  

Conclusion

When profiling it's always advisable to look at the generated assembly code. Often the optimizer is smarter than you think or may even completely removes function invocation. This is especially the case with fixed predefined numbers and (static) functions in translation units.

Sunday, January 10, 2021

C++ solutions for some C issues

The C language

 The C language was co-developed with UNIX and played and important part in the ICT history. It was a small language with little overhead. Unfortunately in the use of it it had some issues as well:

  • dangling pointers
  • memory leaks
  • buffer overruns

 C++

  C++ original goal was to offer and object oriented programming language compatible with C. Later it incorporated generics as well in the form of 'templates'. 

 Modern C++ offers solutions for above issues:

  • smart pointers like unique_ptr and shared_ptr own a memory resource. The smart pointer releases the memory when the last reference to the smart pointer is going out of scope. This solves the problems of dangling pointers and memory leaks. Note that shared_ptr's are not completely opaque: they still have some sharp edges as well like the circular reference problem and the inability to used shared_from_this from a constructor
  • std::vector offers a safe way to manage a contiguous buffer. It has iterators for access and it automatically grows when elements are added. The memory is released when going out of scope. Again it offers an alternative for all problems above.
  • std::string is comparable with std::vector but is specialized for strings. C uses character arrays and they are prone for all of the above mentioned problems

Good C++ code can be as fast or even faster than corresponding C code. As usual one has to know the idioms and read the standard C++ books.

Example 

Suppose you have a function which fills a variable length buffer and do some processing on it. It has multiple early out paths.

 C case

In C this could be:


#include <stdlib.h>

bool f()
{
   size_t nLen = GetBufferLength();

   int* p = malloc(nLen * sizeof(int));
   
   if (!GetBuffer(p, nLen))
   {
      free(p);
      return false;
   }

   if (!EncodeBuffer(p, nLen))
   {
      free(p);
      return false;
   }
   
   free(p);
   return true;
}

 C++ case

 In C++ one can use std:vector as variable length buffer:


#include <vector>

bool f()
{
   const size_t nLen = GetBufferLength();

   std::vector<int>	vec(nLen);
   
   if (!GetBuffer(vec.data(), vec.size()))
   {
      return false;
   }

   if (!EncodeBuffer(vec.data(), vec.size()))
   {
      return false;
   }
   
   return true;
}

There is only one small performance drawback in using std::vector: its elements get default or zero initialized which may be an issue in case a huge buffer is allocated.

Thursday, December 31, 2020

OOP and performance

Encapsulation and information hiding

  Encapsulation and information hiding are major concepts of object oriented programming (OOP). From a maintenance perspective it is beneficial not to be dependent or know about the implementation of an object. Not considering these aspects though can also harm the performance. Example given here may be obvious but is used for illustration purposes.

Example

  Consider the following sample class which uses a std::vector to store its data elements.The sample has the following member functions:

  • Add to add a datum
  • Clear to clear all contained data
  • Sum (or something else) to extract the data

#include <numeric>
#include <vector>

class Sample
{
public:
   void Add(double d)
   {
      m_vecData.push_back(d);
   }

   void Clear()
   {
      m_vecData.clear();
   }

   double Sum() const
   {
      return std::accumulate(m_vecData.cbegin(), m_vecData.cend(), 0.0);
   }

private:
   std::vector<double>  m_vecData;
};

Suppose there is some recorder class which produces sample data. There are a number of ways samples can be retrieved from the recorder:

  1. a function GetSample which returns the sample.
  2. a function FillSample which clear and fills a supplied sample.

class Recorder
{
public:
   Sample GetSample() const
   {
      Sample s;
      s.Add(m_dValue); // in real life the data comes from e.g. a buffer, file or socket

      return s;
   }

   void FillSample(Sample* pSample) const
   {
      pSample->Clear();
      pSample->Add(m_dValue);
   }
   
private:
   double   m_dValue = 1.0;
};

A client use is that many samples are extracted from the recorder in e.g. a loop.


// example 1:
double SampleGet(const Recorder& rRecorder, size_t nMax)
{
   double dTotal = 0.0;

   for (size_t n = 0; n != nMax; ++n)
   {
      const Sample s = rRecorder.GetSample();

      dTotal += s.Sum();
   }

   return dTotal;
}

// example 2:
double SampleGet(const Recorder& rRecorder, size_t nMax)
{
   double dTotal = 0.0;

   Sample s;

   for (size_t n = 0; n != nMax; ++n)
   {
      s = rRecorder.GetSample();

      dTotal += s.Sum();
   }

   return dTotal;
}

// example 3:
double SampleFill(const Recorder& rRecorder, size_t nMax)
{
   double dTotal = 0.0;

   Sample s;

   for (size_t n = 0; n != nMax; ++n)
   {
      rRecorder.FillSample(&s);

      dTotal += s.Sum();
   }

   return dTotal;
}

std::vector only reallocates memory when the capacity is too small. This steers the performance for a large part since memory allocations are performance wise relative heavy:

  1. using GetSample is the cleanest solution from a C++ perspective. Although NRVO may prevent superfluous sample copies, the sample (and thereby the vector) still needs to be created inside the function which may hamper the performance.
  2. using GetSample with the sample outside the loop is not much better.
  3. using FillSample is not the cleanest interface but is favorable from a performance perspective in this case since memory (re)allocation will only occur if the supplied Sample' vector cannot hold enough data elements

Above aspects are an issue due to the implementation details of the 'Sample' class and when performance considerations need to be weighted.

Counter argument

  The classical counter argument from an OO perspective would be that the interface should reflect and prevent bad client use. For this case one could delete the copy- and move constructors and assignment operators. Not sure if that is a good direction. In itself it's not a bad thing that samples get copied. After all they may be treated as value objects Also deleting these functions would prevent them of storing them in the preferred STL container std::vector.

    Conclusion

    As usual in engineering aspects have to be judged and balanced. The OO principles are good guidelines but it's good to know their limitations and break them when necessary.

    'Effective C++' (third edition) mentions a similar case in Item 26 'Postpone variable definitions as long as possible'.

    Wednesday, December 30, 2020

    inline in C++

    Keyword

     It's a common myth that the inline specifier in C++ is ignored by the compiler. While the standard states that it's only a hint for the compiler at least on Visual Studio 2019 the inlining process is influenced by the keyword.

    Test case

     For the test case the following class is used:

    
    struct Dummy
    {
        explicit      Dummy       (int n);
        
        int           Get         () const;  // defined in cpp
        int           GetInline   () const;
    
    private:
        int           m_n;
    };
    
    inline int Dummy::GetInline() const
    {
       return m_n;
    }
    
    

    Assembly

      The use of inlining by the compiler can be studied by looking at the produced assembly in release mode. Tip to use DebugBreak() under debugger to stop around call location.

    For normal function invocations it looks like:

    
    00007FF71ECF166C  lea         rcx,[rsp+30h]  
    00007FF71ECF1671  call        Dummy::Get (07FF71ECF1730h)  
    00007FF71ECF1676
    00007FF71ECF167C
    00007FF71ECF1680  add         ecx,eax  
    
    

     The inlined version looks like:

    
    00007FF71ECF1676
    00007FF71ECF167C  add         ecx,dword ptr [rsp+38h]  
    
    

     For function invocation a call instruction is issued while the inlined version uses an add of the memory content to register ecx. The assembly code for the non inlined function definition is a plain memory copy instruction so performance wise the call is just overhead.

    Results

     Tests take place on Visual Studio 2019 version 16.8.3 with a x64 build in release mode. The following cases are tested:

    1. use class inside a module
    2. use class inside a module with /LTCG (i.e. 'link time code generation')
    3. use exported class (i.e. the class is exported from a DLL and used in another module)

    The results are as follow:

    Case Get GetInline
    use inside a module not inlined inlined
    use inside a module with /LTCG inlined inlined
    use exported class not inlined inlined

    Conclusion

      The inline keyword is not ignored by Visual Studio 2019 and used as a hint. Recommendation to use it where applicable (e.g. one line 'getters' and performance intensive use cases). 

      The use of inline in a DLL context is food for discussion but it's okay as long as you rebuild the DLL (and its clients) when you change the source code of the DLL. MFC and boost are standard examples. 

     Other aspects not discussed here are use in (exported) templates and constexpr constructors (with noexcept).

    External links

    •  https://devblogs.microsoft.com/cppblog/inlining-decisions-in-visual-studio/

    Careful with std::ranges

    <ranges>   C++20 has added the the ranges library. Basically it works on ranges instead of iterators but added some subtle constraint...