C++: Implicit conversions to std::string, dynamic allocations and performance

C++: Implicit conversions to std::string, dynamic allocations and performance

Introduction

When designing functions or APIs in C++ special care has to be taken when it comes to the question whether you should accept an std::string as a function argument or rather a plain char array. Especially in environments like embedded development where special constraints are given for latency or memory allocations, it is of major importance to know about the pitfalls in C++. Why the question of accepting std::strings is important and what are the pitfalls, will be explained and demonstrated in this article.

Demonstration of the problem

In this example a logging functionality shall be implemented. To keep it easy it is a function called "log". For some reasons the log function accepts an std::string reference as an argument. For the demonstration it can only log to std::cout.

void log(const std::string& input) {
    // Log to file or console..
    std::cout << input << std::endl;
}

The clients of the log function usually call it by passing either string literals already defined during compilation time or e.g. error codes calculated during runtime.

int main() {
    log("Initialize app..");
    
    // Do something ..
    log("An error occurred");

    // Do something else ..
    log("Quit");
    return 0;
}

To understand what is happening under the hood when compiling and executing this code the difference between string literals, char arrays and std::string in C++ is clarified first.

C++ string literals vs. char arrays vs. std::string

String literals

A string literal can be described as a sequence of one or more chars "inside double quotes". They are unnamed sequences of characters:

log("Quit"); // "Quit" is a string literal
std::string x = "test"; // "test" is a string literal which is used to initialize x
const char* y = "abc"; // literal "abc" is used to initialize a char array pointed to by y 

The type of the literals are const char arrays (const char[N]). Literals are put into the read-only data section of the object file. String literals are therefore existing for the life of a program and are not allocated on the stack or heap. [cppreference.com]

Null termination characters ("\0") are automatically appended to string literals. This can be easily checked when printing the size of x below. It will print 4.

const char x[] = "123";
std::cout << sizeof(x) << std::endl; // output: 4

If you want to prove that is put into a read-only section of the object file, you can have a look on the intel assembly syntax. You will find the literal "123" in the .rodata section. For generating assembler output, you can run gcc with the option -S. This will create a .s file:

gcc -O0 -S show_assembler.c

Assembler output in show_assembler.s:

...
	.section	.rodata
	.type	x, @object
	.size	x, 4
x:
	.string	"123"
...

You can also check the .rodata section of the ELF file by using the readelf -x .rodata command on the resulting ELF executable:

readelf -x .rodata show_assembler

Hex dump of section '.rodata':
  0x000006e0 01000200 31323300                   ....123.

As you can see here the null termination character 0x00 is automatically appended after "123". "123" in hexadecimal is 0x313233.

Char arrays

A char array is simply an array of characters. As we showed in the chapter before a char array can be initialized using a string literal. In that case the array has a size of the literal plus one because of the null terminator character that is automatically appended by the compiler.

char x[] = "123"; // Array x has a size of 4

If array lengths are specified by the programmer it must not be forgotten to allocate the length of the characters plus one to store the null terminator \0. Also if the array is allocated on the heap the programmer must take care of freeing the memory afterwards.

char* x = new char[4];
strcpy(x, "123");
std::cout << x << std::endl;
delete[] x;

std::string

std::string is a class that contains and manages the char array for the programmer. std::string is only a typedef for the class template specialization std::basic_string<char> .

If you have a look into the declaration of std::basic_string in libstdc++ you will find something similar like this.

Source: https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/basic_string.h#L215-L219

Pseudocode for explanation:

template <...>
class basic_string {

  pointer data;
  size_type string_length;
  union
  {
	  char_type local_buffer[16];
	  size_type allocated_capacity;
  };

};

So an std::string has three class members. Depending on the length of the contained string they are used differently. If you store "christian" in an std::string this will be placed directly into the local_buffer char array in the union which has a size of 16. The size of the char array is compiler dependent. This means for small strings like "christian" no heap allocation is needed which reduces the overhead of allocating memory on the heap for many small strings. This method is also called small string optimization.

In the case that longer strings shall be stored in an std::string memory is allocated on the heap using new[]. The pointer "data" points to the allocated data. The union is used then instead to indicate the allocated capacity on the heap. E.g. if you want to store "christian christian" (19 characters) in the string it is not fitting into the char array of size 16. Memory will be allocated on the heap instead.

The length of the currently contained data is tracked by "string_length" and is returned via the size() or length() member functions.

Caution and fun fact: sizeof(some std::string) will instead always return 32 independent of the actual contained data in the std::string. Can you figure out why?

Some advantages of using std::string over char arrays are:

  • No heap allocation for small char arrays (small string optimization)
  • std::string protects against buffer overflows caused by programming errors
  • Easier to read and use. Many functions for string manipulation are available.

Implicit conversions and converting constructors in C++

Another important aspect to understand are implicit conversions and converting constructors in C++.

In the logging example above we have a function that accepts a std::string reference and we try to call the function with string literals which are of the type const char[N].

void log(const std::string&) {..}
..
log("Initialize app..");

Even if the types do not match (const char array vs. std::string), this example compiles fine. This is due to the compiler generating a conversion for the programmer. The compiler generated conversions are called implicit conversions.

Implicit conversions are performed whenever an expression of some type T1 is used in context that does not accept that type, but accepts some other type T2; in particular: when the expression is used as the argument when calling a function that is declared with T2 as parameter; implicit conversions - cppreference.com

Since the compiler has to know how to convert a type T1 into the target type T2, the compiler tests available conversions in a specific order, e.g. it will look first for available standard conversions and afterwards for user-defined conversions. One example of such a user-defined conversion is a converting constructor. If you search in the implementation of std::string you will find a converting constructor which takes a char pointer as an input and constructs an std::string from it.

_GLIBCXX20_CONSTEXPR
basic_string(const _CharT* __s, const _Alloc& __a = _Alloc())
  : _M_dataplus(_M_local_data(), __a)
  {
    // ... implementation
  }

Source: basic_string.h Line 634

That means the compiler will automatically generate code to call this converting constructor in order to have an std::string object which is passed to the function "log" that expects the std::string. We can verify this assumption with compiler explorer and have a look at the assembler generated by the compiler. The example can be found here.

When we call the function log and pass a string literal, you can see that the compiler automatically adds the code (line 2) to call the converting constructor of std::basic_string<char> and passes the pointer to the string literal data. That means if we design our function log to accept an std::string as an argument, it will produce a lot of overhead for creating std::string objects.

...
(1)   mov     rdi, rsp
(2)   call    std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&) [complobject constructor]
(3)   mov     rdi, rsp
(4)   call    log(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
(5)   jmp     .L13
...

As it was explained in the preceding chapter the calls to the constructor of std::basic_string<char> may involve a dynamic memory allocation depending on the length of the string literal, i.e. if small string optimization can be used or not.

Measuring the performance for two alternative functions

The performance of accepting an std::string or a char array as an argument can be roughly compared using quick-bench.com. It's a tool to compare the performance of multiple C++ snippets and measure the performance ratio between the different solutions.

For this topic two functions are created: log_string and log_char_array. The first function takes a string as an argument and the second function takes a char array. Both functions are called with the literal "aaabbbcccdddeeefffggg" where small string optimization inside std::string should not apply.

This is the benchmark code:

#include <iostream>

int log_string(const std::string& output) {
  // std::cout << output << std::endl;
  return 0;
}

int log_char_array(const char* output) {
  // std::cout << output << std::endl;
  return 0;
}

static void LogString(benchmark::State& state) {
  // Code inside this loop is measured repeatedly
  for (auto _ : state) {
    // Make sure the variable is not optimized away by compiler
    benchmark::DoNotOptimize(log_string("aaabbbcccdddeeefffggg"));
  }
}
// Register the function as a benchmark
BENCHMARK(LogString);

static void LogCharArray(benchmark::State& state) {
  for (auto _ : state) {
    // Make sure the variable is not optimized away by compiler
    benchmark::DoNotOptimize(log_char_array("aaabbbcccdddeeefffggg"));
  }
}
BENCHMARK(LogCharArray);

After running the benchmark in quick-bench a bar chart is generated showing the performance ratio between the two solutions. The lower the bar the faster the solution is, i.e. the runtime of the solution is lower. The expectation is that accepting a char array as a parameter is faster than accepting an std::string because the overhead of calling the converting constructor of std::string is not needed.

In the result you can see that accepting an std::string is about 47x times slower than accepting a char array. The left bar shows the CPU time/noop time ratio for the string solution and the right bar for the char array solution. The test was done with using -O3 for compilation.

quick bench result

Another comparison is done with using a smaller literal "abc" where small string should not apply. Still the additional overhead of calling the string constructor leads to a performance being 5x slower when using std::string.

quick bench result 2

Summary

In conclusion, C++ offers several options for working with strings in your code, however, it is crucial in C++ to understand what happens under the hood, e.g. which code the compiler generates (calling converting constructors for implicit conversions) and where additional dynamic memory allocations may happen. This can affect the performance and behavior of the program or violate the requirements and constraints of your project. One example of such requirements are the AUTOSAR C++14 guidelines for safety-related projects. As always it is a good idea to carefully consider the options and choose the most appropriate solution. Even if std::string may be the most comfortable solution it may not be always needed.