Friday, October 16, 2020

Does C++ still deserve the bad rap it has had for so long?

Traditionally C++ has been seen by many (and you know who you are) as just plain bad: the code is unreadably verbose, error messages are undecipherable, it's unsafe, compilation takes forever and so on. In fact making fun of C++ is even a fun pastime for some. All of this was certainly true in the 90s and even as recent as 10 years ago. But is it still the case? What would be a good way to determine this?

A practical experiment

Let's write a fairly simple program that solves a sort-of-real-worldish problem and see what comes out of it. The problem we want to solve is the following:

Find all files with the extension .txt recursively in the subdirectories of the current directory, count the number of times each word appears in them and print the ten most common words and the number of times they are used.

We're not going to go for extreme performance or handling all possible corner cases, instead going for a reasonable implementation. The full code can be found by following this link.

The first thing we need is a way to detect files with a given extension and words within text. This calls for regular expressions:

const std::regex fname_regex("\\.txt$", std::regex_constants::icase);

const std::regex word_regex("([a-z]{2,})", std::regex_constants::icase);

This might a bit verbose, but quite readable. This only works for ASCII text, but for the purposes of this experiment it is sufficient. To calculate the number of times each word is seen, we need a hash table:


std::unordered_map<std::string, int> word_counts;


Now we need to iterate over all files in the current directory subtree.


for(const auto &e: std::filesystem::recursive_directory_iterator("."))


Skip everything that is not a .txt file.


if(!e.is_regular_file()) {

    continue;

}

if(!std::regex_search(e.path().c_str(), fname_regex)) {

    continue;

}


Process the file line by line:


std::ifstream current_file(e.path());

for (std::string line; std::getline(current_file, line); )


For each line we need to find all words that match the regular expression.


std::sregex_iterator word_end;

for(auto it = std::sregex_iterator(line.begin(), line.end(), word_regex); it != word_end; ++it)


This is a bit verbose, but it takes care of all the state tracking needed to run the regex multiple times on the same input string. Doing the same by hand is fairly tedious and error prone. Now that we have the matches, we need to convert them to standalone lowercase words.


std::string word{it->str()};

for(auto &c : word) {

    c = std::tolower(c);

}


Lowercasing strings is a bit cumbersome, granted, but this is at least fairly readable. Then on to incrementing the word count.


++word_counts[word];


That's all that is needed to count the words. This can't be done directly in the hash map so we need to convert the data to an an array. It needs some setup code.


struct WordCount {

    std::string word;

    int count;

};


std::vector<WordCount> word_array;

word_array.reserve(word_counts.size());


Since we know how many entries will be in the array, we can reserve enough space for it in advance. Then we need to iterate over all entries and add them to the array.


for(const auto &[word, count]: word_counts) {

    word_array.emplace_back(WordCount{word, count});

}


This uses the new structured bindings feature, which is a lot more readable than the old way of dealing with iterator objects or pairs with their firsts and their seconds and all that horror. Fortunately no more.


The simple way of getting the 10 most used entries is to sort the array and then grab the 10 first elements. Let's do something slightly swaggier instead [0]. We'll do a partial sort and discard all entries after 10. For that we need a descending sorting criterion as a lambda.


auto count_order_desc = [](const WordCount &w1, const WordCount &w2) { return w2.count < w1.count; };


Using it is straightforward.


const auto final_size = std::min(10, (int)word_array.size());

std::partial_sort(word_array.begin(), word_array.begin() + final_size, word_array.end(), count_order_desc);

word_array.erase(word_array.begin() + final_size, word_array.end());


All that remains is to print the final result.


for(const auto& word_info: word_array) {

    std::cout << word_info.count << " " << word_info.word << "\n";

};

Safety

As safety and security are important features in current sofware development, let's examine how safe this program is. There are two main safety points: thread safety and memory safety. As this program has only one thread, it can be formally proven to be thread safe. Memory safety also divides into two main things to note: use after frees (or dangling references) and out of bound array accesses.


In this particular program there are no dangling references. All data is in value types and the only references are iterators. They are all scoped so that they never live past their usages. Most are not even stored into named variables but instead passed directly as function arguments (see especially the call to partial_sort). Thus they can not be accessed after they have expired. There is no manual resource management of any kind. All resources are freed automatically. Running the code under Valgrind yields zero errors and zero memory leaks.


There is one place in the code where an out-of-bounds access is possible: when creating the iterator pointing to 10 elements past the beginning of the array [2]. If you were to forget to check that the array has at least 10 elements, that would be an immediate error. Fortunately most standard libraries have a way to enable checks for this. Here's what GCC reports at runtime if you get this wrong:


Error: attempt to advance a dereferenceable (start-of-sequence) iterator 100000 steps, which falls outside its valid range.


Not really a perfect safety record, but overall it's fairly good and certainly a lot better than implementing the same by manually indexing arrays.

Problems

Compiling the program takes several seconds, which is not great. This is mostly due to the regex header, which is known to be heavyweight. Some of the compiler messages encountered during development were needlessly verbose. The actual errors were quite silly, though, such as passing arguments to functions in the wrong order. This could definitely be better. It is reported that the use of concepts in C++20 will fix most of this issue, but there are no practical real-world usage reports yet.

Unique features

This code compiles, runs and works [1] on all major platforms using only the default toolchain provided by the OS vendor without any external dependencies or downloads. This is something that no other programming language can provide as of date. The closest you can get is plain C, but its standard library does not have the necessary functionality. Compiling the code is also simple and can be done with a single command:


g++ -o cppexample -O2 -g -Wall -std=c++17 cppexample.cpp

Conclusions

If we try to look at the end result with objective eyes it would seem to be fairly nice. The entire implementation takes fewer than 60 lines of code. There's nothing immediately terrible that pops up. What is happening at each step is fairly understandable and readable. There's nothing that one could immediately hate and despise for being "awful", as is tradition.


This is not to say you could still not hate C++. If you want to, go right ahead. There are certainly reasons for it. But if you choose to do that, make sure you hate it for real actual reasons of today, rather than things that were broken decades ago but have been fixed since.


[0] There is an even more efficient solution using a priority queue that does not need the intermediate array. Implementing it is left as an exercise to the reader.

[1] Reader beware, I have only proven this code to be portable, not tried it.

[2] The implementation in [0] does not have this problem as no iterators are ever modified by hand.




from Hacker News https://ift.tt/3k4MqZW

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.