std::thread – Antti Juustila

Java file management, hall of fame and a nice surprise

In the previous post I mocked the Java app that was hardcoded to use Windows C: disk as the default place to open files.

What then is the recommended way? One is to start looking from the user home directory:

JFileChooser fileChooser = new JFileChooser();

fileChooser.setCurrentDirectory(new File(System.getProperty("user.home")));

Or pick the documents directory. It is also a nice thing to save the selected directory and use that if the user would like to continue with the already selected directory next time.

What else is happening? It is now the last week of the course Data Structures and Algorithms I am responsible teacher. Students have been analyzing algorithm correctness and time complexity, implementing basic data structures, more advanced ones like hash tables and hash functions, binary search trees and binary search.

Lectures addressed graphs and graph algorithms too, but implementation of these was in an optional course project only, Mazes. When students finish that, they can play a game a bit like the PacMan.

They get to choose from two course projects: either that Mazes project or optionally implement the classic task: count the unique words from a book file, ignoring some words from another file. The largest test file is around 16 MB, having almost 100 000 unique words.

Processing the largest file, using naively two nested for loops takes on my Mac Mini M1 with 16MB of memory around two minutes to process. The fast versions (hash tables, binary search trees) take less than a second.

I have implemented several different solutions for comparison with three different programming languages, Java, Swift and C++. Each week in lectures I demonstrated some of these, and in the end we had this table for comparison (sorry, in Finnish only).

Hall of Fame of the different Books and Words implementations. Swift versions can be found from the link above to GitHub.

As you can see, the C++ with 8 threads was the fastest one. Next after C++ came a couple of Java versions. Swift implementations were not so fast as I expected. After some profiling, I suspect the reason is in the way Unicode chars are handled in Swift. All the book files are UTF-8 and students were expected to handle them correctly. I do not like that mostly in teaching programming languages the default is to stick with ascii and conveniently forget the existence of different languages and character sets.

Well, anyways, for some reason, processing these UTF-8 text files takes a lot of time with Swift. Maybe later I have time to find out if the issue is in my code and/or is there anything that can be done to speed things up.

Something very nice happened a week ago — the student guild of our study program, Blanko, awarded me this diploma for being a quality teacher. Apparently they had a vote and I somehow managed to come first this time. The diploma was accompanied by this Italian themed small gift box. A really nice surprise! I was so astonished to receive this, thank you so much if anyone of you is reading!

Nice award from the students for quality teaching.

To thread or not to thread

There’s a distributed C++ system I made, used as a “patient” in a course on Software architectures. It includes a command line tool TestDataGenerator, which I implemented to test the performance and reliability of the system. The tool generates random data in memory buffers and then writes four test data files which are read and handled by the system’s distributed nodes. An earlier blog post discussed the tool’s implementation details.

The generator was single threaded, writing the four data files in sequence, in the main thread. But then this stupid idea popped in my head — what if the four test data files are written to disk in parallel? Would it be faster? How much if any?

Threading is absolutely not needed in this case: generating test data for 5000 students takes about 250ms using my MacBook Pro (13-inch, 2018), 2.3 GHz four core Intel Core i5, 1 Tb SSD disk. On machines with HDDs this could be somewhat slower.

However, I wanted to see how much of execution time (if any) I can squeeze off with the four threads, each writing to their own data file from the RAM buffers. Also an opportunity to learn more about threads. Those horrible, evil things everyone is saying nobody should use…

My first implementation where the threads were created and executed when the memory buffer was full, and saving the file done in a lambda function:

 if (bufferCounter >= bufSize) {
   std::thread thread1( [&isFirstWrite, &STUDENT_BASIC_INFO_FILE, &basicInfoBuffer] {
     saveBuffer(isFirstWrite, STUDENT_BASIC_INFO_FILE, basicInfoBuffer);
   });
// ...

But creating a thread takes time. Lots of time, thousands of processor cycles, depending on your setup (see e.g. this blog post). If the tool startup parameters are -s 50000 -b 500 (create 50000 records with buffer size of 500), this would mean 50000/500 = 100 thread creations per file, so 400 threads would be created during the execution of the tool. Not very good for performance.

I changed the implementation to create the four threads only once, before filling and saving the memory buffers:

   // For coordination between main thread and writer threads
   std::atomic<int> threadsFinished{0};
   // Prepare four threads that save the data.
   std::vector<std::thread> savers;
   savers.push_back(std::thread(&threadFuncSavingData, std::ref(threadsFinished), std::cref(STUDENT_BASIC_INFO_FILE), std::ref(basicInfoBuffer)));
   savers.push_back(std::thread(&threadFuncSavingData, std::ref(threadsFinished), std::cref(EXAM_INFO_FILE), std::ref(examInfoBuffer)));
   // ... and same for the remaining two threads.

and then woken up every time the data buffers were full:

if (bufferCounter >= bufSize) {
   if (verbose) std::cout << std::endl << "Activating buffer writing threads..." << std::endl;
   // Prepare variables for the file saving threads.
   startWriting = true;
   threadsFinished = 0;
   int currentlyFinished = 0;
   // And launch the file writing threads.
   launchWrite.notify_all();

And then the main thread waits for the writers to finish their job before filling the memory buffers again.

   // Wait for the writer threads to finish.
   while (threadsFinished < 4) {
      std::unique_lock<std::mutex> ulock(fillBufferMutex);
      writeFinished.wait(ulock, [&] {
         return currentlyFinished != threadsFinished;
      });
      currentlyFinished = threadsFinished;
   }

Obviously the file writing threads notify the main thread about them finishing the file operations using a condition variable and a counter the main thread can use to keep track of if all the writer threads finished:

// Thread function saving data in parallel when notified that buffers are full.
void threadFuncSavingData(std::atomic<int> & finishCount, const std::string & fileName, std::vector<std::string> & buffer) {
   bool firstRound = true;
   while (running) {
      // Wait for the main thread to notify the buffers are ready to be written to disk.
      std::unique_lock<std::mutex> ulock(writeMutex);
      launchWrite.wait(ulock, [&] {
         return startWriting || !running;
      });
      // We are still running and writing, so do it.
      if (buffer.size() > 0 && startWriting && running) {
         saveBuffer(firstRound, fileName, buffer);
         buffer.clear();
         firstRound = false;
         // Update the counter that this thread is now ready.
         // Main thread waits that four threads have finished (count is 4).
         finishCount++;
      }
      // Notify the main thread.
      writeFinished.notify_one();
   }
}

Then to measurements. I created a script which executes the tool 20 times, first using threads and then sequentially; not using threads (command line parameter -z disables the threading code and uses sequential code):

echo "Run this in the build directory of TestDataGenerator."
echo "Removing output files..."
rm test-*.txt
echo "Running threaded tests..."
for ((i = 0; i < 20; i++)); do ./GenerateTestData -s 50000 -e 10 -b 500 >> test-par.txt; done
echo "Running sequential tests..."
for ((i = 0; i < 20; i++)); do ./GenerateTestData -zs 50000 -e 10 -b 500 >> test-seq.txt; done
echo "-- Tests done -- "
open test-*.txt

Just to compare, I executed the tests in two machines. MacBook Pro 2.3 GHz Intel Core i5 with four cores, 1 Tb SSD and iMac 2015 with HDD. Next, I took the output files and from there the amount of milliseconds the tool took each time, to a Numbers file and generated these graphics from the test data:

Comparison of sequential and threaded execution in two machines. — Comparison of sequential and threaded execution in two machines

As you can see, there is no difference in writing in threads (parallel) or writing sequentially. Here you can see how the threads take turns and execute in parallel in the cores of the processor of the MacBook Pro:

Profiler showing threads executing. — Blue areas show when the threads are active, executing.

Profiling the execution shows that having multiple threads doing the work won’t make a difference. In the trace below you can see that most the time the threads are either waiting for their turn to flush the data to disk or actually flushing the data. Most of the time in the selected saveBuffer method is spent in flushing data.

Profiler screenshot shows where time was spent, flushing and waiting. — Selected lines show where the most of the time was spent.

Also, in the sequential execution, where the single main thread does all, time is spend in flushing to disk:

Single threaded execution spent most of the time flushing data to disk.

Creating threads to speed up writing to disk — definitely not a good idea in this case. If this would be an app with GUI, then writing large amounts of data in a thread could very well be a good idea. If writing would take more than a couple of hundred milliseconds, user would notice the GUI lagging/not being responsive. So whether to use threads or not to write data to disk, depends on your use case.

This oldish article from DrDobbs is also an interesting read. Writing several files in threads is not necessarily helpful (unless using RAID), and that one should make threading configurable (like the -z parameter in my implementation) because they may in some situations even slow down the app. Also this discussion on when to apply threads is a good one:

Using multiple threads is most helpful when your program is CPU bound and you want to parallelise your program to use multiple CPU cores.
This is not the case for I/O bound problems, such as your scenario. Multiple threads will likely not speed up your system at all.

Remember to join

When I forget to join (or detach) a C++ std::thread, it’ll crash with SIGABRT(6) on macOS. And obviously the stack dump does not tell me what is going on. So I hunt the bug for some hours, digging the log files, then finally remember that I just implemented a thread…

***** FATAL SIGNAL RECEIVED *******
Received fatal signal: SIGABRT(6)	PID: 17403

***** SIGNAL SIGABRT(6)

*******	STACKDUMP *******
	stack dump [1]  1   libg3logger.1.3.2-80.dylib          0x0000000100cb2163 _ZN12_GLOBAL__N_113signalHandlerEiP9__siginfoPv + 83
	stack dump [2]  2   libsystem_platform.dylib            0x00007fff72f73b1d _sigtramp + 29
	stack dump [3]  3   ???                                 0x0000000000000400 0x0 + 1024
	stack dump [4]  4   libsystem_c.dylib                   0x00007fff72e49a1c abort + 120
	stack dump [5]  5   libc++abi.dylib                     0x00007fff6fef2bc8 __cxa_bad_cast + 0
	stack dump [6]  6   libc++abi.dylib                     0x00007fff6fef2ca6 _ZL28demangling_terminate_handlerv + 48
	stack dump [7]  7   libc++abi.dylib                     0x00007fff6feffda7 _ZSt11__terminatePFvvE + 8
	stack dump [8]  8   libc++abi.dylib                     0x00007fff6feffd68 _ZSt9terminatev + 56
	stack dump [9]  9   BasicInfoGUI                        0x0000000100af1ffa _ZN11OHARStudent14StudentHandler8readFileEv + 42
	stack dump [10]  10  BasicInfoGUI                        0x0000000100af28df _ZN11OHARStudent14StudentHandler7consumeERN8OHARBase7PackageE + 2223
	stack dump [11]  11  BasicInfoGUI                        0x0000000100a6d101 _ZN8OHARBase13ProcessorNode14passToHandlersERNS_7PackageE + 897
	stack dump [12]  12  BasicInfoGUI                        0x0000000100a6a69e _ZN8OHARBase13ProcessorNode10threadFuncEv + 2462
	stack dump [13]  13  BasicInfoGUI                        0x0000000100a79061 _ZNSt3__1L8__invokeIMN8OHARBase13ProcessorNodeEFvvEPS2_JEvEEDTcldsdeclsr3std3__1E7forwardIT0_Efp0_Efp_spclsr3std3__1E7forwardIT1_Efp1_EEEOT_OS6_DpOS7_ + 113
	stack dump [14]  14  BasicInfoGUI                        0x0000000100a78f6e _ZNSt3__1L16__thread_executeINS_10unique_ptrINS_15__thread_structENS_14default_deleteIS2_EEEEMN8OHARBase13ProcessorNodeEFvvEJPS7_EJLm2EEEEvRNS_5tupleIJT_T0_DpT1_EEENS_15__tuple_indicesIJXspT2_EEEE + 62
	stack dump [15]  15  BasicInfoGUI                        0x0000000100a78796 _ZNSt3__114__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEMN8OHARBase13ProcessorNodeEFvvEPS8_EEEEEPvSD_ + 118
	stack dump [16]  16  libsystem_pthread.dylib             0x00007fff72f7ed36 _pthread_start + 125
	stack dump [17]  17  libsystem_pthread.dylib             0x00007fff72f7b58f thread_start + 15

Exiting after fatal event  (FATAL_SIGNAL). Fatal type:  SIGABRT
Log content flushed sucessfully to sink

So I am writing this down, just to remember next time. Join or detach. Join or detach….

   void StudentHandler::readFile() {
      std::thread( [this] {
         StudentFileReader reader(*this);
         using namespace std::chrono_literals;
         std::this_thread::sleep_for(50ms);
         reader.read(node.getDataFileName());
      }).join();
   }

Edit: Actually, I changed join() to detach(). With join, the calling thread waited for this file reading thread to finish before continuing to handle incoming data from network. The file was read totally in memory, and only then data from network was combined with data from file and send ahead to next node in the network. With 5000 test records, all of them were held in memory waiting for the data from network to arrive when using join().

When I switched to use detach(), calling thread could continue reading data from network, while simultaneously file reading thread was reading data from file. Whenever a match was found in a list, either one of these threads, the data was combined and send ahead to the next node in the network. So with join, maximum of 5000 records were held in memory all the time, as with detach(), about 1300-1800 records were held in memory at most. Because combined data could be send ahead to next node in the network and discarded from this node. A significant change in the amount of memory the nodes use. So it does matter which you use, depending on the purpose of the threads in your app.