writing to disk – Antti Juustila

There’s a distributed C++ system I made, used as a “patient” in a course on Software architectures. It includes a command line tool TestDataGenerator, which I implemented to test the performance and reliability of the system. The tool generates random data in memory buffers and then writes four test data files which are read and handled by the system’s distributed nodes. An earlier blog post discussed the tool’s implementation details.

The generator was single threaded, writing the four data files in sequence, in the main thread. But then this stupid idea popped in my head — what if the four test data files are written to disk in parallel? Would it be faster? How much if any?

Threading is absolutely not needed in this case: generating test data for 5000 students takes about 250ms using my MacBook Pro (13-inch, 2018), 2.3 GHz four core Intel Core i5, 1 Tb SSD disk. On machines with HDDs this could be somewhat slower.

However, I wanted to see how much of execution time (if any) I can squeeze off with the four threads, each writing to their own data file from the RAM buffers. Also an opportunity to learn more about threads. Those horrible, evil things everyone is saying nobody should use…

My first implementation where the threads were created and executed when the memory buffer was full, and saving the file done in a lambda function:

 if (bufferCounter >= bufSize) {
   std::thread thread1( [&isFirstWrite, &STUDENT_BASIC_INFO_FILE, &basicInfoBuffer] {
     saveBuffer(isFirstWrite, STUDENT_BASIC_INFO_FILE, basicInfoBuffer);
   });
// ...

But creating a thread takes time. Lots of time, thousands of processor cycles, depending on your setup (see e.g. this blog post). If the tool startup parameters are -s 50000 -b 500 (create 50000 records with buffer size of 500), this would mean 50000/500 = 100 thread creations per file, so 400 threads would be created during the execution of the tool. Not very good for performance.

I changed the implementation to create the four threads only once, before filling and saving the memory buffers:

   // For coordination between main thread and writer threads
   std::atomic<int> threadsFinished{0};
   // Prepare four threads that save the data.
   std::vector<std::thread> savers;
   savers.push_back(std::thread(&threadFuncSavingData, std::ref(threadsFinished), std::cref(STUDENT_BASIC_INFO_FILE), std::ref(basicInfoBuffer)));
   savers.push_back(std::thread(&threadFuncSavingData, std::ref(threadsFinished), std::cref(EXAM_INFO_FILE), std::ref(examInfoBuffer)));
   // ... and same for the remaining two threads.

and then woken up every time the data buffers were full:

if (bufferCounter >= bufSize) {
   if (verbose) std::cout << std::endl << "Activating buffer writing threads..." << std::endl;
   // Prepare variables for the file saving threads.
   startWriting = true;
   threadsFinished = 0;
   int currentlyFinished = 0;
   // And launch the file writing threads.
   launchWrite.notify_all();

And then the main thread waits for the writers to finish their job before filling the memory buffers again.

   // Wait for the writer threads to finish.
   while (threadsFinished < 4) {
      std::unique_lock<std::mutex> ulock(fillBufferMutex);
      writeFinished.wait(ulock, [&] {
         return currentlyFinished != threadsFinished;
      });
      currentlyFinished = threadsFinished;
   }

Obviously the file writing threads notify the main thread about them finishing the file operations using a condition variable and a counter the main thread can use to keep track of if all the writer threads finished:

// Thread function saving data in parallel when notified that buffers are full.
void threadFuncSavingData(std::atomic<int> & finishCount, const std::string & fileName, std::vector<std::string> & buffer) {
   bool firstRound = true;
   while (running) {
      // Wait for the main thread to notify the buffers are ready to be written to disk.
      std::unique_lock<std::mutex> ulock(writeMutex);
      launchWrite.wait(ulock, [&] {
         return startWriting || !running;
      });
      // We are still running and writing, so do it.
      if (buffer.size() > 0 && startWriting && running) {
         saveBuffer(firstRound, fileName, buffer);
         buffer.clear();
         firstRound = false;
         // Update the counter that this thread is now ready.
         // Main thread waits that four threads have finished (count is 4).
         finishCount++;
      }
      // Notify the main thread.
      writeFinished.notify_one();
   }
}

Then to measurements. I created a script which executes the tool 20 times, first using threads and then sequentially; not using threads (command line parameter -z disables the threading code and uses sequential code):

echo "Run this in the build directory of TestDataGenerator."
echo "Removing output files..."
rm test-*.txt
echo "Running threaded tests..."
for ((i = 0; i < 20; i++)); do ./GenerateTestData -s 50000 -e 10 -b 500 >> test-par.txt; done
echo "Running sequential tests..."
for ((i = 0; i < 20; i++)); do ./GenerateTestData -zs 50000 -e 10 -b 500 >> test-seq.txt; done
echo "-- Tests done -- "
open test-*.txt

Just to compare, I executed the tests in two machines. MacBook Pro 2.3 GHz Intel Core i5 with four cores, 1 Tb SSD and iMac 2015 with HDD. Next, I took the output files and from there the amount of milliseconds the tool took each time, to a Numbers file and generated these graphics from the test data:

Comparison of sequential and threaded execution in two machines. — Comparison of sequential and threaded execution in two machines

As you can see, there is no difference in writing in threads (parallel) or writing sequentially. Here you can see how the threads take turns and execute in parallel in the cores of the processor of the MacBook Pro:

Profiler showing threads executing. — Blue areas show when the threads are active, executing.

Profiling the execution shows that having multiple threads doing the work won’t make a difference. In the trace below you can see that most the time the threads are either waiting for their turn to flush the data to disk or actually flushing the data. Most of the time in the selected saveBuffer method is spent in flushing data.

Profiler screenshot shows where time was spent, flushing and waiting. — Selected lines show where the most of the time was spent.

Also, in the sequential execution, where the single main thread does all, time is spend in flushing to disk:

Single threaded execution spent most of the time flushing data to disk.

Creating threads to speed up writing to disk — definitely not a good idea in this case. If this would be an app with GUI, then writing large amounts of data in a thread could very well be a good idea. If writing would take more than a couple of hundred milliseconds, user would notice the GUI lagging/not being responsive. So whether to use threads or not to write data to disk, depends on your use case.

This oldish article from DrDobbs is also an interesting read. Writing several files in threads is not necessarily helpful (unless using RAID), and that one should make threading configurable (like the -z parameter in my implementation) because they may in some situations even slow down the app. Also this discussion on when to apply threads is a good one:

Using multiple threads is most helpful when your program is CPU bound and you want to parallelise your program to use multiple CPU cores.
This is not the case for I/O bound problems, such as your scenario. Multiple threads will likely not speed up your system at all.

Tag: writing to disk

To thread or not to thread