In the previous post I mocked the Java app that was hardcoded to use Windows C: disk as the default place to open files.
What then is the recommended way? One is to start looking from the user home directory:
JFileChooser fileChooser = new JFileChooser();
fileChooser.setCurrentDirectory(new File(System.getProperty("user.home")));
Or pick the documents directory. It is also a nice thing to save the selected directory and use that if the user would like to continue with the already selected directory next time.
What else is happening? It is now the last week of the course Data Structures and Algorithms I am responsible teacher. Students have been analyzing algorithm correctness and time complexity, implementing basic data structures, more advanced ones like hash tables and hash functions, binary search trees and binary search.
Lectures addressed graphs and graph algorithms too, but implementation of these was in an optional course project only, Mazes. When students finish that, they can play a game a bit like the PacMan.
They get to choose from two course projects: either that Mazes project or optionally implement the classic task: count the unique words from a book file, ignoring some words from another file. The largest test file is around 16 MB, having almost 100 000 unique words.
Processing the largest file, using naively two nested for loops takes on my Mac Mini M1 with 16MB of memory around two minutes to process. The fast versions (hash tables, binary search trees) take less than a second.
I have implemented several different solutions for comparison with three different programming languages, Java, Swift and C++. Each week in lectures I demonstrated some of these, and in the end we had this table for comparison (sorry, in Finnish only).
Hall of Fame of the different Books and Words implementations. Swift versions can be found from the link above to GitHub.
As you can see, the C++ with 8 threads was the fastest one. Next after C++ came a couple of Java versions. Swift implementations were not so fast as I expected. After some profiling, I suspect the reason is in the way Unicode chars are handled in Swift. All the book files are UTF-8 and students were expected to handle them correctly. I do not like that mostly in teaching programming languages the default is to stick with ascii and conveniently forget the existence of different languages and character sets.
Well, anyways, for some reason, processing these UTF-8 text files takes a lot of time with Swift. Maybe later I have time to find out if the issue is in my code and/or is there anything that can be done to speed things up.
Something very nice happened a week ago — the student guild of our study program, Blanko, awarded me this diploma for being a quality teacher. Apparently they had a vote and I somehow managed to come first this time. The diploma was accompanied by this Italian themed small gift box. A really nice surprise! I was so astonished to receive this, thank you so much if anyone of you is reading!
Nice award from the students for quality teaching.
There’s a distributed C++ system I made, used as a “patient” in a course on Software architectures. It includes a command line tool TestDataGenerator, which I implemented to test the performance and reliability of the system. The tool generates random data in memory buffers and then writes four test data files which are read and handled by the system’s distributed nodes. An earlier blog post discussed the tool’s implementation details.
The generator was single threaded, writing the four data files in sequence, in the main thread. But then this stupid idea popped in my head — what if the four test data files are written to disk in parallel? Would it be faster? How much if any?
Threading is absolutely not needed in this case: generating test data for 5000 students takes about 250ms using my MacBook Pro (13-inch, 2018), 2.3 GHz four core Intel Core i5, 1 Tb SSD disk. On machines with HDDs this could be somewhat slower.
However, I wanted to see how much of execution time (if any) I can squeeze off with the four threads, each writing to their own data file from the RAM buffers. Also an opportunity to learn more about threads. Those horrible, evil things everyone is saying nobody should use…
My first implementation where the threads were created and executed when the memory buffer was full, and saving the file done in a lambda function:
But creating a thread takes time. Lots of time, thousands of processor cycles, depending on your setup (see e.g. this blog post). If the tool startup parameters are -s 50000 -b 500 (create 50000 records with buffer size of 500), this would mean 50000/500 = 100 thread creations per file, so 400 threads would be created during the execution of the tool. Not very good for performance.
I changed the implementation to create the four threads only once, before filling and saving the memory buffers:
// For coordination between main thread and writer threadsstd::atomic<int>threadsFinished{0};// Prepare four threads that save the data.std::vector<std::thread>savers;savers.push_back(std::thread(&threadFuncSavingData,std::ref(threadsFinished),std::cref(STUDENT_BASIC_INFO_FILE),std::ref(basicInfoBuffer)));savers.push_back(std::thread(&threadFuncSavingData,std::ref(threadsFinished),std::cref(EXAM_INFO_FILE),std::ref(examInfoBuffer)));// ... and same for the remaining two threads.
and then woken up every time the data buffers were full:
if(bufferCounter>=bufSize){if(verbose)std::cout<<std::endl<<"Activating buffer writing threads..."<<std::endl;// Prepare variables for the file saving threads.startWriting=true;threadsFinished=0;intcurrentlyFinished=0;// And launch the file writing threads.launchWrite.notify_all();
And then the main thread waits for the writers to finish their job before filling the memory buffers again.
// Wait for the writer threads to finish.while(threadsFinished<4){std::unique_lock<std::mutex>ulock(fillBufferMutex);writeFinished.wait(ulock,[&]{returncurrentlyFinished!=threadsFinished;});currentlyFinished=threadsFinished;}
Obviously the file writing threads notify the main thread about them finishing the file operations using a condition variable and a counter the main thread can use to keep track of if all the writer threads finished:
// Thread function saving data in parallel when notified that buffers are full.voidthreadFuncSavingData(std::atomic<int>&finishCount,conststd::string&fileName,std::vector<std::string>&buffer){boolfirstRound=true;while(running){// Wait for the main thread to notify the buffers are ready to be written to disk.std::unique_lock<std::mutex>ulock(writeMutex);launchWrite.wait(ulock,[&]{returnstartWriting||!running;});// We are still running and writing, so do it.if(buffer.size()>0&&startWriting&&running){saveBuffer(firstRound,fileName,buffer);buffer.clear();firstRound=false;// Update the counter that this thread is now ready.// Main thread waits that four threads have finished (count is 4).finishCount++;}// Notify the main thread.writeFinished.notify_one();}}
Then to measurements. I created a script which executes the tool 20 times, first using threads and then sequentially; not using threads (command line parameter -z disables the threading code and uses sequential code):
echo"Run this in the build directory of TestDataGenerator."echo"Removing output files..."
rm test-*.txt
echo"Running threaded tests..."for((i= 0; i < 20; i++)); do ./GenerateTestData -s 50000 -e 10 -b 500 >> test-par.txt; doneecho"Running sequential tests..."for((i= 0; i < 20; i++)); do ./GenerateTestData -zs 50000 -e 10 -b 500 >> test-seq.txt; doneecho"-- Tests done -- "
open test-*.txt
Just to compare, I executed the tests in two machines. MacBook Pro 2.3 GHz Intel Core i5 with four cores, 1 Tb SSD and iMac 2015 with HDD. Next, I took the output files and from there the amount of milliseconds the tool took each time, to a Numbers file and generated these graphics from the test data:
Comparison of sequential and threaded execution in two machines
As you can see, there is no difference in writing in threads (parallel) or writing sequentially. Here you can see how the threads take turns and execute in parallel in the cores of the processor of the MacBook Pro:
Blue areas show when the threads are active, executing.
Profiling the execution shows that having multiple threads doing the work won’t make a difference. In the trace below you can see that most the time the threads are either waiting for their turn to flush the data to disk or actually flushing the data. Most of the time in the selected saveBuffer method is spent in flushing data.
Selected lines show where the most of the time was spent.
Also, in the sequential execution, where the single main thread does all, time is spend in flushing to disk:
Single threaded execution spent most of the time flushing data to disk.
Creating threads to speed up writing to disk — definitely not a good idea in this case. If this would be an app with GUI, then writing large amounts of data in a thread could very well be a good idea. If writing would take more than a couple of hundred milliseconds, user would notice the GUI lagging/not being responsive. So whether to use threads or not to write data to disk, depends on your use case.
This oldish article from DrDobbs is also an interesting read. Writing several files in threads is not necessarily helpful (unless using RAID), and that one should make threading configurable (like the -z parameter in my implementation) because they may in some situations even slow down the app. Also this discussion on when to apply threads is a good one:
Using multiple threads is most helpful when your program is CPU bound and you want to parallelise your program to use multiple CPU cores.
This is not the case for I/O bound problems, such as your scenario. Multiple threads will likely not speed up your system at all.
When I forget to join (or detach) a C++ std::thread, it’ll crash with SIGABRT(6) on macOS. And obviously the stack dump does not tell me what is going on. So I hunt the bug for some hours, digging the log files, then finally remember that I just implemented a thread…
Edit: Actually, I changed join() to detach(). With join, the calling thread waited for this file reading thread to finish before continuing to handle incoming data from network. The file was read totally in memory, and only then data from network was combined with data from file and send ahead to next node in the network. With 5000 test records, all of them were held in memory waiting for the data from network to arrive when using join().
When I switched to use detach(), calling thread could continue reading data from network, while simultaneously file reading thread was reading data from file. Whenever a match was found in a list, either one of these threads, the data was combined and send ahead to the next node in the network. So with join, maximum of 5000 records were held in memory all the time, as with detach(), about 1300-1800 records were held in memory at most. Because combined data could be send ahead to next node in the network and discarded from this node. A significant change in the amount of memory the nodes use. So it does matter which you use, depending on the purpose of the threads in your app.