I read an interesting question about Monitor structure in Swift Forums. The question was how the way it is implemented in Java using synchronized methods compares to Swift concurrency and actors.
Well since this was interesting and I wanted to learn more, there was simply no other option than to pick up the code editors and start coding…
Another post about data structures, time performance and the count-the-words-in-a-book-file-fast case I’ve been writing about before.
I did a C++ implementation of the books and words problem. Earlier, I have implemented several solutions for the problem using Swift and Java. This time I used C++ standard library std::map and wanted to see if parallel processing in several threads would speed up the processing.
Obviously it did. Execution time of the multithreaded version was 74% of the single threaded version. The sample files were processed by the single threaded version in 665 ms, while the multithreaded version took only 491 ms. Nice!
But then I saw, from the documentation of the std::map, that it keeps the key values in the dictionary in order while elements are added to the map/dictionary.
But this is not needed in my case! Surely this also takes time and gives me additional possibilities in optimising the time performance.
I changed, in the single threaded implementation, the std::map to std::unordered_map, and behold, it was faster than the multithreaded version with 446 ms execution time!
So mind the map. There are many, and some of those may be more suitable to your use case than the others.
Just in case anyone is reading, I updated the code related to the previous post. It was about using Swift async tasks to process data in parallel, comparing it to the single threaded functional programming implementation. Due to updates to Swift (I assume?), the code started to fail (printout was messy, nothing too bad) so I fixed it. Probably my mistake, revealed by some change in the language or libraries I use.
Now the code in FunctionalParallel is using the async version of Swift ArgumentParser library. It is now also otherwise improved and (at least now) works correctly. I will measure the speed on the M1 Mini when I get to it and make any changes, if necessary, in the performance comparison table in the repository readme file.
Now I am out of home, looking after the granddaughter as she sleeps. Her parents (including my daughter) are out, celebrating my son’s birthday. Happy birthday son ?!
The Swift code below goes through an array of words (Strings) and counts unique words and their frequencies, ignoring words in wordsToFilter (another array of Strings). Then the resulting dictionary (map data structure, consisting of word/count pairs) is sorted by the word frequency in descending order. Finally, the top 100 of the most frequent words are printed out.
Functional programming approach to count the frequency of unique words in a text file.
With my test book file of size 17.1 MB, with 2 378 668 words and 97 115 unique words, the code above uses 1.226099 secs to process the file. The time includes reading and splicing the words from the text files into the arrays. For details of measuring, see the end of this post.
Could it be faster if using the Swift async tasks? Let’s try and see!
Below is the code doing the same in eight async tasks. Code for printing out the result is omitted, shown later below.
Async tasks counting word frequencies.
In the code, first the slice size is calculated at line 66. For example, if the array has 1000 words, it is divided into eight slices, each containing 125 words. Then in a for loop, a task group with eight async tasks execute (lines 79-85). Each async task calculates the word frequencies of their own slice of the array. Each task return a dictionary to the task group. Dictionary contains the word / frequency count pairs of the slice of the array.
In lines 88-96, the task group awaits for the tasks to finish. As they do that, the task group combines the dictionary of partial result provided by the task to the task group’s dictionary wordCounts. This happens in a single thread so no data corruption happens. The async tasks are not writing to the final dictionary having all the word / frequency pairs from the async tasks.
Finally the result is sorted and printed out from the wordCounts dictionary, after the task group has merged the results from the tasks:
Printing out the results and main thread waits for the async tasks to finish.
Why the semaphore? This is a console app, and the main thread would continue until the end, after the async tasks were launched. What would happen in the end? The main thread would run past the end of the function, return to main function and finish & quit the process. While the async tasks are still executing. Not good.
So to avoid that 1) the main thread stops to wait for the semaphore, and 2) task group uses the same semaphore to signal when the task group has finished working. The main thread then proceeds to finish.
So, is this any faster? Any use at all in having this more complicated code?
Executing with the same files as above, the execution now takes 0.694983 secs. That is 57% of the original execution time of the single threaded implementation!
Though the absolute times or time differences are not large, the relative difference is very significant. Consider the data sizes being hundreds or thousands of times larger than this test file. Or if this process would be done repeatedly over thousands of files. Continuously. Then the difference would be significant also in time, not only relatively, even if the files would be smaller.
When you take a look at the Xcode Instruments view of the time profiler, you see easily why the speed difference:
Xcode Instruments showing eight parallel threads working on the task.
As you can see, all that work that was earlier done in sequence, is now executed in parallel, asynchronously.
So the answer to the question “Could it be faster if using the Swift async tasks?”, is: yes, absolutely.
The measurements were taken on an Apple Mac Mini M1 (Apple Silicon) with 16GB of RAM and 1 TB of SSD storage.
Reason for the slicing of the array to eight? The M1 processor has eight cores in the processor, each one is put to work. As you can see, the OS and other processes also needs the cores so they are not executed at 100% for this process’ threads all the time.
The code can be found in my BooksAndWords repository at GitHub. Single threaded implementation is in the Functional directory as the async is in the FunctionalParallel directory.
In the previous post I mocked the Java app that was hardcoded to use Windows C: disk as the default place to open files.
What then is the recommended way? One is to start looking from the user home directory:
JFileChooser fileChooser = new JFileChooser();
fileChooser.setCurrentDirectory(new File(System.getProperty("user.home")));
Or pick the documents directory. It is also a nice thing to save the selected directory and use that if the user would like to continue with the already selected directory next time.
What else is happening? It is now the last week of the course Data Structures and Algorithms I am responsible teacher. Students have been analyzing algorithm correctness and time complexity, implementing basic data structures, more advanced ones like hash tables and hash functions, binary search trees and binary search.
Lectures addressed graphs and graph algorithms too, but implementation of these was in an optional course project only, Mazes. When students finish that, they can play a game a bit like the PacMan.
They get to choose from two course projects: either that Mazes project or optionally implement the classic task: count the unique words from a book file, ignoring some words from another file. The largest test file is around 16 MB, having almost 100 000 unique words.
Processing the largest file, using naively two nested for loops takes on my Mac Mini M1 with 16MB of memory around two minutes to process. The fast versions (hash tables, binary search trees) take less than a second.
I have implemented several different solutions for comparison with three different programming languages, Java, Swift and C++. Each week in lectures I demonstrated some of these, and in the end we had this table for comparison (sorry, in Finnish only).
Hall of Fame of the different Books and Words implementations. Swift versions can be found from the link above to GitHub.
As you can see, the C++ with 8 threads was the fastest one. Next after C++ came a couple of Java versions. Swift implementations were not so fast as I expected. After some profiling, I suspect the reason is in the way Unicode chars are handled in Swift. All the book files are UTF-8 and students were expected to handle them correctly. I do not like that mostly in teaching programming languages the default is to stick with ascii and conveniently forget the existence of different languages and character sets.
Well, anyways, for some reason, processing these UTF-8 text files takes a lot of time with Swift. Maybe later I have time to find out if the issue is in my code and/or is there anything that can be done to speed things up.
Something very nice happened a week ago — the student guild of our study program, Blanko, awarded me this diploma for being a quality teacher. Apparently they had a vote and I somehow managed to come first this time. The diploma was accompanied by this Italian themed small gift box. A really nice surprise! I was so astonished to receive this, thank you so much if anyone of you is reading!
Nice award from the students for quality teaching.
There’s a distributed C++ system I made, used as a “patient” in a course on Software architectures. It includes a command line tool TestDataGenerator, which I implemented to test the performance and reliability of the system. The tool generates random data in memory buffers and then writes four test data files which are read and handled by the system’s distributed nodes. An earlier blog post discussed the tool’s implementation details.
The generator was single threaded, writing the four data files in sequence, in the main thread. But then this stupid idea popped in my head — what if the four test data files are written to disk in parallel? Would it be faster? How much if any?
Threading is absolutely not needed in this case: generating test data for 5000 students takes about 250ms using my MacBook Pro (13-inch, 2018), 2.3 GHz four core Intel Core i5, 1 Tb SSD disk. On machines with HDDs this could be somewhat slower.
However, I wanted to see how much of execution time (if any) I can squeeze off with the four threads, each writing to their own data file from the RAM buffers. Also an opportunity to learn more about threads. Those horrible, evil things everyone is saying nobody should use…
My first implementation where the threads were created and executed when the memory buffer was full, and saving the file done in a lambda function:
But creating a thread takes time. Lots of time, thousands of processor cycles, depending on your setup (see e.g. this blog post). If the tool startup parameters are -s 50000 -b 500 (create 50000 records with buffer size of 500), this would mean 50000/500 = 100 thread creations per file, so 400 threads would be created during the execution of the tool. Not very good for performance.
I changed the implementation to create the four threads only once, before filling and saving the memory buffers:
// For coordination between main thread and writer threadsstd::atomic<int>threadsFinished{0};// Prepare four threads that save the data.std::vector<std::thread>savers;savers.push_back(std::thread(&threadFuncSavingData,std::ref(threadsFinished),std::cref(STUDENT_BASIC_INFO_FILE),std::ref(basicInfoBuffer)));savers.push_back(std::thread(&threadFuncSavingData,std::ref(threadsFinished),std::cref(EXAM_INFO_FILE),std::ref(examInfoBuffer)));// ... and same for the remaining two threads.
and then woken up every time the data buffers were full:
if(bufferCounter>=bufSize){if(verbose)std::cout<<std::endl<<"Activating buffer writing threads..."<<std::endl;// Prepare variables for the file saving threads.startWriting=true;threadsFinished=0;intcurrentlyFinished=0;// And launch the file writing threads.launchWrite.notify_all();
And then the main thread waits for the writers to finish their job before filling the memory buffers again.
// Wait for the writer threads to finish.while(threadsFinished<4){std::unique_lock<std::mutex>ulock(fillBufferMutex);writeFinished.wait(ulock,[&]{returncurrentlyFinished!=threadsFinished;});currentlyFinished=threadsFinished;}
Obviously the file writing threads notify the main thread about them finishing the file operations using a condition variable and a counter the main thread can use to keep track of if all the writer threads finished:
// Thread function saving data in parallel when notified that buffers are full.voidthreadFuncSavingData(std::atomic<int>&finishCount,conststd::string&fileName,std::vector<std::string>&buffer){boolfirstRound=true;while(running){// Wait for the main thread to notify the buffers are ready to be written to disk.std::unique_lock<std::mutex>ulock(writeMutex);launchWrite.wait(ulock,[&]{returnstartWriting||!running;});// We are still running and writing, so do it.if(buffer.size()>0&&startWriting&&running){saveBuffer(firstRound,fileName,buffer);buffer.clear();firstRound=false;// Update the counter that this thread is now ready.// Main thread waits that four threads have finished (count is 4).finishCount++;}// Notify the main thread.writeFinished.notify_one();}}
Then to measurements. I created a script which executes the tool 20 times, first using threads and then sequentially; not using threads (command line parameter -z disables the threading code and uses sequential code):
echo"Run this in the build directory of TestDataGenerator."echo"Removing output files..."
rm test-*.txt
echo"Running threaded tests..."for((i= 0; i < 20; i++)); do ./GenerateTestData -s 50000 -e 10 -b 500 >> test-par.txt; doneecho"Running sequential tests..."for((i= 0; i < 20; i++)); do ./GenerateTestData -zs 50000 -e 10 -b 500 >> test-seq.txt; doneecho"-- Tests done -- "
open test-*.txt
Just to compare, I executed the tests in two machines. MacBook Pro 2.3 GHz Intel Core i5 with four cores, 1 Tb SSD and iMac 2015 with HDD. Next, I took the output files and from there the amount of milliseconds the tool took each time, to a Numbers file and generated these graphics from the test data:
Comparison of sequential and threaded execution in two machines
As you can see, there is no difference in writing in threads (parallel) or writing sequentially. Here you can see how the threads take turns and execute in parallel in the cores of the processor of the MacBook Pro:
Blue areas show when the threads are active, executing.
Profiling the execution shows that having multiple threads doing the work won’t make a difference. In the trace below you can see that most the time the threads are either waiting for their turn to flush the data to disk or actually flushing the data. Most of the time in the selected saveBuffer method is spent in flushing data.
Selected lines show where the most of the time was spent.
Also, in the sequential execution, where the single main thread does all, time is spend in flushing to disk:
Single threaded execution spent most of the time flushing data to disk.
Creating threads to speed up writing to disk — definitely not a good idea in this case. If this would be an app with GUI, then writing large amounts of data in a thread could very well be a good idea. If writing would take more than a couple of hundred milliseconds, user would notice the GUI lagging/not being responsive. So whether to use threads or not to write data to disk, depends on your use case.
This oldish article from DrDobbs is also an interesting read. Writing several files in threads is not necessarily helpful (unless using RAID), and that one should make threading configurable (like the -z parameter in my implementation) because they may in some situations even slow down the app. Also this discussion on when to apply threads is a good one:
Using multiple threads is most helpful when your program is CPU bound and you want to parallelise your program to use multiple CPU cores.
This is not the case for I/O bound problems, such as your scenario. Multiple threads will likely not speed up your system at all.
The last time I held the Software Architectures course, I wanted to demonstrate students how to test the performance and reliability quality attributes of a distributed software system. The system already had a feature to process data as a batch by reading data files and processing that data in the networked nodes. All I needed to do was to generate data files with thousands of records to read and process in the system. I implemented a small tool app to generate this test data.
First of all, when generating thousands of records, I wanted to preallocate the necessary buffers to make sure that during data creation no unnecessary buffer allocations are made, making data generation faster:
std::vector<int> generatedStudentNumbers;
if (verbose) std::cout <<"Creating numbers for students..."<< std::endl;
generatedStudentNumbers.resize(studentCount);
resize() allocates big enough vetor for the data. For creating student numbers (int), std::iota and std::shuffle are quite useful:
// Generate student numbers starting from one to studentCount.
std::iota(generatedStudentNumbers.begin(), generatedStudentNumbers.end(), 1);
// Shuffle the numbers randomly.
std::shuffle(generatedStudentNumbers.begin(), generatedStudentNumbers.end(), std::mt19937{std::random_device{}()});
std::iota fills the container with continuous values starting from 1 in this case. std::shuffle puts the numbers in random order. Voilá, you have a long vector of randomly ordered student numbers you can use in the data generation with only four lines of code!
Next, I needed random names for the students for the data set. For that, I needed a vector of names and then randomly get a name from that vector when creating the student records:
std::vector<std::string> firstNames;
firstNames = {"Antti", "Tiina", "Pentti", "Risto", "Päivi", "Jaana", "Jani", "Esko", "Hanna", "Oskari"};
// Initialize the random engine
std::random_device rd;
std::default_random_engine generator(rd());
// Generate a random int from a rangeintgenerateInt(int maxValue) {
std::uniform_int_distribution<int> distribution(0,maxValue);
return distribution(generator);
}
// Pick one random nameconst std::string & getFirstName() {
int index = generateInt(firstNames.size()-1);
return firstNames[index];
}
generateInt() helper function is used to get a random name from the firstNames array. The same procedure was used to generate a last name and the study program name for the student. Then all these pieces of information was stored into a record, basically a tab separated std::string. Records, in turn, were contained in a vector of strings.
What is then left is storing the test data records into a file:
std::ofstream datafile(fileName, isFirstRound ? std::ios::trunc : std::ios::app);
// Shuffle the records randomly.
std::shuffle(buffer.begin(), buffer.end(), std::mt19937{std::random_device{}()});
auto save = [&datafile](const std::string & entry) { if (entry.length() >0) datafile << entry << std::endl; };
std::for_each(buffer.begin(), buffer.end(), save);
datafile.close();
After opening the file stream, first again use std::shuffle to put the data into random order, then use the save lambda function to define what saving a record means. Then just pass this lambda to std::for_each to tell what to do to each of the data records — save them into the std::ofstream.
Finally I made the data generator tool configurable with command line parameters, using Sarge:
Sarge sarge;
sarge.setUsage("./GenerateTestData -[hv]s <number> [-e <number>]");
sarge.setDescription("A test data generator for StudentPassing system. (c) Antti Juustila, 2019.\nUses Sarge Copyright (c) 2019, Maya Posch All rights reserved.");
sarge.setArgument("h", "help", "Display help for using GenerateTestData.", false);
sarge.setArgument("v", "verbose", "Display detailed messages of test data generation process.", false);
sarge.setArgument("s", "students", "Number of students to generate in test data files.", true);
sarge.setArgument("e", "exercises", "Number of exercises generated, default is 6 if option not provided.", true);
sarge.setArgument("b", "bufsize", "Size of the buffer used in generating data", true);
I used the test data generator tool to generate up to 10 000 records and used those test data files to see and demonstrate students how the system manages high data throughput and what which performance. It was also interesting to see what the performance bottlenecks were in the system.
Next year (the last time teaching this course) I’ll demonstrate how to use a thread to read the data files, while at the same time reading test data from the network in another thread. There is a large impact on whether using std::thread.join() or .detach() to control how the networking and data file reading threads cooperate.