performance – Antti Juustila

Advent of Code – hanging on, so far

I’ve solved all the puzzles by now, both part 1 and 2. Some have been quite tricky to solve. Other issues I’ve had — mainly stupid parsing problems.

Having an effective parser would be something that helps. Also, as someone suggested in the Swift forums, a (generic) grid would help since apparently many puzzles have a grid to navigate.

Already today I thought that the puzzle could be done using a graph. Went for maps (dictionaries) and arrays again, though. Unfortunately all my graph implementations I could reuse are either in Java or C++, and I’ve been doing everything so far in Swift. Maybe I’ll prepare a Graph struct/class in Swift just to be ready when it is needed….

Today’s part 2 was hard in a sense that the solution took almost eight minutes to run on my Mac Mini M1. Surely there should be a way to optimise. But — won’t do that, I already passed the challenge. I still am busy with teaching, so all these puzzles are done on extra time.

Anyways, today I rose to position 16 on private AoC Swift leaderboard. Even though I am not competing and skipping a day or failing a puzzle is no issue to me, it is nice to see that I do relatively well – so far…

By now I’ve used Sets, Dictionaries, arrays, ranges, string manipulation, string splitting, enumerations, and those grid operations. Let’s see what comes up tomorrow…

Fast Index Order in Binary Search Tree

So, you can use a Binary Search Tree (BST) to store and quickly access key-value pairs. Usually the key gives the position of the value element in the tree. You place smaller values in the left side, and larger values in the right side of the tree.

A sample BST unbalanced, having 13 nodes in the tree. Key in the BST is the person’s name in ascending order.

Finding by key value is O(log n) operation. Starting from the root node of the tree, comparing the key, you can choose which subtree to continue searching for the key. If the BST is balanced, you need to only do O(log n) comparisons at most to get to the key-value pair you wish to find. Like searching for Jouni Kappalainen takes four comparisons from the root and it is found.

The tree contents in key order shown in a Java Swing table view. Order is by BST key order.

Note that the numbers in each tree node picture above is the number of children of that specific node. For example, Kevin Bacon has zero children, as Samir Numminen has five, and the root node has 12 children.

How about if you want to treat the BST as an ordered collection, each item in the tree accessible by index, as you would do with an array? Like, “give me the 6th element in the tree (element at index 5), when navigating in-order, from smallest to largest key in the tree”. In the sample tree above, Kappalainen would be in the index 5 (zero based indexing). Kevin Bacon’s index would be zero since it is the first element in-order. Root node would be at index 1 in this sample tree.

The obvious thing to do would be to traverse the tree in-order, counting indices from the lowest left side node (Bacon) with index 0 (zero) and then continue counting from there until you are at the index 5 (Kappalainen).

Sample tree with in-order indices added in blue. Compare this to the GUI table order and you see the indices are correct.

This works, but the time complexity of this operation is O(n). If you have, let’s say a million or more elements in the three, that is a lot of steps. Especially if you need to do this repeatedly. Is there a faster way to do this?

The faster way is to use the child counts of the tree nodes to count the index of the current node (starting from the root), and then use that information to decide if you should proceed to left or right subtree to continue finding the desired index. This makes finding the element with an index O(log n) operation, which is considerably faster than linear in-order traversal with O(n).

For example, consider you wish to find element at the index 5. Since the root node’s left child has zero children, we can then deduce that the root node’s index is zero plus one, equals one. Therefore at the root you already know that the element at index 5 must be at the right subtree of the root. No use navigating the left subtree.

Also, when going to the right child from the root, you can calculate the index of that node, Töllikkö. Since Töllikkö is after the root node, add one to the index of root node, therefore the current index being two (2). Then, since all Töllikkö’s left children are in earlier indices, add the child count of Töllikkö’s left subtree, which is 8, to the current index, having current index of ten (10). And since the left child node must be counted too, not just its children, we add another one (1) to current index, it now being 11. So the index of Töllikkö is 11.

Now we know that since we are finding the index 5, it must be in the left subtree of Töllikkö, so we continue there and ignore Töllikkö’s right subtree.

Moving then to Kantonen, we need to deduct one (1) from the current index, since we moved to earlier indices from Töllikkä, current Index being now 10. This is not yet Kantonen’s index. We need to deduct the number of children in Kantonen’s right child + 1 from the current index, and will get 10 – 6, having 4. That is Kantonen’s index.

Since 4 is smaller than 5, we know we need to proceed to the right subtree of Kantonen to find the index 5. Advancing there, in Numminen, we need to add 1 to the current index since we are going “forward” in the index order to bigger indices, as well as the count of children + 1 (the subtree itself) in Numminen’s left subtree, in so current index (Numminen’s) is now 6.

Since 6 < 5 we know to continue searching the index from Numminen’s left subtree there we go (to Kappalainen), deducing one (1) from current index, it now being 5. Now, earlier, when moving to left (to Kantonen), we also deducted the number of children in Kantonen’s right subtree. But since Kappalainen has no right subtree, we deduct nothing.

Now we see that current index is 5, the same we are searching for. So, we can stop the search and return Kappalainen as the element in the index 5 in the tree.

Using this principle of knowing the count of children in each tree node, we can directly proceed to the correct subtree, either left or right, and therefore replacing the linear in-order search with faster logarithmic search. With large data sets, this is considerably faster.

While demonstrating this to students using the course project app TIRA Coders, we saw that loading 1 000 000 coders to the course TIRA Coders app, the O(n) index finding operation (required by the Java Swing table view) made the app quite slow and sluggy. When scrolling, the app kept scrolling long time after I lifted my hand off the mouse wheel. Swing Table view was getting the objects in the visible indices constantly while scrolling and as this is O(n) operation, everything was really, really slow.

When replaced by the O(log n) operation described above, using the app was a breeze even with the same number of one million data elements in the BST; there was no lags in scrolling the table view nor in selecting rows in the table.

So, if you need to access BST elements by the index, consider using the algorithm described above. Obviously, for this to work, you need to update the child counts in parent nodes while adding nodes to the BST. If your BST supports removing nodes and balancing the BST, you need to update the child counts also in those situations.

(I won’t share any code yet, since the students are currently implementing this (optional) feature with the given pseudocode in their BST programming task.)

It’s both

I stumbled to a YouTube video bashing those “quote unquote Clean Code” people for creating badly time performing (fast) code. While he was going through Clean Code principles, he showed, one by one, how not using them produced more performant code.

Every thing he did to make code more time performant – I agree, he is right.

Otherwise, he either is deeply ignorant or was deliberately trolling in a very nasty way. That is why I will not share the link to the video. He had even disabled commenting on the video – a sign he is not after genuine discussion on the topic.

Of course there are conflicting requirements in software development. To satisfy some (many) of those requirements, principles of Clean Code are very useful.

For some other requirements, not using Clean Code principles in some areas of the code is a good thing, to achieve those requirements. Like time performance.

Many times you have to compromise and then pay the price for that compromise. This should be obvious also to the people who made that video.

Mind the map

Another post about data structures, time performance and the count-the-words-in-a-book-file-fast case I’ve been writing about before.

I did a C++ implementation of the books and words problem. Earlier, I have implemented several solutions for the problem using Swift and Java. This time I used C++ standard library std::map and wanted to see if parallel processing in several threads would speed up the processing.

Obviously it did. Execution time of the multithreaded version was 74% of the single threaded version. The sample files were processed by the single threaded version in 665 ms, while the multithreaded version took only 491 ms. Nice!

But then I saw, from the documentation of the std::map, that it keeps the key values in the dictionary in order while elements are added to the map/dictionary.

But this is not needed in my case! Surely this also takes time and gives me additional possibilities in optimising the time performance.

I changed, in the single threaded implementation, the std::map to std::unordered_map, and behold, it was faster than the multithreaded version with 446 ms execution time!

So mind the map. There are many, and some of those may be more suitable to your use case than the others.

For details, see the project in GitHub.

New version of SortSpectacle

My sorting algorithm learning tool / demo app SortSpectacle now includes a new method, Block sort.

Block sort is a variant of Merge sort. Merge sort uses additional arrays to recursively divide the original array into smaller arrays which are then sorted and combined to a final sorted array.

Block sort is an in-place variant of Merge sort. It uses a small cache to move elements around. Thus it is more memory efficient than Merge sort. And according to the demo app, also quite fast.

I still need to implement the step-by-step version of the algorithm — without it the execution is not animated. But if you take a look a the implementation (a Swift port I did based on the Java code WikiSort implemented by Mike McFadden), you can see that it will be quite a challenge to do…

Also it was quite nice to read in Twitter that someone is actually using my sorting algorithms in their own Swift/SwiftUI app to demonstrate the new SwiftUI Charts in action!

Anyhows, here’s a YouTube demo video showing the new version in action:

Update to the previous post code

Just in case anyone is reading, I updated the code related to the previous post. It was about using Swift async tasks to process data in parallel, comparing it to the single threaded functional programming implementation. Due to updates to Swift (I assume?), the code started to fail (printout was messy, nothing too bad) so I fixed it. Probably my mistake, revealed by some change in the language or libraries I use.

Now the code in FunctionalParallel is using the async version of Swift ArgumentParser library. It is now also otherwise improved and (at least now) works correctly. I will measure the speed on the M1 Mini when I get to it and make any changes, if necessary, in the performance comparison table in the repository readme file.

Now I am out of home, looking after the granddaughter as she sleeps. Her parents (including my daughter) are out, celebrating my son’s birthday. Happy birthday son ?!

Speeding up probing of an array with Swift async tasks

The Swift code below goes through an array of words (Strings) and counts unique words and their frequencies, ignoring words in wordsToFilter (another array of Strings). Then the resulting dictionary (map data structure, consisting of word/count pairs) is sorted by the word frequency in descending order. Finally, the top 100 of the most frequent words are printed out.

var words = [String]()
var wordsToFilter = [String]()
...
var counter = 1
words.filter { word in
word.count >= 2 && !wordsToFilter.contains(word)
}.reduce(into: [:]) { counts, word in
counts[word, default: 0] += 1
}.sorted(by: { lhs, rhs in
lhs.value > rhs.value
}).prefix(topListSize).forEach { key, value in
print("\(String(counter).rightJustified(width: 3)). \(key.leftJustified(width: 20, fillChar: ".")) \(value)")
counter += 1
} — *Functional programming approach to count the frequency of unique words in a text file.*

With my test book file of size 17.1 MB, with 2 378 668 words and 97 115 unique words, the code above uses 1.226099 secs to process the file. The time includes reading and splicing the words from the text files into the arrays. For details of measuring, see the end of this post.

Could it be faster if using the Swift async tasks? Let’s try and see!

Below is the code doing the same in eight async tasks. Code for printing out the result is omitted, shown later below.

*Async tasks counting word frequencies.*

In the code, first the slice size is calculated at line 66. For example, if the array has 1000 words, it is divided into eight slices, each containing 125 words. Then in a for loop, a task group with eight async tasks execute (lines 79-85). Each async task calculates the word frequencies of their own slice of the array. Each task return a dictionary to the task group. Dictionary contains the word / frequency count pairs of the slice of the array.

No thread locking for data synchronisation is needed since all concurrent tasks only read from the array and each of them read from their own slice.

In lines 88-96, the task group awaits for the tasks to finish. As they do that, the task group combines the dictionary of partial result provided by the task to the task group’s dictionary wordCounts. This happens in a single thread so no data corruption happens. The async tasks are not writing to the final dictionary having all the word / frequency pairs from the async tasks.

Finally the result is sorted and printed out from the wordCounts dictionary, after the task group has merged the results from the tasks:

// Now subtasks have finished and results from those have been combined to wordCounts.
// Sort the combined dictionary by the word count (value of the map).
var counter = 1
wordCounts.sorted(by: { lhs, rhs in
lhs.value > rhs.value
}).prefix(topListSize).forEach { key, value in
print("\(String(counter).rightJustified(width: 3)). \(key.leftJustified(width: 20, fillChar: ".")) \(value)")
counter += 1
}
// Signal that the async tasks are finished.
taskSemaphore.signal()
}
}
// Waiting for the async tasks to finish.
taskSemaphore.wait() — *Printing out the results and main thread waits for the async tasks to finish.*

Why the semaphore? This is a console app, and the main thread would continue until the end, after the async tasks were launched. What would happen in the end? The main thread would run past the end of the function, return to main function and finish & quit the process. While the async tasks are still executing. Not good.

So to avoid that 1) the main thread stops to wait for the semaphore, and 2) task group uses the same semaphore to signal when the task group has finished working. The main thread then proceeds to finish.

So, is this any faster? Any use at all in having this more complicated code?

Executing with the same files as above, the execution now takes 0.694983 secs. That is 57% of the original execution time of the single threaded implementation!

Though the absolute times or time differences are not large, the relative difference is very significant. Consider the data sizes being hundreds or thousands of times larger than this test file. Or if this process would be done repeatedly over thousands of files. Continuously. Then the difference would be significant also in time, not only relatively, even if the files would be smaller.

When you take a look at the Xcode Instruments view of the time profiler, you see easily why the speed difference:

*Xcode Instruments showing eight parallel threads working on the task.*

As you can see, all that work that was earlier done in sequence, is now executed in parallel, asynchronously.

So the answer to the question “Could it be faster if using the Swift async tasks?”, is: yes, absolutely.

The measurements were taken on an Apple Mac Mini M1 (Apple Silicon) with 16GB of RAM and 1 TB of SSD storage.

Reason for the slicing of the array to eight? The M1 processor has eight cores in the processor, each one is put to work. As you can see, the OS and other processes also needs the cores so they are not executed at 100% for this process’ threads all the time.

The code can be found in my BooksAndWords repository at GitHub. Single threaded implementation is in the Functional directory as the async is in the FunctionalParallel directory.

Swift tree structures: value types, enums and classes

What if you need to define a tree like structure of data elements in Swift? For example you might use a Binary Search Tree to keep track of unique words in a book file:

A tree structure where each node in the tree has optional two child nodes, left and right. Each node has two values: the word found in the book, and the count how many times the word appeared in the book. — *An example of a binary search tree with unique word counts from a text file.*

Since value types are often preferred in Swift, you could use a struct. The Node struct contains the word, the count of it in the book, a key to manage the tree, and optional left and right child nodes of the tree.

struct Node {
let key: Int
let word: String
var count: Int

var leftChild: Node?
var rightChild: Node?

init(_ word: String) {
key = word.hashValue
self.word = word
count = 1
}
}

But as you can see from the error message “Value type ‘Node’ cannot have a stored property that recursively contains it” — recursive value types are not supported in Swift. A Node in the tree struct cannot contain the left and right child nodes when using value types.

What to do? You have (at least) two options:

Use the enum type with associated values.
Use classes.

With Swift enums, you can define two states for the enumeration. Either a) the node in the tree is Empty (there is no node) or b) it has associated values in a Node — the word, the word count, key used to arrange the nodes in the tree by the word hash value, and the optional left and right subtrees:

indirect enum EnumTreeNode {
case Empty
case Node(left: EnumTreeNode, hash: Int, word: String, count: Int, right: EnumTreeNode)

init() {
self = .Empty
}

init(_ word: String) {
self = .Node(left: .Empty, hash: word.hashValue, word: word, count: 1, right: .Empty)
}

func accept(_ visitor: Visitor) throws {
try visitor.visit(node: self)
}
} — *A tree node as an enumeration with associated values.*

When defining recursive enumerations, you must use the indirect keyword to indicate recursion in the enumeration.

The other option is to use classes, which are reference type elements in Swift:

final class TreeNode {
let key: Int
let word: String
var count: Int

var left: TreeNode?
var right: TreeNode?

init(_ word: String) {
self.key = word.hashValue
self.word = word
count = 1
}

You can read more about Swift structs and classes from here if you are unfamiliar with them.

Check out the full implementation of both class based and enum based solutions from this GitHub repository.

So, are there any other differences in the enum and class implementations, than the differences in code?

Let’s check out. First run below is using the enum implementation, and the second one is executed using the class based implementation.

> swift build -c release
> .build/release/bstree ../samples/tiny.txt ..samples/ignore-words.txt 100
...
Count of words: 44, count of unique words: 32
>>>> Time 0.0008840560913085938 secs.


> swift build -c release
> .build/release/bstree ../samples/tiny.txt ..samples/ignore-words.txt 100
...
Count of words: 44, count of unique words: 32
>>>> Time 0.0009189844131469727 secs.

So far so good. Both implementations work (not all results not shown above) and seem to be quite fast. The tiny text file contains only 44 words of which 32 are unique.

But when executing both implementations with a larger 16MB file with 2674582 words of which 97152 are unique…:

> .build/release/bstree ../samples/Bulk.txt ..samples/ignore-words.txt 100
...
Count of words: 2674582, count of unique words: 97152
 >>>> Time 16.52852702140808 secs.


> .build/release/bstree ../samples/Bulk.txt ..samples/ignore-words.txt 100
Count of words: 2674582, count of unique words: 97152
 >>>> Time 3.5031620264053345 secs.

You can see that the first enum based implementation took 16.5 secs to process the same file the class based implementation only took 3.5 secs to process. This is a significant difference. Why this happens?

Swift enums behave like value types. When the algorithm reads a word from the book file, it searches if it already exists in the tree. If yes, the old enum value is replaced with the new one in the tree. This results in copying the tree since it is now mutated. The Swift book says:

“All structures and enumerations are value types in Swift. This means that any structure and enumeration instances you create—and any value types they have as properties—are always copied when they’re passed around in your code.”
— Swift Language Guide

So whenever a node in the tree is modified, that results in copying the tree. When you change the tree by adding to it, the tree is copied. Using classes this does not happen. The excessive copying of the tree nodes is a performance killer when having very large data sets to handle.

There is also a performance penalty in using classes — each time a class instance is accessed, the retain / release count is updated. But as you can see, still the implementation is much faster compared to copying the structure with value types.

Summarizing, enums are a nice way to implement recursive data structures. If you have large data sets and/or the tree or tree nodes are updated often, consider using classes instead.

Java file management, hall of fame and a nice surprise

In the previous post I mocked the Java app that was hardcoded to use Windows C: disk as the default place to open files.

What then is the recommended way? One is to start looking from the user home directory:

JFileChooser fileChooser = new JFileChooser();

fileChooser.setCurrentDirectory(new File(System.getProperty("user.home")));

Or pick the documents directory. It is also a nice thing to save the selected directory and use that if the user would like to continue with the already selected directory next time.

What else is happening? It is now the last week of the course Data Structures and Algorithms I am responsible teacher. Students have been analyzing algorithm correctness and time complexity, implementing basic data structures, more advanced ones like hash tables and hash functions, binary search trees and binary search.

Lectures addressed graphs and graph algorithms too, but implementation of these was in an optional course project only, Mazes. When students finish that, they can play a game a bit like the PacMan.

They get to choose from two course projects: either that Mazes project or optionally implement the classic task: count the unique words from a book file, ignoring some words from another file. The largest test file is around 16 MB, having almost 100 000 unique words.

Processing the largest file, using naively two nested for loops takes on my Mac Mini M1 with 16MB of memory around two minutes to process. The fast versions (hash tables, binary search trees) take less than a second.

I have implemented several different solutions for comparison with three different programming languages, Java, Swift and C++. Each week in lectures I demonstrated some of these, and in the end we had this table for comparison (sorry, in Finnish only).

Hall of Fame of the different Books and Words implementations. Swift versions can be found from the link above to GitHub.

As you can see, the C++ with 8 threads was the fastest one. Next after C++ came a couple of Java versions. Swift implementations were not so fast as I expected. After some profiling, I suspect the reason is in the way Unicode chars are handled in Swift. All the book files are UTF-8 and students were expected to handle them correctly. I do not like that mostly in teaching programming languages the default is to stick with ascii and conveniently forget the existence of different languages and character sets.

Well, anyways, for some reason, processing these UTF-8 text files takes a lot of time with Swift. Maybe later I have time to find out if the issue is in my code and/or is there anything that can be done to speed things up.

Something very nice happened a week ago — the student guild of our study program, Blanko, awarded me this diploma for being a quality teacher. Apparently they had a vote and I somehow managed to come first this time. The diploma was accompanied by this Italian themed small gift box. A really nice surprise! I was so astonished to receive this, thank you so much if anyone of you is reading!

Nice award from the students for quality teaching.

There was an app

Wanted to make a separate more detailed post of my previous post item about how the order of things may greatly influence the time performance of code.

There is an app which reads a large text file to find out how many unique words it has. The app also reports on the top 100 unique words in the file, and the count of them.

One file had 2 735 307 words, of which unique were 99 130.

Since the programmer was clever, s/he used a binary search tree to do the job of keeping book of the unique words. Then it transferred the job to a table to sort the words in order of frequency to print out the top-100 list.

Inserting words into the binary search tree, or adding the appearance (frequency) counter of the word if already in the tree.

When measuring the time performance of the app to handle this one file, the app reported that it could do the job in 2.550396 seconds^*).

When looking at the code inserting the word to the tree (above) closely, one can see that a node is allocated and then free’d even when the word is already in the tree, without anything useful done with the newnode. The node is needed only when the word is not already in the tree.

So, why not move the allocation of the node after the loop, when the node is actually needed:

Allocating the newnode only when it is actually necessary.

This small, some may say a pedantic change, took off 30% of the execution time of the whole app, when handling that same file mentioned above. After this change, the execution time of the app dropped down to 1.867050 seconds. Repeated runs produced similar timings for both solutions.

Allocating from the heap, done repeatedly, is slow.

^*) Measurements done on Apple Mac Mini M1 2020, 16Gb RAM, 8 cores, 1 Tb SSD.