data structures – Antti Juustila

Vacation!

I usually start my summer vacation from Midsummer. This year that’ll mean that the last workday is today 🥳 I’ll be returning to work in the beginning of August.

So far, I’ve been busy in several fronts, including supervising and evaluating MSc theses and planning and organizing the teaching schedules and resourcing for the Fall. We eventually got one half time student assistant to help us in the courses for the Fall semester.

We were required to provide at least some face-to-face teaching for 1^st year students, since complaints this study year about only online/remote teaching offered. It is “a bit” difficult to organize classroom teaching when the total number of students is 900+ and there are only two full time teachers and two other teachers who have only some hours allocated for these three large courses (plus one smaller) we need to organize. Especially when the remote possibility is a must, not an option, since these courses have Open University students and students from other Finnish universities participating.

Anyways, we again tried to fulfill these requirements with the little resources we’ve got. In some earlier years, we have had two halftime student assistants, last year only one and that’ll be the case for next Fall too. Not even comparing to 15 years ago when we had like ten+ student assistants helping out in various courses. Things really have changed from those days. Let’s see how we cope this time…

Another big thing I’ve been working on is implementing new programming tasks for my Data Structures and Algorithms (DSA) course. The passing ratio of the course is too low. I am again trying to find new ways to make the course more straightforward to pass.

Last year I made explicit paths to pass the course so that the students could take an easier path with less programming to get the grade 1. If someone wanted a higher grade, then they’d do additional programming tasks. And if someone wanted a high grade (4-5/5), then they were also required to participate in one-to-one evaluation discussion with me, talking through the algorithms and data structures they implemented and analyzed in their reports.

Before last year’s changes, most students got a grade 3-5. After this change, most students got the grade 1, then some the grade 2 and minority got grades 3-5. Not sure which change contributed to what extent to this change in the course output.

Unfortunately, the passing rate did not improve regardless of these changes. Most of the students that fail actually either never start doing anything, or they do at most the 1-2 first programming tasks (which actually were based on the previous prerequisite course contents) and then disappeared. Maybe they just decided that the course is too challenging considering their prerequisite knowledge and skills? Or they had more important courses they prioritized?

One reason may be that DSA is no longer a prerequisite for any other course in the curriculum. A couple of years ago it still was a prerequisite and students needed to pass DSA to get into Programming 3 (server side stuff). That was then required to enter Programming 4 (GUI programming) and via that path, getting in the BSc Project course was not possible without passing DSA first in the chain. As the prerequisite DSA ⇒ Programming 3 was removed, it may well be that DSA will be one of those last ones they try to pass before getting their BSc degrees, together with the compulsory Swedish language course… Maybe the hypothesis was that DSA is useless for the courses coming after it. Not that I agree with that…

The issues of plagiarism and using AI to avoid learning and doing things have somewhat increased during the last 2-3 years. These issues are not too widespread yet (?), but still clearly on the rise. I’ve now used the same programming tasks for two years, and it is apparent that some are getting the solutions copied from somewhere.

Luckily there are lots of students that are willing to do the work and walk the path, regardless of the challenges — always very motivating to the teachers to see. It is very rewarding for us teachers to discuss issues with students, walk them through their problems, seeing their joy of finally solving difficult issues and at the same time see them learn and gain skills that are important for their professional development as software developers.

Items from above were the motivation for me to implement new programming tasks for the DSA course. Most of the task learning objectives are the same as before. I decided to drop the first task that was just revisiting the insertion sort algorithm taught already in the prerequisite course. That algorithm is now given to the students, as they are supposed to know it already. So no two weeks of course time is spent on that anymore.

Instead, the first task focuses on correctness, first of the two important learning goals for the DSA course, the other being time complexity (to some extent also memory complexity). They’ll learn to use invariants to ensure code is behaving correctly, including pre- and postconditions, as well as loop and class invariants, implemented with asserts and thrown exceptions.

I’ve brought back the partitioning algorithm task back from 2021 course implementation. The binary search tree task will be a bit easier; the current task is too challenging because the context of applying it makes it too large a task to do. So that’ll be simplified.

Furthermore, all of the tasks where they apply the generic algorithm or data structure they first implement, will change. There will no longer be a single example app using all those algorithms and data structures they implement. Some students did value this concrete example with a GUI; seeing a real world context where all this they are supposed to learn, is actually used and needed. But for many, this context was too large, too much code to digest and understand, they got confused and felt lost. In the next DSA, each task they go through will be in a separate, small and simple(r) component, focusing only on a specific algorithm or data structure and how to apply it and analyze it.

Another of my goals is also to move the course exams from Moodle to controlled computer exam class. It is apparent that the Moodle exam is too easy to do in a way that doesn’t measure the students’ actual knowledge of the topics. Maybe the new setup does this better, if I manage to prepare the new exams in the new environment in time for Fall…

The new set of programming tasks has 11 different tasks, and I’ve currently (re)implemented five of them:

Assess and support code correctness of algorithms by using asserts, pre- and postconditions and loop and class invariants.
Recursive binary search, comparing time efficiency with linear search.
Partitioning an array using a predicate, filtering utilizing the partitioning algorithm.
Quicksort, comparing it with insertion sort and partitioning/filtering, when sorting is not actually necessary.
Stack, using stack to check XML correctness (parsing code is given).

That leaves six tasks to be finished in August, before the course begins in September:

Queue.
Linked list.
Binary search tree, applying it to count unique words in Project Gutenberg books / frequency of person names in datasets.
Hash tables and hash functions, possible application is the same as with binary search tree, to compare time/memory efficiency of these two ?
Using the Set in Java (not implementing it).
Something about Graphs (breath/depth first searches?), or leave this to theory only, depending on the estimated student workload.

As usual, all tasks will have automated unit tests and also the code is analyzed by using tools that I’ve discussed in some of my previous posts.

To get all this done in time, it’ll be a busy month in August. Especially when I want to avoid having errors with the material given to students. I should reserve time to test and assess the code and the instructions so that I do not have to post too much errata when the course starts…

In the past, I did use the summer time to do these kinds of teaching development work, but since the past 2-3 years, I’ve decided not to sacrifice myself to work. Vacation time is my time, not the employers.

Hopefully you, dear reader, have also the time to relax during your vacation time, whenever that might be. Have a nice Summer! 🌞 (or Winter if you locate at the other side of the globe, don’t actually know if it is dry or rainy season in between 😃).

Fast Index Order in Binary Search Tree

So, you can use a Binary Search Tree (BST) to store and quickly access key-value pairs. Usually the key gives the position of the value element in the tree. You place smaller values in the left side, and larger values in the right side of the tree.

A sample BST unbalanced, having 13 nodes in the tree. Key in the BST is the person’s name in ascending order.

Finding by key value is O(log n) operation. Starting from the root node of the tree, comparing the key, you can choose which subtree to continue searching for the key. If the BST is balanced, you need to only do O(log n) comparisons at most to get to the key-value pair you wish to find. Like searching for Jouni Kappalainen takes four comparisons from the root and it is found.

The tree contents in key order shown in a Java Swing table view. Order is by BST key order.

Note that the numbers in each tree node picture above is the number of children of that specific node. For example, Kevin Bacon has zero children, as Samir Numminen has five, and the root node has 12 children.

How about if you want to treat the BST as an ordered collection, each item in the tree accessible by index, as you would do with an array? Like, “give me the 6th element in the tree (element at index 5), when navigating in-order, from smallest to largest key in the tree”. In the sample tree above, Kappalainen would be in the index 5 (zero based indexing). Kevin Bacon’s index would be zero since it is the first element in-order. Root node would be at index 1 in this sample tree.

The obvious thing to do would be to traverse the tree in-order, counting indices from the lowest left side node (Bacon) with index 0 (zero) and then continue counting from there until you are at the index 5 (Kappalainen).

Sample tree with in-order indices added in blue. Compare this to the GUI table order and you see the indices are correct.

This works, but the time complexity of this operation is O(n). If you have, let’s say a million or more elements in the three, that is a lot of steps. Especially if you need to do this repeatedly. Is there a faster way to do this?

The faster way is to use the child counts of the tree nodes to count the index of the current node (starting from the root), and then use that information to decide if you should proceed to left or right subtree to continue finding the desired index. This makes finding the element with an index O(log n) operation, which is considerably faster than linear in-order traversal with O(n).

For example, consider you wish to find element at the index 5. Since the root node’s left child has zero children, we can then deduce that the root node’s index is zero plus one, equals one. Therefore at the root you already know that the element at index 5 must be at the right subtree of the root. No use navigating the left subtree.

Also, when going to the right child from the root, you can calculate the index of that node, Töllikkö. Since Töllikkö is after the root node, add one to the index of root node, therefore the current index being two (2). Then, since all Töllikkö’s left children are in earlier indices, add the child count of Töllikkö’s left subtree, which is 8, to the current index, having current index of ten (10). And since the left child node must be counted too, not just its children, we add another one (1) to current index, it now being 11. So the index of Töllikkö is 11.

Now we know that since we are finding the index 5, it must be in the left subtree of Töllikkö, so we continue there and ignore Töllikkö’s right subtree.

Moving then to Kantonen, we need to deduct one (1) from the current index, since we moved to earlier indices from Töllikkä, current Index being now 10. This is not yet Kantonen’s index. We need to deduct the number of children in Kantonen’s right child + 1 from the current index, and will get 10 – 6, having 4. That is Kantonen’s index.

Since 4 is smaller than 5, we know we need to proceed to the right subtree of Kantonen to find the index 5. Advancing there, in Numminen, we need to add 1 to the current index since we are going “forward” in the index order to bigger indices, as well as the count of children + 1 (the subtree itself) in Numminen’s left subtree, in so current index (Numminen’s) is now 6.

Since 6 < 5 we know to continue searching the index from Numminen’s left subtree there we go (to Kappalainen), deducing one (1) from current index, it now being 5. Now, earlier, when moving to left (to Kantonen), we also deducted the number of children in Kantonen’s right subtree. But since Kappalainen has no right subtree, we deduct nothing.

Now we see that current index is 5, the same we are searching for. So, we can stop the search and return Kappalainen as the element in the index 5 in the tree.

Using this principle of knowing the count of children in each tree node, we can directly proceed to the correct subtree, either left or right, and therefore replacing the linear in-order search with faster logarithmic search. With large data sets, this is considerably faster.

While demonstrating this to students using the course project app TIRA Coders, we saw that loading 1 000 000 coders to the course TIRA Coders app, the O(n) index finding operation (required by the Java Swing table view) made the app quite slow and sluggy. When scrolling, the app kept scrolling long time after I lifted my hand off the mouse wheel. Swing Table view was getting the objects in the visible indices constantly while scrolling and as this is O(n) operation, everything was really, really slow.

When replaced by the O(log n) operation described above, using the app was a breeze even with the same number of one million data elements in the BST; there was no lags in scrolling the table view nor in selecting rows in the table.

So, if you need to access BST elements by the index, consider using the algorithm described above. Obviously, for this to work, you need to update the child counts in parent nodes while adding nodes to the BST. If your BST supports removing nodes and balancing the BST, you need to update the child counts also in those situations.

(I won’t share any code yet, since the students are currently implementing this (optional) feature with the given pseudocode in their BST programming task.)

New GUI for teaching data structures and algorithms

Next Fall is the fifth time when I teach the course Data Structures and Algorithms. I received the course on 2019 and after a year of familiarising with the course, started developing it.

First I switched from C to Java and implemented automated unit tests to verify the correctness of student implementations. Some unit tests also measured time performance with varying numbers of n. The measurements were then used to support the time complexity analysis of the implementations, in addition to analysing the pseudocode and implemented algorithms. Comparing the implemented fast algorithms and data structures to slower ones was also one of the things students did.

All this to support the students to learn to see the difference between algorithms and helping them to decide which algorithm or data structure to use in whatever situation. Many times a simple algorithm suffices and simple array and plain loops are enough. But sometimes, something more is needed.

In the last two years, one or two exercise tasks had a graphical user interface (GUI) instead of a console UI, most of the tasks have. The feedback from the students was that seeing the actual effect on using faster and better data structures and algorithms in a GUI was a very nice thing to have. It really gave them a tangible view on the benefits of using these algorithms (like partitioning a large array) and data structures (finding a path using Dijkstra’s algorithm in a maze game).

Based on this student feedback, I decided to implement a GUI to all exercise tasks for the Fall 2023 course. The experiment I want to make is to see if having a GUI helps in motivating the students to work on and learn from the course exercises, compared to the more common console UIs and seeing the results of unit tests passing (or not).

Yesterday the first version of the GUI was finished, including all the exercise tasks to teach concepts like:

sorting an array with simple (and slow, with large n) insertion sort algorithm,
sorting to both ascending and descending order using the Java Comparator interface,
searching using linear search algorithm,
after sorting, implementing faster search using binary search,
implementing a Stack to verify a JSON file has no issues with parentheses nor quotation characters,
implementing a Queue to implement a “call list” (either using an array or a linked list as the internal data structure of the Queue),
when dealing with larger data sets, using faster sorting algorithms like quick sort, heap sort or merge sort,
to make handling large data sets faster in adding data elements and searching for them, using a binary search tree as the data structure,
using a hash table to count the number of unique words used in a text file and list the top 100 words with counts (a classical problem in data structures and algorithms), and finally,
using a graph data structure and associated algorithms (like Dijkstra’s) to find a shortest path between two data items.

So in addition to the already existing unit tests and time efficiency measurements, to verify the correctness and time performance of their data structure and algorithm implementations, next Fall they will also see these working in a “real life” application, the TIRA Coders Phonebook app:

TIRA Coders Phonebook app screenshot. Picture includes text highlighting how all the various algorithms and data structures are included in the app. — TIRA Coders Phonebook app screenshot

Each of the smaller windows and dialogs appear when the feature is used from the menu of the application.

On the lower right corner, there is a log window displaying time performance data when executing the different algorithms; filling the data structures, sorting the data, searching for data, etc.

Students are able to load phonebooks of different sizes, from 100 coders up to 1 000 000 coders, and see how their implementations of different data structures and algorithms cope with various numbers of n.

I still need to work on this app more, making sure that when I remove all the solutions for the data structures and algorithms I’ve implemented, the app is still functional enough to be a starting point for the students to start working on the course exercise tasks.

I also need to carefully point to the students the parts of the code they need to work upon, and parts of the code they do not need to care about, not to get confused by the things not relevant in learning the course topics.

It will be exiting to see, in a couple of months, how this works out and if having the GUI helps in learning or motivates students to work on the tasks.

Swift tree structures: value types, enums and classes

What if you need to define a tree like structure of data elements in Swift? For example you might use a Binary Search Tree to keep track of unique words in a book file:

A tree structure where each node in the tree has optional two child nodes, left and right. Each node has two values: the word found in the book, and the count how many times the word appeared in the book. — *An example of a binary search tree with unique word counts from a text file.*

Since value types are often preferred in Swift, you could use a struct. The Node struct contains the word, the count of it in the book, a key to manage the tree, and optional left and right child nodes of the tree.

struct Node {
let key: Int
let word: String
var count: Int

var leftChild: Node?
var rightChild: Node?

init(_ word: String) {
key = word.hashValue
self.word = word
count = 1
}
}

But as you can see from the error message “Value type ‘Node’ cannot have a stored property that recursively contains it” — recursive value types are not supported in Swift. A Node in the tree struct cannot contain the left and right child nodes when using value types.

What to do? You have (at least) two options:

Use the enum type with associated values.
Use classes.

With Swift enums, you can define two states for the enumeration. Either a) the node in the tree is Empty (there is no node) or b) it has associated values in a Node — the word, the word count, key used to arrange the nodes in the tree by the word hash value, and the optional left and right subtrees:

indirect enum EnumTreeNode {
case Empty
case Node(left: EnumTreeNode, hash: Int, word: String, count: Int, right: EnumTreeNode)

init() {
self = .Empty
}

init(_ word: String) {
self = .Node(left: .Empty, hash: word.hashValue, word: word, count: 1, right: .Empty)
}

func accept(_ visitor: Visitor) throws {
try visitor.visit(node: self)
}
} — *A tree node as an enumeration with associated values.*

When defining recursive enumerations, you must use the indirect keyword to indicate recursion in the enumeration.

The other option is to use classes, which are reference type elements in Swift:

final class TreeNode {
let key: Int
let word: String
var count: Int

var left: TreeNode?
var right: TreeNode?

init(_ word: String) {
self.key = word.hashValue
self.word = word
count = 1
}

You can read more about Swift structs and classes from here if you are unfamiliar with them.

Check out the full implementation of both class based and enum based solutions from this GitHub repository.

So, are there any other differences in the enum and class implementations, than the differences in code?

Let’s check out. First run below is using the enum implementation, and the second one is executed using the class based implementation.

> swift build -c release
> .build/release/bstree ../samples/tiny.txt ..samples/ignore-words.txt 100
...
Count of words: 44, count of unique words: 32
>>>> Time 0.0008840560913085938 secs.


> swift build -c release
> .build/release/bstree ../samples/tiny.txt ..samples/ignore-words.txt 100
...
Count of words: 44, count of unique words: 32
>>>> Time 0.0009189844131469727 secs.

So far so good. Both implementations work (not all results not shown above) and seem to be quite fast. The tiny text file contains only 44 words of which 32 are unique.

But when executing both implementations with a larger 16MB file with 2674582 words of which 97152 are unique…:

> .build/release/bstree ../samples/Bulk.txt ..samples/ignore-words.txt 100
...
Count of words: 2674582, count of unique words: 97152
 >>>> Time 16.52852702140808 secs.


> .build/release/bstree ../samples/Bulk.txt ..samples/ignore-words.txt 100
Count of words: 2674582, count of unique words: 97152
 >>>> Time 3.5031620264053345 secs.

You can see that the first enum based implementation took 16.5 secs to process the same file the class based implementation only took 3.5 secs to process. This is a significant difference. Why this happens?

Swift enums behave like value types. When the algorithm reads a word from the book file, it searches if it already exists in the tree. If yes, the old enum value is replaced with the new one in the tree. This results in copying the tree since it is now mutated. The Swift book says:

“All structures and enumerations are value types in Swift. This means that any structure and enumeration instances you create—and any value types they have as properties—are always copied when they’re passed around in your code.”
— Swift Language Guide

So whenever a node in the tree is modified, that results in copying the tree. When you change the tree by adding to it, the tree is copied. Using classes this does not happen. The excessive copying of the tree nodes is a performance killer when having very large data sets to handle.

There is also a performance penalty in using classes — each time a class instance is accessed, the retain / release count is updated. But as you can see, still the implementation is much faster compared to copying the structure with value types.

Summarizing, enums are a nice way to implement recursive data structures. If you have large data sets and/or the tree or tree nodes are updated often, consider using classes instead.

Teaching season started

Fall teaching season has started with one old course already ongoing. Devices and networks. My part is the networks, so currently I can focus on two new courses starting in October. Data structures and algorithms is an old course but I’ll take responsibility for it this year.

Another course, Platforms and ecosystems is a new course. Meaning I have hands full in creating material for the new course together with two other teachers. And at the same time, familiarising myself with the data structures course.

For the data structures course, I’ve worked earlier with the demo sorting app implemented using Swift. No time to add additional sorting methods there, but should build it with new Xcode, Swift and iOS 14 to see if there are any (breaking) changes.

What I did recently is that I implemented simple (and stupid) array and linked list classes, both in C++ and Swift, to demonstrate the effectiveness of creating and accessing arrays in comparison to linked lists. Will be using that when discussing why different data structures have different (preferred) usage situations. And that there may be conflicting requirements for the data structure in an app. Then you just have to make compromises.

Another important thing to show to the students is that you need to build the release version before comparing or measuring performance.