Antti Juustila – Page 8 – Mostly about programming

Java file management, hall of fame and a nice surprise

In the previous post I mocked the Java app that was hardcoded to use Windows C: disk as the default place to open files.

What then is the recommended way? One is to start looking from the user home directory:

JFileChooser fileChooser = new JFileChooser();

fileChooser.setCurrentDirectory(new File(System.getProperty("user.home")));

Or pick the documents directory. It is also a nice thing to save the selected directory and use that if the user would like to continue with the already selected directory next time.

What else is happening? It is now the last week of the course Data Structures and Algorithms I am responsible teacher. Students have been analyzing algorithm correctness and time complexity, implementing basic data structures, more advanced ones like hash tables and hash functions, binary search trees and binary search.

Lectures addressed graphs and graph algorithms too, but implementation of these was in an optional course project only, Mazes. When students finish that, they can play a game a bit like the PacMan.

They get to choose from two course projects: either that Mazes project or optionally implement the classic task: count the unique words from a book file, ignoring some words from another file. The largest test file is around 16 MB, having almost 100 000 unique words.

Processing the largest file, using naively two nested for loops takes on my Mac Mini M1 with 16MB of memory around two minutes to process. The fast versions (hash tables, binary search trees) take less than a second.

I have implemented several different solutions for comparison with three different programming languages, Java, Swift and C++. Each week in lectures I demonstrated some of these, and in the end we had this table for comparison (sorry, in Finnish only).

Hall of Fame of the different Books and Words implementations. Swift versions can be found from the link above to GitHub.

As you can see, the C++ with 8 threads was the fastest one. Next after C++ came a couple of Java versions. Swift implementations were not so fast as I expected. After some profiling, I suspect the reason is in the way Unicode chars are handled in Swift. All the book files are UTF-8 and students were expected to handle them correctly. I do not like that mostly in teaching programming languages the default is to stick with ascii and conveniently forget the existence of different languages and character sets.

Well, anyways, for some reason, processing these UTF-8 text files takes a lot of time with Swift. Maybe later I have time to find out if the issue is in my code and/or is there anything that can be done to speed things up.

Something very nice happened a week ago — the student guild of our study program, Blanko, awarded me this diploma for being a quality teacher. Apparently they had a vote and I somehow managed to come first this time. The diploma was accompanied by this Italian themed small gift box. A really nice surprise! I was so astonished to receive this, thank you so much if anyone of you is reading!

Nice award from the students for quality teaching.

Liukkaat kadut ale

Liukkaat kadut Apple Watch -sovellus on Black Friday -alessa 24.-27.11. hintaan 0€.

Kyllä, alennus on kokonaista 100%! Säästät jopa noin euron!

Huomaa että sovellus toimii vain Apple Watchissa ja seuraavissa kaupungeissa: Helsinki, Jyväskylä, Kuopio, Lahti ja Oulu. Liukkausvaroitukset tulevat kyseisten kaupunkien katujen kunnossapidosta vastaavilta. Sovellus toimii suomeksi, englanniksi ja ruotsiksi.

Slippery Cities Apple Watch app is on Black Friday sale from Nov 24th until Nov 27th with a price of 0€! Slippery Cities provides alerts of slippery weather for pedestrians in the following cities: Helsinki, Jyväskylä, Kuopio, Lahti and Oulu. The alerts are provided to your Apple Watch app by the service when the maintenance personnell of a city issues a slippery weather alert.

Write once, run everywhere

Is this Java code “write once, run everywhere”?

The checkFileHash method will calculate a hash for the file and check if the correctHash matches with the calculated hash. If the hashes match, the file is identical to what was expected.

boolean checkFileHash(String fileName, String correctHash) {
...
   File file = new File(fileName);
   MessageDigest md = MessageDigest.getInstance("SHA");
   FileInputStream fis = new FileInputStream(file);
   byte[] dataBytes = new byte[1024];
   
   int nread = 0;
   while ((nread = fis.read(dataBytes)) != -1) {
      md.update(dataBytes, 0, nread);
   }
   byte[] mdbytes = md.digest();
   StringBuilder sb = new StringBuilder();
   for (int i = 0; i < mdbytes.length; i++) {
      sb.append(Integer.toString((mdbytes[i] & 0xff) + 0x100, 16).substring(1));
   }
   fis.close();
   System.out.println("\nHash is " + sb.toString());
   return sb.toString().equals(correctHash);

Well this does compile, run and produce a file hash, compare it against the correct one and returns the result.

If the correctHash was calculated earlier on a same OS, hashes will even match. But if the corrrectHash was calculated in a different OS with a different default character encoding, the hashes do not match.

You need to explicitly specify that UTF-8 encoding is used when reading the file — which also should be saved UTF-8 encoded. And when calculating the expected correctHash the UTF-8 encoding should also be used. And when the hash for the file is calculated when file is inspected in this method:

boolean checkFileHash(String fileName, String correctHash) {
...
   // Use SHA for hashing, does not need to be secure.
   MessageDigest md = MessageDigest.getInstance("SHA");
   // Use buffered reader since we need to read explicitly using UTF-8.
   BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(fileName), "UTF-8"));

   // Read the lines from the line to a String.
   String line = null;
   StringBuilder sb = new StringBuilder();
   while ((line = br.readLine()) != null) {
      sb.append(line);
   }
   br.close();
   // Calculate the SHA digest (hash) from the string
   byte [] mdbytes = md.digest(sb.toString().getBytes("UTF-8"));

   // Convert the hash into a String.
   StringBuilder hsb = new StringBuilder();
   for (int i = 0; i < mdbytes.length; i++) {
      hsb.append(Integer.toString((mdbytes[i] & 0xff) + 0x100, 16).substring(1));
   }
   String calculatedHash = hsb.toString();
   System.out.println("\nHash is " + calculatedHash.toString());
   // Check if the hash matches with the expected correct one.
   boolean matches = calculatedHash.equals(correctHash);
   if (!matches) {
      System.out.println("Correct has is: " + correctHash);
   }
   return matches;

Using a platform independent programming language does not guarantee platform independent code.

Obviously UTF-8 specifically is not required, but using a same encoding everywhere.

Once I used (on a macOS) a Java app with file open dialog, that was hardcoded to show directory C:\ by default. Sooooo platform independent…

Running Windows on Monterey in VirtualBox

Just a note to myself: if the Windows 10 on macOS Monterey in VirtualBox does not launch, do:

sudo kmutil load -p '/Library/Application Support/VirtualBox/VBoxDrv.kext'

And it should run. At least until the next time. Maybe someone somewhere fixes this sometimes.

Work work work

youtube.com/watch

Just checked the first commit date of the completely renovated Data structures and algorithms course repository. It was around December 2020.

Since then I have designed and implemented about ten exercises with unit tests in Java. And two course projects with unit tests. Plus about ten additional demos with Swift. All with commentary and instructions.

In a week the new course begins. I’ve recorded about 20 hours of new lecture materials. Additional videos about graph programming with C++ will be included from last year demos.

One of the reasons for not posting too much lately ?

The video is about one of the course project implementing a graph data structure with various algorithms including Dijkstra’s shortest path finding algorithm. I thought applying this in a game context would perhaps motivate the students to dwell into the world of data structures and algorithms. We’ll see….

There was an app

Wanted to make a separate more detailed post of my previous post item about how the order of things may greatly influence the time performance of code.

There is an app which reads a large text file to find out how many unique words it has. The app also reports on the top 100 unique words in the file, and the count of them.

One file had 2 735 307 words, of which unique were 99 130.

Since the programmer was clever, s/he used a binary search tree to do the job of keeping book of the unique words. Then it transferred the job to a table to sort the words in order of frequency to print out the top-100 list.

Inserting words into the binary search tree, or adding the appearance (frequency) counter of the word if already in the tree.

When measuring the time performance of the app to handle this one file, the app reported that it could do the job in 2.550396 seconds^*).

When looking at the code inserting the word to the tree (above) closely, one can see that a node is allocated and then free’d even when the word is already in the tree, without anything useful done with the newnode. The node is needed only when the word is not already in the tree.

So, why not move the allocation of the node after the loop, when the node is actually needed:

Allocating the newnode only when it is actually necessary.

This small, some may say a pedantic change, took off 30% of the execution time of the whole app, when handling that same file mentioned above. After this change, the execution time of the app dropped down to 1.867050 seconds. Repeated runs produced similar timings for both solutions.

Allocating from the heap, done repeatedly, is slow.

^*) Measurements done on Apple Mac Mini M1 2020, 16Gb RAM, 8 cores, 1 Tb SSD.

Today’s thoughts

Someone mentioned the book Exercises in Programming Style by Cristina Lopes in Twitter and I bought it. Interesting read, can recommend. I can bring some things from the book into my course next Fall. We should actually have a course focusing mainly on these topics but alas, do not have one. I just need to implement those things I could apply in Java, maybe as demonstrations, since the book is in Python.

Another thought I need to embed into course material is this nice example of how order of things may greatly influence the time performance of code. I was evaluating a student submission of a course project. A function added nodes (unique word counts of very large text files) into a binary search tree. First a node was allocated using malloc after which the tree was searched in a loop where to put the node. And if a node with the value already existed, the frequency count of the existing node was increased and the node just created could be freed and function would return. And if the value of the node was not found in the tree, then the node just allocated was inserted into the tree in the place that was found for it. By just moving the malloc after the loop (where the node was actually needed) got rid of 30% of the execution time of the whole app. Boom! What an improvement with such a small change. Which is obvious if you have some experience but not so for many students who are in the beginning. This will be a fun demonstration…

Final thought of the day popped to my mind because of that minor piece of code having a large effect, and that is this saying:

The idiom “the straw that broke the camel’s back“, alluding to the proverb “it is the last straw that breaks the camel’s back”, describes the seemingly minor or routine action that causes an unpredictably large and sudden reaction, because of the cumulative effect of small actions.
Wikipedia

Programming as a mental activity

shapeof.com/archives/2020/12/brain-activity_of_people_coding.html

Interesting post and the article there, and something I agree with.

…coding task mainly activated the so-called multiple demand network. This network, whose activity is spread throughout the frontal and parietal lobes of the brain, is typically recruited for tasks that require holding many pieces of information in mind at once, …

I hold the different things required to implement “a thing” in my mind somehow visually. The different items and places of code are in a three dimensional kind of a map in my mind. Difficult to describe this, but this 3D kind-of-a-space is a good enough description. Then I need to keep in my mind information which parts are still not finished, which have issues to solve, and how they connect to each other. This mental map is then somehow connected to the IDE where the action happens in text editors, but that is not my actual “map” of the thing under construction.

Programming is different. If you haven’t tried it because you think you suck at math, you should try anyway. You’d be in good company (I suck at math too).

Agree with this too, personally…

Merging two JSON files

Having two large JSON files with one common element, how can you merge them? Editing by hand not so nice if the files contain hundreds of elements.

For example, you have a large JSON file of locations:

{
"name": "Oulu",
"lat": 65.013784817,
"lon": 25.47209907
},

And then you also have a large JSON file of location codes for the commune:

And you would like to merge these into a one JSON file and parse the commune code, name, and location data at one go.

You can do this with jq using the following command:

./jq-osx-amd64 --slurpfile file2 codes.json.txt '
INDEX($file2[]; .name) as $dict
| $dict[.name] as $x
| if $x then . + $x else empty end
' coords.json.txt > out.json

And as a result, you get your JSON element merged in one file:

{
"name": "Oulu",
"lat": 65.013784817,
"lon": 25.47209907,
"code": "564"
},

Much faster than copying and pasting or implementing your own JSON merger.

Stripping XML

xmlstarlet ed -N d="http://www.w3.org/2005/Atom" -N c="urn:oasis:names:tc:emergency:cap:1.2" -d "d:feed/d:entry/d:content/c:alert/c:info/c:area/c:polygon" fmi-alerts.xml > fmi-stripped.xml

I tried to make sense of a very, very large XML file with repeating elements of hundreds and hundreds of GPS coordinates, among other elements.

Luckily I found xmlstarlet. Installation was easy with homebrew. I just specified the namespaces and the elements to remove. The original 20Mb XML file was stripped to mere 140kb. Much easier to read!