Java file management, hall of fame and a nice surprise

In the previous post I mocked the Java app that was hardcoded to use Windows C: disk as the default place to open files.

What then is the recommended way? One is to start looking from the user home directory:

JFileChooser fileChooser = new JFileChooser();

fileChooser.setCurrentDirectory(new File(System.getProperty("user.home")));

Or pick the documents directory. It is also a nice thing to save the selected directory and use that if the user would like to continue with the already selected directory next time.

What else is happening? It is now the last week of the course Data Structures and Algorithms I am responsible teacher. Students have been analyzing algorithm correctness and time complexity, implementing basic data structures, more advanced ones like hash tables and hash functions, binary search trees and binary search.

Lectures addressed graphs and graph algorithms too, but implementation of these was in an optional course project only, Mazes. When students finish that, they can play a game a bit like the PacMan.

They get to choose from two course projects: either that Mazes project or optionally implement the classic task: count the unique words from a book file, ignoring some words from another file. The largest test file is around 16 MB, having almost 100 000 unique words.

Processing the largest file, using naively two nested for loops takes on my Mac Mini M1 with 16MB of memory around two minutes to process. The fast versions (hash tables, binary search trees) take less than a second.

I have implemented several different solutions for comparison with three different programming languages, Java, Swift and C++. Each week in lectures I demonstrated some of these, and in the end we had this table for comparison (sorry, in Finnish only).

Hall of Fame of the different Books and Words implementations. Swift versions can be found from the link above to GitHub.

As you can see, the C++ with 8 threads was the fastest one. Next after C++ came a couple of Java versions. Swift implementations were not so fast as I expected. After some profiling, I suspect the reason is in the way Unicode chars are handled in Swift. All the book files are UTF-8 and students were expected to handle them correctly. I do not like that mostly in teaching programming languages the default is to stick with ascii and conveniently forget the existence of different languages and character sets.

Well, anyways, for some reason, processing these UTF-8 text files takes a lot of time with Swift. Maybe later I have time to find out if the issue is in my code and/or is there anything that can be done to speed things up.

Something very nice happened a week ago — the student guild of our study program, Blanko, awarded me this diploma for being a quality teacher. Apparently they had a vote and I somehow managed to come first this time. The diploma was accompanied by this Italian themed small gift box. A really nice surprise! I was so astonished to receive this, thank you so much if anyone of you is reading!

Nice award from the students for quality teaching.

Adopting some newer C++ features

I’ve been continuously updating my skills in C++, adopting features from newer features of the language, like from the version C++17. For fun and learning, I’ve been updating some older apps to use these newer features, and implementing some new tools adopting newer features like std::variant, algorithms (instead of traditional loops) and attributes. Some examples below.

Attributes

Instead of commenting in a switch/case structure that fallthrough is OK, use the [[fallthrough]] attribute:

      switch (argc) {
         case 4:
            outputFileName = argv[3];
            [[fallthrough]];
            
         case 3:

Reader is then aware that the missing break; is not actually missing by accident, but intentional. Improves code readability and quality, and silences the compiler warning about the missing break.

To make sure the caller of the function handles the return value, use the [[nodiscard]] attribute:

[[nodiscard]]
int readFile(const std::string & fileName, std::vector<std::string> & entries);

Compiler will warn you that the return value is not handled. This again improves code quality.

nodiscard attribute warns you that essential return value is not handled.

Using using instead of typedef

I wanted to use a shorter name for a complex data structure. Usually done with typedef. Instead, using the using keyword, the one used usually with namespaces, is neat:

using queue_package_type = std::map<std::string, std::pair<int,int>>;
queue_package_type queuePackageCounts;

Or similarily:

using NodeContainer = std::vector<NodeView>;
NodeContainer nodes;
// ...
SPConfigurator::NodeContainer nodes = configurator->getNodes();
std::for_each(std::begin(nodes), std::end(nodes), [this](const NodeView & node) {
   std::string description = node.getInputAddressWithPort() + "\t" + node.getName() + "\t" + node.getOutputAddressWithPort();
   QString logEntry = QString::fromStdString(description);
   ui->LogView->appendPlainText(logEntry);
});

Small thing but makes better looking code, in my opinion. When working with templates, The alias declaration with using is compatible with templates, whereas the C style typedef is not.

Algorithms

In a recent post, I mentioned algorithms like std::iota and std::shuffle, useful in generating test data. When handling containers (vectors, lists), the “old way” is to use either indexes or iterators to handle the items. Implementing these carelessly may lead to bugs. The better alternative is to use algorithms from the standard library, readily developed and rigorously tested, also considering performance. An example from a small tool app I recently made, which searches if id values read from one file are contained in lines read from another file:

std::for_each(std::begin(indexes), std::end(indexes), [&matchCount, &dataEntries, &output](const std::string & index) {
   std::any_of(std::begin(dataEntries), std::end(dataEntries), [&matchCount, &index, &output](const std::string & dataEntry) {
      if (dataEntry.find(index) != std::string::npos) {
         *output << matchCount+1 << "   " << dataEntry << std::endl;
         matchCount++;
         return true; // Not returning from the app but from the lambda function.
      }
      return false;   // Not returning from the app but from the lambda function.
   });
});

std::for_each replaces loops created by using iterators (or indexes to the container), and when some additional logic is needed, std::any_of is a nice solution to end the search when a match is found.

A bit more complicated example, using std::find_if, std::all_of and a boolean predicate object assisting in the search when calling std::find_if. In this example (full source code is here), there is a composite design pattern implemented for handling hierarchical key-value -pairs. The code sample below implements removing a specific key-value pair from the object hierarchy.

/**
 A helper struct to assist in finding an Entity with a given name. Used
 in EntityComposite::remove(const std::string &) to find an Entity with a given name.
 */
struct ElementNameMatches {
   ElementNameMatches(const std::pair<std::string,std::string> & nameValue) {
      searchNameValue = nameValue;
   }
   std::pair<std::string,std::string> searchNameValue;
   bool operator() (const Entity * e) {
      return (e->getName() == searchNameValue.first && e->getValue() == searchNameValue.second);
   }
};

/**
 Removes and deletes a child entity from this Entity.
 If the child is not an immediate child of this entity, then it is given
 to the children to be removed from there, if it is found.
 If the child is a Composite, removes and deletes the children too.
 @param nameValue A child with the equal name and value properties to remove from this entity.
 @return Returns true if the entity was removed, otherwise false.
 */
bool EntityComposite::remove(const std::pair<std::string,std::string> & nameValue) {
   bool returnValue = false;
   auto iter = std::find_if(children.begin(), children.end(), ElementNameMatches(nameValue));
   if (iter != children.end()) {
      Entity * entity = *iter;
      children.remove(*iter);
      delete entity;
      returnValue = true;
   } else {
      // child was not an immediate child. Check if one of the children (or their child) has the child.
      // Use a lambda function to go through the children to find and delete the child.
      // std::all_of can be stopped when the child is found by returning false from the lambda.
      std::all_of(children.begin(), children.end(), [nameValue, &returnValue](Entity * entity) {
         if (entity->remove(nameValue)) {
            returnValue = true;
            return false;
         } else {
            return true;
         }
      });
   }
   return returnValue;
}

// And then call remove() like this, for example, with 
// key "customer", and value "Antti Juustila":
newComposite->remove({"customer", "Antti Juustila"});

What you get is more robust code without your own bugs implemented in “custom” loops with indexes and iterators.

std::variant from C++17

What if your app has some data that can be manipulated in two formats? For example, first you get the data from the network in JSON, and then later you parse the JSON string and create an application specific object holding that parsed data. Later on, you again export the data from the internal object type to JSON to be send over to the network.

You could implement this so that you have both the JSON/string object and the application internal class object in memory. Then just add logic to know which currently has the data and should be used, and ignore the other variable until it is needed. An alternative is to use the good old union to handle this, if you want to save memory. This could be quite complicated to implement.

C++17 provides a more well managed option — std::variant. When using union, you have to keep track what the union contains, but using the variant, it knows which type of object it is currently holding and you can check that.

Following the scenario above, a class could have a member variable holding the JSON in a string, or alternatively, after parsing it, in an application specific object, within an unique pointer assisting with memory management:

std::variant<std::string, std::unique_ptr<DataItem>> payload;

In the class containing the payload member variable, you can initialise it to an empty string:

Package::Package()
: payload("")

Then you can provide setters to change from one representation of the data to another:

// Set the data to be a JSON string:
void Package::setPayload(const std::string & d) {
   payload = d;
}
// ...or a DataItem object, parsed from the string:
void Package::setPayload(std::unique_ptr<DataItem> item) {
   payload = std::move(item);
}

When you access the data to use it somewhere, you can check what is actually stored in the variant and return it. If the representation is not the one requested, return an empty value or null pointer to indicate to the caller that the requested representation of the data is not available currently:

// Get the string, using std::get_if:
const std::string & Package::getPayloadString() const {
   auto item = std::get_if<std::string>(&payload);
   if (item) {
      return *item;
   }
   return emptyString;
}
// Get the DataItem, using std::get_if
const DataItem * Package::getPayloadObject() const {
   auto item = std::get_if<std::unique_ptr<DataItem>>(&payload);
   if (item) {
      return item->get();
   }
   return nullptr;
}

Next I’d like to take a look at how to use the new async programming features of C++, as well as the Boost asio library…

Remember to join

When I forget to join (or detach) a C++ std::thread, it’ll crash with SIGABRT(6) on macOS. And obviously the stack dump does not tell me what is going on. So I hunt the bug for some hours, digging the log files, then finally remember that I just implemented a thread…

***** FATAL SIGNAL RECEIVED *******
Received fatal signal: SIGABRT(6)	PID: 17403

***** SIGNAL SIGABRT(6)

*******	STACKDUMP *******
	stack dump [1]  1   libg3logger.1.3.2-80.dylib          0x0000000100cb2163 _ZN12_GLOBAL__N_113signalHandlerEiP9__siginfoPv + 83
	stack dump [2]  2   libsystem_platform.dylib            0x00007fff72f73b1d _sigtramp + 29
	stack dump [3]  3   ???                                 0x0000000000000400 0x0 + 1024
	stack dump [4]  4   libsystem_c.dylib                   0x00007fff72e49a1c abort + 120
	stack dump [5]  5   libc++abi.dylib                     0x00007fff6fef2bc8 __cxa_bad_cast + 0
	stack dump [6]  6   libc++abi.dylib                     0x00007fff6fef2ca6 _ZL28demangling_terminate_handlerv + 48
	stack dump [7]  7   libc++abi.dylib                     0x00007fff6feffda7 _ZSt11__terminatePFvvE + 8
	stack dump [8]  8   libc++abi.dylib                     0x00007fff6feffd68 _ZSt9terminatev + 56
	stack dump [9]  9   BasicInfoGUI                        0x0000000100af1ffa _ZN11OHARStudent14StudentHandler8readFileEv + 42
	stack dump [10]  10  BasicInfoGUI                        0x0000000100af28df _ZN11OHARStudent14StudentHandler7consumeERN8OHARBase7PackageE + 2223
	stack dump [11]  11  BasicInfoGUI                        0x0000000100a6d101 _ZN8OHARBase13ProcessorNode14passToHandlersERNS_7PackageE + 897
	stack dump [12]  12  BasicInfoGUI                        0x0000000100a6a69e _ZN8OHARBase13ProcessorNode10threadFuncEv + 2462
	stack dump [13]  13  BasicInfoGUI                        0x0000000100a79061 _ZNSt3__1L8__invokeIMN8OHARBase13ProcessorNodeEFvvEPS2_JEvEEDTcldsdeclsr3std3__1E7forwardIT0_Efp0_Efp_spclsr3std3__1E7forwardIT1_Efp1_EEEOT_OS6_DpOS7_ + 113
	stack dump [14]  14  BasicInfoGUI                        0x0000000100a78f6e _ZNSt3__1L16__thread_executeINS_10unique_ptrINS_15__thread_structENS_14default_deleteIS2_EEEEMN8OHARBase13ProcessorNodeEFvvEJPS7_EJLm2EEEEvRNS_5tupleIJT_T0_DpT1_EEENS_15__tuple_indicesIJXspT2_EEEE + 62
	stack dump [15]  15  BasicInfoGUI                        0x0000000100a78796 _ZNSt3__114__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEMN8OHARBase13ProcessorNodeEFvvEPS8_EEEEEPvSD_ + 118
	stack dump [16]  16  libsystem_pthread.dylib             0x00007fff72f7ed36 _pthread_start + 125
	stack dump [17]  17  libsystem_pthread.dylib             0x00007fff72f7b58f thread_start + 15

Exiting after fatal event  (FATAL_SIGNAL). Fatal type:  SIGABRT
Log content flushed sucessfully to sink

So I am writing this down, just to remember next time. Join or detach. Join or detach….

   void StudentHandler::readFile() {
      std::thread( [this] {
         StudentFileReader reader(*this);
         using namespace std::chrono_literals;
         std::this_thread::sleep_for(50ms);
         reader.read(node.getDataFileName());
      }).join();
   }

Edit: Actually, I changed join() to detach(). With join, the calling thread waited for this file reading thread to finish before continuing to handle incoming data from network. The file was read totally in memory, and only then data from network was combined with data from file and send ahead to next node in the network. With 5000 test records, all of them were held in memory waiting for the data from network to arrive when using join().

When I switched to use detach(), calling thread could continue reading data from network, while simultaneously file reading thread was reading data from file. Whenever a match was found in a list, either one of these threads, the data was combined and send ahead to the next node in the network. So with join, maximum of 5000 records were held in memory all the time, as with detach(), about 1300-1800 records were held in memory at most. Because combined data could be send ahead to next node in the network and discarded from this node. A significant change in the amount of memory the nodes use. So it does matter which you use, depending on the purpose of the threads in your app.