Multiple Git repository status awareness & grading app – updates

In the post last week I mentioned that I’ve added new features to the tool I use in Data Structures and Algorithms course to supervise, support, test and grade the student programming course work. I’ll describe here the new features and provide some (hopefully) interesting implementation details about those.

Background about the course that helps in understand the role of this tool:

  • students each have their own github or gitlab repository to where they fork the initial status of their course work from the teacher repository,
  • repository contains the task instructions, skeleton code to implement, unit tests code and other supporting code given ready for students,
  • repository tests enable testing the student’s implementation of the data structures and algorithms that are required to implement in the course,
  • teachers have access to the student repositories to assist students with any issues, and finally to grade the work.

Before going into details, a comment about the motivation for this tool to exist in the first place: The course the tool is used in, has 300+ enrolled students and only two full time teachers (sometimes we do have 1-2 assistants with small number of hours, like 50 hrs in total for the whole Autumn), simultaneously teaching 2-3 other 200-300 student courses. That totals around 900+ active students taught simultaneously, so providing individual support and attention to all is practically impossible, at least if no automation (and other support tools) is taken into use.

So the critical thing in managing all this work is, that we do need automated tools to:

  • keep track of how the students are progressing in the course and help those who desperately need it, and after the last deadline,
  • to evaluate and grade the students’ programming submissions as fast as possible.

The grading should be done in three weeks after the last deadline, so it cannot be done in time manually. The aim is to use this tool during the course, trying to catch those students that are not doing OK, and then try to contact and help them in getting their coursework going and spotting any possible mistakes in their work so that they can correct their mistakes before the final deadline.

Students next Fall will receive the same unit tests and the same code analysis tools (in Java unit tests) they can also themselves use. However, this is not enough, especially if the students do not contact the teachers about their issues soon enough or at all. A problem that is too common, unfortunately.

The new features added to the teachers’ tool during this Winter and Spring are:

  1. Automatic source code analysis:
    • results of the source code analysis are stored in a database table; repository table has a column where problems found in code are flagged,
    • tool checks for java imports that are not allowed, notifies about this and sets error flag in the database table,
    • tool also uses the Antrl parser generator to check if method implementations in the source code violate any rules and if the code does not do the things it should do.
  2. Remote repo checks for both github and gitlab:
    • checks that remote is not public, the rule of the course,
    • checks that remote has only allowed contributors.
  3. Testing and grading enhancements
    • tests are now grouped to grade level tests,
    • when all tests in a grade group passes, grade is stored to repository table,
    • grade tests can be executed in whichever order, since grade value is bit flag,
    • test for a certain student repository can be disabled if they cause issues like tests never ending or taking way too much time,
    • if git pull updates code, grade is set to zero and test results are cleared.
  4. Status reporting
    • the tool now generates a textual status report that can be sent to student by email, listing how test pass, if any forbidden things are used / things to use are not used, git statistics etc.

Let’s take a closer look at the code analysis done by the tool in this post. The other features are left for future possible posts to discuss.

Analyzing if the source code uses “forbidden” imports was implemented with simple text parsing techniques. The tool goes through the Java source files in a certain directory where student code is located, reads the .java files and checks if the source uses imports that are not allowed.

For example, when implementing a Stack data structure, student must not import the java.util.Stack and just use it instead of implementing their own stack from scratch, using an array as internal data structure. Or when sorting an array, java.util.Arrays is not imported and Arrays.sort is not used, because student should have used their own implementation of the sorting algorithms.

These kinds of mistakes can be found by analyzing the imports in the student’s .java files. You cannot call Arrays.sort without having import java.util.Arrays in the .java file.

The tool first reads a rules file that contains the import rules for the course:

[allowedImports]
*,import java.util.Comparator
*,import java.util.function.Predicate
*,import oy.interact.tira.
*,import java.lang.Math;
*,import java.util.concurrent.atomic.AtomicInteger;
CodeWordsCounter,import java.io.
CodeWordsCounter,import java.nio.
SetImplementation,import java.util.Iterator

For example, the java.util.Comparator may be used by the students in all of their .java files (*). The same goes with java.util.function.Predicate and all the files in package oy.interact.tira (the course source packages). The class SetImplementation may also import java.util.Iterator.

Et cetera. As you can see, this is an allow list, not a deny list. The imports that are allowed and can be used in the course are small in number, so it is easier to have an allow list than deny list.

Reading the allow list and analyzing the source files was straightforward to implement by simple file handling and text parsing. First read in the rules file and parse the import rules from there, into this structure:

fileprivate
struct SourceRequirementsSpecification {
   var allowedImports: [String: [String]] = [:]

The Swift dictionary object allowedImports has the class as the key (either * or the name of the class rule is about), and then the array of allowed imports the class may contain.

Then the tool starts reading the student .java files and handles all lines beginning with import to check if that class may import that specific thing or not. If the .java file imports something that was not allowed, the tool database table (for the student’s repository) having a column with integer value, a specific bit in that number (a “flag”) is set to 1 (true). This indicates that the student’s code imported something it was supposed to not do.

This is the data structure keeping track of these repository flags, if they are set or not set:

struct Flags: OptionSet {
   typealias RawValue = Int16
   var rawValue: Int16

   static let remoteFails = Flags(rawValue: 1 << 0)
   static let buildFails = Flags(rawValue: 1 << 1)
   static let usesForbiddenFeatures = Flags(rawValue: 1 << 2)
   static let repositoryIsPublic = Flags(rawValue: 1 << 3)
   static let repositoryHasOutsideCollaborators = Flags(rawValue: 1 << 4)
   static let repositoryTestingIsOnHold = Flags(rawValue: 1 << 5)	
}

The Swift OptionSet is a type that presents a mathematical set interface to a bit set. So each flag is a single bit in the 16 bit integer, being either off (0) or on (1). So instead of having six boolean columns in the database table, I can have one 16 bit integer column to store 16 different flags. Most of those still available for future needs.

The flag usesForbiddenFeatures is set in this case of student code using forbidden imports (or other features, discussed below). Other flags are used to:

  • track if the remote repository cannot be accessed at all (remoteFails) or is empty,
  • compiling the code fails (buildFails),
  • repository is public even though it must be private (repositoryIsPublic) and finally
  • if the repository has other collaborators than the student owning the repo and the course teachers (repositoryHasOutsideCollaborators).

These are all issues that have been met in the course during the previous years.

Final flag repositoryTestingIsOnHold can be set manually from the tool user interface for those repositories that mess up the testing, either because the tests are stuck or take too much time and therefore should not be executed before fixed.

That flag struct is a 16 bit integer, the datatype of the database column storing the flags in each student repository database table. If the code has issues related to these rules, the flag is set. When the git repository is updated from the remote, the user needs to run the code analysis again. If a violation of a rule was fixed, the flag is then reset to zero (false), if no other violations are found.

The imports checking from source code is not enough, though.

Sometimes students do mistakes in method implementations they are not supposed to do. Like when implementing a recursive binary search, they implement an iterative binary search. Or when sorting a table in descending order, they first sort in ascending order and then reverse the table order. Since time efficiency is the major topic of the course, this is not what they are supposed to do. Instead, they should use a comparator to sort the table to descending order in the first place, and then no reverse operation is needed at all, making the operation faster.

For checking these kinds of issues the import rules are not enough since these kind of mistakes can be made without importing something. Therefore, the same rules file contains also rules like this:

[codeRules]
# Algorithms.binarySearchRecursive must contain word binarySearchRecursive
Algorithms.binarySearchRecursive,must,binarySearchRecursive
# Algorithms.binarySearchRecursive must not contain word while
Algorithms.binarySearchRecursive,mustnot,while
Algorithms.binarySearchIterative,must,while
CodeWordsCounter.topCodeWords,must,Algorithms.fastSort
CodeWordsCounter.topCodeWords,mustnot,Algorithms.reverse

Here lines beginning with # are comments and ignored by the tool. Illustrating the examples of mistakes above, the rule:

Algorithms.binarySearchRecursive,mustnot,while

specifies that the algorithm binarySearchRecursive in class Algorithms must not contain the keyword while. Since breaking this rule implies that the student has implemented the recursive algorithm using iterative approach. Therefore, the learning goal of implementing a recursive binary search has not been met, and code needs to be fixed.

These rules are also parsed from the rules file into the above mentioned rules structure, which now looks like this:

enum Rule {
   case must
   case mustNot
}

struct CodeRule {
   let methodName: String
   let rule: Rule
   let stringToMatch: String
}

fileprivate
struct SourceRequirementsSpecification {
   var allowedImports: [String: [String]] = [:]
   var rules: [String: [CodeRule]] = [:] // ClassName : CodeRules
}

In addition to the allowedImports rules, the structure now contains also rules for a class (the Swift dictionary key String has the class name), where the CodeRule structure has the name of the method the rule is about, then if the method must or must not (Rule) contain the string to match.

This is a very simple current implementation, and as I develop this further, it may change to something different later. For example, it should be possible to say that the recursive binary search algorithm must not contain “while” but also not the “for” keyword.

The second example about being able to realize that descending order can be implemented by a specific comparator (instead of ascending sort and then doing reverse, which is more time-consuming operation) is checked by the rule that the class CodeWordsCounter in the algorithm topCodeWords (where the work in the coding task is done) must not call Algorithms.reverse.

If a student uses the Java Collections.reverse or Apache Commons ArrayUtils.reverse for this, that mistake is already caught using the import rules.

For this feature implementation, I used the Antrl parser generator. Antlr has existing grammars for many languages, including Java, and it supports generating parsers also for Swift. So all I needed to do was to get the Antlr tool and the Java grammar files (Java20Lexer.g4 and Java20Parser.g4) and then generate the Java parser and lexer for Swift:

$ antlr -Dlanguage=Swift -message-format gnu -o Autogen Java20Lexer.g4
$ antlr -Dlanguage=Swift -message-format gnu -o Autogen Java20Parser.g4

Those commands place the generated Swift source code files in a subdirectory named Autogen.

Using the generated Swift code from Autogen, the tool can then parse Java code. All I needed was to implement a class that extends the Antlr generated Java20ParserBaseListener, give the rules to this class and start parsing the source string read from the student’s java file, containing the Java source code:

import Antlr4
class Java20MethodBodyChecker: Java20ParserBaseListener {

   private let file: String
   private let rules: [CodeRule]
   private var currentMethod: String = ""
   private var stream: ANTLRInputStream? = nil

   var violations: [String] = []
	
   func start(source: String) throws {
      stream = ANTLRInputStream(source)
      if let stream {
         let lexer = Java20Lexer(stream)
         let parser = try Java20Parser(CommonTokenStream(lexer))
         let parserTree = try parser.compilationUnit()
         let walker = ParseTreeWalker()
         try walker.walk(self, parserTree)
      }
   }
// ...

Analysing the code happens in the overridden handler for parsing Java methods, enterMethodDeclaration. There, the implementation checks, using the rules, if the Java code contains any keywords that should not be there or does not contain keywords that should be here:

override func enterMethodDeclaration(_ ctx: Java20Parser.MethodDeclarationContext) {
 if let identifier = ctx.methodHeader()?.methodDeclarator()?.identifier() {
  currentMethod = identifier.getText()
  let ruleSet = rules.filter( { $0.methodName == currentMethod } )
  if !ruleSet.isEmpty {
   if let body = ctx.methodBody() {
    let bodyText = body.getText()
    for rule in ruleSet {
     switch rule.rule {
      case .must:
       if !bodyText.contains(rule.stringToMatch) {
        violations.append("\(file).\(rule.methodName) ei käytä \(rule.stringToMatch), vaikka pitäisi")
       }
      case .mustNot:
       if bodyText.contains(rule.stringToMatch) {
        violations.append("\(file).\(rule.methodName) käyttää \(rule.stringToMatch),  vaikka ei pitäisi")
       }
     }
    }
   }
  }
 }
}

The violations then contains all the things found to be against the rules.

This worked, except for one thing: if the code has comments that do/do not have the specified keywords of the rules, the code analysis did check those comments also. Even though the comments do not influence the implementation and are therefore totally irrelevant.

So if the comments of recursive binary search algorithm contains the keyword while, the tool would report that as a violation of the rule! Or if the code contains a required keyword, but not in the actual code implementation but only in the comments, the analysis would be satisfied. End result: not satisfactory at all.

First I started to fix this by implementing code myself to remove the comments from the code in a String variable before the analysis, but soon I realized that the Antlr parsing tool should somehow already handle this.

What I did instead, was (after some searching on the internet) to modify the Antrl lexer file for Java so that it does not parse the comments at all, but skips them:

// AJU: COMMENT: '/*' .*? '*/' -> channel(HIDDEN);
COMMENT : '/*' .*? '*/' -> skip;

// AJU: LINE_COMMENT: '//' ~[\r\n]* -> channel(HIDDEN);
LINE_COMMENT : '//' ~[\r\n]* -> skip;

Then I regenerated the Swift parser and lexer source files and added them to the tool project.

Using the new lexer/parsers, the parsed Java code does not anymore include comments and I do not have to consider them at all! This works perfectly.

Without asking anything about any “AI” tools out there. As usual, I do not touch them even with a long stick. I prefer to discover and learn myself, as long as the internet is not polluted with “AI” generated crap which makes everything impossible.

Below you can see the tool reporting about imports in a student repository, imports that should not be in the code, and that the code in CodeWordsCounter.topCodeWords does not call Algorithms.fastSort, even though it should, based on the rules in the rules file.

As you can see, the first violation is because the student has misnamed the class as setImplementation. In Java, classes (and therefore also class files) should always begin with capital letters and the name should be SetImplementation instead. So this is a mistake, but actually does not violate the import rule; the Set data structure implementation may use the Iterator. Therefore, when the rules are found to be violated, it often requires human rechecking to make sure no unfair judgments are made by the automation.

Considering the 300 something students in the course, perhaps each of them making at least 3-5 mistakes the code analysis could find, this results in 900-1500 mistakes in total. For the teacher to comb through all the hundreds of lines of code in each of the student projects, repeatedly during the course, working on this “code quality control” task manually is very time consuming and therefore not a realistic task. Especially considering the available teaching resources mentioned in the beginning.

Hopefully the new features of this tool and the similar tests given to students in Java test code, improve the support the teachers can provide during the course to lead the students to the correct path of implementing the required data structures and algorithms.

The post is getting quite long, so the rest of the new features will perhaps be discussed in future posts. Unless I decide to focus on other more interesting topics. Have a nice day!

Multiple Git repository status awareness app

A year ago I started to implement a GUI app that I could use to track the 250-300+ student projects of my Data structures and algorithms course. The motivation for the tool was to answer the question:

How can you manage four courses with nearly 1000 students when you only have two full time teachers and two part-time supportive teachers to manage all teaching, evaluating and supporting the students in their learning in all of these courses?

Me

The plan was, that using this tool, I could more easily be aware of the statuses of the many student projects in this one course. And then when necessary, do something about possible issues.

Like, knowing that a student is left behind (no commits to the project in the last week or two), gives me a possibility to ask if there is anything I as a teacher could do to help the student to proceed in the course.

And if one or several of the unit tests of the various course programming tasks fail, especially if the task was supposed to be finished already, I could go and check the reason(s). Either by asking the student or taking a look at the git repository of the student myself.

The version last year required me to run shell scripts that would automatically retrieve the up to date code of the student from their remote repository using git pull. Output was directed to a log file the app then parsed to view the content. Similarly, tests were executed in batch from shell scripts with Maven (mvn test), result written to another log file. Again, the app would parse that log file to show in a graph how the tests succeeded or failed.

A while ago I decided that I would like to integrate all the functionality within the GUI app and ditch the separate shell scripts. Also, I decided that the information about the commits and test results would be saved into a database. This way, when app launch is now much faster since it does not need to parse the various log files to show content every time the app is launched.

Instead, when launched, the app reads the repository information, git log data and the test results from a database file. App launches much faster with immediately visible data. When I want to update the commit data for all student repositories, I just press the refresh button in the app toolbar and the app starts to pull data for each repository from the remote repository URLs.

Anonymised student repositories with commit and test data.

When I want to update the test data, pressing another button (the hammer button in the toolbar) starts to execute tests that are due, saving the test results in the app database. I can also refresh an individual repo from remote or execute tests for a single project to get a quick update on the status of the student’s project.

Until now I’ve focused on getting the tabular data visible. Yesterday, I implemented a similar commit grid view you can see in a person’s GitHub profile. This color grid shows the commits visualised for all student repositories:

Commit grid visualisation,

In this grid, time proceeds vertically from up towards bottom of the grid view. Each repository is one thin column on the grid. The timeline starts from the date the course started, proceeding down.

The grid ends at the current date in the bottom of the grid. So when the course proceeds the grid gets taller. Since this course displayed here started on September, the grid here is 42 days “tall”. This will give me a very high level visual awareness of level of activity among all the students in the course.

Obviously the usefulness of the tool depends on the fact (?) that students push their commits to the remote repository often, preferably daily or at least weekly. If they do not, we teachers cannot see them and this tool does not show them. As you can see, some of the repositories are in red, meaning there has not been a commit for over 14 days. Orange repositories indicate the most recent commit being over a week old.

Red and orange colors may mean that the student has decided to do most of the course progrmming just before the deadline. A common though not the best choice. So seeing lots of red at this point does not necessarily mean big issues with passing the course. We’ll see…

I’ve been constantly trying to motivate the students to do commits and pushing often. It is beneficial for them too – you get a backup of your work in the remote, and when you need teachers’ assistance, teachers have immediate access to the code from the remote.

Since the first real deadline of the course will be after two weeks, it is time for me to start using the information from the tool:

  • I can easily send email to the student not having recent commits, by clicking on the (redacted) blue text where student name is displayed – that will initiate an email I will send to the student.
  • Clicking on the repository URL (also redacted) will open the browser and take me to the remote repository web page of that student.
  • Another (redacted) blue line will open the local folder in Finder for me to start looking at the student project directory on my machine.

Those repositories in blue are all OK, and (probably) I do not need to be concerned with those. Especially if the scheduled tests also pass. The purpose of the tool is to let me focus on those students that probably need attention to be able to pass the course in time.

Some implementation details:

  • Implemented on macOS using Xcode, Swift and SwiftUI.
  • Uses Core Data as the app database, with five database tables and ten relationships between the tables.
  • Executes git and mvn test commands in child processes using Process class.
  • SwiftUI Table used for tabular content.
  • Commit grid is draw in a SwiftUI view using Canvas and GraphicsContext.
  • Uses the SwiftUI redacted(reason:) function to hide student information for demonstration purposes.
  • Student and repository data is imported from a tab separated text file exported from Moodle. There, student enter their information and the remote repository URL in a questionnaire form. The form data is then exported as a tsv file, which is imported to the app and all student data is written to the database. No need to enter the student data manually.

Coverage fight

Well it was a fight but got the code coverage working with VS Code / Java and Coverage Gutters extension. Doing this for an exercise under preparation for the next Fall course on Data Structures and Algorithms. Probably this is just a demo or something students may take a look at, not a required task.

Code coverage visible, YEAH!!

What I have in the Maven pom.xml currently is

<argLine>-Xmx2048m</argLine>

within the properties struct, and in the build/plugins:

<plugin>
   <artifactId>maven-failsafe-plugin</artifactId>  
   <version>2.22.2</version>
      <configuration>
        <argLine>${argLine} -Xmx4096m -XX:MaxPermSize=512M ${itCoverageAgent}</argLine>
      </configuration>
</plugin>

and the whole jacoco plugin structure is here, under the build/plugins structure (with some commented out sections):

<plugin>
    <groupId>org.jacoco</groupId>
    <artifactId>jacoco-maven-plugin</artifactId>
    <version>0.8.8</version>
    <executions>
        <execution>
            <id>pre-unit-test</id>
            <goals>
                <goal>prepare-agent</goal>
            </goals>
            <configuration>
                <!-- Sets the path to the file which contains the execution data. -->
                <dataFile>${project.build.directory}/jacoco.exec</dataFile>
                <outputDirectory>${project.reporting.outputDirectory}/jacoco</outputDirectory>
                <propertyName>surefireArgLine</propertyName>
            </configuration>
        </execution>
        <!-- Ensures that the code coverage report for unit tests is created 
              after unit tests have been run. -->
        <execution>
            <id>post-unit-test</id>
            <phase>test</phase>
            <goals>
                <goal>report</goal>
            </goals>
            <configuration>
                <!-- Sets the path to the file which contains the execution data. -->
                <dataFile>${sonar.jacoco.reportPath}</dataFile>
                <!-- Sets the output directory for the code coverage report. -->
                <outputDirectory>${jacoco.path}</outputDirectory>
            </configuration>
        </execution>
        <execution>
            <id>default-report</id>
            <goals></goals>
            <configuration>
                <dataFile>${project.build.directory}/jacoco.xml</dataFile>
                <outputDirectory>${project.reporting.outputDirectory}/jacoco</outputDirectory>
            </configuration>
        </execution>
        <execution>
            <id>jacoco-check</id>
            <goals>
                <goal>check</goal>
            </goals>
            <configuration>
                <rules>
                    <rule>
                        <element>PACKAGE</element>
                        <limits>
                            <limit>
                                <counter>LINE</counter>
                                <value>COVEREDRATIO</value>
                                <minimum>0.0</minimum>
                            </limit>
                        </limits>
                    </rule>
                </rules>
            </configuration>
        </execution>
    </executions>
</plugin>

and finally, you need to build from the command line with this Maven command:

mvn jacoco:prepare-agent package jacoco:report

So when viewing the code then, you can see the code coverage colors in the code editor sidebar on the left.

Busy busy busy

I’ve been really busy with evaluating student coursework in two courses. Some of this work was postponed in Feb/March since I participated in the international MSc program selection process then.

I have been considering posting shorter news like things or just small snippets of work I have been doing. Just to keep this blog alive.

So here is a screenshot of something I did in between the course evaluations — potentially a new exercise to be used in the Data structures and algorithms course next Fall. A small utility that converts plain Finnish text to braille symbols.

Finnish epic Kalevala in braille.

The job for the students is to generate the braille version faster than the naive lookup table looping implementation they get ready, using hash tables.

Another thing I did was for a course on GUI design & programming:

Demo about automated GUI testing

The purpose of this stupid card guessing game was to demonstrate how to implement a simple GUI test automation and execute the tests in simulator environment.

That’s it for today. Now back to evaluating those student programming assingments…

Work work work

youtube.com/watch

Just checked the first commit date of the completely renovated Data structures and algorithms course repository. It was around December 2020.

Since then I have designed and implemented about ten exercises with unit tests in Java. And two course projects with unit tests. Plus about ten additional demos with Swift. All with commentary and instructions.

In a week the new course begins. I’ve recorded about 20 hours of new lecture materials. Additional videos about graph programming with C++ will be included from last year demos.

One of the reasons for not posting too much lately ?

The video is about one of the course project implementing a graph data structure with various algorithms including Dijkstra’s shortest path finding algorithm. I thought applying this in a game context would perhaps motivate the students to dwell into the world of data structures and algorithms. We’ll see….

Generating test data with C++

The last time I held the Software Architectures course, I wanted to demonstrate students how to test the performance and reliability quality attributes of a distributed software system. The system already had a feature to process data as a batch by reading data files and processing that data in the networked nodes. All I needed to do was to generate data files with thousands of records to read and process in the system. I implemented a small tool app to generate this test data.

First of all, when generating thousands of records, I wanted to preallocate the necessary buffers to make sure that during data creation no unnecessary buffer allocations are made, making data generation faster:

std::vector<int> generatedStudentNumbers;
if (verbose) std::cout << "Creating numbers for students..." << std::endl;
generatedStudentNumbers.resize(studentCount);

resize() allocates big enough vetor for the data. For creating student numbers (int), std::iota and std::shuffle are quite useful:

// Generate student numbers starting from one to studentCount.
std::iota(generatedStudentNumbers.begin(), generatedStudentNumbers.end(), 1);
// Shuffle the numbers randomly.
std::shuffle(generatedStudentNumbers.begin(), generatedStudentNumbers.end(), std::mt19937{std::random_device{}()});

std::iota fills the container with continuous values starting from 1 in this case. std::shuffle puts the numbers in random order. Voilá, you have a long vector of randomly ordered student numbers you can use in the data generation with only four lines of code!

Next, I needed random names for the students for the data set. For that, I needed a vector of names and then randomly get a name from that vector when creating the student records:

std::vector<std::string> firstNames;
firstNames = {"Antti", "Tiina", "Pentti", "Risto", "Päivi", "Jaana", "Jani", "Esko", "Hanna", "Oskari"};

// Initialize the random engine
std::random_device rd;
std::default_random_engine generator(rd());

// Generate a random int from a range
int  generateInt(int maxValue) {
   std::uniform_int_distribution<int> distribution(0,maxValue);
   return distribution(generator);
}

// Pick one random name
const std::string & getFirstName() {
   int index = generateInt(firstNames.size()-1);
   return firstNames[index];
}

generateInt() helper function is used to get a random name from the firstNames array. The same procedure was used to generate a last name and the study program name for the student. Then all these pieces of information was stored into a record, basically a tab separated std::string. Records, in turn, were contained in a vector of strings.

What is then left is storing the test data records into a file:

std::ofstream datafile(fileName, isFirstRound ? std::ios::trunc : std::ios::app);

// Shuffle the records randomly.
std::shuffle(buffer.begin(), buffer.end(), std::mt19937{std::random_device{}()});
auto save = [&datafile](const std::string & entry) { if (entry.length() > 0) datafile << entry << std::endl; };
std::for_each(buffer.begin(), buffer.end(), save);
datafile.close();

After opening the file stream, first again use std::shuffle to put the data into random order, then use the save lambda function to define what saving a record means. Then just pass this lambda to std::for_each to tell what to do to each of the data records — save them into the std::ofstream.

Finally I made the data generator tool configurable with command line parameters, using Sarge:

Sarge sarge;
sarge.setUsage("./GenerateTestData -[hv]s <number> [-e <number>]");
sarge.setDescription("A test data generator for StudentPassing system. (c) Antti Juustila, 2019.\nUses Sarge Copyright (c) 2019, Maya Posch All rights reserved.");
sarge.setArgument("h", "help", "Display help for using GenerateTestData.", false);
sarge.setArgument("v", "verbose", "Display detailed messages of test data generation process.", false);
sarge.setArgument("s", "students", "Number of students to generate in test data files.", true);
sarge.setArgument("e", "exercises", "Number of exercises generated, default is 6 if option not provided.", true);
sarge.setArgument("b", "bufsize", "Size of the buffer used in generating data", true);

I used the test data generator tool to generate up to 10 000 records and used those test data files to see and demonstrate students how the system manages high data throughput and what which performance. It was also interesting to see what the performance bottlenecks were in the system.

Next year (the last time teaching this course) I’ll demonstrate how to use a thread to read the data files, while at the same time reading test data from the network in another thread. There is a large impact on whether using std::thread.join() or .detach() to control how the networking and data file reading threads cooperate.