Generating test data with C++

The last time I held the Software Architectures course, I wanted to demonstrate students how to test the performance and reliability quality attributes of a distributed software system. The system already had a feature to process data as a batch by reading data files and processing that data in the networked nodes. All I needed to do was to generate data files with thousands of records to read and process in the system. I implemented a small tool app to generate this test data.

First of all, when generating thousands of records, I wanted to preallocate the necessary buffers to make sure that during data creation no unnecessary buffer allocations are made, making data generation faster:

std::vector<int> generatedStudentNumbers;
if (verbose) std::cout << "Creating numbers for students..." << std::endl;
generatedStudentNumbers.resize(studentCount);

resize() allocates big enough vetor for the data. For creating student numbers (int), std::iota and std::shuffle are quite useful:

// Generate student numbers starting from one to studentCount.
std::iota(generatedStudentNumbers.begin(), generatedStudentNumbers.end(), 1);
// Shuffle the numbers randomly.
std::shuffle(generatedStudentNumbers.begin(), generatedStudentNumbers.end(), std::mt19937{std::random_device{}()});

std::iota fills the container with continuous values starting from 1 in this case. std::shuffle puts the numbers in random order. Voilá, you have a long vector of randomly ordered student numbers you can use in the data generation with only four lines of code!

Next, I needed random names for the students for the data set. For that, I needed a vector of names and then randomly get a name from that vector when creating the student records:

std::vector<std::string> firstNames;
firstNames = {"Antti", "Tiina", "Pentti", "Risto", "Päivi", "Jaana", "Jani", "Esko", "Hanna", "Oskari"};

// Initialize the random engine
std::random_device rd;
std::default_random_engine generator(rd());

// Generate a random int from a range
int  generateInt(int maxValue) {
   std::uniform_int_distribution<int> distribution(0,maxValue);
   return distribution(generator);
}

// Pick one random name
const std::string & getFirstName() {
   int index = generateInt(firstNames.size()-1);
   return firstNames[index];
}

generateInt() helper function is used to get a random name from the firstNames array. The same procedure was used to generate a last name and the study program name for the student. Then all these pieces of information was stored into a record, basically a tab separated std::string. Records, in turn, were contained in a vector of strings.

What is then left is storing the test data records into a file:

std::ofstream datafile(fileName, isFirstRound ? std::ios::trunc : std::ios::app);

// Shuffle the records randomly.
std::shuffle(buffer.begin(), buffer.end(), std::mt19937{std::random_device{}()});
auto save = [&datafile](const std::string & entry) { if (entry.length() > 0) datafile << entry << std::endl; };
std::for_each(buffer.begin(), buffer.end(), save);
datafile.close();

After opening the file stream, first again use std::shuffle to put the data into random order, then use the save lambda function to define what saving a record means. Then just pass this lambda to std::for_each to tell what to do to each of the data records — save them into the std::ofstream.

Finally I made the data generator tool configurable with command line parameters, using Sarge:

Sarge sarge;
sarge.setUsage("./GenerateTestData -[hv]s <number> [-e <number>]");
sarge.setDescription("A test data generator for StudentPassing system. (c) Antti Juustila, 2019.\nUses Sarge Copyright (c) 2019, Maya Posch All rights reserved.");
sarge.setArgument("h", "help", "Display help for using GenerateTestData.", false);
sarge.setArgument("v", "verbose", "Display detailed messages of test data generation process.", false);
sarge.setArgument("s", "students", "Number of students to generate in test data files.", true);
sarge.setArgument("e", "exercises", "Number of exercises generated, default is 6 if option not provided.", true);
sarge.setArgument("b", "bufsize", "Size of the buffer used in generating data", true);

I used the test data generator tool to generate up to 10 000 records and used those test data files to see and demonstrate students how the system manages high data throughput and what which performance. It was also interesting to see what the performance bottlenecks were in the system.

Next year (the last time teaching this course) I’ll demonstrate how to use a thread to read the data files, while at the same time reading test data from the network in another thread. There is a large impact on whether using std::thread.join() or .detach() to control how the networking and data file reading threads cooperate.

One thought on “Generating test data with C++”