Occurrence of Unicode code points used in Project Gutenberg texts

I think it was when having my morning coffee and loitering the internets, I saw somewhere this discussion:

“Here something that woud be really useful for PG development going forward: A count of the occurrence of unicode code points used in PG texts.”

PG refers to Project Gutenberg, where they publish out-of-copyright books for anyone to read for free, in various languages and formats. Highly recommended, btw, if you didn’t know about them before!

Since I’ve done something like this before, just with words, not Unicode code points, I thought I’d give this a try. Especially when using several books from Project Gutenberg as test material in my Data Structures and Algorithms course. Gotta give back, as you know.

I only know just the basics of Unicode, like: it’s hard and complex, full of rabbit holes to fall into if you are ignorant or assume too much. On the other hand, Swift should be one of the best languages to handle Unicode correctly with the String and Character class.

The tool I made (ducking other responsibilities, again!) is available in GitHub. I commented about the tool in the PG issue discussion, as well as my noobness in correct Unicode processing. Hopefully this and/or the other implementations/data offered to them is usable for the purpose. If you know Unicode better than me, please tell me if there’s something to improve in the tool.

A useful site found while doing this is Unicode Explorer. Worth a bookmark.

As what comes to PG folks’ discussion on “…A recent issue #271 notes that the 2em dash is often used but missing in many typefaces.” — my implementation found that the 2em dash (“⸺”, Unicode symbol with codepoint U+2E3A) occurs in the provided book dataset 8 414 times.

Going to mark the hours of this little job in the category of “contributing to the culture & society” in my university work plan 🙌🏼✌🏼(the hands are codepoints U+1F64C U+1F3FC and U+270C U+1F3FC, the symbol itself plus the Fitzpatrick skin type / color…).

To end this post, here’s some technical details on the implementation. Processing the books is done using a Swift task group, concurrently (some details removed from the actual code):

try await withThrowingTaskGroup(of: [Character: Int].self) { group in

 if let enumerator = fileManager.enumerator(atPath: books) {
  while let file = enumerator.nextObject() as? String {
   if file.hasSuffix(".txt") {
     fileCount += 1
     // Add a task to task group to handle one file
     group.addTask { () -> [Character: Int] in
       var taskCodePoints: [Character: Int] = [:]
       let fullPath = booksDirectory + file
       let fileContents = try String(contentsOf: URL(fileURLWithPath: fullPath), encoding: .utf8)
// ...taskCodePoints contains the result of one file
            return taskCodePoints
         }
      }
   }
   
   // Combine the results from various concurrent tasks totaling the result 
   // from all files:
   for try await partial in group {
      for (key, value) in partial {
         codePointsUsage[key, default: 0] += value
      }
   }
}

The data structure [Character: Int] is a Swift dictionary of key-value pairs. Key (Character) is the unique characters read from the files, and value (Int) is the occurrence count of each character — how many times it was occurring in the books.

Each subtask collects a count of character occurrences from one book, and then those are merged together in the for try await partial in group structure, as the subtasks finish.

Watching from the Mac Activity Monitor, I saw that the app used to have nine threads, processing the books in parallel on my Uni work laptop (MacBook Pro M2). For some reason, my own older MacBook Mini M1 was usually much faster in processing than the M2 MacBook.

Advent of code day 1

The accompanying challenge doggie

For the first time, I started the Advent of Code challenge. I’ll be working with Swift using the AoC project template from Apple folks in GitHub, announced in the Swift discussion forums.

Got the two stars from the first day of the challenge. ? and rewarded myself with the traditional Friday pizza with red wine ??

Part two of day one required some interpretation. I finally got the idea and finished with success.

Don’t want to spoil the challenge but I did the part two in two different ways:

  • Going through the input string using String.Index, character by character, picking up words of numbers (“one” etc) converting them to digits (“1”) to a result string. This should be O(n) where n is the number of chars in the input data string.
  • Replacing words using Swift String replaceOccurrences using two different Dictionaries. This has two consecutive for loops iterating two dictionaries replacing text in input string using keys in dictionaries (eg “oneight”) with values (“18”) in dictionaries. This should be O(n*m) where m is the number of entries in the two dictionaries (ca 15 +/- some), n being the number of characters in the input string.

Surprisingly, the first option took 12 secs as using the higher APIs took only 0.007 secs. Maybe I did the first wrong somehow or Swift String index operations are really slow here because of Unicode perfectness. I’ve understood that the collections APIs used with strings are not so picky about Unicode correctness.

Otherwise I used conditional removal of characters that are letters from the string, map algorithms to map the strings containing numbers to integers and reduce algorithm to calculate the sum.

Challenges like this are a good way to brush up my skills. And learn more about Swift. I added myself to the Swift leaderboard to see how other Swift programmers do.

Tomorrow is the day two. Should have time for that in the morning since I wake up early nowadays, both because of myself and the dog. He is already 14+ years and having health issues unfortunately. Meaning early wake-ups every now and then.

The end part of the AoC maybe a real challenge because of all the Christmas hassle in the house. And the busy end of the semester at the university. Interesting to see how far I get and with how many gaps.

Mind the map

Another post about data structures, time performance and the count-the-words-in-a-book-file-fast case I’ve been writing about before.

I did a C++ implementation of the books and words problem. Earlier, I have implemented several solutions for the problem using Swift and Java. This time I used C++ standard library std::map and wanted to see if parallel processing in several threads would speed up the processing.

Obviously it did. Execution time of the multithreaded version was 74% of the single threaded version. The sample files were processed by the single threaded version in 665 ms, while the multithreaded version took only 491 ms. Nice!

But then I saw, from the documentation of the std::map, that it keeps the key values in the dictionary in order while elements are added to the map/dictionary.

But this is not needed in my case! Surely this also takes time and gives me additional possibilities in optimising the time performance.

I changed, in the single threaded implementation, the std::map to std::unordered_map, and behold, it was faster than the multithreaded version with 446 ms execution time!

So mind the map. There are many, and some of those may be more suitable to your use case than the others.

For details, see the project in GitHub.