Occurrence of Unicode code points used in Project Gutenberg texts

I think it was when having my morning coffee and loitering the internets, I saw somewhere this discussion:

“Here something that woud be really useful for PG development going forward: A count of the occurrence of unicode code points used in PG texts.”

PG refers to Project Gutenberg, where they publish out-of-copyright books for anyone to read for free, in various languages and formats. Highly recommended, btw, if you didn’t know about them before!

Since I’ve done something like this before, just with words, not Unicode code points, I thought I’d give this a try. Especially when using several books from Project Gutenberg as test material in my Data Structures and Algorithms course. Gotta give back, as you know.

I only know just the basics of Unicode, like: it’s hard and complex, full of rabbit holes to fall into if you are ignorant or assume too much. On the other hand, Swift should be one of the best languages to handle Unicode correctly with the String and Character class.

The tool I made (ducking other responsibilities, again!) is available in GitHub. I commented about the tool in the PG issue discussion, as well as my noobness in correct Unicode processing. Hopefully this and/or the other implementations/data offered to them is usable for the purpose. If you know Unicode better than me, please tell me if there’s something to improve in the tool.

A useful site found while doing this is Unicode Explorer. Worth a bookmark.

As what comes to PG folks’ discussion on “…A recent issue #271 notes that the 2em dash is often used but missing in many typefaces.” — my implementation found that the 2em dash (“⸺”, Unicode symbol with codepoint U+2E3A) occurs in the provided book dataset 8 414 times.

Going to mark the hours of this little job in the category of “contributing to the culture & society” in my university work plan 🙌🏼✌🏼(the hands are codepoints U+1F64C U+1F3FC and U+270C U+1F3FC, the symbol itself plus the Fitzpatrick skin type / color…).

To end this post, here’s some technical details on the implementation. Processing the books is done using a Swift task group, concurrently (some details removed from the actual code):

try await withThrowingTaskGroup(of: [Character: Int].self) { group in

 if let enumerator = fileManager.enumerator(atPath: books) {
  while let file = enumerator.nextObject() as? String {
   if file.hasSuffix(".txt") {
     fileCount += 1
     // Add a task to task group to handle one file
     group.addTask { () -> [Character: Int] in
       var taskCodePoints: [Character: Int] = [:]
       let fullPath = booksDirectory + file
       let fileContents = try String(contentsOf: URL(fileURLWithPath: fullPath), encoding: .utf8)
// ...taskCodePoints contains the result of one file
            return taskCodePoints
         }
      }
   }
   
   // Combine the results from various concurrent tasks totaling the result 
   // from all files:
   for try await partial in group {
      for (key, value) in partial {
         codePointsUsage[key, default: 0] += value
      }
   }
}

The data structure [Character: Int] is a Swift dictionary of key-value pairs. Key (Character) is the unique characters read from the files, and value (Int) is the occurrence count of each character — how many times it was occurring in the books.

Each subtask collects a count of character occurrences from one book, and then those are merged together in the for try await partial in group structure, as the subtasks finish.

Watching from the Mac Activity Monitor, I saw that the app used to have nine threads, processing the books in parallel on my Uni work laptop (MacBook Pro M2). For some reason, my own older MacBook Mini M1 was usually much faster in processing than the M2 MacBook.