Reading times estimates in Pandoc Lua

Recently, Abinav posted an article on how to estimate reading time in Pandoc-based blogs, titled "Reading Time Estimates for Pandoc Based Blog Generators".

But somehow, he only included Haskell-powered blogs in his definition of "Pandoc-based"!

My Lua-powered Pandoc-based blog begs to differ! ~~Such debasement of terms will not be left unchecked! 😂~~

^Non-wikimedian protesting A person holds a sign "Pandoc Lua's Pandoc too" at a gathering labeled "Pandoc convention" in front of a speaker presenting "Estimating reading time in Haskell" — Non-wikimedian protesting
A person holds a sign "Pandoc Lua's Pandoc too" at a gathering labeled "Pandoc convention" in front of a speaker presenting "Estimating reading time in Haskell"

In protest, I have modified my website to display a words count for all blog articles, which you can now see live on my Blog page.

Lua filter code

The filter code I needed to count the words and add that to the document's metadata is delightfully short:

Pass the document through pandoc.utils.stringify to turn it into a string
Split the string into words with string.gmatch
Count up the words
Optional: calculate reading time

local wpm = 220
function Pandoc(doc)
  local word_count = 0
  local text = pandoc.utils.stringify(doc.blocks)
  for w in text:gmatch('[^ .,?!\n\t()—%-]+') do
    word_count = word_count + 1
  end
  doc.meta.word_count = word_count
  doc.meta.reading_time_string = string.format('%.1f min', word_count / wpm)
end

In this case, I have opted to count words as sequences of characters that are not spaces or basic punctuation, since Lua's gmatch doesn't have great Unicode support, and I do have pages written in Bulgarian/Cyrillic. In particular, the script above counts it's as one word and not two.

Possible optimizations complications

To squeeze an extra bit of performance by not requiring pandoc.utils.stringify to allocate extra memory to store the document as a string (and to also be able to count raw HTML blocks correctly), you can also count words per text-based block or inline, similar to how Abinav does:

function Pandoc(doc)
  local word_count = 0
  function count_words(text)
    for w in text:gmatch('[^ .,?!\n\t()—%-]+') do
      word_count = word_count + 1
    end
  end
  function count_block_or_inline_text(b)
    count_words(b.text)
  end
  function count_block_or_inline_raw(r)
    if r.format == "html" then
      count_words(r.text:gsub('<[^>]+>', ''))
    else
      count_words(r.text)
    end
  end
  doc.blocks:walk({
    CodeBlock = count_block_or_inline_text
    Str = count_block_or_inline_text
    Code = count_block_or_inline_text
    Math = count_block_or_inline_text
    RawBlock = count_block_or_inline_raw
    RawInline = count_block_or_inline_raw
  })
  doc.meta.word_count = word_count
  doc.meta.reading_time_string = string.format('%.1f min', word_count / wpm)
end

HTML tag removal is based on simply deleting sequences of characters that look like <..>, which should generally be reliable, though not as great as actually parsing the HTML.

Yet, observe that this is still waay shorter than Abinav's Haskell code (27 vs 83 lines) thanks to Pandoc's Blocks:walk 😁 dramatically drops glove on floor

Other updates

I've also changed the source code link at the bottom right of each page (next to the Netlify link) to lead directly to the markdown source of a page and not to the whole repository. Small difference, but I hope that link is marginally more useful now.

As for reading time estimates...I'd much rather see the raw word count instead—so I won't be including them for now. Plus, the code does produce double-digit estimated minutes for my longer articles, and I don't like that 😂😂

This has been my post 7 of #100DaysToOffload. My first direct reply to another blog post!