Reading times estimates in Pandoc Lua
Recently, Abinav posted an article on how to estimate reading time in Pandoc-based blogs, titled "Reading Time Estimates for Pandoc Based Blog Generators".
But somehow, he only included Haskell-powered blogs in his definition of "Pandoc-based"!
My Lua-powered Pandoc-based blog begs to differ! Such debasement of terms will not be left unchecked! 😂
A person holds a sign "Pandoc Lua's Pandoc too" at a gathering labeled "Pandoc convention" in front of a speaker presenting "Estimating reading time in Haskell"
In protest, I have modified my website to display a words count for all blog articles, which you can now see live on my Blog page.
Lua filter code
The filter code I needed to count the words and add that to the document's metadata is delightfully short:
- Pass the document through
pandoc.utils.stringify
to turn it into a string - Split the string into words with
string.gmatch
- Count up the words
- Optional: calculate reading time
local wpm = 220
function Pandoc(doc)
local word_count = 0
local text = pandoc.utils.stringify(doc.blocks)
for w in text:gmatch('[^ .,?!\n\t()—%-]+') do
word_count = word_count + 1
end
doc.meta.word_count = word_count
doc.meta.reading_time_string = string.format('%.1f min', word_count / wpm)
end
In this case, I have opted to count words as sequences of characters that are not spaces or basic punctuation, since Lua's gmatch
doesn't have great Unicode support, and I do have pages written in Bulgarian/Cyrillic. In particular, the script above counts it's
as one word and not two.
Possible optimizations complications
To squeeze an extra bit of performance by not requiring pandoc.utils.stringify
to allocate extra memory to store the document as a string (and to also be able to count raw HTML blocks correctly), you can also count words per text-based block or inline, similar to how Abinav does:
function Pandoc(doc)
local word_count = 0
function count_words(text)
for w in text:gmatch('[^ .,?!\n\t()—%-]+') do
word_count = word_count + 1
end
end
function count_block_or_inline_text(b)
(b.text)
count_wordsend
function count_block_or_inline_raw(r)
if r.format == "html" then
(r.text:gsub('<[^>]+>', ''))
count_wordselse
(r.text)
count_wordsend
end
doc.blocks:walk({
CodeBlock = count_block_or_inline_text
Str = count_block_or_inline_text
Code = count_block_or_inline_text
Math = count_block_or_inline_text
RawBlock = count_block_or_inline_raw
RawInline = count_block_or_inline_raw
})
doc.meta.word_count = word_count
doc.meta.reading_time_string = string.format('%.1f min', word_count / wpm)
end
HTML tag removal is based on simply deleting sequences of characters that look like <..>
, which should generally be reliable, though not as great as actually parsing the HTML.
Yet, observe that this is still waay shorter than Abinav's Haskell code (27 vs 83 lines) thanks to Pandoc's Blocks:walk
😁 dramatically drops glove on floor
Other updates
I've also changed the source code link at the bottom right of each page (next to the Netlify link) to lead directly to the markdown source of a page and not to the whole repository. Small difference, but I hope that link is marginally more useful now.
As for reading time estimates...I'd much rather see the raw word count instead—so I won't be including them for now. Plus, the code does produce double-digit estimated minutes for my longer articles, and I don't like that 😂😂
This has been my post 7 of #100DaysToOffload. My first direct reply to another blog post!