Implementing a heuristic to infer document titles in single-H1 Markdown files #11462
Replies: 5 comments 1 reply
-
|
I know nothing about Lua so I asked my friendly neighborhood AI agent to take a crack at it. On the surface, it seems to work but I have no idea how well it's written or what problems it might have: -- infer-h1-as-doctitle.lua
-- Main filter function
function Pandoc(doc)
-- Count H1s and extract text from the first one
local h1_count = 0
local h1_text = nil
function Header(el)
if el.level == 1 then
h1_count = h1_count + 1
if h1_count == 1 then
h1_text = pandoc.utils.stringify(el)
end
end
end
doc:walk({Header = Header})
-- Process only if there's exactly one H1
if h1_count == 1 then
doc.meta.title = pandoc.MetaString(h1_text)
-- Remove first H1 and shift remaining headings up
local h1_removed = false
local new_blocks = {}
for _, block in ipairs(doc.blocks) do
if block.t == 'Header' and block.level == 1 and not h1_removed then
h1_removed = true
else
table.insert(new_blocks, block)
end
end
doc.blocks = new_blocks
-- Shift headings up by 1 level
function Header(el)
if el.level > 1 then
el.level = el.level - 1
end
return el
end
doc = doc:walk({Header = Header})
end
return doc
end |
Beta Was this translation helpful? Give feedback.
-
|
I would think that the most straightforward solution would be to use a shell script wrapper. The script can check to see if the document starts with #!/bin/sh -e
# get the first input filename:
file="$(pandoc --dump-args $@ | tail -1 | head -1)"
firstline="$(head -1 "$file")"
if echo "$firstline" | grep -q "^%"
then pandoc "$@"
elif echo "$firstline" | grep -q "^# "
then pandoc --shift-heading-level-by=-1 "$@"
else pandoc -M title="$file" "$@"
fi |
Beta Was this translation helpful? Give feedback.
-
|
@jgm - thanks for the suggestion! I tried it out on some of our Markdown files, but found that we also have files with YAML-style title metadata: ---
title: Document Title <----
---
# Topic 1 Title
Here is some text.
# Topic 2 Title
Here is some text.So given the task of adding an additional title-inference rule into the heuristics:
I took another crack at a Lua filter for this, and here is what I came up with: --[[
Infer document title from the first heading when no explicit title exists.
If the document has no title (from % format or YAML metadata), find the first
heading in the document. If it is the only heading at that level, use it as
the title and remove it from the body. If another heading at the same level
is found, exit with no changes.
Related discussion here: https://github.com/jgm/pandoc/discussions/11462
]]
function Pandoc(doc)
-- Skip entirely if document already has a title (% or YAML metadata)
local existing_title = doc.meta.title
if existing_title then
local title_text = pandoc.utils.stringify(existing_title)
if title_text and #title_text > 0 then
return doc
end
end
-- Walk through the document, find the first heading (of any level), and
-- determine if it is the only one at its level
local first_level = nil
local first_header = nil
local first_header_is_only = true
local function check_header(el)
if first_header == nil then
first_level = el.level
first_header = el
elseif el.level == first_level then
first_header_is_only = false
end
end
doc:walk({Header = check_header})
-- If the first heading is the only one at its level, use it as the document title
if first_header and first_header_is_only then
doc.meta.title = pandoc.MetaString(pandoc.utils.stringify(first_header))
-- Remove the promoted heading element from the body
local removed = false
local new_blocks = {}
for _, block in ipairs(doc.blocks) do
if block.t == 'Header' and block.level == first_level and not removed then
removed = true
else
table.insert(new_blocks, block)
end
end
doc.blocks = new_blocks
end
return doc
endFortunately our downstream HTML processing pipeline normalizes heading levels, so I don't need to worry about resequencing the levels here. I'm sure there was also a way to make the file-string-searching approach work, but processing the AST feels more reliable to me. |
Beta Was this translation helpful? Give feedback.
-
|
You probably want to replace:
doc.meta.title = pandoc.MetaString(pandoc.utils.stringify(first_header))
with:
doc.meta.title = first_header.content
Otherwise, all rich text formatting will be stripped from the title.
|
Beta Was this translation helpful? Give feedback.
-
|
@badumont - thanks for the heads-up! I will do some testing to confirm what works best for our content. @jgm - before I switched to Python, I was a long-time Perl guy (20+ years). Over the years, I've been burned by "stringy" approaches not being 100% reliable when the input data is arbitrary and unstructured, which unfortunately is my situation here. For example,
I did not enumerate these factors in my original question because I'm encountering them as I try out the solutions in this discussion on my document set. It's been an interesting exercise! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello fellow Pandoc users!
I have tens of thousands of Markdown files from many sources in many formats. I would like to implement a heuristic to infer document titles as follows:
If a
%title exists, then use it as the document title:Else, if a single
#exists, then use it as the document title (i.e.,--shift-heading-level-by=-1):Else, use the file name as the document title:
The problem is that (1) I don't know what structure the document will have when I call Pandoc, and (2) if I always specify
--shift-heading-level-by=-1, it corrupts the first and third cases.Is there a way to conditionally apply
--shift-heading-level-by=-1only when there is a single H1 in the document?Beta Was this translation helpful? Give feedback.
All reactions