Implementing a heuristic to infer document titles in single-H1 Markdown files #11462

chrispy-snps · 2026-02-13T00:52:50Z

chrispy-snps
Feb 13, 2026

Hello fellow Pandoc users!

I have tens of thousands of Markdown files from many sources in many formats. I would like to implement a heuristic to infer document titles as follows:

If a % title exists, then use it as the document title:

% Document Title <----

# Topic 1 Title

Here is some text.

# Topic 2 Title

Here is some text.

Else, if a single # exists, then use it as the document title (i.e., --shift-heading-level-by=-1):

# Document Title <----

## Topic 1 Title

Here is some text.

## Topic 2 Title

Here is some text.

Else, use the file name as the document title:

# Topic 1 Title

Here is some text.

# Topic 2 Title

Here is some text.

The problem is that (1) I don't know what structure the document will have when I call Pandoc, and (2) if I always specify --shift-heading-level-by=-1, it corrupts the first and third cases.

Is there a way to conditionally apply --shift-heading-level-by=-1 only when there is a single H1 in the document?

chrispy-snps · 2026-02-13T01:21:56Z

chrispy-snps
Feb 13, 2026
Author

I know nothing about Lua so I asked my friendly neighborhood AI agent to take a crack at it. On the surface, it seems to work but I have no idea how well it's written or what problems it might have:

-- infer-h1-as-doctitle.lua

-- Main filter function
function Pandoc(doc)
  -- Count H1s and extract text from the first one
  local h1_count = 0
  local h1_text = nil
  
  function Header(el)
    if el.level == 1 then
      h1_count = h1_count + 1
      if h1_count == 1 then
        h1_text = pandoc.utils.stringify(el)
      end
    end
  end
  doc:walk({Header = Header})
  
  -- Process only if there's exactly one H1
  if h1_count == 1 then
    doc.meta.title = pandoc.MetaString(h1_text)
    
    -- Remove first H1 and shift remaining headings up
    local h1_removed = false
    local new_blocks = {}
    for _, block in ipairs(doc.blocks) do
      if block.t == 'Header' and block.level == 1 and not h1_removed then
        h1_removed = true
      else
        table.insert(new_blocks, block)
      end
    end
    
    doc.blocks = new_blocks
    
    -- Shift headings up by 1 level
    function Header(el)
      if el.level > 1 then
        el.level = el.level - 1
      end
      return el
    end
    doc = doc:walk({Header = Header})
  end
  
  return doc
end

0 replies

jgm · 2026-02-13T09:25:25Z

jgm
Feb 13, 2026
Maintainer

I would think that the most straightforward solution would be to use a shell script wrapper. The script can check to see if the document starts with %, # + space, or neither. If %, the script will call pandoc normally, if # , it will call pandoc with --shift-heading-level-by=-1, if neither, it will call pandoc with -M title="$filename".

#!/bin/sh -e

# get the first input filename:
file="$(pandoc --dump-args $@ | tail -1 | head -1)"

firstline="$(head -1 "$file")"

if echo "$firstline" | grep -q "^%"
then pandoc "$@"
elif echo "$firstline" | grep -q "^# "
then pandoc --shift-heading-level-by=-1 "$@"
else pandoc -M title="$file" "$@"
fi

0 replies

chrispy-snps · 2026-02-13T12:55:22Z

chrispy-snps
Feb 13, 2026
Author

@jgm - thanks for the suggestion! I tried it out on some of our Markdown files, but found that we also have files with YAML-style title metadata:

---
title: Document Title <----
---

# Topic 1 Title

Here is some text.

# Topic 2 Title

Here is some text.

So given the task of adding an additional title-inference rule into the heuristics:

Use explicitly-specified title (% or title:)
Infer highest-level heading as title if it is the only heading at that level
Infer filename as title as a fallback

I took another crack at a Lua filter for this, and here is what I came up with:

--[[
  Infer document title from the first heading when no explicit title exists.

  If the document has no title (from % format or YAML metadata), find the first
  heading in the document. If it is the only heading at that level, use it as
  the title and remove it from the body. If another heading at the same level
  is found, exit with no changes.

  Related discussion here: https://github.com/jgm/pandoc/discussions/11462
  ]]

function Pandoc(doc)
  -- Skip entirely if document already has a title (% or YAML metadata)
  local existing_title = doc.meta.title
  if existing_title then
    local title_text = pandoc.utils.stringify(existing_title)
    if title_text and #title_text > 0 then
      return doc
    end
  end

  -- Walk through the document, find the first heading (of any level), and
  -- determine if it is the only one at its level
  local first_level = nil
  local first_header = nil
  local first_header_is_only = true

  local function check_header(el)
    if first_header == nil then
      first_level = el.level
      first_header = el
    elseif el.level == first_level then
      first_header_is_only = false
    end
  end
  doc:walk({Header = check_header})

  -- If the first heading is the only one at its level, use it as the document title
  if first_header and first_header_is_only then
    doc.meta.title = pandoc.MetaString(pandoc.utils.stringify(first_header))

    -- Remove the promoted heading element from the body
    local removed = false
    local new_blocks = {}
    for _, block in ipairs(doc.blocks) do
      if block.t == 'Header' and block.level == first_level and not removed then
        removed = true
      else
        table.insert(new_blocks, block)
      end
    end

    doc.blocks = new_blocks
  end

  return doc
end

Fortunately our downstream HTML processing pipeline normalizes heading levels, so I don't need to worry about resequencing the levels here.

I'm sure there was also a way to make the file-string-searching approach work, but processing the AST feels more reliable to me.

1 reply

jgm Feb 13, 2026
Maintainer

Well, you could just make a tiny adjustment to my script, checking for --- as well as %, on the assumption that there is always going to be a title in a YAML metadata block.

badumont · 2026-02-13T21:11:24Z

badumont
Feb 13, 2026

You probably want to replace: doc.meta.title = pandoc.MetaString(pandoc.utils.stringify(first_header)) with: doc.meta.title = first_header.content Otherwise, all rich text formatting will be stripped from the title.

0 replies

chrispy-snps · 2026-02-14T12:53:42Z

chrispy-snps
Feb 14, 2026
Author

@badumont - thanks for the heads-up! I will do some testing to confirm what works best for our content.

@jgm - before I switched to Python, I was a long-time Perl guy (20+ years). Over the years, I've been burned by "stringy" approaches not being 100% reliable when the input data is arbitrary and unstructured, which unfortunately is my situation here. For example,

The document might use "setext" headings instead of ATX headings for H1s and H2s
There might be an arbitrary blank line before the first H1
There could be spurious text content before the first H1 (e.g., copyright or page headers) due to upstream conversion issues
Some documents might skip H1s and start with H2s purely for cosmetic reasons (I found a document like this yesterday)

I did not enumerate these factors in my original question because I'm encountering them as I try out the solutions in this discussion on my document set. It's been an interesting exercise!

0 replies

Uh oh!

Implementing a heuristic to infer document titles in single-H1 Markdown files #11462

Uh oh!

Uh oh!

chrispy-snps Feb 13, 2026

Replies: 5 comments · 1 reply

Uh oh!

Uh oh!

chrispy-snps Feb 13, 2026 Author

Uh oh!

jgm Feb 13, 2026 Maintainer

Uh oh!

chrispy-snps Feb 13, 2026 Author

Uh oh!

jgm Feb 13, 2026 Maintainer

Uh oh!

Uh oh!

badumont Feb 13, 2026

Uh oh!

Uh oh!

chrispy-snps Feb 14, 2026 Author

chrispy-snps
Feb 13, 2026

Replies: 5 comments 1 reply

chrispy-snps
Feb 13, 2026
Author

jgm
Feb 13, 2026
Maintainer

chrispy-snps
Feb 13, 2026
Author

jgm Feb 13, 2026
Maintainer

badumont
Feb 13, 2026

chrispy-snps
Feb 14, 2026
Author