Custom writer for TASVideos forum markup (BBCode-like) #11434
Replies: 2 comments 1 reply
-
|
Impossible-to-represent inputs: since it's a custom writer, you can do as you like. But our approach in normal pandoc writers is not to raise an error, but to skip the content and emit a log message ( Language class: in Text.Pandoc.Highlighting, we do this: case msum (map (`lookupSyntax` syntaxmap) classes) ofwhich means that we take the first class that has a defined syntax. But I don't think this is exposed in the Lua API. You could just grab a list of supported languages from Embedding resources: other ways of triggering options that would be accessible to the custom reader are environment variables and metadata fields. |
Beta Was this translation helpful? Give feedback.
-
Something similar may be possible with built-in writers, under some circumstances. Here is one I found. If you read an HTML input that has But with a certain arrangement of spans to break up the But I don't understand entirely what's going on, because in this case there seems to be some interference by the |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I've made a custom writer for the markup format used on the TASVideos forum, which is based on BBCode. The format is defined by the upstream parser, which is BbParser.cs.
The most notable aspect of the writer is probably how it does escaping in order to prevent accidental syntax in the input from being interpreted as BBCode in the output. For example, this Markdown input:
Becomes this forum markup output:
[i]in the input, if copied verbatim to the output, would be interpreted as a start tag for italics. The writer prevents that incorrect interpretation by wrapping the left square bracket in[noparse]/[/noparse].://in[noparse]/[/noparse].This is my first time making a custom writer. I wonder if someone more experienced can comment on the implementation and whether it fits with typical practice. I have these specific questions:
pandoc.scaffolding.Writerand instead invented a custom abstraction.errororassertfunctions.el.attr.classes, in order to render a code block?Escaping when a pattern spans AST nodes
I started off trying to use
pandoc.scaffolding.Writer, where you provide a bunch of type-specific callback functions for AST nodes, each returning a fragment of the complete output in the form of a string or aDoc. But I had to give up that approach. The problem is that the rules for escaping text differ depending on what BBCode tags are currently open (in particular, whether the most recently opened tag allows nested child tags or not). I didn't see a way to pass that necessary information into thepandoc.scaffolding.Writercallbacks. Instead, I implemented a similar framework of callback functions, but with every callback taking an additionalstackparameter. Each callback function did appropriate escaping on its fragment of the output, depending on the state of the stack:But even this I had to change. The reason has to do with escaping URLs to prevent autolinking. When there's a URL-like string in the input that is not actually a link, we want to escape the string in the output so it remains plain text and does not get autolinked. We do that by searching for the substring
://in text and converting it to[noparse]://[/noparse]. With the scaffolding-like approach, every callback function independently escapes the output fragment that it returns. The problem arises in an input like this HTML:Because the TASVideos forum markup does not have a special representation for
pandoc.Span, theInlines.Spancallback just returned its contents without any surrounding markup. The fact that the second URL is broken up by spans means that thehttp:and//example.com/parts of the URL were handled by two different calls toInlines.Str. Because the pattern://doesn't appear in either part in its entirety, it didn't get escaped by either call. The second URL was output in way that would, incorrectly, cause it to be autolinked:In short, here we have a situation where the "escape then concatenate" paradigm of
pandoc.scaffolding.Writergives different and incorrect results, in comparison with "concatenate then escape".To fixed this problem I modified the framework further. Now, instead of callbacks returning a pre-escaped fragment of the output document, they yield a sequence of typed "tokens". A token may be something like
start_tag,end_tag, ortext. A top-levelconsolidate_tokensfunction iterates over the preliminary sequence of tokens produced by the callbacks and merges adjacenttexttokens before they are escaped.With the above example, the preliminary sequence of tokens is as follows:
{type = "start_tag", tag = "i"} {type = "text", text = "http://example.com/"} {type = "end_tag", tag = "i"} {type = "blankline"} {type = "start_tag", tag = "i"} {type = "text", text = "http:"} {type = "text", text = "//example.com/"} {type = "end_tag", tag = "i"}After merging adjacent
texttokens, the sequence becomes:{type = "start_tag", tag = "i"} {type = "text", text = "http://example.com/"} {type = "end_tag", tag = "i"} {type = "blankline"} {type = "start_tag", tag = "i"} {type = "text", text = "http://example.com/"} {type = "end_tag", tag = "i"}The top-level
render_tokensfunction iterates over the consolidated token sequence and emits BBCode tags, escaped text, etc., as appropriate for each tag. Following this approach, the output has both URLs properly escaped:This technique is effective, but it makes me wonder if I'm overlooking something. I may be reinventing a wheel. It seems like a general problem, the need to look at the content of text nodes in different places in the AST that become adjacent in the output. Is there some existing function of
Doc, for example, that does what I'm trying to do, marking certain text as needing to be merged and escaped before finally being output? I notice that Text.DocLayout does something similar internally withFlatDoc.Dealing with impossible-to-represent inputs
There are two cases where the particulars where an input document may be impossible to represent in this BBCode variant.
When we're inside an tag, such as
[code], that does not allow nested child tags, and we're asked to output text that would look like its corresponding end tag ([/code]). For example, consider this Markdown input:If it were converted naively, there would be no way to prevent the inner
[/code]from prematurely ending the code block:Not even
[noparse]is allowed inside tags that forbid nesting, so the usual way of escaping cannot be used. Inside such a tag, square brackets are copied to the output verbatim, so the lack of escaping is not actually a problem, except in the particular case of an end tag for the most recently opened tag.When we're asked to output a BBCode start tag parameter (the
paramin[tag=param]) that contains unbalanced[and]characters. Theparampart of a start tag is terminated by the first]character that is not balanced by an earlier[character. Consider this Markdown input:If such a parameter with unbalanced brackets were rendered naively, it would make the parser see the content of the parameter as being shorter or longer than it should be:
Both cases are narrow, but they can happen. It's really only with the
[code]tag that they can happen, because[code]is the only tag used by the writer that both doesn't allow nesting and contains something other than a URL. The only other non-void tags that don't allow nesting,[img]and[url], both contain a URL, and URL percent escaping changes[and]into%5band%5d, which avoids the problem.When one of these situations occurs, I have the writer exit with an error: case (1), case (2).
Is it expected for a custom writer to behave like this, exiting with an error when it hits an impossible representation? Or should it rather attempt to soldier on no matter what, even if it would result in an output that a parser will misinterpret?
Language class in code blocks
When a fenced code block has a language class, the writer includes it as a parameter on the BBCode
[code]tag:becomes
The problem is when there is more than one class set on the code block:
All the classes end up together in the
attr.classesof apandoc.CodeBlock. The TASVideos forum markup allows at most one parameter on a code block, so which do we choose? What I'm doing is taking the first class that does not have a known special interpretation, likenumberLinesandsourceCode. Is there a better way?Embedding resources
I like the
--embed-resourcesfeature of Pandoc, and I wanted to support something similar to embed image data without an external link. But the--embed-resourcescommand-line option is not exposed to custom writers. So I implemented it as a format-specific extension,embed_resources. Can you suggest an alternative way of controlling an option like this one?TeX math inlines
The custom writer uses the hack for math inlines from #11399. If a math formula is simple enough, it can be rendered with BBCode syntax. Otherwise the TeX syntax is output in a
[tt]or[code]tag (depending on whether the math is inline or display).The above converts to:
Beta Was this translation helpful? Give feedback.
All reactions