DOM Parsing

Note

I am seeking feedback for this feature. If there are any changes or additions to the API you feel would make it more useful, or if you have any other suggestions, please let me know (or via email if you prefer).

Be aware that I may make updates to the md4c.domparser API in response to feedback I receive. When that is no longer the case, I will remove this message and make a note in the changelog.

What does “DOM parsing” mean?

In the world of XML, there are two general types of parsers: SAX (i.e. event-based) and DOM (i.e. tree-based). SAX parsers traverse the document, and as each tag or bit of content is parsed, the appropriate event is emitted (enter-element, leave-element, characters) and a callback handles it. DOM parsers construct a tree representation of the entire document for the caller.

While the concepts were originally conceived for XML, most parsers for any markup language usually fit into these same two categories. The MD4C C library and the main md4c Python module take a definitive SAX-like approach to parsing (and the MD4C library is proud of it). This is clear from the GenericParser API.

The md4c.domparser module provides a DOM-like API for use cases where that style is more appropriate. It produces an AST where each paragraph, heading, link, block quote, etc. is represented by an ASTNode.

Why use DOM-like parsing?

You may find that the HTMLRenderer and SAX-like parsers do not provide the flexibility you need. A typical use-case would be if you want to manipulate the input document before it is rendered. For example, maybe you want to convert every occurrence of a certain word to a hyperlink, except in code blocks. Or you want to delete everything after the first paragraph under each heading.

The tradeoff for this flexibility is speed:

  • DOM parsers are more resource-intensive than SAX parsers in general, due to the overhead from producing a tree representation of the entire document in memory.

  • Furthermore, the SAX-like parsers in PyMD4C are a thin layer on top of MD4C, which is heavily optimized C code. DOMParser is implemented in Python on top of ParserObject.

Generating an AST

Since many applications will not need them, the DOM-like parser and the classes for the AST are all in a separate module: md4c.domparser. The parser itself is the md4c.domparser.DOMParser class. In the most simple case, it is used like this:

import md4c.domparser

with open('README.md', 'r') as f:
    markdown = f.read()

parser = md4c.domparser.DOMParser()
ast = parser.parse(markdown)

At this point, ast is the root Document node of the tree. You can render the tree as HTML:

html = ast.render()

Or you can traverse the tree:

def traverse(ast_node):
    # Do stuff on this node before traversing to children

    try:
        for child in ast_node.children:
            traverse(child)
    except AttributeError:
        # No children
        pass

    # Do stuff on this node after traversing to children

traverse(ast)

AST Node Objects

Each type of Markdown element (i.e. each type of block, span, and text) has an associated AST type. For example, Paragraph is for BlockType.P. See Base AST Classes for the full list. For Markdown elements with additional details attached to them (see Details Dicts), each detail becomes an attribute in the object. For instance, a Heading object hdg would have attribute hdg.level.

There are a few base classes that the AST classes inherit from:

md4c.domparser.ASTNode

All AST classes inherit from this class. It provides the type and parent attributes. This is also the class that should be used to construct all AST node objects, no matter their type. More on that in the later sections.

md4c.domparser.ContainerNode

All AST classes that are not leaf nodes inherit from this class. That is, all blocks and inlines except HorizontalRule. It provides the children attribute and the append() and insert() methods for adding new children.

md4c.domparser.TextNode

All AST classes associated with md4c.TextTypes inherit from this. It provides the text attribute containing the unprocessed text from the parser.

AST Manipulation

One of the primary benefits of using a DOM-like parser is you can do AST manipulations on the parsed document before rendering it in HTML. Below are a couple examples of AST manipulations you could do.

Linkify a Keyword

Now a slightly more involved example: You want to replace every instance of your company name, “Example, Inc.” with a link to its homepage, but only in normal text (i.e. not code blocks, raw HTML, etc.). You might do that as follows:

import md4c
import md4c.domparser

# Parse document
with open('document.md', 'r') as f:
    markdown = f.read()
parser = md4c.domparser.DOMParser()
ast = parser.parse(markdown)

def linkify_name(parent, i):
    """If there are any instances of the company name in child
    i of the parent, linkify them and return the index of the
    last inserted child. If there are not, return i."""
    text = parent.children[i].text
    before, name, after = text.partition('Example, Inc.')
    if name == '':
        # Name not present.
        return i

    # Remove old child
    parent.children.pop(i)

    # Add the before portion, if not empty
    if before != '':
        before_node = md4c.domparser.ASTNode(
            md4c.TextType.NORMAL, text=before)
        parent.insert(i, before_node)
        i += 1

    # Add the link
    link_node = md4c.domparser.ASTNode(
        md4c.SpanType.A,
        href=[(md4c.TextType.NORMAL,
               'https://example.com/')])
    link_node.append(md4c.domparser.ASTNode(
        md4c.TextType.NORMAL, text=name))
    parent.insert(i, link_node)

    # Add the after portion and check for more instances,
    # if not empty
    if after != '':
        i += 1
        after_node = md4c.domparser.ASTNode(
            md4c.TextType.NORMAL, text=after)
        parent.insert(i, after_node)
        return linkify_name(parent, i)
    return i

def find_and_linkify_name(ast_node):
    """Traverse the AST looking for normal text nodes,
    then linkify the company name"""
    try:
        i = 0
        while i < len(ast_node.children):
            child = ast_node.children[i]
            if child.type is md4c.TextType.NORMAL:
                i = linkify_name(ast_node, i)
            else:
                find_and_linkify_name(child)
            i += 1
    except AttributeError:
        # No children
        pass

# Linkify company name and render
find_and_linkify_name(ast)
html = ast.render()

Some points to note about this example:

  • The insert() method is like append(), except it lets you pick where in the parent’s children list to insert the new child node.

  • There is no special method to remove a child. Just pop it from the parent’s children list.

  • Be careful when modifying the children list as you iterate over it. It’s not safe to use a for loop on a list that you intend to insert or remove items from.

  • The node type is identified with child.type is md4c.TextType.NORMAL, not isinstance(child, md4c.domparser.NormalText). The former works even if using a custom AST class to handle normal text, while the latter only works with the default NormalText class.

Warning

This example was just a demonstration. If you wanted to do something like this in production code, you should consider that 1) normal text can appear in places where the link replacement shouldn’t happen (e.g. inside the text of an existing link), and 2) numeric entities (e.g. &#x45; for E) can be used to foil the matching.

Using Custom AST Classes

You can customize the classes used for the AST. The main reason to do so is for customizing the rendering functionality, either to tailor the HTML generation to your particular application or generate another output format altogether.

To provide an example, suppose you wanted to use MathJax to render your equations. The default InlineMath and DisplayMath classes render <x-equation> tags, but you need them to render \(...\) and \[...\] instead. Here is how you could do that:

import md4c
import md4c.domparser

# Create custom AST classes for InlineMath and DisplayMath

class InlineMathJax(md4c.domparser.InlineMath,
                    element_type=md4c.SpanType.LATEXMATH):
    def render_pre(self, **kwargs):
        return '\\('

    def render_post(self, **kwargs):
        return '\\)'

class DisplayMathJax(md4c.domparser.DisplayMath,
                     element_type=md4c.SpanType.LATEXMATH_DISPLAY):
    def render_pre(self, **kwargs):
        return '\\['

    def render_post(self, **kwargs):
        return '\\]'

# Parse and render document

with open('document.md', 'r') as f:
    markdown = f.read()
parser = md4c.domparser.DOMParser(latex_math_spans=True)
ast = parser.parse(markdown)
html = ast.render()

The magic here is in the class parameters: Alongside the parent class, we have an element_type parameter. So long as one of our class’s ancestors is ASTNode and element_type is provided, ASTNode will register our new class as the one to construct for that element type. This needs to be done before calling the parse() method.

Some additional notes about the AST classes:

  • Most of the block and span classes (all except HorizontalRule) inherit from ContainerNode. For these, you can almost always rely on the default render() method as-is and just customize render_pre() and render_post(). They run before and after the children are rendered, respectively.

  • The CommonMark spec allows most span elements to occur inside an image element. HTML does not allow this, since the image text becomes the alt text attribute. To handle this, most of the span and text elements accept an image_nesting_level argument for their render() method. If image_nesting_level > 0, they render without HTML tags.

  • Normally, text nodes appear in the regular text of a document. But sometimes, they appear in URL contexts (link targets and image sources). In those contexts, the render function for text nodes is passed an additional keyword argument: url_escape. When True, normal text and entities must process their output through their url_escape() method.

Using bytes as the Input

All the examples above have assumed UTF-8 input. As with all the other parsers in PyMD4C, DOMParser will parse bytes objects as well. In that case, the render() method on the resulting AST will also return a bytes object.

There are some additional caveats to be aware of when modifying ASTs generated from bytes input:

  • When constructing a new ASTNode, you must set use_bytes=True in the constructor, for example:

    heading_node = md4c.domparser.ASTNode(md4c.BlockType.H,
                                          level=1,
                                          use_bytes=True)
    
  • Text for any TextNode must be a bytes object:

    link_node = md4c.domparser.ASTNode(
        md4c.SpanType.A,
        href=[(md4c.TextType.NORMAL, b'http://www.example.com/')],
        use_bytes=True)
    link_node.append(md4c.domparser.ASTNode(
        md4c.TextType.Normal,
        text=b'Example Link Text',
        use_bytes=True)
    
  • When using custom ASTNode subclasses, make sure any overridden render(), render_pre(), or render_post() methods return bytes objects when the self.bytes attribute is True:

    class InlineMathJax(md4c.domparser.InlineMath,
                        element_type=md4c.SpanType.LATEXMATH):
        def render_pre(self, **kwargs):
            if self.bytes:
                return b'\\('
            return '\\('
    
        def render_post(self, **kwargs):
            if self.bytes:
                return b'\\)'
            return '\\)'