I am seeking feedback for this feature. If there are any changes or additions to the API you feel would make it more useful, or if you have any other suggestions, please let me know (or via email if you prefer).
Be aware that I may make updates to the
md4c.domparser API in
response to feedback I receive. When that is no longer the case, I will
remove this message and make a note in the changelog.
What does “DOM parsing” mean?¶
In the world of XML, there are two general types of parsers: SAX (i.e. event-based) and DOM (i.e. tree-based). SAX parsers traverse the document, and as each tag or bit of content is parsed, the appropriate event is emitted (enter-element, leave-element, characters) and a callback handles it. DOM parsers construct a tree representation of the entire document for the caller.
While the concepts were originally conceived for XML, most parsers for any
markup language usually fit into these same two categories. The MD4C C library
and the main
md4c Python module take a definitive SAX-like approach to
parsing (and the MD4C library is proud of it). This is clear from the
md4c.domparser module provides a DOM-like API for use cases where
that style is more appropriate. It produces an AST where each paragraph,
heading, link, block quote, etc. is represented by an
Why use DOM-like parsing?¶
You may find that the
HTMLRenderer and SAX-like parsers do not
provide the flexibility you need. A typical use-case would be if you want to
manipulate the input document before it is rendered. For example, maybe you
want to convert every occurrence of a certain word to a hyperlink, except in
code blocks. Or you want to delete everything after the first paragraph under
The tradeoff for this flexibility is speed:
DOM parsers are more resource-intensive than SAX parsers in general, due to the overhead from producing a tree representation of the entire document in memory.
Furthermore, the SAX-like parsers in PyMD4C are a thin layer on top of MD4C, which is heavily optimized C code.
DOMParseris implemented in Python on top of
Generating an AST¶
Since many applications will not need them, the DOM-like parser and the classes
for the AST are all in a separate module:
md4c.domparser. The parser
itself is the
md4c.domparser.DOMParser class. In the most simple case,
it is used like this:
import md4c.domparser with open('README.md', 'r') as f: markdown = f.read() parser = md4c.domparser.DOMParser() ast = parser.parse(markdown)
At this point,
ast is the root
Document node of
the tree. You can render the tree as HTML:
html = ast.render()
Or you can traverse the tree:
def traverse(ast_node): # Do stuff on this node before traversing to children try: for child in ast_node.children: traverse(child) except AttributeError: # No children pass # Do stuff on this node after traversing to children traverse(ast)
AST Node Objects¶
Each type of Markdown element (i.e. each type of block, span, and text) has an
associated AST type. For example,
Paragraph is for
BlockType.P. See Base AST Classes for the full list.
For Markdown elements with additional details attached to them (see
Details Dicts), each detail becomes an attribute in the object. For instance,
hdg would have attribute
There are a few base classes that the AST classes inherit from:
All AST classes inherit from this class. It provides the
parentattributes. This is also the class that should be used to construct all AST node objects, no matter their type. More on that in the later sections.
All AST classes that are not leaf nodes inherit from this class. That is, all blocks and inlines except
HorizontalRule. It provides the
childrenattribute and the
insert()methods for adding new children.
All AST classes associated with
md4c.TextTypes inherit from this. It provides the
textattribute containing the unprocessed text from the parser.
One of the primary benefits of using a DOM-like parser is you can do AST manipulations on the parsed document before rendering it in HTML. Below are a couple examples of AST manipulations you could do.
Add a Copyright Notice¶
Suppose you wanted to add a horizontal rule and then a copyright notice at the end of the document. (This probably doesn’t require generating the full AST, but it serves as a simple example.) You could do that like this:
import md4c import md4c.domparser # Parse document with open('document.md', 'r') as f: markdown = f.read() parser = md4c.domparser.DOMParser() ast = parser.parse(markdown) # Generate horizontal rule and copyright notice paragraph hr = md4c.domparser.ASTNode(md4c.BlockType.HR) p = md4c.domparser.ASTNode(md4c.BlockType.P) # Add copyright notice text to the paragraph p.append(md4c.domparser.ASTNode( md4c.TextType.NORMAL, text='Copyright ')) p.append(md4c.domparser.ASTNode( md4c.TextType.ENTITY, text='©')) p.append(md4c.domparser.ASTNode( md4c.TextType.NORMAL, text=' 2021 John Doe')) # Add the horizontal rule and copyright notice to the end of the document ast.append(hr) ast.append(p) # Render html = ast.render()
There are several important points to note:
New nodes are always constructed using the
ASTNodeconstructor, no matter the type. It will construct the appropriate subclass depending on the node type enum member passed in as the first argument.
Additional arguments for the
ASTNodeconstructor, when given, must be keyword-only. For text nodes, this must be a single
textargument. For nodes with details, these would correspond with the keys for the details dict.
Nodes can be added as children by calling the
append()method on the parent. That appends the node to the parent’s children list and sets the child node’s parent.
Neither the horizontal rule node nor any of the text nodes can accept children. They do not have
Linkify a Keyword¶
Now a slightly more involved example: You want to replace every instance of your company name, “Example, Inc.” with a link to its homepage, but only in normal text (i.e. not code blocks, raw HTML, etc.). You might do that as follows:
import md4c import md4c.domparser # Parse document with open('document.md', 'r') as f: markdown = f.read() parser = md4c.domparser.DOMParser() ast = parser.parse(markdown) def linkify_name(parent, i): """If there are any instances of the company name in child i of the parent, linkify them and return the index of the last inserted child. If there are not, return i.""" text = parent.children[i].text before, name, after = text.partition('Example, Inc.') if name == '': # Name not present. return i # Remove old child parent.children.pop(i) # Add the before portion, if not empty if before != '': before_node = md4c.domparser.ASTNode( md4c.TextType.NORMAL, text=before) parent.insert(i, before_node) i += 1 # Add the link link_node = md4c.domparser.ASTNode( md4c.SpanType.A, href=[(md4c.TextType.NORMAL, 'https://example.com/')]) link_node.append(md4c.domparser.ASTNode( md4c.TextType.NORMAL, text=name)) parent.insert(i, link_node) # Add the after portion and check for more instances, # if not empty if after != '': i += 1 after_node = md4c.domparser.ASTNode( md4c.TextType.NORMAL, text=after) parent.insert(i, after_node) return linkify_name(parent, i) return i def find_and_linkify_name(ast_node): """Traverse the AST looking for normal text nodes, then linkify the company name""" try: i = 0 while i < len(ast_node.children): child = ast_node.children[i] if child.type is md4c.TextType.NORMAL: i = linkify_name(ast_node, i) else: find_and_linkify_name(child) i += 1 except AttributeError: # No children pass # Linkify company name and render find_and_linkify_name(ast) html = ast.render()
Some points to note about this example:
insert()method is like
append(), except it lets you pick where in the parent’s children list to insert the new child node.
There is no special method to remove a child. Just pop it from the parent’s children list.
Be careful when modifying the children list as you iterate over it. It’s not safe to use a for loop on a list that you intend to insert or remove items from.
The node type is identified with
child.type is md4c.TextType.NORMAL, not
isinstance(child, md4c.domparser.NormalText). The former works even if using a custom AST class to handle normal text, while the latter only works with the default
This example was just a demonstration. If you wanted to do something like
this in production code, you should consider that 1) normal text can appear
in places where the link replacement shouldn’t happen (e.g. inside the text
of an existing link), and 2) numeric entities (e.g.
be used to foil the matching.
Using Custom AST Classes¶
You can customize the classes used for the AST. The main reason to do so is for customizing the rendering functionality, either to tailor the HTML generation to your particular application or generate another output format altogether.
To provide an example, suppose you wanted to use MathJax to render your
equations. The default
DisplayMath classes render
<x-equation> tags, but
you need them to render
\[...\] instead. Here is how you
could do that:
import md4c import md4c.domparser # Create custom AST classes for InlineMath and DisplayMath class InlineMathJax(md4c.domparser.InlineMath, element_type=md4c.SpanType.LATEXMATH): def render_pre(self, **kwargs): return '\\(' def render_post(self, **kwargs): return '\\)' class DisplayMathJax(md4c.domparser.DisplayMath, element_type=md4c.SpanType.LATEXMATH_DISPLAY): def render_pre(self, **kwargs): return '\\[' def render_post(self, **kwargs): return '\\]' # Parse and render document with open('document.md', 'r') as f: markdown = f.read() parser = md4c.domparser.DOMParser(latex_math_spans=True) ast = parser.parse(markdown) html = ast.render()
The magic here is in the class parameters: Alongside the parent class, we have
element_type parameter. So long as one of our class’s ancestors is
element_type is provided,
ASTNode will register our new class as the one to
construct for that element type. This needs to be done before calling the
Some additional notes about the AST classes:
Most of the block and span classes (all except
HorizontalRule) inherit from
ContainerNode. For these, you can almost always rely on the default
render()method as-is and just customize
render_post(). They run before and after the children are rendered, respectively.
The CommonMark spec allows most span elements to occur inside an image element. HTML does not allow this, since the image text becomes the alt text attribute. To handle this, most of the span and text elements accept an
image_nesting_levelargument for their
image_nesting_level > 0, they render without HTML tags.
Normally, text nodes appear in the regular text of a document. But sometimes, they appear in URL contexts (link targets and image sources). In those contexts, the render function for text nodes is passed an additional keyword argument:
url_escape. When True, normal text and entities must process their output through their
bytes as the Input¶
All the examples above have assumed UTF-8 input. As with all the other parsers
DOMParser will parse
as well. In that case, the
render() method on the
resulting AST will also return a
There are some additional caveats to be aware of when modifying ASTs generated
When constructing a new
ASTNode, you must set
use_bytes=Truein the constructor, for example:
heading_node = md4c.domparser.ASTNode(md4c.BlockType.H, level=1, use_bytes=True)
Text for any
TextNodemust be a
link_node = md4c.domparser.ASTNode( md4c.SpanType.A, href=[(md4c.TextType.NORMAL, b'http://www.example.com/')], use_bytes=True) link_node.append(md4c.domparser.ASTNode( md4c.TextType.Normal, text=b'Example Link Text', use_bytes=True)
When using custom
ASTNodesubclasses, make sure any overridden
bytesobjects when the
self.bytesattribute is True:
class InlineMathJax(md4c.domparser.InlineMath, element_type=md4c.SpanType.LATEXMATH): def render_pre(self, **kwargs): if self.bytes: return b'\\(' return '\\(' def render_post(self, **kwargs): if self.bytes: return b'\\)' return '\\)'