Pure Parsing

SAX-Like Parsing

Applications that want to do their own rendering or don’t need to generate output at all can use the GenericParser class. It is implemented in C on top of the bare MD4C parser and provides a similar SAX-like interface.

Here is an example:

import md4c

# Counters

block_counts = dict()
span_counts = dict()
char_count = 0

# Callbacks

def enter_block(block_type, details):
    try:
        block_counts[block_type] += 1
    except KeyError:
        block_counts[block_type] = 1

def leave_block(block_type, details):
    pass

def enter_span(span_type, details):
    try:
        span_counts[span_type] += 1
    except KeyError:
        span_counts[span_type] = 1

def leave_span(span_type, details):
    pass

def process_text(text_type, text):
    global char_count
    char_count += len(text)

# Parsing

with open('README.md', 'r') as f:
    markdown = f.read()

parser = md4c.GenericParser()
parser.parse(markdown,
             enter_block, leave_block,
             enter_span, leave_span,
             process_text)

for block_type, count in block_counts.items():
    print(block_type.name, ':', count)
for span_type, count in span_counts.items():
    print(span_type.name, ':', count)
print('Characters', ':', char_count)

This counts the number of each type of Markdown element and the total number of displayed characters and prints a summary at the end.

There is a fair amount to digest here, so let’s break it into parts.

Constructing the Parser

We will come back to the callbacks defined at the beginning of the code. The first step in the actual parsing is to construct a parser:

parser = md4c.GenericParser()

Much like the HTMLRenderer, the constructor accepts options (see Option Flags), either through keyword arguments:

parser = md4c.GenericParser(tables=True,
                            strikethrough=True)

or a positional argument:

parser = md4c.GenericParser(
    md4c.MD_FLAG_TABLES | md4c.MD_FLAG_STRIKETHROUGH)

Note that only the parser options are accepted, since there is no rendering (this is why there is only one positional argument).

Actual Parsing

The parse() method does the actual parsing. This is where “SAX-like” comes into play: Rather than producing an abstract syntax tree in memory, MD4C provides a callback interface. As it digests the Markdown document from top to bottom, it calls a callback for any of these five events:

  • Entering a new block

  • Leaving a block

  • Entering a new inline/span

  • Leaving an inline/span

  • Adding text inside the current element

Accordingly, the parse() call in the example above has six parameters. The first is the Markdown document to parse, and the other five are the functions to use as callbacks:

parser.parse(markdown,
             enter_block, leave_block,
             enter_span, leave_span,
             process_text)

The Markdown document may be a str or bytes, whichever is convenient. The parsed text will be provided using the same type.

Now, let’s look at how the callbacks work.

Callbacks

Each parse() call requires five callbacks:

enter_block_callback(block_type, details)

Called whenever MD4C enters a new block.

Parameters:
  • block_type – An instance of BlockType

  • details – The details dict

leave_block_callback(block_type, details)

Called whenever MD4C leaves a block.

Parameters:
  • block_type – An instance of BlockType

  • details – The details dict

enter_span_callback(span_type, details)

Called whenever MD4C enters a new span/inline.

Parameters:
  • span_type – An instance of SpanType

  • details – The details dict

leave_span_callback(span_type, details)

Called whenever MD4C leaves a span/inline.

Parameters:
  • span_type – An instance of SpanType

  • details – The details dict

text_callback(text_type, text)

Called whenever MD4C has text to add to the current block or inline element.

Parameters:
  • text_type – An instance of TextType

  • text – A string or bytes containing the text to be added

The first four callbacks work similarly. All must accept a BlockType or SpanType as their first parameter and a details dict as their second parameter. The details dict is described in the next section, but it is how additional properties of the element are provided, such as a heading’s label or a link’s destination. If you were writing your own rendering code, these callbacks would write opening or closing HTML tags to the output stream.

The fifth callback accepts a TextType as the first parameter and some text as the second parameter. The text’s type will match that of the original Markdown input (str or bytes). The text is unprocessed; for example, HTML entities are left in &...; form. If you were writing your own rendering function, this callback would write the text to the output stream (potentially after some translation).

Callbacks do not need to return anything specific—their return values are ignored. To cancel parsing, callbacks can raise the StopParsing exception. The parse() method will catch it and immediately halt parsing quietly. All other exceptions raised in callbacks will abort parsing and propagate back to parse()’s caller.

Details Dicts

The block and span callbacks each accept a details dict. This is where extra properties of the block or span are provided. The details available depend on the type of block or span (and for some, it will be empty). Keys will always be strings, and the values to expect are listed in the tables below.

Any block or span type for which there is no table will receive an empty details dict.

Details dict for UL

Key

Value type

Description

'is_tight'

Bool

Whether the list is tight or not

'mark'

Single-char string

The character (*, -, +) used as a bullet point

Details dict for OL

Key

Value type

Description

'start'

Int

Start index of the ordered list

'is_tight'

Bool

Whether the list is tight or not

'mark_delimiter'

Single-char string

The character (., )) used as the number delimiter

Details dict for LI

Key

Value type

Description

'is_task'

Bool

Whether the list item is a task list item

'task_mark'

Single-char string

The character (X, x, space) used to mark the task. Only present if 'is_task' is True.

'task_mark_offset'

Int

The offset of the task mark character between the []. Only present if 'is_task' is True.

Details dict for H

Key

Value type

Description

'level'

Int

Heading level (1-6)

Details dict for CODE

Key

Value type

Description

'info'

Attribute*

Info string. Only present for fenced code blocks.

'lang'

Attribute*

Language string. Only present for fenced code blocks.

'fence_char'

Single-char string

Fence character (backtick or tilde). None for indented code blocks.

Details dict for TABLE

Key

Value type

Description

'col_count'

Int

Number of columns in the table

'head_row_count'

Int

Number of rows in the table head

'body_row_count'

Int

Number of rows in the table body

Details dict for TH and TD

Key

Value type

Description

'align'

md4c.Align

Cell alignment

Details dict for A

Key

Value type

Description

'href'

Attribute*

Link URL

'title'

Attribute*

Link title

Details dict for IMG

Key

Value type

Description

'src'

Attribute*

Image URL

'title'

Attribute*

Image title

Details dict for WIKILINK

Key

Value type

Description

'target'

Attribute*

Wikilink target

* Attribute values are described below.

Attributes

MD4C uses “attributes” for details that are text, such as link URLs and fenced code block info strings. These are not allowed to contain any span/inline elements, but they may contain HTML entities or null characters, so attributes are how MD4C copes with this.

PyMD4C represents attributes as either None or a list of 2-tuples (text_type, text) where text_type is a member of TextType and text is the actual text as a str or bytes (whichever one the Markdown input was).

For example, this string:

Copyright © John Doe

would be represented as an attribute like this:

[(md4c.TextType.NORMAL, 'Copyright '),
 (md4c.TextType.ENTITY, '©'),
 (md4c.TextType.NORMAL, ' John Doe')]

Currently, the only TextType types allowed in an attribute are NORMAL, ENTITY, and NULLCHAR.

Entity Helper

PyMD4C provides a helper function lookup_entity() to assist with translating HTML entities to their corresponding UTF-8 character(s):

import md4c

md4c.lookup_entity('&lt;')  # Returns '<'

Object-Oriented Parsing

PyMD4C provides a more object-oriented wrapper for GenericParser for applications which might find that useful: the ParserObject class. This is a base class that defines the five callbacks as member functions.

To use it, define a subclass that overrides the callback methods as necessary. The constructor accepts the same arguments as GenericParser (unless it is overridden). Then, you can call the parse() method, which only requires the Markdown input as an argument.

Here is the same example program as above, implemented using a ParserObject instead of GenericParser:

import md4c

class CountingParser(md4c.ParserObject):
    def __init__(self, *args, **kwargs):
        # Pass parser options to ParserObject
        super().__init__(*args, **kwargs)

        self.block_counts = dict()
        self.span_counts = dict()
        self.char_count = 0

    def enter_block(self, block_type, details):
        try:
            self.block_counts[block_type] += 1
        except KeyError:
            self.block_counts[block_type] = 1

    def enter_span(self, span_type, details):
        try:
            self.span_counts[span_type] += 1
        except KeyError:
            self.span_counts[span_type] = 1

    def text(self, text_type, text):
        self.char_count += len(text)

with open('README.md', 'r') as f:
    markdown = f.read()

parser = CountingParser()
parser.parse(markdown)

for block_type, count in parser.block_counts.items():
    print(block_type.name, ':', count)
for span_type, count in parser.span_counts.items():
    print(span_type.name, ':', count)
print('Characters', ':', parser.char_count)

Notice that using this paradigm, the counts can be instance variables instead of global variables. And the callbacks for leaving blocks and spans can be omitted entirely, since they were not necessary.

For more information, see the ParserObject API.