Architecture of MMMD
Prelude
It needs to be stressed that making the parser modular while keeping it relatively simple was a laborous undertaking. There has not been a standard more hostile towards the people who dare attempt to implement it than CommonMark. It should also be noted, that despite it being titled a "Standard" in this document, it is less widely adopted than the Github Flavored Markdown syntax. Github Flavored Markdown, however, is only but a mere subset of this parser's model, albeit requiring a few extensions.
Current state (as of March 02, 2025)
This parser processes text in what can be boiled down to three phases.
- Block/Line phase
- Overlay phase
- Inline phase
It should be noted that all phases have their own related parser
classes, and a shared behaviour system, where each parser takes control
at some point, and may win ambiguous cases by having higher priority
(see #define_child
, #define_overlay
methods for priority parameter)
Block/Line phase
The first phase breaks down blocks, line by line, into block structures. Blocks (preferably inherited from the Block class) can contain other blocks. (i.e. QuoteBlock, ULBlock, OLBlock). Other blocks (known as leaf blocks) may not contain anything else (except inline content, more on that later).
Blocks are designed to be parsed independently. This means that it should be possible to tear out any standard block and make it not get parsed. This, however, isn't thoroughly tested for.
Blocks as proper, real classes have a certain lifecycle to follow when being constructed:
- Open condition
- A block needs to find its first marker on the current line to open
(see
#begin?
method) - Once it's open, it's immediately initialized and fed the line it just
read (but now as an object, not as a class) (see
#consume
method)
- A block needs to find its first marker on the current line to open
(see
- Marker/Line consumption
- While it should be kept open, the block parser instance will
keep reading inupt through
#consume
method, returning a pair of modified line (after consuming its tokens from it) and a boolean value indicating permission of lazy continuation (if it's a block like a QuoteBlock or ULBlock that can be lazily overflowed). Every line the parser needs to record needs to be pushed through the#push
method.
- While it should be kept open, the block parser instance will
keep reading inupt through
- Closure
- If the current line no longer belongs to the current block
(if the block should have been closed on the previous line),
it simply needs to
return
a pair ofnil
, and a boolean value for permission of lazy continuation - If a block should be closed on the current line, it should capture it,
keep track of the "closed" state, then
return
nil
on the next call of#consume
- Once a block is closed, it:
- Receives its content from the parser
- Parser receives the "close" method call
- (optional) Parser may have a callable method
#applyprops
. If it exists, it gets called with the current constructed block. - (optional) All overlays assigned to this block's class are processed on the contents of this block (more on that in Overlay phase)
- (optional) Parser may return a different class, which the current block should be cast into (Overlays may change the class as well)
- (optional) If a block can respond to
#parse_inner
method, it will get called, allowing the block to parse its own contents.
- After this point, the block is no longer touched until the document fully gets processed.
- If the current line no longer belongs to the current block
(if the block should have been closed on the previous line),
it simply needs to
- Inline processing
- (Applies only to Paragraph and any child of LeafBlock) When the document gets fully processed, the contents of the current block are taken, assigned to an InlineRoot instance, and then parsed in Inline mode
- Completion
- The resulting document is then returned.
While there is a lot of functionality available in desgining blocks, it is not necessary for the simplest of the block kinds available. The simplest example of a block parser is likely the ThematicBreakParser class, which implements the only 2 methods needed for a block parser to function.
While parsing text, a block may use additional info:
- In consume method:
lazy
hasharg, if the current line is being processed in lazy continuation mode (likely only ever matters for Paragraph); andparent
- the parent block containing this block.
Block interpretations are tried in decreasing order of their priority
value, as applied using the #define_child
method.
For blocks to be properly indexed, they need to be a valid child or a valid descendant (meaning reachable through child chain) of the Document class.
Overlay phase
Overlay phase doesn't start at some specific point in time. Rather, Overlay phase happens for every block individually - when that block closes.
Overlay mechanism can be applied to any DOMObject type, so long as its close method is called at some point (this may not be of interest to people that do not implement custom syntax, as it generally translates to "only block level elements get their overlays processed")
Overlay mechanism provides the ability to perform some action on the block right after it gets closed and right before it gets interpreted by the inline phase. Overlays may do the following:
- Change the block's class
(by returning a class from the
#process
method) - Change the block's content (by directly editing it)
- Change the block's properties (by modifying its
properties
hash)
Overlay interpretations are tried in decreasing order of their priority
value, as defined using the #define_overlay
method.
Inline phase
Once all blocks have been processed, and all overlays have been applied
to their respective block types, the hook in the Document class's
#parser
method executes inline parsing phase of all leaf blocks
(descendants of the Leaf
class) and paragraphs.
The outer class encompassing all inline children of a block is
InlineRoot
. As such, if an inline element is to ever appear within the
text, it needs to be reachable as a child or a descendant of InlineRoot.
Inline parsing works in three parts:
- First, the contens are tokenized (every parser marks its own tokens)
- Second, the forward walk procedure is called
- Third, the reverse walk procedure is called
This process is repeated for every group of parsers with equal priority. At one point in time, only all the parsers of equal priority may run in the same step. Then, the process goes to the next step, of parsers of higher priority value. As counter-intuitive as this is, this means that it goes to the parsers of lower priority.
At the very end of the process, the remaining strings are concatenated within the mixed array of inlines and strings, and turned into Text nodes, after which the contents of the array are appended as children to the root node.
This process is recursively applied to all elements which may have child elements. This is ensured when an inline parser calls the "build" utility method.
The inline parser is a class that implements static methods tokenize
and either forward_walk
or reverse_walk
. Both may be implemented at
the same time, but this isn't advisable.
The tokenization process is characterized by calling every parser in the
current group with every string in tokens array using the tokenize
method. It is expected that the parser breaks the string down into an
array of other strings and tokens. A token is an array where the first
element is the literal text representation of the token, the second
value is the class of the parser, and the last value (not third) is
the :close
or :open
symbol (though functionally it may hold any
symbol value). Any additional information the parser may need in later
stages may be stored between the last element and the second element.
Example:
Input:
"_this _is a string of_ tokens_"
Output:
[["_", ::PointBlank::Parsing::EmphInline, :open],
"this ",
["_", ::PointBlank::Parsing::EmphInline, :open],
"is a string of",
["_", ::PointBlank::Parsing::EmphInline, :close],
" tokens",
["_", ::PointBlank::Parsing::EmphInline, :close]]
The forward walk is characterized by calling parsers which implement the
#forward_walk
method. When the main class encounters an opening token
in forward_walk
, it will call the #forward_walk
method of the class
that represents this token. It is expected that the parser class will
then attempt to build the first available occurence of the inline
element it represents, after which it will return the array of all
tokens and strings that it was passed where the first element will be
the newly constructed inline element. If it is unable to close the
block, it should simply return the original contents, unmodified.
Example:
Original text:
this is outside the inline `this is inside the inline` and this
is right after the inline `and this is the next inline`
Input:
[["`", ::PointBlank::Parsing::CodeInline, :open],
"this is inside the inline"
["`", ::PointBlank::Parsing::CodeInline, :close],
" and this is right after the inline ",
["`", ::PointBlank::Parsing::CodeInline, :open],
"and this is the next inline"
["`", ::PointBlank::Parsing::CodeInline, :close]]
Output:
[<::PointBlank::DOM::InlineCode
@content = "this is inside the inline">,
" and this is right after the inline ",
["`", ::PointBlank::Parsing::CodeInline, :open],
"and this is the next inline"
["`", ::PointBlank::Parsing::CodeInline, :close]]
The reverse walk is characterized by calling parsers which implement the
#reverse_walk
method when the main class encounters a closing token
for this class (the one that contains the :close
symbol in the last
position of the token information array). After that the main class will
call the parser's #reverse_walk
method with the current list of
tokens, inlines and strings. It is expected that the parser will then
collect all the blocks, strings and inlines that fit within the block
closed by the last element in the list, and once it encounters the
appropriate opening token for the closing token in the last position of
the array, it will then replace the elements fitting within that inline
with a class containing all the collected elements. If it is unable to
find a matching opening token for the closing token in the last
position, it should simply return the original contents, unmodified.
Example:
Original text:
blah blah something something lots of text before the emphasis
_this is emphasized `and this is an inline` but it's still
emphasized_
Input:
["blah blah something something lots of text before the emphasis",
["_", ::PointBlank::Parsing::EmphInline, :open],
"this is emphasized",
<::PointBlank::DOM::InlineCode,
@content = "and this is an inline">,
" but it's still emphasized",
["_", ::PointBlank::Parsing::EmphInline, :close]]
Output:
["blah blah something something lots of text before the emphasis",
<::PointBlank::DOM::InlineEmphasis,
children = [...,
<::PointBlank::DOM::InlineCode ...>
...]>]
Both #forward_walk
and #reverse_walk
are not restricted to making
just the changes discussed above, and can arbitrarily modify the token
arrays. That, however, should be done with great care, so as to not
accidentally break compatibility with other parsers.
To ensure that the collected tokens in the #reverse_walk
and
#forward_walk
are processes correctly, the colllected arrays of
tokens, blocks and inlines should be built into an object that
represents this parser using the build
method (it will automatically
attempt to find the correct class to construct using the
#define_parser
directive in the DOMObject subclass definition)
- Architecture
- Security
- Renderers