Add Architrecture
parent
ace9815864
commit
e470c1da72
|
@ -0,0 +1,278 @@
|
||||||
|
Architecture of madness
|
||||||
|
=======================
|
||||||
|
|
||||||
|
Prelude
|
||||||
|
-------
|
||||||
|
|
||||||
|
It needs to be stressed that making the parser modular while keeping it
|
||||||
|
relatively simple was a laborous undertaking. There has not been a standard
|
||||||
|
more hostile towards the people who dare attempt to implement it than
|
||||||
|
CommonMark. It should also be noted, that despite it being titled a
|
||||||
|
"Standard" in this document, it is less widely adopted than the Github
|
||||||
|
Flavored Markdown syntax. Github Flavored Markdown, however, is only but
|
||||||
|
a mere subset of this parser's model, albeit requiring a few extensions.
|
||||||
|
|
||||||
|
Current state (as of March 02, 2025)
|
||||||
|
------------------------------------
|
||||||
|
|
||||||
|
This parser processes text in what can be boiled down to three phases.
|
||||||
|
|
||||||
|
- Block/Line phase
|
||||||
|
- Overlay phase
|
||||||
|
- Inline phase
|
||||||
|
|
||||||
|
It should be noted that all phases have their own related parser
|
||||||
|
classes, and a shared behaviour system, where each parser takes control
|
||||||
|
at some point, and may win ambiguous cases by having higher priority
|
||||||
|
(see `#define_child`, `#define_overlay` methods for priority parameter)
|
||||||
|
|
||||||
|
### Block/Line phase ###
|
||||||
|
|
||||||
|
The first phase breaks down blocks, line by line, into block structures.
|
||||||
|
Blocks (preferably inherited from the Block class) can contain other blocks.
|
||||||
|
(i.e. QuoteBlock, ULBlock, OLBlock). Other blocks (known as leaf blocks)
|
||||||
|
may not contain anything else (except inline content, more on that later).
|
||||||
|
|
||||||
|
Blocks are designed to be parsed independently. This means that it *should*
|
||||||
|
be possible to tear out any standard block and make it not get parsed.
|
||||||
|
This, however, isn't thoroughly tested for.
|
||||||
|
|
||||||
|
Blocks as proper, real classes have a certain lifecycle to follow when
|
||||||
|
being constructed:
|
||||||
|
|
||||||
|
1. Open condition
|
||||||
|
- A block needs to find its first marker on the current line to open
|
||||||
|
(see `#begin?` method)
|
||||||
|
- Once it's open, it's immediately initialized and fed the line it just
|
||||||
|
read (but now as an object, not as a class) (see `#consume` method)
|
||||||
|
2. Marker/Line consumption
|
||||||
|
- While it should be kept open, the block parser instance will
|
||||||
|
keep reading inupt through `#consume` method, returning a pair
|
||||||
|
of modified line (after consuming its tokens from it) and
|
||||||
|
a boolean value indicating permission of lazy continuation
|
||||||
|
(if it's a block like a QuoteBlock or ULBlock that can be lazily
|
||||||
|
overflowed).
|
||||||
|
Every line the parser needs to record needs to be pushed
|
||||||
|
through the `#push` method.
|
||||||
|
3. Closure
|
||||||
|
- If the current line no longer belongs to the current block
|
||||||
|
(if the block should have been closed on the previous line),
|
||||||
|
it simply needs to `return` a pair of `nil`, and a boolean value for
|
||||||
|
permission of lazy continuation
|
||||||
|
- If a block should be closed on the current line, it should capture it,
|
||||||
|
keep track of the "closed" state, then `return` `nil` on the next call
|
||||||
|
of `#consume`
|
||||||
|
- Once a block is closed, it:
|
||||||
|
1. Receives its content from the parser
|
||||||
|
2. Parser receives the "close" method call
|
||||||
|
3. (optional) Parser may have a callable method `#applyprops`. If
|
||||||
|
it exists, it gets called with the current constructed block.
|
||||||
|
4. (optional) All overlays assigned to this block's class are
|
||||||
|
processed on the contents of this block (more on that in
|
||||||
|
Overlay phase)
|
||||||
|
5. (optional) Parser may return a different class, which
|
||||||
|
the current block should be cast into (Overlays may change
|
||||||
|
the class as well)
|
||||||
|
6. (optional) If a block can respond to `#parse_inner` method, it
|
||||||
|
will get called, allowing the block to parse its own contents.
|
||||||
|
- After this point, the block is no longer touched until the document
|
||||||
|
fully gets processed.
|
||||||
|
4. Inline processing
|
||||||
|
- (Applies only to Paragraph and any child of LeafBlock)
|
||||||
|
When the document gets fully processed, the contents of the current
|
||||||
|
block are taken, assigned to an InlineRoot instance, and then parsed
|
||||||
|
in Inline mode
|
||||||
|
5. Completion
|
||||||
|
- The resulting document is then returned.
|
||||||
|
|
||||||
|
While there is a lot of functionality available in desgining blocks, it is
|
||||||
|
not necessary for the simplest of the block kinds available. The simplest
|
||||||
|
example of a block parser is likely the ThematicBreakParser class, which
|
||||||
|
implements the only 2 methods needed for a block parser to function.
|
||||||
|
|
||||||
|
While parsing text, a block may use additional info:
|
||||||
|
|
||||||
|
- In consume method: `lazy` hasharg, if the current line is being processed
|
||||||
|
in lazy continuation mode (likely only ever matters for Paragraph); and
|
||||||
|
`parent` - the parent block containing this block.
|
||||||
|
|
||||||
|
Block interpretations are tried in decreasing order of their priority
|
||||||
|
value, as applied using the `#define_child` method.
|
||||||
|
|
||||||
|
For blocks to be properly indexed, they need to be a valid child or
|
||||||
|
a valid descendant (meaning reachable through child chain) of the
|
||||||
|
Document class.
|
||||||
|
|
||||||
|
### Overlay phase ###
|
||||||
|
|
||||||
|
Overlay phase doesn't start at some specific point in time. Rather,
|
||||||
|
Overlay phase happens for every block individually - when that block
|
||||||
|
closes.
|
||||||
|
|
||||||
|
Overlay mechanism can be applied to any DOMObject type, so long as its
|
||||||
|
close method is called at some point (this may not be of interest to
|
||||||
|
people that do not implement custom syntax, as it generally translates
|
||||||
|
to "only block level elements get their overlays processed")
|
||||||
|
|
||||||
|
Overlay mechanism provides the ability to perform some action on the block
|
||||||
|
right after it gets closed and right before it gets interpreted by the
|
||||||
|
inline phase. Overlays may do the following:
|
||||||
|
|
||||||
|
- Change the block's class
|
||||||
|
(by returning a class from the `#process` method)
|
||||||
|
- Change the block's content (by directly editing it)
|
||||||
|
- Change the block's properties (by modifying its `properties` hash)
|
||||||
|
|
||||||
|
Overlay interpretations are tried in decreasing order of their priority
|
||||||
|
value, as defined using the `#define_overlay` method.
|
||||||
|
|
||||||
|
### Inline phase ###
|
||||||
|
|
||||||
|
Once all blocks have been processed, and all overlays have been applied
|
||||||
|
to their respective block types, the hook in the Document class's
|
||||||
|
`#parser` method executes inline parsing phase of all leaf blocks
|
||||||
|
(descendants of the `Leaf` class) and paragraphs.
|
||||||
|
|
||||||
|
The outer class encompassing all inline children of a block is
|
||||||
|
`InlineRoot`. As such, if an inline element is to ever appear within the
|
||||||
|
text, it needs to be reachable as a child or a descendant of InlineRoot.
|
||||||
|
|
||||||
|
Inline parsing works in three parts:
|
||||||
|
|
||||||
|
- First, the contens are tokenized (every parser marks its own tokens)
|
||||||
|
- Second, the forward walk procedure is called
|
||||||
|
- Third, the reverse walk procedure is called
|
||||||
|
|
||||||
|
This process is repeated for every group of parsers with equal priority.
|
||||||
|
At one point in time, only all the parsers of equal priority may run in
|
||||||
|
the same step. Then, the process goes to the next step, of parsers of
|
||||||
|
higher priority value. As counter-intuitive as this is, this means that
|
||||||
|
it goes to the parsers of _lower_ priority.
|
||||||
|
|
||||||
|
At the very end of the process, the remaining strings are concatenated
|
||||||
|
within the mixed array of inlines and strings, and turned into Text
|
||||||
|
nodes, after which the contents of the array are appended as children to
|
||||||
|
the root node.
|
||||||
|
|
||||||
|
This process is recursively applied to all elements which may have child
|
||||||
|
elements. This is ensured when an inline parser calls the "build"
|
||||||
|
utility method.
|
||||||
|
|
||||||
|
The inline parser is a class that implements static methods `tokenize`
|
||||||
|
and either `forward_walk` or `reverse_walk`. Both may be implemented at
|
||||||
|
the same time, but this isn't advisable.
|
||||||
|
|
||||||
|
The tokenization process is characterized by calling every parser in the
|
||||||
|
current group with every string in tokens array using the `tokenize`
|
||||||
|
method. It is expected that the parser breaks the string down into an
|
||||||
|
array of other strings and tokens. A token is an array where the first
|
||||||
|
element is the literal text representation of the token, the second
|
||||||
|
value is the class of the parser, and the _last_ value (_not third_) is
|
||||||
|
the `:close` or `:open` symbol (though functionally it may hold any
|
||||||
|
symbol value). Any additional information the parser may need in later
|
||||||
|
stages may be stored between the last element and the second element.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
Input:
|
||||||
|
|
||||||
|
"_this _is a string of_ tokens_"
|
||||||
|
|
||||||
|
Output:
|
||||||
|
|
||||||
|
[["_", ::PointBlank::Parsing::EmphInline, :open],
|
||||||
|
"this ",
|
||||||
|
["_", ::PointBlank::Parsing::EmphInline, :open],
|
||||||
|
"is a string of",
|
||||||
|
["_", ::PointBlank::Parsing::EmphInline, :close],
|
||||||
|
" tokens",
|
||||||
|
["_", ::PointBlank::Parsing::EmphInline, :close]]
|
||||||
|
|
||||||
|
The forward walk is characterized by calling parsers which implement the
|
||||||
|
`#forward_walk` method. When the main class encounters an opening token
|
||||||
|
in `forward_walk`, it will call the `#forward_walk` method of the class
|
||||||
|
that represents this token. It is expected that the parser class will
|
||||||
|
then attempt to build the first available occurence of the inline
|
||||||
|
element it represents, after which it will return the array of all
|
||||||
|
tokens and strings that it was passed where the first element will be
|
||||||
|
the newly constructed inline element. If it is unable to close the
|
||||||
|
block, it should simply return the original contents, unmodified.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
Original text:
|
||||||
|
|
||||||
|
this is outside the inline `this is inside the inline` and this
|
||||||
|
is right after the inline `and this is the next inline`
|
||||||
|
|
||||||
|
Input:
|
||||||
|
|
||||||
|
[["`", ::PointBlank::Parsing::CodeInline, :open],
|
||||||
|
"this is inside the inline"
|
||||||
|
["`", ::PointBlank::Parsing::CodeInline, :close],
|
||||||
|
" and this is right after the inline ",
|
||||||
|
["`", ::PointBlank::Parsing::CodeInline, :open],
|
||||||
|
"and this is the next inline"
|
||||||
|
["`", ::PointBlank::Parsing::CodeInline, :close]]
|
||||||
|
|
||||||
|
Output:
|
||||||
|
|
||||||
|
[<::PointBlank::DOM::InlineCode
|
||||||
|
@content = "this is inside the inline">,
|
||||||
|
" and this is right after the inline ",
|
||||||
|
["`", ::PointBlank::Parsing::CodeInline, :open],
|
||||||
|
"and this is the next inline"
|
||||||
|
["`", ::PointBlank::Parsing::CodeInline, :close]]
|
||||||
|
|
||||||
|
The reverse walk is characterized by calling parsers which implement the
|
||||||
|
`#reverse_walk` method when the main class encounters a closing token
|
||||||
|
for this class (the one that contains the `:close` symbol in the last
|
||||||
|
position of the token information array). After that the main class will
|
||||||
|
call the parser's `#reverse_walk` method with the current list of
|
||||||
|
tokens, inlines and strings. It is expected that the parser will then
|
||||||
|
collect all the blocks, strings and inlines that fit within the block
|
||||||
|
closed by the last element in the list, and once it encounters the
|
||||||
|
appropriate opening token for the closing token in the last position of
|
||||||
|
the array, it will then replace the elements fitting within that inline
|
||||||
|
with a class containing all the collected elements. If it is unable to
|
||||||
|
find a matching opening token for the closing token in the last
|
||||||
|
position, it should simply return the original contents, unmodified.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
Original text:
|
||||||
|
|
||||||
|
blah blah something something lots of text before the emphasis
|
||||||
|
_this is emphasized `and this is an inline` but it's still
|
||||||
|
emphasized_
|
||||||
|
|
||||||
|
|
||||||
|
Input:
|
||||||
|
|
||||||
|
["blah blah something something lots of text before the emphasis",
|
||||||
|
["_", ::PointBlank::Parsing::EmphInline, :open],
|
||||||
|
"this is emphasized",
|
||||||
|
<::PointBlank::DOM::InlineCode,
|
||||||
|
@content = "and this is an inline">,
|
||||||
|
" but it's still emphasized",
|
||||||
|
["_", ::PointBlank::Parsing::EmphInline, :close]]
|
||||||
|
|
||||||
|
Output:
|
||||||
|
|
||||||
|
["blah blah something something lots of text before the emphasis",
|
||||||
|
<::PointBlank::DOM::InlineEmphasis,
|
||||||
|
children = [...,
|
||||||
|
<::PointBlank::DOM::InlineCode ...>
|
||||||
|
...]>]
|
||||||
|
|
||||||
|
Both `#forward_walk` and `#reverse_walk` are not restricted to making
|
||||||
|
just the changes discussed above, and can arbitrarily modify the token
|
||||||
|
arrays. That, however, should be done with great care, so as to not
|
||||||
|
accidentally break compatibility with other parsers.
|
||||||
|
|
||||||
|
To ensure that the collected tokens in the `#reverse_walk` and
|
||||||
|
`#forward_walk` are processes correctly, the colllected arrays of
|
||||||
|
tokens, blocks and inlines should be built into an object that
|
||||||
|
represents this parser using the `build` method (it will automatically
|
||||||
|
attempt to find the correct class to construct using the
|
||||||
|
`#define_parser` directive in the DOMObject subclass definition)
|
Loading…
Reference in New Issue