diff --git a/Architrecture.md b/Architrecture.md new file mode 100644 index 0000000..5bdef83 --- /dev/null +++ b/Architrecture.md @@ -0,0 +1,278 @@ +Architecture of madness +======================= + +Prelude +------- + +It needs to be stressed that making the parser modular while keeping it +relatively simple was a laborous undertaking. There has not been a standard +more hostile towards the people who dare attempt to implement it than +CommonMark. It should also be noted, that despite it being titled a +"Standard" in this document, it is less widely adopted than the Github +Flavored Markdown syntax. Github Flavored Markdown, however, is only but +a mere subset of this parser's model, albeit requiring a few extensions. + +Current state (as of March 02, 2025) +------------------------------------ + +This parser processes text in what can be boiled down to three phases. + +- Block/Line phase +- Overlay phase +- Inline phase + +It should be noted that all phases have their own related parser +classes, and a shared behaviour system, where each parser takes control +at some point, and may win ambiguous cases by having higher priority +(see `#define_child`, `#define_overlay` methods for priority parameter) + +### Block/Line phase ### + +The first phase breaks down blocks, line by line, into block structures. +Blocks (preferably inherited from the Block class) can contain other blocks. +(i.e. QuoteBlock, ULBlock, OLBlock). Other blocks (known as leaf blocks) +may not contain anything else (except inline content, more on that later). + +Blocks are designed to be parsed independently. This means that it *should* +be possible to tear out any standard block and make it not get parsed. +This, however, isn't thoroughly tested for. + +Blocks as proper, real classes have a certain lifecycle to follow when +being constructed: + +1. Open condition + - A block needs to find its first marker on the current line to open + (see `#begin?` method) + - Once it's open, it's immediately initialized and fed the line it just + read (but now as an object, not as a class) (see `#consume` method) +2. Marker/Line consumption + - While it should be kept open, the block parser instance will + keep reading inupt through `#consume` method, returning a pair + of modified line (after consuming its tokens from it) and + a boolean value indicating permission of lazy continuation + (if it's a block like a QuoteBlock or ULBlock that can be lazily + overflowed). + Every line the parser needs to record needs to be pushed + through the `#push` method. +3. Closure + - If the current line no longer belongs to the current block + (if the block should have been closed on the previous line), + it simply needs to `return` a pair of `nil`, and a boolean value for + permission of lazy continuation + - If a block should be closed on the current line, it should capture it, + keep track of the "closed" state, then `return` `nil` on the next call + of `#consume` + - Once a block is closed, it: + 1. Receives its content from the parser + 2. Parser receives the "close" method call + 3. (optional) Parser may have a callable method `#applyprops`. If + it exists, it gets called with the current constructed block. + 4. (optional) All overlays assigned to this block's class are + processed on the contents of this block (more on that in + Overlay phase) + 5. (optional) Parser may return a different class, which + the current block should be cast into (Overlays may change + the class as well) + 6. (optional) If a block can respond to `#parse_inner` method, it + will get called, allowing the block to parse its own contents. + - After this point, the block is no longer touched until the document + fully gets processed. +4. Inline processing + - (Applies only to Paragraph and any child of LeafBlock) + When the document gets fully processed, the contents of the current + block are taken, assigned to an InlineRoot instance, and then parsed + in Inline mode +5. Completion + - The resulting document is then returned. + +While there is a lot of functionality available in desgining blocks, it is +not necessary for the simplest of the block kinds available. The simplest +example of a block parser is likely the ThematicBreakParser class, which +implements the only 2 methods needed for a block parser to function. + +While parsing text, a block may use additional info: + +- In consume method: `lazy` hasharg, if the current line is being processed + in lazy continuation mode (likely only ever matters for Paragraph); and + `parent` - the parent block containing this block. + +Block interpretations are tried in decreasing order of their priority +value, as applied using the `#define_child` method. + +For blocks to be properly indexed, they need to be a valid child or +a valid descendant (meaning reachable through child chain) of the +Document class. + +### Overlay phase ### + +Overlay phase doesn't start at some specific point in time. Rather, +Overlay phase happens for every block individually - when that block +closes. + +Overlay mechanism can be applied to any DOMObject type, so long as its +close method is called at some point (this may not be of interest to +people that do not implement custom syntax, as it generally translates +to "only block level elements get their overlays processed") + +Overlay mechanism provides the ability to perform some action on the block +right after it gets closed and right before it gets interpreted by the +inline phase. Overlays may do the following: + +- Change the block's class + (by returning a class from the `#process` method) +- Change the block's content (by directly editing it) +- Change the block's properties (by modifying its `properties` hash) + +Overlay interpretations are tried in decreasing order of their priority +value, as defined using the `#define_overlay` method. + +### Inline phase ### + +Once all blocks have been processed, and all overlays have been applied +to their respective block types, the hook in the Document class's +`#parser` method executes inline parsing phase of all leaf blocks +(descendants of the `Leaf` class) and paragraphs. + +The outer class encompassing all inline children of a block is +`InlineRoot`. As such, if an inline element is to ever appear within the +text, it needs to be reachable as a child or a descendant of InlineRoot. + +Inline parsing works in three parts: + +- First, the contens are tokenized (every parser marks its own tokens) +- Second, the forward walk procedure is called +- Third, the reverse walk procedure is called + +This process is repeated for every group of parsers with equal priority. +At one point in time, only all the parsers of equal priority may run in +the same step. Then, the process goes to the next step, of parsers of +higher priority value. As counter-intuitive as this is, this means that +it goes to the parsers of _lower_ priority. + +At the very end of the process, the remaining strings are concatenated +within the mixed array of inlines and strings, and turned into Text +nodes, after which the contents of the array are appended as children to +the root node. + +This process is recursively applied to all elements which may have child +elements. This is ensured when an inline parser calls the "build" +utility method. + +The inline parser is a class that implements static methods `tokenize` +and either `forward_walk` or `reverse_walk`. Both may be implemented at +the same time, but this isn't advisable. + +The tokenization process is characterized by calling every parser in the +current group with every string in tokens array using the `tokenize` +method. It is expected that the parser breaks the string down into an +array of other strings and tokens. A token is an array where the first +element is the literal text representation of the token, the second +value is the class of the parser, and the _last_ value (_not third_) is +the `:close` or `:open` symbol (though functionally it may hold any +symbol value). Any additional information the parser may need in later +stages may be stored between the last element and the second element. + +Example: + + Input: + + "_this _is a string of_ tokens_" + + Output: + + [["_", ::PointBlank::Parsing::EmphInline, :open], + "this ", + ["_", ::PointBlank::Parsing::EmphInline, :open], + "is a string of", + ["_", ::PointBlank::Parsing::EmphInline, :close], + " tokens", + ["_", ::PointBlank::Parsing::EmphInline, :close]] + +The forward walk is characterized by calling parsers which implement the +`#forward_walk` method. When the main class encounters an opening token +in `forward_walk`, it will call the `#forward_walk` method of the class +that represents this token. It is expected that the parser class will +then attempt to build the first available occurence of the inline +element it represents, after which it will return the array of all +tokens and strings that it was passed where the first element will be +the newly constructed inline element. If it is unable to close the +block, it should simply return the original contents, unmodified. + +Example: + + Original text: + + this is outside the inline `this is inside the inline` and this + is right after the inline `and this is the next inline` + + Input: + + [["`", ::PointBlank::Parsing::CodeInline, :open], + "this is inside the inline" + ["`", ::PointBlank::Parsing::CodeInline, :close], + " and this is right after the inline ", + ["`", ::PointBlank::Parsing::CodeInline, :open], + "and this is the next inline" + ["`", ::PointBlank::Parsing::CodeInline, :close]] + + Output: + + [<::PointBlank::DOM::InlineCode + @content = "this is inside the inline">, + " and this is right after the inline ", + ["`", ::PointBlank::Parsing::CodeInline, :open], + "and this is the next inline" + ["`", ::PointBlank::Parsing::CodeInline, :close]] + +The reverse walk is characterized by calling parsers which implement the +`#reverse_walk` method when the main class encounters a closing token +for this class (the one that contains the `:close` symbol in the last +position of the token information array). After that the main class will +call the parser's `#reverse_walk` method with the current list of +tokens, inlines and strings. It is expected that the parser will then +collect all the blocks, strings and inlines that fit within the block +closed by the last element in the list, and once it encounters the +appropriate opening token for the closing token in the last position of +the array, it will then replace the elements fitting within that inline +with a class containing all the collected elements. If it is unable to +find a matching opening token for the closing token in the last +position, it should simply return the original contents, unmodified. + +Example: + + Original text: + + blah blah something something lots of text before the emphasis + _this is emphasized `and this is an inline` but it's still + emphasized_ + + + Input: + + ["blah blah something something lots of text before the emphasis", + ["_", ::PointBlank::Parsing::EmphInline, :open], + "this is emphasized", + <::PointBlank::DOM::InlineCode, + @content = "and this is an inline">, + " but it's still emphasized", + ["_", ::PointBlank::Parsing::EmphInline, :close]] + + Output: + + ["blah blah something something lots of text before the emphasis", + <::PointBlank::DOM::InlineEmphasis, + children = [..., + <::PointBlank::DOM::InlineCode ...> + ...]>] + +Both `#forward_walk` and `#reverse_walk` are not restricted to making +just the changes discussed above, and can arbitrarily modify the token +arrays. That, however, should be done with great care, so as to not +accidentally break compatibility with other parsers. + +To ensure that the collected tokens in the `#reverse_walk` and +`#forward_walk` are processes correctly, the colllected arrays of +tokens, blocks and inlines should be built into an object that +represents this parser using the `build` method (it will automatically +attempt to find the correct class to construct using the +`#define_parser` directive in the DOMObject subclass definition)