Add Architrecture

2025-04-03 22:35:08 +00:00 · 2025-04-03 22:35:08 +00:00 · e470c1da72
parent ace9815864
commit e470c1da72
1 changed files with 278 additions and 0 deletions
--- a/Architrecture.md
+++ b/Architrecture.md
@ -0,0 +1,278 @@
+Architecture of madness
+=======================
+
+Prelude
+-------
+
+It needs to be stressed that making the parser modular while keeping it
+relatively simple was a laborous undertaking. There has not been a standard
+more hostile towards the people who dare attempt to implement it than
+CommonMark. It should also be noted, that despite it being titled a
+"Standard" in this document, it is less widely adopted than the Github
+Flavored Markdown syntax. Github Flavored Markdown, however, is only but
+a mere subset of this parser's model, albeit requiring a few extensions.
+
+Current state (as of March 02, 2025)
+------------------------------------
+
+This parser processes text in what can be boiled down to three phases.
+
+- Block/Line phase
+- Overlay phase
+- Inline phase
+
+It should be noted that all phases have their own related parser
+classes, and a shared behaviour system, where each parser takes control
+at some point, and may win ambiguous cases by having higher priority
+(see `#define_child`, `#define_overlay` methods for priority parameter)
+
+### Block/Line phase ###
+
+The first phase breaks down blocks, line by line, into block structures.
+Blocks (preferably inherited from the Block class) can contain other blocks.
+(i.e. QuoteBlock, ULBlock, OLBlock). Other blocks (known as leaf blocks)
+may not contain anything else (except inline content, more on that later).
+
+Blocks are designed to be parsed independently. This means that it *should*
+be possible to tear out any standard block and make it not get parsed.
+This, however, isn't thoroughly tested for.
+
+Blocks as proper, real classes have a certain lifecycle to follow when
+being constructed:
+
+1. Open condition
+    - A block needs to find its first marker on the current line to open
+      (see `#begin?` method)
+    - Once it's open, it's immediately initialized and fed the line it just
+      read (but now as an object, not as a class) (see `#consume` method)
+2. Marker/Line consumption
+    - While it should be kept open, the block parser instance will
+      keep reading inupt through `#consume` method, returning a pair
+      of modified line (after consuming its tokens from it) and
+      a boolean value indicating permission of lazy continuation
+      (if it's a block like a QuoteBlock or ULBlock that can be lazily
+      overflowed).
+      Every line the parser needs to record needs to be pushed
+      through the `#push` method.
+3. Closure
+    - If the current line no longer belongs to the current block
+      (if the block should have been closed on the previous line),
+      it simply needs to `return` a pair of `nil`, and a boolean value for
+      permission of lazy continuation
+    - If a block should be closed on the current line, it should capture it,
+      keep track of the "closed" state, then `return` `nil` on the next call
+      of `#consume`
+    - Once a block is closed, it:
+        1. Receives its content from the parser
+        2. Parser receives the "close" method call
+        3. (optional) Parser may have a callable method `#applyprops`. If
+           it exists, it gets called with the current constructed block.
+        4. (optional) All overlays assigned to this block's class are
+           processed on the contents of this block (more on that in
+           Overlay phase)
+        5. (optional) Parser may return a different class, which
+           the current block should be cast into (Overlays may change
+           the class as well)
+        6. (optional) If a block can respond to `#parse_inner` method, it
+           will get called, allowing the block to parse its own contents.
+    - After this point, the block is no longer touched until the document
+      fully gets processed.
+4. Inline processing
+    - (Applies only to Paragraph and any child of LeafBlock)
+      When the document gets fully processed, the contents of the current
+      block are taken, assigned to an InlineRoot instance, and then parsed
+      in Inline mode
+5. Completion
+    - The resulting document is then returned.
+
+While there is a lot of functionality available in desgining blocks, it is
+not necessary for the simplest of the block kinds available. The simplest
+example of a block parser is likely the ThematicBreakParser class, which
+implements the only 2 methods needed for a block parser to function.
+
+While parsing text, a block may use additional info:
+
+- In consume method: `lazy` hasharg, if the current line is being processed
+  in lazy continuation mode (likely only ever matters for Paragraph); and
+  `parent` - the parent block containing this block.
+
+Block interpretations are tried in decreasing order of their priority
+value, as applied using the `#define_child` method.
+
+For blocks to be properly indexed, they need to be a valid child or
+a valid descendant (meaning reachable through child chain) of the 
+Document class.
+
+### Overlay phase ###
+
+Overlay phase doesn't start at some specific point in time. Rather,
+Overlay phase happens for every block individually - when that block
+closes.
+
+Overlay mechanism can be applied to any DOMObject type, so long as its
+close method is called at some point (this may not be of interest to
+people that do not implement custom syntax, as it generally translates
+to "only block level elements get their overlays processed")
+
+Overlay mechanism provides the ability to perform some action on the block
+right after it gets closed and right before it gets interpreted by the
+inline phase. Overlays may do the following:
+
+- Change the block's class
+  (by returning a class from the `#process` method)
+- Change the block's content (by directly editing it)
+- Change the block's properties (by modifying its `properties` hash)
+
+Overlay interpretations are tried in decreasing order of their priority
+value, as defined using the `#define_overlay` method.
+
+### Inline phase ###
+
+Once all blocks have been processed, and all overlays have been applied
+to their respective block types, the hook in the Document class's
+`#parser` method executes inline parsing phase of all leaf blocks
+(descendants of the `Leaf` class) and paragraphs.
+
+The outer class encompassing all inline children of a block is
+`InlineRoot`. As such, if an inline element is to ever appear within the
+text, it needs to be reachable as a child or a descendant of InlineRoot.
+
+Inline parsing works in three parts:
+
+- First, the contens are tokenized (every parser marks its own tokens)
+- Second, the forward walk procedure is called
+- Third, the reverse walk procedure is called
+
+This process is repeated for every group of parsers with equal priority.
+At one point in time, only all the parsers of equal priority may run in
+the same step. Then, the process goes to the next step, of parsers of
+higher priority value. As counter-intuitive as this is, this means that
+it goes to the parsers of _lower_ priority.
+
+At the very end of the process, the remaining strings are concatenated
+within the mixed array of inlines and strings, and turned into Text
+nodes, after which the contents of the array are appended as children to
+the root node.
+
+This process is recursively applied to all elements which may have child
+elements. This is ensured when an inline parser calls the "build"
+utility method.
+
+The inline parser is a class that implements static methods `tokenize`
+and either `forward_walk` or `reverse_walk`. Both may be implemented at
+the same time, but this isn't advisable.
+
+The tokenization process is characterized by calling every parser in the
+current group with every string in tokens array using the `tokenize`
+method. It is expected that the parser breaks the string down into an
+array of other strings and tokens. A token is an array where the first
+element is the literal text representation of the token, the second
+value is the class of the parser, and the _last_ value (_not third_) is
+the `:close` or `:open` symbol (though functionally it may hold any
+symbol value). Any additional information the parser may need in later
+stages may be stored between the last element and the second element.
+
+Example:
+
+    Input:
+    
+    "_this _is a string of_ tokens_"
+
+    Output:
+
+    [["_", ::PointBlank::Parsing::EmphInline, :open],
+     "this ",
+     ["_", ::PointBlank::Parsing::EmphInline, :open],
+     "is a string of",
+     ["_", ::PointBlank::Parsing::EmphInline, :close],
+     " tokens",
+     ["_", ::PointBlank::Parsing::EmphInline, :close]]
+
+The forward walk is characterized by calling parsers which implement the
+`#forward_walk` method. When the main class encounters an opening token
+in `forward_walk`, it will call the `#forward_walk` method of the class
+that represents this token. It is expected that the parser class will
+then attempt to build the first available occurence of the inline
+element it represents, after which it will return the array of all
+tokens and strings that it was passed where the first element will be
+the newly constructed inline element. If it is unable to close the
+block, it should simply return the original contents, unmodified.
+
+Example:
+
+    Original text:
+
+    this is outside the inline `this is inside the inline` and this
+    is right after the inline `and this is the next inline`
+
+    Input:
+
+    [["`", ::PointBlank::Parsing::CodeInline, :open],
+     "this is inside the inline"
+     ["`", ::PointBlank::Parsing::CodeInline, :close],
+     " and this is right after the inline ",
+     ["`", ::PointBlank::Parsing::CodeInline, :open],
+     "and this is the next inline"
+     ["`", ::PointBlank::Parsing::CodeInline, :close]]
+
+    Output:
+
+    [<::PointBlank::DOM::InlineCode
+      @content = "this is inside the inline">,
+     " and this is right after the inline ",
+     ["`", ::PointBlank::Parsing::CodeInline, :open],
+     "and this is the next inline"
+     ["`", ::PointBlank::Parsing::CodeInline, :close]]
+
+The reverse walk is characterized by calling parsers which implement the
+`#reverse_walk` method when the main class encounters a closing token
+for this class (the one that contains the `:close` symbol in the last
+position of the token information array). After that the main class will
+call the parser's `#reverse_walk` method with the current list of
+tokens, inlines and strings. It is expected that the parser will then
+collect all the blocks, strings and inlines that fit within the block
+closed by the last element in the list, and once it encounters the
+appropriate opening token for the closing token in the last position of
+the array, it will then replace the elements fitting within that inline
+with a class containing all the collected elements. If it is unable to
+find a matching opening token for the closing token in the last
+position, it should simply return the original contents, unmodified.
+
+Example:
+
+    Original text:
+
+    blah blah something something lots of text before the emphasis
+    _this is emphasized `and this is an inline` but it's still
+    emphasized_
+
+
+    Input:
+
+    ["blah blah something something lots of text before the emphasis",
+     ["_", ::PointBlank::Parsing::EmphInline, :open],
+     "this is emphasized",
+     <::PointBlank::DOM::InlineCode,
+      @content = "and this is an inline">,
+     " but it's still emphasized",
+     ["_", ::PointBlank::Parsing::EmphInline, :close]]
+
+    Output:
+
+    ["blah blah something something lots of text before the emphasis",
+     <::PointBlank::DOM::InlineEmphasis,
+      children = [...,
+      <::PointBlank::DOM::InlineCode ...>
+      ...]>]
+
+Both `#forward_walk` and `#reverse_walk` are not restricted to making
+just the changes discussed above, and can arbitrarily modify the token
+arrays. That, however, should be done with great care, so as to not
+accidentally break compatibility with other parsers.
+
+To ensure that the collected tokens in the `#reverse_walk` and
+`#forward_walk` are processes correctly, the colllected arrays of
+tokens, blocks and inlines should be built into an object that
+represents this parser using the `build` method (it will automatically
+attempt to find the correct class to construct using the
+`#define_parser` directive in the DOMObject subclass definition)