Initial revision

author: jcid <devnull@localhost> 2007-10-07 00:36:34 +0200
committer: jcid <devnull@localhost> 2007-10-07 00:36:34 +0200
commit: 93715c46a99c96d6c866968312691ec9ab0f6a03 (patch)
tree: 573f19ec6aa740844f53a7c0eb7114f04096bf64 /doc/HtmlParser.txt
1 files changed, 116 insertions, 0 deletions
diff --git a/doc/HtmlParser.txt b/doc/HtmlParser.txt
new file mode 100644
index 00000000..ec64164d
--- /dev/null
+++ b/doc/HtmlParser.txt
@@ -0,0 +1,116 @@
+ October 2001, --Jcid
+ Last update: Dec 2004
+
+                        ---------------
+                        THE HTML PARSER
+                        ---------------
+
+
+   Dillo's  parser is more than just a HTML parser, it does XHTML
+and  plain  text  also.  It  has  parsing 'modes' that define its
+behaviour while working:
+
+   typedef enum {
+     DILLO_HTML_PARSE_MODE_INIT,
+     DILLO_HTML_PARSE_MODE_STASH,
+     DILLO_HTML_PARSE_MODE_STASH_AND_BODY,
+     DILLO_HTML_PARSE_MODE_BODY,
+     DILLO_HTML_PARSE_MODE_VERBATIM,
+     DILLO_HTML_PARSE_MODE_PRE
+   } DilloHtmlParseMode;
+
+
+   The  parser  works  upon a token-grained basis, i.e., the data
+stream is parsed into tokens and the parser is fed with them. The
+process  is  simple:  whenever  the  cache  has new data, it gets
+passed to Html_write, which groups data into tokens and calls the
+appropriate functions for the token type (TAG, SPACE or WORD).
+
+   Note:   when  in  DILLO_HTML_PARSE_MODE_VERBATIM,  the  parser
+doesn't  try  to  split  the  data stream into tokens anymore, it
+simply collects until the closing tag.
+
+------
+TOKENS
+------
+
+  * A chunk of WHITE SPACE     --> Html_process_space
+
+
+  * TAG                        --> Html_process_tag
+
+    The tag-start is defined by two adjacent characters:
+
+      first : '<'
+      second: ALPHA | '/' | '!' | '?'
+
+      Note: comments are discarded ( <!-- ... --> )
+
+
+    The tag's end is not as easy to find, nor to deal with!:
+
+    1)  The  HTML  4.01  sec.  3.2.2 states that "Attribute/value
+    pairs appear before the final '>' of an element's start tag",
+    but it doesn't define how to discriminate the "final" '>'.
+
+    2) '<' and '>' should be escaped as '&lt;' and '&gt;' inside
+    attribute values.
+
+    3)  The XML SPEC for XHTML states:
+      AttrValue ::== '"' ([^<&"] | Reference)* '"'   |
+                     "'" ([^<&'] | Reference)* "'"
+
+    Current  parser  honors  the XML SPEC.
+
+    As  it's  a  common  mistake  for human authors to mistype or
+    forget  one  of  the  quote  marks of an attribute value; the
+    parser   solves  the  problem  with  a  look-ahead  technique
+    (otherwise  the  parser  could  skip significative amounts of
+    well written HTML).
+
+
+
+  * WORD                       --> Html_process_word
+
+    A  word is anything that doesn't start with SPACE, and that's
+    outside  of  a  tag, up to the first SPACE or tag start.
+
+    SPACE = ' ' | \n | \r | \t | \f | \v
+
+
+-----------------
+THE PARSING STACK 
+-----------------
+
+  The parsing state of the document is kept in a stack:
+
+  struct _DilloHtml {
+     [...]
+     DilloHtmlState *stack;
+     gint stack_top; /* Index to the top of the stack [0 based] */
+     gint stack_max;
+     [...]
+  };
+
+  struct _DilloHtmlState {
+    char *tag;
+    DwStyle *style, *table_cell_style;
+    DilloHtmlParseMode parse_mode;
+    DilloHtmlTableMode table_mode;
+    gint list_level;
+    gint list_number;
+    DwWidget *page, *table;
+    gint32  current_bg_color;
+  };
+
+
+  Basically,  when a TAG is processed, a new state is pushed into
+the  'stack'  and  its  'style'  is  set  to  reflect the desired
+appearance (details in DwStyle.txt).
+
+  That way, when a word is processed later (added to the Dw), all
+the information is within the top state.
+
+  Closing TAGs just pop the stack.
+
+
author	jcid <devnull@localhost>	2007-10-07 00:36:34 +0200
committer	jcid <devnull@localhost>	2007-10-07 00:36:34 +0200
commit	93715c46a99c96d6c866968312691ec9ab0f6a03 (patch)
tree	573f19ec6aa740844f53a7c0eb7114f04096bf64 /doc/HtmlParser.txt