Title: Missing entity expansion inside PRE Author: rodarima Created: Sun, 11 Aug 2024 12:48:34 +0000 State: closed From https://pubs.opengroup.org/onlinepubs/9699919799/utilities/make.html The `
` content is rendered: .SUFFIXES: .o .c .y .l .a .sh .f .c .y .l .sh .f While the `` entity must be rendered as `~`, even inside a `pre` block. Source: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/pre > If you have to display reserved characters such as <, >, &, and " within the > `` tag, the characters must be escaped using their respective character > references. > > `` elements commonly contain ``, ``, and `` elements, to > represent computer code, computer output, and user input, respectively. --%-- From: rodarima Date: Sun, 11 Aug 2024 14:23:32 +0000 Dillo reports: > HTML warning: line 11, Numeric character reference `''` is not valid. Which is handled by `Html_parse_numeric_charref()`: ```c if ((codepoint < 0x20 && codepoint != '\t' && codepoint != '\n' && codepoint != '\f') || (codepoint >= 0x7f && codepoint <= 0x9f) || (codepoint >= 0xd800 && codepoint <= 0xdfff) || codepoint > 0x10ffff || ((codepoint & 0xfffe) == 0xfffe) || (!(html->DocType == DT_HTML && html->DocTypeVersion >= 5.0f) && codepoint > 0xffff)) { /* this catches null bytes, errors, codes out of range, disallowed * control chars, permanently undefined chars, and surrogates. */ char c = *s; *s = '\0'; BUG_MSG("Numeric character reference '%s' is not valid.", tok); *s = c; codepoint = (codepoint >= 145 && codepoint <= 151) ? Html_ms_stupid_quotes_2ucs(codepoint) : -1; } ``` However the tilde character seems to have the Unicode value U+007e or 126 in decimal. >>> hex(ord('~')) '0x7e Which matches the [ISO-8859-1 character set](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) From the Wikipedia: > The popular Windows-1252 character set adds all the missing characters > provided by ISO/IEC 8859-15, plus a number of typographic symbols, by > replacing the rarely used C1 controls in the range 128 to 159 (hex 80 to 9F). > It is very common to mislabel Windows-1252 text as being in ISO-8859-1. A > common result was that all the quotes and apostrophes (produced by "smart > quotes" in word-processing software) were replaced with question marks or > boxes on non-Windows operating systems, making text difficult to read. Many > Web browsers and e-mail clients will interpret ISO-8859-1 control codes as > Windows-1252 characters, and that behavior was later standardized in > HTML5.[20] In the [Windows-1252](https://en.wikipedia.org/wiki/Windows-1252) table I can see that the symbol is not the common tilde `~` but a "small tilde" U+02DC `˜`. So, this seems to be one of those cases where the charset is wrongly set to ISO-8859-1 instead of Windows-1252. The document content type seems to be wrong: ```html ``` Also, the based on the [HTML 4.01 spec](https://www.w3.org/TR/html4/charset.html), the numeric entities must refer to the "document character set": > Occasional characters that fall outside this encoding may still be represented > by character references. These always refer to the document character set, not > the character encoding. And the character set is *not* the `charset`, but Unicode: > The ASCII character set is not sufficient for a global information system such > as the Web, so HTML uses the much more complete character set called the > Universal Character Set (UCS), defined in [ISO10646]. This standard defines a > repertoire of thousands of characters used by communities all over the world. So the entity `` is pointing to the [Unicode symbol for "Start Of String"](https://www.codetable.net/decimal/152), which is non printable. Therefore, there is no bug on Dillo side, but two bugs on the POSIX manual page. - The entity for small tilde must be `˜` or `˜` - They probably mean ~ not the small tilde.