Merge with main repo.

author: Sebastian Geerken <devnull@localhost> 2015-06-03 22:34:03 +0200
committer: Sebastian Geerken <devnull@localhost> 2015-06-03 22:34:03 +0200
commit: da88ede8c401449729ae9c4b78f4c9914d81fa09 (patch)
tree: a66a438f5523269bf470b087d1f38410a688bb1a /devdoc/dw-line-breaking.doc
parent: d03e77b0aa3f1f70dcc1d1377b2262ad674ad87e (diff)
parent: 59b76c75b64578edac35d19c914067a0bd7791e9 (diff)
1 files changed, 470 insertions, 0 deletions
diff --git a/devdoc/dw-line-breaking.doc b/devdoc/dw-line-breaking.doc
new file mode 100644
index 00000000..14ab97c4
--- /dev/null
+++ b/devdoc/dw-line-breaking.doc
@@ -0,0 +1,470 @@
+/** \page dw-line-breaking Changes in Line-Breaking and Hyphenation
+
+<div style="border: 2px solid #ffff00; margin-bottom: 0.5em;
+padding: 0.5em 1em; background-color: #ffffe0"><b>Info:</b>
+Should be incorporated into dw::Textblock.</div>
+
+Introduction
+============
+
+For the implementation of hyphenation in dillo, not only a
+hyphenation algorithm was implemented, but also, the line breaking was
+changed to a simple optimization per line. Aside from the improvement
+by this change per se, an important aspect is the introduction of
+"penalties". Before this change, dillo put all words into a line which
+fitted into it; now, a "badness" is calculated for a possible
+breakpoint, and the best breakpoint, i.&nbsp;e. the breakpoint with the
+smallest value for "badness", is chosen. This can be simply refined
+to define "good" and "bad" breakpoints by assigning a "penalty"; the
+best breakpoint is then the one with the smallest value of "badness +
+penalty". Details can be found below.
+
+Example: Normal spaces have a penalty of 0, while hyphenation points
+get a penalty of, say, 1, since hyphenation is generally considered as
+a bit "ugly" and should rather be avoided. Consider a situation where
+the word "dillo" could be hyphenated, with the following badnesses:
+
+- before "dillo": 0.6;
+- between "dil-" and "lo": 0.2;
+- after "dillo": 0.5.
+
+Since the penalty is added, the last value is the best one, so "dillo"
+is put at the end of the line, without hyphenation.
+
+Under other circumstances (e.&nbsp;g. narrower lines), the values
+might be different:
+
+- before "dillo": infinite;
+- between "dil-" and "lo": 0.3;
+- after "dillo": 1.5.
+
+In this case, even the addition of the penalty makes hyphenation the
+best choice.
+
+
+Literature
+==========
+
+Breaking Paragraphs Into Lines
+------------------------------
+
+Although dillo does not (yet?) implement the algorithm T<sub>E</sub>X
+uses for line breaking, this document shares much of the notation used
+by the article *Breaking Paragraphs Into Lines* by Donald E. Knuth and
+Michael F. Plass; originally published in: Software -- Practice and
+Experience **11** (1981), 1119-1184; reprinted in: *Digital
+Typography* by Donalt E. Knuth, CSLI Publications 1999.  Anyway an
+interesting reading.
+
+Hyphenation
+-----------
+
+Dillo uses the algorithm by Frank Liang, which is described in his
+doctoral dissertation found at http://www.tug.org/docs/liang/. There
+is also a description in chapter H ("Hyphenation") of *The
+T<sub>E</sub>Xbook* by Donald E. Knuth, Addison-Wesley 1984.
+
+Pattern files can be found at
+http://www.ctan.org/tex-archive/language/hyphenation.
+
+
+Overview of Changes
+===================
+
+Starting with this change, dw/textblock.cc has been split up; anything
+related to line breaking has been moved into
+dw/textblock_linebreaking.cc. This will also be done for other aspects
+like floats. (Better, however, would be a clean logical split.)
+
+An important change relates to the way that lines are added: before,
+dillo would add a line as soon as a new word for this line was
+added. Now, a line is added not before the *last* word of this line is
+known. This has two important implications:
+
+- Some values in dw::Textblock::Line, which represented values
+  accumulated within the line, could be removed, since now, these
+  values can be calculated simply in a loop.
+- On the other hand, this means that some words may not belong to any
+  line. For this reason, in some cases (e.&nbsp;g. in
+  dw::Textblock::sizeRequestImpl) dw::Textblock::showMissingLines is
+  called, which creates temporary lines, which must, under other
+  circumstances, be removed again by
+  dw::Textblock::removeTemporaryLines, since they have been created
+  based on limited information, and so possibly in a wrong way. (See
+  below for details.)
+
+When a word can be hyphenated, an instance of dw::Textblock::Word is
+used for each part. Notice that soft hyphens are evaluated
+immediately, but automatic hyphenation is done in a lazy way (details
+below), so the number of instances may change. There are some new
+attributes: only when dw::Textblock::Word::canBeHyphenated is set to
+*true*, automatic hyphenation is allowed; it is set to false when soft
+hyphens are used for a word, and (of course) by the automatic
+hyphenation itself. Furthermore, dw::Textblock::Word::hyphenWidth
+(more details in the comment there) has to be included when
+calculating line widths.
+
+Some values should be configurable: dw::Textblock::HYPHEN_BREAK, the
+penalty for hyphens. Also dw::Textblock::Word::stretchability,
+dw::Textblock::Word::shrinkability, which are both set in
+dw::Textblock::addSpace.
+
+
+Criteria for Line-Breaking
+==========================
+
+Before these changes to line breaking, a word (represented by
+dw::Textblock::Word) had the following attributes related to
+line-breaking:
+
+- the width of the word itself, represented by
+  dw::Textblock::Word::size;
+- the width of the space following the word, represented by
+  dw::Textblock::Word::origSpace.
+
+In a more mathematical notation, the \f$i\f$th word has a width
+\f$w_i\f$ and a space \f$s_i\f$.
+
+A break was possible, when there was a space between the two words,
+and the first possible break was chosen.
+
+With hyphenation, the criteria are refined. Hyphenation should only be
+used when otherwise line breaking results in very large spaces. We
+define:
+
+- the badness \f$\beta\f$ of a line, which is greater the more the
+  spaces between the words differ from the ideal space;
+- a penalty \f$p\f$ for any possible break point.
+
+The goal is to find those break points, where \f$\beta + p\f$ is
+minimal.
+
+Examples for the penalty \f$p\f$:
+
+- 0 for normal line breaks (between words);
+- \f$\infty\f$ to prevent a line break at all costs;
+- \f$-\infty\f$ to force a line
+- a positive, but finite, value for hyphenation points.
+
+So we need the following values:
+
+- \f$w_i\f$ (the width of the word \f$i\f$ itself);
+- \f$s_i\f$ (the width of the space following the word \f$i\f$);
+- the stretchability \f$y_i\f$, a value denoting how much the space
+  after word\f$i\f$ can be stretched (typically \f${1\over 2} s_i\f$
+  for justified text; otherwise 0, since the spaces are not
+  stretched);
+- the shrinkability \f$y_i\f$, a value denoting how much the space
+  after word\f$i\f$ can be shrunken (typically \f${1\over 3} s_i\f$
+  for justified text; otherwise 0, since the spaces are not shrinked);
+- the penalty \f$p_i\f$, if the line is broken after word \f$i\f$;
+- a width \f$h_i\f$, which is added, when the line is broken after
+  word \f$i\f$.
+
+\f$h_i\f$ is the width of the hyphen, if the word \f$i\f$ is a part of
+the hyphenated word (except the last part); otherwise 0.
+
+Let \f$l\f$ be the (ideal) width (length) of the line, which is
+e.&nbsp;at the top given by the browser window width. Furthermore, all words
+from \f$a\f$ to \f$b\f$ are added to the line. \f$a\f$ is fixed: we do
+not modify the previous lines anymore; but our task is to find a
+suitable \f$b\f$.
+
+We define:
+
+\f[W_a^b = \sum_{i=a}^{b} w_i + \sum_{i=a}^{b-1} s_i + h_b\f]
+
+\f[Y_a^b = {Y_0}_a^b + \sum_{i=a}^{b-1} y_i\f]
+
+\f[Z_a^b = {Z_0}_a^b + \sum_{i=a}^{b-1} z_i\f]
+
+
+\f$W_a^b\f$ is the total width, \f$Y_a^b\f$ the total stretchability,
+and \f$Z_a^b\f$ the total shrinkability. \f${Y_0}_a^b\f$ and
+\f${Z_0}_a^b\f$ are the stretchability and shrinkability defined per
+line, and applied at the borders; they are 0 for justified text, but
+\f${Y_0}_a^b\f$ has a positive value otherwise, see below for details.
+
+Furthermore the *adjustment ratio* \f$r_a^b\f$:
+
+- in the ideal case that \f$W_a^b = l\f$: \f$r_a^b = 0\f$;
+- if \f$W_a^b < l\f$: \f$r_a^b = (l - W_a^b) / Y_a^b\f$
+  (\f$r_a^b < 0\f$ in this case);
+- if \f$W_a^b > l\f$: \f$r_a^b = (l - W_a^b) / Z_a^b\f$
+  (\f$r_a^b < 0\f$ in this case).
+
+The badness \f$\beta_a^b\f$ is defined as follows:
+
+- if \f$r_a^b\f$ is undefined or \f$r_a^b < -1\f$: \f$\beta_a^b = \infty\f$;
+- otherwise: \f$\beta_a^b = |r_a^b|^3\f$
+
+The goal is to find the value of \f$b\f$ where \f$\beta_a^b + p_b\f$
+is minimal. (\f$a\f$ is given, since we do not modify the previous
+lines.)
+
+After a couple of words, it is not predictable whether this minimum
+has already been reached. There are two cases where this is possible
+for a given \f$b'\f$:
+
+- \f$\beta_{b'}^a = \infty\f$ (line gets too tight):
+  \f$a \le b < b'\f$, the minimum has to be searched between these two
+  values;
+- \f$p_{b'} = -\infty\f$ (forced line break):
+  \f$a \le b \le b'\f$ (there may be another minimum of
+  \f$\beta_a^b\f$ before; note the \f$\le\f$ instead of \f$<\f$).
+
+This leads to a problem that the last words of a text block are not
+displayed this way, since they do not fulfill these rules for being
+added to a line. For this reason, there are "temporary" lines already
+described above.
+
+(Note that the actual calculation differs from this description, since
+integer arithmetic is used for performance, which make the actual
+code more complicated. See dw::Textblock::BadnessAndPenalty for
+details.)
+
+Ragged Borders
+--------------
+
+For other than justified text (left-, right-aligned and centered), the
+spaces between the words are not shrinked or stretched (so \f$y_i\f$
+and \f$z_i\f$ are 0), but additional space is added to the left or
+right border or to both. For this reason, an additional stretchability
+\f${Y_0}_a^b\f$ is added (see definition above). Since this space at
+the border is 0 in an ideal case (\f$W_a^b = l\f$), it cannot be
+shrunken, so \f${Z_0}_a^b\f$ is 0.
+
+This is not equivalent to the calculation of the total stretchability
+as done for justified text, since in this case, the stretchability
+depends on the number of words: consider the typical case that all
+spaces and stretchabilities are equal (\f$y_a = y_{a + 1} = \ldots =
+y_b\f$). With \f$n\f$ words, the total strechability would be \f$n
+\cdot y_a\f$, so increase with an increasing number of words
+(\f$y_a\f$ is constant). This is correct for justified text, but for
+other alignments, where only one space (or two, for centered text) is
+changed, this would mean that a line with many narrow words is more
+stretchable than a line with few wide words.
+
+It is obvious that left-aligned text can be handled in the same way as
+right-aligned text. [... Centered text? ...]
+
+The default value for the stretchability is the line height without
+the space between the lines (more precisely: the maximum of all word
+heights). The exact value not so important when comparing different
+possible values for the badness \f$\beta_a^b\f$, when \f${Y_0}_a^b\f$
+is nearly constant for different \f$b\f$ (which is the case for the
+actual value), but it is important for the comparison with penalties,
+which are constant. To be considered is also that for non-justified
+text, hyphenation is differently (less) desirable; this effect can be
+achieved by enlarging the stretchability, which will lead to a smaller
+badness, and so make hyphenation less likely. The user can configure
+the stretchability by changing the preference value
+*stretchability_factor* (default: 1.0).
+
+(Comparison to T<sub>E</sub>X: Knuth and Plass describe a method for
+ragged borders, which is effectively the same as described here (Knuth
+1999, pp.&nbsp;93--94). The value for the stretchability of the line
+is slightly less, 1&nbsp;em (ibid., see also p.&nbsp;72 for the
+definition of the units). However, this article suggests a value for
+the hyphenation penalty, which is ten times larger than the value for
+justified text; this would suggest a larger value for
+*stretchability_factor*.)
+
+
+Hyphens
+=======
+
+Words (instances of dw::Textblock::Word), which are actually part of a
+hyphenated word, are always drawn as a whole, not seperately. This
+way, the underlying platform is able to apply kerning, ligatures, etc.
+
+Calculating the width of such words causes some problems, since it is
+not required that the width of text "AB" is identical to the width of
+"A" plus the width of "B", just for the reasons mentioned above. It
+gets even a bit more complicated, since it is required that a word
+part (instance of dw::Textblock::Word) has always the same length,
+independent of whether hyphenation is applied or not. Furthermore, the
+hyphen length is fixed for a word; for practical reasons, it is always
+the width of a hyphen, in the given font.
+
+For calculating the widths, consider a word of four syllables:
+A-B-C-D. There are 3 hyphenation points, and so 2<sup>3</sup> = 8
+possible ways of hyphenation: ABCD, ABC-D, AB-CD, AB-C-D, A-BCD,
+A-BC-D, A-B-CD, A-B-C-D. (Some of them, like the last one, are only
+probable for very narrow lines.)
+
+Let w(A), w(B), w(C), w(D) be the word widths (part of
+dw::Textblock::Word::size), which have to be calculated, and l be a
+shorthand for dw::core::Platform::textWidth. Without considering
+this problem, the calculation would be simple: w(A) = l(A)
+etc. However, it gets a bit more complicated. Since all
+non-hyphenations are drawn as a whole, the following conditions can be
+concluded:
+
+- from drawing "ABCD" (not hyphenated at all): w(A) + w(B) + w(C) +
+  w(D) = l(ABCD);
+- from drawing "BCD", when hyphenated as "A-BCD" ("A-" is not
+  considered here): w(B) + w(C) + w(D) = l(BCD);
+- likewise, from drawing "CD" (cases "AB-CD" and "A-B-CD"): w(C) +
+  w(D) = l(CD);
+- finally, for the cases "ABC-D", "AB-C-D", "A-BC-D", and "A-B-C-D":
+  w(D) = l(D).
+
+So, the calculation is simple:
+
+- w(D) = l(D)
+- w(C) = l(CD) - w(D)
+- w(B) = l(BCD) - (w(C) + w(D))
+- w(A) = l(ABCD) - (w(B) + w(C) + w(D))
+
+For calculation the hyphen widths, the exact conditions would be
+over-determined, even when the possibility for individual hyphen
+widths (instead of simply the text width of a hyphen character) would
+be used. However, a simple approach of fixed hyphen widths will have
+near-perfect results, so this is kept simple.
+
+
+Automatic Hyphenation
+=====================
+
+When soft hyphens are used, words are immediately divided into
+different parts, and so different instances of
+dw::Textblock::Word. Automatic hyphenation (using Liang's algorithm)
+is, however, not applied always, but only when possibly needed, after
+calculating a line without hyphenation:
+
+- When the line is tight, the last word of the line is hyphenated;
+  possibly this will result in a line with less parts of this word,
+  and so a less tight line.
+- When the line is loose, and there is another word (for the next
+  line) available, this word is hyphenated; possibly, some parts of
+  this word are taken into this line, making it less loose.
+
+After this, the line is re-calculated.
+
+A problem arrises when the textblock is rewrapped, e.&nbsp;g. when the
+user changes the window width. In this case, some new instances of
+dw::Textblock::Word must be inserted into the word list,
+dw::Textblock::words. This word list is implemented as an array, which
+is dynamically increased; a simple approach would involve moving all
+of the <i>n</i> elements after position <i>i</i>, so
+<i>n</i>&nbsp;-&nbsp;<i>i</i> steps are necessary. This would not be a
+problem, since O(n) steps are necessary; however, this will be
+necessary again for the next hyphenated word (at the end of a
+following line), and so on, so that
+(<i>n</i>&nbsp;-&nbsp;<i>i</i><sub>1</sub>) +
+(<i>n</i>&nbsp;-&nbsp;<i>i</i><sub>2</sub>) + ..., with
+<i>i</i><sub>1</sub>&nbsp;&lt;&nbsp;<i>i</i><sub>2</sub>&nbsp;&lt;&nbsp;...,
+which results in O(n<sup>2</sup>) steps. For this reason, the word
+list is managed by the class lout::misc::NotSoSimpleVector, which uses
+a trick (a second array) to deal with exactly this problem. See there
+for more details.
+
+
+Tests
+=====
+
+There are test HTML files in the <i>test</i> directory. Also, there is
+a program testing automatic hyphenation, <i>test/liang</i>, which can
+be easily extended.
+
+
+Bugs and Things Needing Improvement
+===================================
+
+High Priority
+-------------
+
+None.
+
+Medium Priority
+---------------
+
+None.
+
+Low Priority
+------------
+
+**Mark the end of a paragraph:** Should dw::core::Content::BREAK still
+be used? Currently, this is redundant to
+dw::Textblock::BadnessAndPenalty.
+
+Solved (Must Be Documented)
+---------------------------
+
+These have been solved recently and should be documented above.
+
+*Bugs in hyphenation:* There seem to be problems when breaking words
+containing hyphens already. Example: "Abtei-Stadt", which is divided
+into "Abtei-" and "Stadt", resulting possibly in
+&quot;Abtei-<span></span>-[new line]Stadt&quot;. See also below under
+"Medium Priority", on how to deal with hyphens and dashes.
+
+**Solution:** See next.
+
+*Break hyphens and dashes:* The following rules seem to be relevant:
+
+- In English, an em-dash is used with no spaces around. Breaking
+  before and after the dash should be possible, perhaps with a
+  penalty > 0. (In German, an en-dash (Halbgeviert) with spaces around
+  is used instead.)
+- After a hyphen, which is part of a compound word, a break should be
+  possible. As described above ("Abtei-Stadt"), this collides with
+  hyphenation.
+
+Where to implement? In the same dynamic, lazy way like hyphenation? As
+part of hyphenation?
+
+Notice that Liang's algorithm may behave different regarding hyphens:
+"Abtei-Stadt" is (using the patterns from CTAN) divided into "Abtei-"
+and "Stadt", but "Nordrhein-Westfalen" is divided into "Nord",
+"rhein-West", "fa", "len": the part containing the hyphen
+("rhein-West") is untouched. (Sorry for the German words; if you have
+got English examples, send them me.)
+
+**Solution for both:** This has been implemented in
+dw::Textblock::addText, in a similar way to soft hyphens. Liang's
+algorithm now only operates on the parts: "Abtei" and "Stadt";
+"Nordrhein" and "Westfalen".
+
+*Hyphens in adjacent lines:* It should be simple to assign a larger
+penalty for hyphens, when the line before is already hyphenated. This
+way, hyphens in adjacent lines are penalized further.
+
+**Solved:** There are always two penalties. Must be documented in
+detail.
+
+*Incorrect calculation of extremes:* The minimal width of a text block
+(as part of the width extremes, which are mainly used for tables) is
+defined by everything between two possible breaks. A possible break
+may also be a hyphenation point; however, hyphenation points are
+calculated in a lazy way, when the lines are broken, and not when
+extremes are calculated. So, it is a matter of chance whether the
+calculation of the minimal width will take the two parts "dil-" and
+"lo" into account (when "dillo" has already been hyphenated), or only
+one part, "dillo" (when "dillo" has not yet been hyphenated),
+resulting possibly in a different value for the minimal width.
+
+Possible strategies to deal with this problem:
+
+- Ignore. The implications should be minimal.
+- Any solution will make it neccessary to hyphenate at least some
+  words when calculating extremes. Since the minimal widths of all
+  words are used to calculate the minimal width of the text block, the
+  simplest approach will hyphenate all words. This would, of course,
+  eliminate the performance gains of the current lazy approach.
+- The latter approach could be optimized in some ways. Examples: (i)
+  If a word is already narrower than the current accumulated value for
+  the minimal width, it makes no sense to hyphenate it. (ii) In other
+  cases, heuristics may be used to estimate the number of syllables,
+  the width of the widest of them etc.
+
+**Solved:** Hyphenated parts of a word are not considered anymore for
+width extremes, but only whole words. This is also one reason for the
+introduction of the paragraphs list.
+
+**Also:**
+
+- Configuration of penalties.
+
+*/
author	Sebastian Geerken <devnull@localhost>	2015-06-03 22:34:03 +0200
committer	Sebastian Geerken <devnull@localhost>	2015-06-03 22:34:03 +0200
commit	da88ede8c401449729ae9c4b78f4c9914d81fa09 (patch)
tree	a66a438f5523269bf470b087d1f38410a688bb1a /devdoc/dw-line-breaking.doc
parent	d03e77b0aa3f1f70dcc1d1377b2262ad674ad87e (diff)
parent	59b76c75b64578edac35d19c914067a0bd7791e9 (diff)