aboutsummaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
Diffstat (limited to 'doc')
-rw-r--r--doc/dw-line-breaking.doc280
1 files changed, 210 insertions, 70 deletions
diff --git a/doc/dw-line-breaking.doc b/doc/dw-line-breaking.doc
index b861680f..913ae14e 100644
--- a/doc/dw-line-breaking.doc
+++ b/doc/dw-line-breaking.doc
@@ -2,59 +2,137 @@
<div style="border: 2px solid #ff0000; margin-bottom: 0.5em;
padding: 0.5em 1em; background-color: #ffe0e0"><b>Warning:</b>
-Unsorted collection of notes. Should be incorporated into
-dw::Textblock.</div>
+Should be incorporated into dw::Textblock.</div>
+
+<h2>Introduction</h2>
+
+For the implementation of hyphenation in dillo, not only an
+hyphenation algorithm was implemented, but also, the line breaking was
+changed to a simple optimization per line. Aside from the improvement
+by this change per se, an important aspect is the introduction of
+"penalties". Before this change, dillo put all words into a line which
+fitted into it; now, a "badness" is calculated for a possible
+breakpoint, and the best breakpoint, i.&nbsp;e. the breakoint with the
+smallest value for "badness", is choosen. This can be simply refined
+to define "good" and "bad" breakpoints by assigning a "penalty"; the
+best breakpoint is then the one with the smallest value of "badness +
+penalty". Details can be found below.
+
+Example: Normal spaces have a penalty of 0, while hyphenation points
+get a penalty of, say, 1, since hyphenation is generally considered as
+a bit "ugly" and should rather be avoided. Consider a situation where
+the word "dillo" could be hyphenated, with the following badnesses:
<ul>
-<li>Motivation: opimized line breaking and introduction of penalties.</li>
-<li>Definition of word: hyphenationed word consists of multiple words.</li>
-<li>Splitting up: dw/textblock.cc and dw/textblock_linebreaking.cc</li>
-<li>Test HTML files in the <i>test</i> directory.</li>
-<li>Adjust dw::Textblock::HYPHEN_BREAK. Also
-dw::Textblock::Word::stretchability,
-dw::Textblock::Word::shrinkability, which are both set in
-dw::Textblock::addSpace().</li>
-<li>Creating a line as a whole, "invisible" words.</li>
-<li>Automatic, dynamic hyphenation by Liang's algorithm.</li>
+<li>before "dillo": 0.6;
+<li>between "dil-" and "lo": 0.2;
+<li>after "dillo": 0.5.
</ul>
+Since the penalty is added, the last value is the best one, so "dillo"
+is put at the end of the line, without hyphenation.
+
+Under other circumstances (e.&nbsp;g. narrower lines), the values
+might be different:
+
+<ul>
+<li>before "dillo": infinite;
+<li>between "dil-" and "lo": 0.3;
+<li>after "dillo": 1.5.
+</ul>
+
+In this case, even the addition of the penalty makes hyphenation the
+best choice.
+
+
<h2>Literature</h2>
<h3>Breaking Paragraphs Into Lines</h3>
-Although dillo does not (yet?) implement the algorithm TeX uses for
-line breaking, this document shares much of the notation used by the
-article <i>Breaking Paragraphs Into Lines</i> by Donald E. Knuth and
-Michael F. Plass; originally published in: Software -- Practice and
-Experience <b>11</b> (1981), 1119-1184; reprinted in: <i>Digital
-Typography</i> by Donalt E. Knuth, CSLI Publications 1999. Anyway an
-interesting reading.
+Although dillo does not (yet?) implement the algorithm T<sub>E</sub>X
+uses for line breaking, this document shares much of the notation used
+by the article <i>Breaking Paragraphs Into Lines</i> by Donald
+E. Knuth and Michael F. Plass; originally published in: Software --
+Practice and Experience <b>11</b> (1981), 1119-1184; reprinted in:
+<i>Digital Typography</i> by Donalt E. Knuth, CSLI Publications 1999.
+Anyway an interesting reading.
<h3>Hyphenation</h3>
-Dillo uses the algorithm by Frank Liang, which is described at
-http://www.tug.org/docs/liang/. Pattern files can be found at
+Dillo uses the algorithm by Frank Liang, which is described his
+doctoral dissertation found at http://www.tug.org/docs/liang/. There
+is also a description in chapter H ("Hyphenation") of <i>The
+T<sub>E</sub>Xbook</i> by Donald E. Knuth, Addison-Wesley 1984.
+
+Pattern files can be found at
http://www.ctan.org/tex-archive/language/hyphenation.
-<h2>Criteria for Line-Breaking</h2>
-Currently, a word (represented by dw::Textblock::Word) has the
-following attributes related to line-breaking:
+<h2>Overview over Changes</h2>
+
+Starting with this change, dw/textblock.cc has been split up; anything
+related to line breaking has been moved into
+dw/textblock_linebreaking.cc. This will also be done for other aspects
+like floats. (Better, however, would be a clean logical split.)
+
+An important change relates to the way how lines are added: before,
+dillo would add a line as soon as a new word for this line was
+added. Now, a line is added not before the <i>last</i> word of this
+line is known. This has two important implications:
<ul>
+<li>Some values in dw::Textblock::Line, which represented values
+accumulated within the line, could be removed, since now, these values
+can be calculated simply in a loop.
+<li>On the other hand, this means that some words may not belong to
+any line. For this reason, in some cases (e.&nbsp;g. in
+dw::Textblock::sizeRequestImpl) dw::Textblock::showMissingLines is
+called, which creates temporary lines, which must, under other
+circumstances, removed again by dw::Textblock::removeTemporaryLines,
+since they have been created based on limited informations, and so
+possibly in a wrong way. (See below for details.)
+</ul>
+
+When a word can be hyphenated, an instance of dw::Textblock::Word is
+used for each part. Notice that soft hyphens are evaluated
+immediately, but automatic hyphenation is done in a lazy way (details
+below), so the number of instances may change. There are some new
+attributes: only when dw::Textblock::Word::canBeHyphenated is set to
+<i>true</i>, automatic hyphenation is allowed; it is set to false when
+soft hyphens are used for a word, and (of course) by the automatic
+hyphenation itself. Furthermore, dw::Textblock::Word::hyphenWidth (more
+details in the comment there) has to be included when calculating line
+widths.
+
+Some values should be configurable: dw::Textblock::HYPHEN_BREAK, the
+penalty for hyphens. (Unfortunately, this value is currently not well
+defined, because of the integer arithmetics in
+dw::Textblock::BadnessAndPenalty.) Also
+dw::Textblock::Word::stretchability,
+dw::Textblock::Word::shrinkability, which are both set in
+dw::Textblock::addSpace.
+
-<li> the width of the word itself, represented by dw::Textblock::Word::size;
-<li> the width of the space following the word, represented by
+<h2>Criteria for Line-Breaking</h2>
+
+Before these changes to line breaking, a word (represented by
+dw::Textblock::Word) had the following attributes related to
+line-breaking:
+
+<ul>
+<li>the width of the word itself, represented by dw::Textblock::Word::size;
+<li>the width of the space following the word, represented by
dw::Textblock::Word::origSpace.
</ul>
-[TODO: When is breaking between two words not permitted?]
-
In a more mathematical notation, the \f$i\f$th word has got a width
\f$w_i\f$ and a space \f$s_i\f$.
+A break was possible, when there was a space between the two words,
+and the first possible break was choosen.
+
With hyphenation, the criteria are refined. Hyphenation should only be
-used when otherwise line breaking results in very large spaces. So, we
+used when otherwise line breaking results in very large spaces. We
define:
<ul>
@@ -71,11 +149,10 @@ Examples for the penalty \f$p\f$:
<ul>
<li>0 for normal line breaks (between words);
<li>\f$\infty\f$ to prevent a line break at all costs;
-<li>\f$-\infty\f$ to force a line <li>a value > 0 for hyphens.
+<li>\f$-\infty\f$ to force a line
+<li>a positive, but finite, value for hyphenation points.
</ul>
-[TODO: which value exactly for hyphens?]
-
So we need the following values:
<ul>
@@ -143,54 +220,117 @@ b'\f$, the minimun has to be searched between these two values;
\f$\le\f$ instead of \f$<\f$).
</ul>
-This also means, that all lines must be finalized with a forced line
-break.
-
-[TODO: Change: only adding complete lines, which leads to "hanging"
-words; temporary lines.]
-
-<h2>Soft Hyphens</h2>
+This leads to a problem that the last words of a text block are not
+displayed this way, since they do not fulfill these rules for being
+added to a line. For this reason, there are "temporary" lines already
+described above.
+
+(Note that the actual calculation differs from this description, since
+integer arithmetics are used for performace, which make the actual
+code more complicated. See dw::Textblock::BadnessAndPenalty for
+details.)
+
+
+<h2>Hyphens</h2>
+
+Words (instances of dw::Textblock::Word), which are actually part of a
+hyphenated word, are always drawn as a whole, not seperately. This
+way, the underlying platform is able to apply kerning, ligatures, etc.
+
+Calculating the width of such words causes some problems, since it is
+not required that the width of text "AB" is identical to the width of
+"A" plus the width of "B", just for the reasons mentioned above. It
+gets even a bit more complicated, since it is required that a word
+part (instance of dw::Textblock::Word) has always the same length,
+independant of whether hyphenation is applied or not. Furthermore, the
+hyphen length is fixed for a word; for practical reasons, it is always
+the width of a hyphen, in the given font.
+
+For calculating the widths, consider a word of for syllables:
+A-B-C-D. There are 3 hyphenation points, and so 2<sup>3</sup> = 8
+possible ways of hyphenation: ABCD, ABC-D, AB-CD, AB-C-D, A-BCD,
+A-BC-D, A-B-CD, A-B-C-D. (Some of them, like the last one, are only
+probable for very narrow lines.)
+
+Let w(A), w(B), W(C), W(D) be the word widths (part of
+dw::Textblock::Word::size), which have to be calculated, and l be a
+shorthand for of dw::core::Platform::textWidth. Without considering
+this problem, the calculatin would be simple: w(A) = l(A)
+etc. However, it gets a bit more complicated. Since all
+non-hyphenations are drawn as a whole, the following conditions can be
+concluded:
-Calculating the width causes some problems, since it is not required
-that the width of text "AB" is identical to the width of "A" plus the
-width of "B". Instead, kerning, ligatures etc. may lead to other
-results. For this reason, a word is also always drawn as a whole.
-
-[TODO: Translate the following text to English.]
-
-<pre>
-Als Beispiel Wort aus vier trennbaren Silben: A-B-C-D
+<ul>
+<li>from drawing "ABCD" (not hyphenated at all): w(A) + w(B) + w(C) +
+w(D) = l(ABCD);
+<li>from drawing "BCD", when hyphenated as "A-BCD" ("A-" is not
+considered here): w(B) + w(C) + w(D) = l(BCD);
+<li>likewise, from drawing "CD" (cases "AB-CD" and "A-B-CD"): w(C) +
+w(D) = l(CD);
+<li>finally, for the cases "ABC-D", "AB-C-D", "A-BC-D", and "A-B-C-D":
+w(D) = l(D).
+</ul>
-3 mögliche Trennstelle, daher 8 mögliche Trennungen (allerdings einige
-unwahrscheinlich):
+So, the calculation is simple:
-ABCD, ABC-D, AB-CD, AB-C-D, A-BCD, A-BC-D, A-B-CD, A-B-C-D,
+<ul>
+<li>w(D) = l(D)
+<li>w(C) = l(CD) - w(D)
+<li>w(B) = l(BCD) - (w(C) + w(D))
+<li>w(A) = l(ABCD) - (w(B) + w(C) + w(D))
+</ul>
-w sei Wortbreite, l das Ergebnis von textWidth. Zwingende Bedingungen
-(bezieht sich auf die Teile, wo kein Trennstrich vorkommt; bei
-Trennstrichen kann man eventuell improvisieren):
+For calculation the hyphen widths, the exact conditions would be
+over-determined, even when the possibility for individual hyphen
+widths (instead of simply the text width of a hyphen character) would
+be used. However, a simple approach of fixed hyphen widths will have
+near-perfect results, so this is kept simple.
-ABCD => w(A) + w(B) + w(C) + w(D) = l(ABCD)
-A-BCD => w(B) + w(C) + w(D) = l(BCD)
-AB-CD, A-B-CD => w(C) + w(D) = l(CD)
-ABC-D, AB-C-D, A-BC-D, A-B-C-D => w(D) = l(D)
-Also einfache Berechnung:
+<h2>Automatic Hyphenation</h2>
-w(D) = l(D)
-w(C) = l(CD) - w(D)
-w(B) = l(BCD) - (w(C) + w(D))
-w(A) = l(ABCD) - (w(B) + w(C) + w(D))
+When soft hyphens are used, words are immediately divided into
+different parts, and so different instances of
+dw::Textblock::Word. Atomatic hyphenation (using Liang's algorithm)
+is, however, not applied always, but only when possibly needed, after
+calculating a line without hyphenation:
-Bei den Breiten inklusive Trennstrich ergibt sich eine
-Überbestimmtheit (bei konsanter Trennstrichbreite pro Wort). Daher
-wird einfach eine feste Trennstrichbreite angenommen:
+<ul>
+<li>When the line is tight, the last word of the line is hyphenated;
+possibly this will result in a line with less parts of this word, and
+so a less tight line.
+<li>When the line is loose, and there is another word (for the
+next line) availible, this word is hyphenated; possibly, some parts of
+this word are taken into this line, making it less loose.
+</ul>
-w(A-) = w(A) + l(-)
-w(AB-) = w(A) + w(B) + l(-)
+After this, the line is re-calculated.
+
+A problem arrises when the textblock is rewrapped, e.&nbsp;g. when the
+user changes the window width. In this case, some new instances of
+dw::Textblock::Word must be inserted into the word list,
+dw::Textblock::words. This word list is implemented as an array, which
+is dynamically increased; a simple approach would involve moving all
+of the <i>n</i> elements after position <i>i</i>, so
+<i>n</i>&nbsp;-&nbsp;<i>i</i> steps are necessary. This would not be a
+problem, since O(n) steps are neccessary; however, this will be
+neccessary again for the next hyphenated word (at the end of a
+following line), and so on, so that
+(<i>n</i>&nbsp;-&nbsp;<i>i</i><sub>1</sub>) +
+(<i>n</i>&nbsp;-&nbsp;<i>i</i><sub>2</sub>) + ..., with
+<i>i</i><sub>1</sub>&nbsp;&lt;&nbsp;<i>i</i><sub>2</sub>&nbsp;&lt;&nbsp;...,
+which results in O(n<sup>2</sup>) steps. For this reason, the word
+list is managed by the class lout::misc::NotSoSimpleVector, which uses
+a trick (a second array) to deal with exactly this problem. See there
+for more details.
+
+
+<h2>Tests</h2>
+
+There are test HTML files in the <i>test</i> directory. Also, there is
+a program testing automatic hyphenation, <i>test/liang</i>, which can
+be easily extended.
-usw.
-</pre>
<h2>Bugs and Things Needing Improvement</h2>