diff options
Diffstat (limited to 'doc/dw-line-breaking.doc')
-rw-r--r-- | doc/dw-line-breaking.doc | 280 |
1 files changed, 210 insertions, 70 deletions
diff --git a/doc/dw-line-breaking.doc b/doc/dw-line-breaking.doc index b861680f..913ae14e 100644 --- a/doc/dw-line-breaking.doc +++ b/doc/dw-line-breaking.doc @@ -2,59 +2,137 @@ <div style="border: 2px solid #ff0000; margin-bottom: 0.5em; padding: 0.5em 1em; background-color: #ffe0e0"><b>Warning:</b> -Unsorted collection of notes. Should be incorporated into -dw::Textblock.</div> +Should be incorporated into dw::Textblock.</div> + +<h2>Introduction</h2> + +For the implementation of hyphenation in dillo, not only an +hyphenation algorithm was implemented, but also, the line breaking was +changed to a simple optimization per line. Aside from the improvement +by this change per se, an important aspect is the introduction of +"penalties". Before this change, dillo put all words into a line which +fitted into it; now, a "badness" is calculated for a possible +breakpoint, and the best breakpoint, i. e. the breakoint with the +smallest value for "badness", is choosen. This can be simply refined +to define "good" and "bad" breakpoints by assigning a "penalty"; the +best breakpoint is then the one with the smallest value of "badness + +penalty". Details can be found below. + +Example: Normal spaces have a penalty of 0, while hyphenation points +get a penalty of, say, 1, since hyphenation is generally considered as +a bit "ugly" and should rather be avoided. Consider a situation where +the word "dillo" could be hyphenated, with the following badnesses: <ul> -<li>Motivation: opimized line breaking and introduction of penalties.</li> -<li>Definition of word: hyphenationed word consists of multiple words.</li> -<li>Splitting up: dw/textblock.cc and dw/textblock_linebreaking.cc</li> -<li>Test HTML files in the <i>test</i> directory.</li> -<li>Adjust dw::Textblock::HYPHEN_BREAK. Also -dw::Textblock::Word::stretchability, -dw::Textblock::Word::shrinkability, which are both set in -dw::Textblock::addSpace().</li> -<li>Creating a line as a whole, "invisible" words.</li> -<li>Automatic, dynamic hyphenation by Liang's algorithm.</li> +<li>before "dillo": 0.6; +<li>between "dil-" and "lo": 0.2; +<li>after "dillo": 0.5. </ul> +Since the penalty is added, the last value is the best one, so "dillo" +is put at the end of the line, without hyphenation. + +Under other circumstances (e. g. narrower lines), the values +might be different: + +<ul> +<li>before "dillo": infinite; +<li>between "dil-" and "lo": 0.3; +<li>after "dillo": 1.5. +</ul> + +In this case, even the addition of the penalty makes hyphenation the +best choice. + + <h2>Literature</h2> <h3>Breaking Paragraphs Into Lines</h3> -Although dillo does not (yet?) implement the algorithm TeX uses for -line breaking, this document shares much of the notation used by the -article <i>Breaking Paragraphs Into Lines</i> by Donald E. Knuth and -Michael F. Plass; originally published in: Software -- Practice and -Experience <b>11</b> (1981), 1119-1184; reprinted in: <i>Digital -Typography</i> by Donalt E. Knuth, CSLI Publications 1999. Anyway an -interesting reading. +Although dillo does not (yet?) implement the algorithm T<sub>E</sub>X +uses for line breaking, this document shares much of the notation used +by the article <i>Breaking Paragraphs Into Lines</i> by Donald +E. Knuth and Michael F. Plass; originally published in: Software -- +Practice and Experience <b>11</b> (1981), 1119-1184; reprinted in: +<i>Digital Typography</i> by Donalt E. Knuth, CSLI Publications 1999. +Anyway an interesting reading. <h3>Hyphenation</h3> -Dillo uses the algorithm by Frank Liang, which is described at -http://www.tug.org/docs/liang/. Pattern files can be found at +Dillo uses the algorithm by Frank Liang, which is described his +doctoral dissertation found at http://www.tug.org/docs/liang/. There +is also a description in chapter H ("Hyphenation") of <i>The +T<sub>E</sub>Xbook</i> by Donald E. Knuth, Addison-Wesley 1984. + +Pattern files can be found at http://www.ctan.org/tex-archive/language/hyphenation. -<h2>Criteria for Line-Breaking</h2> -Currently, a word (represented by dw::Textblock::Word) has the -following attributes related to line-breaking: +<h2>Overview over Changes</h2> + +Starting with this change, dw/textblock.cc has been split up; anything +related to line breaking has been moved into +dw/textblock_linebreaking.cc. This will also be done for other aspects +like floats. (Better, however, would be a clean logical split.) + +An important change relates to the way how lines are added: before, +dillo would add a line as soon as a new word for this line was +added. Now, a line is added not before the <i>last</i> word of this +line is known. This has two important implications: <ul> +<li>Some values in dw::Textblock::Line, which represented values +accumulated within the line, could be removed, since now, these values +can be calculated simply in a loop. +<li>On the other hand, this means that some words may not belong to +any line. For this reason, in some cases (e. g. in +dw::Textblock::sizeRequestImpl) dw::Textblock::showMissingLines is +called, which creates temporary lines, which must, under other +circumstances, removed again by dw::Textblock::removeTemporaryLines, +since they have been created based on limited informations, and so +possibly in a wrong way. (See below for details.) +</ul> + +When a word can be hyphenated, an instance of dw::Textblock::Word is +used for each part. Notice that soft hyphens are evaluated +immediately, but automatic hyphenation is done in a lazy way (details +below), so the number of instances may change. There are some new +attributes: only when dw::Textblock::Word::canBeHyphenated is set to +<i>true</i>, automatic hyphenation is allowed; it is set to false when +soft hyphens are used for a word, and (of course) by the automatic +hyphenation itself. Furthermore, dw::Textblock::Word::hyphenWidth (more +details in the comment there) has to be included when calculating line +widths. + +Some values should be configurable: dw::Textblock::HYPHEN_BREAK, the +penalty for hyphens. (Unfortunately, this value is currently not well +defined, because of the integer arithmetics in +dw::Textblock::BadnessAndPenalty.) Also +dw::Textblock::Word::stretchability, +dw::Textblock::Word::shrinkability, which are both set in +dw::Textblock::addSpace. + -<li> the width of the word itself, represented by dw::Textblock::Word::size; -<li> the width of the space following the word, represented by +<h2>Criteria for Line-Breaking</h2> + +Before these changes to line breaking, a word (represented by +dw::Textblock::Word) had the following attributes related to +line-breaking: + +<ul> +<li>the width of the word itself, represented by dw::Textblock::Word::size; +<li>the width of the space following the word, represented by dw::Textblock::Word::origSpace. </ul> -[TODO: When is breaking between two words not permitted?] - In a more mathematical notation, the \f$i\f$th word has got a width \f$w_i\f$ and a space \f$s_i\f$. +A break was possible, when there was a space between the two words, +and the first possible break was choosen. + With hyphenation, the criteria are refined. Hyphenation should only be -used when otherwise line breaking results in very large spaces. So, we +used when otherwise line breaking results in very large spaces. We define: <ul> @@ -71,11 +149,10 @@ Examples for the penalty \f$p\f$: <ul> <li>0 for normal line breaks (between words); <li>\f$\infty\f$ to prevent a line break at all costs; -<li>\f$-\infty\f$ to force a line <li>a value > 0 for hyphens. +<li>\f$-\infty\f$ to force a line +<li>a positive, but finite, value for hyphenation points. </ul> -[TODO: which value exactly for hyphens?] - So we need the following values: <ul> @@ -143,54 +220,117 @@ b'\f$, the minimun has to be searched between these two values; \f$\le\f$ instead of \f$<\f$). </ul> -This also means, that all lines must be finalized with a forced line -break. - -[TODO: Change: only adding complete lines, which leads to "hanging" -words; temporary lines.] - -<h2>Soft Hyphens</h2> +This leads to a problem that the last words of a text block are not +displayed this way, since they do not fulfill these rules for being +added to a line. For this reason, there are "temporary" lines already +described above. + +(Note that the actual calculation differs from this description, since +integer arithmetics are used for performace, which make the actual +code more complicated. See dw::Textblock::BadnessAndPenalty for +details.) + + +<h2>Hyphens</h2> + +Words (instances of dw::Textblock::Word), which are actually part of a +hyphenated word, are always drawn as a whole, not seperately. This +way, the underlying platform is able to apply kerning, ligatures, etc. + +Calculating the width of such words causes some problems, since it is +not required that the width of text "AB" is identical to the width of +"A" plus the width of "B", just for the reasons mentioned above. It +gets even a bit more complicated, since it is required that a word +part (instance of dw::Textblock::Word) has always the same length, +independant of whether hyphenation is applied or not. Furthermore, the +hyphen length is fixed for a word; for practical reasons, it is always +the width of a hyphen, in the given font. + +For calculating the widths, consider a word of for syllables: +A-B-C-D. There are 3 hyphenation points, and so 2<sup>3</sup> = 8 +possible ways of hyphenation: ABCD, ABC-D, AB-CD, AB-C-D, A-BCD, +A-BC-D, A-B-CD, A-B-C-D. (Some of them, like the last one, are only +probable for very narrow lines.) + +Let w(A), w(B), W(C), W(D) be the word widths (part of +dw::Textblock::Word::size), which have to be calculated, and l be a +shorthand for of dw::core::Platform::textWidth. Without considering +this problem, the calculatin would be simple: w(A) = l(A) +etc. However, it gets a bit more complicated. Since all +non-hyphenations are drawn as a whole, the following conditions can be +concluded: -Calculating the width causes some problems, since it is not required -that the width of text "AB" is identical to the width of "A" plus the -width of "B". Instead, kerning, ligatures etc. may lead to other -results. For this reason, a word is also always drawn as a whole. - -[TODO: Translate the following text to English.] - -<pre> -Als Beispiel Wort aus vier trennbaren Silben: A-B-C-D +<ul> +<li>from drawing "ABCD" (not hyphenated at all): w(A) + w(B) + w(C) + +w(D) = l(ABCD); +<li>from drawing "BCD", when hyphenated as "A-BCD" ("A-" is not +considered here): w(B) + w(C) + w(D) = l(BCD); +<li>likewise, from drawing "CD" (cases "AB-CD" and "A-B-CD"): w(C) + +w(D) = l(CD); +<li>finally, for the cases "ABC-D", "AB-C-D", "A-BC-D", and "A-B-C-D": +w(D) = l(D). +</ul> -3 mögliche Trennstelle, daher 8 mögliche Trennungen (allerdings einige -unwahrscheinlich): +So, the calculation is simple: -ABCD, ABC-D, AB-CD, AB-C-D, A-BCD, A-BC-D, A-B-CD, A-B-C-D, +<ul> +<li>w(D) = l(D) +<li>w(C) = l(CD) - w(D) +<li>w(B) = l(BCD) - (w(C) + w(D)) +<li>w(A) = l(ABCD) - (w(B) + w(C) + w(D)) +</ul> -w sei Wortbreite, l das Ergebnis von textWidth. Zwingende Bedingungen -(bezieht sich auf die Teile, wo kein Trennstrich vorkommt; bei -Trennstrichen kann man eventuell improvisieren): +For calculation the hyphen widths, the exact conditions would be +over-determined, even when the possibility for individual hyphen +widths (instead of simply the text width of a hyphen character) would +be used. However, a simple approach of fixed hyphen widths will have +near-perfect results, so this is kept simple. -ABCD => w(A) + w(B) + w(C) + w(D) = l(ABCD) -A-BCD => w(B) + w(C) + w(D) = l(BCD) -AB-CD, A-B-CD => w(C) + w(D) = l(CD) -ABC-D, AB-C-D, A-BC-D, A-B-C-D => w(D) = l(D) -Also einfache Berechnung: +<h2>Automatic Hyphenation</h2> -w(D) = l(D) -w(C) = l(CD) - w(D) -w(B) = l(BCD) - (w(C) + w(D)) -w(A) = l(ABCD) - (w(B) + w(C) + w(D)) +When soft hyphens are used, words are immediately divided into +different parts, and so different instances of +dw::Textblock::Word. Atomatic hyphenation (using Liang's algorithm) +is, however, not applied always, but only when possibly needed, after +calculating a line without hyphenation: -Bei den Breiten inklusive Trennstrich ergibt sich eine -Überbestimmtheit (bei konsanter Trennstrichbreite pro Wort). Daher -wird einfach eine feste Trennstrichbreite angenommen: +<ul> +<li>When the line is tight, the last word of the line is hyphenated; +possibly this will result in a line with less parts of this word, and +so a less tight line. +<li>When the line is loose, and there is another word (for the +next line) availible, this word is hyphenated; possibly, some parts of +this word are taken into this line, making it less loose. +</ul> -w(A-) = w(A) + l(-) -w(AB-) = w(A) + w(B) + l(-) +After this, the line is re-calculated. + +A problem arrises when the textblock is rewrapped, e. g. when the +user changes the window width. In this case, some new instances of +dw::Textblock::Word must be inserted into the word list, +dw::Textblock::words. This word list is implemented as an array, which +is dynamically increased; a simple approach would involve moving all +of the <i>n</i> elements after position <i>i</i>, so +<i>n</i> - <i>i</i> steps are necessary. This would not be a +problem, since O(n) steps are neccessary; however, this will be +neccessary again for the next hyphenated word (at the end of a +following line), and so on, so that +(<i>n</i> - <i>i</i><sub>1</sub>) + +(<i>n</i> - <i>i</i><sub>2</sub>) + ..., with +<i>i</i><sub>1</sub> < <i>i</i><sub>2</sub> < ..., +which results in O(n<sup>2</sup>) steps. For this reason, the word +list is managed by the class lout::misc::NotSoSimpleVector, which uses +a trick (a second array) to deal with exactly this problem. See there +for more details. + + +<h2>Tests</h2> + +There are test HTML files in the <i>test</i> directory. Also, there is +a program testing automatic hyphenation, <i>test/liang</i>, which can +be easily extended. -usw. -</pre> <h2>Bugs and Things Needing Improvement</h2> |