diff options
Diffstat (limited to 'doc')
-rw-r--r-- | doc/dw-line-breaking.doc | 297 |
1 files changed, 141 insertions, 156 deletions
diff --git a/doc/dw-line-breaking.doc b/doc/dw-line-breaking.doc index 6492a579..2967e98f 100644 --- a/doc/dw-line-breaking.doc +++ b/doc/dw-line-breaking.doc @@ -4,7 +4,8 @@ padding: 0.5em 1em; background-color: #ffffe0"><b>Info:</b> Should be incorporated into dw::Textblock.</div> -<h2>Introduction</h2> +Introduction +============ For the implementation of hyphenation in dillo, not only a hyphenation algorithm was implemented, but also, the line breaking was @@ -23,11 +24,9 @@ get a penalty of, say, 1, since hyphenation is generally considered as a bit "ugly" and should rather be avoided. Consider a situation where the word "dillo" could be hyphenated, with the following badnesses: -<ul> -<li>before "dillo": 0.6; -<li>between "dil-" and "lo": 0.2; -<li>after "dillo": 0.5. -</ul> +- before "dillo": 0.6; +- between "dil-" and "lo": 0.2; +- after "dillo": 0.5. Since the penalty is added, the last value is the best one, so "dillo" is put at the end of the line, without hyphenation. @@ -35,40 +34,42 @@ is put at the end of the line, without hyphenation. Under other circumstances (e. g. narrower lines), the values might be different: -<ul> -<li>before "dillo": infinite; -<li>between "dil-" and "lo": 0.3; -<li>after "dillo": 1.5. -</ul> +- before "dillo": infinite; +- between "dil-" and "lo": 0.3; +- after "dillo": 1.5. In this case, even the addition of the penalty makes hyphenation the best choice. -<h2>Literature</h2> +Literature +========== -<h3>Breaking Paragraphs Into Lines</h3> +Breaking Paragraphs Into Lines +------------------------------ Although dillo does not (yet?) implement the algorithm T<sub>E</sub>X uses for line breaking, this document shares much of the notation used -by the article <i>Breaking Paragraphs Into Lines</i> by Donald -E. Knuth and Michael F. Plass; originally published in: Software -- -Practice and Experience <b>11</b> (1981), 1119-1184; reprinted in: -<i>Digital Typography</i> by Donalt E. Knuth, CSLI Publications 1999. -Anyway an interesting reading. +by the article *Breaking Paragraphs Into Lines* by Donald E. Knuth and +Michael F. Plass; originally published in: Software -- Practice and +Experience **11** (1981), 1119-1184; reprinted in: *Digital +Typography* by Donalt E. Knuth, CSLI Publications 1999. Anyway an +interesting reading. -<h3>Hyphenation</h3> +Hyphenation +----------- Dillo uses the algorithm by Frank Liang, which is described in his doctoral dissertation found at http://www.tug.org/docs/liang/. There -is also a description in chapter H ("Hyphenation") of <i>The -T<sub>E</sub>Xbook</i> by Donald E. Knuth, Addison-Wesley 1984. +is also a description in chapter H ("Hyphenation") of *The +T<sub>E</sub>Xbook* by Donald E. Knuth, Addison-Wesley 1984. Pattern files can be found at http://www.ctan.org/tex-archive/language/hyphenation. -<h2>Overview of Changes</h2> +Overview of Changes +=================== Starting with this change, dw/textblock.cc has been split up; anything related to line breaking has been moved into @@ -77,32 +78,31 @@ like floats. (Better, however, would be a clean logical split.) An important change relates to the way that lines are added: before, dillo would add a line as soon as a new word for this line was -added. Now, a line is added not before the <i>last</i> word of this -line is known. This has two important implications: - -<ul> -<li>Some values in dw::Textblock::Line, which represented values -accumulated within the line, could be removed, since now, these values -can be calculated simply in a loop. -<li>On the other hand, this means that some words may not belong to -any line. For this reason, in some cases (e. g. in -dw::Textblock::sizeRequestImpl) dw::Textblock::showMissingLines is -called, which creates temporary lines, which must, under other -circumstances, be removed again by dw::Textblock::removeTemporaryLines, -since they have been created based on limited information, and so -possibly in a wrong way. (See below for details.) -</ul> +added. Now, a line is added not before the *last* word of this line is +known. This has two important implications: + +- Some values in dw::Textblock::Line, which represented values + accumulated within the line, could be removed, since now, these + values can be calculated simply in a loop. +- On the other hand, this means that some words may not belong to any + line. For this reason, in some cases (e. g. in + dw::Textblock::sizeRequestImpl) dw::Textblock::showMissingLines is + called, which creates temporary lines, which must, under other + circumstances, be removed again by + dw::Textblock::removeTemporaryLines, since they have been created + based on limited information, and so possibly in a wrong way. (See + below for details.) When a word can be hyphenated, an instance of dw::Textblock::Word is used for each part. Notice that soft hyphens are evaluated immediately, but automatic hyphenation is done in a lazy way (details below), so the number of instances may change. There are some new attributes: only when dw::Textblock::Word::canBeHyphenated is set to -<i>true</i>, automatic hyphenation is allowed; it is set to false when -soft hyphens are used for a word, and (of course) by the automatic -hyphenation itself. Furthermore, dw::Textblock::Word::hyphenWidth (more -details in the comment there) has to be included when calculating line -widths. +*true*, automatic hyphenation is allowed; it is set to false when soft +hyphens are used for a word, and (of course) by the automatic +hyphenation itself. Furthermore, dw::Textblock::Word::hyphenWidth +(more details in the comment there) has to be included when +calculating line widths. Some values should be configurable: dw::Textblock::HYPHEN_BREAK, the penalty for hyphens. Also dw::Textblock::Word::stretchability, @@ -110,17 +110,17 @@ dw::Textblock::Word::shrinkability, which are both set in dw::Textblock::addSpace. -<h2>Criteria for Line-Breaking</h2> +Criteria for Line-Breaking +========================== Before these changes to line breaking, a word (represented by dw::Textblock::Word) had the following attributes related to line-breaking: -<ul> -<li>the width of the word itself, represented by dw::Textblock::Word::size; -<li>the width of the space following the word, represented by -dw::Textblock::Word::origSpace. -</ul> +- the width of the word itself, represented by + dw::Textblock::Word::size; +- the width of the space following the word, represented by + dw::Textblock::Word::origSpace. In a more mathematical notation, the \f$i\f$th word has a width \f$w_i\f$ and a space \f$s_i\f$. @@ -132,37 +132,31 @@ With hyphenation, the criteria are refined. Hyphenation should only be used when otherwise line breaking results in very large spaces. We define: -<ul> -<li>the badness \f$\beta\f$ of a line, which is greater the more the -spaces between the words differ from the ideal space; -<li>a penalty \f$p\f$ for any possible break point. -</ul> +- the badness \f$\beta\f$ of a line, which is greater the more the + spaces between the words differ from the ideal space; +- a penalty \f$p\f$ for any possible break point. The goal is to find those break points, where \f$\beta + p\f$ is minimal. Examples for the penalty \f$p\f$: -<ul> -<li>0 for normal line breaks (between words); -<li>\f$\infty\f$ to prevent a line break at all costs; -<li>\f$-\infty\f$ to force a line -<li>a positive, but finite, value for hyphenation points. -</ul> +- 0 for normal line breaks (between words); +- \f$\infty\f$ to prevent a line break at all costs; +- \f$-\infty\f$ to force a line +- a positive, but finite, value for hyphenation points. So we need the following values: -<ul> -<li> \f$w_i\f$ (the width of the word \f$i\f$ itself); -<li> \f$s_i\f$ (the width of the space following the word \f$i\f$); -<li> the stretchability \f$y_i\f$, a value denoting how much the space -after word\f$i\f$ can be stretched (typically \f${1\over 2} s_i\f$); -<li> the shrinkability \f$y_i\f$, a value denoting how much the space -after word\f$i\f$ can be shrunken (typically \f${1\over 3} s_i\f$); -<li> the penalty \f$p_i\f$, if the line is broken after word \f$i\f$; -<li> a width \f$h_i\f$, which is added, when the line is broken after -word \f$i\f$. -</ul> +- \f$w_i\f$ (the width of the word \f$i\f$ itself); +- \f$s_i\f$ (the width of the space following the word \f$i\f$); +- the stretchability \f$y_i\f$, a value denoting how much the space + after word\f$i\f$ can be stretched (typically \f${1\over 2} s_i\f$); +- the shrinkability \f$y_i\f$, a value denoting how much the space + after word\f$i\f$ can be shrunken (typically \f${1\over 3} s_i\f$); +- the penalty \f$p_i\f$, if the line is broken after word \f$i\f$; +- a width \f$h_i\f$, which is added, when the line is broken after + word \f$i\f$. \f$h_i\f$ is the width of the hyphen, if the word \f$i\f$ is a part of the hyphenated word (except the last part); otherwise 0. @@ -185,22 +179,18 @@ We define: \f$W_a^b\f$ is the total width, \f$Y_a^b\f$ the total stretchability, and \f$Z_a^b\f$ the total shrinkability. -Furthermore the <i>adjustment ratio</i> \f$r_a^b\f$: +Furthermore the *adjustment ratio* \f$r_a^b\f$: -<ul> -<li>in the ideal case that \f$W_a^b = l\f$: \f$r_a^b = 0\f$; -<li>if \f$W_a^b < l\f$: \f$r_a^b = (l - W_a^b) / Y_a^b\f$ (\f$r_a^b < 0\f$ in -this case); -<li>if \f$W_a^b > l\f$: \f$r_a^b = (l - W_a^b) / Z_a^b\f$ (\f$r_a^b < 0\f$ in -this case). -</ul> +- in the ideal case that \f$W_a^b = l\f$: \f$r_a^b = 0\f$; +- if \f$W_a^b < l\f$: \f$r_a^b = (l - W_a^b) / Y_a^b\f$ + (\f$r_a^b < 0\f$ in this case); +- if \f$W_a^b > l\f$: \f$r_a^b = (l - W_a^b) / Z_a^b\f$ + (\f$r_a^b < 0\f$ in this case). The badness \f$\beta_a^b\f$ is defined as follows: -<ul> -<li>if \f$r_a^b\f$ is undefined or \f$r_a^b < -1\f$: \f$\beta_a^b = \infty\f$; -<li>otherwise: \f$\beta_a^b = |r_a^b|^3\f$ -</ul> +- if \f$r_a^b\f$ is undefined or \f$r_a^b < -1\f$: \f$\beta_a^b = \infty\f$; +- otherwise: \f$\beta_a^b = |r_a^b|^3\f$ The goal is to find the value of \f$b\f$ where \f$\beta_a^b + p_b\f$ is minimal. (\f$a\f$ is given, since we do not modify the previous @@ -210,13 +200,12 @@ After a couple of words, it is not predictable whether this minimum has already been reached. There are two cases where this is possible for a given \f$b'\f$: -<ul> -<li>\f$\beta_{b'}^a = \infty\f$ (line gets too tight): \f$a \le b < -b'\f$, the minimum has to be searched between these two values; -<li>\f$p_{b'} = -\infty\f$ (forced line break): \f$a \le b \le b'\f$ -(there may be another minimum of \f$\beta_a^b\f$ before; note the -\f$\le\f$ instead of \f$<\f$). -</ul> +- \f$\beta_{b'}^a = \infty\f$ (line gets too tight): + \f$a \le b < b'\f$, the minimum has to be searched between these two + values; +- \f$p_{b'} = -\infty\f$ (forced line break): + \f$a \le b \le b'\f$ (there may be another minimum of + \f$\beta_a^b\f$ before; note the \f$\le\f$ instead of \f$<\f$). This leads to a problem that the last words of a text block are not displayed this way, since they do not fulfill these rules for being @@ -229,7 +218,8 @@ code more complicated. See dw::Textblock::BadnessAndPenalty for details.) -<h2>Hyphens</h2> +Hyphens +======= Words (instances of dw::Textblock::Word), which are actually part of a hyphenated word, are always drawn as a whole, not seperately. This @@ -258,25 +248,21 @@ etc. However, it gets a bit more complicated. Since all non-hyphenations are drawn as a whole, the following conditions can be concluded: -<ul> -<li>from drawing "ABCD" (not hyphenated at all): w(A) + w(B) + w(C) + -w(D) = l(ABCD); -<li>from drawing "BCD", when hyphenated as "A-BCD" ("A-" is not -considered here): w(B) + w(C) + w(D) = l(BCD); -<li>likewise, from drawing "CD" (cases "AB-CD" and "A-B-CD"): w(C) + -w(D) = l(CD); -<li>finally, for the cases "ABC-D", "AB-C-D", "A-BC-D", and "A-B-C-D": -w(D) = l(D). -</ul> +- from drawing "ABCD" (not hyphenated at all): w(A) + w(B) + w(C) + + w(D) = l(ABCD); +- from drawing "BCD", when hyphenated as "A-BCD" ("A-" is not + considered here): w(B) + w(C) + w(D) = l(BCD); +- likewise, from drawing "CD" (cases "AB-CD" and "A-B-CD"): w(C) + + w(D) = l(CD); +- finally, for the cases "ABC-D", "AB-C-D", "A-BC-D", and "A-B-C-D": + w(D) = l(D). So, the calculation is simple: -<ul> -<li>w(D) = l(D) -<li>w(C) = l(CD) - w(D) -<li>w(B) = l(BCD) - (w(C) + w(D)) -<li>w(A) = l(ABCD) - (w(B) + w(C) + w(D)) -</ul> +- w(D) = l(D) +- w(C) = l(CD) - w(D) +- w(B) = l(BCD) - (w(C) + w(D)) +- w(A) = l(ABCD) - (w(B) + w(C) + w(D)) For calculation the hyphen widths, the exact conditions would be over-determined, even when the possibility for individual hyphen @@ -285,7 +271,8 @@ be used. However, a simple approach of fixed hyphen widths will have near-perfect results, so this is kept simple. -<h2>Automatic Hyphenation</h2> +Automatic Hyphenation +===================== When soft hyphens are used, words are immediately divided into different parts, and so different instances of @@ -293,14 +280,12 @@ dw::Textblock::Word. Automatic hyphenation (using Liang's algorithm) is, however, not applied always, but only when possibly needed, after calculating a line without hyphenation: -<ul> -<li>When the line is tight, the last word of the line is hyphenated; -possibly this will result in a line with less parts of this word, and -so a less tight line. -<li>When the line is loose, and there is another word (for the -next line) available, this word is hyphenated; possibly, some parts of -this word are taken into this line, making it less loose. -</ul> +- When the line is tight, the last word of the line is hyphenated; + possibly this will result in a line with less parts of this word, + and so a less tight line. +- When the line is loose, and there is another word (for the next + line) available, this word is hyphenated; possibly, some parts of + this word are taken into this line, making it less loose. After this, the line is re-calculated. @@ -323,36 +308,38 @@ a trick (a second array) to deal with exactly this problem. See there for more details. -<h2>Tests</h2> +Tests +===== There are test HTML files in the <i>test</i> directory. Also, there is a program testing automatic hyphenation, <i>test/liang</i>, which can be easily extended. -<h2>Bugs and Things Needing Improvement</h2> +Bugs and Things Needing Improvement +=================================== -<h3>High Priority</h3> +High Priority +------------- -<b>Bugs in hyphenation:</b> There seem to be problems when breaking +**Bugs in hyphenation:** There seem to be problems when breaking words containing hyphens already. Example: "Abtei-Stadt", which is divided into "Abtei-" and "Stadt", resulting possibly in "Abtei-<span></span>-[new line]Stadt". See also below under "Medium Priority", on how to deal with hyphens and dashes. -<h3>Medium Priority</h3> +Medium Priority +--------------- -<b>Break hyphens and dashes:</b> The following rules seem to be relevant: +**Break hyphens and dashes:** The following rules seem to be relevant: -<ol> -<li>In English, an em-dash is used with no spaces around. Breaking -before and after the dash should be possible, perhaps with a penalty > -0. (In German, an en-dash (Halbgeviert) with spaces around is used -instead.)</li> -<li>After a hyphen, which is part of a compound word, a break should -be possible. As described above ("Abtei-Stadt"), this collides with -hyphenation.</li> -</ol> +- In English, an em-dash is used with no spaces around. Breaking + before and after the dash should be possible, perhaps with a + penalty > 0. (In German, an en-dash (Halbgeviert) with spaces around + is used instead.) +- After a hyphen, which is part of a compound word, a break should be + possible. As described above ("Abtei-Stadt"), this collides with + hyphenation. Where to implement? In the same dynamic, lazy way like hyphenation? As part of hyphenation? @@ -364,7 +351,7 @@ and "Stadt", but "Nordrhein-Westfalen" is divided into "Nord", ("rhein-West") is untouched. (Sorry for the German words; if you have got English examples, send them me.) -<b>Incorrect calculation of extremes:</b> The minimal width of a text +**Incorrect calculation of extremes:** The minimal width of a text block (as part of the width extremes, which are mainly used for tables) is defined by everything between two possible breaks. A possible break may also be a hyphenation point; however, hyphenation @@ -377,30 +364,28 @@ resulting possibly in a different value for the minimal width. Possible strategies to deal with this problem: -<ol> -<li>Ignore. The implications should be minimal. -<li>Any solution will make it neccessary to hyphenate at least some -words when calculating extremes. Since the minimal widths of all words -are used to calculate the minimal width of the text block, the -simplest approach will hyphenate all words. This would, of course, -eliminate the performance gains of the current lazy approach. -<li>The latter approach could be optimized in some ways. Examples: (i) -If a word is already narrower than the current accumulated value for -the minimal width, it makes no sense to hyphenate it. (ii) In other -cases, heuristics may be used to estimate the number of syllables, the -width of the widest of them etc. -</ol> - -<h3>Low Priority</h3> - -<b>Mark the end of a paragraph:</b> Should dw::core::Content::BREAK -still be used? Currently, this is redundant to +- Ignore. The implications should be minimal. +- Any solution will make it neccessary to hyphenate at least some + words when calculating extremes. Since the minimal widths of all + words are used to calculate the minimal width of the text block, the + simplest approach will hyphenate all words. This would, of course, + eliminate the performance gains of the current lazy approach. +- The latter approach could be optimized in some ways. Examples: (i) + If a word is already narrower than the current accumulated value for + the minimal width, it makes no sense to hyphenate it. (ii) In other + cases, heuristics may be used to estimate the number of syllables, + the width of the widest of them etc. + +Low Priority +------------ + +**Mark the end of a paragraph:** Should dw::core::Content::BREAK still +be used? Currently, this is redundant to dw::Textblock::BadnessAndPenalty. -<b>Other than justified text:</b> The calculation of badness is -designed for justified text. For other alignments, it may be -modified. The point is the definition of stretchability and for the -line. +**Other than justified text:** The calculation of badness is designed +for justified text. For other alignments, it may be modified. The +point is the definition of stretchability and for the line. Consider left-aligned text. Most importantly, not the spaces between the words, but the space on the right border is adjusted. If the @@ -425,8 +410,8 @@ lines will, when spaces are shrunken, get too long!) Analogous considerations must be made for right-aligned and centered text. (For centered texts, there are two adjustable spaces.) -<b>Hyphens in adjacent lines:</b> It should be simple to assign a -larger penalty for hyphens, when the line before is already -hyphenated. This way, hyphens in adjacent lines are penalized further. +**Hyphens in adjacent lines:** It should be simple to assign a larger +penalty for hyphens, when the line before is already hyphenated. This +way, hyphens in adjacent lines are penalized further. */ |