diff options
Diffstat (limited to 'doc/dw-line-breaking.doc')
-rw-r--r-- | doc/dw-line-breaking.doc | 470 |
1 files changed, 0 insertions, 470 deletions
diff --git a/doc/dw-line-breaking.doc b/doc/dw-line-breaking.doc deleted file mode 100644 index 14ab97c4..00000000 --- a/doc/dw-line-breaking.doc +++ /dev/null @@ -1,470 +0,0 @@ -/** \page dw-line-breaking Changes in Line-Breaking and Hyphenation - -<div style="border: 2px solid #ffff00; margin-bottom: 0.5em; -padding: 0.5em 1em; background-color: #ffffe0"><b>Info:</b> -Should be incorporated into dw::Textblock.</div> - -Introduction -============ - -For the implementation of hyphenation in dillo, not only a -hyphenation algorithm was implemented, but also, the line breaking was -changed to a simple optimization per line. Aside from the improvement -by this change per se, an important aspect is the introduction of -"penalties". Before this change, dillo put all words into a line which -fitted into it; now, a "badness" is calculated for a possible -breakpoint, and the best breakpoint, i. e. the breakpoint with the -smallest value for "badness", is chosen. This can be simply refined -to define "good" and "bad" breakpoints by assigning a "penalty"; the -best breakpoint is then the one with the smallest value of "badness + -penalty". Details can be found below. - -Example: Normal spaces have a penalty of 0, while hyphenation points -get a penalty of, say, 1, since hyphenation is generally considered as -a bit "ugly" and should rather be avoided. Consider a situation where -the word "dillo" could be hyphenated, with the following badnesses: - -- before "dillo": 0.6; -- between "dil-" and "lo": 0.2; -- after "dillo": 0.5. - -Since the penalty is added, the last value is the best one, so "dillo" -is put at the end of the line, without hyphenation. - -Under other circumstances (e. g. narrower lines), the values -might be different: - -- before "dillo": infinite; -- between "dil-" and "lo": 0.3; -- after "dillo": 1.5. - -In this case, even the addition of the penalty makes hyphenation the -best choice. - - -Literature -========== - -Breaking Paragraphs Into Lines ------------------------------- - -Although dillo does not (yet?) implement the algorithm T<sub>E</sub>X -uses for line breaking, this document shares much of the notation used -by the article *Breaking Paragraphs Into Lines* by Donald E. Knuth and -Michael F. Plass; originally published in: Software -- Practice and -Experience **11** (1981), 1119-1184; reprinted in: *Digital -Typography* by Donalt E. Knuth, CSLI Publications 1999. Anyway an -interesting reading. - -Hyphenation ------------ - -Dillo uses the algorithm by Frank Liang, which is described in his -doctoral dissertation found at http://www.tug.org/docs/liang/. There -is also a description in chapter H ("Hyphenation") of *The -T<sub>E</sub>Xbook* by Donald E. Knuth, Addison-Wesley 1984. - -Pattern files can be found at -http://www.ctan.org/tex-archive/language/hyphenation. - - -Overview of Changes -=================== - -Starting with this change, dw/textblock.cc has been split up; anything -related to line breaking has been moved into -dw/textblock_linebreaking.cc. This will also be done for other aspects -like floats. (Better, however, would be a clean logical split.) - -An important change relates to the way that lines are added: before, -dillo would add a line as soon as a new word for this line was -added. Now, a line is added not before the *last* word of this line is -known. This has two important implications: - -- Some values in dw::Textblock::Line, which represented values - accumulated within the line, could be removed, since now, these - values can be calculated simply in a loop. -- On the other hand, this means that some words may not belong to any - line. For this reason, in some cases (e. g. in - dw::Textblock::sizeRequestImpl) dw::Textblock::showMissingLines is - called, which creates temporary lines, which must, under other - circumstances, be removed again by - dw::Textblock::removeTemporaryLines, since they have been created - based on limited information, and so possibly in a wrong way. (See - below for details.) - -When a word can be hyphenated, an instance of dw::Textblock::Word is -used for each part. Notice that soft hyphens are evaluated -immediately, but automatic hyphenation is done in a lazy way (details -below), so the number of instances may change. There are some new -attributes: only when dw::Textblock::Word::canBeHyphenated is set to -*true*, automatic hyphenation is allowed; it is set to false when soft -hyphens are used for a word, and (of course) by the automatic -hyphenation itself. Furthermore, dw::Textblock::Word::hyphenWidth -(more details in the comment there) has to be included when -calculating line widths. - -Some values should be configurable: dw::Textblock::HYPHEN_BREAK, the -penalty for hyphens. Also dw::Textblock::Word::stretchability, -dw::Textblock::Word::shrinkability, which are both set in -dw::Textblock::addSpace. - - -Criteria for Line-Breaking -========================== - -Before these changes to line breaking, a word (represented by -dw::Textblock::Word) had the following attributes related to -line-breaking: - -- the width of the word itself, represented by - dw::Textblock::Word::size; -- the width of the space following the word, represented by - dw::Textblock::Word::origSpace. - -In a more mathematical notation, the \f$i\f$th word has a width -\f$w_i\f$ and a space \f$s_i\f$. - -A break was possible, when there was a space between the two words, -and the first possible break was chosen. - -With hyphenation, the criteria are refined. Hyphenation should only be -used when otherwise line breaking results in very large spaces. We -define: - -- the badness \f$\beta\f$ of a line, which is greater the more the - spaces between the words differ from the ideal space; -- a penalty \f$p\f$ for any possible break point. - -The goal is to find those break points, where \f$\beta + p\f$ is -minimal. - -Examples for the penalty \f$p\f$: - -- 0 for normal line breaks (between words); -- \f$\infty\f$ to prevent a line break at all costs; -- \f$-\infty\f$ to force a line -- a positive, but finite, value for hyphenation points. - -So we need the following values: - -- \f$w_i\f$ (the width of the word \f$i\f$ itself); -- \f$s_i\f$ (the width of the space following the word \f$i\f$); -- the stretchability \f$y_i\f$, a value denoting how much the space - after word\f$i\f$ can be stretched (typically \f${1\over 2} s_i\f$ - for justified text; otherwise 0, since the spaces are not - stretched); -- the shrinkability \f$y_i\f$, a value denoting how much the space - after word\f$i\f$ can be shrunken (typically \f${1\over 3} s_i\f$ - for justified text; otherwise 0, since the spaces are not shrinked); -- the penalty \f$p_i\f$, if the line is broken after word \f$i\f$; -- a width \f$h_i\f$, which is added, when the line is broken after - word \f$i\f$. - -\f$h_i\f$ is the width of the hyphen, if the word \f$i\f$ is a part of -the hyphenated word (except the last part); otherwise 0. - -Let \f$l\f$ be the (ideal) width (length) of the line, which is -e. at the top given by the browser window width. Furthermore, all words -from \f$a\f$ to \f$b\f$ are added to the line. \f$a\f$ is fixed: we do -not modify the previous lines anymore; but our task is to find a -suitable \f$b\f$. - -We define: - -\f[W_a^b = \sum_{i=a}^{b} w_i + \sum_{i=a}^{b-1} s_i + h_b\f] - -\f[Y_a^b = {Y_0}_a^b + \sum_{i=a}^{b-1} y_i\f] - -\f[Z_a^b = {Z_0}_a^b + \sum_{i=a}^{b-1} z_i\f] - - -\f$W_a^b\f$ is the total width, \f$Y_a^b\f$ the total stretchability, -and \f$Z_a^b\f$ the total shrinkability. \f${Y_0}_a^b\f$ and -\f${Z_0}_a^b\f$ are the stretchability and shrinkability defined per -line, and applied at the borders; they are 0 for justified text, but -\f${Y_0}_a^b\f$ has a positive value otherwise, see below for details. - -Furthermore the *adjustment ratio* \f$r_a^b\f$: - -- in the ideal case that \f$W_a^b = l\f$: \f$r_a^b = 0\f$; -- if \f$W_a^b < l\f$: \f$r_a^b = (l - W_a^b) / Y_a^b\f$ - (\f$r_a^b < 0\f$ in this case); -- if \f$W_a^b > l\f$: \f$r_a^b = (l - W_a^b) / Z_a^b\f$ - (\f$r_a^b < 0\f$ in this case). - -The badness \f$\beta_a^b\f$ is defined as follows: - -- if \f$r_a^b\f$ is undefined or \f$r_a^b < -1\f$: \f$\beta_a^b = \infty\f$; -- otherwise: \f$\beta_a^b = |r_a^b|^3\f$ - -The goal is to find the value of \f$b\f$ where \f$\beta_a^b + p_b\f$ -is minimal. (\f$a\f$ is given, since we do not modify the previous -lines.) - -After a couple of words, it is not predictable whether this minimum -has already been reached. There are two cases where this is possible -for a given \f$b'\f$: - -- \f$\beta_{b'}^a = \infty\f$ (line gets too tight): - \f$a \le b < b'\f$, the minimum has to be searched between these two - values; -- \f$p_{b'} = -\infty\f$ (forced line break): - \f$a \le b \le b'\f$ (there may be another minimum of - \f$\beta_a^b\f$ before; note the \f$\le\f$ instead of \f$<\f$). - -This leads to a problem that the last words of a text block are not -displayed this way, since they do not fulfill these rules for being -added to a line. For this reason, there are "temporary" lines already -described above. - -(Note that the actual calculation differs from this description, since -integer arithmetic is used for performance, which make the actual -code more complicated. See dw::Textblock::BadnessAndPenalty for -details.) - -Ragged Borders --------------- - -For other than justified text (left-, right-aligned and centered), the -spaces between the words are not shrinked or stretched (so \f$y_i\f$ -and \f$z_i\f$ are 0), but additional space is added to the left or -right border or to both. For this reason, an additional stretchability -\f${Y_0}_a^b\f$ is added (see definition above). Since this space at -the border is 0 in an ideal case (\f$W_a^b = l\f$), it cannot be -shrunken, so \f${Z_0}_a^b\f$ is 0. - -This is not equivalent to the calculation of the total stretchability -as done for justified text, since in this case, the stretchability -depends on the number of words: consider the typical case that all -spaces and stretchabilities are equal (\f$y_a = y_{a + 1} = \ldots = -y_b\f$). With \f$n\f$ words, the total strechability would be \f$n -\cdot y_a\f$, so increase with an increasing number of words -(\f$y_a\f$ is constant). This is correct for justified text, but for -other alignments, where only one space (or two, for centered text) is -changed, this would mean that a line with many narrow words is more -stretchable than a line with few wide words. - -It is obvious that left-aligned text can be handled in the same way as -right-aligned text. [... Centered text? ...] - -The default value for the stretchability is the line height without -the space between the lines (more precisely: the maximum of all word -heights). The exact value not so important when comparing different -possible values for the badness \f$\beta_a^b\f$, when \f${Y_0}_a^b\f$ -is nearly constant for different \f$b\f$ (which is the case for the -actual value), but it is important for the comparison with penalties, -which are constant. To be considered is also that for non-justified -text, hyphenation is differently (less) desirable; this effect can be -achieved by enlarging the stretchability, which will lead to a smaller -badness, and so make hyphenation less likely. The user can configure -the stretchability by changing the preference value -*stretchability_factor* (default: 1.0). - -(Comparison to T<sub>E</sub>X: Knuth and Plass describe a method for -ragged borders, which is effectively the same as described here (Knuth -1999, pp. 93--94). The value for the stretchability of the line -is slightly less, 1 em (ibid., see also p. 72 for the -definition of the units). However, this article suggests a value for -the hyphenation penalty, which is ten times larger than the value for -justified text; this would suggest a larger value for -*stretchability_factor*.) - - -Hyphens -======= - -Words (instances of dw::Textblock::Word), which are actually part of a -hyphenated word, are always drawn as a whole, not seperately. This -way, the underlying platform is able to apply kerning, ligatures, etc. - -Calculating the width of such words causes some problems, since it is -not required that the width of text "AB" is identical to the width of -"A" plus the width of "B", just for the reasons mentioned above. It -gets even a bit more complicated, since it is required that a word -part (instance of dw::Textblock::Word) has always the same length, -independent of whether hyphenation is applied or not. Furthermore, the -hyphen length is fixed for a word; for practical reasons, it is always -the width of a hyphen, in the given font. - -For calculating the widths, consider a word of four syllables: -A-B-C-D. There are 3 hyphenation points, and so 2<sup>3</sup> = 8 -possible ways of hyphenation: ABCD, ABC-D, AB-CD, AB-C-D, A-BCD, -A-BC-D, A-B-CD, A-B-C-D. (Some of them, like the last one, are only -probable for very narrow lines.) - -Let w(A), w(B), w(C), w(D) be the word widths (part of -dw::Textblock::Word::size), which have to be calculated, and l be a -shorthand for dw::core::Platform::textWidth. Without considering -this problem, the calculation would be simple: w(A) = l(A) -etc. However, it gets a bit more complicated. Since all -non-hyphenations are drawn as a whole, the following conditions can be -concluded: - -- from drawing "ABCD" (not hyphenated at all): w(A) + w(B) + w(C) + - w(D) = l(ABCD); -- from drawing "BCD", when hyphenated as "A-BCD" ("A-" is not - considered here): w(B) + w(C) + w(D) = l(BCD); -- likewise, from drawing "CD" (cases "AB-CD" and "A-B-CD"): w(C) + - w(D) = l(CD); -- finally, for the cases "ABC-D", "AB-C-D", "A-BC-D", and "A-B-C-D": - w(D) = l(D). - -So, the calculation is simple: - -- w(D) = l(D) -- w(C) = l(CD) - w(D) -- w(B) = l(BCD) - (w(C) + w(D)) -- w(A) = l(ABCD) - (w(B) + w(C) + w(D)) - -For calculation the hyphen widths, the exact conditions would be -over-determined, even when the possibility for individual hyphen -widths (instead of simply the text width of a hyphen character) would -be used. However, a simple approach of fixed hyphen widths will have -near-perfect results, so this is kept simple. - - -Automatic Hyphenation -===================== - -When soft hyphens are used, words are immediately divided into -different parts, and so different instances of -dw::Textblock::Word. Automatic hyphenation (using Liang's algorithm) -is, however, not applied always, but only when possibly needed, after -calculating a line without hyphenation: - -- When the line is tight, the last word of the line is hyphenated; - possibly this will result in a line with less parts of this word, - and so a less tight line. -- When the line is loose, and there is another word (for the next - line) available, this word is hyphenated; possibly, some parts of - this word are taken into this line, making it less loose. - -After this, the line is re-calculated. - -A problem arrises when the textblock is rewrapped, e. g. when the -user changes the window width. In this case, some new instances of -dw::Textblock::Word must be inserted into the word list, -dw::Textblock::words. This word list is implemented as an array, which -is dynamically increased; a simple approach would involve moving all -of the <i>n</i> elements after position <i>i</i>, so -<i>n</i> - <i>i</i> steps are necessary. This would not be a -problem, since O(n) steps are necessary; however, this will be -necessary again for the next hyphenated word (at the end of a -following line), and so on, so that -(<i>n</i> - <i>i</i><sub>1</sub>) + -(<i>n</i> - <i>i</i><sub>2</sub>) + ..., with -<i>i</i><sub>1</sub> < <i>i</i><sub>2</sub> < ..., -which results in O(n<sup>2</sup>) steps. For this reason, the word -list is managed by the class lout::misc::NotSoSimpleVector, which uses -a trick (a second array) to deal with exactly this problem. See there -for more details. - - -Tests -===== - -There are test HTML files in the <i>test</i> directory. Also, there is -a program testing automatic hyphenation, <i>test/liang</i>, which can -be easily extended. - - -Bugs and Things Needing Improvement -=================================== - -High Priority -------------- - -None. - -Medium Priority ---------------- - -None. - -Low Priority ------------- - -**Mark the end of a paragraph:** Should dw::core::Content::BREAK still -be used? Currently, this is redundant to -dw::Textblock::BadnessAndPenalty. - -Solved (Must Be Documented) ---------------------------- - -These have been solved recently and should be documented above. - -*Bugs in hyphenation:* There seem to be problems when breaking words -containing hyphens already. Example: "Abtei-Stadt", which is divided -into "Abtei-" and "Stadt", resulting possibly in -"Abtei-<span></span>-[new line]Stadt". See also below under -"Medium Priority", on how to deal with hyphens and dashes. - -**Solution:** See next. - -*Break hyphens and dashes:* The following rules seem to be relevant: - -- In English, an em-dash is used with no spaces around. Breaking - before and after the dash should be possible, perhaps with a - penalty > 0. (In German, an en-dash (Halbgeviert) with spaces around - is used instead.) -- After a hyphen, which is part of a compound word, a break should be - possible. As described above ("Abtei-Stadt"), this collides with - hyphenation. - -Where to implement? In the same dynamic, lazy way like hyphenation? As -part of hyphenation? - -Notice that Liang's algorithm may behave different regarding hyphens: -"Abtei-Stadt" is (using the patterns from CTAN) divided into "Abtei-" -and "Stadt", but "Nordrhein-Westfalen" is divided into "Nord", -"rhein-West", "fa", "len": the part containing the hyphen -("rhein-West") is untouched. (Sorry for the German words; if you have -got English examples, send them me.) - -**Solution for both:** This has been implemented in -dw::Textblock::addText, in a similar way to soft hyphens. Liang's -algorithm now only operates on the parts: "Abtei" and "Stadt"; -"Nordrhein" and "Westfalen". - -*Hyphens in adjacent lines:* It should be simple to assign a larger -penalty for hyphens, when the line before is already hyphenated. This -way, hyphens in adjacent lines are penalized further. - -**Solved:** There are always two penalties. Must be documented in -detail. - -*Incorrect calculation of extremes:* The minimal width of a text block -(as part of the width extremes, which are mainly used for tables) is -defined by everything between two possible breaks. A possible break -may also be a hyphenation point; however, hyphenation points are -calculated in a lazy way, when the lines are broken, and not when -extremes are calculated. So, it is a matter of chance whether the -calculation of the minimal width will take the two parts "dil-" and -"lo" into account (when "dillo" has already been hyphenated), or only -one part, "dillo" (when "dillo" has not yet been hyphenated), -resulting possibly in a different value for the minimal width. - -Possible strategies to deal with this problem: - -- Ignore. The implications should be minimal. -- Any solution will make it neccessary to hyphenate at least some - words when calculating extremes. Since the minimal widths of all - words are used to calculate the minimal width of the text block, the - simplest approach will hyphenate all words. This would, of course, - eliminate the performance gains of the current lazy approach. -- The latter approach could be optimized in some ways. Examples: (i) - If a word is already narrower than the current accumulated value for - the minimal width, it makes no sense to hyphenate it. (ii) In other - cases, heuristics may be used to estimate the number of syllables, - the width of the widest of them etc. - -**Solved:** Hyphenated parts of a word are not considered anymore for -width extremes, but only whole words. This is also one reason for the -introduction of the paragraphs list. - -**Also:** - -- Configuration of penalties. - -*/ |