diff options
author | Sebastian Geerken <devnull@localhost> | 2015-06-03 22:34:03 +0200 |
---|---|---|
committer | Sebastian Geerken <devnull@localhost> | 2015-06-03 22:34:03 +0200 |
commit | da88ede8c401449729ae9c4b78f4c9914d81fa09 (patch) | |
tree | a66a438f5523269bf470b087d1f38410a688bb1a /devdoc/dw-line-breaking.doc | |
parent | d03e77b0aa3f1f70dcc1d1377b2262ad674ad87e (diff) | |
parent | 59b76c75b64578edac35d19c914067a0bd7791e9 (diff) |
Merge with main repo.
Diffstat (limited to 'devdoc/dw-line-breaking.doc')
-rw-r--r-- | devdoc/dw-line-breaking.doc | 470 |
1 files changed, 470 insertions, 0 deletions
diff --git a/devdoc/dw-line-breaking.doc b/devdoc/dw-line-breaking.doc new file mode 100644 index 00000000..14ab97c4 --- /dev/null +++ b/devdoc/dw-line-breaking.doc @@ -0,0 +1,470 @@ +/** \page dw-line-breaking Changes in Line-Breaking and Hyphenation + +<div style="border: 2px solid #ffff00; margin-bottom: 0.5em; +padding: 0.5em 1em; background-color: #ffffe0"><b>Info:</b> +Should be incorporated into dw::Textblock.</div> + +Introduction +============ + +For the implementation of hyphenation in dillo, not only a +hyphenation algorithm was implemented, but also, the line breaking was +changed to a simple optimization per line. Aside from the improvement +by this change per se, an important aspect is the introduction of +"penalties". Before this change, dillo put all words into a line which +fitted into it; now, a "badness" is calculated for a possible +breakpoint, and the best breakpoint, i. e. the breakpoint with the +smallest value for "badness", is chosen. This can be simply refined +to define "good" and "bad" breakpoints by assigning a "penalty"; the +best breakpoint is then the one with the smallest value of "badness + +penalty". Details can be found below. + +Example: Normal spaces have a penalty of 0, while hyphenation points +get a penalty of, say, 1, since hyphenation is generally considered as +a bit "ugly" and should rather be avoided. Consider a situation where +the word "dillo" could be hyphenated, with the following badnesses: + +- before "dillo": 0.6; +- between "dil-" and "lo": 0.2; +- after "dillo": 0.5. + +Since the penalty is added, the last value is the best one, so "dillo" +is put at the end of the line, without hyphenation. + +Under other circumstances (e. g. narrower lines), the values +might be different: + +- before "dillo": infinite; +- between "dil-" and "lo": 0.3; +- after "dillo": 1.5. + +In this case, even the addition of the penalty makes hyphenation the +best choice. + + +Literature +========== + +Breaking Paragraphs Into Lines +------------------------------ + +Although dillo does not (yet?) implement the algorithm T<sub>E</sub>X +uses for line breaking, this document shares much of the notation used +by the article *Breaking Paragraphs Into Lines* by Donald E. Knuth and +Michael F. Plass; originally published in: Software -- Practice and +Experience **11** (1981), 1119-1184; reprinted in: *Digital +Typography* by Donalt E. Knuth, CSLI Publications 1999. Anyway an +interesting reading. + +Hyphenation +----------- + +Dillo uses the algorithm by Frank Liang, which is described in his +doctoral dissertation found at http://www.tug.org/docs/liang/. There +is also a description in chapter H ("Hyphenation") of *The +T<sub>E</sub>Xbook* by Donald E. Knuth, Addison-Wesley 1984. + +Pattern files can be found at +http://www.ctan.org/tex-archive/language/hyphenation. + + +Overview of Changes +=================== + +Starting with this change, dw/textblock.cc has been split up; anything +related to line breaking has been moved into +dw/textblock_linebreaking.cc. This will also be done for other aspects +like floats. (Better, however, would be a clean logical split.) + +An important change relates to the way that lines are added: before, +dillo would add a line as soon as a new word for this line was +added. Now, a line is added not before the *last* word of this line is +known. This has two important implications: + +- Some values in dw::Textblock::Line, which represented values + accumulated within the line, could be removed, since now, these + values can be calculated simply in a loop. +- On the other hand, this means that some words may not belong to any + line. For this reason, in some cases (e. g. in + dw::Textblock::sizeRequestImpl) dw::Textblock::showMissingLines is + called, which creates temporary lines, which must, under other + circumstances, be removed again by + dw::Textblock::removeTemporaryLines, since they have been created + based on limited information, and so possibly in a wrong way. (See + below for details.) + +When a word can be hyphenated, an instance of dw::Textblock::Word is +used for each part. Notice that soft hyphens are evaluated +immediately, but automatic hyphenation is done in a lazy way (details +below), so the number of instances may change. There are some new +attributes: only when dw::Textblock::Word::canBeHyphenated is set to +*true*, automatic hyphenation is allowed; it is set to false when soft +hyphens are used for a word, and (of course) by the automatic +hyphenation itself. Furthermore, dw::Textblock::Word::hyphenWidth +(more details in the comment there) has to be included when +calculating line widths. + +Some values should be configurable: dw::Textblock::HYPHEN_BREAK, the +penalty for hyphens. Also dw::Textblock::Word::stretchability, +dw::Textblock::Word::shrinkability, which are both set in +dw::Textblock::addSpace. + + +Criteria for Line-Breaking +========================== + +Before these changes to line breaking, a word (represented by +dw::Textblock::Word) had the following attributes related to +line-breaking: + +- the width of the word itself, represented by + dw::Textblock::Word::size; +- the width of the space following the word, represented by + dw::Textblock::Word::origSpace. + +In a more mathematical notation, the \f$i\f$th word has a width +\f$w_i\f$ and a space \f$s_i\f$. + +A break was possible, when there was a space between the two words, +and the first possible break was chosen. + +With hyphenation, the criteria are refined. Hyphenation should only be +used when otherwise line breaking results in very large spaces. We +define: + +- the badness \f$\beta\f$ of a line, which is greater the more the + spaces between the words differ from the ideal space; +- a penalty \f$p\f$ for any possible break point. + +The goal is to find those break points, where \f$\beta + p\f$ is +minimal. + +Examples for the penalty \f$p\f$: + +- 0 for normal line breaks (between words); +- \f$\infty\f$ to prevent a line break at all costs; +- \f$-\infty\f$ to force a line +- a positive, but finite, value for hyphenation points. + +So we need the following values: + +- \f$w_i\f$ (the width of the word \f$i\f$ itself); +- \f$s_i\f$ (the width of the space following the word \f$i\f$); +- the stretchability \f$y_i\f$, a value denoting how much the space + after word\f$i\f$ can be stretched (typically \f${1\over 2} s_i\f$ + for justified text; otherwise 0, since the spaces are not + stretched); +- the shrinkability \f$y_i\f$, a value denoting how much the space + after word\f$i\f$ can be shrunken (typically \f${1\over 3} s_i\f$ + for justified text; otherwise 0, since the spaces are not shrinked); +- the penalty \f$p_i\f$, if the line is broken after word \f$i\f$; +- a width \f$h_i\f$, which is added, when the line is broken after + word \f$i\f$. + +\f$h_i\f$ is the width of the hyphen, if the word \f$i\f$ is a part of +the hyphenated word (except the last part); otherwise 0. + +Let \f$l\f$ be the (ideal) width (length) of the line, which is +e. at the top given by the browser window width. Furthermore, all words +from \f$a\f$ to \f$b\f$ are added to the line. \f$a\f$ is fixed: we do +not modify the previous lines anymore; but our task is to find a +suitable \f$b\f$. + +We define: + +\f[W_a^b = \sum_{i=a}^{b} w_i + \sum_{i=a}^{b-1} s_i + h_b\f] + +\f[Y_a^b = {Y_0}_a^b + \sum_{i=a}^{b-1} y_i\f] + +\f[Z_a^b = {Z_0}_a^b + \sum_{i=a}^{b-1} z_i\f] + + +\f$W_a^b\f$ is the total width, \f$Y_a^b\f$ the total stretchability, +and \f$Z_a^b\f$ the total shrinkability. \f${Y_0}_a^b\f$ and +\f${Z_0}_a^b\f$ are the stretchability and shrinkability defined per +line, and applied at the borders; they are 0 for justified text, but +\f${Y_0}_a^b\f$ has a positive value otherwise, see below for details. + +Furthermore the *adjustment ratio* \f$r_a^b\f$: + +- in the ideal case that \f$W_a^b = l\f$: \f$r_a^b = 0\f$; +- if \f$W_a^b < l\f$: \f$r_a^b = (l - W_a^b) / Y_a^b\f$ + (\f$r_a^b < 0\f$ in this case); +- if \f$W_a^b > l\f$: \f$r_a^b = (l - W_a^b) / Z_a^b\f$ + (\f$r_a^b < 0\f$ in this case). + +The badness \f$\beta_a^b\f$ is defined as follows: + +- if \f$r_a^b\f$ is undefined or \f$r_a^b < -1\f$: \f$\beta_a^b = \infty\f$; +- otherwise: \f$\beta_a^b = |r_a^b|^3\f$ + +The goal is to find the value of \f$b\f$ where \f$\beta_a^b + p_b\f$ +is minimal. (\f$a\f$ is given, since we do not modify the previous +lines.) + +After a couple of words, it is not predictable whether this minimum +has already been reached. There are two cases where this is possible +for a given \f$b'\f$: + +- \f$\beta_{b'}^a = \infty\f$ (line gets too tight): + \f$a \le b < b'\f$, the minimum has to be searched between these two + values; +- \f$p_{b'} = -\infty\f$ (forced line break): + \f$a \le b \le b'\f$ (there may be another minimum of + \f$\beta_a^b\f$ before; note the \f$\le\f$ instead of \f$<\f$). + +This leads to a problem that the last words of a text block are not +displayed this way, since they do not fulfill these rules for being +added to a line. For this reason, there are "temporary" lines already +described above. + +(Note that the actual calculation differs from this description, since +integer arithmetic is used for performance, which make the actual +code more complicated. See dw::Textblock::BadnessAndPenalty for +details.) + +Ragged Borders +-------------- + +For other than justified text (left-, right-aligned and centered), the +spaces between the words are not shrinked or stretched (so \f$y_i\f$ +and \f$z_i\f$ are 0), but additional space is added to the left or +right border or to both. For this reason, an additional stretchability +\f${Y_0}_a^b\f$ is added (see definition above). Since this space at +the border is 0 in an ideal case (\f$W_a^b = l\f$), it cannot be +shrunken, so \f${Z_0}_a^b\f$ is 0. + +This is not equivalent to the calculation of the total stretchability +as done for justified text, since in this case, the stretchability +depends on the number of words: consider the typical case that all +spaces and stretchabilities are equal (\f$y_a = y_{a + 1} = \ldots = +y_b\f$). With \f$n\f$ words, the total strechability would be \f$n +\cdot y_a\f$, so increase with an increasing number of words +(\f$y_a\f$ is constant). This is correct for justified text, but for +other alignments, where only one space (or two, for centered text) is +changed, this would mean that a line with many narrow words is more +stretchable than a line with few wide words. + +It is obvious that left-aligned text can be handled in the same way as +right-aligned text. [... Centered text? ...] + +The default value for the stretchability is the line height without +the space between the lines (more precisely: the maximum of all word +heights). The exact value not so important when comparing different +possible values for the badness \f$\beta_a^b\f$, when \f${Y_0}_a^b\f$ +is nearly constant for different \f$b\f$ (which is the case for the +actual value), but it is important for the comparison with penalties, +which are constant. To be considered is also that for non-justified +text, hyphenation is differently (less) desirable; this effect can be +achieved by enlarging the stretchability, which will lead to a smaller +badness, and so make hyphenation less likely. The user can configure +the stretchability by changing the preference value +*stretchability_factor* (default: 1.0). + +(Comparison to T<sub>E</sub>X: Knuth and Plass describe a method for +ragged borders, which is effectively the same as described here (Knuth +1999, pp. 93--94). The value for the stretchability of the line +is slightly less, 1 em (ibid., see also p. 72 for the +definition of the units). However, this article suggests a value for +the hyphenation penalty, which is ten times larger than the value for +justified text; this would suggest a larger value for +*stretchability_factor*.) + + +Hyphens +======= + +Words (instances of dw::Textblock::Word), which are actually part of a +hyphenated word, are always drawn as a whole, not seperately. This +way, the underlying platform is able to apply kerning, ligatures, etc. + +Calculating the width of such words causes some problems, since it is +not required that the width of text "AB" is identical to the width of +"A" plus the width of "B", just for the reasons mentioned above. It +gets even a bit more complicated, since it is required that a word +part (instance of dw::Textblock::Word) has always the same length, +independent of whether hyphenation is applied or not. Furthermore, the +hyphen length is fixed for a word; for practical reasons, it is always +the width of a hyphen, in the given font. + +For calculating the widths, consider a word of four syllables: +A-B-C-D. There are 3 hyphenation points, and so 2<sup>3</sup> = 8 +possible ways of hyphenation: ABCD, ABC-D, AB-CD, AB-C-D, A-BCD, +A-BC-D, A-B-CD, A-B-C-D. (Some of them, like the last one, are only +probable for very narrow lines.) + +Let w(A), w(B), w(C), w(D) be the word widths (part of +dw::Textblock::Word::size), which have to be calculated, and l be a +shorthand for dw::core::Platform::textWidth. Without considering +this problem, the calculation would be simple: w(A) = l(A) +etc. However, it gets a bit more complicated. Since all +non-hyphenations are drawn as a whole, the following conditions can be +concluded: + +- from drawing "ABCD" (not hyphenated at all): w(A) + w(B) + w(C) + + w(D) = l(ABCD); +- from drawing "BCD", when hyphenated as "A-BCD" ("A-" is not + considered here): w(B) + w(C) + w(D) = l(BCD); +- likewise, from drawing "CD" (cases "AB-CD" and "A-B-CD"): w(C) + + w(D) = l(CD); +- finally, for the cases "ABC-D", "AB-C-D", "A-BC-D", and "A-B-C-D": + w(D) = l(D). + +So, the calculation is simple: + +- w(D) = l(D) +- w(C) = l(CD) - w(D) +- w(B) = l(BCD) - (w(C) + w(D)) +- w(A) = l(ABCD) - (w(B) + w(C) + w(D)) + +For calculation the hyphen widths, the exact conditions would be +over-determined, even when the possibility for individual hyphen +widths (instead of simply the text width of a hyphen character) would +be used. However, a simple approach of fixed hyphen widths will have +near-perfect results, so this is kept simple. + + +Automatic Hyphenation +===================== + +When soft hyphens are used, words are immediately divided into +different parts, and so different instances of +dw::Textblock::Word. Automatic hyphenation (using Liang's algorithm) +is, however, not applied always, but only when possibly needed, after +calculating a line without hyphenation: + +- When the line is tight, the last word of the line is hyphenated; + possibly this will result in a line with less parts of this word, + and so a less tight line. +- When the line is loose, and there is another word (for the next + line) available, this word is hyphenated; possibly, some parts of + this word are taken into this line, making it less loose. + +After this, the line is re-calculated. + +A problem arrises when the textblock is rewrapped, e. g. when the +user changes the window width. In this case, some new instances of +dw::Textblock::Word must be inserted into the word list, +dw::Textblock::words. This word list is implemented as an array, which +is dynamically increased; a simple approach would involve moving all +of the <i>n</i> elements after position <i>i</i>, so +<i>n</i> - <i>i</i> steps are necessary. This would not be a +problem, since O(n) steps are necessary; however, this will be +necessary again for the next hyphenated word (at the end of a +following line), and so on, so that +(<i>n</i> - <i>i</i><sub>1</sub>) + +(<i>n</i> - <i>i</i><sub>2</sub>) + ..., with +<i>i</i><sub>1</sub> < <i>i</i><sub>2</sub> < ..., +which results in O(n<sup>2</sup>) steps. For this reason, the word +list is managed by the class lout::misc::NotSoSimpleVector, which uses +a trick (a second array) to deal with exactly this problem. See there +for more details. + + +Tests +===== + +There are test HTML files in the <i>test</i> directory. Also, there is +a program testing automatic hyphenation, <i>test/liang</i>, which can +be easily extended. + + +Bugs and Things Needing Improvement +=================================== + +High Priority +------------- + +None. + +Medium Priority +--------------- + +None. + +Low Priority +------------ + +**Mark the end of a paragraph:** Should dw::core::Content::BREAK still +be used? Currently, this is redundant to +dw::Textblock::BadnessAndPenalty. + +Solved (Must Be Documented) +--------------------------- + +These have been solved recently and should be documented above. + +*Bugs in hyphenation:* There seem to be problems when breaking words +containing hyphens already. Example: "Abtei-Stadt", which is divided +into "Abtei-" and "Stadt", resulting possibly in +"Abtei-<span></span>-[new line]Stadt". See also below under +"Medium Priority", on how to deal with hyphens and dashes. + +**Solution:** See next. + +*Break hyphens and dashes:* The following rules seem to be relevant: + +- In English, an em-dash is used with no spaces around. Breaking + before and after the dash should be possible, perhaps with a + penalty > 0. (In German, an en-dash (Halbgeviert) with spaces around + is used instead.) +- After a hyphen, which is part of a compound word, a break should be + possible. As described above ("Abtei-Stadt"), this collides with + hyphenation. + +Where to implement? In the same dynamic, lazy way like hyphenation? As +part of hyphenation? + +Notice that Liang's algorithm may behave different regarding hyphens: +"Abtei-Stadt" is (using the patterns from CTAN) divided into "Abtei-" +and "Stadt", but "Nordrhein-Westfalen" is divided into "Nord", +"rhein-West", "fa", "len": the part containing the hyphen +("rhein-West") is untouched. (Sorry for the German words; if you have +got English examples, send them me.) + +**Solution for both:** This has been implemented in +dw::Textblock::addText, in a similar way to soft hyphens. Liang's +algorithm now only operates on the parts: "Abtei" and "Stadt"; +"Nordrhein" and "Westfalen". + +*Hyphens in adjacent lines:* It should be simple to assign a larger +penalty for hyphens, when the line before is already hyphenated. This +way, hyphens in adjacent lines are penalized further. + +**Solved:** There are always two penalties. Must be documented in +detail. + +*Incorrect calculation of extremes:* The minimal width of a text block +(as part of the width extremes, which are mainly used for tables) is +defined by everything between two possible breaks. A possible break +may also be a hyphenation point; however, hyphenation points are +calculated in a lazy way, when the lines are broken, and not when +extremes are calculated. So, it is a matter of chance whether the +calculation of the minimal width will take the two parts "dil-" and +"lo" into account (when "dillo" has already been hyphenated), or only +one part, "dillo" (when "dillo" has not yet been hyphenated), +resulting possibly in a different value for the minimal width. + +Possible strategies to deal with this problem: + +- Ignore. The implications should be minimal. +- Any solution will make it neccessary to hyphenate at least some + words when calculating extremes. Since the minimal widths of all + words are used to calculate the minimal width of the text block, the + simplest approach will hyphenate all words. This would, of course, + eliminate the performance gains of the current lazy approach. +- The latter approach could be optimized in some ways. Examples: (i) + If a word is already narrower than the current accumulated value for + the minimal width, it makes no sense to hyphenate it. (ii) In other + cases, heuristics may be used to estimate the number of syllables, + the width of the widest of them etc. + +**Solved:** Hyphenated parts of a word are not considered anymore for +width extremes, but only whole words. This is also one reason for the +introduction of the paragraphs list. + +**Also:** + +- Configuration of penalties. + +*/ |