aboutsummaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
Diffstat (limited to 'doc')
-rw-r--r--doc/dw-line-breaking.doc297
1 files changed, 141 insertions, 156 deletions
diff --git a/doc/dw-line-breaking.doc b/doc/dw-line-breaking.doc
index 6492a579..2967e98f 100644
--- a/doc/dw-line-breaking.doc
+++ b/doc/dw-line-breaking.doc
@@ -4,7 +4,8 @@
padding: 0.5em 1em; background-color: #ffffe0"><b>Info:</b>
Should be incorporated into dw::Textblock.</div>
-<h2>Introduction</h2>
+Introduction
+============
For the implementation of hyphenation in dillo, not only a
hyphenation algorithm was implemented, but also, the line breaking was
@@ -23,11 +24,9 @@ get a penalty of, say, 1, since hyphenation is generally considered as
a bit "ugly" and should rather be avoided. Consider a situation where
the word "dillo" could be hyphenated, with the following badnesses:
-<ul>
-<li>before "dillo": 0.6;
-<li>between "dil-" and "lo": 0.2;
-<li>after "dillo": 0.5.
-</ul>
+- before "dillo": 0.6;
+- between "dil-" and "lo": 0.2;
+- after "dillo": 0.5.
Since the penalty is added, the last value is the best one, so "dillo"
is put at the end of the line, without hyphenation.
@@ -35,40 +34,42 @@ is put at the end of the line, without hyphenation.
Under other circumstances (e.&nbsp;g. narrower lines), the values
might be different:
-<ul>
-<li>before "dillo": infinite;
-<li>between "dil-" and "lo": 0.3;
-<li>after "dillo": 1.5.
-</ul>
+- before "dillo": infinite;
+- between "dil-" and "lo": 0.3;
+- after "dillo": 1.5.
In this case, even the addition of the penalty makes hyphenation the
best choice.
-<h2>Literature</h2>
+Literature
+==========
-<h3>Breaking Paragraphs Into Lines</h3>
+Breaking Paragraphs Into Lines
+------------------------------
Although dillo does not (yet?) implement the algorithm T<sub>E</sub>X
uses for line breaking, this document shares much of the notation used
-by the article <i>Breaking Paragraphs Into Lines</i> by Donald
-E. Knuth and Michael F. Plass; originally published in: Software --
-Practice and Experience <b>11</b> (1981), 1119-1184; reprinted in:
-<i>Digital Typography</i> by Donalt E. Knuth, CSLI Publications 1999.
-Anyway an interesting reading.
+by the article *Breaking Paragraphs Into Lines* by Donald E. Knuth and
+Michael F. Plass; originally published in: Software -- Practice and
+Experience **11** (1981), 1119-1184; reprinted in: *Digital
+Typography* by Donalt E. Knuth, CSLI Publications 1999. Anyway an
+interesting reading.
-<h3>Hyphenation</h3>
+Hyphenation
+-----------
Dillo uses the algorithm by Frank Liang, which is described in his
doctoral dissertation found at http://www.tug.org/docs/liang/. There
-is also a description in chapter H ("Hyphenation") of <i>The
-T<sub>E</sub>Xbook</i> by Donald E. Knuth, Addison-Wesley 1984.
+is also a description in chapter H ("Hyphenation") of *The
+T<sub>E</sub>Xbook* by Donald E. Knuth, Addison-Wesley 1984.
Pattern files can be found at
http://www.ctan.org/tex-archive/language/hyphenation.
-<h2>Overview of Changes</h2>
+Overview of Changes
+===================
Starting with this change, dw/textblock.cc has been split up; anything
related to line breaking has been moved into
@@ -77,32 +78,31 @@ like floats. (Better, however, would be a clean logical split.)
An important change relates to the way that lines are added: before,
dillo would add a line as soon as a new word for this line was
-added. Now, a line is added not before the <i>last</i> word of this
-line is known. This has two important implications:
-
-<ul>
-<li>Some values in dw::Textblock::Line, which represented values
-accumulated within the line, could be removed, since now, these values
-can be calculated simply in a loop.
-<li>On the other hand, this means that some words may not belong to
-any line. For this reason, in some cases (e.&nbsp;g. in
-dw::Textblock::sizeRequestImpl) dw::Textblock::showMissingLines is
-called, which creates temporary lines, which must, under other
-circumstances, be removed again by dw::Textblock::removeTemporaryLines,
-since they have been created based on limited information, and so
-possibly in a wrong way. (See below for details.)
-</ul>
+added. Now, a line is added not before the *last* word of this line is
+known. This has two important implications:
+
+- Some values in dw::Textblock::Line, which represented values
+ accumulated within the line, could be removed, since now, these
+ values can be calculated simply in a loop.
+- On the other hand, this means that some words may not belong to any
+ line. For this reason, in some cases (e.&nbsp;g. in
+ dw::Textblock::sizeRequestImpl) dw::Textblock::showMissingLines is
+ called, which creates temporary lines, which must, under other
+ circumstances, be removed again by
+ dw::Textblock::removeTemporaryLines, since they have been created
+ based on limited information, and so possibly in a wrong way. (See
+ below for details.)
When a word can be hyphenated, an instance of dw::Textblock::Word is
used for each part. Notice that soft hyphens are evaluated
immediately, but automatic hyphenation is done in a lazy way (details
below), so the number of instances may change. There are some new
attributes: only when dw::Textblock::Word::canBeHyphenated is set to
-<i>true</i>, automatic hyphenation is allowed; it is set to false when
-soft hyphens are used for a word, and (of course) by the automatic
-hyphenation itself. Furthermore, dw::Textblock::Word::hyphenWidth (more
-details in the comment there) has to be included when calculating line
-widths.
+*true*, automatic hyphenation is allowed; it is set to false when soft
+hyphens are used for a word, and (of course) by the automatic
+hyphenation itself. Furthermore, dw::Textblock::Word::hyphenWidth
+(more details in the comment there) has to be included when
+calculating line widths.
Some values should be configurable: dw::Textblock::HYPHEN_BREAK, the
penalty for hyphens. Also dw::Textblock::Word::stretchability,
@@ -110,17 +110,17 @@ dw::Textblock::Word::shrinkability, which are both set in
dw::Textblock::addSpace.
-<h2>Criteria for Line-Breaking</h2>
+Criteria for Line-Breaking
+==========================
Before these changes to line breaking, a word (represented by
dw::Textblock::Word) had the following attributes related to
line-breaking:
-<ul>
-<li>the width of the word itself, represented by dw::Textblock::Word::size;
-<li>the width of the space following the word, represented by
-dw::Textblock::Word::origSpace.
-</ul>
+- the width of the word itself, represented by
+ dw::Textblock::Word::size;
+- the width of the space following the word, represented by
+ dw::Textblock::Word::origSpace.
In a more mathematical notation, the \f$i\f$th word has a width
\f$w_i\f$ and a space \f$s_i\f$.
@@ -132,37 +132,31 @@ With hyphenation, the criteria are refined. Hyphenation should only be
used when otherwise line breaking results in very large spaces. We
define:
-<ul>
-<li>the badness \f$\beta\f$ of a line, which is greater the more the
-spaces between the words differ from the ideal space;
-<li>a penalty \f$p\f$ for any possible break point.
-</ul>
+- the badness \f$\beta\f$ of a line, which is greater the more the
+ spaces between the words differ from the ideal space;
+- a penalty \f$p\f$ for any possible break point.
The goal is to find those break points, where \f$\beta + p\f$ is
minimal.
Examples for the penalty \f$p\f$:
-<ul>
-<li>0 for normal line breaks (between words);
-<li>\f$\infty\f$ to prevent a line break at all costs;
-<li>\f$-\infty\f$ to force a line
-<li>a positive, but finite, value for hyphenation points.
-</ul>
+- 0 for normal line breaks (between words);
+- \f$\infty\f$ to prevent a line break at all costs;
+- \f$-\infty\f$ to force a line
+- a positive, but finite, value for hyphenation points.
So we need the following values:
-<ul>
-<li> \f$w_i\f$ (the width of the word \f$i\f$ itself);
-<li> \f$s_i\f$ (the width of the space following the word \f$i\f$);
-<li> the stretchability \f$y_i\f$, a value denoting how much the space
-after word\f$i\f$ can be stretched (typically \f${1\over 2} s_i\f$);
-<li> the shrinkability \f$y_i\f$, a value denoting how much the space
-after word\f$i\f$ can be shrunken (typically \f${1\over 3} s_i\f$);
-<li> the penalty \f$p_i\f$, if the line is broken after word \f$i\f$;
-<li> a width \f$h_i\f$, which is added, when the line is broken after
-word \f$i\f$.
-</ul>
+- \f$w_i\f$ (the width of the word \f$i\f$ itself);
+- \f$s_i\f$ (the width of the space following the word \f$i\f$);
+- the stretchability \f$y_i\f$, a value denoting how much the space
+ after word\f$i\f$ can be stretched (typically \f${1\over 2} s_i\f$);
+- the shrinkability \f$y_i\f$, a value denoting how much the space
+ after word\f$i\f$ can be shrunken (typically \f${1\over 3} s_i\f$);
+- the penalty \f$p_i\f$, if the line is broken after word \f$i\f$;
+- a width \f$h_i\f$, which is added, when the line is broken after
+ word \f$i\f$.
\f$h_i\f$ is the width of the hyphen, if the word \f$i\f$ is a part of
the hyphenated word (except the last part); otherwise 0.
@@ -185,22 +179,18 @@ We define:
\f$W_a^b\f$ is the total width, \f$Y_a^b\f$ the total stretchability, and
\f$Z_a^b\f$ the total shrinkability.
-Furthermore the <i>adjustment ratio</i> \f$r_a^b\f$:
+Furthermore the *adjustment ratio* \f$r_a^b\f$:
-<ul>
-<li>in the ideal case that \f$W_a^b = l\f$: \f$r_a^b = 0\f$;
-<li>if \f$W_a^b < l\f$: \f$r_a^b = (l - W_a^b) / Y_a^b\f$ (\f$r_a^b < 0\f$ in
-this case);
-<li>if \f$W_a^b > l\f$: \f$r_a^b = (l - W_a^b) / Z_a^b\f$ (\f$r_a^b < 0\f$ in
-this case).
-</ul>
+- in the ideal case that \f$W_a^b = l\f$: \f$r_a^b = 0\f$;
+- if \f$W_a^b < l\f$: \f$r_a^b = (l - W_a^b) / Y_a^b\f$
+ (\f$r_a^b < 0\f$ in this case);
+- if \f$W_a^b > l\f$: \f$r_a^b = (l - W_a^b) / Z_a^b\f$
+ (\f$r_a^b < 0\f$ in this case).
The badness \f$\beta_a^b\f$ is defined as follows:
-<ul>
-<li>if \f$r_a^b\f$ is undefined or \f$r_a^b < -1\f$: \f$\beta_a^b = \infty\f$;
-<li>otherwise: \f$\beta_a^b = |r_a^b|^3\f$
-</ul>
+- if \f$r_a^b\f$ is undefined or \f$r_a^b < -1\f$: \f$\beta_a^b = \infty\f$;
+- otherwise: \f$\beta_a^b = |r_a^b|^3\f$
The goal is to find the value of \f$b\f$ where \f$\beta_a^b + p_b\f$
is minimal. (\f$a\f$ is given, since we do not modify the previous
@@ -210,13 +200,12 @@ After a couple of words, it is not predictable whether this minimum
has already been reached. There are two cases where this is possible
for a given \f$b'\f$:
-<ul>
-<li>\f$\beta_{b'}^a = \infty\f$ (line gets too tight): \f$a \le b <
-b'\f$, the minimum has to be searched between these two values;
-<li>\f$p_{b'} = -\infty\f$ (forced line break): \f$a \le b \le b'\f$
-(there may be another minimum of \f$\beta_a^b\f$ before; note the
-\f$\le\f$ instead of \f$<\f$).
-</ul>
+- \f$\beta_{b'}^a = \infty\f$ (line gets too tight):
+ \f$a \le b < b'\f$, the minimum has to be searched between these two
+ values;
+- \f$p_{b'} = -\infty\f$ (forced line break):
+ \f$a \le b \le b'\f$ (there may be another minimum of
+ \f$\beta_a^b\f$ before; note the \f$\le\f$ instead of \f$<\f$).
This leads to a problem that the last words of a text block are not
displayed this way, since they do not fulfill these rules for being
@@ -229,7 +218,8 @@ code more complicated. See dw::Textblock::BadnessAndPenalty for
details.)
-<h2>Hyphens</h2>
+Hyphens
+=======
Words (instances of dw::Textblock::Word), which are actually part of a
hyphenated word, are always drawn as a whole, not seperately. This
@@ -258,25 +248,21 @@ etc. However, it gets a bit more complicated. Since all
non-hyphenations are drawn as a whole, the following conditions can be
concluded:
-<ul>
-<li>from drawing "ABCD" (not hyphenated at all): w(A) + w(B) + w(C) +
-w(D) = l(ABCD);
-<li>from drawing "BCD", when hyphenated as "A-BCD" ("A-" is not
-considered here): w(B) + w(C) + w(D) = l(BCD);
-<li>likewise, from drawing "CD" (cases "AB-CD" and "A-B-CD"): w(C) +
-w(D) = l(CD);
-<li>finally, for the cases "ABC-D", "AB-C-D", "A-BC-D", and "A-B-C-D":
-w(D) = l(D).
-</ul>
+- from drawing "ABCD" (not hyphenated at all): w(A) + w(B) + w(C) +
+ w(D) = l(ABCD);
+- from drawing "BCD", when hyphenated as "A-BCD" ("A-" is not
+ considered here): w(B) + w(C) + w(D) = l(BCD);
+- likewise, from drawing "CD" (cases "AB-CD" and "A-B-CD"): w(C) +
+ w(D) = l(CD);
+- finally, for the cases "ABC-D", "AB-C-D", "A-BC-D", and "A-B-C-D":
+ w(D) = l(D).
So, the calculation is simple:
-<ul>
-<li>w(D) = l(D)
-<li>w(C) = l(CD) - w(D)
-<li>w(B) = l(BCD) - (w(C) + w(D))
-<li>w(A) = l(ABCD) - (w(B) + w(C) + w(D))
-</ul>
+- w(D) = l(D)
+- w(C) = l(CD) - w(D)
+- w(B) = l(BCD) - (w(C) + w(D))
+- w(A) = l(ABCD) - (w(B) + w(C) + w(D))
For calculation the hyphen widths, the exact conditions would be
over-determined, even when the possibility for individual hyphen
@@ -285,7 +271,8 @@ be used. However, a simple approach of fixed hyphen widths will have
near-perfect results, so this is kept simple.
-<h2>Automatic Hyphenation</h2>
+Automatic Hyphenation
+=====================
When soft hyphens are used, words are immediately divided into
different parts, and so different instances of
@@ -293,14 +280,12 @@ dw::Textblock::Word. Automatic hyphenation (using Liang's algorithm)
is, however, not applied always, but only when possibly needed, after
calculating a line without hyphenation:
-<ul>
-<li>When the line is tight, the last word of the line is hyphenated;
-possibly this will result in a line with less parts of this word, and
-so a less tight line.
-<li>When the line is loose, and there is another word (for the
-next line) available, this word is hyphenated; possibly, some parts of
-this word are taken into this line, making it less loose.
-</ul>
+- When the line is tight, the last word of the line is hyphenated;
+ possibly this will result in a line with less parts of this word,
+ and so a less tight line.
+- When the line is loose, and there is another word (for the next
+ line) available, this word is hyphenated; possibly, some parts of
+ this word are taken into this line, making it less loose.
After this, the line is re-calculated.
@@ -323,36 +308,38 @@ a trick (a second array) to deal with exactly this problem. See there
for more details.
-<h2>Tests</h2>
+Tests
+=====
There are test HTML files in the <i>test</i> directory. Also, there is
a program testing automatic hyphenation, <i>test/liang</i>, which can
be easily extended.
-<h2>Bugs and Things Needing Improvement</h2>
+Bugs and Things Needing Improvement
+===================================
-<h3>High Priority</h3>
+High Priority
+-------------
-<b>Bugs in hyphenation:</b> There seem to be problems when breaking
+**Bugs in hyphenation:** There seem to be problems when breaking
words containing hyphens already. Example: "Abtei-Stadt", which is
divided into "Abtei-" and "Stadt", resulting possibly in
&quot;Abtei-<span></span>-[new line]Stadt&quot;. See also below under
"Medium Priority", on how to deal with hyphens and dashes.
-<h3>Medium Priority</h3>
+Medium Priority
+---------------
-<b>Break hyphens and dashes:</b> The following rules seem to be relevant:
+**Break hyphens and dashes:** The following rules seem to be relevant:
-<ol>
-<li>In English, an em-dash is used with no spaces around. Breaking
-before and after the dash should be possible, perhaps with a penalty >
-0. (In German, an en-dash (Halbgeviert) with spaces around is used
-instead.)</li>
-<li>After a hyphen, which is part of a compound word, a break should
-be possible. As described above ("Abtei-Stadt"), this collides with
-hyphenation.</li>
-</ol>
+- In English, an em-dash is used with no spaces around. Breaking
+ before and after the dash should be possible, perhaps with a
+ penalty > 0. (In German, an en-dash (Halbgeviert) with spaces around
+ is used instead.)
+- After a hyphen, which is part of a compound word, a break should be
+ possible. As described above ("Abtei-Stadt"), this collides with
+ hyphenation.
Where to implement? In the same dynamic, lazy way like hyphenation? As
part of hyphenation?
@@ -364,7 +351,7 @@ and "Stadt", but "Nordrhein-Westfalen" is divided into "Nord",
("rhein-West") is untouched. (Sorry for the German words; if you have
got English examples, send them me.)
-<b>Incorrect calculation of extremes:</b> The minimal width of a text
+**Incorrect calculation of extremes:** The minimal width of a text
block (as part of the width extremes, which are mainly used for
tables) is defined by everything between two possible breaks. A
possible break may also be a hyphenation point; however, hyphenation
@@ -377,30 +364,28 @@ resulting possibly in a different value for the minimal width.
Possible strategies to deal with this problem:
-<ol>
-<li>Ignore. The implications should be minimal.
-<li>Any solution will make it neccessary to hyphenate at least some
-words when calculating extremes. Since the minimal widths of all words
-are used to calculate the minimal width of the text block, the
-simplest approach will hyphenate all words. This would, of course,
-eliminate the performance gains of the current lazy approach.
-<li>The latter approach could be optimized in some ways. Examples: (i)
-If a word is already narrower than the current accumulated value for
-the minimal width, it makes no sense to hyphenate it. (ii) In other
-cases, heuristics may be used to estimate the number of syllables, the
-width of the widest of them etc.
-</ol>
-
-<h3>Low Priority</h3>
-
-<b>Mark the end of a paragraph:</b> Should dw::core::Content::BREAK
-still be used? Currently, this is redundant to
+- Ignore. The implications should be minimal.
+- Any solution will make it neccessary to hyphenate at least some
+ words when calculating extremes. Since the minimal widths of all
+ words are used to calculate the minimal width of the text block, the
+ simplest approach will hyphenate all words. This would, of course,
+ eliminate the performance gains of the current lazy approach.
+- The latter approach could be optimized in some ways. Examples: (i)
+ If a word is already narrower than the current accumulated value for
+ the minimal width, it makes no sense to hyphenate it. (ii) In other
+ cases, heuristics may be used to estimate the number of syllables,
+ the width of the widest of them etc.
+
+Low Priority
+------------
+
+**Mark the end of a paragraph:** Should dw::core::Content::BREAK still
+be used? Currently, this is redundant to
dw::Textblock::BadnessAndPenalty.
-<b>Other than justified text:</b> The calculation of badness is
-designed for justified text. For other alignments, it may be
-modified. The point is the definition of stretchability and for the
-line.
+**Other than justified text:** The calculation of badness is designed
+for justified text. For other alignments, it may be modified. The
+point is the definition of stretchability and for the line.
Consider left-aligned text. Most importantly, not the spaces between
the words, but the space on the right border is adjusted. If the
@@ -425,8 +410,8 @@ lines will, when spaces are shrunken, get too long!)
Analogous considerations must be made for right-aligned and centered
text. (For centered texts, there are two adjustable spaces.)
-<b>Hyphens in adjacent lines:</b> It should be simple to assign a
-larger penalty for hyphens, when the line before is already
-hyphenated. This way, hyphens in adjacent lines are penalized further.
+**Hyphens in adjacent lines:** It should be simple to assign a larger
+penalty for hyphens, when the line before is already hyphenated. This
+way, hyphens in adjacent lines are penalized further.
*/