Plan 9 from Bell Labs’s /usr/web/sources/contrib/steve/root/sys/src/cmd/tex/web2c/misc/patgen.web

Copyright © 2021 Plan 9 Foundation.
Distributed under the MIT License.
Download the Plan 9 distribution.


% This is PATGEN.WEB in text format, as of October 24, 1996.
% Version 1.0 was finished in 1983.
% Version 2.0 major revision for `8-bit TeX' (November 8, 1991).
% Version 2.1 allows left/right_hypen_min from terminal (April, 1992).
% Version 2.2 added `close_in(dictionary)' (August, 1996).
% Version 2.3 avoided division by zero - Karl Berry (October, 1996).

% Here is TeX material that gets inserted after \input webmac
\def\hang{\hangindent 3em\indent\ignorespaces}
\def\PASCAL{Pascal}

\def\title{PATGEN}
\def\contentspagenumber{45} % should be odd
\def\topofcontents{
  \line{\tenit Appendix\hfil \mainfont\contentspagenumber}
  \vfill
  \null\vskip 40pt
  \centerline{\titlefont {\ttitlefont PAT}tern {\ttitlefont GEN}eration
    program}
  \vskip 8pt
  \centerline{\titlefont for the \TeX 82 hyphenator}
  \vskip 15pt
  \centerline{(Version 2.3, October 1996)}
  \vfill}
\pageno=\contentspagenumber \advance\pageno by 1

@* Introduction.
This program takes a list of hyphenated words and generates a set of
patterns that can be used by the \TeX 82 hyphenation algorithm.

The patterns consist of strings of letters and digits, where a digit
indicates a `hyphenation value' for some intercharacter position.  For
example, the pattern \.{3t2ion} specifies that if the string \.{tion}
occurs in a word, we should assign a hyphenation value of 3 to the
position immediately before the \.{t}, and a value of 2 to the position
between the \.{t} and the \.{i}.

To hyphenate a word, we find all patterns that match within the word and
determine the hyphenation values for each intercharacter position.  If
more than one pattern applies to a given position, we take the maximum of
the values specified (i.e., the higher value takes priority).  If the
resulting hyphenation value is odd, this position is a feasible
breakpoint; if the value is even or if no value has been specified, we are
not allowed to break at this position.

In order to find quickly the patterns that match in a given word and to
compute the associated hyphenation values, the patterns generated by this
program are compiled by \.{INITEX} into a compact version of a finite
state machine.  For further details, see the \TeX 82 source.

The |banner| string defined here should be changed whenever \.{PATGEN}
gets modified.

@d banner=='This is PATGEN, Version 2.3' {printed when the program starts}

@ The original version~1 of \.{PATGEN} was written by Frank~M. Liang
@^Liang, Franklin Mark@>
in 1982; a major revision (version~2) by Peter Breitenlohner in 1991
@^Breitenlohner, Peter@>
is mostly related to the new features of `8-bit \TeX' (version~3 of
\TeX 82). The differences between versions~1 and~2 fall into several
categories (all of Liang's algorithms have been left essentially
unchanged): (1)~enhancements related to 8-bit \TeX, e.g., the
introduction of 8-bit |ASCII_code| values and of \.{\\lefthyphenmin} and
\.{\\righthyphenmin}; (2)~a modification of the input and output
procedures which should make language specific modifications of this
program unnecessary (information about the external representation of
all `letters' used by a particular language is obtained from the
|translate| file); (3)~removal of ANSI standard \PASCAL\ and range check
violations; (4)~removal of uninitialized variables; (5)~minor
modifications in order to simplify system-dependent modifications.
@^range check violations@>

@ This program is written in standard \PASCAL, except where it is
necessary to use extensions.  All places where nonstandard constructions
are used have been listed in the index under ``system dependencies.''
@!@^system dependencies@>

The program uses \PASCAL's standard |input| and |output| files to read
from and write to the user's terminal.

@d print(#)==write(output,#)
@d print_ln(#)==write_ln(output,#)
@d get_input(#)==read(input,#)
@d get_input_ln(#)==
  begin if eoln(input) then read_ln(input);
  read(input,#);
  end
@#
@d end_of_PATGEN=9999

@p @<Compiler directives@>@/
program PATGEN(@!dictionary,@!patterns,@!translate,@!patout);
label end_of_PATGEN;
const @<Constants in the outer block@>@/
type @<Types in the outer block@>@/
var @<Globals in the outer block@>@/
procedure initialize; {this procedure gets things started properly}
  var @<Local variables for initialization@>@/
  begin print_ln(banner);@/
  @<Set initial values@>@/
  end;

@ The patterns are generated in a series of sequential passes through the
dictionary.  In each pass, we collect count statistics for a particular
type of pattern, taking into account the effect of patterns chosen in
previous passes.  At the end of a pass, the counts are examined and new
patterns are selected.

Patterns are chosen one level at a time, in order of increasing
hyphenation value.  In the sample run shown below, the parameters
|hyph_start| and |hyph_finish| specify the first and last levels,
respectively, to be generated.

Patterns at each level are chosen in order of increasing pattern length
(usually starting with length~2).  This is controlled by the parameters
|pat_start| and |pat_finish| specified at the beginning of each level.

Furthermore patterns of the same length applying to different
intercharacter positions are chosen in separate passes through the
dictionary.  Since patterns of length $n$ may apply to $n+1$ different
positions, choosing a set of patterns of lengths $2$ through $n$ for a
given level requires $(n+1)(n+2)/2-3$ passes through the word list.

At each level, the selection of patterns is controlled by the three
parameters |good_wt|, |bad_wt|, and |thresh|.  A hyphenating pattern will
be selected if |good*good_wt-bad*bad_wt>=thresh|, where |good| and
|bad| are the number of times the pattern could and could not be
hyphenated, respectively, at a particular point.  For inhibiting patterns,
|good| is the number of errors inhibited, and |bad| is the number of
previously found hyphens inhibited.

@<Globals...@>=
@!pat_start, @!pat_finish: dot_type;
@!hyph_start, @!hyph_finish: val_type;
@!good_wt, @!bad_wt, @!thresh: integer;

@ The proper choice of the parameters to achieve a desired degree of
hyphenation is discussed in Chapter~4.  Below we show part of a sample run
of \.{PATGEN}, with the user's inputs underlined.
$$\vbox{\halign{\.{#\hfil}\cr
$\underline{\smash{\.{ex patgen}}}$\cr
DICTIONARY : $\underline{\smash{\.{murray.hyf}}}$\cr
PATTERNS   : $\underline{\smash{\.{nul:}}}$\cr
TRANSLATE  : $\underline{\smash{\.{nul:}}}$\cr
PATOUT     : $\underline{\smash{\.{murray.pat}}}$\cr
This is PATGEN, Version 2.0\cr
left\_hyphen\_min = 2, right\_hyphen\_min = 3, 26 letters\cr
0 patterns read in\cr
pattern trie has 256 nodes, trie\_max = 256, 0 outputs\cr
hyph\_start, hyph\_finish: $\underline{\.{1 1}}$\cr
pat\_start, pat\_finish: $\underline{\.{2 3}}$\cr
good weight, bad weight, threshold: $\underline{\.{1 3 3}}$\cr
processing dictionary with pat\_len = 2, pat\_dot = 1\cr
\cr
0 good, 0 bad, 3265 missed\cr
 0.00 \%,  0.00 \%,  100.00 \%\cr
338 patterns, 466 nodes in count trie, triec\_max = 983\cr
46 good and 152 bad patterns added (more to come)\cr
finding 715 good and 62 bad hyphens, efficiency =  10.72\cr
pattern trie has 326 nodes, trie\_max = 509, 2 outputs\cr
processing dictionary with pat\_len = 2, pat\_dot = 0\cr
\cr
\hskip 1.5em ...\cr
\cr
1592 nodes and 39 outputs deleted\cr
total of 220 patterns at hyph\_level 1\cr
hyphenate word list? $\underline{\smash{\.{y}}}$\cr
writing pattmp.1\cr
\cr
2529 good, 243 bad, 736 missed\cr
 77.46 \%,  7.44 \%,  22.54 \%\cr}}$$

@ Note that before beginning a pattern selection run, a file of existing
patterns may be read in.  In order for pattern selection to work properly,
this file should only contain patterns with hyphenation values less than
|hyph_start|.  Each word in the dictionary is hyphenated according to the
existing set of patterns (including those chosen on previous passes of the
current run) before pattern statistics are collected.

Also, a hyphenated word list may be written out at the end of a run.  This
list can be read back in as the `dictionary' to continue pattern selection
from this point.  In addition to ordinary hyphens (|'-'|) the new list
will contain two additional kinds of ``hyphens'' between letters, namely
hyphens that have been found by previously generated patterns, as well
as erroneous hyphens that have been inserted by those patterns.  These
are represented by the symbols |'*'| and |'.'|, respectively. The three
characters |'-'|, |'*'|, and |'.'| are, in fact, just the default values
used to represent the three kinds of hyphens, the |translate| file may
specify different characters to be used instead of them.

In addition, a word list can include hyphen weights, both for entire words
and for individual hyphen positions.  (The syntax for this is explained in
the dictionary processing routines.)  Thus common words can be weighted
more heavily, or, more generally, words can be weighted according to their
frequency of occurrence, if such information is available.  The use of
hyphen weights combined with an appropriate setting of the pattern
selection threshold can be used to guarantee hyphenation of certain words
or certain hyphen positions within a word.

@ Below we show the first few lines of a typical word list,
before and after generating a level of patterns.
$$\vbox{\halign{\tabskip 1in\.{#\hfil}&\.{#\hfil}\cr
abil-i-ty&  abil*i*ty\cr
ab-sence&  ab*sence\cr
ab-stract&  ab*stract\cr
ac-a-dem-ic&  ac-a-d.em-ic\cr
ac-cept&  ac*cept\cr
ac-cept-able&  ac*cept-able\cr
ac-cept-ed&  ac*cept*ed\cr
\hskip 1.5em ...&\hskip 1.5em ...\cr
}}$$

@ We augment \PASCAL 's control structures a bit using |goto|\unskip's
and the following symbolic labels.

@d exit=10 {go here to leave a procedure}
@d continue=22 {go here to resume a loop}
@d done=30 {go here to exit a loop}
@d found=40 {go here when you've found it}
@d not_found=41 {go here when you've found something else}

@ Here are some macros for common programming idioms.

@d incr(#)==#:=#+1 {increase a variable by unity}
@d decr(#)==#:=#-1 {decrease a variable by unity}
@#
@d Incr_Decr_end(#)==#
@d Incr(#)==#:=#+Incr_Decr_end {we use |Incr(a)(b)| to increase \dots}
@d Decr(#)==#:=#-Incr_Decr_end {\dots\ and |Decr(a)(b)| to decrease
  variable |a| by |b|; this can be optimized for some compilers}
@#
@d loop == @+ while true do@+ {repeat over and over until a |goto| happens}
@d do_nothing == {empty statement}
@d return==goto exit {terminate a procedure call}
@f return==nil
@f loop == xclause

@ In case of serious problems \.{PATGEN} will give up, after issuing an
error message about what caused the error. Such errors might be
discovered inside of subroutines inside of subroutines, so a \.{WEB}
macro called |jump_out| has been introduced. This macro, which transfers
control to the label |end_of_PATGEN| at the end of the program, contains
the only non-local |@!goto| statement in \.{PATGEN}. Some \PASCAL\
compilers do not implement non-local |goto| statements. In such cases
the |goto end_of_PATGEN| in the definition of |jump_out| should simply
be replaced by a call on some system procedure that quietly terminates
the program.
@^system dependencies@>

An overflow stop occurs if \.{PATGEN}'s tables aren't large enough.

@d jump_out==goto end_of_PATGEN {terminates \.{PATGEN}}
@#
@d error(#)==begin print_ln(#); jump_out; end
@d overflow(#)==error('PATGEN capacity exceeded, sorry [',#,'].')
@.PATGEN capacity exceeded ...@>

@ @<Compiler directives@>=
@{@&$C-,A+,D-@} {no range check, catch arithmetic overflow, no debug overhead}
@^system dependencies@>

@* The character set.
Since different \PASCAL\ systems may use different character sets, we use
the name |text_char| to stand for the data type of characters appearing in
external text files.  We also assume that |text_char| consists of the
elements |chr(first_text_char)| through |chr(last_text_char)|, inclusive.
The definitions below should be adjusted if necessary.
@^system dependencies@>
@^character set dependencies@>

Internally, characters will be represented using the type |ASCII_code|.
Note, however, that only some of the standard ASCII characters are
assigned a fixed |ASCII_code|; all other characters are assigned an
|ASCII_code| dynamically when they are first read from the |translate|
file specifying the external representation of the `letters' used by a
particular language. For the sake of generality the standard version of
this program allows for 256 different |ASCII_code| values, but 128 of
them would probably suffice for all practical purposes.

@d first_text_char=0 {ordinal number of the smallest element of |text_char|}
@d last_text_char=255 {ordinal number of the largest element of |text_char|}
@#
@d last_ASCII_code=255 {the highest allowed |ASCII_code| value}

@<Types...@>=
@!text_char=char; {the data type of characters in text files}
@!ASCII_code=0..last_ASCII_code; {internal representation of input characters}
@!text_file=text;

@ Some \PASCAL s can store only signed eight-bit quantities (|-128..127|)
but not unsigned ones (|0..255|) in one byte. If storage is tight we
must, for such \PASCAL s, either restrict |ASCII_code| to the range
|0..127| (with some loss of generality) or convert between |ASCII_code|
and |packed_ASCII_code| and vice versa by subtracting or adding an
offset. (Or we might define |packed_ASCII_code| as |char| and use
suitable typecasts for the conversion.) Only the type |packed_ASCII_code|
will be used for large arrays and the \.{WEB} macros |si| and |so| will
always be used to convert an |ASCII_code| into a |packed_ASCII_code| and
vice versa.
@^system dependencies@>

@d min_packed=0 {change this to `$\\{min\_packed}=-128$' when necessary;
  and don't forget to change the definitions of |si| and |so| below
  accordingly}
@#
@d si(#)==# {converts |ASCII_code| to |packed_ASCII_code|}
@d so(#)==# {converts |packed_ASCII_code| to |ASCII_code|}

@<Types...@>=
@!packed_ASCII_code=min_packed..last_ASCII_code+min_packed;

@ We want to make sure that the ``constants'' defined in this program
satisfy all the required relations. Some of them are needed to avoid
time-consuming checks while processing the dictionary and\slash or to
prevent range check and array bound violations.
@^range check violations@>

Here we check that the definitions of |ASCII_code| and
|packed_ASCII_code| are consistent with those of |si| and |so|.

@<Set init...@>=
bad:=0;@/
if last_ASCII_code<127 then bad:=1;
if (si(0)<>min_packed)or(so(min_packed)<>0) then bad:=2;@/
@<Check the ``constant'' values for consistency@>@;
if bad>0 then error('Bad constants---case ',bad:1);
@.Bad constants@>

@ @<Local variables for init...@>=
@!bad:integer;
@!i:text_char;
@!j:ASCII_code;

@ We convert between |ASCII_code| and the user's external character set by
means of arrays |xord| and |xchr| that are analogous to \PASCAL's |ord|
and |chr| functions.

@<Globals...@>=
@!xord: array [text_char] of ASCII_code;
  {specifies conversion of input characters}
@!xchr: array [ASCII_code] of text_char;
  {specifies conversion of output characters}

@ The following code initializes the |xchr| array with some of the
standard ASCII characters.

@<Set init...@>=
for j:=0 to last_ASCII_code do xchr[j]:=' ';
xchr["."]:='.';@/
xchr["0"]:='0'; xchr["1"]:='1'; xchr["2"]:='2'; xchr["3"]:='3';
xchr["4"]:='4'; xchr["5"]:='5'; xchr["6"]:='6'; xchr["7"]:='7';
xchr["8"]:='8'; xchr["9"]:='9';@/
xchr["A"]:='A'; xchr["B"]:='B'; xchr["C"]:='C'; xchr["D"]:='D';
xchr["E"]:='E'; xchr["F"]:='F'; xchr["G"]:='G'; xchr["H"]:='H';
xchr["I"]:='I'; xchr["J"]:='J'; xchr["K"]:='K'; xchr["L"]:='L';
xchr["M"]:='M'; xchr["N"]:='N'; xchr["O"]:='O'; xchr["P"]:='P';
xchr["Q"]:='Q'; xchr["R"]:='R'; xchr["S"]:='S'; xchr["T"]:='T';
xchr["U"]:='U'; xchr["V"]:='V'; xchr["W"]:='W'; xchr["X"]:='X';
xchr["Y"]:='Y'; xchr["Z"]:='Z';@/
xchr["a"]:='a'; xchr["b"]:='b'; xchr["c"]:='c'; xchr["d"]:='d';
xchr["e"]:='e'; xchr["f"]:='f'; xchr["g"]:='g'; xchr["h"]:='h';
xchr["i"]:='i'; xchr["j"]:='j'; xchr["k"]:='k'; xchr["l"]:='l';
xchr["m"]:='m'; xchr["n"]:='n'; xchr["o"]:='o'; xchr["p"]:='p';
xchr["q"]:='q'; xchr["r"]:='r'; xchr["s"]:='s'; xchr["t"]:='t';
xchr["u"]:='u'; xchr["v"]:='v'; xchr["w"]:='w'; xchr["x"]:='x';
xchr["y"]:='y'; xchr["z"]:='z';

@ The following system-independent code makes the |xord| array contain a
suitable inverse to the information in |xchr|.

@d invalid_code=0 {|ASCII_code| that should not appear}
@d tab_char=@'11 {|ord| of tab character; tab characters seem to be
  unavoidable with files from UNIX systems}
@^system dependencies@>
@^character set dependencies@>

@<Set init...@>=
for i:=chr(first_text_char) to chr(last_text_char) do
  xord[i]:=invalid_code;
for j:=0 to last_ASCII_code do xord[xchr[j]]:=j;
xord[' ']:=" "; xord[chr(tab_char)]:=" ";

@ So far each invalid |ASCII_code| has been assigned the character |' '|
and all invalid characters have been assigned |ASCII_code=invalid_code|.
The |get_ASCII| function, used only while reading the |translate| file,
returns the |ASCII_code| corresponding to a character, assigning a new
|ASCII_code| first if necessary.

@d num_ASCII_codes=last_ASCII_code+1 {number of different |ASCII_code| values}

@p function get_ASCII(@!c:text_char):ASCII_code;
label found;
var i: ASCII_code;
begin i:=xord[c];
if i=invalid_code then
  begin while i<last_ASCII_code do
    begin incr(i);
    if (xchr[i]=' ')and(i<>" ") then goto found;
    end;
  overflow(num_ASCII_codes:1,' characters');
  found: xord[c]:=i; xchr[i]:=c;
  end;
get_ASCII:=i;
end;

@ The \TeX 82 hyphenation algorithm operates on `hyphenable words'
converted temporarily to lower case, i.e., they may consist of up to
255 different `letters' corresponding to \.{\\lccode}s |1..255|. These
\.{\\lccode}s could, in principle, be language dependent but this might
lead to undesirable results when hyphenating multilingual paragraphs.
No more than 245 different letters can occur in hyphenation patterns
since the characters |'0'..'9'| and |'.'| play a special r\^^Dole when
reading patterns. For the purpose of this program each letter is
represented internally by a unique |internal_code>=2| (|internal_code=1|
is the |edge_of_word| indicator); |internal_code| values |2..127| will
probably suffice for all practical purposes, but we allow the range
|2..last_ASCII_code| for the sake of generality. Syntactically
|internal_code| and |ASCII_code| are the same, we will use one or the
other name according to the semantic context.

@d edge_of_word=1 {|internal_code| for start and end of a word}

@<Types...@>=
@!internal_code=ASCII_code;
@!packed_internal_code=packed_ASCII_code;

@ Note that an |internal_code| used by this program is in general quite
different from the |ASCII_code| (or rather \.{\\lccode}) used by \TeX
82. This program allows the input of characters (from the |dictionary|
and |patterns| file) corresponding to an |internal_code| in either lower
or upper case form; the output (to the |patout| and |pattmp| file) will
always be in lower case form.

Unfortunately there does not (yet?) exist a standardized and widely
accepted 8-bit character set (or a unique one-to-one translation between
such sets). On the other hand macro expansion takes place in \TeX 82
when reading hyphenable words and when reading patterns. Thus the lower
and upper case versions of all `letters' used by a particular language
can (and for the sake of portability should) be represented entirely in
terms of the standard ASCII character set; either directly as characters
or via macros (or active characters) with or without arguments. The
macro definitions for such a representation will in general be language
dependent.

For the purpose of this program the external representation of the lower
and upper case version of a letter (i.e., |internal_code|) consists of a
unique sequence of characters (or \\{ASCII\_codes}), the only restriction
being that no such sequence must be a subsequence of an other one.
Moreover such sequences must not start with |' '|, |'.'|, |'0'..'9'| or
with one of the three characters (|'-'|, |'*'|, and |'.'|) representing
hyphens in the |dictionary| file; a sequence may, however, end with a
mandatory |' '| as, e.g., the sequence |'\ss '|.

The language dependent values of \.{\\lefthyphenmin} and
\.{\\righthyphenmin} as well as the external representation of the lower
and upper case letters and their collating sequence are specified in the
|translate| file, thus making any language dependent modifications of
this program unnecessary. If the |translate| file is empty (or does not
exist) the values \.{\\lefthyphenmin=2} and \.{\\righthyphenmin=3} and
|internal_code| values |2..27| with the one character external
representations |'a'..'z'| and |'A'..'Z'| will be used as defaults.

Incidentally this program can be used to convert a |dictionary| and
|patterns| file from one (``upper case'') to another (``lower case'')
external representation of letters.

@ When reading the |dictionary| (and |patterns|) file sequences of
characters must be recognized and converted to their corresponding
|internal_code|. This conversion is part of \.{PATGEN}s inner loop and
@^inner loop@>
must therefore be done as efficient as possible. Thus we will
mostly bypass the conversion from character to |ASCII_code| and convert
directly to the corresponding |internal_code| using the |xclass|
and |xint| arrays. Six types of characters are distinguished by their
|xclass|:

\yskip\hang |space_class| character |' '| terminates a pattern or word.

\yskip\hang |digit_class| characters |'0'..'9'| are hyphen values for a
pattern or hyphen weights for a word; their |xint| is the corresponding
numeric value |0..9|.

\yskip\hang |hyf_class| characters (|'.'|, |'-'|, and |'*'|) are `dots'
and indicate hyphens in a word; their |xint| is the corresponding
numeric value |err_hyf..found_hyf|.

\yskip\hang |letter_class| characters represent a letter; their |xint|
is the corresponding |internal_code|.

\yskip\hang |escape_class| characters indicate the start of a
multi-character sequence representing a letter.

\yskip\hang |invalid_class| characters should not occur except as part
of multi-character sequences.

@d space_class=0 {the character |' '|}
@d digit_class=1 {the characters |'0'..'9'|}
@d hyf_class=2 {the `hyphen' characters (|'.'|, |'-'|, and |'*'|)}
@d letter_class=3 {characters representing a letter}
@d escape_class=4 {characters that start a multi-character sequence
  representing a letter}
@d invalid_class=5 {characters that normally should not occur}
@#
@d no_hyf=0 {no hyphen}
@d err_hyf=1 {erroneous hyphen}
@d is_hyf=2 {hyphen}
@d found_hyf=3 {found hyphen}

@<Types...@>=
@!class_type=space_class..invalid_class; {class of a character}
@!digit=0..9; {a hyphen weight (or word weight)}
@!hyf_type=no_hyf..found_hyf; {type of a hyphen}

@ In addition we will use the |xext|, |xdig|, and |xdot| arrays to
convert from the internal representation to the corresponding
characters.

@<Globals...@>=
@!xclass: array [text_char] of class_type;
  {specifies the class of a character}
@!xint: array [text_char] of internal_code;
  {specifies the |internal_code| for a character}
@!xdig: array [0..9] of text_char;
  {specifies conversion of output characters}
@!xext: array [internal_code] of text_char;
  {specifies conversion of output characters}
@!xhyf: array [err_hyf..found_hyf] of text_char;
  {specifies conversion of output characters}

@ @<Set init...@>=
for i:=chr(first_text_char) to chr(last_text_char) do
  begin xclass[i]:=invalid_class; xint[i]:=0;
  end;
xclass[' ']:=space_class;
for j:=0 to last_ASCII_code do xext[j]:=' ';
xext[edge_of_word]:='.';
for j:=0 to 9 do
  begin xdig[j]:=xchr[j+"0"];
  xclass[xdig[j]]:=digit_class; xint[xdig[j]]:=j;
  end;
xhyf[err_hyf]:='.'; xhyf[is_hyf]:='-'; xhyf[found_hyf]:='*';
  {default representation for hyphens}

@ We assume that words use only the letters |cmin+1| through |cmax|.
This allows us to save some time on trie operations that involve
searching for packed transitions belonging to a particular state.

@d cmin=edge_of_word

@<Globals...@>=
@!cmax: internal_code; {largest |internal_code| or |ASCII_code|}

@* Data structures.
The main data structure used in this program is a dynamic packed trie.
In fact we use two of them, one for the set of patterns selected so far,
and one for the patterns being considered in the current pass.

For a pattern $p_1\ldots p_k$, the information associated with that
pattern is accessed by setting |@t$t_1$@>:=trie_root+@t$p_1$@>| and
then, for |1<i<=k|, setting |@t$t_i$@>:=trie_link(@t$t_{i-1}$@>)+
@t$p_i$@>|; the pattern information is then stored in a location addressed
by |@t$t_k$@>|.  Since all trie nodes are packed into a single array, in
order to distinguish nodes belonging to different trie families, a special
field is provided such that |trie_char@t$(t_i)=si(p_i)$@>| for all |i|.

In addition the trie must support dynamic insertions and deletions.  This
is done by maintaining a doubly linked list of unoccupied cells and
repacking trie families as necessary when insertions are made.

Each trie node consists of three fields:  the character |trie_char|, and
the two link fields |trie_link| and |trie_back|.  In addition there is a
separate boolean array |trie_base_used|.  When a node is unoccupied,
|trie_char=min_packed| and the link fields point to the next and previous
unoccupied nodes, respectively, in the doubly linked list.  When a node is
occupied, |trie_link| points to the next trie family, and |trie_back|
(renamed |trie_outp|) contains the output associated with this transition.
The |trie_base_used| bit indicates that some family has been packed at
this base location, and is used to prevent two families from being packed
at the same location.

@ The sizes of the pattern tries may have to be adjusted depending
on the particular application (i.e., the parameter settings and the
size of the dictionary).  The sizes below were sufficient to generate
the original set of english \TeX 82 hyphenation patterns (file
\.{hyphen.tex}).

@<Constants...@>=
@!trie_size=55000; {space for pattern trie}
@!triec_size=26000; {space for pattern count trie, must be less than
 |trie_size| and greater than the number of occurrences of any pattern in
 the dictionary}
@!max_ops=4080; {size of output hash table, should be a multiple of 510}
@!max_val=10; {maximum number of levels$+1$, also used to denote bad patterns}
@!max_dot=15; {maximum pattern length, also maximum length of external
  representation of a `letter'}
@!max_len=50; {maximum word length}
@!max_buf_len=80; {maximum length of input lines, must be at least |max_len|}

@ @<Check the ``constant'' values for consistency@>=
if (triec_size<4096)or(trie_size<triec_size) then bad:=3;
if max_ops>trie_size then bad:=4;
if max_val>10 then bad:=5;
if max_buf_len<max_len then bad:=6;

@ @<Types...@>=
@!q_index=1..last_ASCII_code; {number of transitions in a state}
@!val_type=0..max_val; {hyphenation values}
@!dot_type=0..max_dot; {dot positions}
@!op_type=0..max_ops; {index into output hash table}
@!word_index=0..max_len; {index into |word|}
@!trie_pointer=0..trie_size;
@!triec_pointer=0..triec_size;@/
@!op_word=packed record dot: dot_type; val: val_type; op: op_type end;

@ Trie is actually stored with its components in separate packed arrays,
in order to save space and time (although this depends on the computer's
word size and the size of the trie pointers).

@<Globals...@>=
@!trie_c: packed array[trie_pointer] of packed_internal_code;
@!trie_l, @!trie_r: packed array[trie_pointer] of trie_pointer;
@!trie_taken: packed array[trie_pointer] of boolean;
@!triec_c: packed array[triec_pointer] of packed_internal_code;
@!triec_l, @!triec_r: packed array[triec_pointer] of triec_pointer;
@!triec_taken: packed array[triec_pointer] of boolean;
@!ops: array[op_type] of op_word; {output hash table}

@ When some trie state is being worked on, an unpacked version of the
state is kept in positions |1..qmax| of the global arrays |trieq_c|,
|trieq_l|, and |trieq_r|. The character fields need not be in any
particular order.

@<Globals...@>=
@!trieq_c: array[q_index] of internal_code; {character fields of a
  single trie state}
@!trieq_l, @!trieq_r: array[q_index] of trie_pointer; {link fields}
@!qmax: q_index; {number of transitions in an unpacked state}
@!qmax_thresh: q_index; {controls density of first-fit packing}

@ Trie fields are accessed using the following macros.

@d trie_char(#)==trie_c[#]
@d trie_link(#)==trie_l[#]
@d trie_back(#)==trie_r[#]
@d trie_outp(#)==trie_r[#]
@d trie_base_used(#)==trie_taken[#]
@#
@d triec_char(#)==triec_c[#]
@d triec_link(#)==triec_l[#]
@d triec_back(#)==triec_r[#]
@d triec_good(#)==triec_l[#]
@d triec_bad(#)==triec_r[#]
@d triec_base_used(#)==triec_taken[#]
@#
@d q_char(#)==trieq_c[#]
@d q_link(#)==trieq_l[#]
@d q_back(#)==trieq_r[#]
@d q_outp(#)==trieq_r[#]
@#
@d hyf_val(#)==ops[#].val
@d hyf_dot(#)==ops[#].dot
@d hyf_nxt(#)==ops[#].op

@* Routines for pattern trie.
The pattern trie holds the set of patterns chosen prior to the current
pass, including bad or ``hopeless'' patterns at the current level that
occur too few times in the dictionary to be of use.  Each transition of
the trie includes an output field pointing to the hyphenation information
associated with this transition.

@<Globals...@>=
@!trie_max: trie_pointer; {maximum occupied trie node}
@!trie_bmax: trie_pointer; {maximum base of trie family}
@!trie_count: trie_pointer; {number of occupied trie nodes, for space usage
 statistics}
@!op_count: op_type; {number of outputs in hash table}

@ Initially, the dynamic packed trie has just one state, namely the root,
with all transitions present (but with null links).  This is convenient
because the root will never need to be repacked and also we won't have to
check that the base is nonnegative when packing other states.
Moreover in many cases we need not check for a vanishing link field:
if |trie_link(t)=0| then a subsequent test for
|trie_char(trie_link(t)+c)=si(c)| will always fail due to |trie_root=1|.

@d trie_root=1

@p procedure init_pattern_trie;
var c: internal_code; @!h: op_type;
begin  for c:=0 to last_ASCII_code do
  begin  trie_char(trie_root+c):=si(c); {indicates node occupied;
      fake for |c=0|}
    trie_link(trie_root+c):=0;
    trie_outp(trie_root+c):=0;
    trie_base_used(trie_root+c):=false;
  end;
  trie_base_used(trie_root):=true;
  trie_bmax:=trie_root;
  trie_max:=trie_root+last_ASCII_code;
  trie_count:=num_ASCII_codes;@/
  qmax_thresh:=5;@/
  trie_link(0):=trie_max+1;
  trie_back(trie_max+1):=0;@/
  {|trie_link(0)| is used as the head of the doubly linked list of
    unoccupied cells}
  for h:=1 to max_ops do hyf_val(h):=0; {clear output hash table}
  op_count:=0;
end;

@ The |first_fit| procedure finds a hole in the packed trie into which the
state in |trieq_c|, |trieq_l|, and |trieq_r| will fit.  This is normally
done by going through the linked list of unoccupied cells and testing if
the state will fit at each position.  However if a state has too many
transitions (and is therefore unlikely to fit among existing
transitions) we don't bother and instead just pack it immediately to the
right of the occupied region (starting at |trie_max+1|).

@p function first_fit: trie_pointer;
label found, not_found;
var s, @!t: trie_pointer; @!q: q_index;
begin  @<Set |s| to the trie base location at which this state should be
    packed@>;
  for q:=1 to qmax do {pack it}
  begin  t:=s+q_char(q);@/
    trie_link(trie_back(t)):=trie_link(t);
    trie_back(trie_link(t)):=trie_back(t); {link around
      filled cell}
    trie_char(t):=si(q_char(q));
    trie_link(t):=q_link(q);
    trie_outp(t):=q_outp(q);
    if t>trie_max then trie_max:=t;
  end;
  trie_base_used(s):=true;
  first_fit:=s
end;

@ The threshold for large states is initially 5 transitions.  If more than
one level of patterns is being generated, the threshold is set to 7 on
subsequent levels because the pattern trie will be sparser after bad
patterns are deleted (see |delete_bad_patterns|).

@<Set |s| to the trie base location at which this state should be packed@>=
if qmax>qmax_thresh then t:=trie_back(trie_max+1) @+else t:=0;
loop  begin t:=trie_link(t); s:=t-q_char(1); {get next unoccupied cell}
  @<Ensure |trie| linked up to |s+num_ASCII_codes|@>;
  if trie_base_used(s) then goto not_found;
  for q:=qmax downto 2 do {check if state fits here}
  if trie_char(s+q_char(q))<>min_packed then goto not_found;
  goto found;
  not_found: end;
found:

@ The trie is only initialized (as a doubly linked list of empty cells) as
far as necessary.  Here we extend the initialization if necessary, and
check for overflow.

@<Ensure |trie| linked up to |s+num_ASCII_codes|@>=
if s>trie_size-num_ASCII_codes then
  overflow(trie_size:1,' pattern trie nodes');
while trie_bmax<s do
  begin incr(trie_bmax);
  trie_base_used(trie_bmax):=false;
  trie_char(trie_bmax+last_ASCII_code):=min_packed;
  trie_link(trie_bmax+last_ASCII_code):=trie_bmax+num_ASCII_codes;
  trie_back(trie_bmax+num_ASCII_codes):=trie_bmax+last_ASCII_code;
  end

@ The |unpack| procedure finds all transitions associated with the state
with base |s|, puts them into the arrays |trieq_c|, |trieq_l|, and
|trieq_r|, and sets |qmax| to one more than the number of transitions
found.  Freed cells are put at the beginning of the free list.

@p procedure unpack(@!s: trie_pointer);
var c: internal_code; @!t: trie_pointer;
begin  qmax:=1;
for c:=cmin to cmax do {search for transitions belonging to this state}
  begin  t:=s+c;
  if so(trie_char(t))=c then {found one}
    begin q_char(qmax):=c;
    q_link(qmax):=trie_link(t);
    q_outp(qmax):=trie_outp(t);
    incr(qmax);@/
    {now free trie node}
    trie_back(trie_link(0)):=t;
    trie_link(t):=trie_link(0);
    trie_link(0):=t;
    trie_back(t):=0;
    trie_char(t):=min_packed;
    end;
  end;
trie_base_used(s):=false;
end;

@ The function |new_trie_op| returns the `opcode' for the output
consisting of hyphenation value~|v|, hyphen position |d|, and next output
|n|.  The hash function used by |new_trie_op| is based on the idea that
313/510 is an approximation to the golden ratio [cf.\ {\sl The Art of
Computer Programming \bf3} (1973), 510--512]; but the choice is
comparatively unimportant in this particular application.

@p function new_trie_op(@!v: val_type; @!d: dot_type; @!n: op_type): op_type;
label exit;
var h: op_type;
begin  h:=((n+313*d+361*v) mod max_ops)+1; {trial hash location}
loop  begin if hyf_val(h)=0 then {empty position found}
    begin incr(op_count);
    if op_count=max_ops then overflow(max_ops:1,' outputs');
    hyf_val(h):=v; hyf_dot(h):=d; hyf_nxt(h):=n; new_trie_op:=h; return;
    end;
  if (hyf_val(h)=v) and (hyf_dot(h)=d) and
     (hyf_nxt(h)=n) then {already in hash table}
    begin  new_trie_op:=h; return;
    end;
  if h>1 then decr(h) @+else h:=max_ops; {try again}
  end;
exit: end;

@ @<Globals...@>=
@!pat: array[dot_type] of internal_code; {current pattern}
@!pat_len: dot_type; {pattern length}

@ Now that we have provided the necessary routines for manipulating the
dynamic packed trie, here is a procedure that inserts a pattern of length
|pat_len|, stored in the |pat| array, into the pattern trie.  It also adds
a new output.

@p procedure insert_pattern(@!val: val_type; @!dot: dot_type);
var i: dot_type; @!s, @!t: trie_pointer;
begin  i:=1;
  s:=trie_root+pat[i]; t:=trie_link(s);
  while (t>0) and (i<pat_len) do {follow existing trie}
  begin  incr(i); Incr(t)(pat[i]);
    if so(trie_char(t))<>pat[i] then
      @<Insert critical transition, possibly repacking@>;
    s:=t; t:=trie_link(s);
  end;
  q_link(1):=0; q_outp(1):=0; qmax:=1;
  while i<pat_len do {insert rest of pattern}
  begin  incr(i); q_char(1):=pat[i];
    t:=first_fit;
    trie_link(s):=t;
    s:=t+pat[i];
    incr(trie_count);
  end;
  trie_outp(s):=new_trie_op(val,dot,trie_outp(s));
end;

@ We have accessed a transition not in the trie.  We insert it, repacking
the state if necessary.

@<Insert critical transition, possibly repacking@>=
begin  if trie_char(t)=min_packed then
  begin  {we're lucky, no repacking needed}
  trie_link(trie_back(t)):=trie_link(t);
  trie_back(trie_link(t)):=trie_back(t);@/
  trie_char(t):=si(pat[i]);
  trie_link(t):=0;
  trie_outp(t):=0;
  if t>trie_max then trie_max:=t;
  end
else  begin  {whoops, have to repack}
  unpack(t-pat[i]);@/
  q_char(qmax):=pat[i];
  q_link(qmax):=0;
  q_outp(qmax):=0;@/
  t:=first_fit;
  trie_link(s):=t;
  Incr(t)(pat[i]);
  end;
incr(trie_count);
end

@* Routines for pattern count trie.
The pattern count trie is used to store the set of patterns considered in
the current pass, along with the counts of good and bad instances.  The
fields of this trie are the same as the pattern trie, except that there is
no output field, and leaf nodes are also used to store counts
(|triec_good| and |triec_bad|).  Except where noted, the following
routines are analogous to the pattern trie routines.

@<Globals...@>=
@!triec_max, @!triec_bmax, @!triec_count: triec_pointer; {same as for pattern
  trie}
@!triec_kmax: triec_pointer; {shows growth of trie during pass}
@!pat_count: integer; {number of patterns in count trie}

@ [See |init_pattern_trie|.]  The variable |triec_kmax| always contains
the size of the count trie rounded up to the next multiple of 4096, and is
used to show the growth of the trie during each pass.

@d triec_root=1

@p procedure init_count_trie;
var c: internal_code;
begin  for c:=0 to last_ASCII_code do
  begin  triec_char(triec_root+c):=si(c);@/
    triec_link(triec_root+c):=0;
    triec_back(triec_root+c):=0;
    triec_base_used(triec_root+c):=false;
  end;
  triec_base_used(triec_root):=true;
  triec_bmax:=triec_root; triec_max:=triec_root+last_ASCII_code;
  triec_count:=num_ASCII_codes; triec_kmax:=4096;@/
  triec_link(0):=triec_max+1; triec_back(triec_max+1):=0;@/
  pat_count:=0;
end;

@ [See |first_fit|.]

@p function firstc_fit: triec_pointer;
label found, not_found;
var a, @!b: triec_pointer; @!q: q_index;
begin  @<Set |b| to the count trie base location at which this state should
    be packed@>;
  for q:=1 to qmax do {pack it}
  begin  a:=b+q_char(q);@/
    triec_link(triec_back(a)):=triec_link(a);
    triec_back(triec_link(a)):=triec_back(a);@/
    triec_char(a):=si(q_char(q));
    triec_link(a):=q_link(q);
    triec_back(a):=q_back(q);
    if a>triec_max then triec_max:=a;
  end;
  triec_base_used(b):=true;
  firstc_fit:=b
end;

@ The threshold for attempting a first-fit packing is 3 transitions, which
is lower than for the pattern trie because speed is more important here.

@<Set |b| to the count trie base location...@>=
if qmax>3 then a:=triec_back(triec_max+1) @+else a:=0;
loop  begin a:=triec_link(a); b:=a-q_char(1);@/
  @<Ensure |triec| linked up to |b+num_ASCII_codes|@>;
  if triec_base_used(b) then goto not_found;
  for q:=qmax downto 2 do
  if triec_char(b+q_char(q))<>min_packed then goto not_found;
  goto found;
  not_found: end;
found:

@ @<Ensure |triec| linked up to |b+num_ASCII_codes|@>=
if b>triec_kmax-num_ASCII_codes then
  begin if triec_kmax=triec_size then
    overflow(triec_size:1,' count trie nodes');
  print(triec_kmax div 1024:1, 'K ');
  if triec_kmax>triec_size-4096 then triec_kmax:=triec_size
    else Incr(triec_kmax)(4096);
  end;
while triec_bmax<b do
  begin incr(triec_bmax);
  triec_base_used(triec_bmax):=false;
  triec_char(triec_bmax+last_ASCII_code):=min_packed;
  triec_link(triec_bmax+last_ASCII_code):=triec_bmax+num_ASCII_codes;
  triec_back(triec_bmax+num_ASCII_codes):=triec_bmax+last_ASCII_code;
  end

@ [See |unpack|.]

@p procedure unpackc(@!b: triec_pointer);
var c: internal_code; @!a: triec_pointer;
begin  qmax:=1;
for c:=cmin to cmax do {search for transitions belonging to this state}
  begin  a:=b+c;
  if so(triec_char(a))=c then {found one}
    begin q_char(qmax):=c;
    q_link(qmax):=triec_link(a);
    q_back(qmax):=triec_back(a);
    incr(qmax);@/
    triec_back(triec_link(0)):=a;
    triec_link(a):=triec_link(0);
    triec_link(0):=a; triec_back(a):=0;
    triec_char(a):=min_packed;
    end;
  end;
triec_base_used(b):=false;
end;

@ [See |insert_pattern|.]  Patterns being inserted into the count trie are
always substrings of the current word, so they are contained in the array
|word| with length |pat_len| and finishing position |fpos|.

@p function insertc_pat(@!fpos: word_index): triec_pointer;
var spos: word_index; @!a, @!b: triec_pointer;
begin  spos:=fpos-pat_len; {starting position of pattern}
  incr(spos); b:=triec_root+word[spos]; a:=triec_link(b);
  while (a>0) and (spos<fpos) do {follow existing trie}
  begin  incr(spos); Incr(a)(word[spos]);
    if so(triec_char(a))<>word[spos] then
    @<Insert critical count transition, possibly repacking@>;
    b:=a; a:=triec_link(a);
  end;
  q_link(1):=0; q_back(1):=0; qmax:=1;
  while spos<fpos do {insert rest of pattern}
  begin  incr(spos); q_char(1):=word[spos];
    a:=firstc_fit;
    triec_link(b):=a;
    b:=a+word[spos];
    incr(triec_count);
  end;
  insertc_pat:=b;
  incr(pat_count);
end;

@ @<Insert critical count transition, possibly repacking@>=
begin  if triec_char(a)=min_packed then {lucky}
  begin  triec_link(triec_back(a)):=triec_link(a);
  triec_back(triec_link(a)):=triec_back(a);
  triec_char(a):=si(word[spos]);
  triec_link(a):=0;
  triec_back(a):=0;
  if a>triec_max then triec_max:=a;
  end
else  begin  {have to repack}
  unpackc(a-word[spos]);@/
  q_char(qmax):=word[spos];
  q_link(qmax):=0;
  q_back(qmax):=0;
  a:=firstc_fit;
  triec_link(b):=a;
  Incr(a)(word[spos]);
  end;
incr(triec_count);
end

@* Input and output.
For some \PASCAL\ systems output files must be closed before the program
terminates; it may also be necessary to close input files. Since
standard \PASCAL\ does not provide for this, we use \.{WEB} macros and
will say |close_out(f)| resp.\ |close_in(f)|; these macros should not
produce errors or system messages, even if a file could not be opened
successfully.
@^system dependencies@>

@d close_out(#)==close(#) {close an output file}
@d close_in(#)==do_nothing {close an input file}

@<Globals...@>=
@!dictionary, @!patterns, @!translate, @!patout, @!pattmp: text_file;

@ When reading a line from one of the input files (|dictionary|,
|patterns|, or |translate|) the characters read from that line (padded
with blanks if necessary) are to be placed into the |buf| array. Reading
lines from the |dictionary| file should be as efficient as possible
since this is part of \.{PATGEN}'s ``inner loop''. Standard \PASCAL,
unfortunately, does not provide for this; consequently the \.{WEB} macro
|read_buf| defined below should be optimized if possible. For many
\PASCAL's this can be done with |read_ln(f,buf)| where |buf| is declared
as \PASCAL\ string (i.e., as \&{packed} \&{array} |[1..any]| \&{of}
|char|), for others a string type with dynamic length can be used.
@^inner loop@>@^system dependencies@>

@d read_buf(#)== {reads a line from input file |#| into |buf| array}
  begin buf_ptr:=0;
  while not eoln(#) do
    begin if (buf_ptr>=max_buf_len) then bad_input('Line too long');
@.Line too long@>
    incr(buf_ptr); read(#,buf[buf_ptr]);
    end;
  read_ln(#);
  while buf_ptr<max_buf_len do
    begin incr(buf_ptr); buf[buf_ptr]:=' ';
    end;
  end

@<Globals...@>=
@!buf: array[1..max_buf_len] of text_char; {array to hold lines of input}
@!buf_ptr: 0..max_buf_len; {index into |buf|}

@ When an error is caused by bad input data we say |bad_input(#)| in
order to disply the contents of the |buf| array before terminating with
an error message.

@d print_buf== {print contents of |buf| array}
  begin buf_ptr:=0;
  repeat incr(buf_ptr); print(buf[buf_ptr]);
  until buf_ptr=max_buf_len;
  print_ln(' ');
  end
@d bad_input(#)==begin print_buf; error(#); end

@ The |translate| file may specify the values of \.{\\lefthyphenmin} and
\.{\\righthyphenmin} as well as the external representation and
collating sequence of the `letters' used by the language. In addition
replacements may be specified for the characters |'-'|, |'*'|, and |'.'|
representing hyphens in the word list. If the |translate| file is empty
(or does not exist) default values will be used.

@p procedure read_translate;
label done;
var c: text_char;
@!n: integer;
@!j: ASCII_code;
@!bad: boolean;
@!lower: boolean;
@!i: dot_type; @!s, @!t: trie_pointer;
begin imax:=edge_of_word;
reset(translate);
if eof(translate) then
  @<Set up default character translation tables@>
else  begin read_buf(translate); @<Set up hyphenation data@>;
  cmax:=last_ASCII_code-1;
  while not eof(translate) do @<Set up representation(s) for a letter@>;
  end;
close_in(translate);
print_ln('left_hyphen_min = ',left_hyphen_min:1,
         ', right_hyphen_min = ',right_hyphen_min:1,
         ', ',imax-edge_of_word:1,' letters');
cmax:=imax;
end;

@ @<Globals...@>=
@!imax: internal_code; {largest |internal_code| assigned so far}
@!left_hyphen_min, @!right_hyphen_min: dot_type;

@ @<Set up default...@>=
begin left_hyphen_min:=2; right_hyphen_min:=3;
for j:="A" to "Z" do
  begin incr(imax);
  c:=xchr[j+"a"-"A"]; xclass[c]:=letter_class; xint[c]:=imax;
  xext[imax]:=c;
  c:=xchr[j]; xclass[c]:=letter_class; xint[c]:=imax;
  end;
end

@ The first line of the |translate| file must contain the values
of \.{\\lefthyphenmin} and \.{\\righthyphenmin} in columns 1--2 and
3--4. In addition columns~5, 6, and~7 may (optionally) contain
replacements for the default characters |'.'|, |'-'|, and |'*'|
respectively, representing hyphens in the word list.
If the values specified for \.{\\lefthyphenmin} and \.{\\righthyphenmin}
are invalid (e.g., blank) new values are read from the terminal.

@<Set up hyphenation...@>=
bad:=false;
if buf[1]=' ' then n:=0
else if xclass[buf[1]]=digit_class then n:=xint[buf[1]]@+
else bad:=true;
if xclass[buf[2]]=digit_class then n:=10*n+xint[buf[2]]@+
else bad:=true;
if (n>=1)and(n<max_dot) then left_hyphen_min:=n@+else bad:=true;
if buf[3]=' ' then n:=0
else if xclass[buf[3]]=digit_class then n:=xint[buf[3]]@+
else bad:=true;
if xclass[buf[4]]=digit_class then n:=10*n+xint[buf[4]]@+
else bad:=true;
if (n>=1)and(n<max_dot) then right_hyphen_min:=n@+
else bad:=true;
if bad then
  begin bad:=false;
  repeat print('left_hyphen_min, right_hyphen_min: '); get_input(n1,n2);@/
    if (n1>=1)and(n1<max_dot)and(n2>=1)and(n2<max_dot) then
      begin left_hyphen_min:=n1; right_hyphen_min:=n2;
      end
    else  begin n1:=0;
      print_ln('Specify 1<=left_hyphen_min,right_hyphen_min<=',
                max_dot-1:1,' !');
      end;
  until n1>0;
  end;
for j:=err_hyf to found_hyf do
  begin if buf[j+4]<>' ' then xhyf[j]:=buf[j+4];
  if xclass[xhyf[j]]=invalid_class then xclass[xhyf[j]]:=hyf_class@+
  else bad:=true;
  end;
xclass['.']:=hyf_class; {in case the default has been changed}
if bad then bad_input('Bad hyphenation data')
@.Bad hyphenation data@>

@ Each following line is either a comment or specifies the external
representations for one `letter' used by the language. Comment lines
start with two equal characters (e.g., are blank) and are ignored.
Other lines contain the external representation of the lower case
version and an arbitrary number of `upper case versions' of a letter
preceded and separated by a delimiter and followed by two consecutive
delimiters; the delimiter may be any character not occuring in either
version.

@<Set up repres...@>=
begin read_buf(translate); buf_ptr:=1; lower:=true;
while not bad do {lower and then upper case version}
  begin pat_len:=0;
  repeat if buf_ptr<max_buf_len then incr(buf_ptr) @+ else bad:=true;
    if buf[buf_ptr]=buf[1] then
      if pat_len=0 then goto done
      else  begin if lower then
          begin if imax=last_ASCII_code then
            begin print_buf; overflow(num_ASCII_codes:1,' letters');
            end;
          incr(imax); xext[imax]:=xchr[pat[pat_len]];
          end;
        c:=xchr[pat[1]];
        if pat_len=1 then
          begin if xclass[c]<>invalid_class then bad:=true;
          xclass[c]:=letter_class; xint[c]:=imax;
          end
        else @<Insert a letter into pattern trie@>;
        end
    else if pat_len=max_dot then bad:=true
    else  begin incr(pat_len); pat[pat_len]:=get_ASCII(buf[buf_ptr]);
      end;
  until (buf[buf_ptr]=buf[1])or bad;
  lower:=false;
  end;
done: if bad then bad_input('Bad representation');
@.Bad representation@>
end

@ When the (lower or upper case) external representation of a letter
consists of more than one character and the corresponding |ASCII_code|
values have been placed into the |pat| array we store them in
the pattern trie.  [See |insert_pattern|.] Since this `external subtrie'
starts at |trie_link(trie_root)| it does not interfere with normal
patterns. The output field of leaf nodes contains the |internal_code|
and the link field distinguishes between lower and upper case letters.

@<Insert a letter...@>=
begin if xclass[c]=invalid_class then xclass[c]:=escape_class;
if xclass[c]<>escape_class then bad:=true;
i:=0; s:=trie_root; t:=trie_link(s);
while (t>trie_root) and (i<pat_len) do {follow existing trie}
  begin  incr(i); Incr(t)(pat[i]);
  if so(trie_char(t))<>pat[i] then
    @<Insert critical transition, possibly repacking@>
  else if trie_outp(t)>0 then bad:=true;
  s:=t; t:=trie_link(s);
  end;
if t>trie_root then bad:=true;
q_link(1):=0; q_outp(1):=0; qmax:=1;
while i<pat_len do {insert rest of pattern}
  begin  incr(i); q_char(1):=pat[i];
  t:=first_fit;
  trie_link(s):=t;
  s:=t+pat[i];
  incr(trie_count);
  end;
trie_outp(s):=imax;
if not lower then trie_link(s):=trie_root;
end

@ The |get_letter| \.{WEB} macro defined here will be used in
|read_word| and |read_patterns| to obtain the |internal_code|
corresponding to a letter externally represented by a multi-character
sequence (starting with an |escape_class| character).

@d get_letter(#)==
  begin t:=trie_root;
  loop  begin t:=trie_link(t)+xord[c];
    if so(trie_char(t))<>xord[c] then bad_input('Bad representation');
@.Bad representation@>
    if trie_outp(t)<>0 then
      begin #:=trie_outp(t); goto done;
      end;
    if buf_ptr=max_buf_len then c:=' '
    else  begin incr(buf_ptr); c:=buf[buf_ptr];
      end;
    end;
  done: end

@ In order to prepare for the output phase we store all but the last of
the \\{ASCII\_codes} of the external representation of each `lower case
letter' in the pattern count trie which is no longer used at that time.
The recursive |find_letters| procedure traverses the `external subtrie'.

@p procedure find_letters(@!b: trie_pointer; @!i: dot_type);@/
  {traverse subtries of family |b|; |i| is current depth in trie}
var c: ASCII_code; {a local variable that must be saved on recursive calls}
@!a: trie_pointer; {does not need to be saved}
@!j: dot_type; {loop index}
@!l: triec_pointer;
begin if i=1 then init_count_trie;
for c:=cmin to last_ASCII_code do {find transitions belonging to this family}
  begin a:=b+c;
  if so(trie_char(a))=c then {found one}
    begin pat[i]:=c;
    if trie_outp(a)=0 then find_letters(trie_link(a),i+1)
    else if trie_link(a)=0 then {this is a lower case letter}
      @<Insert external representation for a letter into count trie@>;
    end;
  end;
end;

@ Starting from |triec_root+trie_outp(a)| we proceed through link fields
and store all \\{ASCII\_codes} except the last one in the count trie;
the last character has already been stored in the |xext| array.

@<Insert external...@>=
begin l:=triec_root+trie_outp(a);
for j:=1 to i-1 do
  begin if triec_max=triec_size then
    overflow(triec_size:1,' count trie nodes');
  incr(triec_max); triec_link(l):=triec_max; l:=triec_max;
  triec_char(l):=si(pat[j]);
  end;
triec_link(l):=0;
end

@ During the output phase we will say |write_letter(i)(f)| and
|write(f,xext[i])| to write the lower case external representation of
the letter with internal code |i| to file |f|: |xext[i]| is the last
character of the external representation whereas the \.{WEB} macro
|write_letter| defined here writes all preceding characters (if any).

@d write_letter_end(#)==while l>0 do
  begin write(#,xchr[so(triec_char(l))]); l:=triec_link(l);
  end
@d write_letter(#)==l:=triec_link(triec_root+#); write_letter_end

@* Routines for traversing pattern tries.
At the end of a pass, we traverse the count trie using the following
recursive procedure, selecting good and bad patterns and inserting them
into the pattern trie.

@p procedure traverse_count_trie(@!b: triec_pointer; @!i: dot_type);@/
{traverse subtries of family |b|; |i| is current depth in trie}
var c: internal_code; {a local variable that must be saved on recursive calls}
    @!a: triec_pointer; {does not need to be saved}
begin
for c:=cmin to cmax do {find transitions belonging to this family}
  begin  a:=b+c;
  if so(triec_char(a))=c then {found one}
    begin  pat[i]:=c;
    if i<pat_len then traverse_count_trie(triec_link(a),i+1)
    else @<Decide what to do with this pattern@>;
    end;
  end;
end;

@ When we have come to the end of a pattern, |triec_good(a)| and
|triec_bad(a)| contain the number of times this pattern helps or hinders
the cause.  We use the counts to determine if this pattern should be
selected, or if it is hopeless, or if we can't decide yet.  In the latter
case, we set |more_to_come| true to indicate that there might still be
good patterns extending the current type of patterns.

@<Decide what to do...@>=
if good_wt*triec_good(a)<thresh then {hopeless pattern}
begin   insert_pattern(max_val,pat_dot);
  incr(bad_pat_count)
end else
if good_wt*triec_good(a)-bad_wt*triec_bad(a)>=thresh then {good pattern}
begin   insert_pattern(hyph_level,pat_dot);
  incr(good_pat_count);
  Incr(good_count)(triec_good(a));
  Incr(bad_count)(triec_bad(a));
end else
  more_to_come:=true

@ Some global variables are used to accumulate statistics about the
performance of a pass.

@<Globals...@>=
@!good_pat_count, @!bad_pat_count: integer; {number of patterns added at end
  of pass}
@!good_count, @!bad_count, @!miss_count: integer; {hyphen counts}
@!level_pattern_count: integer; {number of good patterns at level}
@!more_to_come: boolean;

@ The recursion in |traverse_count_trie| is initiated by the following
procedure, which also prints some statistics about the patterns chosen.
The ``efficiency'' is an estimate of pattern effectiveness.

@d bad_eff==(thresh/good_wt)

@p procedure collect_count_trie;
begin  good_pat_count:=0; bad_pat_count:=0;
  good_count:=0; bad_count:=0;
  more_to_come:=false;
  traverse_count_trie(triec_root,1); @/
  print(good_pat_count:1,' good and ',
    bad_pat_count:1,' bad patterns added');
  Incr(level_pattern_count)(good_pat_count);
  if more_to_come then print_ln(' (more to come)') @+else print_ln(' ');
  print('finding ',good_count:1,' good and ',bad_count:1,' bad hyphens');
  if good_pat_count>0 then
  print_ln(', efficiency = ',
    good_count/(good_pat_count+bad_count/bad_eff):1:2)
  else print_ln(' ');
  print_ln('pattern trie has ',trie_count:1,' nodes, ',@|
    'trie_max = ',trie_max:1,', ',op_count:1,' outputs');
end;

@ At the end of a level, we traverse the pattern trie and delete bad
patterns by removing their outputs.  If no output remains, the node is
also deleted.

@p function delete_patterns(@!s: trie_pointer): trie_pointer;@/
{delete bad patterns in subtrie |s|, return 0 if entire subtrie freed,
  otherwise |s|}
var c: internal_code; @!t: trie_pointer; @!all_freed: boolean;
  {must be saved on recursive calls}
    @!h, @!n: op_type; {do not need to be saved}
begin  all_freed:=true;
  for c:=cmin to cmax do {find transitions belonging to this family}
  begin  t:=s+c;
    if so(trie_char(t))=c then
    begin  @<Link around bad outputs@>;
      if trie_link(t)>0 then
        trie_link(t):=delete_patterns(trie_link(t));
      if (trie_link(t)>0) or (trie_outp(t)>0) or (s=trie_root) then
        all_freed:=false
      else
        @<Deallocate this node@>;
    end;
  end;
  if all_freed then {entire state is freed}
  begin  trie_base_used(s):=false;
    s:=0;
  end;
  delete_patterns:=s;
end;

@ @<Link around bad outputs@>=
begin  h:=0;
  hyf_nxt(0):=trie_outp(t);
  n:=hyf_nxt(0);
  while n>0 do
  begin  if hyf_val(n)=max_val then hyf_nxt(h):=hyf_nxt(n)
      else h:=n;
    n:=hyf_nxt(h);
  end;
  trie_outp(t):=hyf_nxt(0);
end

@ Cells freed by |delete_patterns| are put at the end of the free list.

@<Deallocate this node@>=
begin  trie_link(trie_back(trie_max+1)):=t;
  trie_back(t):=trie_back(trie_max+1);
  trie_link(t):=trie_max+1;
  trie_back(trie_max+1):=t;
  trie_char(t):=min_packed;@/
  decr(trie_count);
end

@ The recursion in |delete_patterns| is initiated by the following
procedure, which also prints statistics about the number of nodes deleted,
and zeros bad outputs in the hash table.  Note that the hash table may
become somewhat disorganized when more levels are added, but this defect
isn't serious.

@p procedure delete_bad_patterns;
var old_op_count: op_type;
    @!old_trie_count: trie_pointer;
    @!t: trie_pointer; @!h: op_type;
begin  old_op_count:=op_count;
  old_trie_count:=trie_count;@/
  t:=delete_patterns(trie_root);
  for h:=1 to max_ops do
  if hyf_val(h)=max_val then
  begin  hyf_val(h):=0; decr(op_count);
  end;
  print_ln(old_trie_count-trie_count:1,' nodes and ',@|
    old_op_count-op_count:1,' outputs deleted');
  qmax_thresh:=7; {pattern trie will be sparser because of deleted
    patterns}
end;

@ After all patterns have been generated, we will traverse the pattern
trie and output all patterns.  Note that if a pattern appears more than
once, only the maximum value at each position will be output.

@p procedure output_patterns(@!s: trie_pointer; @!pat_len: dot_type);@/
{output patterns in subtrie |s|; |pat_len| is current depth in trie}
var c: internal_code; {must be saved on recursive calls}
@!t: trie_pointer; @!h: op_type; @!d: dot_type;@/
@!l: triec_pointer; {for |write_letter|}
begin for c:=cmin to cmax do
  begin t:=s+c;
  if so(trie_char(t))=c then
    begin pat[pat_len]:=c;
    h:=trie_outp(t);
    if h>0 then @<Output this pattern@>;
    if trie_link(t)>0 then output_patterns(trie_link(t),pat_len+1);
    end;
  end;
end;

@ @<Output this pattern@>=
begin  for d:=0 to pat_len do hval[d]:=0;
  repeat  d:=hyf_dot(h);
    if hval[d]<hyf_val(h) then hval[d]:=hyf_val(h);
    h:=hyf_nxt(h);
  until h=0;
  if hval[0]>0 then write(patout,xdig[hval[0]]);
  for d:=1 to pat_len do
  begin  write_letter(pat[d])(patout); write(patout,xext[pat[d]]);
    if hval[d]>0 then write(patout,xdig[hval[d]]);
  end;
  write_ln(patout);
end

@* Dictionary processing routines.
The procedures in this section are the ``inner loop'' of the pattern
generation process.  To speed the program up, key parts of these routines
could be coded in machine language.
@^inner loop@>

@<Globals...@>=
@!word: array[word_index] of internal_code; {current word}
@!dots: array[word_index] of hyf_type; {current hyphens}
@!dotw: array[word_index] of digit; {dot weights}
@!hval: array[word_index] of val_type; {hyphenation values}
@!no_more: array[word_index] of boolean; {positions `knocked out'}
@!wlen: word_index; {length of current word}
@!word_wt: digit; {global word weight}
@!wt_chg: boolean; {indicates |word_wt| has changed}

@ The words in the |dictionary| consist of the `letters' used by the
language. ``Dots'' between letters can be one of four possibilities:
|'-'| indicating a hyphen, |'*'| indicating a found hyphen, |'.'|
indicating an error, or nothing; these are represented internally by the
four values |is_hyf|, |found_hyf|, |err_hyf|, and |no_hyf| respectively.
When reading a word we will, however, convert |err_hyf| into |no_hyf|
and |found_hyf| into |is_hyf| thus ignoring whether a hyphen has or
has not been found by a previous set of patterns.

@<Prepare to read dictionary@>=
xclass['.']:=invalid_class; {in case the default has been changed}
xclass[xhyf[err_hyf]]:=hyf_class; xint[xhyf[err_hyf]]:=no_hyf;
xclass[xhyf[is_hyf]]:=hyf_class; xint[xhyf[is_hyf]]:=is_hyf;
xclass[xhyf[found_hyf]]:=hyf_class; xint[xhyf[found_hyf]]:=is_hyf;

@ Furthermore single-digit word weights are allowed.  A digit at
the beginning of a word indicates a global word weight that is to be
applied to all following words (until the next global word weight).  A
digit at some intercharacter position indicates a weight for that position
only.

The |read_word| procedure scans a line of input representing a word,
and places the letters into the array |word|, with |word[1]=word[wlen]=
edge_of_word|.  The dot appearing between |word[dpos]| and |word[dpos+1]|
is placed in |dots[dpos]|, and the corresponding dot weight in
|dotw[dpos]|.

@p procedure read_word;
label done, found;
var c: text_char;
@!t: trie_pointer;
begin read_buf(dictionary);
word[1]:=edge_of_word; wlen:=1; buf_ptr:=0;
repeat incr(buf_ptr); c:=buf[buf_ptr];
  case xclass[c] of
  space_class: goto found;
  digit_class:
    if wlen=1 then {global word weight}
      begin if xint[c]<>word_wt then wt_chg:=true;
      word_wt:=xint[c];
      end
    else dotw[wlen]:=xint[c]; {dot weight}
  hyf_class: dots[wlen]:=xint[c]; {record the dot |c|}
  letter_class: {record the letter |c|}
    begin incr(wlen);
    if wlen=max_len then
      begin print_buf; overflow('word length=',max_len:1);
      end;
    word[wlen]:=xint[c]; dots[wlen]:=no_hyf; dotw[wlen]:=word_wt;
    end;
  escape_class: {record a multi-character sequence starting with |c|}
    begin incr(wlen);
    if wlen=max_len then
      begin print_buf; overflow('word length=',max_len:1);
      end;
    get_letter(word[wlen]); dots[wlen]:=no_hyf; dotw[wlen]:=word_wt;
    end;
  invalid_class: bad_input('Bad character');
@.Bad character@>
  end;
until buf_ptr=max_buf_len;
found: incr(wlen); word[wlen]:=edge_of_word;
end;

@ Here is a procedure that uses the existing patterns to hyphenate the
current word.  The hyphenation value applying between the characters
|word[dpos]| and |word[dpos+1]| is stored in |hval[dpos]|.

In addition, |no_more[dpos]| is set to |true| if this position is
``knocked out'' by either a good or bad pattern at this level.  That is,
if the pattern with current length and hyphen position is a superstring of
either a good or bad pattern at this level, then we don't need to collect
count statistics for the pattern because it can't possibly be chosen in
this pass.  Thus we don't even need to insert such patterns into the count
trie, which saves a good deal of space.

@p procedure hyphenate;
label done;
var spos, @!dpos, @!fpos: word_index;
    @!t: trie_pointer; @!h: op_type; @!v: val_type;
begin
for spos:=wlen-hyf_max downto 0 do
  begin  no_more[spos]:=false; hval[spos]:=0;
    fpos:=spos+1; t:=trie_root+word[fpos];
    repeat  h:=trie_outp(t);
      while h>0 do @<Store output |h| in the |hval| and
        |no_more| arrays, and advance |h|@>;
      t:=trie_link(t);
      if t=0 then goto done;
      incr(fpos); Incr(t)(word[fpos]);
    until so(trie_char(t))<>word[fpos];
  done:
  end;
end;

@ In order to avoid unnecessary test (and range check violations) the
globals |hyf_min|, |hyf_max|, and |hyf_len| are set up such that only
positions from |hyf_min| up to |wlen-hyf_max| of the |word| array need
to be checked, and that words with |wlen<hyf_len| need not to be checked
at all.

@<Globals...@>=
@!hyf_min, @!hyf_max, @!hyf_len: word_index; {limits for legal hyphens}

@ @<Prepare to read dictionary@>=
hyf_min:=left_hyphen_min+1; hyf_max:=right_hyphen_min+1;
hyf_len:=hyf_min+hyf_max;

@ @<Store output |h| in the |hval| and |no_more| arrays, and advance |h|@>=
begin  dpos:=spos+hyf_dot(h);
  v:=hyf_val(h);
  if (v<max_val) and (hval[dpos]<v) then hval[dpos]:=v;
  if (v>=hyph_level) then {check if position knocked out}
  if ((fpos-pat_len)<=(dpos-pat_dot))and((dpos-pat_dot)<=spos) then
    no_more[dpos]:=true;
  h:=hyf_nxt(h);
end

@ The |change_dots| procedure updates the |dots| array representing the
printing values of the hyphens.  Initially, hyphens (and correctly
found hyphens) in the word list are represented by |is_hyf| whereas
non-hyphen positions (and erroneous hyphens) are represented by |no_hyf|. A
Here these values are increased by one for each hyphen found by the
current patterns, thus changing |no_hyf| into |err_hyf| and |is_hyf|
into |found_hyf|. The routine also collects statistics about the number
of good, bad, and missed hyphens.

@d incr_wt(#)==Incr(#)(dotw[dpos])

@p procedure change_dots;
var dpos: word_index;
begin  for dpos:=wlen-hyf_max downto hyf_min do
  begin if odd(hval[dpos]) then incr(dots[dpos]);
  if dots[dpos]=found_hyf then incr_wt(good_count)
  else if dots[dpos]=err_hyf then incr_wt(bad_count)
  else if dots[dpos]=is_hyf then incr_wt(miss_count);
  end;
end;

@ The following procedure outputs the word as hyphenated by the current
patterns, including any word weights. Hyphens inhibited by the values of
\.{\\lefthyphenmin} and \.{\\righthyphenmin} are output as well.

@p procedure output_hyphenated_word;
var dpos: word_index;@/
@!l: triec_pointer; {for |write_letter|}
begin if wt_chg then {output global word weight}
  begin write(pattmp,xdig[word_wt]); wt_chg:=false
  end;
for dpos:=2 to wlen-2 do
  begin write_letter(word[dpos])(pattmp); write(pattmp,xext[word[dpos]]);
  if dots[dpos]<>no_hyf then write(pattmp,xhyf[dots[dpos]]);
  if dotw[dpos]<>word_wt then write(pattmp,xdig[dotw[dpos]]);
  end;
write_letter(word[wlen-1])(pattmp); write_ln(pattmp,xext[word[wlen-1]]);
end;

@ For each dot position in the current word, the |do_word| routine first
checks to see if we need to consider it.  It might be knocked out or a dot
we don't care about.  That is, when considering hyphenating patterns, for
example, we don't need to count hyphens already found.  If a relevant dot
is found, we increment the count in the count trie for the corresponding
pattern, inserting it first if necessary.  At this point of the program
range check violations may occur if these counts are incremented beyond
|triec_max|; it would, however, be too expensive to prevent this.
@^range check violations@>

@p procedure do_word;
label continue, done;
var spos, @!dpos, @!fpos: word_index; @!a: triec_pointer;
    @!goodp: boolean;
begin  for dpos:=wlen-dot_max downto dot_min do
  begin  spos:=dpos-pat_dot;
    fpos:=spos+pat_len;
    @<Check this dot position and |goto continue| if don't care@>;
    incr(spos); a:=triec_root+word[spos];
    while spos<fpos do
    begin  {follow existing count trie}
      incr(spos);
      a:=triec_link(a)+word[spos];
      if so(triec_char(a))<>word[spos] then
      begin   {insert new count pattern}
        a:=insertc_pat(fpos);
        goto done;
      end;
    end;
  done:  if goodp then incr_wt(triec_good(a))
    @+else incr_wt(triec_bad(a));
  continue:
  end;
end;

@ The globals |good_dot| and |bad_dot| will be set to |is_hyf| and
|no_hyf|, or |err_hyf| and |found_hyf|, depending on whether the current
level is odd or even, respectively. The globals |dot_min|, |dot_max|,
and |dot_len| are analogous to |hyf_min|, |hyf_max|, and |hyf_len|
defined earlier.

@<Globals...@>=
@!good_dot, @!bad_dot: hyf_type; {good and bad hyphens at current level}
@!dot_min, @!dot_max, @!dot_len: word_index; {limits for legal dots}

@ @<Prepare to read dictionary@>=
if procesp then
  begin dot_min:=pat_dot; dot_max:=pat_len-pat_dot;
  if dot_min<hyf_min then dot_min:=hyf_min;
  if dot_max<hyf_max then dot_max:=hyf_max;
  dot_len:=dot_min+dot_max;
  if odd(hyph_level) then
    begin good_dot:=is_hyf; bad_dot:=no_hyf;
    end
  else begin good_dot:=err_hyf; bad_dot:=found_hyf;
    end;
  end;

@ If the dot position |dpos| is out of bounds, knocked out, or a ``don't
care'', we skip this position.  Otherwise we set the flag |goodp|
indicating whether this is a good or bad dot.

@<Check this dot position...@>=
if no_more[dpos] then goto continue;
if dots[dpos]=good_dot then goodp:=true else
if dots[dpos]=bad_dot then goodp:=false else goto continue;

@ If |hyphp| is set to |true|, |do_dictionary| will write out a copy of
the dictionary as hyphenated by the current set of patterns.  If |procesp|
is set to |true|, |do_dictionary| will collect pattern statistics for
patterns with length |pat_len| and hyphen position |pat_dot|, at level
|hyph_level|.

@<Globals...@>=
@!procesp, @!hyphp: boolean;
@!pat_dot: dot_type; {hyphen position, measured from beginning of pattern}
@!hyph_level: val_type; {hyphenation level}
@!filnam: packed array[1..8] of char; {for |pattmp|}

@ The following procedure makes a pass through the word list, and also
prints out statistics about number of hyphens found and storage used by
the count trie.

@p procedure do_dictionary;
begin  good_count:=0; bad_count:=0; miss_count:=0;
  word_wt:=1; wt_chg:=false;
  reset(dictionary);@/
  @<Prepare to read dictionary@>@;@/
  if procesp then
  begin  init_count_trie;
    print_ln('processing dictionary with pat_len = ',pat_len:1,
      ', pat_dot = ',pat_dot:1);
  end;
  if hyphp then
    begin filnam:='pattmp. ';
    filnam[8]:=xdig[hyph_level];
    rewrite(pattmp,filnam);
    print_ln('writing pattmp.', xdig[hyph_level]);
    end;
  @<Process words until end of file@>;@/
  close_in(dictionary);@/
  print_ln(' ');
  print_ln(good_count:1,' good, ',bad_count:1,' bad, ',
    miss_count:1,' missed');
  if (good_count+miss_count)>0 then
    print_ln((100*good_count/(good_count+miss_count)):1:2,' %, ',
      (100*bad_count/(good_count+miss_count)):1:2,' %, ',
      (100*miss_count/(good_count+miss_count)):1:2,' %');
  if procesp then
    print_ln(pat_count:1,' patterns, ',triec_count:1,
      ' nodes in count trie, ','triec_max = ',triec_max:1);
  if hyphp then close_out(pattmp);
end;

@ @<Process words...@>=
while not eof(dictionary) do
  begin read_word;
  if wlen>=hyf_len then {short words are never hyphenated}
    begin hyphenate; change_dots;
    end;
  if hyphp then if wlen>2 then output_hyphenated_word;
     {empty words are ignored}
  if procesp then if wlen>=dot_len then do_word;
  end

@* Reading patterns.
Before beginning a run, we can read in a file of existing patterns.  This
is useful for extending a previous pattern selection run to get some more
levels.  (Since these runs are quite time-consuming, it is convenient to
choose patterns one level at a time, pausing to look at the results of the
previous level, and possibly amending the dictionary.)

@p procedure read_patterns;
label done, found;
var c: text_char;
@!d: digit;
@!i: dot_type;
@!t: trie_pointer;
begin xclass['.']:=letter_class; xint['.']:=edge_of_word;
level_pattern_count:=0; max_pat:=0;
reset(patterns);
while not eof(patterns) do
  begin  read_buf(patterns);
  incr(level_pattern_count);@/
  @<Get pattern and dots and |goto found|@>;
  found: @<Insert pattern@>;
  end;
close_in(patterns);
print_ln(level_pattern_count:1,' patterns read in');@/
print_ln('pattern trie has ',trie_count:1,' nodes, ',@|
  'trie_max = ',trie_max:1,', ',op_count:1,' outputs');
end;

@ The global variable |max_pat| keeps track of the largest hyphenation
value found in any pattern.

@<Globals...@>=
@!max_pat: val_type;

@ When a new pattern has been input into |buf|, we extract the letters of
the pattern, and insert the hyphenation values (digits) into the |hval|
array.

@<Get pattern...@>=
pat_len:=0; buf_ptr:=0; hval[0]:=0;
repeat incr(buf_ptr); c:=buf[buf_ptr];
  case xclass[c] of
  space_class: goto found;
  digit_class:
    begin d:=xint[c];
    if d>=max_val then bad_input('Bad hyphenation value');
@.Bad hyphenation value@>
    if d>max_pat then max_pat:=d;
    hval[pat_len]:=d;
    end;
  letter_class:
    begin incr(pat_len); hval[pat_len]:=0; pat[pat_len]:=xint[c];
    end;
  escape_class: {record a multi-character sequence starting with |c|}
    begin incr(pat_len); hval[pat_len]:=0; get_letter(pat[pat_len]);
    end;
  hyf_class, invalid_class: bad_input('Bad character');
@.Bad character@>
  end;
until buf_ptr=max_buf_len

@ Then we insert the pattern for each non-vanishing hyphenation value.
In addition we check that |edge_of_word| (i.e., |'.'|) occurs only as
first or last character; otherwise we would have to perform a time
consuming test for the end of a word in the |hyphenate| procedure.

@<Insert pattern@>=
if pat_len>0 then {avoid spurious patterns}
  for i:=0 to pat_len do
    begin if hval[i]<>0 then insert_pattern(hval[i],i);
    if i>1 then if i<pat_len then if pat[i]=edge_of_word then
      bad_input('Bad edge_of_word');
@.Bad edge_of_word@>
    end

@* The main program.
This is where \.{PATGEN} actually starts.  We initialize the pattern trie,
get |hyph_level| and |pat_len| limits from the terminal, and generate
patterns.

@p begin initialize;
init_pattern_trie;
read_translate;
read_patterns;
procesp:=true; hyphp:=false;@/
repeat print('hyph_start, hyph_finish: '); get_input(n1,n2);@/
  if (n1>=1)and(n1<max_val)and(n2>=1)and(n2<max_val) then
    begin hyph_start:=n1; hyph_finish:=n2;
    end
  else  begin n1:=0;
    print_ln('Specify 1<=hyph_start,hyph_finish<=',max_val-1:1,' !');
    end;
until n1>0;
hyph_level:=max_pat; {in case |hyph_finish<hyph_start|}
for i:=hyph_start to hyph_finish do
  begin  hyph_level:=i; level_pattern_count:=0;
  if hyph_level>hyph_start then print_ln(' ')
  else if hyph_start<=max_pat then
    print_ln('Largest hyphenation value ',max_pat:1,
             ' in patterns should be less than hyph_start');@/
@.Largest hyphenation value@>
  repeat print('pat_start, pat_finish: '); get_input(n1,n2);@/
    if (n1>=1)and(n1<=n2)and(n2<=max_dot) then
      begin pat_start:=n1; pat_finish:=n2;
      end
    else  begin n1:=0;
      print_ln('Specify 1<=pat_start<=pat_finish<=',max_dot:1,' !');
      end;
  until n1>0;
  repeat print('good weight, bad weight, threshold: ');
    get_input(n1,n2,n3);@/
    if (n1>=1)and(n2>=1)and(n3>=1) then
      begin good_wt:=n1; bad_wt:=n2; thresh:=n3;
      end
    else  begin n1:=0;
      print_ln('Specify good weight, bad weight, threshold>=1 !');
      end;
  until n1>0;
  @<Generate a level@>;
  delete_bad_patterns;
  print_ln('total of ',level_pattern_count:1,
    ' patterns at hyph_level ',hyph_level:1);
  end;
find_letters(trie_link(trie_root),1); {prepare for output}
rewrite(patout);
output_patterns(trie_root,1);
close_out(patout);@/
@<Make final pass to hyphenate word list@>;
end_of_PATGEN:
end.

@ The patterns of a given length (at a given level) are chosen with dot
positions ordered in an ``organ-pipe'' fashion.  For example, for
|pat_len=4| we choose patterns for different dot positions in the order 2,
1, 3, 0, 4.  The variable |dot1| controls this iteration in a clever
manner.

@<Globals...@>=
@!n1, @!n2, @!n3: integer; {accumulators}
@!i: val_type; {loop index: hyphenation level}
@!j: dot_type; {loop index: pattern length}
@!k: dot_type; {loop index: hyphen position}
@!dot1: dot_type;
@!more_this_level: array[dot_type] of boolean;

@ The array |more_this_level| remembers which positions are permanently
``knocked out''.  That is, if there aren't any possible good patterns
remaining at a certain dot position, we don't need to consider longer
patterns at this level containing that position.

@<Generate a level@>=
for k:=0 to max_dot do more_this_level[k]:=true;
for j:=pat_start to pat_finish do
  begin  pat_len:=j; pat_dot:=pat_len div 2; dot1:=pat_dot*2;
  repeat pat_dot:=dot1-pat_dot; dot1:=pat_len*2-dot1-1;
    if more_this_level[pat_dot] then
      begin do_dictionary; collect_count_trie;
      more_this_level[pat_dot]:=more_to_come;
      end;
  until pat_dot=pat_len;
  for k:=max_dot downto 1 do
    if not more_this_level[k-1] then more_this_level[k]:=false;
  end

@ When all patterns have been found, the user has a chance to see what
they do.  The resulting \.{pattmp} file can be used as the new
`dictionary' if we want to continue pattern generation from this point.

@<Make final pass to hyphenate word list@>=
procesp:=false; hyphp:=true;@/
print('hyphenate word list? ');
get_input_ln(buf[1]);
if (buf[1]='Y') or (buf[1]='y') then do_dictionary

@* System-dependent changes.
This section should be replaced, if necessary, by changes to the program
that are necessary to make \.{PATGEN} work at a particular installation.
It is usually best to design your change file so that all changes to
previous sections preserve the section numbering; then everybody's version
will be consistent with the printed program. More extensive changes,
which introduce new sections, can be inserted here; then only the index
itself will get a new section number.
@^system dependencies@>

@* Index.
Pointers to error messages appear here together with the section numbers
where each ident\-i\-fier is used.

Bell Labs OSI certified Powered by Plan 9

(Return to Plan 9 Home Page)

Copyright © 2021 Plan 9 Foundation. All Rights Reserved.
Comments to webmaster@9p.io.