Pascal Implementation by Steven Pemberton and Martin Daniels

Chapter 1: Input and Lexical Analysis

Lines: [303-549]

Routines: endofline, error, insymbol, nextch, options

This group of routines is responsible for reading the input, producing a listing, reporting errors, and splitting the input stream into distinct 'symbols' to be passed on to the next stage of the compiler.


The structure of the lexical analyser

Routine nextch, lines [358-371]

All input in the compiler is done by this procedure. Its purpose is to deliver the next character of the input in variable ch, dealing with the listing, and the end of line and end of file events. The following non-local variables are affected:

the current character to be dealt with
the position of the character in the current line
a buffer for the eoln test, so that all processing of the line can be completed before printing errors relating to the line
rather untidily used to deal with eof.
If end of line was met before this character then call the procedure endofline.
If the input is not exhausted then save the state of eoln in eol, read the next character, print it if required, and increment chcnt, giving the position of this character in this line.
Otherwise, the input is exhausted, and so a warning is printed, and test is set to false.


  1. The variable eol is used to delay processing of the end of line until all processing of the line itself has been completed, mainly to ensure that all error messages relating to the last item on the line appear with that line.

    When eoln(input) is true, the next read(input,ch) will read a space into ch, as required by Pascal. A space terminates all lexical items, except strings, where eol is tested explicitly [486]. Thus the end of line terminates all items.

  2. Ideally the statement if list then writeln(output) [360] should be the first statement of procedure endofline where it belongs.
  3. The "eof encountered" message [368] is split in two because the P4 compiler restricts strings to a maximum of 16 characters.
  4. The assignment to test [369] is part of the rather messy treatment of end of file. This will be discussed later when talking about procedure insymbol.
  5. chcnt is reset to zero at each call of endofline [335].

Routine error, lines [338-347]

Whenever the compiler reports an error, this procedure is called with an integer parameter representing the error number. The procedure saves a maximum of 10 errors in a buffer errlist, each element consisting of the error number (ferrnr) and the position at which it occurred (chcnt). Errinx (error index) counts the number of errors reported about the current line; if this number ever exceeds 9, then error 255 -- further errors suppressed -- is saved, and no others are saved for this line. Procedure endofline prints these saved error messages.


  1. Since after an error it is no further use to produce code for the program being compiled, the global variable prcode which governs code production could usefully be set to false within this procedure. This would reduce compilation time for incorrect programs.
  2. Even though the nmr field is declared as 1..400 there are two places where error is called with a value of more than 400 ([664, 1921]).
  3. Error numbers are used more or less as defined in the Pascal User Manual (see the appendix). The major exceptions are 399 which is meant to mean 'implementation restriction', and so should really be 398, and 400, 500 and 501 meaning 'compiler error'.

Routine endofline, lines [303-336]

This procedure processes end of line: listing errors for the current line, incrementing the line count, printing the two numbers that precede each line of the listing, and resetting chcnt.

If errors reported then print them on the listing.
If more than one error refers to the same character position, the error numbers are separated by commas.
An upward arrow ^ is printed as close to the error position as possible, considering the errors already printed.
The minimum width required to print the error number is calculated.
Before each line there are printed the line number and either ic the instruction counter, that is, the number of instructions produced so far, or lc, the location counter, the number of locations of data-store reserved so far for the current routine. Which of these is printed is determined by dp which is set to true when compiling the declaration part of a routine, and false otherwise [3567, 3582].


  1. If list is false, then only the error messages are printed with no indication of which line they are for; it would seem a good idea to print linecount along with the asterisks at line [307].
  2. The calculation of f, the width needed to print the error number [321-3], is only necessary for the freepos calculation since the width specification in a write statement states the minimum width required. Therefore

    write(output, currnmr: 1)

    would have the required effect, taking as much space as necessary to print currnmr.

  3. The lexical analyser is one of the bottle-necks of any compiler, and nextch in particular is here, as it is called once for each character in the source program (more than 160,000 times when compiling itself!). One way to reduce this bottle-neck would be to skip the initial blanks on every input line here in procedure endofline. When compiling the compiler this would save around 60,000 calls to nextch.

Routine options, lines [373-394]

This routine deals with 'pragmatic comments' like the one on line [1]. The format of the options is supposed to be a number of single options separated by commas, each option consisting of a letter followed by a sign. No contained spaces are allowed, and the four options possible are:

controls printing of internal tables after each routine compiled.
controls listing of the compiled program.
controls production of extra code to check array bounds, dereferencing of nil, subranges, etc.
controls production of code.

The defaults for these are t-, l+, d+, c-, set at line [3800] in procedure initscalars.


  1. Observe that '+' turns any option on, any other character turns it off. You might prefer the following:

    if ch = '+' then prcode:=true
    else if ch = '-' then prcode:=false
  2. Any divergence from the required format causes the rest of the comment to be ignored. Unfortunately, unrecognised options are treated differently to recognised ones. Consider how the following would be treated:


    You might also like to consider the effect of


    and of


    and indeed of

    (*$l+,c+, *)

    and consider how to prevent the problems these cause.

Routine insymbol, lines [349-356, 396-549]

This procedure is the heart of the lexical analyser. Its purpose is to skip spaces and comments (interpreting pragmatic comments on the way) and to collect the basic symbols of the program for use by the next phase of the compiler.

The different sorts of symbols are defined by the type symbol, [80-86], and may be split into several classes:

A symbol is represented by the following group of global variables:

The kind of symbol currently held
identifies which operator when sy = relop, mulop, or addop
id, kk
the characters of the last identifier met, and its length
the value of the last constant met.

The routine works by getting the next character, deciding on the basis of this which sort of token it is going to be, and then collecting characters as long as they can be part of that sort of token.

The routine may be split into several distinct parts:

get next non-space character
deal with unrecognised characters
identifiers and reserved words
integers and reals
characters and strings
punctuation and operators, with [532-539] for comments

Space and Illegal Character Skipping [397-405]

Label 1 is used once at line [539] for skipping comments. When a comment has been skipped, a jump is made back to here, the beginning of insymbol
This piece of code skips spaces and newlines until the next relevant character, or the end of file. Because of the tricky use of the variable test it is quite difficult to understand this short piece of code; remember that nextch may set test to false.

These 4 lines of code deal with unknown characters, reporting error 399, and setting sy to othersy, which is a symbol not recognised by the syntax analyser.

The array chartp is an array that defines the 'character type' of each character in the available character set.

You will find chartp declared [284] as

array [char] of chtp

where chtp is defined as [90-91]

(letter, number, special, illegal,
chstrquo, chcolon, chperiod, chlt, chgt, chlparen, chspace)

You will find chartp initialised in procedure chartypes, [3908-44]: here all elements are filled with illegal before filling in elements that are known, thus leaving illegal in unknown elements.


  1. The P4 compiler attempts as far as possible to be portable between different machines, and therefore tries to assume as little as possible about the character set available. The character types beginning 'ch' (chstrquo, chcolon, etc.) are all later additions attempting to increase this character independence, since they require fewer explicit character literals.

    Unfortunately, these additions have not been applied uniformly: for example, the character set might include a second 'space' character such as 'tab', so it would be better to skip spaces with

    while (chartp[ch] = chspace) and not eol
    do nextch

    rather than the present

    while (ch=' ') and not eol
    do nextch

    This goes for most other character literals in the lexical scanner.

  2. As mentioned above, the character types beginning ch, chstrquo, chcolon, etc., were later additions. As a choice of names, they are not very good, because they are a different style to the other four names. Names more in the style of the others should have been chosen, such as quote, although it is recognised that these have to be chosen with care so that they do not clash with existing names.
  3. The section of code to skip spaces and newlines, [398-401], is so complicated that it deserves a rewrite; furthermore, test is a very poor choice of identifier name, since it reveals little about its use (this may have been a later addition to fix a bug). If its name is changed to fileended, and the assignment fileended:=false is retained in nextch, verify that this section can be replaced by the following code:

    while (ch =' ') and not fileended do nextch.

    Remember that when eol is true, ch = ' '.

  4. The if statement, [402-6], is quite unnecessary. Far better to include illegal in the case statement that follows it:

    case chartp[ch] of
    illegal: begin
    letter: ...
  5. Since the syntax analyser does not recognise the symbol othersy, such a symbol will always cause a syntax error. Consequently, illegal characters will generate two error messages, one from the lexical analyser, and one from the syntax analyser. It would really be better then not to generate an error here, but leave it to the syntax analyser. In any case, an illegal character is not really an implementation restriction; error 6 ('illegal symbol') would be more appropriate.

Symbol Collection [407-547]


The kind of symbol that is going to be analysed depends on its initial character, so the character type is selected using a case:

Identifiers and Reserved Words [408-424]

Collect up to 8 characters of the identifier in the array id, terminating when ch is a character that cannot continue an identifier.

Pad out the rest of id with spaces, k is the length of this identifier, kk the length of the previous identifier. If k is greater than kk, then the rest of id already contains spaces, and kk can be set to k. Otherwise, the remaining characters of id that are not spaces, (those from k to kk) must be made so.

For instance, if the last identifier was maxint, there is the following situation:


If the next identifier is special, it becomes:


Whereas if the following identifier had been char, it would have been


leaving the final nt of maxint to be made into spaces.


Determine if the identifier is a reserved word or not. The reserved words are stored, in ascending length order, in the array rw (reserved words) declared [285], initialised in procedure reswords [3824].

The array frw (first reserved word) is set up so that frw[n] points to the first reserved word of length n in rw, for n = 1 to 8.

Then the search for a particular identifier in the reserved words list may be restricted to the reserved words of the length of that identifier, i.e. frw[k] to frw[k+1] - 1.

If the reserved word is found, the arrays rsy (reserved symbol) and rop (reserved operator) are used to deliver the kind of symbol and the kind of operator that it is.



[423] If the word is not a reserved word, then it is just an identifier.


  1. [411] The constant 8, denoting the maximum of characters used to distinguish identifiers, while being a standard for Pascal, could well be a constant, say maxid. There would be nothing to lose by this added bit of documentation, and it would facilitate changing it if wanted. In fact, there is a good argument for making all literal integers into constants (with the possible exception of 0 and 1).
  2. The test, [414-5]

    until chartp[ch] in [special, illegal, chstrquo, chcolon, chperiod, chlt, chgt, chlparen, chspace]

    is shorter and more obvious if written as the equivalent

    until not (chartp[ch] in [letter, number]).

  3. [416-9] It is doubtful whether the optimisations involved in this bit of code are really worth it. They certainly obscure its workings. The clearest alternative would be

    for i:=k+1 to 8 do id[i]:=' ';

    although there would now be no use for the separate k and kk, and the kk:=k could be eliminated.

    Less obvious, but retaining most of the optimisation would be

    for i:=k+i to kk do id[i]:=' ';

    but the simpler first alternative seems preferable, since it depends less on its surroundings. The optimisations applied here really seem to gain only a microscopic amount of efficiency, and should be avoided.

  4. Note that kk is initialised [3802] to 8. Convince yourself that this prevents the need for id to be initialised.
  5. Note that since the reserved word procedure has 9 letters, it is only recognised from its first 8.
  6. Forward should not be a reserved word; nil should be.
  7. The maximum search length for reserved words is 8 for three lettered words. If the search was done using the first letter rather than the length of the word, the maximum search would be reduced to three (for letters p, t, and f -- not including forward).

Real and Integer Numbers [425-80]

The overview of this section is:

Collect the integer part of the number
Collect the fractional part, if any: the period and following digits.
Collect the exponent, if any: the 'e', the optional sign, and the following digits.
Construct the real constant.
Construct the integer constant.

Before examining this section in detail, it is first necessary to study how constants are stored.

How constants are stored

The values of integer, real, character, and string literals produced by the lexical analyser are stored in global variable val of type valu.

Valu is declared, [105-8], as a variant record:

 valu = record case intval: boolean of
    true: (ival: integer)
    false: (valp: csp)

where field ival is for storing integer and character literals holding the integer value or the ordinal value of the character, and field valp for all other types of literal.

The comment 'intval never set nor tested' [105] reflects the fact that the compiler knows from context when it has an integer or character literal and when it does not.

The field valp is of type csp, (constant pointer), where csp is constant and constant is a further variant record with three variants, (reel, strg, and pset), one for reals, one for strings, and one for set literals. (Set literals are not constructed by the lexical analyser, but by the syntax analyser, line [2830] onwards). Sets are held as a set of (small) integers, strings as a short string (up to a maximum of strglgth characters) with an index giving its length, and reals as the string of characters making up the number. Reals are held in this way, so that the compiler itself does not have to do any real arithmetic, which makes it easier to implement initially.

So integers and characters are held


reals, strings, sets


Now to return to the commentary.

Collect the individual digits of the integer part of the number in the array digit until ch is not a digit. Note that a number, including fractional part and exponent, may only be a maximum of digmax characters long.
If the integer part is followed by '.' or 'e' then the number may well be a real number. If it is followed by 'e', it certainly is, but the period may turn out to be part of a construct such as 1..10.
So in case it is such a construct, the value of i is left untouched as the length of the integer part, and k is used for the length of the real number.
If the character was a period, then it is stored, and the next character is obtained.
If this character is a period, then it is a 1..10 construct, and so ch is changed to '..' for future processing, when insymbol is next called, and there is a jump to label 3 to deal with the integer part of the number [466].
To get here there was only one period after the integer part, and so it is definitely a real number. Pascal demands that at least one digit follow the period of a real number -- if not, error 201 is reported.
The digits of the fractional part are gathered.
Whether or not there was a fractional part, if the number is followed by an 'e', the 'e' is stored and the exponent is collected.
The optional sign is saved.
A series of digits is expected and saved.

Here the real constant is saved.

Lvp is a ^constant; its cclass is set to reel, and the digits of the number are transferred from digit to rval, after initialising rval to all spaces.

If the number of characters in the number was greater than digmax then error 203 is reported and the number 0.0 is stored.
The constant is saved in val.

Integer Numbers [466-79]

Label 3 comes from [434] in constructs like 1..10.
Complains if the number is too long and saves 0.

Calculates the value of the integer, making sure that it is not too big.

Ordint is an array [char] of integer, only used for the characters '0' to '9' holding the value of each of these digits, i.e. 0 to 9; initialised at lines [3940-3].


  1. Type valu is split into two (one record for integers and characters, one for all others) as a space saving optimisation. Ideally, it would be defined as

    cstclass = (int, reel, pset, strg, chrctr);
    value = record case cclass: cstclass of
        int: (ival integer);
        reel: (rval: packed array [1..strglgth] of char)
        pset: (pval: set of setlow..sethigh)
        chrctr: (cval: char)
        strg: (slgth:0..strglgth
              sval: packed array [1..strglgth] of char)

    However, this would mean that every object of type valu would need to be large enough to accommodate the largest of these variants (probably strg). Obviously, the designers of the compiler anticipated more integer and character literals than other kinds and so treated them specially to save space. Thus reals, strings and sets need space for one extra pointer -- valp (and of course, extra time to access via valp) -- but integers and characters need very much less space.

    An additional space saving advantage is that a constant can be created using new with the necessary size, for example, new(lvp, reel) [454].

  2. The treatment of the 1..10 case [434] is verging on the abominable. The jump from the then part of an if statement into its else part is probably the worst bit of programming in the compiler. It is unlikely that Pascal even allows it, and many Pascal compilers refuse it. In this edition we have rewritten it without a goto, leaving the original code as a comment.

  3. digmax = strglgth -1 (see line [3808]) so there is no problem copying digit into rval [458].
  4. rval[1] is left blank [458] to allow for an optional sign which may get filled by the syntax analyser, should this number be part of a constant declaration such as

    const z =-3.14

    (several places in procedure constant, starting line [864]).

  5. The test if ival <= mxint10 [473] is meant to ensure overflow does not occur in the next line when ival is multiplied by ten. Mxint10 is initialised [3808] to the value maxint div 10.

    However, the test is not correct: for example if maxint = 1000, ival = 100, and k = '1'. It is necessary to ensure that the new value of ival will not cause overflow, i.e. that

    ival * 10 + ordint[digit[k]] <= maxint

    that is,

    ival * 10 <= maxint - ordint[digit[k]]

    that is,

    ival <= (maxint - ordint[digit[k]]) div 10.

    This is the test wanted, and mxint10 can be eliminated in the process.

  6. Pascal guarantees that the characters '0' to '9' are all in one group in the character set. Therefore the declaration of ordint [296]

    ordint: array[char] of integer

    could better be declared as

    ordint: array['0'.. '9'] of 0..9

    as this exactly reflects its use. However, because of Pascal's guarantee, the value of a digit character can be calculated using the expression

    ord(ch) - ord('0').

    Therefore, line [474] which reads

    ival:=ival - 10 + ordint[digit[k]]

    may be changed to

    ival: = ival * 10 + (ord(digit[k]) - ord ('0'))

    (the brackets are essential to avoid causing overflow in critical cases). With this change, ordint may be eliminated entirely.

  7. Error 203 really means 'integer constant exceeds range'. However there is no complementary one for reals.

String and Character Literals [481-502]


Collect the characters of the strings. If the string extends over a line, error 202 results.

The outer loop deals with quotes within strings, which are represented as two single quotes, e.g. '''': the inner loop terminates when a single quote is met, and this quote is saved. The outer loop then continues if a further quote follows, but this quote is not saved. If a second quote does not follow, then the double loop terminates.

Lgth is decremented to remove the last saved quote.
Pascal does not have zero length strings.
A string of length one is a character literal, and the ord of this character is stored in val.ival.
The string, of length >= 2, is stored in sval and slgth (string value and string length) of lvp, which is then stored in val.valp. If the string is longer than the maximum allowed, (strglgth), the implementation restriction error 399 results.


  1. That strings can only be a maximum of strglgth characters long is one of the more inconvenient restrictions of this compiler. One of the many possible solutions would be to store strings as a list of pieces, for instance:

    type constant = record case cclass: cstclass
    strg: (slth: integer;
    sval: piece)
    piece = record chars: array [1..piecemax] of char;
    nextpiece: ^piece
  2. As before with spaces, the termination tests would best be

    (chart[ch] = chstrquo).

Punctuation and Operators [503-547]

This deals with colon and colon-equals (:=).
This deals with period (.) and range (..).
Less than (lt: <), less than or equal (le: <=), and not equals (ne: <>).
Greater than (gt: >) and greater or equal (ge: >=).
Deals with open bracket '('; if it is followed by '*', then it is the start of a comment. If this comment starts with '$', then it is a pragmatic comment, and so procedure options is called. Then the rest of the comment up to '*)' is skipped. Remember that '*' may appear within the comment not followed by ')'. If there was a comment, there is a jump to label 1, right at the beginning of insymbol [397], to start searching for a symbol again.

This deals with all single character symbols and operators not already dealt with: + - * / ) = , [ ] ^ ; .

Ssy yields the kind of symbol, sop the kind of operator if it is one, noop otherwise.

Ssy is initialised [3858-64], sop [3873-5].

Due to the loop [398-401], this will only be reached if ch happens to be a space when end of file is reached. Othersy is not recognised by the syntax analyser and so will cause a syntax error.


  1. Since when dealing with integers directly followed by '..', ch gets set to ':' [434], '..' sometimes gets treated as a colon. Consider why then, '..=' never gets accepted as a substitute for ':='.


    type index=0:9

    ever get accepted? How about for labels?


  2. Label 1 is really just a loop

    while commentskipped do


    repeat... until symbolfound.

    Consider the changes necessary.

    Another solution would be to replace the goto 1 by a (recursive) call to insymbol.

  3. Note in the initialisation of ssy [3858...] that the assignments for ( $ space . ' : < > are all unnecessary, since these single character symbols and operators are dealt with individually elsewhere. Similarly with sop [3873...] for < and >.

General Notes on the Lexical Analyser

  1. Throughout the compiler there are several groups of variables that really form single units. Such groups are for example

    ch + chcnt, a character and its position.

    errlist + errinx, the error list and its index

    For documentation reasons, it would seem quite a good idea to physically group these together as records, for example:

    var ch: record val: char;
    pos: integer
    error: record numbers: array[1..10] of
    record pos: integer;
    nmr: 1..400
    index: 0..10
    options: record list, debug, check, printtables: boolean end;

    and access them

    if option.list then

    while ch.val=' ' do ....

  2. While about it, another field could be added to the ch record: class: chtp and ch could always be updated with its type. Then the following could be written:

    while (ch.class = chspace) and not fileended do nextch

  3. Throughout the lexical analyser, there seems to be a tacit assumption that eoln will directly precede eof. For example [484]

    repeat nextch
    until (eol) or (ch='''')

    Pascal does not require this to be true. One solution would be in nextch to write

    if eof(input) then begin 
    writeln(output, '*** eof', 'encountered') 
    ch.class := illegal; 
    ch.val := ' '; 
    fileended := true 
    end else

    Then, most symbols would terminate with the illegal character class, and those that would not (strings and comments) could explicitly test:

    repeat nextch; ... 
    until eol or fileended or (ch.class = chstrquo)

    Another solution would be to completely revise the treatment of file events. If, instead of the two variables eol and fileended there was a single variable:

    var filestate: (normal, lineended, fileended)

    firstly this would more clearly show that eol and fileended will never occur simultaneously; but more importantly it simplifies a great many tests, for now may be written

    while (filestate = normal) and (ch.class = digit)
    do ...


    repeat nextch; .... 
    until (filestate <> normal) or (ch.class = chstrquo)

    to include both eoln and eof.

  4. Throughout the compiler, there are procedures that are really too long. Here, insymbol provides a good example: its final end is some 300 lines away from its heading, and this makes for quite difficult reading.

    It seems a good idea to split such large routines up. For example, insymbol into procedures skipspaces, inidentifier, innumber, instring, etc.

    Then the body of insymbol could be:

    case ch.class of 
    letter: inidentifier; 
    number: innumber; 
    chstrquo: instring; 

    Similar reasoning could be made for a procedure skipcomment to replace [533-40].

    It could be argued that such routines, called from only one place, are a bad idea on the grounds of efficiency. However the increased readability easily compensates for the slight decrease in efficiency. Besides, there are several precedents: for example endofline.

  5. Similarly throughout the compiler there are several pieces of very repetitious code that ideally should be put in one place as a single routine.

    Examples are at lines [427, 433, 437, 442, 445, 450] where there is something like

    k:= k+1; 
    if k <= digmax then digit[k] := ch; 

    It would be better to make this sequence into a procedure, say savedigit, and write

    repeat savedigit
    until ch.class <> number

    which is very much more readable. (Adopting this, [431] would need to be changed to i := k.)

    At lines [378...] there could be a procedure

    procedure setoption(var option: boolean); 
    if ch.val <> '*' then nextch; 
    if ch.val = '+' then option := true 
    else if ch.val = '-' then option := false 

    Then in procedure options:

    if ch = 't' then setoption(prtables) 
    else if ch = 'l' then setoption(list) 
  6. There is a problem with the parameterisation of the target machine [34-56]. The identifiers maxint, ordminchar, and ordmaxchar are used here to specify properties of the target machine, but this is a mistake since these three identifiers are used as a standard to mean values of the machine that the program is running on.

    This probably reflects an original assumption that the compiler would run on the same machine that it was generating code for. However, this is not necessarily the case, and to partly solve the problem there should really be three other values, say intmax, charmin, and charmax to represent the target machine's properties, and then be very careful about which was used where.

    An example of such a use is in the initialisation of the array chartp on line [3911] - Here it is the source machine's character set, i.e. ordminchar and ordmaxchar as such that is meant. Later it will be seen that the use of ordminchar and ordmaxchar on line [639] the target machine's characters are meant.

    A larger, more pervasive problem with the differences between the source and target machine is if the target machine has a value of maxint larger than the source machine can handle. Since the values of integer literals are calculated within the compiler, only major surgery could solve this.

    A similar problem occurs with sets if the source machine has smaller sets than the target machine. A solution to this would be to hold pval of type constant as an ordered list of integers.

    However, possibly the worst problem in the mismatch between source and target machines is that for accesses to arrays whose index types are character, and for case statements on characters, wrong code is generated for machines whose character set is different to the source machine's. This is discussed in the relevant sections.

Pascal Implementation by Steven Pemberton and Martin Daniels