Pascal Implementation by Steven Pemberton and Martin Daniels

Chapter 5: Compiling Expressions

Lines: [2084-2226, 2738-3102]

Routines: factor, term, simpleexpression, expression, selector, constant.

Routine factor, lines [2748-896]

Factors are the basic operands of expressions.

[2752-5]
Check that the next symbol may start a factor. Facbegsys (factor begin symbols) contains the symbols that may start a factor, and is initialised at [3819]. In case skip skips to a context symbol (that is, no factor found) the type of this factor is set to nil.
[2756]

This while loop is for error recovery purposes only: normally there will only be one factor here. However if after the factor at [2893] the next symbol is not in the context, a following factor will still be fully analysed rather than just skipped.

[2759-84] Identifiers. These may be functions [2762-70], where procedure call deals with the parameters, constants [2772-6], or variables and fields of records [2778-83], where selector deals with indexes (for example, a[i]), field designators (for example, r.x), file buffers (input^), and referenced variables (ptr^).

At [2764-9] and [2779-82] subrange types are reduced to their range types. For instance in a:=i where i is of type 1..10, the expression type for i is reduced to integer so that later

if (lattr.typtr = intptr)

can be used instead of

if comptypes(lattr.typtr, intptr)

(see for example [2905]).

[2785-817] Constants. Integer [2786-92], real [2794-800], and string [2802-17].

The index type (inxtype) of strings is nil [2810] since they may not be indexed.

[2818-21] Bracketed expressions.

[2822-8] The boolean not operator. Generates the code for the operand followed by a NOT instruction. For example, not test would produce

LODI 0 5
NOT

[2825-7] Checks that the operand was boolean.

[2829-91] Set values. A set value is a list of expressions enclosed in square brackets. Each expression may be a constant, or a value calculated at run time: all the constant values are collected together in a variable called cstpart (constant part), while code is generated to create the set for the variable values. Then code is generated to add the constant part in one go.

For example, the code for

[1, [[a, 2, b, 3, c]

would be

 LODI 0 8    Load a
 SGS         Create a singleton set of a
 LODI 0 7    Load b
 SGS
 UNI         Unite the two sets
 LODI 0 6    Load c
 SGS
 UNI
 LDC( l 2 3) Load set constant
 UNI

[2830] Constant part is empty; no variable part yet.

[2831-3] Create a structure for a set. Element type unknown at present (always unknown for the empty set).

[2834-9] Deal with the empty set.

[2842] This repeat statement (ends [2869]) deals with each expression in the list.

[2843-5] Check the expression is a scalar.

[2847] Check the type of the expression matches the type of the preceding expression. Lsp^.elset is initially nil, but comptypes always returns true if one of its parameters is nil.

[2849-54] If the expression was a constant, check that it is in set range, and if so add it to the constant part.

[2856-65] Deal with non-constant expressions.

[2857] If it is not an integer expression, generate an ORD instruction to convert it to integer.

[2859] SGS generates a single element set from the last expression.

[2860] If there have already been some non-constant expressions, a UNI (set union) instruction is generated to unite the last two created sets.

[2863] Save the type of the last expression.

[2866] The last two expressions were not compatible.

[2875-84] If there was both a variable part and a constant part, code is generated to load the constant part and unite it with the variable part.

[2887-90] If there was only a constant part, its value is saved in gattr.

[2893] Check the next symbol will be acceptable.

Notes

  1. 0..47 at line [2750] should be setlow..sethigh.
  2. One consequence of the error recovery at [2893] is that if there is a missing semicolon between two assignment statements, e.g.

    a:=1
    b:=0

    then the variable of the second assignment will get consumed by the call of factor for the right hand side of the first assignment.

  3. [2762] If an identifier was meant to be a function, but was undeclared, its call will not be dealt with as well as possible, since it will be assumed to be a variable. A better way of deciding whether the factor is a call is to see whether it is a function or the next symbol is a left bracket.
  4. Note the different styles of [2764-9] and [2779-2].
  5. If charal is not 1 (unlikely, but possible), the calculation of size at [2810] will be wrong, since it takes no account of alignments. The correct calculation (see [1319-23] for other arrays) would be

    size:=charsize;
    align(charptr, size);
    size:=lgth*size

    However, the value charmax [40] represents the aligned size of char, so

    size:=lgth * charmax

    is sufficient. See also [880] in procedure constant for similar code.

  6. Variable test [2867] is actually declared 5 levels out at [853]! This is quite unnecessarily inefficient, and a test local to factor, and every other routine that uses test, should be declared, though finished would be a more descriptive name.

    In fact the formulation

    repeat
        .
        .
        .
        test:=sy <> some symbol;
        if not test then insymbol
    until test
    

    or similar, occurs quite frequently, and is necessary because Pascal has no clean way of terminating loops on the middle. One solution, depending on your taste for functions with side effects, would be a function like

    function isnt(s: symbol): boolean
    begin
        if sy=s then
        begin insymbol;
              isnt:=false end
        else isnt:=true
    end;

    with the following use

    repeat
        .
        .
        .
    until isnt(comma)
  7. Note that the compiler does not handle subranges in set constants, like [1..9]. All that is needed for this is to load the lower and upper bounds of the subrange and then generate a new instruction, say MPS (generate multiple set).
  8. Neither the compiler nor the interpreter checks that nonconstant expressions in sets are in range. One solution is to add before [2859]

    if debug then gen2t(45 (*chk*), setlow, sethigh, intptr)

  9. Nil is treated by the compiler as a standard identifier, therefore allowing it to be redeclared. In fact it should be a reserved word and recognised in this routine.

Routine term, lines [2898-2951]

A term is a factor, or a number of factors separated by mulops (* , /, div, mod, and).

The basic structure of this routine is

factor(...);
while sy=mulop do
begin
    factor(...);
    generate multiply instructions
end

[2899] Compile the first factor.

[2901] There is a mulop, so generate instructions to fully load the factor. Save the attributes of the factor (gattr) in a local variable lattr (you can think of the 'l' as standing for local, or left as it is the attributes for the left operand).

[2902] Compile and load the right operand.

[2903] If there were no errors in either operand then generate the code for the operator.

[2905] The operator was '*'.

[2906] If both operands are integer, it is an integer multiply MPI.

[2909-19] If one operand is integer, convert it to real, and see if that leaves both real for a real multiply MPR.

FLO converts a left hand operand to real, FLT a right hand operand.

[2921-3] The only possibility is for both operands to be compatible set types, when the set intersection instruction INT is generated.

[2924] Otherwise just give an error message.

[2926] Both operands of '/' must be real, so convert them if possible and as necessary.

[2939-48] Deal with div, mod, and and in a similar way.

Notes

  1. Since many of the code instructions have type fields already, it would seem consistent for MPI, MPR, and INT to be MULI, MULR, and MULS, and let the assembler treat them differently if necessary.

    DVI and DVR are other candidates for this treatment.

  2. Code similar to [2909-17] appears four more times in the next three pages. It could be a procedure as long as it took the form of lines [2927-34], with two if statements, rather than one with an else.
  3. It is important to note that on a machine where intsize is different to realsize, the FLO instruction to float a left hand operand will only work if the right hand operand is already real. This is because there is no way for the FLO instruction to tell where its operand is. There is only one place where both operands might need to be converted to real, and that is with the operator '/'. Therefore the order in which the FLT and FLO instructions are generated [2927-34], with the FLT first, is very important.

    For all other operators, at most one operand will be floated.

Routine simpleexpression, lines [2953-3020]

A simple expression is a term, or several terms separated by addops (+, -, or). The first term may be signed.

[2954-64] Deal with the initial sign, if present. A plus is ignored. NGI is integer negate, NGR is real.

[2965-9] This is similar to term.

[2971-90] Operator '+'. Identical treatment to '*' in factor. ADR is real add, ADI integer. UNI is set union.

[2991-3012] Minus as for plus. DIF is set difference.

[3013-6] Or as for and in term.

Notes

  1. Same comment as MULI, MULR, MULS for NGI, NGR which become NEGI, NEGR; ADI, ADR, UNI which become ADDI, ADDR, ADDS; and SBI, SBR, DIF which become SUBI, SUBR, SUBS.
  2. Adopting note 1 above, since the treatment of plus and minus here, and mul in term are so similar, they could be combined into a procedure, say operand, and then:

    case lop of
    plus: operand(ADD)
    minus: operand(SUB)
    orop: ...
    ...
    end;
  3. The treatment of div, mod, and, or, and the latter half of rdiv is the same and therefore a candidate for a procedure.
  4. Rather than generating a negate instruction for constants, the value of the constant could be changed, in the style of [887-927] in procedure constant. Then the code generated for, say,

    a:=-1

    would be

     LDCI -1
     STRI 0 5

    rather than

     LDCI 1
     NGI
     STRI 0 5

Routine expression, lines [3022-102]

An expression is a simple expression, or two of them separated by a relop (in, <, <=, =, <>, >=, >).

[3026-8] If the simple expression is of a type small enough to be loaded then do so; otherwise only load its address. This latter case is normally for arrays and records, but here the only possibility is for strings.

[3030-2] If the operator was in, coerce the left hand side to integer.

[3033] Get the second simple expression.

[3034-6] Load it or its address, as necessary.

[3038-43] If the operator was in and the operands are correct, generate an INN instruction.

[3046-55] If the operand types are different and one is integer, float it.

[3056-88] Select the type field for the instruction.

[3065] For integer and enumerations.

[3068] Only = and <> allowed for pointers.

[3073] < and > are not allowed for sets.

[3075] Only character arrays may be compared.

[3081-7] Records and files may not be compared.

[3089-96] Generate the instruction.

[3098] The operands were not compatible.

[3100] The result is a boolean expression.

Notes

  1. Code similar to

    if gattr.typtr^.form <= power then

    appears everywhere and is far from explicit. Far better would be

    if gattr.typtr^.form in [scalar, subrange, pointer, power]

    or since it appears often, using a variable

    if gattr.typtr^.form in simpleforms

    However, to simplify everything load may as well start

    if gattr.typtr <> nil
    then if gattr.typtr^.form in [arrays, records] then loadaddress
    else ...
  2. There is a similarity between the type indicator code [3058-87] and [1904 et seq.]
  3. You may prefer

    if not (lop in [eqop, neop]) then error(131)

    as closer to the intent of [3068].

  4. A suitably initialised array would make lines [3089-96] simpler:

    gen2(oper[lop], ord(typind), lsize)

    though lsize is not used elsewhere, and so could be expanded and eliminated:

    gen2(oper[lop], ord(typind), leflr.typtr^.size)

Routine selector, lines [2086-225]

This deals with indexing arrays, field selection of records, dereferencing pointers, and windowing a file, for example

 
a[i]
r.f
p^
input^

It is called after reading the initial identifier and doing a searchid for it (e.g., [2236] in procedure variable). Fcp contains the details of this identifier. The code generated for an array access like a[i] is something like

[Code to load the address of a]
[Code to load the value of i make it integer if necessary]
[Optional code to check iin bounds]
DECI lmin Decrement ithe lower bound of a
IXA size  Calculate the address of the element

No special code is generated for field selection, dereferencing, or windowing, all of which are treated as accesses to variables, except for a field selection of a record which is only indirectly accessible. In this case the code to load the address of the record is followed by:

 
INCA offset     increment the address with the offset of the field within the record.

[2089-126] Access the identifier.

[2093-6] Save the address of an actual variable.

[2098-100] Load a formal variable. Since it must be a parameter to a routine, vlev must be greater than one, so an LDO instruction would never be possible.

[2101-12] Identifiers that are the result of a with statement. For example,

with r do f:=0

where f is a field of r. Disx is the value left over from the preceding call of searchid.

[2103-6] If it is a field of a directly accessible record, then just increment the address of the record by the offset of the field.

[2108-12] If the record is indirect, load its address and save the offset of the field. The address of the record will have been stored in a local temporary location. If it is at the outermost level, it can be accessed by an LDO instruction; otherwise an LOD 0 instruction.

[2113] A function identifier is only a variable when assigning to it in the body of the function.

[2115] You cannot assign to a standard function.

[2118] Nor to a formal function.

[2120] Nor to a function, except within the body of that function. (So

function f: integer;
    procedure p;
    begin f:=0 end;
begin p end;

is disallowed).

[2121-3] Calculate the address of the function result (address zero at this level).

[2131-74] Indexes.

[2133] This repeat deals with a list of indexes like a[i, j, k].

[2134-7] Check that it is an array that is being indexed.

[2139] Load the index.

[2141-4] Check that the index is a scalar, and generate an ORD instruction if it is not integer.

[2148] Check that the index type matches the bounds of the array.

[2151-8] Generate code to check the value of the index, and decrement it by the lower bound of the array.

[2161-4] Update the attributes of the expression.

[2165-70] Generate the IXA instruction.

[2176-204] Field selections.

[2180-2] Check that it is a record that is being selected from.

[2184-7] An identifier must follow the period. Locate it as a field of the record.

[2188-9] Error if not located.

[2192-7] Update the type and offsets of gattr.

[2206] Pointer dereference and file window.

[2209-16] If it is a pointer, load it and optionally check its value.

[2218] A file window is just accessed like a variable.

[2219] It is neither a pointer nor a file.

Notes

  1. Constant indexes to arrays, like

    x[2]:=0

    generate code like

    LDA 0 5 Load address of x
    LDCI 2  Load constant 2, the index
    DECI 1  Subtract the lower bound of x
    IXA 2   Index x. 2 is the element size
    LDCI 0  Load the right hand side
    STOI    Store the value

    Ideally the code would be

    LDCI 0
    STRI 0 7

    but this would be difficult, because the address of x has to be loaded before it is known if the index is constant. The best that can be done in the present circumstances is

    LDA 0 5
    INCA 2   Increment the address by 2
    LDCI 0
    STOI

    Consider what would be necessary to effect this.

  2. It is arguable whether there is any advantage of an LDO over an LOD 0, that is between accessing a variable as global or local, and therefore whether the if statement at [2109] serves any useful purpose.
  3. Examine the code at [2120-3] carefully. What do you think the reason is for the begin end round [2121-3]?

    One possibility is that [2120] was an afterthought, and inserted carelessly.

  4. The code at [2154-6] prevents a DEC instruction from having a negative argument like

    DECI -10

    which would rather get produced as

    INCI 10

    even though the interpreter accepts negative arguments (and instructions like LDCI can have negative arguments).

    As the comment at [2157] suggests, these three lines could be replaced by

    gen1t(31 (*dec*), lmin, intptr)

  5. The second parameter of the CHK generated at [2211] (maxaddr), is never used by the interpreter. The first parameter value of 1 means that the pointer value may not be nil.
  6. Error 177 ("You may only assign to the identifier of a function in the body of that function") occurs if a function is called as a statement. This can be confusing.
  7. An IXA instruction is nearly always preceded by a DEC (or INC) instruction (the only exception being when the lower bound is zero). If the format of instructions were not so restrictive, and allowed potentially large values for the P parameter, an instruction like

    IXA 1 2

    could be produced instead of

    DECI l
    IXA 2

    (Actually, CHK produces large values for P, but the assembler has to take special action for this).

  8. Indexing code could be improved. Consider

    i:=a[i]

    producing

    LDA 0 9
    LODI 0 7
    DECI l
    IXA 2
    INDI 0
    STRI 0 7

    Having produced the IXA instruction, an indirect load instruction IND has to be used to get to the contents of the element

    If array accessing were delayed a little, and IXA replaced by a typed instruction, say INX, then the above could be replaced by

    LDA 0 9
    LODI 0 7
    DECI l
    INXI 0
    STRI 0 7

    When an address is really needed from an array access, for instance when assigning to it, IXA would be used.

  9. Consider the following array

    var a: array['0'.. '9'] of integer

    and an access to it of

    a['1']

    The code for this would be something like:

    LDA 0 5
    LDCC 'l'
    ORDC
    CHKI 48 57
    DECI 48
    IXA 2

    The mysterious 48's and the 57 are the integer values of the characters '0' and '9' as represented on the source machine, effectively preventing this code from being assembled on a machine with a different character set.

    It seems astonishing that there should be the trouble taken to generate the character literal in the LDC instruction only to produce integers in the CHK and DEC instructions.

    One solution to this would be to alter the definition of CHKC to use character literals, and generate the following code:

    LDA 0 5
    LDCC '1'
    CHKC '09'
    ORDC
    LDCC '0'
    SBI
    IXA 2

    Another solution would be to change the definition of CHKC and DECC giving

    LDA 0 5
    LDC '1'
    CHKC '09'
    DECC '0'
    ORDC (possibly)
    IXA 2

    (It could be argued that DECC leaves an integer result not a character one, and so the ORDC would be unnecessary.) However, changing DECC like this would cause a problem for the standard function pred which generates code for a character

    DECC 1

    where the parameter is definitely not a character. A possible solution to this would be to introduce yet another instruction, PRD, say.

    A third and possibly the easiest solution would be to parameterise the compiler with the collating sequence of the target machine, so that the integers produced for CHK and DEC were those of the target machine. This would involve an array

    var ordchar: array [char] of minchar..maxchar;

    where minchar and maxchar are the bounds of the target machine's character set. This would then have to be initialised:

    ordchar['0']:=48; ordchar['1']:=49; ... etc.

    but the only other change that would need to be made is in the lexical analyser, where a character constant is formed. Line [491] would then read

    if lgth=1 then val.ival:=ordchar[string[1]]

Routine constant, lines[864-932]

There are certain places in Pascal where only constants are allowed. These are in constant declarations (of course), in subrange type specifications, the case labels of variant records and case statements, and in calls of new.

This procedure compiles such constants. Fsp returns the type of the constant, fvalu its value.

[872-84] Strings. See [2801 et seq.] for some similar code.

[886-924] Integers and reals

[887-91] Collect a sign if present, and set sign accordingly.

[892] The constant is an identifier declared in a constant declaration.

[894-5] Copy its attributes.

[897-8] If the identifier represents an integer constant, and there was a negative sign, negate the value of the constant.

[900-10] The identifier is a real constant. Unlike integers, the real value is held as a pointer, so a copy of the value has to be made, and the sign altered as necessary.

[912] A string may not be signed.

[916-9] Integer constant

[921-4] Real constant.

[926] Some other symbol.

Note

This procedure is in a slightly odd position, and could well be placed later, say before [1025].

Pascal Implementation by Steven Pemberton and Martin Daniels