Pascal Implementation by Steven Pemberton and Martin Daniels
Factors are the basic operands of expressions.
This while loop is for error recovery purposes only: normally there will only be one factor here. However if after the factor at [2893] the next symbol is not in the context, a following factor will still be fully analysed rather than just skipped.
[2759-84] Identifiers. These may be functions [2762-70], where procedure call deals with the parameters, constants [2772-6], or variables and fields of records [2778-83], where selector deals with indexes (for example, a[i]), field designators (for example, r.x), file buffers (input^), and referenced variables (ptr^).
At [2764-9] and [2779-82] subrange types are reduced to their range types. For instance in a:=i where i is of type 1..10, the expression type for i is reduced to integer so that later
if (lattr.typtr = intptr)
can be used instead of
if comptypes(lattr.typtr, intptr)
(see for example [2905]).
[2785-817] Constants. Integer [2786-92], real [2794-800], and string [2802-17].
The index type (inxtype) of strings is nil [2810] since they may not be indexed.
[2818-21] Bracketed expressions.
[2822-8] The boolean not operator. Generates the code
for the operand followed by a NOT
instruction. For example,
not test would produce
LODI 0 5 NOT
[2825-7] Checks that the operand was boolean.
[2829-91] Set values. A set value is a list of expressions enclosed in square brackets. Each expression may be a constant, or a value calculated at run time: all the constant values are collected together in a variable called cstpart (constant part), while code is generated to create the set for the variable values. Then code is generated to add the constant part in one go.
For example, the code for
[1, [[a, 2, b, 3, c]
would be
LODI 0 8 Load a SGS Create a singleton set of a LODI 0 7 Load b SGS UNI Unite the two sets LODI 0 6 Load c SGS UNI LDC( l 2 3) Load set constant UNI
[2830] Constant part is empty; no variable part yet.
[2831-3] Create a structure for a set. Element type unknown at present (always unknown for the empty set).
[2834-9] Deal with the empty set.
[2842] This repeat statement (ends [2869]) deals with each expression in the list.
[2843-5] Check the expression is a scalar.
[2847] Check the type of the expression matches the type of the preceding expression. Lsp^.elset is initially nil, but comptypes always returns true if one of its parameters is nil.
[2849-54] If the expression was a constant, check that it is in set range, and if so add it to the constant part.
[2856-65] Deal with non-constant expressions.
[2857] If it is not an integer expression, generate an ORD
instruction to convert it to integer.
[2859] SGS
generates a single element set from the last
expression.
[2860] If there have already been some non-constant expressions, a
UNI
(set union) instruction is generated to unite the last two
created sets.
[2863] Save the type of the last expression.
[2866] The last two expressions were not compatible.
[2875-84] If there was both a variable part and a constant part, code is generated to load the constant part and unite it with the variable part.
[2887-90] If there was only a constant part, its value is saved in gattr.
[2893] Check the next symbol will be acceptable.
One consequence of the error recovery at [2893] is that if there is a missing semicolon between two assignment statements, e.g.
a:=1 b:=0
then the variable of the second assignment will get consumed by the call of factor for the right hand side of the first assignment.
If charal is not 1 (unlikely, but possible), the calculation of size at [2810] will be wrong, since it takes no account of alignments. The correct calculation (see [1319-23] for other arrays) would be
size:=charsize; align(charptr, size); size:=lgth*size
However, the value charmax [40] represents the aligned size of char, so
size:=lgth * charmax
is sufficient. See also [880] in procedure constant for similar code.
Variable test [2867] is actually declared 5 levels out at [853]! This is quite unnecessarily inefficient, and a test local to factor, and every other routine that uses test, should be declared, though finished would be a more descriptive name.
In fact the formulation
repeat . . . test:=sy <> some symbol; if not test then insymbol until test
or similar, occurs quite frequently, and is necessary because Pascal has no clean way of terminating loops on the middle. One solution, depending on your taste for functions with side effects, would be a function like
function isnt(s: symbol): boolean begin if sy=s then begin insymbol; isnt:=false end else isnt:=true end;
with the following use
repeat . . . until isnt(comma)
MPS
(generate multiple set).Neither the compiler nor the interpreter checks that nonconstant expressions in sets are in range. One solution is to add before [2859]
if debug then gen2t(45 (*chk*), setlow, sethigh, intptr)
A term is a factor, or a number of factors separated by mulops (* , /, div, mod, and).
The basic structure of this routine is
factor(...); while sy=mulop do begin factor(...); generate multiply instructions end
[2899] Compile the first factor.
[2901] There is a mulop, so generate instructions to fully load the factor. Save the attributes of the factor (gattr) in a local variable lattr (you can think of the 'l' as standing for local, or left as it is the attributes for the left operand).
[2902] Compile and load the right operand.
[2903] If there were no errors in either operand then generate the code for the operator.
[2905] The operator was '*'.
[2906] If both operands are integer, it is an integer
multiply MPI
.
[2909-19] If one operand is integer, convert it to
real, and see if that leaves both real for a real multiply
MPR
.
FLO
converts a left hand operand to real, FLT
a
right hand operand.
[2921-3] The only possibility is for both operands to
be compatible set types, when the set intersection instruction
INT
is generated.
[2924] Otherwise just give an error message.
[2926] Both operands of '/' must be real, so convert them if possible and as necessary.
[2939-48] Deal with div, mod, and and in a similar way.
Since many of the code instructions have
type fields already, it would seem consistent for MPI
,
MPR
, and INT
to be MULI
,
MULR
, and MULS
, and let the assembler treat
them differently if necessary.
DVI
and DVR
are other
candidates for this treatment.
It is important to note that on a machine where
intsize is different to realsize, the
FLO
instruction to float a left hand operand will only
work if the right hand operand is already real. This is because there
is no way for the FLO
instruction to tell where its
operand is. There is only one place where both operands might need to
be converted to real, and that is with the operator '/'. Therefore the
order in which the FLT
and FLO
instructions
are generated [2927-34], with the FLT
first, is very
important.
For all other operators, at most one operand will be floated.
A simple expression is a term, or several terms separated by addops (+, -, or). The first term may be signed.
[2954-64] Deal with the initial sign, if present. A plus is
ignored. NGI
is integer negate, NGR
is real.
[2965-9] This is similar to term.
[2971-90] Operator '+'. Identical
treatment to '*' in factor. ADR
is real add, ADI
integer. UNI
is set union.
[2991-3012] Minus as for plus. DIF
is set difference.
[3013-6] Or as for and in term.
MULI
,
MULR
, MULS
for NGI
,
NGR
which become NEGI
, NEGR
;
ADI
, ADR
, UNI
which become
ADDI
, ADDR
, ADDS
; and
SBI
, SBR
, DIF
which become
SUBI
, SUBR
, SUBS
.Adopting note 1 above, since the treatment of plus and minus here, and mul in term are so similar, they could be combined into a procedure, say operand, and then:
case lop of plus: operand(ADD) minus: operand(SUB) orop: ... ... end;
Rather than generating a negate instruction for constants, the value of the constant could be changed, in the style of [887-927] in procedure constant. Then the code generated for, say,
a:=-1
would be
LDCI -1 STRI 0 5
rather than
LDCI 1 NGI STRI 0 5
An expression is a simple expression, or two of them separated by a relop (in, <, <=, =, <>, >=, >).
[3026-8] If the simple expression is of a type small enough to be loaded then do so; otherwise only load its address. This latter case is normally for arrays and records, but here the only possibility is for strings.
[3030-2] If the operator was in, coerce the left hand side to integer.
[3033] Get the second simple expression.
[3034-6] Load it or its address, as necessary.
[3038-43] If the operator was in and
the operands are correct, generate an INN
instruction.
[3046-55] If the operand types are different and one is integer, float it.
[3056-88] Select the type field for the instruction.
[3065] For integer and enumerations.
[3068] Only = and <> allowed for pointers.
[3073] < and > are not allowed for sets.
[3075] Only character arrays may be compared.
[3081-7] Records and files may not be compared.
[3089-96] Generate the instruction.
[3098] The operands were not compatible.
[3100] The result is a boolean expression.
Code similar to
if gattr.typtr^.form <= power then
appears everywhere and is far from explicit. Far better would be
if gattr.typtr^.form in [scalar, subrange, pointer, power]
or since it appears often, using a variable
if gattr.typtr^.form in simpleforms
However, to simplify everything load may as well start
if gattr.typtr <> nil then if gattr.typtr^.form in [arrays, records] then loadaddress else ...
You may prefer
if not (lop in [eqop, neop]) then error(131)
as closer to the intent of [3068].
A suitably initialised array would make lines [3089-96] simpler:
gen2(oper[lop], ord(typind), lsize)
though lsize is not used elsewhere, and so could be expanded and eliminated:
gen2(oper[lop], ord(typind), leflr.typtr^.size)
This deals with indexing arrays, field selection of records, dereferencing pointers, and windowing a file, for example
a[i] r.f p^ input^
It is called after reading the initial identifier and doing a searchid for it (e.g., [2236] in procedure variable). Fcp contains the details of this identifier. The code generated for an array access like a[i] is something like
[Code to load the address of a] [Code to load the value of i make it integer if necessary] [Optional code to check iin bounds] DECI lmin Decrement ithe lower bound of a IXA size Calculate the address of the element
No special code is generated for field selection, dereferencing, or windowing, all of which are treated as accesses to variables, except for a field selection of a record which is only indirectly accessible. In this case the code to load the address of the record is followed by:
INCA offset increment the address with the offset of the field within the record.
[2089-126] Access the identifier.
[2093-6] Save the address of an actual variable.
[2098-100] Load a formal variable. Since it must be a
parameter to a routine, vlev must be greater than one, so
an LDO
instruction would never be possible.
[2101-12] Identifiers that are the result of a with statement. For example,
with r do f:=0
where f is a field of r. Disx is the value left over from the preceding call of searchid.
[2103-6] If it is a field of a directly accessible record, then just increment the address of the record by the offset of the field.
[2108-12] If the record is indirect, load its address
and save the offset of the field. The address of the record will have
been stored in a local temporary location. If it is at the outermost
level, it can be accessed by an LDO
instruction;
otherwise an LOD 0
instruction.
[2113] A function identifier is only a variable when assigning to it in the body of the function.
[2115] You cannot assign to a standard function.
[2118] Nor to a formal function.
[2120] Nor to a function, except within the body of that function. (So
function f: integer; procedure p; begin f:=0 end; begin p end;
is disallowed).
[2121-3] Calculate the address of the function result (address zero at this level).
[2131-74] Indexes.
[2133] This repeat deals with a list of indexes like a[i, j, k].
[2134-7] Check that it is an array that is being indexed.
[2139] Load the index.
[2141-4] Check that the index is a scalar, and
generate an ORD
instruction if it is not integer.
[2148] Check that the index type matches the bounds of the array.
[2151-8] Generate code to check the value of the index, and decrement it by the lower bound of the array.
[2161-4] Update the attributes of the expression.
[2165-70] Generate the IXA
instruction.
[2176-204] Field selections.
[2180-2] Check that it is a record that is being selected from.
[2184-7] An identifier must follow the period. Locate it as a field of the record.
[2188-9] Error if not located.
[2192-7] Update the type and offsets of gattr.
[2206] Pointer dereference and file window.
[2209-16] If it is a pointer, load it and optionally check its value.
[2218] A file window is just accessed like a variable.
[2219] It is neither a pointer nor a file.
Constant indexes to arrays, like
x[2]:=0
generate code like
LDA 0 5 Load address of x LDCI 2 Load constant 2, the index DECI 1 Subtract the lower bound of x IXA 2 Index x. 2 is the element size LDCI 0 Load the right hand side STOI Store the value
Ideally the code would be
LDCI 0 STRI 0 7
but this would be difficult, because the address of x has to be loaded before it is known if the index is constant. The best that can be done in the present circumstances is
LDA 0 5 INCA 2 Increment the address by 2 LDCI 0 STOI
Consider what would be necessary to effect this.
LDO
over an LOD 0
, that is between accessing
a variable as global or local, and therefore whether the if statement at [2109] serves any useful purpose.Examine the code at [2120-3] carefully. What do you think the reason is for the begin end round [2121-3]?
One possibility is that [2120] was an afterthought, and inserted carelessly.
The code at [2154-6] prevents a DEC
instruction from having a negative argument like
DECI -10
which would rather get produced as
INCI 10
even though the interpreter accepts negative arguments (and
instructions like LDCI
can have negative arguments).
As the comment at [2157] suggests, these three lines could be replaced by
gen1t(31 (*dec*), lmin, intptr)
CHK
generated at [2211] (maxaddr), is never used by the
interpreter. The first parameter value of 1 means that the pointer
value may not be nil.An IXA
instruction is nearly always
preceded by a DEC
(or INC
) instruction (the
only exception being when the lower bound is zero). If the format of
instructions were not so restrictive, and allowed potentially large
values for the P parameter, an instruction like
IXA 1 2
could be produced instead of
DECI l IXA 2
(Actually, CHK
produces large
values for P, but the assembler has to take special action for
this).
Indexing code could be improved. Consider
i:=a[i]
producing
LDA 0 9 LODI 0 7 DECI l IXA 2 INDI 0 STRI 0 7
Having produced the IXA
instruction, an indirect load instruction IND
has to be
used to get to the contents of the element
If array accessing were delayed a little, and
IXA
replaced by a typed instruction, say
INX
, then the above could be replaced by
LDA 0 9 LODI 0 7 DECI l INXI 0 STRI 0 7
When an address is really needed from an array
access, for instance when assigning to it, IXA
would be used.
Consider the following array
var a: array['0'.. '9'] of integer
and an access to it of
a['1']
The code for this would be something like:
LDA 0 5 LDCC 'l' ORDC CHKI 48 57 DECI 48 IXA 2
The mysterious 48's and the 57 are the integer values of the characters '0' and '9' as represented on the source machine, effectively preventing this code from being assembled on a machine with a different character set.
It seems astonishing that there should be the trouble taken to
generate the character literal in the LDC
instruction only to
produce integers in the CHK
and DEC
instructions.
One solution to this would be to alter the definition of CHKC
to
use character literals, and generate the following code:
LDA 0 5 LDCC '1' CHKC '09' ORDC LDCC '0' SBI IXA 2
Another solution would be to change the
definition of CHKC
and DECC
giving
LDA 0 5 LDC '1' CHKC '09' DECC '0' ORDC (possibly) IXA 2
(It could be argued that DECC
leaves an integer result not a
character one, and so the ORDC
would be unnecessary.) However,
changing DECC
like this would cause a problem for the standard
function pred which generates code for a character
DECC 1
where the parameter is definitely not a character. A possible
solution to this would be to introduce yet another instruction,
PRD
, say.
A third and possibly the easiest solution would
be to parameterise the compiler with the collating sequence of the
target machine, so that the integers produced for CHK
and
DEC
were those of the target machine. This would involve
an array
var ordchar: array [char] of minchar..maxchar;
where minchar and maxchar are the bounds of the target machine's character set. This would then have to be initialised:
ordchar['0']:=48; ordchar['1']:=49; ... etc.
but the only other change that would need to be made is in the lexical analyser, where a character constant is formed. Line [491] would then read
if lgth=1 then val.ival:=ordchar[string[1]]
There are certain places in Pascal where only constants are allowed. These are in constant declarations (of course), in subrange type specifications, the case labels of variant records and case statements, and in calls of new.
This procedure compiles such constants. Fsp returns the type of the constant, fvalu its value.
[872-84] Strings. See [2801 et seq.] for some similar code.
[886-924] Integers and reals
[887-91] Collect a sign if present, and set sign accordingly.
[892] The constant is an identifier declared in a constant declaration.
[894-5] Copy its attributes.
[897-8] If the identifier represents an integer constant, and there was a negative sign, negate the value of the constant.
[900-10] The identifier is a real constant. Unlike integers, the real value is held as a pointer, so a copy of the value has to be made, and the sign altered as necessary.
[912] A string may not be signed.
[916-9] Integer constant
[921-4] Real constant.
[926] Some other symbol.
This procedure is in a slightly odd position, and could well be placed later, say before [1025].
Pascal Implementation by Steven Pemberton and Martin Daniels
Copyright © 1982, 2002 Steven Pemberton and Martin Daniels, all rights reserved.