Many object identification methods work by grouping procedures based on the type of the arguments they process [34]. Unfortunately, COBOL is an untyped language, blocking this route for object identification purposes. To remedy this problem, we propose to infer types for COBOL variables based on their actual usage [22]
At first sight COBOL may appear to be a typed language. Every variable occurring in the statements of the procedure division, must be declared in the data division first. A typical declaration may look as follows:
01 TAB100. 05 TAB100-POS PIC X(01) OCCURS 40. 05 TAB100-FILLED PIC S9(03) VALUE 0.
Here, three variables are declared. The first is TAB100, which is the name of a record consisting of two fields: TAB100-POS, which is a single character byte (picture ``X'') occurring 40 times, i.e., an array of length 40; and TAB100-FILLED, which is an integer (picture ``9'') comprising three bytes initialized with value zero.
Unfortunately, the variable declarations in the data division suffer from a number of problems, making them unsuitable to fulfill the role of types. First of all, since it is not possible to separate type definitions from variable declarations, when two variables for the same record structure are needed, the full record construction needs to be repeated. This violates the principle that the type hides the actual representation chosen.
Besides that, the absence of type definitions makes it difficult to group variables that represent the same kind of entities. Although it might well be possible that such variables have the same byte representation, the converse does not hold: One cannot conclude that whenever two variables share the same byte representation, they must represent the same kind of entity.
In addition to these important problems pertaining to type definitions, COBOL only has limited means to accurately indicate the allowed set of values for a variable (i.e., there are no ranges or enumeration types). Moreover, in COBOL, sections or paragraphs that are used as procedures are type-less, and have no explicit parameter declarations.
To solve these problems, we have proposed the use of type inference to find types for COBOL variables based on their actual usage [22]. We start with the situation that every variable is of a unique primitive type. We then generate equivalences between these types based on their usage: if variables are compared using some relational operator, we infer that they must belong to the same type; and if an expression is assigned to a variable, the type of the expression must be a subtype of that of the expression.