Making sense of source code
Context-specific analysis to manage and mitigate complexity
Software is all around us and becoming ever more important. At the same time, the specialists who create, maintain and extend all this source code are having trouble managing its exponentially increasing complexity, especially after the initial development phase. As a consequence, the software that we use has both lower quality and higher cost than is desirable, and this generally becomes worse as software evolves over time.
The solution to this problem has come from the programmers’ world itself: meta-software, i.e. software (tools) to manage software (under development). Making sense of an existing code base, however, is far more difficult than supporting programmers working in a green field. Analyzing source code in the reverse engineering direction requires its original context in order to trace back its meaning and purpose. Context-specific analysis can help us in understanding software complexity and finding new ways to prevent, manage and mitigate it.
Most of today’s innovation takes place in software, and the prominence of the role of software increases every day. Its impact is profound in technology, but also strongly felt in our society and our personal environments. Sometimes we can see it directly; most of the time it plays its part more discreetly, hidden in the background. Over recent decades, software has become an omnipresent, pervasive and dominant force in our daily lives, and this development is still far from its end.
The sheer volume of software is growing rapidly, due to both strong demand for new applications and extensions to existing systems. At the same time its complexity has been increasing exponentially. Software’s complexity naturally grows much faster than its volume. This is mainly due to lack of time to reconsider the internal design of software so as to fit evolving requirements or remove deprecated functionality. Other factors add even more dimensions of complexity to our code bases. These include the distributed character of today’s software systems, and the freedom and expressive power of modern programming languages and tooling.
The tools to support this forward engineering direction of the software development process are reasonably well understood from a scientific point of view, and form part of every modern development environment. As they are being developed alongside new programming concepts and languages, they allow programmers to build increasingly complex software systems in a highly efficient and well managed way. Provided, that is, that the programmers are working in a green field.
Generally, however, software is read far more often than it is written. During the lifetime of a typical program, bugs need to be found and fixed, some components are exchanged for others, new functionality needs to be incorporated, and – hopefully – existing code can be reused in other packages. At this point software development becomes an evolutionary process, requiring a very different kind of tooling to analyze the intricacies of an organically growing code base.
Unfortunately, the tools to support this reverse engineering direction are far more difficult to develop than those for the forward direction. To provide information at a higher level of abstraction, they need to extract the meaning and purpose of the source code, qualities that mostly existed only in the heads of the original programmers. These people worked their way through the requirements to understand the what, how and why of the program; they interacted with customers, users and colleagues about parts that needed further clarification; and they were probably busy getting things done rather than meticulously logging every administrative detail of the process. In general, reading the resulting code and making sense of it becomes harder and harder as the software changes over time and its original context fades into history.
It should come as no surprise that unmanageable complexity leads to errors, which lead to extra costs at best and accidents at worst.
The solution to this problem is to incorporate knowledge of a very different nature than we do now into the analysis phase of our reverse engineering tools. This includes technical information on the interfaces and coding standards used, as well as business information on terminology and processes. This additional input is expected to be specific to a particular sector or industry, or even to an individual organization or development process. The result will be an advanced type of software analysis tooling that is domain-specific, or context-specific,
This new direction in the development of tools for forward engineering analysis is still in its infancy, however. It requires a lot of fundamental research before its benefits can be reaped. So I propose context-specific analysis as an important research area, with the explicit goal of understanding software complexity and finding new ways to prevent, manage and mitigate it.
What we are trying to accomplish is to clearly separate the reusable “heavy lifting” in analysis tools from the hopefully light weight specializations needed for specific contexts. To do this, we first transform source code into reusable abstract objects and relations, and then analyze these models in conjunction with context-specific information. Examples of the latter are knowledge about which specific APIs, platforms and coding standards are being used, along with the professional terminology and idioms.
That requires us to link up to particular sectors and industries, which often have this type of knowledge available in a formalized or semi-formalized form. Think, for example, of domain-specific languages (DSLs) which provide specialized features for particular domains, and UML code (Unified Modeling Language) representing software systems, workflows and business processes. So reaching out to specific sectors, finding industry partners, and talking to experts are important parts of this quest.
Rascal to the rescue
For our research we rely heavily on Rascal, http://www.rascal-mpl.org, a meta-programming system aimed at code analysis, code transformation (e.g. refactoring), and the implementation of domain-specific languages. As a matter of fact, Rascal itself is a domain-specific language catering to meta-programmers.
Reading source code and transforming this into abstractions for analysis can already be done for Java and PHP. We are currently starting on the connectors for C and C++. Developing these connectors has now become an engineering problem rather than a scientific feat.
The main research questions lie around the analysis phase. How do we incorporate context-specific knowledge in our tools, and how do we process the resulting information? At the other end we need to find out how to query these abstractions in a smart and fast way. The latter is a scientific challenge in itself.
What drives us in this direction is the reward this scientific research will hopefully bring. Context-specific analysis will allow us to see not only what the source code says but also what it means. It will let us ask context-specific questions. Since more specific questions lead to more specific answers, those answers will be more relevant to the software engineering experts – and their managers – than current state-of-the-art software tools can provide.
All in all, context-specific analysis is expected to facilitate higher-quality software at lower cost, especially after the initial development phase. Naturally, to get there we need collaboration and an extensive exchange of knowledge between software researchers and software engineers. Making sense of software is eminently a multidisciplinary affair.