Jurgen J. Vinju Blog

“Code”, or “source code”, is a term used in plain language to talk about the texts that programmers write and read in programming languages. There exist thousands of programming languages, but only a dozen of them are popular beyond a few hundred users of the language. Some of them are used by hundreds of thousands of people. In a manner of speaking, computers also “read” these same languages and they can execute the instructions explained in code. Executing, or running, code which is first written by people and then later, and many times over, executed very efficiently by computer hardware seems to be the point of software. It takes time to learn a language. Like a normal language, if you already know one language learning the next one may take around two weeks to become a fluttering novice and around ten years to reach utter fluency as a master of the language.

If you do not know what these programming languages are like at all, then you might think that some kind of message is being transmitted in some kind of “coded” fashion. And as a result you would also conclude that reading this code requires a “decoding” of some kind, like some kind of secret which must be unlocked. The word “code” seems to suggest that programming is some kind of alienating experience. That conclusion is wrong, and that image of programmers working on secrets, tediously writing and reading numerical code, is harmful. That image, however, is constantly projected by the mainstream media and the entertainment industry.

Understanding what programmers really do is important, because what software does is influencing our daily lives in increasing scope and intensity. In fact, it is today quite hard to make a list of things we do and experience which are not influenced by some sort of software. Since programmers write all this software, they are accountable and we all must engage in conversation with them. Understanding what “code” is and what “coding” entails, paves the way for a good conversation.

The word “code” does not in the least mean the message is intentionally garbled

This choice of terminology was to indicate that for computers to consume and interpret software, they have to associate numbers with actions. Today’s computer hardware can simply not process anything but numbers. Hence they are called “digital computers”; they are in a way counting extremely quickly on their abundantly many fingers. When we talk about “the computer”, namely the core thing without any of its peripheral input and output devices, that is all it can do.

To make computers process the instructions we want to give them, these instructions must be numbers. For example, the instruction code could be that, for a specific model of computer hardware, 1 means to copy a number to elsewhere, and 2 means to multiply two numbers, 3 means to jump ahead to another instruction, etc. Since digital computer hardware is (accidentally) binary -a transistor switch is either on (1) or off (0)- the codes do look like 1 for 1, 10 for 2 and 11 for 3. The entire set of codes is called the computer’s “instruction set”. Computers have dozens to hundreds of instructions in their instruction set, depending on their particular model.

Programmers never have to type in instruction numbers

Indeed very few -a few dozen perhaps- ever did write binary code. This idea of programmers thinking, reading and writing coded numbers is prevalent in movies or journalistic reports on software. The image of programmers speaking in a freakish binary tongue makes them look outlandish. Illustrating software by showing a stock photo of “The Matrix” makes programmers look like science fiction characters, instead of normal people. These associations can be functional and exciting from the movie theater perspective, but they are without merit in the real world of software engineering. Journalists should certainly avoid them.

For software to be “good”, in any sense of the label, its designers and its users have to come as close together as possible; they have to understand each other as well as humanly possible. A proper conceptual image of what a programmer actually does, helps a lot in this regard. When they read and write code, programmers are communicating with computers, but more importantly they are communicating with people. Programmers are trying to communicate as clearly as possible.

Programmers write readable texts in understandable programming languages

Programmers read and write in so-called “high level programming languages”, for which it is normally quite easy to associate meaning to its expressions for humans. For example:

if (TodayIsMonday) then print("Hello new week!"); is a typical statement in a typical high level programming language. In fact, it is the goal of professional software engineers to express their intent using these languages in a most clear and concise manner as humanly possible. That is why programmers also use the spacing of the text to structure their expressions even more clearly. This is much like the use of typographical elements in natural written language, such as bullet points, aligned margins and the indentation of paragraphs. For example, to make clear that the above print statement is only executed conditionally we actually write it under and slightly to the right of the governing condition if (TodayIsMonday):

if (TodayIsMonday) then
   print("Hello new week!);

This programmer has also chosen the name TodayIsMonday, as opposed to the non-sensical Tmp13, to clearly indicate her intent. Good naming in source code, next to layout, is one of the prime means of expressing ourselves clearly as programmers. This example shows that professional programmers try everything in their power to make their writings more clear and certainly not less clear to other people. Programmers are much more like the manual writers of IKEA, then they are like secret agents.

In summary:

Code is not encoded or secret or hard to grasp, if you speak the programming language it is written in.
Code is just text written in a language you have not learned yet. These languages are limited in a fashion, in the sense they are amenable mostly to express commands which can be executed by a computer, and the conditions under which to execute every command. It’s like English would be restricted to all you can write in a recipe cookbook; still vast, offering unlimited creative options, yet restricted to a domain and typical jargon of that domain.
Programmers are just people who can understand others, and express themselves, using a different kind of language.
Programmers can be asked to explain a piece of code and they should be able to express their intent to you in natural language.
Learning a programming language can be a liberating experience, a new way to express yourself and a new way to create beautiful things; things which interact with the people you care about in a profound manner. This is how programmers feel about code when they are blisfully “in the zone”; code is their creative medium. When creativity is not flowing, the other side of this medal appears: frustration.

Three more important things to understand about programmers and code

1. How do programming languages work? Why is it called “source” code?

For a computer to process the instructions written in an actual programming language, a translation down to the original operation codes is necessary. This fully automated translation is software itself; it is called an “interpreter” or a “compiler”:

Interpreters directly execute the expressions as operation codes. An interpreter is like a translater signalling the news in sign language for deaf people as it is presented by the news reader.
Compilers first translate the expressions “down” to numeric codes which can then be executed later. A compiler is like a person translating a pictorial IKEA manual down to step-by-step instructions in written English. Programmers usually do not read the output of a compiler. The compiler is a trusted agent which faithfully and literally translates the commands of the higher level language to the lower level language which can be executed by the computer directly. An exception must be made for the language designers who need to dig into all language levels at the same time.
Many of today’s programming languages are first translated by a compiler to some other intermediate programming language, which can then be again either interpreted or compiled further towards computer instructions. Intermediate languages have been introduced for economical reasons. By sharing intermediate languages, we can share the often intricate translator software among different computer hardware designs, and among different designs of high-level programming languages. The intermediate levels make it extremely hard for a programmer to predict the quality of the resulting software; not just her code but also all intermediate translators influence this quality. Allthough these translators may faithfully reflect the computational effect of her code, they also arbitrarily influence other aspects such as efficiency and security.
Why “source” code? The translation of computer programs down to an executable format for the machine is the reason of the adjective “source” in “source code”. This text is namely the source to the translation towards the “target” binary form.

2. Still, why is programming then called “coding” often? What is “coding” anyway?

From a birds-eye perspective, programmers do indeed translate expressions in natural languages and expression in pictorial form, towards executable instructions for the machines in programming languages. This design and writing process is sometimes called “coding” colloquially. It is not however “encoding”. What programmers do do during this so-called “translation” is learning, designing, reading, writing, refactoring, debugging, testing and demonstrating:

Learn: listen and watch, practise and try, mirror and provide feedback, all to understand from the software’s stakeholders what the goal of the software should be, and what level of quality (safety, robustness, efficiency, privacy, flexibility) is expected. The goal is to formulate precise problems, features and requirements as a result of this learning process. Sometimes we “draw” (mock up, write) prototypes or models to improve the quality of the communication.
Design: model, split up, architect a solution which would hypothetically implement the software. This happens on increasing levels of detail. We talk and draw a lot during this process. Sometimes we make small working prototypes during this process as well.
Read: read source code and documentation of existing software, in orde to be able to reuse them or fix them. Learning about existing software is a (very) big part of the activities of today’s programmers. In order to read source code, programmers use code browsers -dubbed IDEs. These browsers show code like webpages, with search facilities and links included. Such a browsing tool helps programmers to find their way without having to read all of the source code.
Write: literally write pieces of the design as source code.
Refactor: revise source code for the purpose of improving clarity or flexibility without changing its observable behavior when executed on a computer. This is like revising and rewriting an existing IKEA manual to avoid reported misunderstandings.
Test: write and run test code which exercises isolated parts of the solution source code and checks its input/output behavior. We do this find out if mistakes were made which have to be fixed first (known as debugging). Some programmers first write tests and then write the solution to help make sure they write nothing unnecessary and produce nothing which is not tested.
Debug: is a complex human activity which involves mostly searching for a cause in the source code to an observed failure of the software system, and when found fixing the error which caused the failure, preventing the failure to occur again.
Demonstrate: try it all out and show the running software to ourselves and others to find out it was right or wrong, and going “back to the drawing board” of course :-)

3. Why are simple questions about what particular software actually does often evaded or even lied about?

Although the intented purpose and the design considerations of source code can be clarified by an author or an analyst, predicting the effects of a piece of source code, when it will be executed in arbitrary circumstances and provided with arbitrary input, is an entirely different matter. The effects of all but the most trivial source code are emergent.

To understand why the effects of source code are emergent, it is sufficient to understand that source code explains only to a computer which trivial instruction to execute next, one-by-one, and in which local circumstance. The code needs not to explain to the computer why these steps are taken. Also it is not required that the code explains to the computer where the path eventually leads. This means that code does not literally contain the answer to the questions: “what will this code do when executed?”, and “why does this code behave like this?” The code simply generates the behavior in a step-by-step fashion when executed by a computer, and nothing else. The conclusion is that information external to the code will always be required to answer these two relevant questions.

For example, this code says that if a-squared plus b-squared equals to c-squared, for any numbers named a, b and c, the computer should print “hurray!” and otherwise do nothing for all the other numbers.

   if (square(a) + square(b) == square(c)) then
       print("hurray!"); 

This code works fine for some intended purpose, but it neither makes explicit that a, b and c are the three sides of a triangle, nor that the equation only happens to be true only for right-angled triangles. To predict with some level of confidence when “hurray!” will be printed and when it will not be printed, is to understand Pythagoras’ theorem.

Next we can change this example to use an arbitrary fixed exponent n, instead of the square operation:

   if (exp(a, n) + exp(b, n) == exp(c, n)) then 
       print("hurray!")

Suddenly we need an understanding of Fermat’s last theorem to predict the effect of the code. That should be a cause of alarm; it is entirely a different piece of cake to understand Fermat’s last theorem compared to Pythagoras’ theorem. We went from basic but fundamental high-school mathematics to the deepest and least comprehensible mathematical insights via a simple code edit.

To predict the effects of source code and answer simple valid questions about it, often deep and possibly inaccurate analysis is required, using both theoretical and empirical research methods. Much like a novel can be read and discussed by anybody, source code can be read and explained to another person. But a deeper understanding of its qualitative attributes requires a critic to dig much deeper and much more abstractly; using the technical lore of the academic field of Literature. Perhaps an answer to simple questions about the novel are never produced, or the answer is not 100% objective and accurate. Whether you have authored the code, or whether you are analyzing somebody else’s code, answering the question “what will this code do when executed?”, inevitably requires serious study. Answering the “why?” question inevitably requires some serious soul searching.

If you would ask a professional programmer what she did to protect your privacy, she would explain to you pieces of source code which purposefully avoid data leakage and she might show you automated tests which assure that particular attack vectors for thieves have been rendered harmless. If you ask the same programmer whether or not the software prevents data leakage in all circumstances, the answer would be evasive or simply “no”. If you then ask her under which circumstances her code might leak data, her correct answer would be “I do not know”. The reason is that analysis of future behavior of code for all possible future contexts and all possible future inputs is hard. It’s not just hard in a boring or tedious way, requiring extensive concentration or patience for an extended period of time. No, it’s hard in a fundamental scientific way, requiring for every new piece of software new creative theory forming or proving hypotheses empirically using new research methods. In many cases it is even theoretically impossible to answer such questions.

Finally, source code can (and often will) grow to be inhumanly large. So just to simply read the code requires more time and energy than available. This is a fact of life which is hard to accept, and to which we are all trying to find an answer as software engineering researchers. One solution direction is to make code simpler and shorter: to invent new languages. Programmers call this direction “Domain Specific Languages” and “Model Driven Engineering”. This direction helps like a metro map helps. The metro map is not a normal map because it does not show the actual distance and relative position of every metro station, it shows you only how to go from one place to another in an elegant and concise manner. The restriction to the domain and the context of the map reader, allows the designer to come up with a language which is sufficient and which solves the problem.

Another solution direction is to automate code queries to be able to help answer questions about it without having to actually read it. This is more akin to diagnostic machinery in medicin. We invent new diagnostic apparatus that help the engineers to answer questions with acceptible accuracy and within acceptable time. Also in this direction, contextuality and domain knowledge can help to improve the tools we make. Like a telescope and a microsope are both made of lenses -but in a different configuration- software analysis tools are made for different tasks -but based on similar principles.

Knowing that answering questions about software can be truly hard, we must still ask programmers about their code and expect them to provide clear answers. If no clear answer is available, we can expect programmers to explain clearly why the answer would be costly or impossible to provide. After all, programmers are just people speaking a few weird languages more than you do. That does not make them aliens which are never to be trusted. It also does not make them gods which are beyond your judgment by definition. Something in the middle sounds just about right: programmers are people.