Compiled and interpreted languages

Published

Having followed the discussions in this thread for a while I have decided to throw my 2 cents in as well.

Beware, this will be long, but I hope that it is understandable even to people without a degree in computer science.


What is interpretation ?

To me it is the process of mapping (somehow encoded) instructions to the actual operations associated with them, thus giving them meaning. In this way interpretation and execution are the same thing.

Because of this every implementation of any language is interpreted, they only differ in the level at which the interpretation happens, or is most conveniently to look at by us.

Digression

It might have passed unnoticed, but I made a rather important distinction in the last sentence: Languages are not implementations, they have implementations. And properties like interpreted, compiled, etc. belong to the implementation, not to the implemented language.

I believe that some the arguments I saw and the confusion arising from it stem from the intermixing use of these two notions. I guess that this comes, partially at least, from the fact that most (all?) languages seem to have a preferred type of implementation, for example C interpreters exist, but are much less in number than C compilers.

So, if I say xxx language in the future it means that the usual implementation for that language has the property xxx.

I implicitly mentioned different levels of interpretation, but what are they ?

  • At the lowest level we have the hardware. Yes, a (micro)processor is an interpreter, as it associates machine language instructions (encoded in bytes) to operations like clear register, etc.

    This is usually the lowest level of interpretation we see in a computer system, but depending on the implementation of the processor not necessarily the lowest one existing.

    All microprogrammed processors have a small machine inside busily interpreting microinstructions (bit vectors) for the sole task of interpreting/executing the incoming machine instructions. And depending on the complexity they might even have a nanoprogram level (for example the M68xxx series).

    In the end every computation is reduced to this level, but often it is not very convenient to look that deep.

  • One step above we have virtual machines. These are basically interpreters implemented in software instead of hardware.

    They often employ simple machine languages, but not necessarily so.

    An implementation of a domain specific language like the one used by Brent Welch to describe forms via tcl procedures is a virtual machine too, executing form-instructions.

    A virtual machine capable of executing the language of a hardware processor is called an emulator. One application is the simulation of embedded systems on bigger machines, for example to test a new robot control application, and another the emulation of the hardware of long-gone game-consoles to allow the usage of their games even today.

    Back to the simple machine languages: The usual name for them is bytecode, due to the fact that many of them use single-byte instructions. Another thing to note is that most(all?) seem to implement stack-based processors instead of the (usually) register-based hardware processors. The easy compilation (see below) of higher languages into a language so structured seems to be the deciding factor.

There is nothing above virtual machines, just different techniques for implementing them.

Before I am coming to that I should explain the notion of compilation too.


The general notion behind the term is that of a converter, taking a program written in a language X as input and producing a program in another language Y as output. But not all converters are compilers; to be one, some restrictions have to be met.

An application like a cross-referencer is certainly no compiler although its output can be seen as a program in a simple language. The problem is that the output does not contain all of the information which was in its input.

The output of a compiler has to be semantically equivalent to its input. That means, if we execute input and output (remember, both are programs!) on some data D both have to produce the same results.

This definition is a bit broader than a programmer might see it, as it classifies pretty-printers as compilers too, but that is fine with me, they simply belong to the special class of compilers translating from a language X to X itself.


Ok, now we can look at the possible approaches to implement virtual machines.

  1. The interpreting software reads the program in the language to execute, parsing and executing it along the way. During iterations of language elements the input is parsed multiple times.

    This is an immediate interpreter and seems to be usually meant or implied if a language is called interpreted.

  2. Two-stage interpretation:

    In the first stage the input is read completely, compiled into a lower-level language and then executed by the second stage, which is an immediate interpreter for the low-level language.

    This often uses a bytecode as the low-level language, thus reducing/eliminating the parsing overhead for the 2nd stage interpreter. And as the complex and costly parsing of the high-level language now happens only once the combined system is usually faster than an immediate interpreter for the high-level language itself.

    Please note that this implementation uses a compiler, but it is implicitly called by the environment and not seen by the user. Terms used for this are on-the-fly compilation or just-in-time compilation (See comments about Java later on).

  3. Compiler based

    The user has to explicitly invoke a compiler to translate his program in the high-level language into a low-level language. The resulting output is then executed, either on hardware or by a bytecode interpreter.

Both (1) and (2) allow the construction of interactive shells wrapping around the core engine, reading and executing user input as it is typed, with immediate feedback of the results.


An explanation of the term scripting language is still missing. Now that I laid the foundations in the text above I will work on this one.

Many people seem to believe that scripting and (immediate) interpretation are synonyms. I disagree, but have to digress a bit (again) to explain that. Two general notes on this topic before that:

  • The definition of the categories above are based on language implementations, and therefore rather strict. scripting, and its counterpart programming are much more diffuse and more a question of how to use a language, i.e. the way a programmer works with it and for what tasks.

  • Whenever I encountered discussions about scripting vs. programming I had the feeling that some of the participants believe that scripting languages are inferior to programming languages and not to be taken as serious languages, that they are more for hacking on a problem rather than real programming.

    I have to disagree with that opinion. To me a scripting language is as serious as any other programming language. It is just that they have different problem domains to apply them to.

On to the digression. I will talk about various computer languages, their implementations and wether I consider them as a scripting language or not. After that section I will then try to describe the major characteristics of scripting as I see them. This should make clear why the boundary between the scripting / programming is a fuzzy one too.

  • Basic

    is one of the earliest languages, and still alive. The earliest implementations were immediate interpreters, but modern implementations at least use tokenization to speed up execution. I don't know enough about the exact implementations to call it a byte-code and 2-stage interpreter, but it definitely is a precursor for such techniques.

    The only variant I am sure about that it can be compiled is Visual Basic from M$. Another important feature is its extendability via VBX'es and the successor technology OCX / (Radio)ActiveX.

    I am not sure wether the VB-IDE allows interactive development of applications. If yes it would be a scripting language (explanation later, be patient).

  • Pascal, Modula, Algol, C, C++, Ada, FORTRAN, Eiffel, Java.

    The main set of programming languages. They are usually compiled.

  • C

    I have heard that an interpreter exists, but cannot confirm it.

  • Pascal

    The UCSD-Variant of this language was compiled into P-code, a bytecode-language, and then interpreted. Due to the explicit compilation this is no 2-stage interpreter.

  • Java

    This language uses the same technology as did UCSD-Pascal, explicit compilation into a portable bytecode. The so called just-in-time-compilation is basically the same technology which is used by 2-stage interpreters, but one level deeper:

    2-stage Interpreter: Highlevel-Lang -(on demand) -> Bytecode.

    JIT: Bytecode -(on demand) -> Maschinecode.

    AFAIK Sun is developing hardware for the immediate and fast execution of Java-bytecodes.

  • SmallTalk

    One of the first OO-languages. Very interactive, unfortunately a closed system. Not a scripting language.

    Standard implementation technique is a 2-stage interpreter, but I read in the Handbook of programming languages about a commercial variant using JIT-technology to improve performance. This predates its use by Java, AFAIK.

    The earliest implementations were run on hardware, microprogrammed processors who interpreted the then bytecode language immediately (the then because I believe that the bytecode of SmallTalk varies between the first implementations and the standard SmallTalk-80).

  • Lisp

    Another language from the beginnings of the computer era with a very simple, some say non-existent syntax.

    The earliest implementations were immediate interpreters for the list-structures coming from the input reader, but modern ones use 2-stage interpreters and bytecodes.

    There were even machines executing such bytecodes immediately on hardware (TI-Explorer, Symbolics) but they did not survive in the market.

    Examples of this family are Commmon Lisp, Scheme, Emacs Lisp and Guile.

    AFAIK Guile from the GNU project is the first variant (itself a descendant of Scheme) containing extendability features, like loading of shared libraries.

    I see Guile as scripting language, but not the other variants.

    Remark: Thinking back to the introduction the existence of hard- and software implementations for SmallTalk, Lisp and Java shows very clearly how movable the boundary between these two really is and why a comp.scientist sees every execution as an interpretation of some kind.

  • sh, bash, ksh, csh, tcsh, ...

    The bourne shell and its descendants, all implement basically the same language with a simple syntax. All implementations are immediate interpreters and feature interactive shells.

    Basic features for extension: Execution of arbitrary commands, pipes and I/O direction for building more complex applications from basic building blocks.

    All are scripting languages.

  • Tcl

    Before version 8 this language was implemented with an immediate interpreter, but version 8 and higher use bytecodes and a 2-stage interpreter.

    ICEM CFD sells a Tcl2C compiler, thus essentialy providing a compiled implementation. Jacl is a tcl interpreter written in Java, giving us a nice example for the concept of stacking virtual machines on each other: The tcl interpreter is a virtual machine, as it is written in Java it is executed by the Java-VM, which in turn is machine language, interpreted by hardware. SICP contains more nice examples for this type of construction.

    Extensible through facilities to call arbitrary external commands, like in shells, and through loading of shared libraries. Offers an interactive shell.

    A simple grammar, a streamlined (simpler quoting model) and extended (more quoting possibilities) version of sh-syntax.

    Definitely a scripting language.

  • Perl, Python

    Two more scripting languages, both implemented by 2-stage interpreters, from the very start. Their grammars are more complex than the ones of Lisp, Tcl and sh, but less than, say C (At least for Python ;-).

    Extensible like Tcl.

  • Python

    When I heard first from JPython I thought it to be something like Jacl, a Python interpreter implemented in Java. Newer information suggests that it is more of a compiler from Python to Java-Bytecodes, skipping the Python-bytecode language entirely.

So, looking back at that long list of languages, what caused me to label some of them as scripting languages but not others ?

Well, the first important point is interactivity, the ability to play with an application while it is in development (after), to try and compare different possibilities for its implementation in a very short time, to develop it incrementally, starting with a simple implementation which is then refined.

This is a very different style of application development than is used with programming languages whose larger turnaround times in the edit-run/test-edit cycle make such playing around difficult. It is maybe even (part of) the reason why they are sometimes seen as inferior or hackish (see general note 2 before the language section).

Please note that an interactive shell is only one way to provide this element. Neither Perl nor Python feature such a shell, yet they are fast enough to create a good enough approximation of a shell to allow the incremental style of development.

A trade-off can be seen too. All of the languages with an interactive shell (tcl, sh, guile) have a much simpler grammar than the languages without (perl, python). IMO this is because for interactive shells we users have be able to type statements of the language at the shell prompt without thinking too much ahead and/or about syntactical structure, as it might be necessary in full-blown languages with complex grammars. To write an application in a scripting languages without shell we need an editor, thus are not constrained by its basically single-line input.

A universal point is the existence of fast interpreters for the language.

The second thing required for me to consider a language as scripting is easy extensibility, allowing it to act as glue between highly diverse components. That is the reason why I said that Guile is a scripting language, but not the other Lisp-variants. The same is true for M$-VisualBasic, it is a scripting language if and only if the IDE allows interactive development.

I concede that the extensibility features of sh are at the low end of the spectrum but the near universal use of shell-scripts to administrate and control unix systems shows that even they are powerful enough for many and complex tasks.