Incremental compilation

I have a strong feeling that we’re doing this whole incremental compilation thing wrong. I have hands-on experience with the Scala compiler (which is known to be one of the badly written ones), but I think this could be applied to probably all languages.

The starting observation is pretty simple: we compile code much more frequently in our IDEs while we’re developing (i.e. continuously modifying the code) than have a fixed point of the code base to generate binaries. And yet, most compilers known to me are designed and written for the latter, and we later tried to force and trick them to work for the first case. There’s a Hungarian proverb that matches the situation: sitting on the horse backwards. I know, I know. This originates in the old days when we didn’t have IDEs and really only the monolithic compilation existed. A lot has changed since most people write their code in vi and compile it separately though. It’s also known that writing compilers is really complex and hard. Why don’t we make it potentially more complex then? :) Let me pitch you the idea.

The current way of things is that we have the compiler that can be invoked on a bunch of source files. It parses those files, builds ASTs and runs complicated algorithms to check if the text has proper syntax and is semantically correct – whatever that means in the language. After this happens, it generates lower-level representation of the text that can be interpreted (in the meaning of understood, not in the CS ‘interpreted’) by either a virtual machine or a physical one. While a developer is using an IDE (Integrated Development Environment) they modify the text in one or more of those files. When the incremental compilation is started -e.g. the developer started a test- the compiler tries to determine the changes and how it affects the dependencies of the modified text. This information is used to invoke the compiler on the minimum number of entities that must be recompiled (entities here can refer to source files, classes or anything that makes sense in the given language). In the case of languages that have a high level of expressiveness -including Scala-, determining the effects and the dependent entities is really complicated. To be on the safe side we usually end up recompiling entities that might not necessarily have to be recompiled. We also need to keep track of the mapping between the original text and the compiler’s internal representation so we can point out errors we have found. IDEs tend to support other useful features such as syntax highlighting, code navigation and code completion. They need to have some idea of what the text actually means to support these features, so they usually have some kind of representation of the text in an object model. This representation can be really close to what the compiler uses – in some cases they don’t need all the information though, letting them dismiss some minor details.

Instead of going through all these hurdles to support the developer write the code, we should let them edit the AST directly. By this I don’t mean having a WYSIWYG graphical editor to build the AST, but all modifications of the text should be mapped to direct modifications of the AST. Since we need to have the representation of the code in the memory and we need to keep the mapping between the text and the compiler’s internal data structures we should have everything we need to achieve this. We could potentially support operations in the compiler that can be easily mapped to the possible textual modifications. Let me give you a few examples in Scala.
Every project started with an empty source file. When the first package is created, that can be fed to the compiler as a package created command. The same applies to classes, methods, variables and basically everything else when they are first created.
When the developer modifies the type of a parameter to a method, we know we need to rerun the typer and check all uses of the method. If the modification happens in the in-memory AST, we should be able to easily trickle down the effects and regenerate only the sub-trees of the AST that we need. We shouldn’t need to try to figure out these on differences between the text if we modified the AST directly.
There are some types of changes that wouldn’t “redraw” the representation, e.g. moving things around in the file or adding a comment would only need to update the position information.
I believe mapping all possible text modifications to compiler commands shouldn’t be too hard; I haven’t gone through all of them though. It would require major compiler redesign, of course.

This approach could also help implementing the aforementioned IDE features: it would give us immediate error reporting for free, update the data for code completion and syntax highlighting, plus it could potentially support more robust refactoring tools.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s