This paper is about a layering paradigm for enterprise scale software called a “code forest”. The paradigm is a tree-shaped database of elements that compose the code base, with their full content and revision history. Developers edit the database rather than the file system. I will get into details on what that means, but first want to start with list of problems that the approach improves upon.
Why we need code forests
In my examples I’m using C#-like code and terminology but it is the same concept with java or any language designed for enterprise-scale software. In this context “enterprise” means possibly millions of lines of code and multiple tiers with overlapping legacy and new products.
Some problems with today’s large code bases:
- Depending on the language there are now three or more competing naming systems permeating the code base. These include the class names and optionally heirarchical namespaces of classes; names in the file system; and names of folders, projects and “solutions”. The only reason for the complication is the history of adding tools on top of other tools; it is not needed. A code forest only has one naming scheme.
- Developers are usually forced to deal with source files. The base unit in programming and compiling is traditionally the file, but it does not have to be that way. A source file is not a meaningful concept in the compiled product. A code forest allows you to work with named elements in the forest, not files.
- Developers are currently forced to deal with deployment considerations when writing classes. A code forest allows deployment decisions to be made completely separately.
- The skills and other team characteristics needed to manage a code base are different than skills for writing classes and methods. A code forest helps teams do forest management as distinct from code quality management.
- Documentation and understanding often decline as code bases get bigger. A code forest organizes code with documentation about code structure in the same place as the code itself.
- Unwanted dependencies and friend dependencies creep into code bases as teams get bigger. A code forest makes creating dependencies an action that you have to take explicitly, so there can be no dependencies creeping in accidentally.
- Source control is standard practice now, but technically it is an optional layer on top of a non-source-controlled system, and it can be broken, avoided, worked around, misused, or not be fully integrated with the development tool, leading to complications. A code forest is source control to begin with, so it is impossible to not use source control with it.
- Developers can spend a great deal of time recompiling “the universe” of code when only one line changed. The compiler is usually too unaware of the code layering to optimize away unnecessary work. A code forest allows compiling to be based on changes only.
- Visibility of class members is not flexible enough, with the options of public, private and protected members. Sometimes you need visibility in more complex ways, but we end up making too much public. Code forests control visibility exactly.
Basic definition of the code forest
A code forest is a tree (a directed acyclic graph with any number of roots) of nodes (which I’ll call “code elements“) along with the revision history of each element and a map of all dependencies between all elements. It can also include the concepts of code branches, commits, and other source control features.
Each element is composed of an expression in the form “visibility-spec name = element-definition” and a separate area for typing definitional or contract comments. Some example elements are shown here:
- public A = 3
- B = int (string s) { return s.Length; }
- visible C = class { … }
The examples show a variable element, a function element, and a class element, respectively. The class element will have child elements inside it. The only type of element that allows fairly long definitions is the function body. Since most functions are ideally less than 20 lines, that means source control is operating on much smaller units than we are used to.
You may be questioning the function and class syntax. I am not concerned with exact syntax in this paper. There are many function syntaxes, and the one used here is chosen simply because it puts the name on the left of the equals sign so it is consistent with all other definitions. We are assuming that the type of any element is unambiguous from the definition, so in the example, A is known to be an integer. Classes also use the name = class syntax.
To organize millions of lines of code, one can think of all those millions of lines in one giant file with a lot of nesting. Of course you would not display it that way because of its size, but that is one logical way to display it. Replacing brackets with indented bullets to indicate the tree shape, that would look like this example:
- PersistentData =
- public Person = class
- Name = “”
- IsSally = bool () { return Name == “Sally”; }
- Team = class
- Members = new List<Person>;
- public Person = class
- UI =
- Person = PersistentData.Person
- ThisUser = (Person)null;
A team of developers can be branching, editing and merging elements all at the same time. The editable unit is the element; there is no need to “check out” or edit whole classes as a unit.
Layer views
You can also look at a code forest visually, showing boxes for the organizational classes and arrows denoting dependencies. Here is an example:
The example comes from an earlier paper “Megaworkarounds” – http://www.divergentlabs.org/tech/megaworkarounds/
The advantage to this kind of view is that it shows how the code is layered. Tools can also allow you to draw layers and drag elements to change the structure of the code base. For example, you could draw a box around a number of functions dealing with the same thing in an overly complex class and create an encapsulating layer. Read the rest of this entry »