Star Ford

Essays on lots of things since 1989.

Code forests

on 2017 June 15

This paper is about a layering paradigm for enterprise scale software called a “code forest”. The paradigm is a tree-shaped database of elements that compose the code base, with their full content and revision history. Developers edit the database rather than the file system. I will get into details on what that means, but first want to start with list of problems that the approach improves upon.

Why we need code forests

In my examples I’m using C#-like code and terminology but it is the same concept with java or any language designed for enterprise-scale software. In this context “enterprise” means possibly millions of lines of code and multiple tiers with overlapping legacy and new products.

Some problems with today’s large code bases:

  • Depending on the language there are now three or more competing naming systems permeating the code base. These include the class names and optionally heirarchical namespaces of classes; names in the file system; and names of folders, projects and “solutions”. The only reason for the complication is the history of adding tools on top of other tools; it is not needed. A code forest only has one naming scheme.
  • Developers are usually forced to deal with source files. The base unit in programming and compiling is traditionally the file, but it does not have to be that way. A source file is not a meaningful concept in the compiled product. A code forest allows you to work with named elements in the forest, not files.
  • Developers are currently forced to deal with deployment considerations when writing classes. A code forest allows deployment decisions to be made completely separately.
  • The skills and other team characteristics needed to manage a code base are different than skills for writing classes and methods. A code forest helps teams do forest management as distinct from code quality management.
  • Documentation and understanding often decline as code bases get bigger. A code forest organizes code with documentation about code structure in the same place as the code itself.
  • Unwanted dependencies and friend dependencies creep into code bases as teams get bigger. A code forest makes creating dependencies an action that you have to take explicitly, so there can be no dependencies creeping in accidentally.
  • Source control is standard practice now, but technically it is an optional layer on top of a non-source-controlled system, and it can be broken, avoided, worked around, misused, or not be fully integrated with the development tool, leading to complications. A code forest is source control to begin with, so it is impossible to not use source control with it.
  • Developers can spend a great deal of time recompiling “the universe” of code when only one line changed. The compiler is usually too unaware of the code layering to optimize away unnecessary work. A code forest allows compiling to be based on changes only.
  • Visibility of class members is not flexible enough, with the options of public, private and protected members. Sometimes you need visibility in more complex ways, but we end up making too much public. Code forests control visibility exactly.

Basic definition of the code forest

A code forest is a tree (a directed acyclic graph with any number of roots) of nodes (which I’ll call “code elements“) along with the revision history of each element and a map of all dependencies between all elements. It can also include the concepts of code branches, commits, and other source control features.

Each element is composed of an expression in the form “visibility-spec name = element-definition” and a separate area for typing definitional or contract comments. Some example elements are shown here:

  • public A = 3
  • B = int (string s) { return s.Length; }
  • visible C = class { … }

The examples show a variable element, a function element, and a class element, respectively. The class element will have child elements inside it. The only type of element that allows fairly long definitions is the function body. Since most functions are ideally less than 20 lines, that means source control is operating on much smaller units than we are used to.

You may be questioning the function and class syntax. I am not concerned with exact syntax in this paper. There are many function syntaxes, and the one used here is chosen simply because it puts the name on the left of the equals sign so it is consistent with all other definitions. We are assuming that the type of any element is unambiguous from the definition, so in the example, A is known to be an integer. Classes also use the name = class syntax.

To organize millions of lines of code, one can think of all those millions of lines in one giant file with a lot of nesting. Of course you would not display it that way because of its size, but that is one logical way to display it. Replacing brackets with indented bullets to indicate the tree shape, that would look like this example:

  • PersistentData =
    • public Person = class
      • Name = “”
      • IsSally = bool () { return Name == “Sally”; }
    • Team = class
      • Members = new List<Person>;
  • UI =
    • Person = PersistentData.Person
    • ThisUser = (Person)null;

A team of developers can be branching, editing and merging elements all at the same time. The editable unit is the element; there is no need to “check out” or edit whole classes as a unit.

Layer views

You can also look at a code forest visually, showing boxes for the organizational classes and arrows denoting dependencies. Here is an example:

The example comes from an earlier paper “Megaworkarounds” – http://www.divergentlabs.org/tech/megaworkarounds/

The advantage to this kind of view is that it shows how the code is layered. Tools can also allow you to draw layers and drag elements to change the structure of the code base. For example, you could draw a box around a number of functions dealing with the same thing in an overly complex class and create an encapsulating layer.

Sequencing magic

Since there are no source files in a code forest, there is no code formatting or sequencing. In other words the programmer does not decide where to put blank lines and in what order to list elements. You can still format a function body, which is a leaf node in the forest, but you cannot format the class itself. You can view the elements in a class in whatever order you like: alphabetically by name, by visibility, by type, or by some more complex algorithm.

In addition to the flexible options for display sequencing, there is a natural sequence that is derived from dependencies. The elements having no dependencies on other elements are first, and the remaining elements are sequenced in such a way so all declarations appear before any referencing elements. Until cross-dependencies are discussed below, the usual case is all one-way dependencies. This results in sequence constraints: the natural sequence is not completely deterministic, but rather constrained. That is to say, if two elements have no dependencies on each other, they can appear in either order in the natural sequence.

Another definition: A big sister is a sibling that is declared earlier in the natural sequence. In the example above, Person is a big sister to Team, and conversely Team is a little sister to Person. Person is not a sister to UI because they are not declared at the same level.

In the bulleted example, a reference from inside the UI element to PersistentData.Person is valid because (1) PersistentData is a big sister, and (2) Person is a public member of PersistentData.

Now for some nice magic: So in the example above, UI displays after PersistentData only because an element inside UI references PersistentData. If you were to remove that reference, and add one going the other way from PersistentData to UI, then the sequence would switch as a result. If you tried to make a reference both ways, you would break it and it will not compile.

The natural sequence can help optimize compiling, as discussed in another section.

Naming and referencing rules (basic)

The basic rules of names are:

  • Every element has a single-word name.
  • Every name must be distinct from all of its big sisters and all of its ancestors.
  • You can reference a big sister or ancestor with its single name. (You do not qualify ancestors with dot syntax.)
  • You can reference any visible child of a big sister or ancestor by chaining names with dot syntax (such as with PersistentData.Person above)

Saying that a name is “in scope” means it is a valid reference.

There is support for friends, discussed below, but it should be the exception rather than the rule. A clean forest has almost all one-way dependencies.

The reference map

The code forest manager needs to build a reference map – a database of references from and to each element – to aid in generating the natural sequence, for browsing and debugging, to facilitate graphic display of layers and dependencies, and also to optimize compiling. It has to know the language syntax and has to look inside the definitions as well as inside the function bodies to find the references.

Those references can be cached at the element level, and persisted as part of the database. There is a giant optimization simply in not having to recompute the reference map each time the development environment starts, as is usually done today. When the element definition changes, the reference list for that element is recomputed, but if it has not changed then the map as a whole does not need to change.

Class inheritance

To be clear, the code forest is all about naming scope, not about class inheritance. Thus a subclass can be declared anywhere that has visibility to the superclass. While the development tool can search for sub- and super-class relationships, that is not part of the code forest system.

Visibility rules

Visibility specifiers in C# are private, protected, public, and internal. Without going on a long rant, “internal” makes syntax depend on how the compilation is divided into assemblies, which should not be intertwined with the concept of visibility. There are other limitations to the system, mainly the cases where certain parts of a system need visibility into certain parts of classes, but other areas of the system should not have access.

The code forest has these levels of visibility:

  • Without a visibility specifier, an element is visible to its little sisters only. This is like private in C#. In the example, this is why PersistentData does not need to be public and can still be referenced in the UI layer.
  • Use “visible” to indicate that an element is also visible to its direct parent. This does not make it visible to more distant ancestors.
  • Use “visible(name)” to indicate the ancestor level that the element is also visible to. Name must be an ancestor, not a sister. All intervening ancestors between the element and the named ancestor can also see the element.
  • Use “public” to indicate visibility to ancestors all the way up the scope to the top level.

Since visibility to derived classes is a separate concept from visibility to scope ancestors, we use different keywords for that:

  • Use “open” to indicate the element is visible to derived classes (like protected in C#)
  • Use “sealed” to indicate that an element is not visible to derived classes; you would only need to use this if it had been declared open in the superclass.
  • Without open or sealed, the element is not visible to derived classes (as with private in C#)

You can combine scope and subclass visibility in rational ways like “visible sealed” or “visible open”. When using public, you cannot also add open or sealed. None of the specifiers can take away visibility to little sisters, which is always guaranteed.

Classes, singletons, and layers

Note the two similar syntaxes:

  • A = class {…}
  • B = {…}

The contents of the brackets for A and B – child elements in the code forest – can be the same in both cases. A and B can both contain data members, classes, etc. The difference is that A is the name of an instantiable type, while B is the name of a singleton instance of an anonymous type. Singletons have several related uses, but they are really only one thing:

  • Singletons define scoping and layers. If B contains classes and other layers, the singleton can be called a “layer” and is used for isolating names and encapsulating parts of the code base.
  • Singletons define objects. If B contains data members, then the singleton is essentially a static class (but not exactly)
  • Singletons isolate parts of a class. If the parent scope of A and B is itself a class, then B defines a subset of class members. Use this with visibility specifiers to fine tune encapsulation within a class.

Because singletons are supported in this way, there is no need for the keyword “static”. Static classes are unfortunate to begin with because you sometimes think it is a good idea, then later have to refactor a lot to make it instantiable. Also they are bug attractants.

There’s some interesting and useful side-effect magic of the singleton pattern in the code forest. That is, the singleton is only single within the parent scope. So if the parent is a class then the singleton is not really static; its data members are separate for each instance of the parent. If you really want something to be one single instance, you have to declare it up high in the scope outside of any classes.

Since by defining code forests we are not messing with the whole concept of object orientation, element references have to work in the expected way. That is, function calls and data references made inside a class all operate on the same object, and references to a member of a parent class cannot be allowed, since there is not necessarily any instance of that class attached to the child class.

Aliases

The alias syntax requires no keywords and is simply done like this:

  • A = class {…}
  • B = A
  • C = Libraries.Graphics.Circle

Saying B = A is saying that B is another name for the class A. Likewise, C is another name for Libraries.Graphics.Circle. You can alias singletons or classes. This is a primary technique for simplifying references to layers that might otherwise have long qualified names. Aliases scope like all other elements.

In the case where an element is a variable or function, declaring a reference to it like this is not the same concept:

  • F = int() {…}
  • G = F
  • M = 2
  • N = M

In these examples, we are not creating aliases, but rather creating a new reference (G) to a function (F), or a new variable (N) whose value is copied from M. Assuming C#, in these cases G can later be pointed to a different function, and N can be set to a new value.

Aliases by contrast are constant and always refer to the class or singleton named in the declaration.

The element types

The element types will depend on the language, but here is a list of what could be supported by a variant of C#.

  • singleton – for all the uses discussed in the singleton section above
  • class – a collection of elements that can be instantiated
  • enumeration
  • data member – this can be supported at the top scope for truly global variables, and also in any other scope
  • constant
  • function
  • property
  • non-text resource – more on this below
  • alias

Functional programming

We noted that only named elements of classes and singletons are managed by the code forest. If we declare a function within a function, that inner function (often called a lambda function) is not a code forest element. Here is an example two lines of a function body:

int i = 0;

Action f = () => { ++i; … };

These variables (i and f) are not code forest elements, and their scope is only the containing function. They cannot be referenced from elsewhere. The closure owned by f containing i creates naming complexity and is one of the reasons we cannot subject nested functions, or lambdas, to the code forest’s scheme.

Non-text elements

Because we are working in a database paradigm instead of a text-file paradigm, there is less limitation on the way graphics and other resources are included in compiled code.

For example, an element could be declared like this:

CloseIcon = {};

where the block symbol indicates expandable content – in this case, a graphics file. The file need not be written out to the file system, edited, and re-imported, which gets boring; instead the editor can edit the database record in place.

Language strings, all multi-media files, embedded databases, and anything that is logically a constant can be treated this way. They are all just forest elements.

Source control

Source control in a code forest has to be more robust than we are used to. In particular it needs to handle renaming and repositioning elements automatically without losing history.

A nice source control side effect of the code forest paradigm is that fewer spurious changes are recorded. In typical programming, adding blank lines and other such edits that do not change the compilation are treated as changes, but when formatting is mostly removed from the system, this will happen less.

Refactoring tools

A code forest will make refactor operations faster and easier to automate. Here are some examples.

  • Suggestions
    • The tool can find suspicious dependencies, such as when there is only one reference controlling the natural sequence of a large part of the code base.
    • The tool can suggest explicit encapsulation where there are no dependencies from outside a group of elements.
  • Reorganization
    • The tool can allow drawing a box around existing layers (demoting child elements), or erasing a layer (promoting child elements)
  • Visualization
    • During a preview of a refactor operation such as a name change or visibility change, the tool can highlight layers in the layer graphic view which will be affected by the change.

Complex visibility

Here is an example showing the need for more complex visibility than we normally have access to.

In this example we define a base class for business entities (clients, sales, etc), and we want to have some shared functionality for all those entities, like knowing if the record is being edited by a user, whether it has been archived, or other such system-level or generic considerations. But in some cases we only want the code dealing with management of that concern to be able to see the relevant properties of the object.

Here is a snippet of the code forest illustrating how complex visibility works to our advatage. In this hypothetical system, primary keys for new records are created one way, while old records from a legacy system are imported without a key and they get generted another way.

  • BusinessEntities
    • Generic
      • EntityBase = class
        • visible open IsImported = false
        • visible PrimaryKey = 0
      • ImportManager = class
        • AssignKey = void (EntityBase rec) { if (rec.IsImported) rec.PrimaryKey = GenerateKey(); }
    • ClientLayer
      • Client = class : EntityBase
        • Validate = void { … };

In the above example, note that BusinessEntities, Generic, and ClientLayer are singletons, used here just to create layering. The function Client.Validate cannot see EntityBase.IsImported, even though that element is visible to EntityBase and derived classes. This restriction in visibility allows us to implement generic handling of primary keys in a layer, without having to make it part of the EntityBase class itself, and without having to over-expose it.

As another example, consider a class method that should only be called by one other method. An example is when there are steps in a sequence, and you want to write each step as a separate function but ensure they are only called in the right order by a master function:

  • SomeProcess = class
    • Sequence
      • StepOne = void() {…}
      • StepTwo = void() {…}
      • visible Run = void() { StepOne(); StepTwo(); }
    • Foo = void() { Sequence.Run(); … }

In that example, Sequence is a singleton for scoping reasons, but all four functions are still members of the class SomeProcess. Foo can run the sequence of steps because Run is marked visible, but it cannot see StepOne or StepTwo. This allows the class designer to ensure that inappropriate calls would be out of scope.

Friend elements

So far I have ignored the need for cross-dependencies, also known as friend elements. Friend classes can reference each other. Likewise, friend functions within a class can call each other. While some languages like Pascal went to great lengths to prohibit all friend elements, other languages like C# default to everything being friends and have limited ways to prevent friends. The overly generous nature of modern languages lets cross-dependencies creep in too easily, and this leads to a developer culture of almost totally ignoring layers as an organizational concept.

There are a few reasons for elements to be friends, the main ones being:

  • Two or more classes are designed to be containers and members of containers, such as classes for Person and Family, or generic classes for Tree and Node.
  • Two or more business entity classes refer to each other in the same way database records can be cross-dependent.
  • Recursion, when indirect (method A calls B, which calls A)

A code forest provides syntax for creating friend elements like this:

  • DataStructures
    • Trees = [Tree, Node]
      • Tree = class {…}
      • Node = class {…}

In this example, the [Tree, Node] syntax element declares a cross-dependency between the child elements. This is the only case where a forward reference is allowed – that is, a reference to a child declared later in the natural sequence.

This architecture for friend elements makes it hard to set up cross-dependencies between classes that are distant in the code forest. They have to be siblings to be friends, which is how it should be.

An indirect recursion example shows how friend methods would work as expected:

  • StorageLocation = class
    • public Products = List<ProductQty>
    • public Sublocations = List<StorageLocation>
  • Inventory = class [CountAt, CountOfChildren]
    • CountAt = int (StorageLocation loc) { return loc.Products.Sum(p => p.Qty) + CountOfChildren(loc); }
    • CountOfChildren = int (StorageLocation loc) {return loc.Sum(p => CountAt(loc.Sublocations); }

Without the friend specifier [CountAt, CountOfChildren], CountAt would not be able to call CountOfChildren.

Contract programing

Loosely, contract programming is the set of patterns that establish the usage of a class through clearly articulated and enforced constraints, so that the user of the class never has to inspect the implementation to be an effective user.

A code forest does not change how contract programming is implemented, but it can facilitate the clear articulation of class usage. Each element has an associated description field, which can have formatting and hyperlinks allowing the programmer to document right next to the code. Readers of the code can access that formatted text and ideally do not need to navigate to the implementation. I have not been showing any comments in the examples here because they are hard to show using the nested bullet format, but in an interactive graphical forest view, they could be shown prominently.

Compiling

One of the reasons Pascal is still in use in 2017 is that it compiles very fast, and it does that because it can produce machine code from source code in a single pass. This is possible because of aspects of the language syntax, mainly the limited support for friend elements.

A code forest can take advantage of that same idea, but go further. Assuming C# for now, a leaf element in the forest can be compiled into intermediate code with debug information, and cached at the level of that element. Since the tools know specifically which element changed, that cached intermediate code can stay cached as long as the element does not change.

Then, non-leaf elements (those having children) can combine the compiled bits of their children in natural sequence to create and cache the compiled bit for those elements. And so on up the tree.

A team of developers working on branches of a code base will not change 90% of it over the course of a day, so we can take advantage of cached compilations in a hugely efficient manner. Suppose a function changes, but its signature does not change. Since the signature has not changed, all calls to the function are still valid. Then the compiler only needs to update the compiled bit for that element and recompose (but not recompile) the ancestor elements, which could be done in a matter of milliseconds even for a million-line code base.

If a signature changes or a visible element is added or removed, then there could be some recompiling necessary of the rest of the code base. But since the reference map is already known, anything that is not dependent on the change does not need to recompile.

A further optimization comes if servers do compilation and cache it at each commit point. In that scenario, developers can move to the latest branch and not have to recompile at all.

Testing and deployment

In C#, one is currently forced to arrange classes in assemblies, and specify whether an assembly is an entry point or a class library.

By contrast a code forest does not need any such limitation. Any function from anywhere in the forest can be deployed as the entry point, and the tool will package up that function with all its dependencies into the deployable product. All unused deadweight code is excluded from the deployment. It would be common for a large forest to have dozens of semi-independent products that share layers of code all in the same forest. Each product is nothing more than a reference to a function element anywhere in the forest.

This feature natrually extends to testing layers, which can be part of the same forest but excluded from deployments. Just as compiling can occur layer by layer up the tree, testing layers can be run at the time their dependent layers are fully compiled, and only when any dependency changes. Test failures can be integrated into real time feedback so the developer knows about the failure shortly after making a code change.

Work process

There are a couple notable changes in developer work process that a code forest would allow:

  • In the process of breaking something – such as a rename or other signature change that breaks callers – the developer would quickly see on a graphic map of the code base which layers are broken by the change. Then they would work up one layer at a time to fix the call sites. Usually today, we get thousands of errors when we break something at a low level, but we learn to ignore most of them as spurious, since only the first reported error can be guaranteed to be accurately reported by the compiler. The other ones can often be spurious simply because an intervening dependency cannot compile, but the compiler does not know when to quit. With the strucutre of a code forest, it would only tell you about errors in the most proximate broken layer.
  • Permissions to modify the forest can be made at the level of elements. Having very fine grained permissions would mean a big team would never have to cordon off a whole sepatate code base just for permissions concerns. It could all still be in one forest. As an example, you might say that only senior level developers can modify some base library layers that are used everywhere, or you might divide it out by the deployable product with multiple sub-teams having permissions to shared layers.
  • Better connection can be made between layers and branching. Code can be read-only unless you specifically create a branch for editing, and this can be done at the granularity of any element in the forest. For example Jen intends to modify the invoicing layer to refactor it, so she creates the branch on that whole layer, which unlocks it for editing. Other developers can see Jen is editing that particular layer (and why) and might defer potentially conflicting edits. Jen can meanwhile stay current in other parts of the forest. Or she could branch a different part of the forest for some other reason and easily deal with the two projects separately, even if they overlap in time.

Importing code

I have written this so far without mentioning reusing code from outside the owning organization. Of course, references to outside the forest are necessary, and this section shows a paradigm for doing that.

But first I want to discuss version hell and how code based on file systems is a particular invitation to disaster when importing code. Version hell is an artefact of too many naming systems, and an incomplete dependency system, which the code forest is primarily intended to address. A typical version issue is where the name of a class from an outside assembly is not renamed for each release, perhaps the file names are also not renamed, but the version may be stored in a configuration file. Or worst case, it just is not stored anywhere. Someone or some process magically replaces the file with a new version and something breaks. Perhaps part of the code base needs a different version than another part, or there is a problem finding the requested version, or the deployment machine has a different version than the development machine.

Another depressing version problem is when you need to import code and modify it, thus branching irrevocably and locking yourself into an old version.

Stepping back, there are three logically coherent ways to combine and reuse code across teams:

  • Importing a read-only copy of source code or intermediate code into the target. This is the fastest performance and allows the best optimization. It also provides the best developer experience since it can be debugged and foreign classes can be subclassed. Therefore this is the best choice for reasonable sized libraries that may have many incoming calls.
  • Importing compiled object code which is late-bound into the same process (also known as dynamic link libraries). This allows better obfuscation if that is a priority, and the foreign code could be in a any language.
  • Calling across processes using operating system marshaling of data streams, such as Windows named pipes or http. This makes the most sense when the foreign process is long-running, or is large, or supports multiple callers, and particularly when the streamed data size is small.

In each case, there is a separate party owning the foreign code, and only a cached read-only version of it is imported. Copy-pasting code is not one of the logically coherent ways to share code!

In order to cause the import to happen, we define an external element thusly:

The difference between the examples is that the “source” version imports source code while the “object” version imports a late-bound object code.

If you are referring to the library in many parts of the forest, you can define it at a high level where it is in scope everywhere, or define all such imports as public elements in a layer which can be aliased by other layers as needed.

There are hybrid approaches that allow for imported source code with an obfuscated object code or with a separate process. The un-obfuscated source is the shell that simplifies the call syntax to the underlying worker.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: