Some programs have a clear design and coding new features is quick and easy. Other programs are a patchwork quilt of barely comprehensible fragments, bug fixes, and glue. If you have to code new features for such programs, you're often better off rewriting them.
However there is a middle ground that I suspect is pretty common in these days of clean code and automated test suites: you have a good program with a clear design, but as you start to implement a new feature, you realise it's a force fit and you're not sure why. What to do?
I've recently being implementing a feature in Eclipse Virgo which raised this question and I hope sheds some light on how to proceed. Let's take a look.
The New Feature
I've been changing the way the Virgo kernel isolates itself from applications. Previously, Equinox supported nested OSGi frameworks and it was easy to isolate the kernel from applications by putting the applications in a nested framework (a "user region") and sharing selected packages and services with the kernel. However, the nested framework support is being withdrawn in favour of an OSGi standard set of framework hooks. These hooks let you control, or at least limit, the visibility of bundles, packages, and services -- all in a single framework.
So I set about re-basing Virgo on the framework hooks. The future looked good: eventually the hooks could be used to implement multiple user regions and even to rework the way application scoping is implemented in Virgo.
An Initial Implementation
One bundle in the Virgo kernel is responsible for region support, so I set about reworking it to use the framework hooks. After a couple of weeks the kernel and all its tests were running ok. However, the vision of using the hooks to implement multiple user regions and redo application scoping had receded into the distance given the rather basic way I had written the framework hooks. I had the option of ignoring this and calling "YAGNI!" (You Ain't Gonna Need It!). But I was certain that once I merged my branch into master, the necessary generalisation would drop down the list of priorities. Also, if I ever did prioritise the generalisation work, I would have forgotten much of what was then buzzing around my head.
So the first step was to come up with a suitable abstract model. I had some ideas when we were discussing nested frameworks in the OSGi Alliance a couple of years ago: to partition the framework into groups of bundles and then to connect these groups together with one-way connections which would allow certain packages and services to be visible from one group to another.
Using Virgo terminology, I set about defining how I could partition the framework into regions and then connect the regions together using package, service, and bundle filters. At first it was tempting to avoid cycles in the graph, but it soon became clear that cycles are harmless and indeed necessary for modelling Virgo's existing kernel and user region, which need to be connected to each other with appropriate filters.
A Clean Abstraction
Soon I had a reasonably good idea of the kind of graph with filters that was necessary, so it was tempting to get coding and then refactor the thing into a reasonable shape. But I had very little idea of how the filtering of bundles and packages would interact. In the past I've found that refactoring from such a starting point can waste a lot of time, especially when tests have been written and need reworking. Code has inertia to being changed, so its often better to defer coding until I get a better understanding.
To get a clean abstraction and a clear understanding, while avoiding "analysis paralysis", I wrote a formal specification of these connected regions. This is essentially a mathematical model of the state of the graph and the operations on it. This kind of model enables properties of the system to be discovered before it is implemented in code. My friend and colleague, Steve Powell, was keen to review the spec and suggest several simplifications and before long, we had a formal spec with some rather nice algebraic properties for filtering and combining regions.
To give you a feel for how these properties look, take this example which says that "combining" two regions (used when describing the combined appearance of two regions) and then filtering is equivalent to filtering the two regions first and then combining the result:
A New Implementation
I defined a RegionDigraph ("digraph" is short for "directed graph") interface, implemented it, and defined a suit of unit tests to give good code coverage. I then implemented a fresh collection of framework hooks in terms of the region digraph and then ripped out the old framework hooks and code supporting what in retrospect was a poorly formed notion of region membership and replaced this with the new framework hooks underpinned by the region digraph.
I Really Did Need It (IRDNI?)
It took a while to get all the kernel integration tests running again, mainly because the user region needs to be configured so that packages from the system bundle (which belongs in the kernel region) are imported along with some new services such as the region digraph service.
As problems occurred, I could step back and think in terms of the underlying graph. By writing appropriate toString methods on Region and RegionDigraph implementation classes, the model became easier to visualise in the debugger. This gives me hope that if and when other issues arise, I will have a better chance of debugging them because I can understand the underlying model.
A couple of significant issues turned up along the way, both related to the use of "side states" when Virgo deploys applications.
The first is the need to temporarily add bundle descriptions to the user region.
The second is the need to respect the region digraph when diagnosing resolver errors. This is relatively straightforward when deploying and diagnosing failures. It is less straightforward when dumping resolution failure states for offline analysis: the region digraph also needs to be dumped so it can also be used in the offline analysis.
These issues would have been much harder to address in the initial framework hooks implementation. The first issue would have involved some fairly arbitrary code to record and remove bundle descriptions from the user region. The second would have been much trickier as there was a poorly defined and overly static notion of region membership which wouldn't have lent itself to including in a state dump without looking like a gross hack. But with the region digraph it was easy to create a temporary "coregion" to contain the temporary bundle descriptions and it should be straightforward to capture the digraph alongside the state dump.
Ok, so I'm convinced that the region digraph is pulling its weight and isn't a bunch of YAGNI. But someone challenged me the other day by asking "Why do the framework hooks have to be so complex?".
Well, firstly the region digraph ensures consistent behaviour across the five framework hooks (bundle find, bundle event, service find, service event, and resolver hooks), especially regarding filtering behaviour, treatment of the system bundle, and transitive dependencies (i.e. across more than one region connection). This consistency should lead to fewer bugs, more consistent documentation, and ease of understanding for users.
Secondly, the region digraph is much more flexible than hooks based on a static notion of region membership: bundles may be added to the kernel after the user region has been created, application scoping should be relatively straightforward to rework in terms of regions thus giving scoping and regions consistent semantics (fewer bugs, better documentation etc), and multiple user regions should be relatively tractable to implement.
Thirdly, the region digraph should be an excellent basis for implementing the notion of a multi-bundle application. In the OSGi Alliance, we are currently discussing how to standardise the multi-bundle application constructs in Virgo, Apache Aries, the Paremus Service Fabric, and elsewhere. Indeed I regard it as a proof of concept that the framework hooks can be used to implement certain basic kinds of multi-bundle application. As a nice spin-off, the development of the region digraph has resulted in several Equinox bugs being fixed and some clarifications being made to the framework hooks specification.
I am writing this while the region digraph is "rippling" through the Virgo repositories on its way into the 3.0 line. But this item is starting to have a broader impact. Last week I gave a presentation on the region digraph to the OSGi Alliance's Enterprise Expert Group. There was a lot of interest and subsequently there has even been discussion of whether the feature should be implemented in Equinox so that it can be reused by other projects outside Virgo.
Postscript (30 March 2010)
The region digraph is working out well in Virgo. We had to rework the function underlying the admin console because there is no longer a "surrogate" bundle representing the kernel packages and services in the user region. To better represent the connections from the user region to the kernel, the runtime artefact model inside Virgo needs to be upgraded to understand regions directly. This is work in progress in the 3.0 line.
Meanwhile, Tom Watson, an Equinox committer, is working with me to move the region digraph code to Equinox. The rationale is to ensure that multiple users of the framework hooks can co-exist (by using the region digraph API instead of directly using the framework hooks).
Tom contributed several significant changes to the digraph code in Virgo including persistence support. When Virgo dumps a resolution failure state, it also dumps the region digraph. The dumped digraph is read back in later and used to provide a resolution hook for analysing the dumped state, which ensures consistency between the live resolver and the dumped state analysis.