A painful road to Java modularity

3 days ago 1

A few years ago, we decided to overhaul the internals of our in-house open-source job queue manager JQM, a sort of specialized application server for asynchronous jobs. One goal in particular was to allow customer-specific extensions of the product on some identified extension points with the less friction possible. We thought we were in an ideal case to implement OSGi, a renowned modularity framework:

the code was already carefully architectured, especially with the extension points clearly marked,
we already had experience with OSGi, a well-known Java modularity framework

How wrong were we! This post is not about why we chose OSGi and not the alternative JPMS. It is about everything that factually went wrong, with a healthy dose of (very) tired ranting.

Disease and Medicine: not (always) OSGi fault

A modularity framework has two main responsibilities: ensure isolation between ‘modules’ (whatever the chosen granularity defining a module is) and manage the lifecycle of said modules.

The first point is the real goal: a module should be independent of the implementations of the contracts it uses. This is especially important for our use case – we need a plugin system, consisting of a set of Java interfaces with an unlimited number of implementations we know nothing about. And the framework should do its upmost to restrict what a module can see inside the others.

The second point is a consequence: as long as we know nothing of implementations and can only use interfaces, we need someone else to instantiate objects and provide instances backing the interfaces. There are a host of different patterns to do this, mostly gravitating around the Inversion of Control and the Registry patterns.

However, when dealing with existing code bases, this rarely maps well. The fact that modules become fully isolated is especially a bummer – it is so easy to take shortcuts with direct access to fields and methods that break encapsulation. After all what is the harm when you control both the code being exposed and the code profiting from the lapse in encapsulation? Sadly the answer is: the victim is modularity. Might as well do a single module/package in that case. And after ten years the JQM code was riddled with some of these.

Lifecycle is also an issue in JQM. After all JQM is a specialized sort of application server and has a complicated startup process to ensure that metadata is present, initialized if needed, that all plugins are loaded, etc. Introducing an external control for parts of it is not trivial.

So… everything is fine, OSGi is actually only here to force us to clean up our act. That’s factual, and actually a very good thing for the future maintainability of our code base. So why talk of pain?

Killing the patient

An unwilling patient

It is all well and dandy to ensure perfect modularity of one’s own code, but what happens when external libraries get involved? Mayhem.

First, stupid as it may seem, not all common libraries include an OSGi or JPMS manifest and can’t be used directly by the OSGi framework. Not that many thankfully, and in an OSS world PRs are always possible, but tiring, especially when you have to dabble inside unfamiliar build tooling (who the hell invented the torture named Gradle?). That’s actually the easy part, as the PAX project has a dynamic encapsulation library (itself using the BND tool which, one way or another will always find its way inside an OSGi project).

Second, all ‘big’ frameworks like JPA2, JAX-RS or JAXB implementations do black magic with class loading. That’s not their fault – they have to work in many different contexts with radically different class loading mechanisms (child first, parent first, as well as the dreaded TCCL – the thread context class loader) AND at the same time be modular themselves with mechanisms like late-binding or SPI/ServiceLoader. Sometimes, the black magic is actually inside the API and not inside the implementation (JAXB I’m looking at you), something a madman thought great on his worst day. Just throw a module restriction on class visibility and headaches ensue.

This class loading hacks are likely the worst item on the list. Actually, it is a consequence of the original sin of Java class loading which was not made for modularity. (funnily enough, this is so true that JPMS, the ‘official Java’ answer to OSGi, has chosen to avoid the issue entirely and not use class loader isolation between modules). So OSGi is a hack trying to solve a fundamental issue, and like all hacks it works only in most cases and clashes with other hacks.

The most infuriating thing comes when OSGi tries to work around the issue with disasters like the OSGi ServiceLoader Mediator specification (also known as SPI Fly, its only implementation) which try to set the TCCL dynamically with byte code injection to allow SPI mechanisms to work… The specification is arid, the implementation documentation is a joke and the result is lost nights wondering why the library only half loads or trying to re-package external frameworks. There is a new attempt in OSGi R8 (no implementations yet) with OSGi Connect to ease communication between OSGi and normal Java – let’s wait for the bright future with a big grain of salt…

The only actually reasonable solution here is to forego all magic configuration systems (implementation auto-discovery, declaration of classes to use in a configuration file…) and use in-code configuration when possible. When migrating a huge code base using standard-compliant magic configuration, this switch is hard to justify cost-wise.

Dubious therapy

As a developer, I want only one build system. In our case, it’s Maven. I’m not against plugging other build systems in (for example a npm build of a JS module) as long as they are controlled by the main one. I certainly do not want to have a packaging system separate from the build system. Yet this is the OSGi assumed proposition – they want to separate the build path from the runtime path. (This, by the way, is a fundamental difference with JPMS and likely its only architectural advantage over OSGi).

Well, I don’t want to have two competing dependency version resolvers. When I update a runtime or test dependency version, I want it to be updated inside the final distribution bundle too.

In the end we have chosen to still use Maven for packaging, with many hacks inside the packaging descriptor. Hacks being plays on scopes and dependency exclusions – the OSGi guys believe so much in their philosophy that they do not care at all about the transitive dependencies of their artifacts, going as far as leaving out non-OSGi sub-dependencies…

Missing information

The documentation is nothing short of a catastrophe. For the whole ecosystem. The main corpus is the OSGi specification (R7 at the time, R8 today). For a specification, it is really cool. But it is not made for users, it’s meant for implementers of the specification. Yet the rest of the documentation is so poor that one is compelled to go back and back again to it. The different framework implementations (the big two being Apache Felix and Eclipse Equinox) hardly have any docs. There are a few websites/blogs (vogella, thank you!) with information that is very often outdated as the framework has changed quite a bit since its inception in 1999. Very few Stack Overflow users. All in all, little information – OSGi is not a mainstream technology, it is something which is mostly used behind the scenes inside foundations works. Few users by nature, so few information contributors. The rest of the information will be found inside source code on GitHub. Especially in the tests of the big OSGi frameworks or OSGi-using libraries.

The learning curve is more of a wall as a result. There was an attempt with OSGi En-Route to provide some startup templates but even those… are buggy. It is always funny to clone a sample repository and find it just does not work. Was the right direction though.

But still: when you have to understand why the modular HTTP service (the HTTP whiteboard in OSGi terms) does not share context with your REST web service inside the same bundle, you do not expect to discover inside a hard to find JIRA ticket that there are multiple servlet contexts automatically created and that you have to create one manually and use LDAP filters to bind your elements to it. You would either expect a clear documentation or it to work out of the box. Or you would expect yourself to burn everything with fire.

Another place of missing information is error messages. These are dreary. How could you guess that a NullReferenceException inside SPI Fly means ‘there is no JAR manifest’? Or quickly find inside 5 lines of LDAP filters the actual missing package on startup? This is likely the second most important point of this rant: when things go wrong, OSGi makes it really hard to understand why.

Testing

Tests using the actual OSGi mechanisms are complicated to create. There are multiple toolings available, but really only one that works in a classic way for Java Junit-users: PAX-Exam. It tries its best to run the tests inside an OSGi framework rather than in the CL that has started the test JVM. As we had hundreds of pre-existing junit tests, it was the only way for us to go.

This is another pain point associated to the ‘OSGi does not care about build class path’ – the junit tests ‘see’ all the dependency tree in their class path. It will not be visible inside the OSGi bubble, but all it takes is a weirdly-set TCCL (thank you CXF for randomly changing it) for it to surface. Ensue lovely exceptions like ‘class X is not an instance of class X’. Setting logging correctly especially is hard.

But overall, thanks to the PAX Exam coders. It is a great tool, and they even have some documentation! (even if Google will always send you to an older version)

Sad tales of basic malpractice

You know what? JAXB-OSGI jar works perfectly in version 2.3.3. Stops working in version 2.3.4 with, like, no information at all. (you then learn that they have hard-coded inside the API jar, not the implementation a mechanism created years ago by the GlassFish team for service discovery, that it is not documented anywhere but yeah sure it’s in the code…)

Partial bundles and dynamic imports: this is… an extension system for the extension system. Yahoo. Why they were created is understandable, but is it necessarily a good idea to force humans to read a jar manifest to understand what is going on?

LDAP filters to filter objects inside the OSGi registry: downright cruel.

The JAX-RS whiteboard only works with a very specific set of dependencies. Use the Karaf feature or loose hours debugging why your REST service does not start.

Theoretically, OSGi is a dependency resolver. Yet specific bundle start order (‘start levels’) are still necessary for many things (logger, framework extensions, …).

And to end on a funny note: Apache Felix, in order to start, relies itself on SPI/ServiceLoader. Makes you wonder why you should use more.

Regrets

All in all, we do not regret the work done. Well, done is a matter of perspective as that kind of refactoring is actually never finished but the worst is behind us. But it is more about the clarity of the resulting code structure than about the framework, because the idea of breaking JQM by simply doing a minor library upgrade is not exactly what we dreamed of. OSGi is a bundle of hacks made by people who were both well-intentioned and great thinkers. It remains a hack, and it is a bit sad to see so much wasted energy on this.

As final nail in the coffin, the OSGi foundation has died, and the new Eclipse Foundation overlords of the specification have not yet made their plans clear. So we capitalized on the work done on OSGi to… implement JPMS instead, and removed all traces of OSGi. This will be the subject of a subsequent post, but we can already say: at least we do not regret this final decision.

Read Entire Article