Decoupling Data: Strategies and Considerations

Problem

It may be that an organization must evolve its data's fundamental structure in order to scale operations. In the e-commerce space, this could take form in several ways. For instance, it may be that the catalog cannot grow without an unaffordable spend in computational complexity. This could be bad because a company cannot effectively compete without growing the catalog, thereby losing market share and thus losing business. Another example is slow developer velocity, which grinds to a halt due to the testing of complex dependencies that shipping even the smallest feature will require. Slow velocity means inefficent operations, which also means less business. These are two examples of how the data's inability to scale will manifest as business problems.

Access to the data's implementation details is the root cause. This manifests as the use of direct SQL rather than APIS, which means that consumers have very close relationship to the data's technological configuration. That is, they know more about the data than just what it literally is. That is, they know about the tables, what columns, foreign key relationships, what databases, hardware details etc. It would be like knowing Google search's internal querying logic, which is very complex, and also not necessary to perform a simple query; for most, the simple Google Search website is sufficient. Now, with a small company, it is ok to be close to the implementation details, but at scale, there are problems, as we shall see.

Due to the access to implementation details, the quantity and diversity of consumers makes it challenging to evolve the data structure. The reason is this. Say, for instance, a table had to be moved to another database. Then, all the consumers would have to know about that, and that would be major problem because they would have to reconfigure their query to access the new table. The result is that a usually simple task is impossible to achieve because consumers have to migrate. That is, they depend on the current implementation to access the data, so it is messy and a lot of work to have all consumers rewrite their code to accomodate the data changes. Something has to be done about this, because if the refactoring doesn't happen, the company will lose out on the opportunity to scale the operation, and in the worse case, they cannot support the current state.

Some other examples will illuminate how this situation become untenable at scale. To start, consider the existence multiple sources of truth. At scale, this can be signifant problem due to the complexity in maintaining consistency, and also has a multiplying factor where the implementation details are revealed at many places and many times. Another example is the the challenge of testing amidst the tech debt. If, for instance, there is a preponderance of 1000-line classes without tests, then verifying a new feature is a unaffordable amount of work. Needless to saw, classes of that size mean only one thing: coupling to the implementation details. In fact, the need for data refactoring is very similar to the need for general refactoring in this way: that is, the size of a single entity causing too much coupling, but again, it is not the goal to elaborate on all the examples, because anyone in the thick of data restructuring knows the problem is real, nor will they feel a dirth of evidence.

A Formal Approach
See https://github.com/Adrianjewell91/decoupler-website/blob/main/README.md
Requirements

The first requirement is to make it easy for consumers to migrate, otherwise there is no point of solving the problem. In other words, consumers should be able to continue as they were before the changes were made, with no interuption to operations.

Another requirement is to refactor the data structure incrementally. Strictly speaking, incremental wins are not a hard requirement, but it should be, especially if some of the data needs to be refactored before other parts. Additionally, it is good practice it think incrementally, because stakeholders usually expect it, and it avoids the temptation of waterfall. This last point is important: not doing waterfall, which creates the risk of building a huge product, but then nobody uses it. It comes from thinking that clients will adopt new platform no matter what, but things do not work like that in practice.

The complexity of the effort will also pose a challenge, but thankfully, it divides roughly into two types. The first is modifying columns, new data, validations etc. These examples are real; migrating tables to new databases is exactly what needs to happen, and even more than that has to be done. Additionally, tables need to be rewritten, refactored, and moved not only to new databases but to entirely different kinds of databases. This is very important for saving money on software costs. However, there also the important second kind, which is the reworking of the business domains themselves. It is a conceptual refactoring, where new business entities emerge from the growing requirements of the business. Of course, both types of refactoring support and inform the other, but they are also distinct. For instance, migrating a database table is different from concieving on a new business domain, for instance, a new Catalog or Marketing concept.

A Practical Strategy

Hide the implementation details in a systematic and simple way. Importantly, do it without interfering with the software systems that currently consume it. Practically, this means creating new interfaces, migrating clients, then refactoring behind the interfaces. This leads to a great simplification of the software system by decoupling the data from how it is stored. Once the data is behind interfaces, the data can evolve independently of one another but retain the existing relationship. This is what every successfully scaled company has done in order to grow, and the benefits of doing it are discussed at great length by the people who implemented it. Therefore, we know it works, and it also works because, in theory, the current data interfaces aren't bad, they just need to be able to grow and develop independently of one another, but also keep the existing functionality. So, it is ok to keep them, and change them and grow them under the hood.

Implementation Steps
  1. Assume that the starting point is a monolithic software system. In this sytsem, clients are coupled to the implementation details of the data, in a way expressed by the diagram below.
  2. First, wrap each table behind an interface. This is an agile way to make the new interfaces available immediately for adoption, without too much work, so restructuring can start happinng incrementally.
  3. Second, create new domains and migrate data as needed. This gives teams their correct responsibility: let data owners address implementation details, and domain owners take care of the domain restructuring.
  4. Third, migrate existing clients as needed behind the scenes as needed. Inject the interfaces into the old system in a way that is hidden from the consumers. Then, lock the old interface from new features and then require any new work to use the new interfaces. There are many ways to do this, even at the db level. In a SQL world, the strategy would map under the hood every SQL query to an interface, and then lock the SQL queries, so no new ones could be made. Then, any new feature would require adoption of the new data structures because the old queries are locked from being changed. This strategy will work because it is programmatic rather than negotiative. That is, no work needs to be done to convince teams to adopt or not, just set the rules and enforce them programmatically, and it will happen.

Risks

One risk is making life difficult for consumers, mainly by mandating a refactor of their own code, with their own engineering effort. This mistake is easy to make because the company doesn't "eat their own dogfood" and therefore think it is easy for clients to migrate. However this is rarely the case, and this decision costs the company a great deal of time because consumers have many things to do instead of letting engineering teams tell them to stop all their work and migrate to the new interface. It basically defeats the whole point of trying to solve the orginal problem, which was to stop teams from having to migrate their interface when the data structure changes. So in the end, there is a massive risk, which is that consumers won't migrate.

A second risk is striving for waterfall-based, rather incremental, wins. This is because a company may decide to build a new platform as complex as the previous one, and before adoption. That is a lot of work: to build it, to figure out what to build, because they have to find out the intended consumption patterns, and the new business domains. This is very hard to do because the future is unpredictable, and could change. If the future changes, the work would have been wasted. Hopefully, consumers adopt, but it is not know as to when. With that uncertainty, the whole point of the project in is question: the important work of refactoring the data structure. The answer is that it won't get started until the new platform is built, then adopted, which will take a very long time. In the meantime, there will still be the legacy data structures in place, so no progress on the original problem.

A third risk is asking the data owners specifically to perform all of the data restructuring. This will be a time-sink because teams get very little done when doing too much. This might not seem like a mistake, because the data owners know a lot, and they know how their data should look in the future, but they have an incomplete picture. There are also data consumers who have their ideas of new domains, what shape they want the data, and how they want to integrate it, etc. Therefore, the domain-specific refactoring should rest with the consumer as much as possible because they know what they want better than anyone else. So, if the data owners try to do the work of the consumers, or vice versa, then teams will spend too much time asking other teams what it is that they want, which leads to a lot of time wasted on discussions and planning instead of building.

If any of these risks are likely to happen, then there are symptoms of a greater problem, which is that the focus has shifted away from the original problem, which is bad because the original problem still stands: refactor the data structure. Probably, a few consumers will still adopt, but it is clear what is happening: the company is focused on building a new platform, in a waterfall way, for an imagined, pre-planned, overarching future use-case which may or may not exist, and then hoping everyone will adopt it. If all that works, then finally the company can focus on rewriting the data structures under the hood. However, it is too far away, and by then it may be too late.

Additional Risks and Management
  1. Performance:
    • The decoupled architecture means the removal of the SQL Join as an architectural lynchpin, which could be undesirable because the relational model dissolves, and possible performance implications because joins are very efficient. However, this this permissable because the decoupled system offers many alternatives to the Join, such as caching and aggregation, to name a few.
  2. Complexity:
    • A decoupled architecture introduces complexity elsewhere, as it will consist of many micro-services, but that is permissable because they can be observed and monitored with modern observability software.
  3. Cost of Resources:
    • Manage cost; the cost of running the decoupled architecture needs to be cheaper than running the database monolith, but that will happen because the system will be on average the same size as database monolith but only need to scale some of the components, which net cheaper than scaling the whole monolith. Additionally, the ROI on decoupling should justify the costs.