Road to Effective Data Sync in the post-SaaS Era
Shapes of the data
When the word data shows up in a sentence, people tend to think of things like databases, spreadsheets, etc. In the history of data processing, a table has always been the most intuitive and straightforward format for the majority to interact with data. Ancient Sumerians used Cuneiform Tablets to record agricultural produce and transactions back to 2400 BCE. Ancient Egyptians used papyrus scrolls to perform mathematical queries back to 2000 BCE. Ancient Chinese even used bamboo strips to make predictions based on astronomical observations back to 1200 BCE.
However, the shape of data is not to be constrained under a specific format. No matter how popular a format has ever been. It doesn’t even need to be formatted in an understandable way that humans can read. This has always been the case, but except for a relatively small group of data-savvy people, those data segments that exist under the water and driving the world are not to be seen.
When traders are looking at all the numbers on the screens at Nasdaq, why would they care where the data comes from since it is already presented to them in the most familiar way possible? How the data is stored vs how the data is presented is like potatoes vs pretzels.
Things started to change in the past decade when analytical work became overwhelmingly demanded and the barrier of entry was shredded by the rise of modern NoSQL databases, data warehouses, BI tools, and human-readable intermediate representations such as CSV and JSON.
Though everything seemed to be rosy in the early stage, the challenges brought about by the rapid advancements started to kick in and they made the pain of understanding data intensify for people, especially those non-technical. Though we are in an age where data is abundant and predominant, the presentation and management of data have never been this scattered.
- Every system that contains data has its agenda for how the data is situated so the data in idle is scattered.
- Every query system that loads data also has its agenda for how data is presented so the data is also scattered in action.
Then comes SaaS, the “almighty savior” who presents data to people from a specific demographic in the most straightforward way possible and gets the job done fast. However, it intensifies the scatteredness of data. In the past few years, we worked on thousands of different API endpoints on the market, and we have only found some weak patterns in their design.
The API connector (a.k.a. Acho CDK) helps us to connect to most data sources promptly, but it is still not immune to systems that are too alien to be supported. Why? Because there’s no universal standard for data storage, data retrieval, and data presentation. Though a lot of standards exist, they focus on their own eco-systems instead of interoperability.
One model to fit them all?
A universal model to define how we store and access data might sound very enticing. However, dreaming of a draconian tech giant to standardize everything and the problem goes away is proven to be a joke. Nobody could defy the essential complexity of data storage, processing, and presentation, with however much capital. Other than algorithmic and engineering complexity, there also goes politics. Who would ever want their data to be solely stored in a single vendor’s system?
Though the problem is too hard to overcome, the demand is real. Thus people come along and say let’s fix half of the problem first. That is ETL, our Malum necessarium. Let’s pull the data to another place and transform it into a specific format so that you can consume it without frustration anymore. Boom, here goes headless BI, problem solved for data analysts, and data engineers get a cushy corporate job.
People are hyped, they think the unified data modeling problem is solved, and the Modern data stack gang flooded the market with tools to celebrate all the use cases surrounding the ETL. We have data pipeline monitoring, data quality assurance, data exploration, data cataloging, open-sourcing solved problems like BI tools and data modeling scripts and toolkits.
It all seems very promising until you find out, surprisingly, you are deeply entrenched in 10 SaaS tools, 5 databases, and a freaking data warehouse that contains the same piece of data transformed into fifty shades of tables siloed through a hundred-node DAG flowchart, you start to wonder where could you find the salvation that brings your business data back to sanity.
Distributed data consistency
Disclaimer: The data consistency strategies of robust databases/warehouses such as sharding, read replica, master-slave architectures, CQRS, etc, are not what we are discussing here. They are solved problems and almost all efforts that go into them are iterative.
The data modeling problem that is deliberately ignored by the modern data stack is the data consistency issue across systems. The problem has always been there, but it is intensified when SaaS vendors come along and fight for data ownership.
What made things even worse, SaaS vendors are not only interested in data ownership. The stronger ones among them are investing a lot in controlling other SaaS systems. The efforts created a genre called iPaaS. They contributed to the data consistency efforts to some extent and nobody could take that away from them. However, there is a conflict of interest where these iPaaS will focus on the most renowned SaaS products as their source of truth, and they will productize rigid workflows that make their customers’ business processes even less flexible. Hence we see operation leaders complaining about software like Zapier failing to deliver when they require the slightest change of pre-packaged behavior.
The market needs a bipartisan source of truth for data that can not only pull data from external systems efficiently but also retain control of certain resources on the external systems in a non-intrusive manner. We need a framework of data consistency across different systems instead of some productized plugins that limit how our customers’ businesses operate. The data consistency layer should present the following traits:
- Strong consistency across systems with low latency/rate limit
- Optimistic weak consistency across systems with high latency/rate limit
- Strongly ordered time series audit logging for visibility
- Easy to manage with human-readable declarative configurations
- Partial consistency to allow non-intrusive operations and flexible workflows
- Continuous monitoring of validation rules to bring awareness of data corruption
Fighting the anti-patterns
The data industry is never short of talented and capable engineers. The reason why a flexible data modeling system has never been built is mostly predicated on three factors:
- Lack of supporting business model
- Weak demand from technology-focused enterprises (no open-source initiatives)
- The difficulty of implementation
The idea of a fully configurable and flexible data consistency model is a clear anti-pattern for PLG products since it by itself claims weak productization of the core feature sets. It is a different story in the open-source community but it is again not a good idea to start this off as major adopters of open-source dev tools are enterprises with strong IT support and fewer SaaS tool adoptions.
With the difficulty of implementation added on top of all this, it is a very hard decision unless we find a niche. Fortunately enough, the niche has become more and more clear recently. The operation leaders in the middle market are in urgent need of something that can automate their processes and cut the SaaS seats.
- Schedule a Discovery Call
- Email us directly: contact@adenhq.com