Data-First Architecture
I recently had a light bulb moment when I saw a tweet from Evan Todd. It helped bring together some ideas I have had for a while on software architecture.
Data characteristics excluding software functionality should dictate the system architecture.
The shape, size and rate of change of the data are the most important factors when starting to architect a system. The first thing to do is estimate these characteristics in average and extreme cases.
Functional programming encourages this mindset since the data and functions are kept separate. F# has particular strengths in data-oriented programming.
I am going to make the case with an example. I will argue most asset management systems store and use the wrong data. This limits functionality and increases system complexity.

Traditional Approach
Most asset management systems consider positions , profit and returns to be their primary data. You can see this as they normally have overnight batch processes that generate and save positions for the next day.
This produces an enormous amount of duplicate data. Databases are large and grow rapidly. What is being saved is essentially a chosen set of calculation results.
Worse is that other processes are built on top of this position data such as adjustments, lock down and fund aggregation.
This architecture comes from not investigating the characteristics of the data first and jumping straight to thinking about system entities and functionality.

Data-First Approach
The primary data for asset management is asset terms , price timeseries and trades . All other position data are just calculations based on these. We can ignore these for now and consider caching of calculations at a later stage.
- terms data is complex in structure but relatively small and changes infrequently. Event sourcing works well here for audit and a changing schema.
- timeseries data is simple in structure and can be efficiently compressed down to 10-20% of its original size.
- trades data is a simple list of asset quantity flows from one entity to another. The data is effectively all numeric and fixed size. A ledger style append only structure works well here.
We can use the iShares fund range as an extreme example. They have many funds and trade far more often than most asset managers.
Downloading these funds over a period and focusing on the trade data gives us some useful statistics:
- Total of 280 funds.
- Ranging from 50 to 5000 positions per fund.
- An average of 57 trades per day per fund.
- The average trade values can be stored in less than 128 bytes.
- A fund for 1 year would be around 1.7 MB.
- A fund for 10 years would be around 17 MB.
- 280 funds for 10 years would be around 5 GB.
Now we have a good feel for the data we can start to make some decisions about the architecture.
Given the sizes we can decide to load and cache by whole fund history. This will simplify the code, especially in the data access layer, and give a greater number of profit and return measures that can be offered. Most of these calculations are ideally performed as a single pass through the ordered trades stored in a sensible structure. It turns out with in memory data this requires negligible processing time and can just be done as the screen refreshes.
More advanced functionality can be offered, such as looking at a hierarchy of funds and perform calculations at a parent level, with various degrees of filtering and aggregation. As the data is bitemporal we can easily ask questions such as «what did this report look like previously?» or even «what was responsible for a change in a calculation result?». Since the data is append only we can just update for latest additions and save cloud costs.
Conclusion

By first understanding the data, we can build a system that is simpler, faster, more flexible and cheaper to host.
Software developers cannot always answer questions on the size and characteristics of their system’s data. It has been abstracted away from them. People are often surprised that full fund history can be held in memory and queried.
We are not google. Our extreme cases will be easier to estimate. Infinitely scalable by default leads to complexity and poor performance.
With cloud computing, where architectural costs are obvious, right sizing is essential.
Most of the references I could find come from the games industry. I would be interested to hear about any other examples or counterexamples.
References
FAQ — some questions I’ve been asked
- How do you deal with previously reported values and make sure they will be the same in the future? The data model is bitemporal so we can request any reporting data as at any prior time. Lockdown process design becomes simply storing a timestamp for a reporting period. Reporting can make use of lockdown timestamps to produce a complete view of prior period adjustments with full details. Without a bitemporal data model this often becomes a reconciliation process, leading to further manual steps.
- What about reported values changing due to code changes? Reporting data can be saved when key reports are generated and used in regression testing. Regression testing of all reports using the old and new code can also be automated. This is very good practice for high quality systems and is not very difficult to implement.
Data-First Approach: Redefining Enterprise Architecture in the Digital Era
In the data-driven digital era, businesses are increasingly recognizing the strategic value of data. A paradigm shift is taking place in how we approach enterprise architecture (EA). The data-first approach is emerging as a revolutionary concept, turning traditional EA models on their head. In this new model, data is no longer the end product of a process but the foundation upon which all other elements are built.
The data-first approach begins with defining data products – structured sets of data that provide specific value to the organization. Each data product consists of multiple data points, providing an understanding of the properties and performance of the data, its place in the ecosystem, and much more.
Once the data products are defined, they lay the groundwork for the infrastructure layer. In the context of a multi cloud strategy, this infrastructure serves as a flexible storage and processing layer. It accommodates a wide range of data products, ensuring accessibility and scalability to accommodate increasing data volumes.
Following the infrastructure, APIs and microservices take the stage. They interact with the infrastructure to extract, process, and utilize the data, serving as a bridge between the raw data stored in the infrastructure and the applications that make use of the data.
Applications, the next layer, are designed around the available data. They utilize APIs and microservices to interact with the data, providing value-added services to end-users. In the new data-first architecture, applications are no longer just data generators; they’re also data consumers, providing services based on insights gleaned from the data.
These applications then form the basis for platforms, which deliver interconnected capabilities. Platforms provide a structured way for applications to interact, breaking down silos and enabling a seamless flow of data between different applications.
Lastly, capabilities tie everything together. They represent the concrete business outcomes achieved using the data, infrastructure, APIs, microservices, applications, and platforms.
By placing data at the core of the enterprise architecture, the data-first approach aligns the organization’s technological capabilities with its strategic objectives. This alignment helps to ensure that technology serves the organization’s goals rather than merely supporting them.
In conclusion, the data-first approach is not just a technological shift; it is a strategic transformation. It places data at the center of the enterprise architecture, underlining its role as a valuable strategic asset. This shift demands a new mindset from business and technology leaders – one that views data not just as a byproduct of operations but as a driving force of strategic value.
Why you should take a data first approach to AI

Machine learning systems are born through the marriage of both code and data. The code specifies how the machine should learn and the training data encapsulates what should be learned. Academia mostly focuses on ways to improve the learning algorithms, the how of machine learning. When you come to build practical AI systems though, the dataset you’re training on has at least as much impact on performance as the choice of algorithm.
Although there are a lot of tools for improving machine learning models, there are very few options when it comes to improving your data set. At Humanloop, we’ve given a lot of thought to how you can systematically improve data sets for machine learning.
Improving your dataset can dramatically improve AI performance
In a recent talk, Andrew Ng, shared a story from a project he worked on at Landing AI, building a computer vision system to help find defects in steel. Their first attempt at the system had a baseline performance of 76%. Humans could find defects with 90% accuracy so this wasn’t good enough to put into production. The team working on the project then split in two. One team worked on trying different model types, hyper-parameters and architecture changes. The other team looked to improve the quality of their data set. After a few weeks of iteration, the results came in. The modeling team despite huge effort had not been able to improve performance at all. The data team on the other hand were able to get a 16% performance improvement. Improving the dataset actually led to super-human performance on this task.
By fixing errors in their data set the data-team was able to take their algorithm from worse than human to super-human.
This story isn’t at all unique. I’ve had a similar experience at Humanloop. We worked with a team of lawyers from one of the big-4 accountancy firms to train a document classifier on legal contracts. Similar to finding defects in steel, the task was subtle and required domain expertise. After the first round of labeling and training was complete, the model still wasn’t good enough to match Human level performance. Within Humanloop, there’s a tool to investigate data points where there is disagreement between the AI model and the human annotators. Using this view the team were able to find around 30 misclassifications in a data set of 1000 documents. Fixing just these 30 mistakes was enough to get the AI system to match human-level performance.
What do «data bugs» look like?
There’s a lot of discussion of «data prep» and «data cleaning» but what actually differentiates high quality data from low quality?
Most machine learning systems today are still using supervised learning. That means that the training data consists of (input, output) pairs and we wish the system to be able to take the input and map to the output. For example, the input may be an audio clip and the output could be the transcribed speech. Or the input might be an image of a damaged car and the output could be the locations of all the scratches. At Humanloop we focus on NLP so an example input for us might be a customer service message and the output could be a templated response. Building these training datasets usually requires having humans manually label the inputs for the computer to learn from.
If there’s ambiguity in the way data is labeled then the ML model will need much more data to get to high performance. There are a few different ways that data collection and annotation can go wrong:
-
Simple misannotations The simplest type of mistake is just a misannotation. This is when an annotator, perhaps tired from lots of labeling, accidentally puts a data point in the wrong class. Although it’s a simple error, it’s surprisingly common and can have huge negative impact on AI system performance. A recent investigation of benchmark datasets in computer vision research found that over 3% of all data points were mislabeled. Over 6% of the ImageNet validation data set is mislabeled. How can you expect to get high performance when the benchmark data is wrong?!


Data quality matters even more in small datasets
Most companies and research groups don’t have access to the internet scale datasets that Google, Facebook and other tech giants have. When the dataset is that large you can get away with some noise in your data. However, most teams are operating in domains where they have hundreds to thousands of labeled examples. In this small data regime, data quality becomes even more important.

To get some intuition as to why data quality matters so much, consider the very simple 1-dimensional supervised learning problem shown above. In this case we’re trying to fit a curve to some measured data points. On the left we see a large noisy dataset and on the right a small clean data set. It’s clear that a small number of very low noise data points shows the same curve of a large but noisy dataset. The corollary of this is that noise in small datasets is particularly harmful. Though most machine learning problems are very high dimensional they operate on the same principles as curve fitting and are affected in analogous ways.
Intelligent tooling can make a huge difference to improving data quality
There are lots of tools for improving machine learning models but how can we systematically improve machine learning datasets?
Data Cleaning Tools
A workflow that some teams are beginning to adopt is to iterate between training models and then correcting «data bugs». Tools are emerging to facilitate this workflow such as label noise in context and Aquarium learning or the Humanloop data debugger.
The way these tools work is to use the model being trained to help find «data bugs». This can be done by looking at areas where the model and humans have high disagreement or at classes where there is high disagreement between different annotators. Various different forms of visualization can help find clusters of mistakes and fix them all at once.
Weak Labeling
Another approach to improving datasets is to embrace noise but use heuristic rules to scale annotation.
As we saw in the curve-fitting example above, you can get good results either through very small datasets that are clean or very large datasets that are noisy. The idea behind weak labeling is to automatically generate a very large number of noisy labels. These labels are generated by having subject-matter experts write down heuristic rules.
For example, you may have a rule for an email classifier that says «mark an email as job application if it contains the word ‘cv'». This rule will not be very reliable but can be automatically applied to thousands or millions of examples.
If there are lots of different rules then their labels can be combined and denoised to produce high quality data.
Active Learning
Data cleaning tools still rely on humans to manually find the mistakes in data sets and don’t help us address the problems of class imbalance described above. Active learning is an approach that trains a model as a team annotates and uses that model to search for high value data. Active learning can automatically improve the balance of data sets and help teams get to high performing models with significantly less data.
Data first approaches lead to better team collaboration
As we’ve written about recently, one of the big advantages of adopting a data-first approach to machine learning is that it allows for much better collaboration between all the different teams involved. Improving datasets forces collaboration between the subject-matter experts who are annotating data and the data scientists who are thinking about how to train models.
The increased involvement of non-technical subject-matter experts in training and improving machine learning models is one of the most exciting aspects of machine learning software compared to traditional software. At Humanloop we’ve been working on incorporating machine teaching into the normal workflows of non-technical experts so that they can automate tasks with much less dependence on machine learning engineers. In our next blog, we’ll share some of our lessons on how machine learning will be incorporated into many people’s daily jobs.
Frequently asked questions
What is a data first approach to machine learning?
A data first approach to machine learning focuses on the datasets that you use to train your machine learning models with. It involves looking for «data bugs» in your dataset and fixing them.
Andrew Ng describes how a data-centric approach improved performance by more than 16% while a model-centric approach failed to achieve any improvement in this talk.
What are «data bugs»?
- Misannotations: Incorrect labels. Even widely-used benchmark datasets have these! Check out this website with examples and the accompanying paper.
- Inconsistencies in annotation guidelines: The difference between label classes is often extremely subtle. For example, what exactly is or isn’t a «product» when classifying «Product reviews»?
- Imbalanced data or missing classes: Real world datasets often containing different number of examples for each category being classified. This can cause worse performance but also exacerbates problems of fairness and bias.
How do I adopt a data first approach to machine learning?
- Data cleaning tools such as Label Noise in Context, Aquarium learning, and the Humanloop Data Debugger can improve your workflows by highlighting where the model and humans have high disagreement and by creating visualizations that help spot common mistakes.
- With weak labeling, you can embrace noise and use heuristic rules to scale annotation. With lots of different rules, the labels can be combined and denoised to produce high quality data at massive scale.
- Active learning improves data balance and helps teams get to high performing models with significantly less data.
About the author
Name Raza Habib Role Cofounder and CEO
Raza is the CEO and Cofounder at Humanloop. He was inspired to work on AI as “the most transformative technology in our lifetimes” after studying under Prof David Mackay while doing Physics at Cambridge. Raza was the founding engineer of Monolith AI – applying AI to mechanical engineering, and has built speech systems at Google AI. He has a PhD in Machine Learning from UCL.
Raza is the CEO and Cofounder at Humanloop. He was inspired to work on AI as “the most transformative technology in our lifetimes” after studying under Prof David Mackay while doing Physics at Cambridge. Raza was the founding engineer of Monolith AI – applying AI to mechanical engineering, and has built speech systems at Google AI. He has a PhD in Machine Learning from UCL.
Is Data-First AI the Next Big Thing?
We are roughly a decade removed from the beginnings of the modern machine learning (ML) platform, inspired largely by the growing ecosystem of open-source Python-based technologies for data scientists. It’s a good time for us to reflect back upon the progress that has been made, highlight the major problems enterprises have with existing ML platforms, and discuss what the next generation of platforms will be like. As we’ll discuss, we believe the next disruption in the ML platform market will be the growth of data-first AI platforms.
Essential Components for an ML Solution
It is sometimes easy to forget now (or, tragically, maybe it’s all too real for some), but there was once a time when building machine learning models required a substantial amount of work. In days not too far gone, this would involve implementing your own algorithms, writing tons of code in the process, and hoping you make no crucial errors in translating academic work into a functional library. Now that we have things like scikit-learn, XGBoost, and Tensorflow/PyTorch, a large obstacle has been removed and it’s possible for non-experts to create models with a smaller amount of domain knowledge and coding experience, and perhaps get initial results back in hours. In the process, it can sometimes be easy to forget what is at the essence of a ML solution. Were we inclined to attempt to solve a ML problem from scratch, what would we need? I’ve long believed that there are four key pieces to any ML solution:
- Data: Training data is essential for any ML algorithm.
- Code: Pick your library of choice, but some code will be required to use it.
- Environment: We need somewhere to run the code.
- Model: The thing we’ll use to make predictions.

Image: Continual
The resulting outputs are predictions, which inevitably is what the business is interested in, but there’s non-trivial effort to get there. I bring this up because I’d like to propose it as a way to view the different generations of ML platforms, based upon which of the four elements listed above they focus on:
- Generation 1 is code- and environment-based: The focus is on writing code, collaborating, and making it easy to execute that code.
- Generation 2 is model-based: The focus is on quickly creating and tracking experiments, as well as deploying, monitoring, and understanding models.
- Generation 3 is data-based: The focus is on the construction of features and labels – the truly unique aspect of most use cases – and automation of the rest of the ML workflow.
Platforms can change greatly based upon their focus, as we’ll see below. The one that is right for your organization largely depends on what your goals are: do you need a platform that streamlines the development lifecycle for research-oriented data scientists (Gen 1), or something that helps your business execute AI use cases as quickly as possible with minimal risk (Gen 3)?
Generation 1: Collaborative Data Science
The modern take on the ML platform began to take shape in the late 2000s as an ecosystem of Python open-source libraries were emerging that made the task of developing data science relatively easy. Packages like NumPy (2006), pandas (2008), and scikit-learn (2007) made transforming and modeling data in Python much easier than it previously was, and, combined with tools like matplotlib (2003) and IPython (2001), newly minted data scientists could spin up a local development environment fairly quickly and readily have a multitude of tools at their disposal. Many early data practitioners came from the academic world and were previously accustomed to notebook-oriented tools like Mathematica and Maple, so it was no surprise that the release of IPython Notebooks in 2011 (later renamed to Jupyter Notebooks in 2015) came with much fanfare (Although we’ll take a Python-centric approach for this article, it’s worth noting that RStudio was also released in 2011). By this time, Python package and environment management were also getting easier thanks to Pypa (which maintains tools like virtualenv and pip); and a few years later data scientists got more powerful modeling tools with XGBoost (2014), TensorFlow (2015), and PyTorch (2016). All the pieces were really falling into place for Python data professionals.

Just an average day of using Lorenz Attractors to solve your business problems.
Image: jupyter.org
Notebooks were, and continue to be, one of the main tools that data scientists use on a day-to-day basis. Love them or hate them, they have entrenched themselves in the ML landscape in a way that no other editor technology has. However, as great as they may be, as real companies began adopting notebooks into their technology stacks, they inevitably discovered many gaps (this list is not exhaustive), such as:
- Sharing work and collaborating with peers
- Building a valid environment to run someone else’s notebook
- Getting access to the right infrastructure to run code
- Scheduling notebooks
- Enforcing coding standards
High-tech companies who adopted notebooks early have likely built some sort of bespoke solution that tackles the above challenges (and maybe more!). For the rest, software vendors began to emerge offering “collaborative data science” tools. These tools were built for data scientists: they revolve around notebooks and try to reduce the friction in collaborating and running code at the enterprise level. If we refer back to our original essential components of machine learning, these tools are squarely focused on code and environment. Contemporary solutions are cloud-based and run in containers, abstracting away even more complexity from the data scientist. Generally speaking, they tend to be good at what they do: providing a nice development sandbox for data scientists to explore and collaborate.

Bespoke systems are not always the best means of solving problems.
Over the years, however, these tools have demonstrated that they fall short in several areas (again, not exhaustive):
- Lack of Operational Model: By making the platform as general and flexible as possible, it becomes more difficult to automate common tasks.
- Difficult Path to Production: Notebooks are the core resource for this platform, but they are notoriously difficult to rely upon for production work and are greatly error-prone.
- Data Scientist-Focused: Code-based approach is great for data scientists, but it means other users in your organization will get little value out. Even the majority of people you pay to code (software developers) generally dislike notebooks.
- Encourage Pipeline Jungles: The manual nature of the platform means that any production work is going to necessitate a complex rig of data and API pipelines to ensure that things actually work.
- Harder tasks are “exercises left to the reader”: Feature Engineering, Feature Store, Model Deployment & Monitoring, all of this is either done manually or externally.
Although Gen 1 ML platforms have their use in development cycles, time has proven them to be poor systems for production work. I believe that most of the negative press concerning ML being a notoriously difficult field to operationalize is due to the fault of Gen 1 platforms. Notebooks can be great to prototype out a new, difficult use case, but they should quickly be discarded as ideas mature in order to focus on more robust systems. As such, they are generally not a good starting point for creating a production ML system.
Generation 2: Model-Based Point Solutions
Around the time when data science leaders became frustrated with notebook-based platforms, either built in-house or vendor-acquired, lots of people started talking about the “data science workflow” or Machine Learning Development Life Cycle (MLDLC). The point was to develop a framework akin to the Software Development Life Cycle, which is pretty standard in software development groups and helps guide that team from development into production. Such a thing was/is sorely needed in ML to help struggling teams put the pieces together in order to conceptualize a proper ML production strategy.

The Machine Learning Development Life Cycle
Image: Continual
As we discussed, collaborative data science tools leave many gaps in the path to production, and the MLDLC really highlights this. It will primarily be useful for the early stages of this cycle: around feature engineering and model training. For the rest of the cycle, we’ll need to find other tools.

Notebooks can be useful, but they still leave a lot of the MLDLC uncovered.
Image: Continual
Luckily, new tools were already on the rise: AutoML tools and experimentation trackers for model training, MLOps tools for model deployment and monitoring, explainable AI Tools (XAI) for model insights, and pipeline tools for end-to-end automation. Interestingly, feature store tools have only really begun to make an appearance in the last couple of years, but we’ll discuss those more in Gen 3. In any case, with enough dedication and elbow grease (or cash), you can cover up all the boxes in the MLDLC and feel content that you’ve built a robust ML platform.

This is known as “Bingo” in some circles.
Image: Continual
All these tools solve a specific problem in the MLDLC, and they all focus on models, not code. This represented a pretty big shift in thinking about the ML platform. Instead of expecting data scientists to solve all problems from scratch via coding, perhaps a more reasonable approach is to simply give them tools that help automate various parts of their workflow and make them more productive. Many data science leaders realize that their teams are primarily using algorithms “off the shelf”, so let’s focus on automating the less difficult parts of this process and see if we can arrange the jigsaw pieces into some kind of coherent production process.
This isn’t to say that everyone welcomes these tools with open arms. AutoML, in particular, can be met with a backlash due to data scientists either not trusting the results of the process, or perhaps feeling threatened by its presence. The former is a great case for adopting XAI in step with AutoML, and the latter is something that I believe fades over time as data scientists realize that it’s not competing with them, but rather something they can use to get better and faster results for the business. Nothing should be trusted without scrutiny, but AutoML can be a great tool for automating and templating what is likely going to become a very tedious process as you work through use case after use case.
On the surface, all these model-based point solutions look great. Collect them all like Pokémon and you’ve completed the MLDLC.
However, cobbling together point solutions is also not without its pitfalls:
- Integration Hell: To execute a simple use case, the ML-part of the solution requires four or more different tools. Good luck troubleshooting when something breaks.
- Pipeline Jungles Still Exist: And they’re arguably much worse than they were in Gen 1. You now have your original pipelines going into the ML platform, as well as several more between all your new tools.
- Isolated from Data Plane: These tools are all model-focused and operate on models, not data. You’ll still need a tool like a Gen 1 collaborative notebook to handle any data work that needs to be done, as these don’t provide those capabilities.
- Production is a Complex Web of API/SDK Acrobatics: A realistic production scenario in Gen 2 is: writing a script that generates training data (probably written from a notebook), passing the resulting Dataframe into your AutoML framework via API or SDK, passing the resulting model into your MLOps tool via API or SDK, and then running it through your XAI framework (if you even have one) to generate insights. How do you score new data? Similarly, write another script that leverages more APIs. Run all of this in something like Airflow, or maybe your Gen 1 platform has a scheduler function.
- Harder tasks are “exercises left to the reader”: Feature Engineering, Feature Store, Entity Relationship Mapping, etc… You’re still doing a decent amount of work elsewhere.
- Team of experts required: These tools love to claim that since they are automating parts of the process, they “democratize ML” to make it easy for anybody to self-serve. However, I’ve yet to really find one that places business context first and doesn’t require a team of K8s/cloud engineers, machine learning engineers, and data scientists to operate.
It’s worth noting that Gen 2 platforms have already evolved: more established vendors have either been iterating on new products or acquiring startups to expand offerings. Instead of buying multiple point solutions from multiple vendors, you may be able to buy them all from the same vendor, conveniently dubbed “Enterprise AI”. Unfortunately, what has resulted doesn’t adequately resolve any of the issues listed above, except maybe making integrations slightly less painful (but this is not a given, buyer beware). The main benefit is really just that you can buy all your shiny toys from the same vendor, and when you start working with the tech out of the box you quickly realize that you’re back to square one trying to rig up your own production process across products that share little in common except the name brand.
Don’t confuse this with a Gen 3 approach. There must be a better way.
Generation 3: Declarative Data-First AI
What really is a machine learning model? If we look at it abstractly, it takes data as input and spits out predictions and hopefully also model insights so we can evaluate how well the model is doing. If you accept this as your paradigm for machine learning, it becomes obvious that your ML platform needs to be data-focused. Gen 1 and Gen 2 are unnecessarily concerned with what is happening inside that model, as a result, it becomes nearly impossible for the average company to string together a dependable production process. But, with a data-first approach, this is actually attainable.

Image: Continual
To the credit of the Gen 1 and Gen 2 approaches, Gen 3 wouldn’t exist without them. Both because it builds upon some of the concepts they established, and without people struggling to actually operationalize ML with Gen 1 and Gen 2 tools, it likely never would have come about. At the heart of the data-first approach is the idea that AI has advanced enough that you should be able to simply provide a set of training data to your platform, along with a small amount of metadata or configuration, and the platform will be able to create and deploy your use case into production in hours. No coding is necessary. No pipelining. No struggling with DevOps tools as a data scientist. Operationalizing this workflow couldn’t be easier.
How is this possible? There are three core ingredients:
- Feature Store: Register your features and relationships. Automate feature engineering. Collaborate with peers so you don’t have to recreate the wheel every time you need to transform data. Let the feature store figure out how to serve data for training and inference.
- Declarative AI Engine: Raise the level of abstraction and automate building models and generating predictions. Allow power users to customize the experience via configuration.
- Continual MLOps and XAI: Recognize that the world isn’t static. Automate model deployment and promotion. Automate generating model insights. Allow data scientists to act as gatekeepers that review and approve work but put the rest on autopilot.
If you want to see what this looks like in practice, you can try the declarative data-first AI platform we are building at Continual. It sits on top of your cloud data warehouse and continually builds predictive models that never stop learning from your data. Users can interact with the system via CLI, SDK, or UI, but production use is easily operationalized via simple declarative CLI commands.
We’re not the only ones who are thinking about ML in a data-first approach. This idea has been kicking around in FAANG companies for several years, such as Apple’s Overton and Trinity and Uber’s Ludwig. A recent article on Declarative Machine Learning Systems provides an excellent summary of these efforts. Andrew Ng recently riffed on data-centric AI as has Andrej Karpathy from Tesla. We expect that many more are on their way. We also believe that declarative data-first AI is an essential part of the modern data stack, which promises to reduce the complexity of running a data platform in the cloud.
Data-first AI is an exciting new concept that has the potential to drastically simplify operational AI and help enterprises drive business impact from AI/ML. To highlight a few important consequences of data-first AI:
- Reliable path to production: Simplify production ML via a well-defined operational workflow.
- End-to-End Platform: Accelerate time to value by reducing integration tasks and pipeline jungles.
- Democratization of AI: Provide a system so easy that all data professionals can use it. Guard rails allow data scientists to control the process.
- Accelerate Use Case Adoption: Set up production workflows in days, not weeks or months. Manage more production use cares with less resources.
- Reduce Costs: Buy less stuff and lower the cost of maintenance.
Although we believe data-first platforms will arise to be the predominant ML platforms for everyday AI, they are not without their limits. For truly cutting-edge AI research, there’s probably nothing that can get around the fact that manual work will be needed. This may not be a large concern outside the most technical of companies, but it helps to have a development-focused tool available for such times. We believe that the data-first platform is great at solving 95% of the known ML problems out there, and the other 5% may require more TLC. However, we think it’s a monumental improvement to enable 95% of your use cases to be handled by data engineers/analysts with some oversight by a data scientist, and allow the data science team to focus more on the difficult 5% of problems. To do this, they need a stellar system that automates everything and lets them manage and maintain workflow with little intervention required: ala a data-first platform.
What Tool is Right For Your Team?
We’ve covered a lot of ground in this article and discussed a lot of tooling options. At times, the ML/AI tooling landscape can feel overwhelming. The data-first approach to AI disrupts many preconceived notions and its power is best seen to be believed. At Continual, we’re strong believers that ML/AI solutions should be evaluated using your real world use cases. With many solutions, this can take weeks or months and exposes hype from reality. At Continual, our goal is to enable you to deliver your first production use cases in a day. That’s the power of a declarative data-first approach to AI that integrates natively with your cloud data warehouse. If this sounds intriguing, register for our upcoming webinar or reach out to us for a demo so you can experience it for yourself.