Scalable, Resilient Data Orchestration: The Power of Intelligent Systems
Practical insights for software professionals seeking to design large-scale data orchestration systems that unlock actionable intelligence.
Join the DZone community and get the full member experience.
Join For FreeData is the key driver for any intelligent solution, including AI/ML. The accuracy and quality of any AI/ML model are directly proportional to the quality of data, regardless of whether it takes the form of input data, a prompt, or a pre-trained knowledge dataset. Often, the training datasets for AI/ML models originate from multiple sources and undergo various stages of data processing before being transformed into useful information that models can rely on for training. To achieve a reliable and continuous flow of data from diverse sources, transform data based on business rules, and extract insights and recommendations, we need a data orchestration solution.
This article discusses the unsung hero that orchestrates data flow across multiple components, including pipelines, to deliver the intended outcome. This article intentionally focuses on the characteristics and principles of a data orchestrator that are architecturally stable and resilient over time. The topics covered in this article are technology-agnostic and applicable to any industry-recognized data orchestration tools. As an engineer, the solution architecture is influenced by the capabilities or non-functional requirements, not purely by the tools available in the market. This article offers a refreshing take on the topic of data orchestration and shares my experience with the distinguished community of software professionals on designing large-scale data orchestration systems that ultimately bring insights and power intelligent systems.
What Is Data Orchestration?
Data orchestration is a set of related tasks executed in an order driven by specific use case needs. Data orchestration can have many names, such as workflow or state machine. Essentially, they are represented as Directed Acyclic Graphs (DAG), composed of nodes and edges. The nodes are represented as individual tasks, while the edges are depicted as triggers. A DAG can compose multiple subsystems and connect them through events that trigger actions, share data payloads along with execution context, thereby making the orchestration well-informed to chart a course of action.
Data Orchestration vs Data Pipeline
Data pipeline represents a component built to process data from one or more heterogeneous data source and persist results. The pipeline is often built with a specific technology such as Apache Spark, Apache Flink, or others suited to the processing needs. However, Data orchestration spans across multiple components and the execution flow is determined by the execution state (a.k.a. state machines). The tasks that make up the data orchestrator are not purely data processors, meaning they may not process data. They delegate the processing to other services operating from containers, instances, or big data services, often connected by message brokers (Kafka) or other interfaces. AWS Step function and Apache Airflow are popular off-the shelf products available for Data orchestration. Also, the concept of data orchestration applies to real-time services that make decisions driven by data.
Traits of Good Data Orchestration Design
Responsive to Triggers
An orchestrator should allow one or more types of triggers to execute the engine. The execution trigger could be an API endpoint, a CLI command, or an event triggered by data activity such as publishing data to a queue/topic, adding or modifying objects in object storage, or receiving push notifications from other sources. The ability of the orchestrator to respond to both events and API triggers makes the solution versatile for automating processing and on-demand execution. Additionally, the orchestrator is often interfaced with API gateways or reverse proxies to receive incoming traffic.
Modular and Composable
An orchestrator can call another orchestrator to extend its functionality. Each orchestrator may represent a sub-system that has distinct functionality and can be composed for different business use cases, promoting reusability and preventing the building of siloed systems that are, in essence, identical but marginally differ in specifics. It is very important to keep track of the execution lineage as it transitions from one orchestration process to another.
Scalable Execution in Parts or Whole
One or more data events can happen at a time, and the orchestrator should be able to handle the events in parallel. If more parallelism is needed, the orchestrator should bring additional execution instances to handle the request as long as the number of parallel requests are well under the rate limit. However, scaling for a data orchestrator is a little nuanced—the orchestrator may just have to scale a specific task or subsystem to handle the data intensive operation.
Serial and Parallel Execution Tasks
Like a DAG or workflow, the orchestrator can have some tasks that serially execute due to dependencies and some they do in parallel as they are independent. An orchestrator can have sequences of one or more serial and parallel task combination—for example, fan-out fetching an information about a topic from websites and fan-in to generate the summary. Tasks can also be dynamic and data driven. Each task should receive execution context that contains common workflow metadata (such as task data source connection parameters) as well as interim results shared by the upstream tasks that influence whether the task is in scope or not and the execution type. To put it simply, the state of the execution should shape up the execution flow—and that's what state machines do. Orchestrators work the same way: they're state machines, too.
Retry Mechanism
Any task that is part of the orchestration can fail and restarting all over again is an expensive option that can cause unnecessary churn. Imagine you're training a Large Language Model (LLM) model which encounters a failure in classifying a dataset. You start the process all over again, and that is seriously counter-productive. Each task should have an ability to retry upon failure with max retries fixed and most importantly, produce idempotent results, meaning re-processing same data should not differ from the output that resulted in the first successful attempt.
Reliable Restart Capabilities
Orchestrators are expected to fail or restart at any time, during deployment and other maintenance activities. When such events occur, orchestrators should restore the previous state and continue processing from that last known good state. Checkpointing is a common practice in data processing systems, where the system's state is consistently snapshotted to persistent targets such as disk, object storage, or database, allowing it to resume from where it left off during restart. Orchestration systems that leverage continuous data streams, such as Kafka, should commit the offset for events that are successfully processed during checkpointing and then resume processing events from the last committed offset when restarted. It's essential to note that a restarted system may process duplicate events; therefore, the system should ensure its outcome is idempotent unless specifically intended. It's recommended to explore patterns such as exactly-once and at-least once processing, as this topic can be nuanced, and having a firm understanding is beneficial. Ultimately, a good system should also provide an option to ignore the previous state when a system restore is impossible.
Transactional Execution
The requirements of data orchestration can vary from one use case to another. An orchestration system can feed the same input to one or more AI/ML models, process responses from some by priority or first result, and require others to stop processing as their outcome is no longer expected. It is a known practice called Champion/Challenger in AI/ML model execution, where models are pitted against each other with the same input to assess their characteristics. It is worth considering terminating a long-running model execution to avoid unnecessary cost when its outcome is ignored.
Auditable System Operations
Most would agree that managing a data orchestration system is orders of magnitude harder than developing one. Operators feel lucky when the system fails due to catastrophic failure and commands immediate attention. However, silent failures such as processing inaccuracies are hard to identify, and correcting them requires a significant effort for out-of-cycle processing and careful execution, considering all edge cases. This has far-reaching implications if the use case is related to banking, finance, or any other regulatory-heavy domain. Nobody wants to be in this situation, trust me.
It is highly advised to have processes that persist execution summaries, such as tasks involved in approving or declining credit card applications and tasks ignored during determination. A big shout-out to AWS Step Functions, whose visual execution summary feature comes very handy when analyzing the state of an execution instance. For those new to AWS Step Functions, this tool creates an instance of a workflow for each event trigger and provides near-real-time, interactive status updates on tasks that make up the workflow.
The synergy between data and Artificial Intelligence continues to unlock new frontiers of intelligence, pushing the boundaries of what we thought was possible. Let us not underestimate the importance of the often-overlooked data orchestration work – it's the steady heartbeat that fuels innovation, driving discovery and progress forward.
Trending Practices
Leverage Object Storage Over Databases
When scaling large data systems, throughput is often throttled by external systems, such as databases, that are not easily scalable and come with a heavy price tag. Databases have some niche advantages, making it easy to retrieve and update state. However, as scale increases, keeping up with desired CPU, memory, storage, and partitions (if the database is shared) can be either operationally heavy or expensive, or both. From recent successes, leveraging object storage for data persistence has seemed to be an easy solution. It comes with high availability, theoretically unlimited storage, cross-region replication, and SQL-like query engines using tools such as Presto, as well as notebooks and big data frameworks (Spark, Flink, etc.). Like databases, data can be partitioned into files using shard keys, and old partitions can be pruned using life-cycle policies. For common use cases, choosing object storage over databases appears to be more beneficial as the gap narrows for large-scale processing.
Experiment With File Formats
Traditional text-based file formats, such as CSV and JSON, can be inefficient for storing and retrieving data. Often, data systems are interested in a specific subset of data, and retrieving this subset can be challenging without loading all the data and processing it, which incurs unnecessary processing overhead on CPU, memory, and network I/O. Specialized file formats like Parquet, which is a columnar data file format, are effective for reading and writing data, offering good compression that keeps costs low and speeds up query performance with less I/O overhead. Parquet acts like a database on the edge and has good community support for querying the data using Big Data and Hadoop SQL engines. It's not an overstatement to say that Parquet is becoming the de-facto choice for a data file format in data systems.
Prioritize Data Over Metadata in Streams
Streaming the object location over the object content (a.k.a. Metadata) as a data source and expecting the pipeline to load and process the object content is ideally a bad design in stream-based processing systems. The object content processing would become a bottleneck, as parallelism occurs at the object level rather than by record. This can lead to out-of-memory (OOM) issues when the data read from the object location exceeds the memory available on the executor/worker/task that the big data engine delegates tasks to. It is recommended instead to ingest the records of objects over their respective locations, thus introducing a stream and increasing parallelism. This approach leads to better distribution of processing across executors and improved throughput.
Conclusion
Modern data systems have evolved beyond a single, monolithic data pipeline. The abundance of data and the need to unlock its intelligence differentiate industries and services that must remain competitive in the face of an Artificial General Intelligence (AGI). There is growing demand to build private AI/ML systems powered by Retrieval Augmented Generation (RAG) that harness enterprise and/or third-party data, making enterprise services smarter, cheaper, and better. While data orchestration systems often receive less attention than they deserve, production-ready enterprise-compliant AI/ML systems require highly scalable and high-resilient data systems that continue to engineer efficient storage and file formats for storing and retrieving data, extracting insights, and computing aggregates faster but cheaper on resources while safeguarding data integrity and adherence to data governance practices. It is very important for engineers who build these systems to focus on what these systems should do, rather than finding a popular tool in the market. A strong understanding of principles and characteristics of a data orchestration system would make one that is expandable, iterable and resilient over time.
Opinions expressed by DZone contributors are their own.
Comments