From Kubeflow to Flyte: A More Reliable ML Orchestration Foundation

At aiXplain, as our machine learning team worked to build, deploy and manage increasingly complex ML workflows, it became apparent that our initial orchestration framework, Kubeflow, was making us less productive over time. Kubeflow is an open-source ML toolkit created by Google that provides Kubernetes-native pipelines. We initially chose it for its workflow management capabilities and relative stability.

However, significant limitations emerged. It was fragile, and pipelines broke frequently. The system was not modular, making it hard to scale.

Kubeflow also didn’t match how our data scientists preferred to work. All this complexity slowed down our team. After looking at other options like Flyte, Airflow, and Luigi, we decided Flyte was the best fit. It was simpler and more reliable for our needs.

Kubeflow

Kubeflow, initially a beacon of promise in the world of machine learning orchestration, gradually revealed its limitations. Although initially selected for its robustness and workflow management capabilities, it began to show signs of impracticality as time went on.

One of the major issues encountered with Kubeflow was the fragility of its pipelines. These pipelines proved to be quite brittle, particularly when subjected to changes in Python dependencies, causing disruptions and slowdowns in the development process. Moreover, Kubeflow was a complex platform, particularly for data scientists. Its intricate and Kubernetes-centric structure made it challenging for them to navigate without the assistance of ML engineers. Data scientists often had to create and manage Kubernetes clusters during the development and testing phases, significantly extending the time required for project completion.

A significant philosophical misalignment also emerged between Kubeflow and the data scientists. While Kubeflow insisted on running everything on Kubernetes, data scientists often preferred the flexibility of local development environments. This clash in preferences created friction and slowed down development.

Furthermore, Kubeflow lacked modularity, making it challenging to conduct independent testing of various pipeline roles and stages. This lack of modularity hindered the team’s ability to efficiently scale their projects as complexity increased.

Maintenance proved to be another substantial hurdle with Kubeflow. Frequent changes and fixes, often necessitated by shifting dependencies, created significant overhead for the team. Core components sometimes lacked stability, eroding confidence in the platform’s reliability.

As a result of these challenges, including increased costs, scalability issues, a cumbersome developer experience, slow innovation cycles, and a misalignment with data scientists’ workflow preferences, the team began to recognize the need for a more seamless and reliable orchestration solution. It was time for a change.

Flyte

Flyte is an open-source platform created by Lyft designed for orchestrating ML workflows using Python. It gained strong momentum within Lyft and had extensive documentation available. Flyte emphasizes local testing and modular architecture, which aligned well with our data scientists’ preferences. It handles cluster management automatically, provides the reliability needed for production workflows, and removes the need to interact directly with complex Kubernetes components.

Flyte’s core philosophical approach of simplifying ML workflow orchestration for data scientists matched our goal of increasing productivity. Its stability, mindshare, and simplicity made it a compelling successor after weighing the various options.

The Migration Process

We slowly migrated from Kubeflow to Flyte in phases. Our process involved:

  • Researching and setting up POCs to test Flyte viability
  • Translating existing Kubeflow pipelines to Flyte workflows using the SDK
  • Iteratively migrating and validating critical workflows
  • Retiring Kubeflow pipelines once Flyte was stable

Challenges included learning Flyte’s architecture, reworking incompatible components, configuring permissions and interfaces, filling documentation gaps, identifying failures, maintaining consistency across systems, and getting organizational buy-in. Despite these hurdles, the investment enabled us to replace the burdensome complexity of Kubeflow with the simplicity of Flyte.

Impact of Moving to Flyte

Flyte’s local testing and modular architecture sped up development cycles and debugging. Data scientists gained more autonomy without Kubernetes’ complexity. Flyte provided the reliability and stability needed for production workflows. And simpler workflows reduced time spent on maintenance.

Additional benefits included:

  • Reduced infrastructure costs by avoiding Kubernetes clusters for testing
  • Tighter alignment with data science preferences
  • Smoother scaling of workflows as complexity grows
  • Faster time-to-production with rapid iterations
  • Increased productivity and innovation

By switching to Flyte, we reduced friction, boosted reliability, and improved the velocity and delivery of ML initiatives. The gains underscored the importance of selecting an ML orchestration platform aligned with team skills and philosophical approaches.

Conclusion

Going forward, we plan to continue gaining proficiency with Flyte as a core part of our stack. We will optimize workflows, leverage Flyte’s UI, explore integrations, contribute to the open-source community, and monitor developments from Lyft.

The switch to Flyte gave us a workflow orchestration foundation aligned with our needs. We’re excited to see what new productivity gains and innovations this simpler, more reliable platform unlocks next.