Scalable and reproducible workflows with Pachyderm

by Jon Ander Novella (Uppsala Universitet) available at https://indico.scc.kit.edu/event/427/contributions/4248/.

Description

Data scientists must manage analyses that consist of multiple stages, large datasets and a great number of tools, all the while maintaining reproducibility of results. Amongst the variety of available tools to undertake parallel computations, Pachyderm is an open-source workflow-engine and distributed data processing tool that fulfils these needs by creating a data pipelining and data versioning layer on top of projects from the container ecosystem. In this workshop you will learn how to:

  • create a simple local Kubernetes infrastructure,
  • install and interact with Pachyderm and
  • implement a scalable and reproducible workflow using containers.

Keywords pachyderm workflow management