Scalable and reproducible workflows with Pachyderm

by Jon Ander Novella (Uppsala Universitet) available at https://indico.scc.kit.edu/event/427/contributions/4248/.

Skills you will gain Plan stewardship and sharing of FAIR outputs Set up and document workflows Data cleaning, processing and software versioning Data transformation and integration Prepare and document for FAIR outputs Workflow set-up and provenance information mgmt Use or develop open research tools/services

Description

Data scientists must manage analyses that consist of multiple stages, large datasets and a great number of tools, all the while maintaining reproducibility of results. Amongst the variety of available tools to undertake parallel computations, Pachyderm is an open-source workflow-engine and distributed data processing tool that fulfils these needs by creating a data pipelining and data versioning layer on top of projects from the container ecosystem. In this workshop you will learn how to:

create a simple local Kubernetes infrastructure,
install and interact with Pachyderm and
implement a scalable and reproducible workflow using containers.

Keywords pachyderm workflow management