Integrated Pipelines
Published on
Data Pipelines appear in almost every project. If they stay for themself without an application around them, just use something like Airflow. But they are often embedded in other applications.
The individual steps run asynchronously on workers using Celery and we use Django for database management. A pipeline consists of multiple steps which are run by different processors. Each steps has requirements on the state of an object running through the pipeline. Often it is easier to run in batch mode, processing multiple objects at once and there are also shared resources that need to be allocated and can not be shared between two concurrent job like the GPU where we need almost all VRAM. Pipelines are defined maybe only within a sub-graph of the object-graph, for example a document, which has multiple pages, junks, etc. There is then one pipeline per document.
These are serious challenges, but they are not specific to a project, thus suitable for a shared approach between project. This note outline some ideas about a solution.
… WIP