2024-04-17: Technology discussion
Participants
- Matt Fisher
- Trey Stafford
- Matt Jones
- Rushiraj Nenuji
Discussion
Architecture and interface diagrams review
New interface diagram
View new “recipe to OGDC” interface diagram
- Matt J:
- “Looks super compatible” with what ADC is doing already.
- PDG transformation pipelines are driven by configuration files
- “Fetch data as first step” is a mistake. Transferring data as part of a pipeline run can take an extremely long time w/ large datasets.
- For smaller datasets, less of a constraint, but ADC tries to avoid the assumption that data can be fetched from the web.
- How to leverage existing work on PDG services?
- Functions for geospatial transformations are already defined in the PDG repo(s)
- PDG has a deduplication function that lets you choose between two deduplication strategies.
- PDG has service building blocks that are combined with configuration to define pipelines
- Ray and parsl are largely the same thing, it shouldn’t be hard to switch between them
- Juliet is moving work from Slurm -> Parsl on k8s.
- Biggest challenge isn’t Ray -> Parsl, it’s dealing with the Slurm stuff.
- Matt F:
- Repo2docker-inspired approach to solving the 80% simple cases
- Matt J: Need to account for complex workflows, most of work is on 20% “long tail”
- Performance is extremely cluster-dependent. Our workflow performance is largely limited by i/o (that’s our “achilles heel”)
- None of this comes in to play for in-memory datasets
- Streaming?
- Haven’t thought of it at ADC
- QGreenland doesn’t have a use case
- PDG: https://github.com/PermafrostDiscoveryGateway
- GitHub self-hosted runner?
- Use it with GH Enterprise at ADC
- We would exceed the “free” runner limits
- Haven’t tried running it on k8s before, usually use VM and S3.
- Need a release to trigger data changes instead of on every merge to main. Some large products would be technically infeasible.
- Matt J: I like pangeo-forge model for publicly-debated recipes for derived data products. Deriving our stuff on that is a good way to to.
- Matt J: Test case?
- Matt F: I think we need a broad suite of test cases, maybe 5-10, to represent the variety of data transformation cases we have.
- Matt J: How do we handle triggering processing based on changes to data, when the recipe doesn’t need to change?
- Manage this as part of a versioning scheme?
- https://huggingface.co/spaces/boettiger-lab/leafmap
Workflow technology choice
- ADC hasn’t decided between Ray and Parsl yet
- Connect with PDG workflows team (Luigi is lead); they’re the ones doing the Ray ML evaluation and will have opinions
Why parsl
?
Currently use htop/k8s logs to keep track of success/failure. Is there a more robust monitoring solution we could use?
Juliet was not involved in decision to use parsl instead of Ray; should we consider it a firm technology commitment by ADC or an experiment?
- Prior evaluation (suggests use of Ray):
- No firm commitment to ray or parsl.
Action items
-
- [ ]