Thanks a lot Jörgen for the elaborate response - I appreciate it!
Very good feedback-points. Let me respond on some of them:
Wouldn't it be even more accurate to describe it as a provenance DAG?
Ah, yes, very good point :) I started thinking about that the other day. My current thinking is that the log could still be allowed to contain duplicates, and that the tool that converts from log to bash script could do deduplication of tasks that result in the same bash script. Will have to think about this though.
For each task invocation we logged the workflow id, the task type (align, mpileup, varcall, annotate, etc.), a timestamp when the task was started, the time duration of execution, the names and sizes of input and output data files and a few more stuff.
Great ideas! Will consider adding some of them.
Another aspect is that a distributed execution environment produces slightly different provenance information than a local one
Yes, this is a problem. For us, the way we run tasks via slurm's blocking salloc command will mean that queuing time will be included in the execution time recorded by scipipe. So this is something we'll have to solve.
Thanks for the report about the time command. I remember (from previous work with sciluigi) I sometime made sure to use /usr/bin/time instead of the bash built-in, which helped in some cases, but perhaps we need to ship our own time-binary with the workflow, so that we are sure to get the same behaviour everywhere.
... the Open Provenance Model [1] might be worth taking a look at ... ... Another possibility might be the PROV standard [2] ... ... The Taverna community has taken the topic very serious and it's definitely worth looking at how they approached provenance and how they made decisions over time [3]
Thanks for the pointers, will definitely study up a bit on those.
To me it seems to be a bit of a divide between extremely-comprehensive workflow systems (taverna, pegasus and the like), and more practically oriented ones (bpipe, cuneiform, nextflow, snakemake...).
I'm aware there's is a lot of theory developed for provenance in the more comprehensive camp (and I've read quite some of it), but I also feel that over-complex provenance models loose the focus of making it practical enough to implement in a way that alleviates the workflow developer from things to think about, rather than piles even more things upon them, to think about.
I mean, it is great and important background theory to go back and review, but I also think that work is needed to convert this theory into practical implementations.
It also seems to me to be a divide between two approaches to reproducibility:
- Describe all tools, data and parameters with declarative and semantically unambiguous meta-data, using ontologies, etc etc.
- Just provide the means to exactly replicate a study and inspect all the code and data that was used to run it, by simply including all the source code, tools and data used.
I feel that while approch 1 might be theoretically more "correct", I also think it might take more work than any hustling scientist is willing to put in, and also might get disconnected from the actual executable code in the same way as code comments can do - meaning that it gets less trustworthy.
Thus, while approach 2 seems less rigoruous, I feel it will often be more practical, since it a) require less extra work for developers and scientists, and b) will always tell the full truth of what was actually run, a.k.a. "the code is the documentation".
In fact, a provenance trace is itself a valid workflow that can be executed by Hi-WAY.
That is very neat indeed! :)