CompEval is a tool for evaluation of workflows composed from expensive computation tasks with shared sub-expressions. GitEval uses the Git version control system to both represent the workflow as a directory tree and each of its particular instances. This enables easy deployment in computation clusters.
CompEval is not the simplest possible implementation of the concept, but we hope it is the simplest possible implementation of our particular formulation of the design goals:
-
Version management. Be sure which version of your implementation computed the given result.
-
Easy deployment. Updating and recomputing should be easy and fast (without recomputing what didn't need to be).
-
Human-accessible storage. All stages of the computation shall be easily accessible for user inspection.
This mostly eliminates using a script to guide the workflow execution due to (i) difficulty of eliminating common sub-expressions on higher than leaf level, and (ii) difficulty of storing contents of temporary variables (named or unnamed). Therefore, we opt for functional programming semantics and a very explicit representation of the expression tree.
The computation is a named executable that takes one or more inputs and transforms them to a set of outputs. The inputs and outputs are stored in files and the filenames are passed as arguments to the executable; the format of the files is arbitrary from CompEval's perspective.
Conceptually, the computation will be a computationally intensive task that may take a long time to run, but it is pure in the sense that when run on the same inputs, the computation will always produce the same outputs (up to an isomorphism / with high probability, in case of probabilistic algorithms - the point is, we can safely reuse previously obtained results).
An instance of a computation on given inputs and obtaining given output
is a call (i.e., a call is a triplet (computation, inputs, outputs)
).
It is named based on .
The workflow is a tree of computation expressions. Each sub-tree is also a workflow (or, if you will, a sub-workflow). A computation in each node of the expression tree obtains its inputs from either
- outputs or output slices of sub-workflows,
- global inputs or input slices of the whole task (see below), or
- literals stored in the expression tree.
A particular workflow expression tree executed with given set of global task inputs is called a run, performing calls on tree nodes leaf-first.
The task represents a whole "program" we want to execute, processing a set of externally-supplied input data, obtaining the final output we desire from the system. The task therefore consists of a workflow with the whole expression tree and a set of inputs (that are globally available within the workflow tree), producing a set of outputs (corresponding to the outputs of the root node of the workflow tree).
An instance of a task on given inputs and obtaining given output
is a job (i.e., a job is a triplet (task (= workflow), inputs, outputs)
).
A task is a named Git repository with directory structure like
computations/...
workflow/...
inputs
The computations/
subdirectory is usually a submodule and contains
a library of computation executables; alternatively, each computation
could reside in a submodule. The workflow/
subdirectory tree
contains the computation expression tree.
The inputs/
text file shall contain a line for each input passed,
assigning a symbolic name to each input for referral from workflows.
A workflow subdirectory for a computation is organized like:
kind
value
inputs/nn-text/...
inputs/nn_mm-text/...
outputs
The kind
file contains a single line "computation". The value
file
contains a single line with the name of the computation to run.
The inputs/
subdirectory contains a directory for each input
(where nn
is number of the input two digits zero-padded, numbered from
zero, and text
is free-form human readable description); the directory
contains another sub-workflow. In case multiple inputs are generated by
this sub-workflow, the nn_mm-text
naming convention must be used,
describing the range of inputs supplied by this sub-workflow.
The outputs
file contains the slice of computation outputs to pass
up through the expression tree, one line per output, containing the
number of computation output. E.g. single line 0
says that just
the first computation output is to be used as the first workflow output,
while two lines 01
and 00
say that the second computation output
will be used as the first workflow output and the first computation
output will be used as the second workflow output.
A workflow subdirectory representing a global input or a literal is organized like
kind
value
with the kind
file containing "input" or "literal", respectively,
and the value
file containing either symbolic name of the global
task input or raw file name in the global file storage (see below),
respectively.
The computations subdirectory of a task repository is usually a submodule and the library of available computations is shared between most or all tasks. Each computation is referred by a name that corresponds to a directory in the computations subdirectory or repository. The computation directory has a structure like this:
inputs
outputs
exec
The inputs
file contains one line per required input, each line
containing a symbolic name of the input; this information is used
for debugging and error reporting. The outputs
file contains one
line per produced output in the same format.
The exec
file shall be executable and will be executed when
a call is issued. It receives #inputs + #outputs
parameters
with the names of files to read inputs from and to write outputs to.
Possibly, exec
would be a wrapper over the executable itself,
e.g. making sure it's checked out and built on the current host.
However, this is a different layer and entirely transparent to
CompEval. In case this model is used, each exec
file version
should be tied to a single particular version of the main executable
to ensure integrity of the whole evaluation and call reuse.
All inputs and outputs encountered in the processing of a single job are stored in a "global file storage". This is simply a directory for now, possibly on a network filesystem; its semantics might be enriched in the future to work as a distributed file system.
Each output is assigned a filename that is based on the particular computation that produced it and inputs that were used for its production
c_cname/ccid/nn/input0_input1_...
where c_
is literal, cname
is the symbolic name of the computation,
ccid
is HEAD commit id of the computations/cname
subdirectory, input0
etc. are SHA1 hashes of the inputs of the computation and nn
is the
output number (00 for the first output, etc.). Only first twelve digits
of each hash are used in the filename.
Each task input is also stored in the global file storage for future reference, assigned a filename in the format
tinputs/hash2/hash10
where tinputs
is literal, and considering SHA1 hashes of the contents
of the task inputs, hash2
are the first two digits and hash10
are
the next ten digits.
The job outputs are stored in file names of the format
t_tname/tcid/nn/input0_input1_...
where t_
is literal, tname
is the symbolic name of the task
(determined from the name of the directory holding the Git repo),
tcid
is HEAD commit id of the task repository, input0
etc. are SHA1
hashes of the global inputs of the task and nn
is the output number
(00 for the first output, etc.). Only first twelve digits of each hash
are used in the filename.
The CompEval tool is run on a computer where we wish to carry out the computation, at the root of the task repository. Its first argument is path to the global file storage. The tool can be used for execution or inspection.
The basic command is ce-run [-n] STORAGE INPUTS...
. This will
start a job executing the current task's workflow tree, printing out
the tree in a friendly form as the execution proceeds together with
filenames of inputs / outputs for further manual inspection.
The ce-run -n
mode is a slighly modification to be used for
inspection. It will also walk the current task's workflow tree, but
it will only reuse already obtained results and will not start
computations for missing results.
Another command for inspection is ce-sym
that will just
print out the current task's workflow tree in a symbolic form using
a LISP-like functional syntax.
A more visual alternative is ce-viz
which will render a graphical
representation of the task's workflow, with inputs/outputs decorated
with human-readable labels.