ParallelizationConcepts.txt

= Blip.parellel =

an overview of the parallelization strategies available in the blip library
    by Fawzi Mohamed

Note that there is a preprint of an article about this at https://github.com/fawzi/blip/raw/master/paraArt.pdf

== Parallelization hierarchy ==

 * smp is the first level in the parallelization hierarchy/methods supported in
   blip, it is the more basic one and the one used by all the others. Smp offers
   shared memory and several execution contextes, and is confined on one machine
   (there are ways to extend it further, but I don't think that it is a good
   idea)
 
 * mpi is the second level, it stands for message passing interface, and more or
   less mirrors what is offered by MPI 1.2: message passing but on a fixed number
   N of "executers", these can be identified with a simple number from 0 to N-1.
   These can be distributed in various ways (higher dimensional spaces,
   subsetting,...). Loosing an executer is not supported. While mpi is called
   message passing and simple point to point message passing is possible the
   emphasis is on collective communication that is normally implemented
   efficiently. Mpi is just an interface, it can be realized on the top of mpi,
   but it doesn't have to.
 
 * rpc stays for remote procedure calling, it is implemented using some ideas
   from xml rpc and rest interfaces, and has an api similar to Next distributed
   objects (DO): you have a way to publish objects, and you can connect and call
   method using proxies to communicate to remote objects. Remote objects are
   identified through url-like strings, several low level protocols can be
   supported
 
 * cluster is just a couple of things useful to handle a dynamic group of
   executers communicating though rpc

== SMP ==

Today computers are able to execute several threads concurrently both by having
multiple cores, and having multiple issuer per core (Hyperthreading and
similar).

Obviously one wants to take advantage of these computing resources.

Threads are an obvious way to try to use these resources, but in my opinion
normally they are not a very good primitive to expose to the programmer, because
they work well when their number is equal to the number of executing units, an
worse when it is very different. So the optimal choice is to use a number of
threads equal to the number of hardware threads, and subdivide the work so that
all thread are busy.

Subdividing the the work in n_threads chunks is a natural choice. This is
possible, but not so easy, as this number is not constant, and this approach is
not robust with respect to having other work progress concurrently (both by your
own program and by the OS). The same task A might be executed both alone and
concurrently with task B, so finding the correct number of threads to assign to
the task A is not easy.

Assuming that tasks management/switching costs x it is much better to try to
subdivide the problem in as many tasks as possible, but making sure that each
task on average still needs y>>x time to execute (for example y~10x)
independently of the number of processors. In this case a scheduler can keep
processors busy by distributing tasks around, and if load on cone processor
changes the work can be automatically moved. This means that the unit of
computation one has to think about is not the thread, but one of these smaller
units, that we call tasks.

Having many tasks also works well as latency hiding mechanism, many tasks can do
operations that can stall the cpu. I/O: network and disk operations, and lately
also things like GPU computation. To avoid wasting cpu cycles one should switch
to another task as soon one of these operations is started.

There is one thing that speaks against creating as many tasks as possible: task
handling and management uses up resources, and some problems would create really
many tasks. The solution to this is not to create too many tasks at once but
create them lazily. This must be done in concert with the evaluation strategy.

To find a way to strike a balance between these two we will consider a recursive
computation. Recursive computations build an important class of computations,
all divide and conquer approaches can be cast in this form for example, thus
having a good solution for them will likely improve much the value of our
solution. A recursive function calls itself, and generates other tasks. to
evaluate this on a single processor there is a well known efficient strategy:
eager evaluation, evaluate first the subtasks before finishing the main task.
This means a depth first evaluation of the tasks. Thus we have our evaluation
strategy for recursive tasks: evaluate subtasks first.

In general we will not get rid of the task switching overhead (this is safely
possible only in few instances), but we will avoid creating too many task at the
same time. The recursive approach is not always correct: independent tasks
should be scheduled on an equal footing independently on how "deep" each one is.
With this we have the core of the idea behind blip.parallel.smp : efficiently
evaluate several recursive tasks.

This also indirectly explains why threads are not the ideal parallelization
primitive: disregarding technical problems (like wired memory,...) threads have
an implicit "fairness" in them that means that all threads have more or less
the same chance of executing (depending on the priority), and should not be
ignored for long times.
This is correct for external tasks, between different recursive task, but
within a recursive task it forces tasks up in the hierarchy to be executed,
and gives a breath first execution that will allocate too many tasks.

The execution characteristics within a recursive task or between different tasks
is very different, and it is the programmer job to choose between them,
Recursive tasks *might* be parallel and should be executed in parallel
if free hardware resources are available, whereas independent tasks *should*
be executed in parallel with some kind of fairness.
Recursive tasks cope well with data parallel workloads, thread-like parallelization
copes well with webserver-like load.
Making some automatic scheduler choose between them is not going to work well.

Now lets look at the ugly details a little bit more.

In general it is better to always try to keep all computation resources busy
(there are some exceptions like intel processors with memory intensive tasks
when they still were memory starved), and to make things more difficult there is
a memory hierarchy and different processors share different levels of the memory
hierarchy, and thus moving things between them has different cost.

To simplify things I will assume that it is never worthwhile to keep a processor
idle. I will take the view of the computational scientist: one that knows that
the problem is big enough to use the whole machine, and is interested in maximum
throughput. Deciding to use only part of the available resources, or putting an
upper bound on the latency of the tasks is a refinement that can be done later.

Blip.parallel.smp is mainly based on schedulers (queues) fixed at
processor/thread level that if idle try to steal work from the neighboring
schedulers (in the order given by the level of memory hierarchy sharing).

Ideally when stealing one should try to equalize the amount of work between the
schedulers. As the execution time of a task is unknown this is impossible to do
exactly, but in general you should try to steal large tasks, which means root
tasks (that will spawn other tasks) in recursive tasks: steal root tasks,
execute children tasks first. The actual stealing amount is randomized a bit, to
increase the chance of good work distribution.

Using this approach one can have a simple model and a simple conceptual model to
describe concurrent programs.

One has tasks (see blip.parallel.smp.BasicTasks) that execute a given function and:

 - can be created and spawn
 
 - are always spawn in the context of a super task
 
 - are finished only when all their subtasks are finished
 
 it is possible to:
 
 - add operations to execute when a task finishes
 
 - wait for the completion of some task
 
 - delay the executing task and then resubmit it later (useful when the task
   executes one of those operations that stall the cpu, see Taks.delay and
   Task.resubmitDelayed)
 
 - you can ask the current task for a cache that is common to the current numa
   node (this is useful for pools, and delete lists in general, see cachedPool,
   cachedPoolNext in blip.container.Cache).

Some languages try have special syntax for some of these operations, but
normally the program does not have so many of them to make any special syntax
really worth it, it should be clean, but method calling work just well enough.

Using these methods helper for the common case of parallel loops on arrays are
available in blip.parallel.smp.PLoopHelpers.

Another concept useful to build parallel programs are dataflow variables.
Dataflow variables are variables that can be set several times but always to the
same value. Reading a dataflow variable blocks until a value is set. The nice
thing about dataflow variables is that they do not introduce race conditions: a
program using them either always blocks or never does.

Using this you have solution that does work well most of the time. Most of the
time is not always, sometime you need to ensure that a task stays on one
processor, sometime you might want to start with a specific task distribution
that you calculated using information of the numa topology. The advantage of
realizing everything as a library and with normal objects is that this is easily
possible.

With this you should understand the main ideas behind parallelization in Blip,
what is left are the "boring" optimizations, and implementation details that
really make a difference when using the library, but should not be important to
simply understand it.

Go to BlipOverview to see some examples