README for the meta LU-decomposition programm presented in "Locality Aware DAG-Scheduling for LU-Decomposition"

Contents
========
 * LICENSE : the license of this project
 * compile_meta : the compilation script
 * /src and /include : the source files


Changes
=======
 This is the first published version


Installation
============
 To compile this project, execute the compile_meta script, after inserting the appropriate paths and variables.

 There are three paths, that have to be set:
  * ${compiler}  : path to a C++ compiler (we use intel's icpc compiler)
  * ${MKL_path}  : path to the MKL library, this path should contain the folders .../include/ and .../lib/intel64/
  * ${PAPI_path} : path to the PAPI library, that is used to read performance counters (for log files)

 Running the compile script should produce the executable "meta".


Note
====
 * Our current NUMA code works under the assumption that the operating system places 
   physical memory pages on the node that first accesses them. This can be changed 
   in matrix_desc.h by defining the macro flag FORCE_NUMA. If FORCE_NUMA is defined 
   the libnuma is used for memory placement.
 * Our program outputs detailed execution logs. If the execution logs are not needed, 
   they can be disabled by undefining the macro flag CREATE_LOG from context.h.


Execution Parameters
====================
--threads=<T>                          (read libnuma/2)    Number of physical CPU-cores (without hyperthreading)
--nodes=<N>                            (read libnuma)      Number of nodes (NUMA-nodes)
--n_range=<nstart>:<nstop>:<nstep>     (8192:8192:1)       Matrix sizes
--niter=<I>                            (5)                 Iterations per matrix size
--nb=<NB>                              (256)               Tile size
--numa=                                (rand)              NUMA strategy
       rand                                                  ignore NUMA effects (no optimization)
       col                                                   distribute tiles in a meta-columnwise pattern
       det:<X1>:<X2>                                         distribute the matrix in diagonals (this was a test)
            X1 : T, R                                          panel tasks are executed           T:on the node with the top panel   
                                                                                                  R:on a random node
            X2 : R, A, B, C                                    Schur-Complement jobs are executed R:random node 
                                                                                                  A:node with panel tile (A_ik) 
                                                                                                  B:node with top tile (A_kj) 
                                                                                                  C:node with the changed tile (A_ij)
--[no]log or --logonce                 (logonce)           [dis/en]ables the generation of a log file
--[no]steal                            (steal)             enables/disables work stealing between different NUMA nodes 
--[a]sync                              (async)             enables/disables working while task DAG is computed
--path=<PATH>                                              specifies, where eventual logs are put to

Output
======
 The format of the log file:
 * there are some general information, about the iteration (number of threads, matrix size, tile size, number of jobs, execution time)
 * there is a line for each subtask. Each subtask has the following entries: 
    processor_id | task_type (0=P, 1=T, 2=U, 3=X) | iteration (k) | position_x | position_y | size_x | start_time (in ns) | end_time (in ns) | L3_cache_misses | L3_cache_accesses | percentage of L3_cache_misses vs. L3_cache accesses

