--- trunk/uoms-doc/doc.tex	2010/05/12 11:23:57	11
+++ trunk/uoms-doc/doc.tex	2010/11/29 18:07:07	14
@@ -32,6 +32,7 @@
  \item \texttt{doc/manual.pdf}: This file. User's manual.
  \item \texttt{COPYING and COPYING.LESSER}: Files containing the use and redistribution terms (license).
  \item \texttt{changelog.txt}: File with changes in each release.
+ \item \texttt{Makefile}: Makefile to build the benchmarking suite. It relies on the src/Makefile file.
  \item \texttt{src/affinity.upc}: UPC code with affinity-related tests.
  \item \texttt{src/config/make.def.template.*}: Makefile templates for HP UPC and Berkeley UPC.
  \item \texttt{src/config/parameters.h}: Header with some customizable parameters.
@@ -54,6 +55,12 @@
 \section{Operations tested}
 
 \begin{itemize}
+\item \texttt{upc\_forall} (read elements of a shared array)
+\item \texttt{upc\_forall} (write elements of a shared array)
+\item \texttt{upc\_forall} (read+write elements of a shared array)
+\item \texttt{for} (read elements of a shared array)
+\item \texttt{for} (write elements of a shared array)
+\item \texttt{for} (read+write elements of a shared array)
 \item \texttt{upc\_barrier}
 \item \texttt{upc\_all\_broadcast}
 \item \texttt{upc\_all\_scatter}
@@ -107,6 +114,8 @@
 \item \texttt{upc\_free}
 \end{itemize}
 
+The \texttt{upc\_forall} and \texttt{for} benchmarks test the performance of accesses to a shared \texttt{int} array in read, write and read+write operations. The \texttt{upc\_forall} benchmark distributes the whole workload across threads, whereas in the \texttt{for} benchmark all the work is performed by thread 0. This is useful for testing the speed of remote accesses and optimization techniques such as coalescing. The operation performed in read is a sum of a variable in the stack and the current element in the array, to prevent the compiler from dropping the first $N-1$ iterations. The operation performed in write is a simply update of the elements with its position in the array. The operation performed in read+write is a sum of the current element and its position in the array.
+
 In bulk memory transfer operations there are two modes: remote and local. Remote mode will copy data from one thread to another, whereas local mode, will copy data from one thread to another memory region with affinity to the same thread.
 
 
@@ -166,6 +175,12 @@
 
 \item \texttt{-input FILE}: Read user defined list of benchmarks to run from \texttt{FILE}. Valid benchmark names are:
 \begin{itemize}
+\item \texttt{upc\_forall\_read}
+\item \texttt{upc\_forall\_write}
+\item \texttt{upc\_forall\_readwrite}
+\item \texttt{for\_read}
+\item \texttt{for\_write}
+\item \texttt{for\_readwrite}
 \item \texttt{upc\_barrier}
 \item \texttt{upc\_all\_broadcast}
 \item \texttt{upc\_all\_scatter}
@@ -231,25 +246,30 @@
 To compile the suite you have to setup a correct \texttt{src/config/make.def} file. Templates are provided to this purpose. The needed parameters are:
 
 \begin{itemize}
-\item \texttt{CC}: Defines the C compiler used to compile the C code. Please note this does not involve the resulting C code generated from the UPC code if your UPC compiler is a source to source compiler.
-\item \texttt{CFLAGS}: Defines the C flags used to compile the C code. Please note this does not involve the resulting C code generated from the UPC code if your UPC compiler is a source to source compiler
+\item \texttt{CC}: Defines the C compiler used to compile the C code. Please note this has nothing to do with the resulting C code generated from the UPC code if your UPC compiler is a source to source compiler.
+\item \texttt{CFLAGS}: Defines the C flags used to compile the C code. Please note this has nothing to do with the resulting C code generated from the UPC code if your UPC compiler is a source to source compiler
 \item \texttt{UPCC}: Defines the UPC compiler used to compile the suite
-\item \texttt{UPCFLAGS}: Defines the UPC compiler flags used to compile the suite. Please note you should not specify any number of threads flag at this point
+\item \texttt{UPCFLAGS}: Defines the UPC compiler flags used to compile the suite. Please note you should not specify the number of threads flag at this point
 \item \texttt{UPCLINK}: Defines the UPC linker used to link the suite
 \item \texttt{UPCLINKFLAGS}: Defines the UPC linker flags used to link the suite
-\item \texttt{THREADS\_SWITCH}: Defines the correct switch to set the desired number of threads. It is compiler dependent, and also includes any blank space after the switch
+\item \texttt{THREADS\_SWITCH}: Defines the correct switch to set the desired number of threads. It is compiler dependant, and also includes any blank space after the switch
 \end{itemize}
 
-Once you have set up your \texttt{make.def} file you can compile the suite as
-following:
+Once you have set up your \texttt{make.def} file you can compile the suite.
+\\
+
+For a static thread setup type:
 
 \texttt{make NTHREADS=NUMBER\_OF\_UPC\_THREADS}
 
 E.g., for 128 threads:
 
 \texttt{make NTHREADS=128}
+\\
 
+For a dynamic thread setup just type:
 
+\texttt{make}
 
 
 
@@ -304,7 +324,7 @@
 
 \normalsize
 
-The header indicates the benchmarked function and the number of processes involved. The first column shows the size used for each particular row. It is the size of the data at the root thread, or in any thread in a non-rooted operation. The second column is the number of repetitions performed for that particular message size. The following three columns are, respectively, the minimum, maximum and average latencies. The last column shows the aggregated bandwidth calculated using the maximum latencies. Therefore, the bandwidth reported is the minimum bandwidth achieved in all the repetitions.
+The header indicates the benchmarked function and the number of processes involved. The first column shows the block size used for each particular row. The second column is the number of repetitions performed for that particular message size. The following three columns are, respectively, the minimum, maximum and average latencies. The last column shows the aggregated bandwidth calculated using the maximum latencies. Therefore, the bandwidth reported is the minimum bandwidth achieved in all the repetitions.
 
 Moreover, when 2 threads are used, affinity tests are performed. This way you can measure the effects of data locality in NUMA systems, if the 2 threads run in the same machine. This feature may be useful even when the 2 threads run in different machines. E.g.: Machines with non-uniform access to the network interface, like quad-socket Opteron/Nehalem-based machines, or cell-based machines like HP Integrity servers. The output of this tests is preceded with something like: