[uoms] Diff of /trunk/uoms-doc/doc.tex

Diff of /trunk/uoms-doc/doc.tex

-revision 1, Mon Apr  5 17:12:14 2010 UTC
+revision 15, Tue Nov 30 13:05:36 2010 UTC
 Line 2
  You can contact us at:\\
- Dr Guillermo Lopez Taboada
- Computer Architecture Group (CAG)
- University of A Coruña, Spain
- taboada@udc.es\\
  Galicia Supercomputing Center (CESGA)
+ \url{http://www.cesga.es}
  Santiago de Compostela, Spain
- upc@cesga.es
+ upc@cesga.es\\\\
+ PhD. Guillermo Lopez Taboada
+ Computer Architecture Group (CAG)
+ \url{http://gac.des.udc.es/index_en.html}
+ University of A Coruña, Spain
+ taboada@udc.es\\
+ \section{Acknowledgments}
+ This work was funded by Hewlett-Packard Spain and partially supported by the Ministry of Science and Innovation of Spain under Project TIN2007-67537-C03-02 and by the Galician Government (Xunta de Galicia, Spain) under the Consolidation Program of Competitive Research Groups (Ref. 3/2006 DOGA 12/13/2006). We gratefully thank Brian Wibecan for his comments and for share with us his thoughts and knowledge. Also, we thank Jim Bovay for his support, and CESGA, for providing access to the FinisTerrae supercomputer.
  \section{Files in this benchmarking suite}
-Line 30
+Line 32
   \item \texttt{doc/manual.pdf}: This file. User's manual.
   \item \texttt{COPYING and COPYING.LESSER}: Files containing the use and redistribution terms (license).
   \item \texttt{changelog.txt}: File with changes in each release.
+  \item \texttt{Makefile}: Makefile to build the benchmarking suite. It relies on the src/Makefile file.
   \item \texttt{src/affinity.upc}: UPC code with affinity-related tests.
   \item \texttt{src/config/make.def.template.*}: Makefile templates for HP UPC and Berkeley UPC.
   \item \texttt{src/config/parameters.h}: Header with some customizable parameters.
-Line 52
+Line 55
  \section{Operations tested}
  \begin{itemize}
+ \item \texttt{upc\_forall} (read elements of a shared array)
+ \item \texttt{upc\_forall} (write elements of a shared array)
+ \item \texttt{upc\_forall} (read+write elements of a shared array)
+ \item \texttt{for} (read elements of a shared array)
+ \item \texttt{for} (write elements of a shared array)
+ \item \texttt{for} (read+write elements of a shared array)
  \item \texttt{upc\_barrier}
  \item \texttt{upc\_all\_broadcast}
  \item \texttt{upc\_all\_scatter}
-Line 89
+Line 98
  \item \texttt{upc\_memput} (local)
  \item \texttt{memcpy} (local)
  \item \texttt{memmove} (local)
+ \item \texttt{upc\_memcpy\_async} (remote)
+ \item \texttt{upc\_memget\_async} (remote)
+ \item \texttt{upc\_memput\_async} (remote)
+ \item \texttt{upc\_memcpy\_async} (local)
+ \item \texttt{upc\_memget\_async} (local)
+ \item \texttt{upc\_memput\_async} (local)
  \item \texttt{upc\_memcpy\_asynci} (remote)
  \item \texttt{upc\_memget\_asynci} (remote)
  \item \texttt{upc\_memput\_asynci} (remote)
-Line 99
+Line 114
  \item \texttt{upc\_free}
  \end{itemize}
+ The \texttt{upc\_forall} and \texttt{for} benchmarks test the performance of accesses to a shared \texttt{int} array in read, write and read+write operations. The \texttt{upc\_forall} benchmark distributes the whole workload across threads, whereas in the \texttt{for} benchmark all the work is performed by thread 0. This is useful for testing the speed of remote accesses and optimization techniques such as coalescing. The operation performed in read is a sum of a variable in the stack and the current element in the array, to prevent the compiler from dropping the first $N-1$ iterations. The operation performed in write is a simply update of the elements with its position in the array. The operation performed in read+write is a sum of the current element and its position in the array.
  In bulk memory transfer operations there are two modes: remote and local. Remote mode will copy data from one thread to another, whereas local mode, will copy data from one thread to another memory region with affinity to the same thread.
-Line 111
+Line 128
  \begin{itemize}
  \item \texttt{NUMCORES}: If defined it will override the detection of the number of cores. If not defined the number of cores is set through the \texttt{sysconf(\_SC\_NPROCESSORS\_ONLN)} system call.
  \item \texttt{ASYNC\_MEM\_TEST}: If defined asynchronous memory transfer tests will be built. Default is defined.
+ \item \texttt{ASYNCI\_MEM\_TEST}: If defined asynchronous memory transfer with implicit handlers tests will be built. Default is defined.
  \item \texttt{MINSIZE}: The minimum message size to be used in the benchmarking. Default is 4 bytes.
  \item \texttt{MAXSIZE}: The maximum message size to be used in the benchmarking. Default is 16 megabytes.
  \end{itemize}
-Line 121
+Line 139
  \begin{itemize}
  \item \texttt{-help}: Print usage information and exits.
  \item \texttt{-version}: Print UOMS version and exits.
- \item \texttt{-off\_cache}: Enable cache invalidation. Be aware that the cache invalidation greatly increases the memory consumption. Also, note that for block sizes smaller than the cache line size it will not work.
+ \item \texttt{-off\_cache}: Enable cache invalidation. Be aware that the cache invalidation greatly increases the memory consumption. Also, note that for block sizes smaller than the cache line size it will not have any effect.
  \item \texttt{-warmup}: Enable a warmup iteration.
- \item \texttt{-reduce\_op OP}: Choose the reduce operation to be performed by \texttt{upc\_all\_reduceD} and \texttt{upc\_all\_prefix\_reduceD}. Valid operations are:
+ \item \texttt{-reduce\_op OP}: Choose the reduce operation to be performed by \texttt{upc\_all\_reduceD} and \texttt{upc\_all} \texttt{\_prefix\_reduceD}. Valid operations are:
  \begin{itemize}
  \item \texttt{UPC\_ADD (default)}
  \item \texttt{UPC\_MULT}
-Line 155
+Line 173
  \item \texttt{-maxsize SIZE}: Specifies the maximum block size (in bytes)
+ \item \texttt{-time SECONDS}: Specifies the maximum run time in seconds for each block size. Disabled by default. Important: this setting will not interrupt an ongoing operation
  \item \texttt{-input FILE}: Read user defined list of benchmarks to run from \texttt{FILE}. Valid benchmark names are:
  \begin{itemize}
+ \item \texttt{upc\_forall\_read}
+ \item \texttt{upc\_forall\_write}
+ \item \texttt{upc\_forall\_readwrite}
+ \item \texttt{for\_read}
+ \item \texttt{for\_write}
+ \item \texttt{for\_readwrite}
  \item \texttt{upc\_barrier}
  \item \texttt{upc\_all\_broadcast}
  \item \texttt{upc\_all\_scatter}
-Line 196
+Line 222
  \item \texttt{upc\_all\_prefix\_reduceD}
  \item \texttt{upc\_all\_reduceLD}
  \item \texttt{upc\_all\_prefix\_reduceLD}
+ \item \texttt{upc\_memget\_async}
+ \item \texttt{upc\_memput\_async}
+ \item \texttt{upc\_memcpy\_async}
+ \item \texttt{local\_upc\_memget\_async}
+ \item \texttt{local\_upc\_memput\_async}
+ \item \texttt{local\_upc\_memcpy\_async}
  \item \texttt{upc\_memget\_asynci}
  \item \texttt{upc\_memput\_asynci}
  \item \texttt{upc\_memcpy\_asynci}
-Line 216
+Line 248
  To compile the suite you have to setup a correct \texttt{src/config/make.def} file. Templates are provided to this purpose. The needed parameters are:
  \begin{itemize}
- \item \texttt{CC}: Defines the C compiler used to compile the C code. Please note this does not involve the resulting C code generated from the UPC code if your UPC compiler is a source to source compiler.
+ \item \texttt{CC}: Defines the C compiler used to compile the C code. Please note this has nothing to do with the resulting C code generated from the UPC code if your UPC compiler is a source to source compiler.
- \item \texttt{CFLAGS}: Defines the C flags used to compile the C code. Please note this does not involve the resulting C code generated from the UPC code if your UPC compiler is a source to source compiler
+ \item \texttt{CFLAGS}: Defines the C flags used to compile the C code. Please note this has nothing to do with the resulting C code generated from the UPC code if your UPC compiler is a source to source compiler
  \item \texttt{UPCC}: Defines the UPC compiler used to compile the suite
- \item \texttt{UPCFLAGS}: Defines the UPC compiler flags used to compile the suite. Please note you should not specify any number of threads flag at this point
+ \item \texttt{UPCFLAGS}: Defines the UPC compiler flags used to compile the suite. Please note you should not specify the number of threads flag at this point
  \item \texttt{UPCLINK}: Defines the UPC linker used to link the suite
  \item \texttt{UPCLINKFLAGS}: Defines the UPC linker flags used to link the suite
- \item \texttt{THREADS\_SWITCH}: Defines the correct switch to set the desired number of threads. It is compiler dependent, and also includes any blank space after the switch
+ \item \texttt{THREADS\_SWITCH}: Defines the correct switch to set the desired number of threads. It is compiler dependant, and also includes any blank space after the switch
  \end{itemize}
- Once you have set up your \texttt{make.def} file you can compile the suite as
+ Once you have set up your \texttt{make.def} file you can compile the suite.
- following:
+ \\
+ For a static thread setup type:
  \texttt{make NTHREADS=NUMBER\_OF\_UPC\_THREADS}
  E.g., for 128 threads:
  \texttt{make NTHREADS=128}
+ \\
+ For a dynamic thread setup just type:
+ \texttt{make}
-Line 289
+Line 326
  \normalsize
- The header indicates the benchmarked function and the number of processes involved. The first column shows the size used for each particular row. It is the size of the data at the root thread, or in any thread in a non-rooted operation. The second column is the number of repetitions performed for that particular message size. The following three columns are, respectively, the minimum, maximum and average latencies. The last column shows the aggregated bandwidth calculated using the maximum latencies. Therefore, the bandwidth reported is the minimum bandwidth achieved in all the repetitions.
+ The header indicates the benchmarked function and the number of processes involved. The first column shows the block size used for each particular row. The second column is the number of repetitions performed for that particular message size. The following three columns are, respectively, the minimum, maximum and average latencies. The last column shows the aggregated bandwidth calculated using the maximum latencies. Therefore, the bandwidth reported is the minimum bandwidth achieved in all the repetitions.
  Moreover, when 2 threads are used, affinity tests are performed. This way you can measure the effects of data locality in NUMA systems, if the 2 threads run in the same machine. This feature may be useful even when the 2 threads run in different machines. E.g.: Machines with non-uniform access to the network interface, like quad-socket Opteron/Nehalem-based machines, or cell-based machines like HP Integrity servers. The output of this tests is preceded with something like:

 Legend:



Removed from v.1
 


changed lines


 
Added in v.15
 Legend:



Removed from v.1
 


changed lines


 
Added in v.15
-Removed from v.1
+Added in v.15

root@forge.cesga.es	ViewVC Help
Powered by ViewVC 1.0.0