Log In | Get Help   
Home My Page Projects Code Snippets Project Openings UPC Operations Microbenchmarking Suite
Summary Activity Tracker Lists Docs News SCM Files
[uoms] Diff of /trunk/uoms-doc/doc.tex
[uoms] / trunk / uoms-doc / doc.tex Repository:
ViewVC logotype

Diff of /trunk/uoms-doc/doc.tex

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1, Mon Apr 5 17:12:14 2010 UTC revision 15, Tue Nov 30 13:05:36 2010 UTC
# Line 2  Line 2 
2    
3  You can contact us at:\\  You can contact us at:\\
4    
 Dr Guillermo Lopez Taboada  
   
 Computer Architecture Group (CAG)  
   
 University of A Coruña, Spain  
   
 taboada@udc.es\\  
   
5  Galicia Supercomputing Center (CESGA)  Galicia Supercomputing Center (CESGA)
6    
7    \url{http://www.cesga.es}
8    
9  Santiago de Compostela, Spain  Santiago de Compostela, Spain
10    
11  upc@cesga.es  upc@cesga.es\\\\
12    
13    PhD. Guillermo Lopez Taboada
14    
15    Computer Architecture Group (CAG)
16    
17    \url{http://gac.des.udc.es/index_en.html}
18    
19    University of A Coruña, Spain
20    
21    taboada@udc.es\\
22    
23    
24    \section{Acknowledgments}
25    
26    This work was funded by Hewlett-Packard Spain and partially supported by the Ministry of Science and Innovation of Spain under Project TIN2007-67537-C03-02 and by the Galician Government (Xunta de Galicia, Spain) under the Consolidation Program of Competitive Research Groups (Ref. 3/2006 DOGA 12/13/2006). We gratefully thank Brian Wibecan for his comments and for share with us his thoughts and knowledge. Also, we thank Jim Bovay for his support, and CESGA, for providing access to the FinisTerrae supercomputer.
27    
28    
29  \section{Files in this benchmarking suite}  \section{Files in this benchmarking suite}
# Line 30  Line 32 
32   \item \texttt{doc/manual.pdf}: This file. User's manual.   \item \texttt{doc/manual.pdf}: This file. User's manual.
33   \item \texttt{COPYING and COPYING.LESSER}: Files containing the use and redistribution terms (license).   \item \texttt{COPYING and COPYING.LESSER}: Files containing the use and redistribution terms (license).
34   \item \texttt{changelog.txt}: File with changes in each release.   \item \texttt{changelog.txt}: File with changes in each release.
35     \item \texttt{Makefile}: Makefile to build the benchmarking suite. It relies on the src/Makefile file.
36   \item \texttt{src/affinity.upc}: UPC code with affinity-related tests.   \item \texttt{src/affinity.upc}: UPC code with affinity-related tests.
37   \item \texttt{src/config/make.def.template.*}: Makefile templates for HP UPC and Berkeley UPC.   \item \texttt{src/config/make.def.template.*}: Makefile templates for HP UPC and Berkeley UPC.
38   \item \texttt{src/config/parameters.h}: Header with some customizable parameters.   \item \texttt{src/config/parameters.h}: Header with some customizable parameters.
# Line 52  Line 55 
55  \section{Operations tested}  \section{Operations tested}
56    
57  \begin{itemize}  \begin{itemize}
58    \item \texttt{upc\_forall} (read elements of a shared array)
59    \item \texttt{upc\_forall} (write elements of a shared array)
60    \item \texttt{upc\_forall} (read+write elements of a shared array)
61    \item \texttt{for} (read elements of a shared array)
62    \item \texttt{for} (write elements of a shared array)
63    \item \texttt{for} (read+write elements of a shared array)
64  \item \texttt{upc\_barrier}  \item \texttt{upc\_barrier}
65  \item \texttt{upc\_all\_broadcast}  \item \texttt{upc\_all\_broadcast}
66  \item \texttt{upc\_all\_scatter}  \item \texttt{upc\_all\_scatter}
# Line 89  Line 98 
98  \item \texttt{upc\_memput} (local)  \item \texttt{upc\_memput} (local)
99  \item \texttt{memcpy} (local)  \item \texttt{memcpy} (local)
100  \item \texttt{memmove} (local)  \item \texttt{memmove} (local)
101    \item \texttt{upc\_memcpy\_async} (remote)
102    \item \texttt{upc\_memget\_async} (remote)
103    \item \texttt{upc\_memput\_async} (remote)
104    \item \texttt{upc\_memcpy\_async} (local)
105    \item \texttt{upc\_memget\_async} (local)
106    \item \texttt{upc\_memput\_async} (local)
107  \item \texttt{upc\_memcpy\_asynci} (remote)  \item \texttt{upc\_memcpy\_asynci} (remote)
108  \item \texttt{upc\_memget\_asynci} (remote)  \item \texttt{upc\_memget\_asynci} (remote)
109  \item \texttt{upc\_memput\_asynci} (remote)  \item \texttt{upc\_memput\_asynci} (remote)
# Line 99  Line 114 
114  \item \texttt{upc\_free}  \item \texttt{upc\_free}
115  \end{itemize}  \end{itemize}
116    
117    The \texttt{upc\_forall} and \texttt{for} benchmarks test the performance of accesses to a shared \texttt{int} array in read, write and read+write operations. The \texttt{upc\_forall} benchmark distributes the whole workload across threads, whereas in the \texttt{for} benchmark all the work is performed by thread 0. This is useful for testing the speed of remote accesses and optimization techniques such as coalescing. The operation performed in read is a sum of a variable in the stack and the current element in the array, to prevent the compiler from dropping the first $N-1$ iterations. The operation performed in write is a simply update of the elements with its position in the array. The operation performed in read+write is a sum of the current element and its position in the array.
118    
119  In bulk memory transfer operations there are two modes: remote and local. Remote mode will copy data from one thread to another, whereas local mode, will copy data from one thread to another memory region with affinity to the same thread.  In bulk memory transfer operations there are two modes: remote and local. Remote mode will copy data from one thread to another, whereas local mode, will copy data from one thread to another memory region with affinity to the same thread.
120    
121    
# Line 111  Line 128 
128  \begin{itemize}  \begin{itemize}
129  \item \texttt{NUMCORES}: If defined it will override the detection of the number of cores. If not defined the number of cores is set through the \texttt{sysconf(\_SC\_NPROCESSORS\_ONLN)} system call.  \item \texttt{NUMCORES}: If defined it will override the detection of the number of cores. If not defined the number of cores is set through the \texttt{sysconf(\_SC\_NPROCESSORS\_ONLN)} system call.
130  \item \texttt{ASYNC\_MEM\_TEST}: If defined asynchronous memory transfer tests will be built. Default is defined.  \item \texttt{ASYNC\_MEM\_TEST}: If defined asynchronous memory transfer tests will be built. Default is defined.
131    \item \texttt{ASYNCI\_MEM\_TEST}: If defined asynchronous memory transfer with implicit handlers tests will be built. Default is defined.
132  \item \texttt{MINSIZE}: The minimum message size to be used in the benchmarking. Default is 4 bytes.  \item \texttt{MINSIZE}: The minimum message size to be used in the benchmarking. Default is 4 bytes.
133  \item \texttt{MAXSIZE}: The maximum message size to be used in the benchmarking. Default is 16 megabytes.  \item \texttt{MAXSIZE}: The maximum message size to be used in the benchmarking. Default is 16 megabytes.
134  \end{itemize}  \end{itemize}
# Line 121  Line 139 
139  \begin{itemize}  \begin{itemize}
140  \item \texttt{-help}: Print usage information and exits.  \item \texttt{-help}: Print usage information and exits.
141  \item \texttt{-version}: Print UOMS version and exits.  \item \texttt{-version}: Print UOMS version and exits.
142  \item \texttt{-off\_cache}: Enable cache invalidation. Be aware that the cache invalidation greatly increases the memory consumption. Also, note that for block sizes smaller than the cache line size it will not work.  \item \texttt{-off\_cache}: Enable cache invalidation. Be aware that the cache invalidation greatly increases the memory consumption. Also, note that for block sizes smaller than the cache line size it will not have any effect.
143  \item \texttt{-warmup}: Enable a warmup iteration.  \item \texttt{-warmup}: Enable a warmup iteration.
144  \item \texttt{-reduce\_op OP}: Choose the reduce operation to be performed by \texttt{upc\_all\_reduceD} and \texttt{upc\_all\_prefix\_reduceD}. Valid operations are:  \item \texttt{-reduce\_op OP}: Choose the reduce operation to be performed by \texttt{upc\_all\_reduceD} and \texttt{upc\_all} \texttt{\_prefix\_reduceD}. Valid operations are:
145  \begin{itemize}  \begin{itemize}
146  \item \texttt{UPC\_ADD (default)}  \item \texttt{UPC\_ADD (default)}
147  \item \texttt{UPC\_MULT}  \item \texttt{UPC\_MULT}
# Line 155  Line 173 
173    
174  \item \texttt{-maxsize SIZE}: Specifies the maximum block size (in bytes)  \item \texttt{-maxsize SIZE}: Specifies the maximum block size (in bytes)
175    
176    \item \texttt{-time SECONDS}: Specifies the maximum run time in seconds for each block size. Disabled by default. Important: this setting will not interrupt an ongoing operation
177    
178  \item \texttt{-input FILE}: Read user defined list of benchmarks to run from \texttt{FILE}. Valid benchmark names are:  \item \texttt{-input FILE}: Read user defined list of benchmarks to run from \texttt{FILE}. Valid benchmark names are:
179  \begin{itemize}  \begin{itemize}
180    \item \texttt{upc\_forall\_read}
181    \item \texttt{upc\_forall\_write}
182    \item \texttt{upc\_forall\_readwrite}
183    \item \texttt{for\_read}
184    \item \texttt{for\_write}
185    \item \texttt{for\_readwrite}
186  \item \texttt{upc\_barrier}  \item \texttt{upc\_barrier}
187  \item \texttt{upc\_all\_broadcast}  \item \texttt{upc\_all\_broadcast}
188  \item \texttt{upc\_all\_scatter}  \item \texttt{upc\_all\_scatter}
# Line 196  Line 222 
222  \item \texttt{upc\_all\_prefix\_reduceD}  \item \texttt{upc\_all\_prefix\_reduceD}
223  \item \texttt{upc\_all\_reduceLD}  \item \texttt{upc\_all\_reduceLD}
224  \item \texttt{upc\_all\_prefix\_reduceLD}  \item \texttt{upc\_all\_prefix\_reduceLD}
225    \item \texttt{upc\_memget\_async}
226    \item \texttt{upc\_memput\_async}
227    \item \texttt{upc\_memcpy\_async}
228    \item \texttt{local\_upc\_memget\_async}
229    \item \texttt{local\_upc\_memput\_async}
230    \item \texttt{local\_upc\_memcpy\_async}
231  \item \texttt{upc\_memget\_asynci}  \item \texttt{upc\_memget\_asynci}
232  \item \texttt{upc\_memput\_asynci}  \item \texttt{upc\_memput\_asynci}
233  \item \texttt{upc\_memcpy\_asynci}  \item \texttt{upc\_memcpy\_asynci}
# Line 216  Line 248 
248  To compile the suite you have to setup a correct \texttt{src/config/make.def} file. Templates are provided to this purpose. The needed parameters are:  To compile the suite you have to setup a correct \texttt{src/config/make.def} file. Templates are provided to this purpose. The needed parameters are:
249    
250  \begin{itemize}  \begin{itemize}
251  \item \texttt{CC}: Defines the C compiler used to compile the C code. Please note this does not involve the resulting C code generated from the UPC code if your UPC compiler is a source to source compiler.  \item \texttt{CC}: Defines the C compiler used to compile the C code. Please note this has nothing to do with the resulting C code generated from the UPC code if your UPC compiler is a source to source compiler.
252  \item \texttt{CFLAGS}: Defines the C flags used to compile the C code. Please note this does not involve the resulting C code generated from the UPC code if your UPC compiler is a source to source compiler  \item \texttt{CFLAGS}: Defines the C flags used to compile the C code. Please note this has nothing to do with the resulting C code generated from the UPC code if your UPC compiler is a source to source compiler
253  \item \texttt{UPCC}: Defines the UPC compiler used to compile the suite  \item \texttt{UPCC}: Defines the UPC compiler used to compile the suite
254  \item \texttt{UPCFLAGS}: Defines the UPC compiler flags used to compile the suite. Please note you should not specify any number of threads flag at this point  \item \texttt{UPCFLAGS}: Defines the UPC compiler flags used to compile the suite. Please note you should not specify the number of threads flag at this point
255  \item \texttt{UPCLINK}: Defines the UPC linker used to link the suite  \item \texttt{UPCLINK}: Defines the UPC linker used to link the suite
256  \item \texttt{UPCLINKFLAGS}: Defines the UPC linker flags used to link the suite  \item \texttt{UPCLINKFLAGS}: Defines the UPC linker flags used to link the suite
257  \item \texttt{THREADS\_SWITCH}: Defines the correct switch to set the desired number of threads. It is compiler dependent, and also includes any blank space after the switch  \item \texttt{THREADS\_SWITCH}: Defines the correct switch to set the desired number of threads. It is compiler dependant, and also includes any blank space after the switch
258  \end{itemize}  \end{itemize}
259    
260  Once you have set up your \texttt{make.def} file you can compile the suite as  Once you have set up your \texttt{make.def} file you can compile the suite.
261  following:  \\
262    
263    For a static thread setup type:
264    
265  \texttt{make NTHREADS=NUMBER\_OF\_UPC\_THREADS}  \texttt{make NTHREADS=NUMBER\_OF\_UPC\_THREADS}
266    
267  E.g., for 128 threads:  E.g., for 128 threads:
268    
269  \texttt{make NTHREADS=128}  \texttt{make NTHREADS=128}
270    \\
271    
272    For a dynamic thread setup just type:
273    
274    \texttt{make}
275    
276    
277    
# Line 289  Line 326 
326    
327  \normalsize  \normalsize
328    
329  The header indicates the benchmarked function and the number of processes involved. The first column shows the size used for each particular row. It is the size of the data at the root thread, or in any thread in a non-rooted operation. The second column is the number of repetitions performed for that particular message size. The following three columns are, respectively, the minimum, maximum and average latencies. The last column shows the aggregated bandwidth calculated using the maximum latencies. Therefore, the bandwidth reported is the minimum bandwidth achieved in all the repetitions.  The header indicates the benchmarked function and the number of processes involved. The first column shows the block size used for each particular row. The second column is the number of repetitions performed for that particular message size. The following three columns are, respectively, the minimum, maximum and average latencies. The last column shows the aggregated bandwidth calculated using the maximum latencies. Therefore, the bandwidth reported is the minimum bandwidth achieved in all the repetitions.
330    
331  Moreover, when 2 threads are used, affinity tests are performed. This way you can measure the effects of data locality in NUMA systems, if the 2 threads run in the same machine. This feature may be useful even when the 2 threads run in different machines. E.g.: Machines with non-uniform access to the network interface, like quad-socket Opteron/Nehalem-based machines, or cell-based machines like HP Integrity servers. The output of this tests is preceded with something like:  Moreover, when 2 threads are used, affinity tests are performed. This way you can measure the effects of data locality in NUMA systems, if the 2 threads run in the same machine. This feature may be useful even when the 2 threads run in different machines. E.g.: Machines with non-uniform access to the network interface, like quad-socket Opteron/Nehalem-based machines, or cell-based machines like HP Integrity servers. The output of this tests is preceded with something like:
332    

Legend:
Removed from v.1  
changed lines
  Added in v.15

root@forge.cesga.es
ViewVC Help
Powered by ViewVC 1.0.0  

Powered By FusionForge