Log In | Get Help   
Home My Page Projects Code Snippets Project Openings UPC Operations Microbenchmarking Suite
Summary Activity Tracker Lists Docs News SCM Files
[uoms] Annotation of /trunk/uoms-doc/doc.tex
[uoms] / trunk / uoms-doc / doc.tex Repository:
ViewVC logotype

Annotation of /trunk/uoms-doc/doc.tex

Parent Directory Parent Directory | Revision Log Revision Log


Revision 4 - (view) (download) (as text)

1 : dalvarez 1 \section{Contact}
2 :    
3 :     You can contact us at:\\
4 :    
5 : dalvarez 4 Galicia Supercomputing Center (CESGA)
6 : dalvarez 1
7 : dalvarez 4 \url{http://www.cesga.es}
8 : dalvarez 1
9 : dalvarez 4 Santiago de Compostela, Spain
10 : dalvarez 1
11 : dalvarez 4 upc@cesga.es\\\\
12 : dalvarez 1
13 : dalvarez 4 Dr Guillermo Lopez Taboada
14 : dalvarez 1
15 : dalvarez 4 Computer Architecture Group (CAG)
16 : dalvarez 1
17 : dalvarez 4 \url{http://gac.des.udc.es/index_en.html}
18 : dalvarez 1
19 : dalvarez 4 University of A Coruña, Spain
20 : dalvarez 1
21 : dalvarez 4 taboada@udc.es\\
22 : dalvarez 1
23 :    
24 :    
25 :    
26 :    
27 :    
28 :    
29 :     \section{Files in this benchmarking suite}
30 :    
31 :     \begin{itemize}
32 :     \item \texttt{doc/manual.pdf}: This file. User's manual.
33 :     \item \texttt{COPYING and COPYING.LESSER}: Files containing the use and redistribution terms (license).
34 :     \item \texttt{changelog.txt}: File with changes in each release.
35 :     \item \texttt{src/affinity.upc}: UPC code with affinity-related tests.
36 :     \item \texttt{src/config/make.def.template.*}: Makefile templates for HP UPC and Berkeley UPC.
37 :     \item \texttt{src/config/parameters.h}: Header with some customizable parameters.
38 :     \item \texttt{src/defines.h}: Header with needed definitions.
39 :     \item \texttt{src/headers.h}: Header with HUCB functions headers.
40 :     \item \texttt{src/mem\_manager.upc}: Memory-related functions for allocation and freeing.
41 :     \item \texttt{src/UOMS.upc}: Main file. It contains the actual benchmarking code.
42 :     \item \texttt{src/init.upc}: Code to initialize some structures and variables.
43 :     \item \texttt{src/Makefile}: Makefile to build the benchmarking suite.
44 :     \item \texttt{src/timers/timers.c}: Timing functions.
45 :     \item \texttt{src/timers/timers.h}: Timing functions headers.
46 :     \item \texttt{src/utils/data\_print.upc}: Functions to output the results.
47 :     \item \texttt{src/utils/utilities.c}: Auxiliary functions.
48 :     \end{itemize}
49 :    
50 :    
51 :    
52 :    
53 :    
54 :     \section{Operations tested}
55 :    
56 :     \begin{itemize}
57 :     \item \texttt{upc\_barrier}
58 :     \item \texttt{upc\_all\_broadcast}
59 :     \item \texttt{upc\_all\_scatter}
60 :     \item \texttt{upc\_all\_gather}
61 :     \item \texttt{upc\_all\_gather\_all}
62 :     \item \texttt{upc\_all\_permute}
63 :     \item \texttt{upc\_all\_exchange}
64 :     \item \texttt{upc\_all\_reduceC}
65 :     \item \texttt{upc\_all\_prefix\_reduceC}
66 :     \item \texttt{upc\_all\_reduceUC}
67 :     \item \texttt{upc\_all\_prefix\_reduceUC}
68 :     \item \texttt{upc\_all\_reduceS}
69 :     \item \texttt{upc\_all\_prefix\_reduceS}
70 :     \item \texttt{upc\_all\_reduceUS}
71 :     \item \texttt{upc\_all\_prefix\_reduceUS}
72 :     \item \texttt{upc\_all\_reduceI}
73 :     \item \texttt{upc\_all\_prefix\_reduceI}
74 :     \item \texttt{upc\_all\_reduceUI}
75 :     \item \texttt{upc\_all\_prefix\_reduceUI}
76 :     \item \texttt{upc\_all\_reduceL}
77 :     \item \texttt{upc\_all\_prefix\_reduceL}
78 :     \item \texttt{upc\_all\_reduceUL}
79 :     \item \texttt{upc\_all\_prefix\_reduceUL}
80 :     \item \texttt{upc\_all\_reduceF}
81 :     \item \texttt{upc\_all\_prefix\_reduceF}
82 :     \item \texttt{upc\_all\_reduceD}
83 :     \item \texttt{upc\_all\_prefix\_reduceD}
84 :     \item \texttt{upc\_all\_reduceLD}
85 :     \item \texttt{upc\_all\_prefix\_reduceLD}
86 :     \item \texttt{upc\_memcpy} (remote)
87 :     \item \texttt{upc\_memget} (remote)
88 :     \item \texttt{upc\_memput} (remote)
89 :     \item \texttt{upc\_memcpy} (local)
90 :     \item \texttt{upc\_memget} (local)
91 :     \item \texttt{upc\_memput} (local)
92 :     \item \texttt{memcpy} (local)
93 :     \item \texttt{memmove} (local)
94 :     \item \texttt{upc\_memcpy\_asynci} (remote)
95 :     \item \texttt{upc\_memget\_asynci} (remote)
96 :     \item \texttt{upc\_memput\_asynci} (remote)
97 :     \item \texttt{upc\_memcpy\_asynci} (local)
98 :     \item \texttt{upc\_memget\_asynci} (local)
99 :     \item \texttt{upc\_memput\_asynci} (local)
100 :     \item \texttt{upc\_all\_alloc}
101 :     \item \texttt{upc\_free}
102 :     \end{itemize}
103 :    
104 :     In bulk memory transfer operations there are two modes: remote and local. Remote mode will copy data from one thread to another, whereas local mode, will copy data from one thread to another memory region with affinity to the same thread.
105 :    
106 :    
107 :    
108 :     \section{Customizable parameters}
109 :    
110 :     \subsection{Compile time}
111 :     In the \texttt{src/config/parameters.h} file you can customize some parameters at compile time. They are:
112 :    
113 :     \begin{itemize}
114 :     \item \texttt{NUMCORES}: If defined it will override the detection of the number of cores. If not defined the number of cores is set through the \texttt{sysconf(\_SC\_NPROCESSORS\_ONLN)} system call.
115 :     \item \texttt{ASYNC\_MEM\_TEST}: If defined asynchronous memory transfer tests will be built. Default is defined.
116 :     \item \texttt{MINSIZE}: The minimum message size to be used in the benchmarking. Default is 4 bytes.
117 :     \item \texttt{MAXSIZE}: The maximum message size to be used in the benchmarking. Default is 16 megabytes.
118 :     \end{itemize}
119 :    
120 :     \subsection{Run time}
121 :     The following flags can be used at run time in the command line:
122 :    
123 :     \begin{itemize}
124 :     \item \texttt{-help}: Print usage information and exits.
125 :     \item \texttt{-version}: Print UOMS version and exits.
126 :     \item \texttt{-off\_cache}: Enable cache invalidation. Be aware that the cache invalidation greatly increases the memory consumption. Also, note that for block sizes smaller than the cache line size it will not work.
127 :     \item \texttt{-warmup}: Enable a warmup iteration.
128 :     \item \texttt{-reduce\_op OP}: Choose the reduce operation to be performed by \texttt{upc\_all\_reduceD} and \texttt{upc\_all\_prefix\_reduceD}. Valid operations are:
129 :     \begin{itemize}
130 :     \item \texttt{UPC\_ADD (default)}
131 :     \item \texttt{UPC\_MULT}
132 :     \item \texttt{UPC\_LOGAND}
133 :     \item \texttt{UPC\_LOGOR}
134 :     \item \texttt{UPC\_AND}
135 :     \item \texttt{UPC\_OR}
136 :     \item \texttt{UPC\_XOR}
137 :     \item \texttt{UPC\_MIN}
138 :     \item \texttt{UPC\_MAX}
139 :     \end{itemize}
140 :    
141 :     \item \texttt{-sync\_mode MODE}: Choose the synchronization mode for the collective operations. Valid modes are:
142 :     \begin{itemize}
143 :     \item \texttt{UPC\_IN\_ALLSYNC|UPC\_OUT\_ALLSYNC (default)}
144 :     \item \texttt{UPC\_IN\_ALLSYNC|UPC\_OUT\_MYSYNC}
145 :     \item \texttt{UPC\_IN\_ALLSYNC|UPC\_OUT\_NOSYNC}
146 :     \item \texttt{UPC\_IN\_MYSYNC|UPC\_OUT\_ALLSYNC}
147 :     \item \texttt{UPC\_IN\_MYSYNC|UPC\_OUT\_MYSYNC}
148 :     \item \texttt{UPC\_IN\_MYSYNC|UPC\_OUT\_NOSYNC}
149 :     \item \texttt{UPC\_IN\_NOSYNC|UPC\_OUT\_ALLSYNC}
150 :     \item \texttt{UPC\_IN\_NOSYNC|UPC\_OUT\_MYSYNC}
151 :     \item \texttt{UPC\_IN\_NOSYNC|UPC\_OUT\_NOSYNC}
152 :     \end{itemize}
153 :    
154 :     \item \texttt{-msglen FILE}: Read user defined problem sizes from \texttt{FILE} (in bytes). If specified it will override \texttt{-minsize} and \texttt{-maxsize}
155 :    
156 :     \item \texttt{-minsize SIZE}: Specifies the minimum block size (in bytes). Sizes will increase by a factor of 2
157 :    
158 :     \item \texttt{-maxsize SIZE}: Specifies the maximum block size (in bytes)
159 :    
160 :     \item \texttt{-input FILE}: Read user defined list of benchmarks to run from \texttt{FILE}. Valid benchmark names are:
161 :     \begin{itemize}
162 :     \item \texttt{upc\_barrier}
163 :     \item \texttt{upc\_all\_broadcast}
164 :     \item \texttt{upc\_all\_scatter}
165 :     \item \texttt{upc\_all\_gather}
166 :     \item \texttt{upc\_all\_gather\_all}
167 :     \item \texttt{upc\_all\_exchange}
168 :     \item \texttt{upc\_all\_permute}
169 :     \item \texttt{upc\_memget}
170 :     \item \texttt{upc\_memput}
171 :     \item \texttt{upc\_memcpy}
172 :     \item \texttt{local\_upc\_memget}
173 :     \item \texttt{local\_upc\_memput}
174 :     \item \texttt{local\_upc\_memcpy}
175 :     \item \texttt{memcpy}
176 :     \item \texttt{memmove}
177 :     \item \texttt{upc\_all\_alloc}
178 :     \item \texttt{upc\_free}
179 :     \item \texttt{upc\_all\_reduceC}
180 :     \item \texttt{upc\_all\_prefix\_reduceC}
181 :     \item \texttt{upc\_all\_reduceUC}
182 :     \item \texttt{upc\_all\_prefix\_reduceUC}
183 :     \item \texttt{upc\_all\_reduceS}
184 :     \item \texttt{upc\_all\_prefix\_reduceS}
185 :     \item \texttt{upc\_all\_reduceUS}
186 :     \item \texttt{upc\_all\_prefix\_reduceUS}
187 :     \item \texttt{upc\_all\_reduceI}
188 :     \item \texttt{upc\_all\_prefix\_reduceI}
189 :     \item \texttt{upc\_all\_reduceUI}
190 :     \item \texttt{upc\_all\_prefix\_reduceUI}
191 :     \item \texttt{upc\_all\_reduceL}
192 :     \item \texttt{upc\_all\_prefix\_reduceL}
193 :     \item \texttt{upc\_all\_reduceUL}
194 :     \item \texttt{upc\_all\_prefix\_reduceUL}
195 :     \item \texttt{upc\_all\_reduceF}
196 :     \item \texttt{upc\_all\_prefix\_reduceF}
197 :     \item \texttt{upc\_all\_reduceD}
198 :     \item \texttt{upc\_all\_prefix\_reduceD}
199 :     \item \texttt{upc\_all\_reduceLD}
200 :     \item \texttt{upc\_all\_prefix\_reduceLD}
201 :     \item \texttt{upc\_memget\_asynci}
202 :     \item \texttt{upc\_memput\_asynci}
203 :     \item \texttt{upc\_memcpy\_asynci}
204 :     \item \texttt{local\_upc\_memget\_asynci}
205 :     \item \texttt{local\_upc\_memput\_asynci}
206 :     \item \texttt{local\_upc\_memcpy\_asynci}
207 :    
208 :     \end{itemize}
209 :     \end{itemize}
210 :    
211 :    
212 :    
213 :    
214 :    
215 :    
216 :     \section{Compilation}
217 :    
218 :     To compile the suite you have to setup a correct \texttt{src/config/make.def} file. Templates are provided to this purpose. The needed parameters are:
219 :    
220 :     \begin{itemize}
221 :     \item \texttt{CC}: Defines the C compiler used to compile the C code. Please note this does not involve the resulting C code generated from the UPC code if your UPC compiler is a source to source compiler.
222 :     \item \texttt{CFLAGS}: Defines the C flags used to compile the C code. Please note this does not involve the resulting C code generated from the UPC code if your UPC compiler is a source to source compiler
223 :     \item \texttt{UPCC}: Defines the UPC compiler used to compile the suite
224 :     \item \texttt{UPCFLAGS}: Defines the UPC compiler flags used to compile the suite. Please note you should not specify any number of threads flag at this point
225 :     \item \texttt{UPCLINK}: Defines the UPC linker used to link the suite
226 :     \item \texttt{UPCLINKFLAGS}: Defines the UPC linker flags used to link the suite
227 :     \item \texttt{THREADS\_SWITCH}: Defines the correct switch to set the desired number of threads. It is compiler dependent, and also includes any blank space after the switch
228 :     \end{itemize}
229 :    
230 :     Once you have set up your \texttt{make.def} file you can compile the suite as
231 :     following:
232 :    
233 :     \texttt{make NTHREADS=NUMBER\_OF\_UPC\_THREADS}
234 :    
235 :     E.g., for 128 threads:
236 :    
237 :     \texttt{make NTHREADS=128}
238 :    
239 :    
240 :    
241 :    
242 :    
243 :    
244 :    
245 :    
246 :     \section{Timers used}
247 :    
248 :     This suite uses high-resolution timers in IA64 architecture. In particular it uses the Interval Timer Counter (\texttt{AR.ITC}). For other architectures it uses the \texttt{hpupc\_ticks\_now} if you are using HP UPC, or \texttt{bupc\_ticks\_now} if you are using Berkeley UPC, whose precision depends on the specific architecture. If none of this requirements are met the suite uses the default \texttt{gettimeofday} function. However, the granularity of this function only allows to measure microseconds, rather than nanoseconds.
249 :    
250 :    
251 :    
252 :    
253 :    
254 :    
255 :     \section{Output explanation}
256 :    
257 :     This is an output example of the broadcast:
258 :    
259 :     \small
260 :     \begin{verbatim}
261 :     #---------------------------------------------------
262 :     # Benchmarking upc_all_broadcast
263 :     # #processes = 2
264 :     #---------------------------------------------------
265 :     #bytes #repetitions t_min[nsec] t_max[nsec] t_avg[nsec] BW_aggregated[MB/sec]
266 :     4 20 19942 48820275 2463315.85 0.00
267 :     8 20 19942 22922 21457.25 0.70
268 :     16 20 19942 22397 21420.10 1.43
269 :     32 20 19942 22235 21626.35 2.88
270 :     64 20 20277 33610 22886.00 3.81
271 :     128 20 20285 22812 21676.60 11.22
272 :     256 20 20767 22845 22230.50 22.41
273 :     512 20 20767 23020 22314.85 44.48
274 :     1024 20 22777 29255 24169.85 70.01
275 :     2048 20 23705 25425 24603.85 161.10
276 :     4096 20 24562 27097 26437.60 302.32
277 :     8192 20 29885 33205 32174.35 493.42
278 :     16384 20 42492 44735 43919.35 732.49
279 :     32768 10 68317 70052 69490.00 935.53
280 :     65536 10 121610 123837 122635.00 1058.42
281 :     131072 10 227550 231515 229323.50 1132.30
282 :     262144 10 437645 444740 441354.00 1178.86
283 :     524288 10 861287 871700 867619.70 1202.91
284 :     1048576 5 1702722 1704420 1703642.40 1230.42
285 :     2097152 5 3417170 3435637 3429128.40 1220.82
286 :     4194304 5 6830267 6839535 6834224.40 1226.49
287 :     8388608 2 13434382 13469047 13451715.00 1245.61
288 :     16777216 2 27310152 27343357 27326755.00 1227.15
289 :     33554432 1 54294385 54294385 54294385.00 1236.02
290 :     \end{verbatim}}}
291 :    
292 :     \normalsize
293 :    
294 :     The header indicates the benchmarked function and the number of processes involved. The first column shows the size used for each particular row. It is the size of the data at the root thread, or in any thread in a non-rooted operation. The second column is the number of repetitions performed for that particular message size. The following three columns are, respectively, the minimum, maximum and average latencies. The last column shows the aggregated bandwidth calculated using the maximum latencies. Therefore, the bandwidth reported is the minimum bandwidth achieved in all the repetitions.
295 :    
296 :     Moreover, when 2 threads are used, affinity tests are performed. This way you can measure the effects of data locality in NUMA systems, if the 2 threads run in the same machine. This feature may be useful even when the 2 threads run in different machines. E.g.: Machines with non-uniform access to the network interface, like quad-socket Opteron/Nehalem-based machines, or cell-based machines like HP Integrity servers. The output of this tests is preceded with something like:
297 :    
298 :     \begin{verbatim}
299 :     #---------------------------------------------------------
300 :     # using #cores = 0 and 1 (Number of cores per node: 16)
301 :     # CPU Mask: 1000000000000000 (core 0), 0100000000000000 (core 1)
302 :     #---------------------------------------------------------
303 :     \end{verbatim}
304 :    
305 :     All tests after these lines are performed using core 0 (thread 0) and core 1 (thread 1) until another affinity header is showed.

root@forge.cesga.es
ViewVC Help
Powered by ViewVC 1.0.0  

Powered By FusionForge