Log In | Get Help   
Home My Page Projects Code Snippets Project Openings UPC Operations Microbenchmarking Suite
Summary Activity Tracker Lists Docs News SCM Files
[uoms] View of /trunk/uoms-doc/doc.tex
[uoms] / trunk / uoms-doc / doc.tex Repository:
ViewVC logotype

View of /trunk/uoms-doc/doc.tex

Parent Directory Parent Directory | Revision Log Revision Log


Revision 11 - (download) (as text) (annotate)
Wed May 12 11:23:57 2010 UTC (14 years, 1 month ago) by dalvarez
File size: 15448 byte(s)
Minor changes in the manual
\section{Contact}

You can contact us at:\\

Galicia Supercomputing Center (CESGA)

\url{http://www.cesga.es}

Santiago de Compostela, Spain

upc@cesga.es\\\\

PhD. Guillermo Lopez Taboada

Computer Architecture Group (CAG)

\url{http://gac.des.udc.es/index_en.html}

University of A Coruña, Spain

taboada@udc.es\\


\section{Acknowledgments}

This work was funded by Hewlett-Packard Spain and partially supported by the Ministry of Science and Innovation of Spain under Project TIN2007-67537-C03-02 and by the Galician Government (Xunta de Galicia, Spain) under the Consolidation Program of Competitive Research Groups (Ref. 3/2006 DOGA 12/13/2006). We gratefully thank Brian Wibecan for his comments and for share with us his thoughts and knowledge. Also, we thank Jim Bovay for his support, and CESGA, for providing access to the FinisTerrae supercomputer.


\section{Files in this benchmarking suite}

\begin{itemize}
 \item \texttt{doc/manual.pdf}: This file. User's manual.
 \item \texttt{COPYING and COPYING.LESSER}: Files containing the use and redistribution terms (license).
 \item \texttt{changelog.txt}: File with changes in each release.
 \item \texttt{src/affinity.upc}: UPC code with affinity-related tests.
 \item \texttt{src/config/make.def.template.*}: Makefile templates for HP UPC and Berkeley UPC.
 \item \texttt{src/config/parameters.h}: Header with some customizable parameters.
 \item \texttt{src/defines.h}: Header with needed definitions.
 \item \texttt{src/headers.h}: Header with HUCB functions headers.
 \item \texttt{src/mem\_manager.upc}: Memory-related functions for allocation and freeing.
 \item \texttt{src/UOMS.upc}: Main file. It contains the actual benchmarking code.
 \item \texttt{src/init.upc}: Code to initialize some structures and variables.
 \item \texttt{src/Makefile}: Makefile to build the benchmarking suite.
 \item \texttt{src/timers/timers.c}: Timing functions.
 \item \texttt{src/timers/timers.h}: Timing functions headers.
 \item \texttt{src/utils/data\_print.upc}: Functions to output the results.
 \item \texttt{src/utils/utilities.c}: Auxiliary functions.
\end{itemize}





\section{Operations tested}

\begin{itemize}
\item \texttt{upc\_barrier}
\item \texttt{upc\_all\_broadcast}
\item \texttt{upc\_all\_scatter}
\item \texttt{upc\_all\_gather}
\item \texttt{upc\_all\_gather\_all}
\item \texttt{upc\_all\_permute}
\item \texttt{upc\_all\_exchange}
\item \texttt{upc\_all\_reduceC}
\item \texttt{upc\_all\_prefix\_reduceC}
\item \texttt{upc\_all\_reduceUC}
\item \texttt{upc\_all\_prefix\_reduceUC}
\item \texttt{upc\_all\_reduceS}
\item \texttt{upc\_all\_prefix\_reduceS}
\item \texttt{upc\_all\_reduceUS}
\item \texttt{upc\_all\_prefix\_reduceUS}
\item \texttt{upc\_all\_reduceI}
\item \texttt{upc\_all\_prefix\_reduceI}
\item \texttt{upc\_all\_reduceUI}
\item \texttt{upc\_all\_prefix\_reduceUI}
\item \texttt{upc\_all\_reduceL}
\item \texttt{upc\_all\_prefix\_reduceL}
\item \texttt{upc\_all\_reduceUL}
\item \texttt{upc\_all\_prefix\_reduceUL}
\item \texttt{upc\_all\_reduceF}
\item \texttt{upc\_all\_prefix\_reduceF}
\item \texttt{upc\_all\_reduceD}
\item \texttt{upc\_all\_prefix\_reduceD}
\item \texttt{upc\_all\_reduceLD}
\item \texttt{upc\_all\_prefix\_reduceLD}
\item \texttt{upc\_memcpy} (remote)             
\item \texttt{upc\_memget} (remote)             
\item \texttt{upc\_memput} (remote)             
\item \texttt{upc\_memcpy} (local)              
\item \texttt{upc\_memget} (local)              
\item \texttt{upc\_memput} (local)
\item \texttt{memcpy} (local)
\item \texttt{memmove} (local)
\item \texttt{upc\_memcpy\_async} (remote)
\item \texttt{upc\_memget\_async} (remote)    
\item \texttt{upc\_memput\_async} (remote)     
\item \texttt{upc\_memcpy\_async} (local)      
\item \texttt{upc\_memget\_async} (local)     
\item \texttt{upc\_memput\_async} (local)
\item \texttt{upc\_memcpy\_asynci} (remote)
\item \texttt{upc\_memget\_asynci} (remote)    
\item \texttt{upc\_memput\_asynci} (remote)     
\item \texttt{upc\_memcpy\_asynci} (local)      
\item \texttt{upc\_memget\_asynci} (local)     
\item \texttt{upc\_memput\_asynci} (local)
\item \texttt{upc\_all\_alloc}
\item \texttt{upc\_free}
\end{itemize}

In bulk memory transfer operations there are two modes: remote and local. Remote mode will copy data from one thread to another, whereas local mode, will copy data from one thread to another memory region with affinity to the same thread.



\section{Customizable parameters}

\subsection{Compile time}
In the \texttt{src/config/parameters.h} file you can customize some parameters at compile time. They are:

\begin{itemize}
\item \texttt{NUMCORES}: If defined it will override the detection of the number of cores. If not defined the number of cores is set through the \texttt{sysconf(\_SC\_NPROCESSORS\_ONLN)} system call.
\item \texttt{ASYNC\_MEM\_TEST}: If defined asynchronous memory transfer tests will be built. Default is defined.
\item \texttt{ASYNCI\_MEM\_TEST}: If defined asynchronous memory transfer with implicit handlers tests will be built. Default is defined.
\item \texttt{MINSIZE}: The minimum message size to be used in the benchmarking. Default is 4 bytes.
\item \texttt{MAXSIZE}: The maximum message size to be used in the benchmarking. Default is 16 megabytes.
\end{itemize}

\subsection{Run time}
The following flags can be used at run time in the command line:

\begin{itemize}
\item \texttt{-help}: Print usage information and exits.
\item \texttt{-version}: Print UOMS version and exits.
\item \texttt{-off\_cache}: Enable cache invalidation. Be aware that the cache invalidation greatly increases the memory consumption. Also, note that for block sizes smaller than the cache line size it will not have any effect.
\item \texttt{-warmup}: Enable a warmup iteration.
\item \texttt{-reduce\_op OP}: Choose the reduce operation to be performed by \texttt{upc\_all\_reduceD} and \texttt{upc\_all} \texttt{\_prefix\_reduceD}. Valid operations are:
\begin{itemize}
\item \texttt{UPC\_ADD (default)}
\item \texttt{UPC\_MULT}
\item \texttt{UPC\_LOGAND}
\item \texttt{UPC\_LOGOR}
\item \texttt{UPC\_AND}
\item \texttt{UPC\_OR}
\item \texttt{UPC\_XOR}
\item \texttt{UPC\_MIN}
\item \texttt{UPC\_MAX}
\end{itemize}

\item \texttt{-sync\_mode MODE}: Choose the synchronization mode for the collective operations. Valid modes are:
\begin{itemize}
\item \texttt{UPC\_IN\_ALLSYNC|UPC\_OUT\_ALLSYNC (default)}
\item \texttt{UPC\_IN\_ALLSYNC|UPC\_OUT\_MYSYNC}
\item \texttt{UPC\_IN\_ALLSYNC|UPC\_OUT\_NOSYNC}
\item \texttt{UPC\_IN\_MYSYNC|UPC\_OUT\_ALLSYNC}
\item \texttt{UPC\_IN\_MYSYNC|UPC\_OUT\_MYSYNC}
\item \texttt{UPC\_IN\_MYSYNC|UPC\_OUT\_NOSYNC}
\item \texttt{UPC\_IN\_NOSYNC|UPC\_OUT\_ALLSYNC}
\item \texttt{UPC\_IN\_NOSYNC|UPC\_OUT\_MYSYNC}
\item \texttt{UPC\_IN\_NOSYNC|UPC\_OUT\_NOSYNC}
\end{itemize}

\item \texttt{-msglen FILE}: Read user defined problem sizes from \texttt{FILE} (in bytes). If specified it will override \texttt{-minsize} and \texttt{-maxsize}

\item \texttt{-minsize SIZE}: Specifies the minimum block size (in bytes). Sizes will increase by a factor of 2

\item \texttt{-maxsize SIZE}: Specifies the maximum block size (in bytes)

\item \texttt{-input FILE}: Read user defined list of benchmarks to run from \texttt{FILE}. Valid benchmark names are:
\begin{itemize}
\item \texttt{upc\_barrier}
\item \texttt{upc\_all\_broadcast}
\item \texttt{upc\_all\_scatter}
\item \texttt{upc\_all\_gather}
\item \texttt{upc\_all\_gather\_all}
\item \texttt{upc\_all\_exchange}
\item \texttt{upc\_all\_permute}
\item \texttt{upc\_memget}
\item \texttt{upc\_memput}
\item \texttt{upc\_memcpy}
\item \texttt{local\_upc\_memget}
\item \texttt{local\_upc\_memput}
\item \texttt{local\_upc\_memcpy}
\item \texttt{memcpy}
\item \texttt{memmove}
\item \texttt{upc\_all\_alloc}
\item \texttt{upc\_free}
\item \texttt{upc\_all\_reduceC}
\item \texttt{upc\_all\_prefix\_reduceC}
\item \texttt{upc\_all\_reduceUC}
\item \texttt{upc\_all\_prefix\_reduceUC}
\item \texttt{upc\_all\_reduceS}
\item \texttt{upc\_all\_prefix\_reduceS}
\item \texttt{upc\_all\_reduceUS}
\item \texttt{upc\_all\_prefix\_reduceUS}
\item \texttt{upc\_all\_reduceI}
\item \texttt{upc\_all\_prefix\_reduceI}
\item \texttt{upc\_all\_reduceUI}
\item \texttt{upc\_all\_prefix\_reduceUI}
\item \texttt{upc\_all\_reduceL}
\item \texttt{upc\_all\_prefix\_reduceL}
\item \texttt{upc\_all\_reduceUL}
\item \texttt{upc\_all\_prefix\_reduceUL}
\item \texttt{upc\_all\_reduceF}
\item \texttt{upc\_all\_prefix\_reduceF}
\item \texttt{upc\_all\_reduceD}
\item \texttt{upc\_all\_prefix\_reduceD}
\item \texttt{upc\_all\_reduceLD}
\item \texttt{upc\_all\_prefix\_reduceLD}
\item \texttt{upc\_memget\_async}
\item \texttt{upc\_memput\_async}
\item \texttt{upc\_memcpy\_async}
\item \texttt{local\_upc\_memget\_async}
\item \texttt{local\_upc\_memput\_async}
\item \texttt{local\_upc\_memcpy\_async}
\item \texttt{upc\_memget\_asynci}
\item \texttt{upc\_memput\_asynci}
\item \texttt{upc\_memcpy\_asynci}
\item \texttt{local\_upc\_memget\_asynci}
\item \texttt{local\_upc\_memput\_asynci}
\item \texttt{local\_upc\_memcpy\_asynci}

\end{itemize}
\end{itemize}






\section{Compilation}

To compile the suite you have to setup a correct \texttt{src/config/make.def} file. Templates are provided to this purpose. The needed parameters are:

\begin{itemize}
\item \texttt{CC}: Defines the C compiler used to compile the C code. Please note this does not involve the resulting C code generated from the UPC code if your UPC compiler is a source to source compiler.
\item \texttt{CFLAGS}: Defines the C flags used to compile the C code. Please note this does not involve the resulting C code generated from the UPC code if your UPC compiler is a source to source compiler
\item \texttt{UPCC}: Defines the UPC compiler used to compile the suite
\item \texttt{UPCFLAGS}: Defines the UPC compiler flags used to compile the suite. Please note you should not specify any number of threads flag at this point
\item \texttt{UPCLINK}: Defines the UPC linker used to link the suite
\item \texttt{UPCLINKFLAGS}: Defines the UPC linker flags used to link the suite
\item \texttt{THREADS\_SWITCH}: Defines the correct switch to set the desired number of threads. It is compiler dependent, and also includes any blank space after the switch
\end{itemize}

Once you have set up your \texttt{make.def} file you can compile the suite as
following:

\texttt{make NTHREADS=NUMBER\_OF\_UPC\_THREADS}

E.g., for 128 threads:

\texttt{make NTHREADS=128}








\section{Timers used}

This suite uses high-resolution timers in IA64 architecture. In particular it uses the Interval Timer Counter (\texttt{AR.ITC}). For other architectures it uses the \texttt{hpupc\_ticks\_now} if you are using HP UPC, or \texttt{bupc\_ticks\_now} if you are using Berkeley UPC, whose precision depends on the specific architecture. If none of this requirements are met the suite uses the default \texttt{gettimeofday} function. However, the granularity of this function only allows to measure microseconds, rather than nanoseconds.






\section{Output explanation}

This is an output example of the broadcast:

\small
\begin{verbatim}
#---------------------------------------------------
# Benchmarking upc_all_broadcast  
# #processes = 2                                    
#---------------------------------------------------
       #bytes #repetitions  t_min[nsec]  t_max[nsec]   t_avg[nsec] BW_aggregated[MB/sec]
            4           20        19942     48820275    2463315.85                  0.00               
            8           20        19942        22922      21457.25                  0.70               
           16           20        19942        22397      21420.10                  1.43               
           32           20        19942        22235      21626.35                  2.88               
           64           20        20277        33610      22886.00                  3.81               
          128           20        20285        22812      21676.60                 11.22               
          256           20        20767        22845      22230.50                 22.41               
          512           20        20767        23020      22314.85                 44.48               
         1024           20        22777        29255      24169.85                 70.01               
         2048           20        23705        25425      24603.85                161.10               
         4096           20        24562        27097      26437.60                302.32               
         8192           20        29885        33205      32174.35                493.42               
        16384           20        42492        44735      43919.35                732.49               
        32768           10        68317        70052      69490.00                935.53               
        65536           10       121610       123837     122635.00               1058.42               
       131072           10       227550       231515     229323.50               1132.30               
       262144           10       437645       444740     441354.00               1178.86               
       524288           10       861287       871700     867619.70               1202.91               
      1048576            5      1702722      1704420    1703642.40               1230.42               
      2097152            5      3417170      3435637    3429128.40               1220.82               
      4194304            5      6830267      6839535    6834224.40               1226.49               
      8388608            2     13434382     13469047   13451715.00               1245.61              
     16777216            2     27310152     27343357   27326755.00               1227.15              
     33554432            1     54294385     54294385   54294385.00               1236.02
\end{verbatim}}}

\normalsize

The header indicates the benchmarked function and the number of processes involved. The first column shows the size used for each particular row. It is the size of the data at the root thread, or in any thread in a non-rooted operation. The second column is the number of repetitions performed for that particular message size. The following three columns are, respectively, the minimum, maximum and average latencies. The last column shows the aggregated bandwidth calculated using the maximum latencies. Therefore, the bandwidth reported is the minimum bandwidth achieved in all the repetitions.

Moreover, when 2 threads are used, affinity tests are performed. This way you can measure the effects of data locality in NUMA systems, if the 2 threads run in the same machine. This feature may be useful even when the 2 threads run in different machines. E.g.: Machines with non-uniform access to the network interface, like quad-socket Opteron/Nehalem-based machines, or cell-based machines like HP Integrity servers. The output of this tests is preceded with something like:

\begin{verbatim}
#---------------------------------------------------------
# using #cores = 0 and 1 (Number of cores per node: 16)
# CPU Mask: 1000000000000000 (core 0), 0100000000000000 (core 1)
#---------------------------------------------------------
\end{verbatim}

All tests after these lines are performed using core 0 (thread 0) and core 1 (thread 1) until another affinity header is showed.

root@forge.cesga.es
ViewVC Help
Powered by ViewVC 1.0.0  

Powered By FusionForge