IPB-logo-black-beta.pngSCL.jpg

PARADOX Cluster User Guide v2.1

1. Introduction

2. System configuration

2.1. Hardware description

2.2. Filesystems

3. System Access

4. User environment and programming

4.1. Environment Modules

4.1.1. PRACE Common Production Environment (PCPE)

4.2. Batch system and jobs submission

4.2.1. Sequential Job submission

4.2.2. MPI Job submission

4.2.3. OpenMP Job submission

4.2.4 Hybrid job submission

4.2.5. CUDA single node job submission

4.2.6. CUDA MPI Job submission

4.3. Compiling

4.3.1. Compiler flags

C/C++

Fortran

4.3.2. Compiling MPI program

4.3.3. Compiling OpenMP programs

4.3.4. Compiling hybrid programs

4.3.5. Compiling CUDA programs

4.3.6 Compiling CUDA-MPI programs

4.3.7 PGI Compilation example

Contact

1. Introduction

PARADOX Cluster at the Scientific Computing Laboratory of Institute of Physics Belgrade consists of 106 compute nodes (2 x 8 core Sandy Bridge Xeon 2.6GHz processors with 32GB of RAM + NVIDIA® TeslaM2090) interconnected by the QDR InfiniBand network.

IMG_2368 Panorama_IMG_2368 Panorama_b_2.jpg

Picture 1. PARADOX Cluster

2. System configuration

2.1. Hardware description

PARADOX is an HP Proliant SL250s based cluster with the following components:

Operating system:

The operating system on PARADOX cluster is Scientific Linux 6.4.

2.2. Filesystems

There is one Lustre file system on PARADOX which is mounted on /home. It is shared between worker nodes and used both for long term storage and cluster job submission. This file system has directories /home/<USERNAME> created for each user.

Additionally, there is another local file system available on each worker node: /scratch. This file system should be used only for temporary storage of running jobs (each compute node has 500GB local hard disk).

3. System Access

From your local machine within IPB computer network, you need to use the ssh command to access login node of the PARADOX Cluster - paradox.ipb.ac.rs.

$ ssh username@paradox.ipb.ac.rs

If you need a graphical environment you have to use the -X option:

$ ssh username@paradox.ipb.ac.rs -X

Secure copy (scp) can be used to transfer data to or from paradox.ipb.ac.rs. This node is used for preparing, submitting jobs to the batch system and some lightweight testing but not for the long running computations. To logout from paradox.ipb.ac.rs, you can use the Ctrl-d command, or exit.

If accessing PARADOX cluster from outside of IPB computer network, you need first to log in to the user interface machine ui.ipb.ac.rs, and then proceed to log in to the login node - paradox.ipb.ac.rs.

4. User environment and programming

4.1. Environment Modules

Paradox cluster uses Environment Modules[1] to set up user environment variables for various development, debugging and profiling scenarios. The modules are divided into applications, environment, compilers, libraries and tools categories.

Available modules can be listed with following command:

$ module avail

The list of currently loaded modules can be brought up by tying:

$ module list

Each module can be loaded by executing:

$ module load module_name

Specific modules can be unloaded by calling:

$ module unload module_name

All modules can be unloaded by executing:

$ module purge

The following modules are available:

4.1.1. PRACE Common Production Environment (PCPE)

Provides a working environment which has, at least as much as possible, a uniform interface from the user point of view, especially to access to the various software required to build and execute the applications on the available platforms - both Tier-0 and Tier-1 PRACE sites.

4.2. Batch system and jobs submission

Job submissions, resources allocations and the jobs launching over the cluster are managed by the batch system (torque+maui scheduler). From paradox.ipb.ac.rs jobs can be submitted using the command qsub.

To submit a batch job, you first have to write a shell script which contains:

Then your job can be launched by submitting this script to batch system. The job will enter into a batch queue and, when resources are available, job will be launched over allocated nodes. Batch system provides monitoring of all submitted jobs.

Queue standard is available for user’s job submission.

Frequently used PBS commands for getting the status of the system, queues, or jobs are:

qstat

list information about queues and jobs

qstat –q

list all queues on system

qstat –Q

list queue limits for all queues

qstat –a

list all jobs on system

qstat –au userID

list all jobs owned by user userID

qstat –s

list all jobs with status comments

qstat – r

list all running jobs

qstat –f jobID

list all information known about specified job

qstat –n

in addition to the basic information, nodes allocated to a job are listed

qstat –Qf <queue>

list all information about specified queue

qstat –B

list summary information about the PBS server

qdel jobID

delete the batch job with jobID

qalter

alter a batch job

qsub

submit a job

4.2.1. Sequential Job submission

Here is a sample sequential job PBS script:

#!/bin/bash

#PBS -q standard

#PBS -l nodes=1:ppn=1

#PBS -l walltime=00:10:00

#PBS -e ${PBS_JOBID}.err

#PBS -o ${PBS_JOBID}.out

cd $PBS_O_WORKDIR

chmod +x job.sh

./job.sh

This job can be submitted by issuing following command:

$ qsub job.pbs

The qsub command will return а result of the type:

<JOB_ID>.paradox.ipb.ac.rs

Where <ЈОВ_ID> is а unique integer used to identify the given job.

To check the status of your job use the following command:

$ qstat <ЈОВ_ID>

This will return an output similar to:

JobID

Name

User

Time Use

S

Queue

-----------------------------

----------------

----------------

---------------

----------------

----------------

<JOB_ID>.paradox

job.pbs

<username>

00:01:10

R

standard

Alternatively you can check the status of all your jobs using the following syntax of the qstat command:

$ qstat -u <user_name>

To get detailed information about your job use the following command:

$ qstat -f <JOB_ID>

When your job is finished, files to which standard output and standard error of a job was redirected will appear in your work directory.

If, for some reason, you want to cancel a job following command should be executed:

$ qdel <JOB_ID>

If qstat <ЈОВ_ID> returns the following line:

qstat: Unknown Job Id <JOB_ID>.paradox

This most likely means that your job has finished.

4.2.2. MPI Job submission

Here is an example of an MPI job submission script:

#!/bin/bash

#PBS -q standard

#PBS -l nodes=2:ppn=16

#PBS -l walltime=10:00:00

#PBS -e ${PBS_JOBID}.err

#PBS -o ${PBS_JOBID}.out

cd $PBS_O_WORKDIR

chmod +x prog

module load openmpi/1.6.5

mpirun ./prog

MPI launcher, together with the batch system will take care of proper launching of a parallel job, i.e. no need to specify number of MPI instances to be launched or machine file in the command line. All these information launcher will obtain from the batch system. All stated PBS directives are same as for the sequential job except the resource allocation line which is, in this case:

#PBS -l nodes=2:ppn=16

In this statement we are demanding 2 nodes with 16 cores each (2 full nodes, as PARADOX worker nodes are 16 cores machines), all together 32 MPI instances.

Job can be submitted by issuing following command:

$ qsub job.pbs

By using the qstat command we can view the resources allocated for our parallel job.

$ qstat -n standard

paradox.ipb.ac.rs:

                                                                         Req'd  Req'd   Elap

Job ID               Username Queue    Jobname          SessID NDS   TSK Memory Time  S Time

-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----

509.paradox.ipb.     johndoe  standard job.pbs           30285     4   4    --  48000 R   --

   Gn106+Gn106+Gn106+Gn106

Job monitoring and canceling is no different than for sequential job and it is already described in sequential job submission section (4.2.1).

4.2.3. OpenMP Job submission

Here is an example of an openMP job submission script:

#!/bin/bash

#PBS -q standard

#PBS -l nodes=1:ppn=16

#PBS -l walltime=00:10:00

#PBS -e ${PBS_JOBID}.err

#PBS -o ${PBS_JOBID}.out

cd $PBS_O_WORKDIR

chmod +x prog

export OMP_NUM_THREADS=16

./prog

Executable job is compiled with OpenMP (see 4.3.2. Compiling OpenMP programs section). For the execution of the OpenMP jobs you shouldn’t use more than one node, as it is specified in PBS script:

#PBS -l nodes=1:ppn=16

OpenMP is a shared memory parallel computational library and as such the processes cannot

be forked among various machines. Thus, an OpenMP job on the PARADOX Cluster can at most consume 16 CPUs in parallel since the largest SMP on the cluster has 16 cores.

OMP_NUM_THREADS environment variable should be specified, especially in the case when

user is not allocating whole node for its job (not using ppn=16). If this is the case and the number of threads is not specified in program, OpenMP executable will use 16 threads (worker nodes at PARADOX have 16 CPU cores) and potentially compete for CPU time with the other jobs running at the same node. Job submitting, monitoring and canceling is the same as for other previously described types of jobs.

If the job binary was compiled with intel compiler, than the appropriate intel modules should be loaded, e.g:

module load intel/14.0.1

4.2.4 Hybrid job submission

The following example shows a typical hybrid job submission script:

#!/bin/bash
#PBS -q standard
#PBS -l nodes=4:ppn=16
#PBS -l walltime=10:00:00
#PBS -e ${PBS_JOBID}.err
#PBS -o ${PBS_JOBID}.out

module load openmpi/1.8.2

export OMP_NUM_THREADS=16

cd $PBS_O_WORKDIR
chmod +x prog

mpirun -np 4 -npernode 1 --bind-to none ./prog

It is important to take note of the third line in this script which specifies how the processes should be laid out across compute nodes. It is important to adjust the ppn parameter in accordance with OMP_NUM_THREADS environment setting to avoid oversubscribing a node. The mpirun parameter npernode also controls the actual number of processes that will be assigned to a node, because the PBS line only serves for the resource allocation.

Another detail is the --bind-to none parameter which is needed for OpenMPI versions from 1.8 on, which allows threads to spread across the cores. For more information about process and thread mapping, please see mpirun documentation and the --map-by parameter.

4.2.5. CUDA single node job submission

CUDA jobs are very similar to previous types, except that cuda module should be loaded and since login node has no GPU, these programs can only run on compute nodes.

Here is an example of an CUDA job submission script:

#!/bin/bash

#PBS -q standard

#PBS -l nodes=1:ppn=1

#PBS -l walltime=10:00:00

#PBS -e ${PBS_JOBID}.err

#PBS -o ${PBS_JOBID}.out

cd $PBS_O_WORKDIR

chmod +x prog

module load cuda/5.5

./prog

4.2.6. CUDA MPI Job submission

CUDA jobs are very similar to previous types, except that cuda module should be loaded and since login node has no GPU, these programs can only run on compute nodes.

Here is an example of an CUDA MPI job submission script:

#!/bin/bash

#PBS -q standard

#PBS -l nodes=4:ppn=1

#PBS -l walltime=10:00:00

#PBS -e ${PBS_JOBID}.err

#PBS -o ${PBS_JOBID}.out

cd $PBS_O_WORKDIR

chmod +x prog

module load openmpi/1.6.5

module load cuda/5.5

mpirun ./prog

4.3. Compiling

4.3.1. Compiler flags

C/C++

Intel compilers: icc and icpc. Compilation options are the same, except for the the C language behavior. icpc manages all the source files as C++ files whereas icc makes a difference between both of them.

Basic flags:

Optimizations:

Preprocessor:

Practical:

Fortran

Intel compiler: ifort (Fortran compiler).

Basic flags :

Optimizations:

Run-time check:

Preprocessor:

Practical:

Please refer to the 'man pages' of the compilers for more information.

4.3.2. Compiling MPI program

Here is an example of a MPI program:

   #include <stdio.h>

   #include <mpi.h>

   main(int argc, char **argv)

   {

      int num_procs, my_id;

 int len;

 char name[MPI_MAX_PROCESSOR_NAME];

 MPI_Init(&argc, &argv);

 /* find out process ID, and how many processes were started. */

      MPI_Comm_rank(MPI_COMM_WORLD, &my_id);

      MPI_Comm_size(MPI_COMM_WORLD, &num_procs);

      MPI_Get_processor_name(name, &len);//

      printf("Hello, world.I'm process %d of %d on %s\n", my_id,
            num_procs, name);

      MPI_Finalize();

   }

MPI implementations are using mpicc, mpic++, mpif77 and mpif90 wrappers for compiling and

linking MPI programs:

$ mpicc -o test test.c                (Assuming that mpicc is on your $PATH)

4.3.3. Compiling OpenMP programs

The Intel and GNU compilers support OpenMP.

Example OpenMP program:

   #include <omp.h>

   #include <stdio.h>

   #include <stdlib.h>

   int main (int argc, char *argv[])

   {

      int nthreads, tid;

      /* Fork a team of threads giving them their own copies of
     variables */

      #pragma omp parallel private(nthreads, tid)

      {

         /* Obtain thread number */

         tid = omp_get_thread_num();

         printf("Hello World from thread = %d\n", tid);

         /* Only master thread does this */

         if (tid == 0)

         {

            nthreads = omp_get_num_threads();

            printf("Number of threads = %d\n", nthreads);

         }

      }   /* All threads join master thread and disband */

   }

Intel compilers flag: -openmp

$ icc -openmp -o prog prog.c

It is recommended to compile OpenMP programs with static linking of intel libraries at paradox.ipb.ac.rs machine before submission:

$ icc -openmp -static-intel -o prog prog.c

GNU compilers flag: -fopenmp

$ gcc -fopenmp -o prog prog.c

4.3.4. Compiling hybrid programs

Hybrid programs are supported by combination of any of the installed mpi libraries and compiler suites. The appropriate openmp flag should be passed to the compiler of choice, as described in the previous section.

The following is the source code for a simple mpi-openmp hybrid application which just prints out process and thread ids:

#include <omp.h>            /* OpenMP Library */

#include <stdio.h>          /* printf() */

#include <stdlib.h>         /* EXIT_SUCCESS */

int main (int argc, char *argv[]) {

    system("hostname");

    /* Parameters of OpenMP. */

    int O_P;                                 /* number of OpenMP processors */

    int O_T;                                 /* number of OpenMP threads */

    int O_ID;                                /* OpenMP thread ID */

    /* Get a few OpenMP parameters. */

    O_P  = omp_get_num_procs();              /* get number of OpenMP processors*/

    O_T  = omp_get_num_threads();            /* get number of OpenMP threads */

    O_ID = omp_get_thread_num();             /* get OpenMP thread ID */

    printf("O_ID:%d  O_P:%d  O_T:%d\n", O_ID,O_P,O_T);

    /* PARALLEL REGION */

    /* Thread IDs range from 0 through omp_get_num_threads()-1. */

    /* We execute identical code in all threads (data parallelization). */

    #pragma omp parallel private(O_T,O_ID)

    {

    O_T  = omp_get_num_threads();            /* get number of OpenMP threads */

    O_ID = omp_get_thread_num();             /* get OpenMP thread ID */

    printf("parallel region:           O_ID=%d  O_T=%d\n", O_ID,O_T);

    }

    /* Exit master thread. */

    printf("O_ID:%d   Exits\n", O_ID);

    return EXIT_SUCCESS;

}

To compile the hybrid code, the following lines should be executed:

$ module load gnu                                        # or intel

$ module load openmpi

$ mpicc -fopenmp hybrid_example.c -o prog        #or -openmp (for intel)

4.3.5. Compiling CUDA programs

CUDA 5.5 and 6.0 toolkits are available, however, the login node does not have a GPU so the compiled binary is runnable only on compute nodes.

The following is a hello world cuda program:

// This is the REAL "hello world" for CUDA!

// It takes the string "Hello ", prints it, then passes it

// to CUDA with an array

// of offsets. Then the offsets are added in parallel to produce

// the string "World!"

// By Ingemar Ragnemalm 2010

 

#include <stdio.h>

 

const int N = 16;

const int blocksize = 16;

 

__global__

void hello(char *a, int *b)

{

        a[threadIdx.x] += b[threadIdx.x];

}

 

int main()

{

        char a[N] = "Hello \0\0\0\0\0\0";

        int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

 

        char *ad;

        int *bd;

        const int csize = N*sizeof(char);

        const int isize = N*sizeof(int);

 

        printf("%s", a);

 

        cudaMalloc( (void**)&ad, csize );

        cudaMalloc( (void**)&bd, isize );

        cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice );

        cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice );

        

        dim3 dimBlock( blocksize, 1 );

        dim3 dimGrid( 1, 1 );

        hello<<<dimGrid, dimBlock>>>(ad, bd);

        cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost );

        cudaFree( ad );

        cudaFree( bd );

        

        printf("%s\n", a);

        return EXIT_SUCCESS;

}

The code can be compiled with following commands:

$ module load cuda/5.5

$ nvcc hello.cu -o hello

4.3.6 Compiling CUDA-MPI programs

As a tutorial we will go through building and launching of a simple application that multiplies to randomly generated vectors of numbers in  parallel, using mpi and cuda.

This is the source for the cuda kernel, which has been saved into multiply.cu file:

#include <cuda.h>

#include <cuda_runtime.h>

#include <math.h>

__global__ void kmultiply(const float* a, float* b, int n) {

        int i = threadIdx.x+blockIdx.x*blockDim.x;

        if (i<n)

            b[i] *= a[i];

}

extern "C" void launch_multiply(const float* a, float* b, int n){

        float* dA;

        float* dB;

        int cerr;

        cerr = cudaMalloc((void**)&dA, n*sizeof(float));

        cerr = cudaMalloc((void**)&dB, n*sizeof(float));

        cerr = cudaMemcpy(dA, a, n*sizeof(float),

  cudaMemcpyHostToDevice);

        cerr = cudaMemcpy(dB, b, n*sizeof(float),

  cudaMemcpyHostToDevice);

        kmultiply<<<ceil(n/256), 256>>>(a,b,n);

cerr = cudaThreadSynchronize();

        cerr = cudaMemcpy(b, dB, n*sizeof(float),

  cudaMemcpyDeviceToHost);

        cudaFree(dA);

        cudaFree(dB);

}

The main program, saved as main.c contains the following source code:

#include <stdlib.h>

#include <stdio.h>

#include <mpi.h>

#include <math.h>

void launch_multiply(const float* a, float* b, int n);

int main(int argc, char** argv) {

        int rank, nprocs;

        int n = 1000000;

        int chunk;

        float *A, *B;

        float *pA, *pB;

        int num_procs, my_id;

int len;

char name[MPI_MAX_PROCESSOR_NAME];

        if(argc>1) {

            n = atoi(argv[1]);

        }

        MPI_Init(&argc, &argv);

        MPI_Comm_rank(MPI_COMM_WORLD, &rank);

        MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

        chunk = ceil(1.0*n/nprocs);

        A = (float*) malloc(n*sizeof(float));

        B = (float*) malloc(n*sizeof(float));

        pA = (float*) malloc(chunk*sizeof(float));

        pB = (float*) malloc(chunk*sizeof(float));

 

      MPI_Get_processor_name(name, &len);//

      printf("process %d of %d on %s\n", rank,
            nprocs, name);

        if(rank==0) {

            //prepare arrays...

            for (int i=0; i<n; i++) {

                A[i]=((float)rand()/RAND_MAX);

                B[i]=((float)rand()/RAND_MAX);

            }

        }

        MPI_Scatter(A, chunk, MPI_FLOAT, pA, chunk, MPI_FLOAT,

                    0, MPI_COMM_WORLD);

        MPI_Scatter(B, chunk, MPI_FLOAT, pB, chunk, MPI_FLOAT,

                    0, MPI_COMM_WORLD);

        launch_multiply(pA, pB, chunk);

        MPI_Gather(pB, chunk, MPI_FLOAT,

                   B, chunk, MPI_FLOAT,

                   0, MPI_COMM_WORLD);

        free(A);

        free(B);

        free(pA);

        free(pB);

        MPI_Finalize();

        return 0;

}

For the compilation of the tutorial code, the following should be executed:

$ module load cuda/5.5
$ module load openmpi/1.6.5
$ nvcc -c multiply.cu -o multiply.o
$ mpicc -std=c99 -o prog main.c multiply.o -L/usr/local/cuda-5.5/lib64 -lcudart

The job submission script has the same layout as the script given in 4.2.6, with one difference in the last line that launches the program:

mpirun ./prog 1000000

The number is passed as argument to the program which sets the required level of precision which can be used to make the execution time longer or shorter.

4.3.7 PGI Compilation example

PGI compilers and libraries are available on Paradox in modules named pgi and pgi64, the later being preferred one. Along with standard C/C++ and Fortran compilation (pgcc, pgcpp, pgfortran), PGI compilers support accelerator card programming in CUDA for C/C++ and Fortran, and also support OpenACC directives.

The following example was taken from Nvidia’s Parallel Forall blog and it demonstrates use of OpenACC.

/*
*  Copyright 2012 NVIDIA Corporation
*
*  Licensed under the Apache License, Version 2.0 (the "License");
*  you may not use this file except in compliance with the License.
*  You may obtain a copy of the License at
*
*      http://www.apache.org/licenses/LICENSE-2.0
*
*  Unless required by applicable law or agreed to in writing, software
*  distributed under the License is distributed on an "AS IS" BASIS,
*  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
*  See the License for the specific language governing permissions and
*  limitations under the License.
*/

#include <math.h>
#include <string.h>
#include "timer.h"

int main(int argc, char** argv) {
   int n = 4096;
   int m = 4096;
   int iter_max = 1000;

   const float pi  = 2.0f * asinf(1.0f);
   const float tol = 1.0e-5f;
   float error     = 1.0f;

   float A[n][m];
   float Anew[n][m];
   float y0[n];

   memset(A, 0, n * m * sizeof(float));

   // set boundary conditions
   for (int i = 0; i < m; i++) {
       A[0][i]   = 0.f;
       A[n-1][i] = 0.f;
   }

   for (int j = 0; j < n; j++) {
       y0[j] = sinf(pi * j / (n-1));
       A[j][0] = y0[j];
       A[j][m-1] = y0[j]*expf(-pi);
   }

   printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);

   StartTimer();
   int iter = 0;

#pragma omp parallel for shared(Anew)
   for (int i = 1; i < m; i++) {
       Anew[0][i]   = 0.f;
       Anew[n-1][i] = 0.f;
   }
#pragma omp parallel for shared(Anew)
   for (int j = 1; j < n; j++) {
       Anew[j][0]   = y0[j];
       Anew[j][m-1] = y0[j]*expf(-pi);
   }

    while ( error > tol && iter < iter_max ) {
       error = 0.f;

#pragma omp parallel for shared(m, n, Anew, A)
#pragma acc kernels
       for( int j = 1; j < n-1; j++) {
           for( int i = 1; i < m-1; i++ ) {
               Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1]
                       + A[j-1][i] + A[j+1][i]);
               error = fmaxf( error, fabsf(Anew[j][i]-A[j][i]));
           }
       }

#pragma omp parallel for shared(m, n, Anew, A)
#pragma acc kernels
       for( int j = 1; j < n-1; j++) {
           for( int i = 1; i < m-1; i++ ) {
               A[j][i] = Anew[j][i];
           }
       }

       if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);
       iter++;
   }

   double runtime = GetTimer();
   printf(" total: %f s\n", runtime / 1000.f);

(timer.h can be found in the github repository at the following address: https://github.com/parallel-forall/code-samples.git)

The most straightforward way to compile this example is to use the following commands:

$ module load pgi64
$ pgcc -I../common -acc -ta=nvidia,time -Minfo=accel laplace2d.c -o laplace2d_acc

This creates openACC version. Since the paradox login node does not have an accelerator card, this example must be submitted to PBS for execution on compute nodes. The following script could be used for the example above:

#!/bin/bash
#PBS -q standard
#PBS -l nodes=1:ppn=16
#PBS -l walltime=10:00:00
#PBS -e ${PBS_JOBID}.err
#PBS -o ${PBS_JOBID}.out
#PBS -A example

cd $PBS_O_WORKDIR

module load pgi64
./laplace2d_acc

If static linking is preferred, there are two options: to link everything statically or to static link only pgi libraries. The first can be acieved with compiler flag -Bstatic and the second with -Bstatic_pgi. With static linking one can avoid having to load pgi module for execution on compute nodes.

Contact

If you have any questions or need more information, please contact: hpc-admin@scl.rs


[1] See http://modules.sourceforge.net for more info.