c++ - MPI_Scatter slows down the code? -


folks! wrote code computes scalar product of 2 huge vectors mpi. first, process rank 0 creates 2 random vectors , send via mpi_scatter rest. after that, compute partial sums , send process rank 0. main problem mpi_scatter takes huge amount of time send data other processes and, therefore program gets slower additional processes. measured mpi_wtime() , mpi_scatter() function took in cases 80% of computation time. serial code faster mpi-settings have tried.

these results on dualcore different numbers of processes:

processes time

serial 0,3275

1 0,3453

2 0,4522

4 3,4755

8 5,8645

10 8,9112

20 24,4612

40 63,2633

do know how avoid such bottlenecks? dont mind mpi_allgather()... part of homework :)

int main(int argc, char* argv[]) { srand(time(null)); int size, len, whoami, i, j, k; int n = 10000000; double start, elapsed_time, end; double *vec1, *vec2;  mpi_init(&argc, &argv); start = mpi_wtime();  mpi_comm_size(mpi_comm_world, &size); mpi_comm_rank(mpi_comm_world, &whoami);  if(n%size != 0){     printf("choose number can divided through 10000000\n");     exit(1); }  int chunk = n/size;  double *buf1 = malloc(chunk * sizeof(double));  // recv_buf mpi_scatter double *buf2 = malloc(chunk * sizeof(double));  double *gatherresult = malloc(size*(sizeof(double)));   //recv_buf mpi_allgather double result, finalresult = 0;  if(whoami == 0){      vec1 = malloc(n * sizeof(double));     vec2 = malloc(n * sizeof(double));     random_vector(vec1, n);     random_vector(vec2, n);  }     /* sends divided array other processes */ mpi_scatter(vec1, chunk, mpi_double, buf1, chunk, mpi_double, 0, mpi_comm_world); mpi_scatter(vec2, chunk, mpi_double, buf2, chunk, mpi_double, 0, mpi_comm_world);  if(whoami == 0){     end = mpi_wtime();     elapsed_time = end - start;     printf("time taken %.4f seconds\n", elapsed_time); }  for(i = 0; < chunk; ++){     result += buf1[i] * buf2[i]; }  printf("the sub result: #%d, %.2f\n",whoami, result);  /* allgather: (sendbuf, number of elements in sendbuf, type of send, number of elements recv, recv type, comm)*/ mpi_allgather(&result, 1 , mpi_double, gatherresult, 1, mpi_double , mpi_comm_world);  for(i = 0; < size; i++){     finalresult += gatherresult[i];  }  mpi_barrier(mpi_comm_world); end = mpi_wtime(); elapsed_time = end - start;  if(whoami == 0){     printf("finalresult is: %.2f\n", finalresult);     printf("time taken %.4f seconds\n", elapsed_time);     vecvec_test(n, vec1, vec2, finalresult);  // test if result correct }  mpi_barrier(mpi_comm_world);  return 0; } 

distributed computation of scalar product makes sense if vectors stored in distributed fashion, otherwise pushing each time content of big vector on network (or whatever else ipc machanism in place) root other processes take more time take single-threaded process work. scalar product memory-bound problem, means current cpu cores fast when data comes main memory , not cpu cache, arrive @ slower rate cpu core able process.

what in order demonstrate how mpi helps in case modify algorithm vectors scattered first , distributed scalar product computed many times:

mpi_scatter(vec1, buf1); mpi_scatter(vec2, buf2);  // idea sync processes before benchmarking mpi_barrier();  start = mpi_wtime();  (i = 1; <= 1000; i++) {    local_result = dotprod(buf1, buf2);    mpi_reduce(&local_result, &result, mpi_sum); }  end = mpi_wtime();  printf("time per iteration: %f\n", (end - start) / 1000); 

(pseudocode, not real c++)

you should see time per iteration decreasing number of mpi processes, if adding more mpi processes means more cpu sockets , therefore higher aggregated memory bandwidth. notice use of mpi_reduce instead of mpi_gather followed sum.


Comments

Popular posts from this blog

angularjs - ADAL JS Angular- WebAPI add a new role claim to the token -

php - CakePHP HttpSockets send array of paramms -

node.js - Using Node without global install -