matrix - Best way to achieve CUDA Vector Diagonalization -

April 15, 2014

what want feed in m x n matrix, , in parallel, construct n square diagonal matrices each column of matrix, perform operation on each square diagonal matrix, , recombine result. how do this?

so far, start of m x n matrix; result previous matrix computation each element calculated using function y = f(g(x)).

this gives me matrix n column elements [f1, f2...fn] each fn represents column vector of height m.

from here, want differentiate each column of matrix respect g(x). differentiating fn(x) w.r.t. g(x) results in square matrix elements f'(x). under constraint, square matrix reduces jacobian elements of each row along diagonal of square matrix, , equal fn', other elements equaling zero.

hence reason why necessary construct diagonal each of vector rows fn.

to this, take target vector defined a(ha x 1) extracted larger a(m x n) matrix. prepared zeroed matrix defined c(ha x ha) used hold diagonals.

the aim diagonalize vector square matrix each element of sitting on diagonal of c, else being zero.

there more efficient ways accomplish using pre-built routine without building whole new kernel, please aware these purposes, method necessary.

the kernel code (which works) accomplish shown here:

_cudadiagonalizetest << <5, 1 >> >(d_a, matrix_size.uiwa, matrix_size.uiha, d_c, matrix_size.uiwc, matrix_size.uihc);  __global__ void _cudadiagonalizetest(float *a, int wa, int ha, float *c, int wc, int hc) {     int ix, iy, idx;      ix = blockidx.x * blockdim.x + threadidx.x;     iy = blockidx.y * blockdim.y + threadidx.y;      idx = iy * wa + ix;      c[idx * (wc + 1)] = a[idx];  }

i bit suspicious naive approach solution , wondering if give example of how same using

a) reduction

b) thrust

for vectors of large row size, able use gpu's multithreading capabilities chunk task small jobs, , combine each result @ end __syncthreads().

the picture below shows desired result is.

i have read nvidia's article on reduction, did not manage achieve desired results.

any assistance or explanation welcomed.

enter image description here thanks.

enter image description here

matrix target 4 columns. want take each column, , copy elements matrix b diagonal, iterating through each column.

i created simple example based on thrust. uses column-major order store matrices in thrust::device_vector. should scale larger row/column counts.

another approach based off thrust strided_range example.

this example want (fill diagonals based on input vector). however, depending on how proceed resulting matrix "differentiating" step, might still worth investigating if sparse storage (without 0 entries) possible, since reduce memory consumption , ease iterating.

#include <thrust/device_vector.h> #include <thrust/scatter.h> #include <thrust/sequence.h> #include <thrust/iterator/transform_iterator.h> #include <thrust/iterator/counting_iterator.h> #include <thrust/functional.h> #include <iostream>   template<typename v> void print_matrix(const v& mat, int rows, int cols) {    for(int = 0; < rows; ++i)    {      for(int j = 0; j < cols; ++j)      {       std::cout << mat[i + j*rows] << "\t";      }      std::cout << std::endl;    } }  struct diag_index : public thrust::unary_function<int,int> {   diag_index(int rows) : rows(rows){}    __host__ __device__   int operator()(const int index) const   {       return (index*rows + (index%rows));   }    const int rows; };  int main() {   const int rows = 5;    const int cols = 4;    // allocate memory , fill demo data   // use column-major order   thrust::device_vector<int> a(rows*cols);   thrust::sequence(a.begin(), a.end());    thrust::device_vector<int> b(rows*rows*cols, 0);    // fill diagonal matrix   thrust::scatter(a.begin(), a.end(), thrust::make_transform_iterator(thrust::make_counting_iterator(0),diag_index(rows)), b.begin());    print_matrix(a, rows, cols);   std::cout << std::endl;   print_matrix(b, rows, rows*cols);   return 0; }

this example output:

0    5    10    15     1    6    11    16     2    7    12    17     3    8    13    18     4    9    14    19      0    0    0    0    0    5    0    0    0    0    10    0    0    0    0    15    0    0    0    0     0    1    0    0    0    0    6    0    0    0    0    11    0    0    0    0    16    0    0    0     0    0    2    0    0    0    0    7    0    0    0    0    12    0    0    0    0    17    0    0     0    0    0    3    0    0    0    0    8    0    0    0    0    13    0    0    0    0    18    0     0    0    0    0    4    0    0    0    0    9    0    0    0    0    14    0    0    0    0    19

Search This Blog

Call

matrix - Best way to achieve CUDA Vector Diagonalization -

Comments

Post a Comment

Popular posts from this blog

node.js - Using Node without global install -

php - CakePHP HttpSockets send array of paramms -

angularjs - ADAL JS Angular- WebAPI add a new role claim to the token -