f# - Parallel Filtering of CUDA Array (Compute Architecture) -


i trying develop framework performing standard functional sequence operations (map, filter, fold, etc) in f# computations performed on gpu (using cuda) instead of cpu.

i've had success implementing cuda map function such can write

let result =      cudaarray      |> cudaarray.map <@ fun x -> x ** 3.0  @>      |> array.ofcudaarray 

this relatively straightforward because of operations performed elementwise.

now, interested in writing similar system filtering predicate. i'm thinking implementing predicate map function returns boolean array need find way of reducing array of interest 1 matching boolean array element has value of true.

i found article (http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/) describes looks nice method of solving problem, boils down using following function index elements in destination array:

// warp-aggregated atomic increment __device__ int atomicagginc(int *ctr) {   int mask = __ballot(1);   // select leader   int leader = __ffs(mask) – 1;   // leader update   int res;   if(lane_id() == leader)     res = atomicadd(ctr, __popc(mask));   // broadcast result   res = warp_bcast(res, leader);   // each thread computes own value   return res + __popc(mask & ((1 << lane_id()) – 1)); } // atomicagginc 

the problem is, understand article warp_bcast supported compute capability 3.0 cards or greater.

i interested know if there there alternative warp_bcast compute capability < 3.0 or if there other approach can use solve problem without sacrificing of huge performance gains described in article? (to clear, i'm absolutely open different approaches 1 described in article listed if can me crack this.)


Comments

Popular posts from this blog

angularjs - ADAL JS Angular- WebAPI add a new role claim to the token -

node.js - Using Node without global install -

php - CakePHP HttpSockets send array of paramms -