f# - Parallel Filtering of CUDA Array (Compute Architecture) -
i trying develop framework performing standard functional sequence operations (map, filter, fold, etc) in f# computations performed on gpu (using cuda) instead of cpu.
i've had success implementing cuda map function such can write
let result = cudaarray |> cudaarray.map <@ fun x -> x ** 3.0 @> |> array.ofcudaarray
this relatively straightforward because of operations performed elementwise.
now, interested in writing similar system filtering predicate. i'm thinking implementing predicate map function returns boolean array need find way of reducing array of interest 1 matching boolean array element has value of true.
i found article (http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/) describes looks nice method of solving problem, boils down using following function index elements in destination array:
// warp-aggregated atomic increment __device__ int atomicagginc(int *ctr) { int mask = __ballot(1); // select leader int leader = __ffs(mask) – 1; // leader update int res; if(lane_id() == leader) res = atomicadd(ctr, __popc(mask)); // broadcast result res = warp_bcast(res, leader); // each thread computes own value return res + __popc(mask & ((1 << lane_id()) – 1)); } // atomicagginc
the problem is, understand article warp_bcast supported compute capability 3.0 cards or greater.
i interested know if there there alternative warp_bcast compute capability < 3.0 or if there other approach can use solve problem without sacrificing of huge performance gains described in article? (to clear, i'm absolutely open different approaches 1 described in article listed if can me crack this.)
Comments
Post a Comment