python - is it more efficient to use unions rather than joins in apache spark, or does it not matter? -

recently running job on apache spark cluster , going inner join on 2 rdds. thought calculation avoid join using union, reducebykey , filter instead. join doing under hood?

say have objects in rdd's have following structure:

{ 'key':'somekey', 'value': <some positive integer> }

then avoid join i'd write:

leftrdd = rdd1.map(lambda y: (y['key'], (1, y['value'], -1)) rightrdd = rdd2.map(lambda y: (y['key'], (0, -1, y['value'])) joinedrdd = (leftrdd + rightrdd) \     .reducebykey(lambda x,y: (max(x[0],y[0]), max(x[1],y[1]), max(x[2],y[2])) \     .filter(lambda y: y[1][0] == 1)

joinedrdd have same result if i'd done inner join, added complexity worth avoid join?

pyspark joins poor @ scalability - hunch @ manual rdd operations one.

in particular joins in pyspark lose partitioning - copartioned joins not supported.

for specifics: should careful on semantics of reducebykey: outputs same data structure input. may expecting different based on code.

take @ (pyspark) nested lists after reducebykey more info on reducebykey.

update

the native scala version more aggressive in retaining existing partitioning (not inducing full shuffle):

if (self.partitioner == some(partitioner)) {   self.mappartitions(iter => {     val context = taskcontext.get()     new interruptibleiterator(context, aggregator.combinevaluesbykey(iter, context))   }, preservespartitioning = true) } else {   new shuffledrdd[k, v, c](self, partitioner)     .setserializer(serializer)     .setaggregator(aggregator)     .setmapsidecombine(mapsidecombine) }

instead python version induces shuffle:

    shuffled = locally_combined.partitionby(numpartitions)

it reason had noted performance concern on pyspark reducebykey.

the overall 'answer' not clear cut yes or no: saying "could yes" - depends on how write custom pyspark rdd code vs using join() - induces shuffle.

Search This Blog

Call

python - is it more efficient to use unions rather than joins in apache spark, or does it not matter? -

Comments

Post a Comment