python - is it more efficient to use unions rather than joins in apache spark, or does it not matter? -
recently running job on apache spark cluster , going inner join on 2 rdds. thought calculation avoid join using union, reducebykey , filter instead. join doing under hood?
say have objects in rdd's have following structure:
{ 'key':'somekey', 'value': <some positive integer> }
then avoid join i'd write:
leftrdd = rdd1.map(lambda y: (y['key'], (1, y['value'], -1)) rightrdd = rdd2.map(lambda y: (y['key'], (0, -1, y['value'])) joinedrdd = (leftrdd + rightrdd) \ .reducebykey(lambda x,y: (max(x[0],y[0]), max(x[1],y[1]), max(x[2],y[2])) \ .filter(lambda y: y[1][0] == 1) joinedrdd have same result if i'd done inner join, added complexity worth avoid join?
pyspark joins poor @ scalability - hunch @ manual rdd operations one.
in particular joins in pyspark lose partitioning - copartioned joins not supported.
for specifics: should careful on semantics of reducebykey: outputs same data structure input. may expecting different based on code.
take @ (pyspark) nested lists after reducebykey more info on reducebykey.
update
the native scala version more aggressive in retaining existing partitioning (not inducing full shuffle):
if (self.partitioner == some(partitioner)) { self.mappartitions(iter => { val context = taskcontext.get() new interruptibleiterator(context, aggregator.combinevaluesbykey(iter, context)) }, preservespartitioning = true) } else { new shuffledrdd[k, v, c](self, partitioner) .setserializer(serializer) .setaggregator(aggregator) .setmapsidecombine(mapsidecombine) } instead python version induces shuffle:
shuffled = locally_combined.partitionby(numpartitions) it reason had noted performance concern on pyspark reducebykey.
the overall 'answer' not clear cut yes or no: saying "could yes" - depends on how write custom pyspark rdd code vs using join() - induces shuffle.
Comments
Post a Comment