hadoop - Spark insert to HBase slow -
i inserting hbase using spark it's slow. 60,000 records takes 2-3mins. have 10 million records save.
object writetohbase extends serializable { def main(args: array[string]) { val csvrows: rdd[array[string] = ... val dateformatter = datetimeformat.forpattern("yyyy-mm-dd hh:mm:ss") val usersrdd = csvrows.map(row => { new usertable(row(0), row(1), row(2), row(9), row(10), row(11)) }) processusers(sc: sparkcontext, usersrdd, dateformatter) }) } def processusers(sc: sparkcontext, usersrdd: rdd[usertable], dateformatter: datetimeformatter): unit = { usersrdd.foreachpartition(part => { val conf = hbaseconfiguration.create() val table = new htable(conf, tablename) part.foreach(userrow => { val id = userrow.id val name = userrow.name val date1 = dateformatter.parsedatetime(userrow.date1) val hrow = new put(bytes.tobytes(id)) hrow.add(cf, q, bytes.tobytes(date1)) hrow.add(cf, q, bytes.tobytes(name)) ... table.put(hrow) }) table.flushcommits() table.close() }) } i using in spark-submit:
--num-executors 2 --driver-memory 2g --executor-memory 2g --executor-cores 2
it's slow because implementation doesn't leverage proximity of data; piece of spark rdd in server may transferred hbase regionserver running on server.
currently there no spark's rrd operation use hbase data store in efficient manner.
Comments
Post a Comment