Order by and Join in SQL or spark or mapreduce -
i have 2 tables content below.
table 1: id1 id2 id3 id4 name descr status date 1 -12134 17773 8001300701101 name1 descr1 inactive 20121203 2 -12136 17773 8001300701101 name1 descr1 inactive 20121202 3 -12138 17785 9100000161822 name3 descr3 inactive 20121201 4 -12140 17785 9100000161822 name3 descr3 active 20121130 5 -12142 17787 8000500039106 name4 descr4 active 20121129 table2: id1 id2 id3 id4 name descr 0 17781 17773 8001300701101 name1 descr1 0 17783 17783 8001300060109 name2 descr2 0 17785 17785 9100000161822 name3 descr3 0 17787 17787 8000500039106 name4 descr4 0 17789 17789 0000080052364 name5 descr5 i trying below result.
id3 id4 name descr status date 17773 8001300701101 name1 descr1 inactive 20121202 17783 8001300060109 name2 descr2 null null 17785 9100000161822 name3 descr3 active 20121201 17787 8000500039106 name4 descr4 active 20121129 17789 0000080052364 name5 descr5 null null as per above i/p , o/p, 2 tables should joined based on columns id3, id4, name , desc. if active record exists, should return active record. if inactive record exists, oldest inactive record should joined.
i tried different queries no longer near answer wanted. 4 columns joined non primary fields not nulls. there can 1 many or many many relationship between 2 tables.
i working on apache phoenix , if solution in hadoop mapreduce or in apache spark ok.
a sample query have written follows.
select table2.*, table1.status, table1.date table1 right outer join table2 on table1.id3 = table2.id3 , table1.id4 = table2.id4 , table1.name = table2.name , table1.descr = table2.descr order (status) , order (date) can 1 me please?
you cannot straight join against table 1. instead, have join against multiple queries of table 1, joined together. count, going have do:
- a query find minimum
dateactiverecords in table 1 per id3, id4, etc. - a query find minimum
dateinactiverecords in table 1 - a full outer join between above 2 queries
coalesceselectactiveversusinactivefields.
something this:
val cookedtable1 = table1.filter( $"status" === "active" ).groupby( $"id3", $"id4", $"name", $"descr", $"status" ).agg( $"id3", $"id4", $"name", $"descr", $"status", min($"date") "date" ).join( table1.filter( $"status" === "inactive" ).groupby( $"id3", $"id4", $"name", $"descr", $"status" ).agg( $"id3", $"id4", $"name", $"descr", $"status", min($"date") "date" ).select( $"id3" "id3r", $"id4" "id4r", $"name" "namer", $"descr" "descrr", $"status" "statusr", $"date" "dater" ), $"id3" === $"id3r" , $"id4" === $"id4r" , $"name" === $"namer" , $"descr" === $"descrr", "full_outer" ) .select( coalesce($"id3", $"id3r") "id3", coalesce($"id4",$"id4r") "id4", coalesce($"name", $"namer") "name", coalesce($"descr", $"descrr") "descr", coalesce($"status", $"statusr") "status", coalesce($"date", $"dater") "date" ) given have in above table 1, result like:
cookedtable1.show id3 id4 name descr status date 17785 9100000161822 name3 descr3 active 20121130 17787 8000500039106 name4 descr4 active 20121129 17773 8001300701101 name1 descr1 inactive 20121202 now, using cookedtable1 in place of table1, same query did before:
cookedtable1.registertemptable("cookedtable1") val results = sqlcontext("select table2.*, cookedtable1.status, cookedtable1.date cookedtable1 right outer join table2 on cookedtable1.id3 = table2.id3 , cookedtable1.id4 = table2.id4 , cookedtable1.name = table2.name andcookedtable1.descr = table2.descr" ) this should results looking for.
Comments
Post a Comment