Implement recursive query in Hive using pyspark

I wanted to traverse through data in table which is in decision tree structure which parent child relationship . It it very easy to implement in oracle using CONNECT BY but we don't have this function in hive . So i implemented this in hive using hive and pyspark .

Below is code snippet which i used to implement this. Explanation follows

i=1

df=spark.sql("select perent_id,child_id from table_for_recursion where perent_id is null ")

df.createOrReplaceTempView("df")

spark.sql("insert overwrite table task_recursion_source select * from df")

spark.sql("insert overwrite table task_recursion_result select * from df")

while i > 0:

df1=spark.sql("select qar.perent_id,qar.child_id from table_for_recursion qar join task_recursion_source df on qar.perent_id=df.child_id ")

df1.createOrReplaceTempView("df1")

print(df1.count())

spark.sql("insert into table task_recursion_result select * from df1")

print ("appended rows")

spark.sql("insert overwrite table task_recursion_source select * from df1")

cnt=spark.sql("select count(1) from task_recursion_source ")

print(cnt.count())

i=cnt.collect()[0][0]

Let me Explain whats going on in above pyspark code .

table_for_recursion is the table which on which we need to implement recursive query in which there are two fields one is parent_id and child_id .

Final table of output is task_recursion_result which has all the data output of recursion results from the table table_for_recursion

First declare and set a variable i to 0

then create a dataframe which result where parent id is null and this will give you the root node which dont have any parent and then you can start traversing .

i have created two table task_recursion_source and task_recursion_result , one table task_recursion_source will hold only temorary data which will have only leaf node childs at each lavel.

i create a while loop and we run the loop till count in task_recursion_source becomes 0 as there will be no child nodes to process. in each loop we run the query

select qar.perent_id,qar.child_id from table_for_recursion qar join task_recursion_source df on qar.perent_id=df.child_id

this query will return all the childs of the parent which is currently child in task_recursion_source so we will overwrite task_recursion_source and apend table task_recursion_result

and then we will count the no of records in task_recursion_source and set this value to varible i so at the end when there are no childs to process it will set to 0 and loop will end and you will have all the result in table task_recursion_result

Please comment if you any questions .

Search This Blog

Hadoop Developers

Implement recursive query in Hive using pyspark

Comments

Post a Comment