Implement recursive query in Hive using pyspark
I wanted to traverse through data in table which is in decision tree structure which parent child relationship . It it very easy to implement in oracle using CONNECT BY but we don't have this function in hive . So i implemented this in hive using hive and pyspark .
Below is code snippet which i used to implement this. Explanation follows
i=1
df=spark.sql("select perent_id,child_id from table_for_recursion where perent_id is null ")
df.createOrReplaceTempView("df")
spark.sql("insert overwrite table task_recursion_source select * from df")
spark.sql("insert overwrite table task_recursion_result select * from df")
while i > 0:
df1=spark.sql("select qar.perent_id,qar.child_id from table_for_recursion qar join task_recursion_source df on qar.perent_id=df.child_id ")
df1.createOrReplaceTempView("df1")
print(df1.count())
spark.sql("insert into table task_recursion_result select * from df1")
print ("appended rows")
spark.sql("insert overwrite table task_recursion_source select * from df1")
cnt=spark.sql("select count(1) from task_recursion_source ")
print(cnt.count())
i=cnt.collect()[0][0]
Let me Explain whats going on in above pyspark code .
table_for_recursion is the table which on which we need to implement recursive query in which there are two fields one is parent_id and child_id .
Final table of output is task_recursion_result which has all the data output of recursion results from the table table_for_recursion
First declare and set a variable i to 0
then create a dataframe which result where parent id is null and this will give you the root node which dont have any parent and then you can start traversing .
i have created two table task_recursion_source and task_recursion_result , one table task_recursion_source will hold only temorary data which will have only leaf node childs at each lavel.
i create a while loop and we run the loop till count in task_recursion_source becomes 0 as there will be no child nodes to process. in each loop we run the query
select qar.perent_id,qar.child_id from table_for_recursion qar join task_recursion_source df on qar.perent_id=df.child_id
this query will return all the childs of the parent which is currently child in task_recursion_source so we will overwrite task_recursion_source and apend table task_recursion_result
and then we will count the no of records in task_recursion_source and set this value to varible i so at the end when there are no childs to process it will set to 0 and loop will end and you will have all the result in table task_recursion_result
Please comment if you any questions .
Comments
Post a Comment