Implement recursive query in Hive using pyspark

 I wanted to traverse through data in table which is in decision tree structure which parent child relationship . It it very easy to implement in oracle using CONNECT BY but we don't have this function in hive . So i implemented this in hive using hive and pyspark .

Below is code snippet which i used to implement this. Explanation follows 

 i=1

        df=spark.sql("select perent_id,child_id from table_for_recursion where perent_id is null ")

        df.createOrReplaceTempView("df")        

        spark.sql("insert overwrite table task_recursion_source select * from df")         

        spark.sql("insert overwrite table task_recursion_result select * from df") 

        while i > 0:

          df1=spark.sql("select qar.perent_id,qar.child_id from table_for_recursion qar join task_recursion_source df on  qar.perent_id=df.child_id ")

          df1.createOrReplaceTempView("df1")

          print(df1.count())          

          spark.sql("insert into table task_recursion_result select * from df1") 

          print ("appended rows")          

          spark.sql("insert overwrite table task_recursion_source select * from df1")           

          cnt=spark.sql("select count(1) from task_recursion_source ") 

          print(cnt.count())

          i=cnt.collect()[0][0]

Let me Explain whats going on in above pyspark code . 

table_for_recursion is the table which on which we need to implement recursive query in which there are two fields one is parent_id and child_id .

Final table of output is task_recursion_result which has all the data output of recursion results from the table table_for_recursion 

First  declare and set a variable i to 0

then create a dataframe which result where parent id is null and this will give you  the root node which dont have any parent and then you can start traversing .

i have created two table task_recursion_source  and task_recursion_result , one table task_recursion_source  will hold only temorary data which will have only leaf node childs at each lavel.

i create a while loop and we run the loop till count in task_recursion_source  becomes 0 as there will be no child nodes to process. in each loop we run the query 

select qar.perent_id,qar.child_id from table_for_recursion qar join task_recursion_source df on  qar.perent_id=df.child_id

this query will return all the childs of the parent which is currently child in  task_recursion_source  so we will overwrite task_recursion_source  and apend table task_recursion_result 

and then we will count the no of records in task_recursion_source and set this value to varible i so at the end when there are no childs to process it will set to 0 and loop will end and you will have all the result in table task_recursion_result 


Please comment if you any questions .


Comments