Three ways of iterating over tree-like ActiveRecord structures
You are building an e-commerce app and at some point you introduce a Category model as a part of Product data. A Category can have various subcategories, which are represented by the same model. At this point, it's obvious we are dealing with a tree-like data structure. Step by step, I'll introduce three ways of iteration over such a structure and compare their performance.
Recently, I ran into a problem of refactoring a method responsible for returning the whole set of tree node descendants. In our example case, descendants are subcategories of a given category. I'm going to walk you through the way of how I handled it.
Before we get into details behind it, let's prepare the project and sample data.
Preparing data
Let's generate a Rails app:
rails new shop --database=postgresql
Configure your database.yml
, and run:
rails db:create
Create the Category
model with a name and reference to itself:
rails g model category name:string category:references
Let's run the generated migration:
rails db:migrate
We have our database schema ready, but we still need to add relationship annotations to our Category
model.
class Category < ApplicationRecord
belongs_to :parent, class_name: 'Category', optional: true, foreign_key: :category_id
has_many :children, class_name: 'Category', dependent: :destroy
end
Notice the optional
parameter here. By default, Rails 5 will raise an error whenever you pass the nil
value for a belongs_to
relation. We want to skip that default setting since our root category will not have any parent.
Well done, now it's time to prepare some seed data.
# Root category
root = Category.create name: 'sport'
# Sport subcategories
basketball_cat = root.children.create name: 'basketball'
fitness_cat = root.children.create name: 'fitness'
# Basketball categories
basketball_cat.children.create name: 'clothing'
basketball_cat.children.create name: 'basketballs'
basketball_cat.children.create name: 'footwear'
# Fitness subcategories
fitness_cat.children.create name: 'dumbbells'
fitness_cat.children.create name: 'benches'
fitness_cat.children.create name: 'kettlebells'
Great, we have everything we need to start iterating over our tree data structure. It's time to implement the first method.
Simple iteration over while
The first solution is just an iteration via while
over each descendant node in our tree and pushing its children to the result array. Let's check it out:
def all_children_iteration
childs_to_visit = children.to_a
childs_to_return = []
while childs_to_visit.present?
current_node = childs_to_visit.shift
childs_to_return << current_node
childs_to_visit.concat(current_node.children)
end
childs_to_return
end
Fair simple. We go through each category, add its children to the result array and when there are no more categories to visit down the path, we return the result array.
Recursion
In this solution, we simply go through every category and by using the recursive strategy we add each one to the result array.
def all_children_recursion
children.flat_map do |child_cat|
child_cat.all_children_recursion << child_cat
end
end
It's worth noticing that with recursion we need to allocate more memory for the stack. In case of processing big sets of data, it might result in a stack overflow.
PostgreSQL solution
Now, here is the interesting part. In version 8.4, PostgreSQL
introduced "WITH RECURSIVE tree
" query (reference) which allows you to return children from an adjacency tree. We can leverage that in our Category
model. Firstly, let's define our last method:
def all_children_sql
Category.find_by_sql(recursive_tree_children_sql)
end
The SQL
will be pretty large as you will see in the moment. It's better to keep it separately. Here it is:
def recursive_tree_children_sql
columns = self.class.column_names
columns_joined = columns.join(',')
sql =
<<-SQL
WITH RECURSIVE category_tree (#{columns_joined}, level)
AS (
SELECT
#{columns_joined},
0
FROM categories
WHERE id = #{id}
UNION ALL
SELECT
#{columns.map { |col| 'cat.' + col }.join(',')},
ct.level + 1
FROM categories cat, category_tree ct
WHERE cat.category_id = ct.id
)
SELECT * FROM category_tree
WHERE level > 0
ORDER BY level, category_id, name;
SQL
sql.chomp
end
Here we define multi-line string which in the end needs .chomp
to get rid of new line characters. After all, this query returns the same data as our iterative and recursive functions (although it might return records in a different order). One big advantage of this approach is only one database call which results in fast response.
Performance
We are going to use built-in Benchmark
library to measure the performance of methods.
In this scenario, I generated 10000 categories randomly assigned to sample parents:
[].tap do |categories|
10_000.times do |i|
categories << Category.create!(name: "Cat #{i}", parent: categories.sample)
end
end
I also decided to disable SQL
logging in console with the following code:
ActiveRecord::Base.logger = nil
Testing code looks like this:
root = Category.first
iterative_diff = Benchmark.realtime do
root.all_children_iteration
end
# Reloading prevents SQL caching
root.reload
recursion_diff = Benchmark.realtime do
root.all_children_recursion
end
root.reload
sql_diff = Benchmark.realtime do
root.all_children_sql
end
puts "Iterative performance: #{iterative_diff}"
puts "Recursive performance: #{recursion_diff}"
puts "SQL performance: #{sql_diff}"
Results:
Iterative performance: 11.911259000000427
Recursive performance: 11.971864999999525
SQL performance: 0.31752699999924516
The results are mind-blowing. SQL in this case is much faster than both iterative and recursive methods. Why is it so?
For each node in a tree, we are performing sql request to fetch its children. As you can see, those requests are very time-consuming. Can we somehow reduce it?
Preloading children
Before we start to iterate over children subsets, we can use includes
method to preload them. However in this approach we assume that we already know the depth of a tree. In order to do that we can add the depth
attribute to the Category
model. It should store the actual depth of each node.
rails g migration add_depth_to_categories depth:integer
We also need before_create
callback in the Category
model:
before_create :assign_depth
def assign_depth
self.depth = (parent.present? ? parent.depth + 1 : 0)
end
Let's also define a method responsible for returning includes hash which we'll need later on:
def includes_hash
tree_depth = Category.order(depth: :desc).first.depth || 0
tree_depth.times.inject(:children) { |h| { children: h } }
end
We can test our methods once again but with preloading:
root = Category.first
iterative_diff = Benchmark.realtime do
root = Category.includes(Category.first.includes_hash).first
root.all_children_iteration
end
root.reload
recursion_diff = Benchmark.realtime do
root = Category.includes(Category.first.includes_hash).first
root.all_children_recursion
end
puts "Iterative performance: #{iterative_diff}"
puts "Recursive performance: #{recursion_diff}"
Results:
Iterative performance: 0.777115999999296
Recursive performance: 0.7672039999997651
As you can see, the solution presented above brings a huge improvement. Preloading reduces the execution time significantly. However, both methods are still slower than the 'SQL' method. You also need to keep in mind that every instance of node deletion might change the whole tree structure. In this case, you need to handle the depth update of each affected node.
In the next blog post I'll compare already presented solutions with two patterns designed for hierarchical data management: materialised path and nested set.
Photo by Yaoqi LAI on Unsplash