Thanks for filling out our form!
Here's what happens next:

We'll write you back as soon as possible, right now we might be very busy saving the World. Give us a day or two, we'll contact you for sure.


In the meantime, perhaps you would like to take a look at one of our blog posts that explains

How do we work with our Clients
How we can help you with your business

In Touch,
Prograils Team

Prograils blog

Szymon Soppa on 21.06.18 in Ruby on Rails
Post yaoqi lai 21901 unsplash

Three ways of iterating over tree-like ActiveRecord structures

You are building an e-commerce app and at some point you introduce a Category model as a part of Product data. A Category can have various subcategories, which are represented by the same model. At this point, it's obvious we are dealing with a tree-like data structure. Step by step, I'll introduce three ways of iteration over such a structure and compare their performance.

Recently, I ran into a problem of refactoring a method responsible for returning the whole set of tree node descendants. In our example case, descendants are subcategories of a given category. I'm going to walk you through the way of how I handled it.

Before we get into details behind it, let's prepare the project and sample data.

Preparing data

Let's generate a Rails app:

rails new shop --database=postgresql

Configure your database.yml, and run:

rails db:create

Create the Category model with a name and reference to itself:

rails g model category name:string category:references

Let's run the generated migration:

rails db:migrate

We have our database schema ready, but we still need to add relationship annotations to our Category model.

class Category < ApplicationRecord
  belongs_to :parent, class_name: 'Category', optional: true, foreign_key: :category_id
  has_many :children, class_name: 'Category', dependent: :destroy
end

Notice the optional parameter here. By default, Rails 5 will raise an error whenever you pass the nil value for a belongs_to relation. We want to skip that default setting since our root category will not have any parent.

Well done, now it's time to prepare some seed data.

# Root category
root = Category.create name: 'sport'
# Sport subcategories
basketball_cat = root.children.create name: 'basketball'
fitness_cat = root.children.create name: 'fitness'
# Basketball categories
basketball_cat.children.create name: 'clothing'
basketball_cat.children.create name: 'basketballs'
basketball_cat.children.create name: 'footwear'
# Fitness subcategories
fitness_cat.children.create name: 'dumbbells'
fitness_cat.children.create name: 'benches'
fitness_cat.children.create name: 'kettlebells'

Great, we have everything we need to start iterating over our tree data structure. It's time to implement the first method.

Simple iteration over while

The first solution is just an iteration via while over each descendant node in our tree and pushing its children to the result array. Let's check it out:

def all_children_iteration
  childs_to_visit = children.to_a
  childs_to_return = []
  while childs_to_visit.present?
    current_node = childs_to_visit.shift
    childs_to_return << current_node
    childs_to_visit.concat(current_node.children)
  end
  childs_to_return
end

Fair simple. We go through each category, add its children to the result array and when there are no more categories to visit down the path, we return the result array.

Recursion

In this solution, we simply go through every category and by using the recursive strategy we add each one to the result array.

def all_children_recursion
  children.flat_map do |child_cat|
    child_cat.all_children_recursion << child_cat
  end
end

It's worth noticing that with recursion we need to allocate more memory for the stack. In case of processing big sets of data, it might result in a stack overflow.

PostgreSQL solution

Now, here is the interesting part. In version 8.4, PostgreSQL introduced "WITH RECURSIVE tree" query (reference) which allows you to return children from an adjacency tree. We can leverage that in our Category model. Firstly, let's define our last method:

def all_children_sql
  Category.find_by_sql(recursive_tree_children_sql)
end

The SQL will be pretty large as you will see in the moment. It's better to keep it separately. Here it is:

def recursive_tree_children_sql
  columns = self.class.column_names
  columns_joined = columns.join(',')
  sql =
    <<-SQL
      WITH RECURSIVE category_tree (#{columns_joined}, level)
      AS (
        SELECT
          #{columns_joined},
          0
        FROM categories
        WHERE id = #{id}

        UNION ALL
        SELECT
          #{columns.map { |col| 'cat.' + col }.join(',')},
          ct.level + 1
        FROM categories cat, category_tree ct
        WHERE cat.category_id = ct.id
      )
      SELECT * FROM category_tree
      WHERE level > 0
      ORDER BY level, category_id, name;
    SQL
  sql.chomp
end

Here we define multi-line string which in the end needs .chomp to get rid of new line characters. After all, this query returns the same data as our iterative and recursive functions (although it might return records in a different order). One big advantage of this approach is only one database call which results in fast response.

Performance

We are going to use built-in Benchmark library to measure the performance of methods.

In this scenario, I generated 10000 categories randomly assigned to sample parents:

[].tap do |categories|
  10_000.times do |i|
    categories << Category.create!(name: "Cat #{i}", parent: categories.sample)
  end
end

I also decided to disable SQL logging in console with the following code:

ActiveRecord::Base.logger = nil

Testing code looks like this:

root = Category.first

iterative_diff = Benchmark.realtime do
  root.all_children_iteration
end

# Reloading prevents SQL caching
root.reload

recursion_diff = Benchmark.realtime do
  root.all_children_recursion
end

root.reload

sql_diff = Benchmark.realtime do
  root.all_children_sql
end

puts "Iterative performance: #{iterative_diff}"
puts "Recursive performance: #{recursion_diff}"
puts "SQL performance: #{sql_diff}"

Results:

Iterative performance: 11.911259000000427
Recursive performance: 11.971864999999525
SQL performance: 0.31752699999924516

The results are mind-blowing. SQL in this case is much faster than both iterative and recursive methods. Why is it so?

For each node in a tree, we are performing sql request to fetch its children. As you can see, those requests are very time-consuming. Can we somehow reduce it?

Preloading children

Before we start to iterate over children subsets, we can use includes method to preload them. However in this approach we assume that we already know the depth of a tree. In order to do that we can add the depth attribute to the Category model. It should store the actual depth of each node.

rails g migration add_depth_to_categories depth:integer

We also need before_create callback in the Category model:

before_create :assign_depth

def assign_depth
  self.depth = (parent.present? ? parent.depth + 1 : 0)
end

Let's also define a method responsible for returning includes hash which we'll need later on:

def includes_hash
  tree_depth = Category.order(depth: :desc).first.depth || 0
  tree_depth.times.inject(:children) { |h| { children: h } }
end

We can test our methods once again but with preloading:

root = Category.first

iterative_diff = Benchmark.realtime do
  root = Category.includes(Category.first.includes_hash).first
  root.all_children_iteration
end

root.reload

recursion_diff = Benchmark.realtime do
  root = Category.includes(Category.first.includes_hash).first
  root.all_children_recursion
end

puts "Iterative performance: #{iterative_diff}"
puts "Recursive performance: #{recursion_diff}"

Results:

Iterative performance: 0.777115999999296
Recursive performance: 0.7672039999997651

As you can see, the solution presented above brings a huge improvement. Preloading reduces the execution time significantly. However, both methods are still slower than the 'SQL' method. You also need to keep in mind that every instance of node deletion might change the whole tree structure. In this case, you need to handle the depth update of each affected node.

In the next blog post I'll compare already presented solutions with two patterns designed for hierarchical data management: materialised path and nested set.

Photo by Yaoqi LAI on Unsplash

Share on
comments powered by Disqus