How to optimize DB read operations in Ruby on Rails

Featured on Hashnode

Premise

Ruby on Rails is rarely seen as a framework oriented towards speed and performance. And that's true: Rails was born as a tool to speed up development and to make developers' life easier.

This must be contextualized, though.

Those who say this usually are talking about scaling problems. That is to say that when requests and concurrency grows Rails is easily outperformed by other more performant frameworks or languages: Rust, Elixir, and Java are good examples.

The point is that when we talk about this kind of problems we are talking about very heavy loads, thousands or tens of thousands requests per minutes.

On smaller loads Rails has no particular problems, and generally performance is more a developer water cooler argument of discussion.

We agree on the fact that a response time of 5ms vs 50ms is nicer to see, but to the final user this is rarely detectable, as we can expect 100-300ms of global network overhead.

So if you have performance problems on an application with a modest load the problems is your code, not Rails.

Don't get discouraged, this means two things:

  • there is a lot of measures to take to increase performance;
  • you DO NOT have to rewrite your app in Rust🚀.

Note

Rails in dev mode is significantly slower than production mode. Test in production mode before concluding that your app is slow.

Query performance

The largest part of Rails performance problems are not actual Rails problems, but rather problems in the way we use ActiveRecord and the number of queries ran for each request.

To check wether this is our case it's enough to run the slow network request and check the server logs for something like this:

Completed 200 OK in 72ms (Views: 43.6ms | ActiveRecord: 24.6ms | Allocations: 27391)

This is the recap of the request time, divided by each component. As you can see this request is fairly fast and ActiveRecord has a moderate impact on the overall request time.

On the contrary this log:

Completed 200 OK in 2800ms (Views: 80.6ms | ActiveRecord: 2719.2ms | Allocations: xxxx)

means that there is a problem: the queries take to long!.

First we have to identify the kind of queries that are slow. Read queries (SELECT) or write queries (INSERT, UPDATE, DELETE).

Read queries

As a rule of thumb: the fewer queries, the better. This is because:

  • Running a query has a fixed network and database processing overhead, not depending on the query complexity. Running 10 light queries costs more than running one big query.
  • Parsing a query is not free in Rails, as we have to instantiate models based on the query result
  • Databases are generally faster than Rails: it's better to make the database do the heavy search&filtering work.

The n+1 queries problem

Rails allows to handle models and relations in a very simple way, but this can lead to unoptimized queries. A typical problem is the "n+1 queries problem".

This is the use case: we want to iterate on a models collection and, for each record, access one of its relationships to extract some attribute from this relationship.

Normally every time we access a relationship on a model Rails runs a SELECT to load the relationship data. This is also valid for the fetched relationships: trying to access a relationship of the previous one triggers another select, and so on.

Naturally, the more the record, the more the queries. An example:

  users = User.all
  commenters = users.flat_map do |user|
    # calling user.posts, triggers SELECT * FROM posts WHERE user_id = ?
    user.posts.flat_map do |post| 
      # post.comments triggers "SELECT * FROM comments WHERE post_id = ?"
      # .author also triggers "SELECT * FROM users WHERE user_id = ? LIMIT 1"
      post.comments.map(&:author)
    end
  end

In this examples, 100 users with 100 posts each, each post with 10 comments, means ... 100+100 * 10 * 2 = 2100 queries. ARGH

How can we solve this?

The idea is to preload with one (or few) queries the data before the for loop, to avoid running all the queries and to have all the data ready from the beginning.

Rails offers three main ways to preload, each composable with the existing query:

  • The eager_load method alters the main query, adding a LEFT OUTER JOIN for each relationship to preload. It's the most efficient way counting the number of queries because only one very large query is executed, but can create conflicts with the other SQL constructs used. For example it can mangle with select, distinct and group methods and generate invalid sql.
  • The preload method loads the requested relationships in a separate query, with a single query for each relationship. It's a little less efficient than eager loading, but it's safer.
  • The includes method chooses based on context wether to use a LEFT OUTER JOIN or a separate query, checking for conflicts.

The previous example optimized looks like this:

 users = User.preload(posts: { comments: :author })
 commenters = users.flat_map do |user|
    # no query here
    user.posts.flat_map do |post| 
      # neither here
      post.comments.map(&:author)
    end
  end

Using preload we will run a query for the users, a query for the posts, one for authors and one for comments. A total of 4 queries, which even if large and time-consuming (for example 50ms each, which is a lot) will still be reasonably fast (200ms in total).

We are talking 50x just with a little optimization.

Warning: often the n+1 problem is hidden, because Rails iterates implicitly on the data. As an example, every time we serialize a model that has a relation, that will trigger the problem.

Pro tip: there is a little lovely gem which can, in some specific cases, solve the problem automagically: goldiloader. It's always a good idea to install it.

Manual preloading

Sometimes the Rails automatic preload does not work, but it can serve as an inspiration on how to optimize our query!

Manual preloading is most useful when we are looping over data stored in the database.

The idea is: instead of using find_by! or where methods inside a loop, we preload the data before the loop and keep a key-value map to access rapidly preloaded data.

For example: we want to update all the Posts inside our app. Posts had a text field containing the author name, but we made the author a relationship on the post, and want to migrate from a text field to a reference field.

We would usually do:

  Post.find_each do |p|
    p.user = User.find_by!(name: p.author_name) # a query for each post!
    p.save!
  end

Bad news: we have 100.000 posts, and Rails is still running after an hour. This is obviously slow.

Solution:

  # preloading all users in a name -> user map
  user_map = User.where(name: Post.select('DISTINCT author_name')).index_by(&:name)
  Post.find_each do |p|
    # users are already there, this is very fast
    p.user = user_map[p.author_name] || (raise ActiveRecord::RecordNotFound)
    p.save!
  end

This saves us all find_by! calls, 50% of the total queries is now gone!

Unwanted Queries

Some rails’ models methods trigger queries, often unwanted, unexpected or unnecessary. It’s important to know when and where this could happen!

count

Calling #count on a has_many relationship always triggers a query. Even if data has been loaded previously. For example:

user = User.first
user.posts.each { |p| ... } # this fires "SELECT * FROM posts WHERE user_id = ?"
user.posts.size # posts are cached, no query
user.posts.count # this triggers SELECT COUNT(*) FROM posts WHERE user_id = ?

tldr; never use #count, use #size.

any?, exists? and present?

These methods could potentially trigger an unwanted query if the relationship has not been already loaded. For example:

user = User.first
if user.posts.any? # SELECT 1 FROM posts WHERE user_id = ? LIMIT 1
  ...
end
user.posts.each { ...stuff... } # SELECT * FROM posts WHERE user_id = ?

Forcing the relationship loading with #load is enough to avoid the redundant SELECT.

user = User.first
if user.posts.load.any? # SELECT * FROM posts WHERE user_id = ?
  ...
end
user.posts.each { ...stuff... } # cached: no query

Further readings on this can be found here.

find_each

Have you ever been tempted to iterate on a Rails models using #each? Yeah, that’s cool, but if your database is somewhat large you will see your console explode and your dev machine’s ram disappear in a sneeze.

This is because #each loads the whole table in memory with a SELECT * FROM table, and then iterates one value at a time. Before starting to iterate it has to fit the whole table, Rails overhead included, inside the memory. So what’s the correct way to iterate on a large table?

This is where #find_each comes in. This method works exactly as #each, chunking the iteration in 1000-elements batches. It will load and iterate on 1000 elements, empty the ram, and load the next 1000.

The batch size can be controlled by setting the, surprise surprise, batch_size parameter.

Database indexes

Indexes deserve a long and through discussion, which we will tackle in another article.

Indexes help if you already optimized the queries Rails-side, but the single query still takes too long to complete. Here are some quick tips.

Put indexes on all field of tables with 1000+ records on which you expect to launch a where or find_by. For example usernames, ids, emails, vat numbers, etc.

Consider that indexes work only in these cases:

  • if the query filters the dataset using equality operators;
  • if comparison operators on numbers are used, with some limitations, more here
  • if the query is a partial string search, only if the match in on the start of the string, eg: WHERE column LIKE 'something%', not WHERE column LIKE '%something%'. - To check wether a query uses an index you can ask your database to EXPLAIN it:
Post.where(user_id: 5).explain
# => EXPLAIN (SELECT * from posts [...]) ;

If you use PostgreSQL you are done, for almost every use case.

If you use MySQL on all tables with 10k+ records create a compound index on all the fields/columns you will search together often.

For example, if this query is run often:

User.where(first_name: 'john', last_name: 'doe')

It is a good idea to create a compound index on first_name and last_name:

  add_index :users, [:first_name, :last_name]

Note: it is nonsense to add an index on a field with a low cardinality, it could do more harm than good (for example an enum fields with a few elements or a boolean).

Query caching

We previously alluded on query caching in a previous paragraph, let’s dig deeper!

If a request is made twice rails does not repeat the query, but serves a cached version of the results. This can be leveraged in many ways: for example, it makes no sense to make two queries, one after the other, when the second is a filtered version of the first one. It is much better to call again the first query and then filter programmatically. Concretely, this is bad:

users = User.where(role: :admin) # first query
users.do_some_stuff
...
users = User.where(role: :admin).where(id: [10, 50, 100]) # this is not cached! you are wasting a query
users.do_some_stuff_again

This is good:

users = User.where(role: :admin) # first query
users.do_some_stuff
...
users = User.where(role: :admin).filter { |e| e.id.in?([10, 50, 100]) } # again the first query, which is cached
users.do_some_stuff_again

Too large columns

Some models may contain particularly heavy columns, like large JSON or TEXT fields. If these fields are used only in some views it may help to do one of the following:

1. disable the column loading

In rails a model can be configured to ignore a column, for example:

class User < ApplicationRecord
  self.ignored_columns = ['fat_json_column']
end

To load it you will need to select it manually with select or pluck:

class User < ApplicationRecord
  self.ignored_columns = ['fat_json_column']
  def fat_json_column # this is almost like a computed attr
    # but it loads it from the database
    User.where(id: id).pick(:fat_json_column)
  end
end
2. move the field

The idea is simple: move the large column in another table, with a 1-to-1 relationship with the main table, and then load the relationship when needed:

class User < ApplicationRecord
  has_one :user_data
end
# fat_json_column is here
class UserData < ApplicationRecord
  belongs_to :user
end

What’s next?

We saw how to optimize read queries. The next obvious step is to tackle read queries. See you in the next article of the series!

About Us

This article is part of the Mònade Developers Blog, written by Piero Dotti, edited and translated by Fabrizio Mele.