How we got struck by 5–year–old implementation

15 November 2022 — Published originally on blog.arkency.com

How we got struck by 5–year–old implementation

Recently we discovered that we were wrong on computing lock key for acquiring advisory locks. It was already covered as an update to article about building read models, but we thought that telling the whole story behind the issue could be interesting for you.

Why we needed advisory lock

Please, have a look at this asynchronous event handler. There are two bounded contexts involved:

Banking — it takes care of bank account's technical information coming from 3rd party service
Accounting — it is interested in bank accounts but in terms of doing bookkeeping job like putting initial balance and its further changes into specific accounts

When bank account is created in Banking we need to reflect that in Accounting and provide bookkeeping identifier for that account. It is done under certain rules, eg. all the bank accounts need to own parent code 512000 and code which follows previous one like 512101, 512102, 512103, etc.

# frozen_string_literal: true

module Accounting
  class CreateBankAccount < Infra::EventHandler
    def call(event)
      tenant = ::Account.find(event.data.fetch(:tenant_id))
      bank_account = ::BankAccount.find(event.data.fetch(:id))
      bookkeeping_account = BookkeepingAccount.for(bank_account)

      BankAccount.create!(
        tenant: tenant,
        name: "#{bookkeeping_account.code} – Bank",
        display_name: "Bank",
        code: bookkeeping_account.code, # eg. 512101
        parent_code: bookkeeping_account.parent_code, # eg. 512000
      )
    end
  end
end

The code above is prone to concurrency issues. The business logic won't allow creating two Accounting::BankAccount with the same code for the given tenant. ActiveRecord::RecordNotUnique would pop up soon, obviously.

Advisory locks to the rescue

In 2017 while working on a different project, we came with idea of using advisory locks to implement pessimistic locking, pg_advisory_xact_lock(key bigint) specifically:

pg_advisory_xact_lock works the same as pg_advisory_lock, except the lock is automatically released at the end of the current transaction and cannot be released explicitly.

pg_advisory_lock locks an application-defined resource, which can be identified either by a single 64-bit key value or two 32-bit key values (note that these two key spaces do not overlap).

We needed a way to generate big integer to be passed as argument to pg_advisory_xact_lock. Using Object#hash for that purpose sounded natural since it generates an Integer hash value for this object.

Quick spike to verify our hypothesis:

uuid = "a2e920fd-c51a-44a8-924d-5dc6aaba9884"
lock_nr = uuid.hash
ActiveRecord::Base.transaction do
  puts "trying to obtain lock - #{Time.now}"
  ActiveRecord::Base.connection.execute "SELECT pg_advisory_xact_lock(#{lock_nr})"
  puts "lock granted, sleeping - #{Time.now}"
  sleep(50)
end
puts "lock released - #{Time.now}"

Lock no. 1:

   (0.5ms)  BEGIN
trying to obtain lock - 2017-06-28 10:05:44 +0200
   (0.7ms)  SELECT pg_advisory_xact_lock(1924743033351481473)
lock granted, sleeping - 2017-06-28 10:05:44 +0200

Lock no. 2:

trying to obtain lock - 2017-06-28 10:05:46 +0200
   (48570.8ms)  SELECT pg_advisory_xact_lock(1924743033351481473)
lock granted, sleeping - 2017-06-28 10:06:34 +0200

Proof of concept worked. Let's implement it then:

# frozen_string_literal: true

class ApplicationRecord < ActiveRecord::Base
  self.abstract_class = true

  def self.with_advisory_lock(*args)
    transaction do
      bigint = args.join.hash
      ApplicationRecord.connection.execute("SELECT pg_advisory_xact_lock(#{bigint})")
      yield
    end
  end
end

module Accounting
  class CreateBankAccount < Infra::EventHandler
    def call(event)
      tenant = ::Account.find(event.data.fetch(:tenant_id))
      bank_account = ::BankAccount.find(event.data.fetch(:id))

      ApplicationRecord.with_advisory_lock(tenant.id) do
        bookkeeping_account = BookkeepingAccount.for(bank_account)

        BankAccount.create!(
          tenant: tenant,
          name: "#{bookkeeping_account.code} – Bank",
          display_name: "Bank",
          code: bookkeeping_account.code,
          parent_code: bookkeeping_account.parent_code,
        )
      end
    end
  end
end

tenant_id was taken as the input value for calculating big integer since we needed to guarantee the uniqueness in scope of the tenant.

Testing for concurrency issues

In 2015 Robert wrote a post on Testing race conditions in Rails apps. Since then, we know well how to test concurrent code, don't we?

# frozen_string_literal: true

require_relative "test_helper"
require "database_cleaner/active_record"

module Accounting
  class OnBankAccountCreatedConcurrencyTest < TestCase
    self.use_transactional_tests = false

    setup { DatabaseCleaner.strategy = [:truncation] }

    def test_concurrency
      begin
        concurrency_level = ActiveRecord::Base.connection.pool.size - 1
        assert concurrency_level >= 4

        bank_accounts = concurrency_level.times.map { create_bank_account }

        fail_occurred = false
        wait_for_it = true

        Thread.abort_on_exception = true
        threads =
          concurrency_level.times.map do |i|
            Thread.new do
              true while wait_for_it
              begin
                Accounting::CreateBankAccount.new.call(
                  Banking::BankAccountCreated.new(data: { tenant_id: 2137, id: bank_accounts.fetch(i).id }),
                )
              rescue ActiveRecord::RecordNotUnique
                fail_occurred = true
              end
            end
          end
        wait_for_it = false
        threads.each(&:join)

        refute fail_occurred
        assert_equal 4, Accounting::BankAccount.of_tenant(2137).where(parent_code: 512_000).size
      ensure
        ActiveRecord::Base.connection_pool.disconnect!
      end
    end

    teardown { DatabaseCleaner.clean }

    private

    def create_bank_account
      BankAccount.create!(
        connector_id: 12_345,
        balance_currency: Money::Currency.new("EUR").iso_code,
        balance_value: 1_000_000.00,
        external_id: SecureRandom.uuid,
      )
    end
  end
end

The test was green, code was acting properly, so we shipped the code to production and slept well.

Everything was fine until it wasn't

Few years later, ActiveRecord::RecordNotUnique strikes back. We were intrigued why this had happened, but had no clue. The issue was self healing, since the code was run asynchronously on sidekiq with retries on failure. Quick investigations didn't bring answer to the problem. The issue wasn't a trouble to the app, but it was rather popping up in Honeybadger making developers scratch their head again and again.

Then our teammate get those flashbacks from past project. He reminded himself that from time to time the similar issue occurred with advisory lock acquired in the same manner. You start discussing stack differences between those projects, what has changed in the past few months

— Oh, we've added another sidekiq process

You instantly run two separate irb processes to check whether this might be the case:

Process no. 1:

irb(main):001:0> 123456.hash
=> -169614201293062129

Process no. 2:

irb(main):001:0> 123456.hash
=> -4474522856021669622

— Boom! Roasted...

Properly compute your lock key

Our initial implementation of advisory_lock method didn't provide identical hash across different MRI processes and code was prone to ActiveRecord::RecordNotUnique errors:

# frozen_string_literal: true

class ApplicationRecord < ActiveRecord::Base
  self.abstract_class = true

  def self.with_advisory_lock(*args)
    transaction do
      bigint = args.join.hash
      ApplicationRecord.connection.execute("SELECT pg_advisory_xact_lock(#{bigint})")
      yield
    end
  end
end

It required two different processes using two separate database connections to prove that previous advisory_lock didn’t work as expected and allowed share of the same resource. Test setup didn't meet this criteria both in dev environment nor on CI.

There's important note on that in Ruby’s Object#hash docs being the key to our issues:

The hash value for an object may not be identical across invocations or implementations of Ruby. If you need a stable identifier across Ruby invocations and implementations you will need to generate one with a custom method.

We fixed it by creating custom hash_64() function in our PostgreSQL database:

create function hash_64(_identifier character varying) returns bigint
    language plpgsql
as
$$
DECLARE
hash bigint;
BEGIN
  select left('x' || md5(_identifier), 16)::bit(64)::bigint into hash;
  return hash;
  END;
  $$;

alter function hash_64(varchar) owner to dbuser;

It was then used to fix the implementation of advisory_lock:

# frozen_string_literal: true

class ApplicationRecord < ActiveRecord::Base
  self.abstract_class = true

  def self.with_advisory_lock(*args)
    transaction do
      ApplicationRecord.connection.execute("SELECT pg_advisory_xact_lock(hash_64('#{args.join}'))")
      yield
    end
  end

The hash_64() implementation was taken from Eventide codebase.

There are other alternative solutions, like use of Zlib#crc32 if you prefer to stick with Ruby to compute lock key:

# frozen_string_literal: true
class ApplicationRecord < ActiveRecord::Base
  self.abstract_class = true

  def self.with_advisory_lock(*args)
    transaction do
      ApplicationRecord.connection.execute("SELECT pg_advisory_xact_lock(#{Zlib.crc32(args.join)})")
      yield
    end
  end