Klemens Lukaszczyk

Speed up your tests a few times by using partitions

I sometimes land on projects with a lot of integration tests that can’t run async. Waiting 20 minutes just to check I didn’t break anything gets old fast - and rewriting wasn’t an option because 1. there were 10k tests, 2. as always, there was more important stuff :D

Running tests in partitions is the obvious fix. It helped me cut test runs from 20 min to 6 min - but to get there I had to deal with a few problems first.

Problem 1: DB deadlocks

ERROR: deadlock detected
DETAIL:
Process 12345 waits for ShareLock on transaction 67890; blocked by process 11111.
Process 11111 waits for ShareLock on transaction 12345; blocked by process 12345.

It happens when two (or more) transactions end up waiting on each other’s locks - and partitions sharing one database is the perfect setup for it. The fix is to give each partition its own database. Use the MIX_TEST_PARTITION env var and append it to the base DB name:

# in config/runtime.exs
partition = System.get_env("MIX_TEST_PARTITION", "")
# e.g. postgres://postgres:postgres@localhost:5432/dev_db
test_database_url = System.get_env("TEST_DATABASE_URL")
# each partition gets its own database
# e.g. postgres://postgres:postgres@localhost:5432/dev_db1
# e.g. postgres://postgres:postgres@localhost:5432/dev_db2
test_database_url = test_database_url <> partition

Problem 2: Not enough DB connections

** (DBConnection.ConnectionError) could not checkout the connection owned by #PID<0.84403.0>.
Reason: connection not available and request was dropped from queue after 723ms. You can configure how long requests wait in the queue using :queue_target and :queue_interval. See DBConnection.start_link/2 for more information

Postgres defaults max_connections to 100. Running 3 partitions means 3 Repo clients - if each pool is configured for 40, that’s 3 × 40 = 120 connections. You blow past the limit.

Bump max_connections on the database:

services:
  db:
    image: postgres:15.7
    command: ["postgres", "-c", "max_connections=1000"]

Problem 3: Starting many partitions is cumbersome

Running:

MIX_TEST_PARTITION=1 mix test --partitions 3
MIX_TEST_PARTITION=2 mix test --partitions 3
MIX_TEST_PARTITION=3 mix test --partitions 3

in separate terminals every time is not fun. xpanes can launch them in parallel panes from a single command:

N=3; xpanes -c "time MIX_TEST_PARTITION={} mix test --partitions $N" $(seq 1 $N)

But - one solved problem reveals another. If you want to rerun only failed tests, mix test --failed reads from .mix_test_failures in _build. When partitions run in parallel they all overwrite the same file, so in practice you can’t rerun failed tests across partitions.

A small tool to make this easier

I built al_check to smooth all of this out. Once installed:

  • check --only test runs tests in 3 partitions by default
  • check --only test --partitions 4 changes the partition count
  • check --failed reruns failed tests from all partitions

It supports more checks too - here’s a look:

check --green run

That’s it - hope this lowers the bar for running tests in partitions and gets your CI results back way faster :D

Cheers!

← Back to blog