Bhoos | Scale Tales: Walking the Tightrope of Migration

God could create the world in six days because he didn’t have to make it compatible with the previous version.
— Programmer folklore

Joining an organization with an existing codebase as a developer is rarely an easy task. Of course, there is the complaint about legacy code, and legacy tools. Many aspiring junior devs would feel that they could just rewrite components of a codebase from scratch in no time to replace the obsolete sections. While an apparently heroic task at proposal, what is much more difficult is to instead keep the good parts and for the rest, make backwards-compatible changes.

The following tweet sums up the hurdle presented by decision accumulation.

Legacy and heavily used systems have decision accumulation over the years:

some decisions were made because of the technology available at the time, some due to a feature need, some due to dependency management, some were made to make the system more secure or to test it.
— Sarah Drasner (@sarah_edo) December 21, 2022

Why We Got Rid of Jsonb

At Bhoos, our Marriage Card Game was scaling up. We were crossing a million sync requests(activities from the user end), and our user base grew daily. We had some database schema decisions that seemed good and performed decently until, down the line, these decisions started giving issues. One of these decisions was to store game-specific user data that client apps send (e.g., stats, achievements, bots) in a Postgresql jsonb column. This is near-identical to storing simply a JSON object (with better operations/indexing at the cost of higher initial parse time and larger disk space). This method was quite simple and flexible as well.

data column:
{
"bots": {
"selected": [ ... ],
"bot_id": {... },
},
"pref": { ... },
"marriage": {
"config": { "mans":3, "mode": "kidnap", ...},
"stats": {
"lastPlayed": 1669702024365,
"kidnap": {
"points": 113,
"gamesPlayed": 13,
"gamesWon": 6
...
},
"murder": { ... }
...
},
"achievements": { ... }
...
}
}

The alternative was to have normalized tables for each of the stats, achievements, etc. Client-side mobile apps already used JSON representation to pass around data, so no conversion in data format would be necessary at the server end for a jsonb column. The server would readily accept if the client-side code added new keys to the JSON. Yet, this singular column started giving us more and more headaches due to the reasons below.

Transparency and schema-adherence: We like Postgresql, we like relational databases, and we like the confidence and sanity that a strict schema enforces. The Jsonb column violated this sort of principle. The client apps could (and did) send practically anything packages as json, and the server would accept it. Since the client-apps are in-house, this was not as big an issue as it could have been. Still, we had deprecated keys still present in many users' data and many instances of data that would be constraint violations in otherwise normalized tables. Worst of all, such cases were not immediately apparent. We traced back many application-level errors to faulty data present in some rows, the likes of which we hadn’t anticipated. One would have to craft many queries and actively hunt for such illegal states, and fixing them was an even more tedious process. In short, the flexibility we were enjoying crossed the line, and turned into unneeded permissibility.
Performance: Postgresql, like many other databases follows the MVCC pattern, which means that to update any row, it needs to be copied entirely. The jsonb data column in our user table was unbounded and increasingly growing larger for most users (especially very active users). So, even other operations that updated the user table but did not affect the data column still had the performance hit due to needing to copy this column’s data. The common solution recommended is to extract such hot columns to a different table. Implementing that method with the jsonb column intact would still mean tiny updates to the data would require copying the entire tuple. The only way forward (within Postgres) was to break apart this conglomerate column into logically separated, normalized tables.

Challenges

In applications, such an architectural decision is hard enough to undo, one must consider all the side-effects and breaking changes the undo change might introduce. In online databases with high uptime, this is doubly difficult. The main issue we faced was ensuring near-100% uptime while preventing data loss. The detailed requirements were as follows.

Smooth online migration: Taking the database offline for a scheduled maintenance would be the safest and most reliable option. However, during testing, we found that with our current database infra, the migration itself would take at least a couple of days. Such a long downtime was simply not acceptable.
No data loss: Migrating a copy, then swapping out the database ran the risk of the newer version missing out data updates during the transition. This is a commonly-tackled challenge, with strategies like dual-write and change-data-capture often employed. For our case, any data change was likely immediately interacted with by the user (for instance, a stat update causing an achievement or bot to be unlocked), so preventing loss of this data is paramount, compared to older data that can be eventually consistent.
Backwards compatibility: The jsonb column used a deeply-nested structure, where the keys and even the nesting depth could not be predicted before a payload arrived. Switching to normalized tables prevents this sort of structure. Yet, since not all client apps could be depended upon to upgrade alongside the migration, at the application-level, client apps not needing to be updated to run was a must to allow both of the above and keep users happy.

Feature flags and row-level locking

The schema migration itself was easy enough, and postgres accomplishes table and column additions without much fuss or impact on performance. The main issue was migrating the old data to the newer tables. We used a feature flag on each row of the user table to indicate whether the user had been migrated or not. Then, at the application level, we checked whether these flags were enabled or not during each API call, and forked to appropriate blocks. In Typescript, this can be made easier by defining a higher order function like the following:

static async migrationCompatibleTransaction<T>(
client: PoolClient,
userId: string,
oldVersion: () => Promise<T>,
newVersion: () => Promise<T>
) : Promise<T> {
const migratedRes = await client.query(
`
BEGIN;
SELECT migrated
FROM "user"
WHERE id = ${userId}
FOR UPDATE;
`
);

const migrated = migratedRes.rows[0]['migrated'];

try {
if (migrated) {
const result = await newVersion();
await client.query('COMMIT;');
return result;
} else {
const result = await oldVersion();
await client.query('COMMIT;');
return result;
}
} catch(err) {
await client.query('ROLLBACK;');
throw err;
}
}

Using this function in server code ensures that the new version of the API call needs to have the same return type as the old one and also facilitates row-level locking for each user. At least during the period where we maintain backwards compatibility, we have some additional overhead due to the payload/response transformation logic.

We first send a multi-command query to begin a transaction, obtain an exclusive row-level lock of the user table for the given user, and also read the feature flag. If any database operations are underway for a user due to an API call and the migration script reaches the same user, it will wait until the server transaction completes.

For the ongoing migration script, which is a separate process, a similar setup is provided by immediately beginning a transaction and locking the row. When the migration is ongoing for a user, any queries that perform SELECT will be presented with data at the beginning of the migration transaction, and any UPDATE/DELETE queries will wait for the transaction to be committed before proceeding.

Execution and Learnings

The migration script itself was quite prone to hogging database resources and slowing down our operations. After running some tests, we decided to interleave the migrations, at a rate of about one user each second. We also only ran the migrations at set intervals in the day when user activity was lower. It took us around two weeks to completely migrate to the newer schema and finally be rid of jsonb. All the thresholds were determined by first taking a db dump and simulating the actual migration in a dev server.

One issue that cropped up in relation to row-level locking were non-row-level functions. For instance, for a leaderboard API call, the migration status of the user requesting the API was not the only feature flag that was sufficient. We needed the data of, say, the top fifty players. In such cases, simply coalescing the newer value (which would be null if any user had not been migrated) with the older one worked well. Thankfully, all such functions were read-only; needing to also write in such cases might have presented a greater challenge.

COALESCE((
SELECT value
FROM user_statistic
WHERE
user_id = "user".id
AND statistic_id = 'xp'
),
("user".data -> "xp"))::FLOAT8
) AS xp,

The main lessons learned revolved around ensuring backwards compatibility. This is a very difficult concept to execute, especially for permissible interfaces like the one we had. We also lacked a complete test suite, which was reflected in some of the edge case bugs that cropped up. An added difficulty emerged due to the added transparency: during the simulated migration, a lot of user data was error-prone in the new normalized tables. We needed to fix all such data discrepancies first. All in all, the whole planning, refactoring, testing and execution process took a couple of months.

So the next time you design a database schema, don’t rush. Instead, slow down to really figure out what the best architecture would be - not just in the present but also down the road when your application might be a hit among your userbase.

Get in Touch

Why We Got Rid of Jsonb

Challenges

Feature flags and row-level locking

Execution and Learnings