Defensive Design

Hold my beer 🍺. So yeah, that tiny “Now Watching” card on my overview page. It looks innocent, right? A simple fetch, a quick render, maybe a bit of CSS styling for the vibes. It should have been a ten-minute job.

Instead, I made it a whole thing. A big, technical, over-engineered fuss.

The visually simple “Now Watching” card

First off, I’m a TV junkie. I love watching, like, a lot. And everything I watch, I track on Trakt¹.

My Trakt stats at the time of writing this post

One day, I decided it would be fun to show my site visitors what I just finished bingeing. Trakt has an API², I’m a dev, I said to myself, why not? But here’s the thing about me: I like thinking about stuff deeper than I probably should. I like to anticipate every scenario where something might break and because of that, I love doing deep engineering stuff.

And boy, did I find some scenarios.

The Dark Ages

When I first built it, the implementation was almost boring. No state. No background jobs. Just data in, JSX out. But the implementation had flaws which I didn't come to realize till later on. I was manually managing tokens. I’d open Termux on my phone every now and then, curl the Trakt API to get a new access token, and then manually paste that string into my environment variables.

Yes, there was a phase where I was SSH-ing to patch prod while sitting on a couch. If a system requires me to babysit it, it’s unfinished.

It was operationally expensive. It was brittle. I was essentially acting as a manual Cron job, and I hate manual work. So I decided to re-engineer the whole logic after months of manual cron-jobbing.

Statelessness vs. State

In a perfect world, you get an API key, you put it in an .env file, and you forget it exists. But Trakt (and most modern OAuth providers in 2026) uses refresh token rotation.

Here is the "whack" reality:

You get an access token (valid for 7 days in my case).
You get a refresh token.
When the access token expires, you use the refresh token to get a new pair.
The moment you use that refresh token, the old one is invalid.

Because the environment variables in Vercel are immutable at runtime, it was evident that I needed a writable storage that I could securely save these tokens and able to rotate them automatically. I went through the usual mental checklist: Edge Config is great for reads but terrible for frequent writes. KV stores are fast, but offer no real guarantees for atomic compare-and-set without layering hacks.

The problem was coordination. I needed to guarantee that only one execution path could mutate it at a time. This wasn’t caching. This was state with invariants. And invariants belong in a database.

The Thundering Herd (A Race Condition)

In a serverless world like Vercel, this is a nightmare. Vercel functions are stateless. They don't have a shared memory. If two people visit my blog at the same time and the token is expired, both functions will try to refresh it at the same microsecond

The thundering herd problem – two parallel requests collide, and OAuth responds by revoking the session.

This is what engineers call the thundering herd.

Function A calls Trakt to refresh. Trakt says "Cool, here is New Token #1."
Function B calls Trakt a millisecond later with the same old refresh token.
Trakt sees Function B and thinks: "Wait, someone already used this token. This is a replay attack!"
The Result: Trakt revokes my entire session. I'm logged out. The card breaks. I lose sleep.

The Database-Enforced Mutex

To fix this, I had to introduce a source of truth. I chose a database (Postgres) because it's boring, predictable, and serverless-friendly. Here's how I implemented it:

// tokens.ts
import { sql } from "@/lib/db

const OAUTH_ENDPOINT = "https://api.trakt.tv/oauth/token"

type TokenResult =
  | string
  | {
      refreshed: boolean
      expiresAt: string | Date
      info?: string
    }

export async function getValidTraktToken(
  { debug = false }: { debug?: boolean } = {}
): Promise<TokenResult> {
  /**
   * Single-row token store
   * Fields:
   * - access_token
   * - refresh_token
   * - expires_at
   * - refreshing (mutex)
   * - updated_at (lock freshness)
   */
  const [token] = await sql`
    select
      access_token,
      refresh_token,
      expires_at,
      refreshing,
      updated_at
    from oauth_tokens
    limit 1
  `

  if (!token) {
    throw new Error("OAuth token missing")
  }

  const now = Date.now()

  // refresh window: 24h before expiry
  const expiresSoon =
    new Date(token.expires_at).getTime() - now < 24 * 60 * 60 * 1000

  // stale lock detection (example 8 mins)
  const lockIsStale=
    token.refreshing &&
    now - new Date(token.updated_at).getTime() > 8 * 60 * 1000

  // 1. Token is healthy
  if (!expiresSoon) {
    return debug
      ? { refreshed: false, expiresAt: token.expires_at }
      : token.access_token
  }

  // 2. Another process is refreshing
  if (token.refreshing && !lockIsStale) {
    return debug
      ? {
          refreshed: false,
          expiresAt: token.expires_at,
          info: "Refresh already in progress",
        }
      : token.access_token
  }

  // 3. Acquire refresh lock (or recover stale one)
  await sql`
    update oauth_tokens
    set refreshing = true, updated_at = now()
    where refreshing = false or ${lockIsStale} = true
  `

  try {
    // refresh request
    const res = await fetch(OAUTH_ENDPOINT, {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        grant_type: "refresh_token",
        refresh_token: token.refresh_token,
      }),
    })

    if (!res.ok) {
      throw new Error("OAuth refresh failed")
    }

    const data = await res.json()
    const expiresAt = new Date(Date.now() + data.expires_in * 1000)

    await sql`
      update oauth_tokens
      set
        access_token = ${data.access_token},
        refresh_token = ${data.refresh_token},
        expires_at = ${expiresAt},
        refreshing = false,
        updated_at = now()
    `

    return debug
      ? {
          refreshed: true,
          expiresAt,
          info: lockIsStale
            ? "Stale lock recovered"
            : "Token refreshed",
        }
      : data.access_token
  } catch (err) {
    // fail-safe: never leave the lock hanging
    await sql`update oauth_tokens set refreshing = false`
    throw err
  }
}

When Function B arrives while Function A is mid-refresh, it hits that second condition: token.refreshing && !lockIsStale. It sees the lock is active and fresh, so it just returns the existing (slightly expired but still valid) access token.

You’ll notice the lock isn’t implemented in JavaScript. That’s intentional. If you try to lock in application code, you’ve already lost. Two functions can read the same value before either writes. You need the comparison and the mutation to happen in the same atomic step.

That’s why this atomic SQL line is the heart of the system:

UPDATE oauth_tokens 
SET refreshing = true, updated_at = now()
WHERE id = 1 AND (refreshing = false OR updated_at < now() - interval '8 minutes');

By adding WHERE refreshing = false, I’m using the database as a bouncer. If Function B tries to refresh while Function A is already doing it, the database returns 0 rows updated. Function B then knows to back off and just use the existing token.

Using the database as a mutex: one request passes, the other is denied, and the system stays consistent.

When Locks Go Stale

But here's the thing about locks: they can get stuck. Imagine Function A acquires the lock and starts refreshing the token. Halfway through the OAuth request, something goes sideways–maybe the network times out, maybe the Lambda hits its execution limit, maybe a cosmic ray flips a bit in the CPU (hey, it happens). Function A dies mid-flight.

Now the refreshing flag is stuck at true. Forever.

Without stale lock detection, the system is permanently bricked. No function can ever refresh the token again because they all see refreshing = true and back off. The card stops working. The site breaks. And the only fix is for me to SSH into the database at 3 AM and manually flip the flag back to false.

That's exactly the kind of manual intervention I was trying to avoid.

So I added the stale lock detector: OR updated_at < now() - interval '8 minutes'. If that updated_at timestamp is more than 8 minutes old, we know something went catastrophically wrong. The lock has been abandoned. The next function that comes along sees the stale lock and says, "Aight, this is dead. I'm taking over."

The system self-heals with zero manual intervention, zero downtime. Just boring reliable infrastructure doing its job while I sleep.

How abandoned refreshes self-heal without manual intervention.

Preemption: The Background Janitor

I also hated the idea of lazy loading. Why wait for a user to hit my site to trigger a refresh? That makes the site slow for that one unlucky visitor. Refreshing tokens during a live request puts correctness on the "hot path," and that’s a bad place for fragile work.

Instead, I implemented preemptive refreshing. I set up a cron job that hits a secured /api/cron route every morning at say 2:00 AM.

It checks if the token is within a 24-hour expiry window.
If it is, it performs the refresh while I’m literally asleep.
I secured this with a CRON_SECRET so don't try spamming my refresh logic.

A preemptive refresh window that shifts token renewal off the hot path and into a controlled background job.

An example of a successful token refresh response:

{
  "refreshed": true,
  "expires_at": "2026-01-09T04:12:00Z",
  "info": "Token refreshed within safety window"
}

The cron job will run successfully only when the token is about to expire in less than 24 hours, if not, it will ignore refreshing.

Graceful Degradation

Even with a "tank" architecture, APIs fail. If Trakt, the cron job or the database goes down, I don't want my blog to 500. If everything else fails, the site still works. That’s because I set up ISR to mask failures, and a fallback JSON to step in if everything else falls apart.

ISR Cache: Added a revalidate: 60 to my API route. My site serves a cached version of the data, masking any temporary upstream hiccups. This also makes the initial loading of the page even faster as the card doesn't have to wait for the data to kick in. The card is almost static even though it isn't.
JSON Fallback: If the API returns garbage, I catch the error and serve a local.json file. It’s "last-known-good" data. People love to dunk on fallback files, but this one earns its keep. If Trakt goes down for an hour, the UI doesn't suddeny need to lie or error. I could've stored these fallbacks in the database but I was lazy. Maybe in the near future.

Uptime & Peace of Mind

Somewhere along the way, I stopped thinking about Trakt entirely. What I built instead was a single-row state machine with atomic transitions, guarded by a mutex, refreshed preemptively, and recoverable from partial failure.

Since moving to this "over-engineered" setup:

Manual Interventions: 0
Race Condition Failures: 0
My Sleep Quality: 10/10

It’s a lot of work for a card that just says I’m watching The Bear. But it’s not about the card. It’s about the system. It’s about defensive engineering.

Now, back to the couch. I have shows to track. 🌭🍿

Resources

Footnotes

Trakt - My Trakt Profile. ↩
Trakt API - Trakt API Documentation. ↩