ObservabilityIn production7 min readobservability.frontend

Observability for an SEO-Critical Frontend Platform

A server-rendered, SEO-critical platform fails in ways a green build can’t show: a third-party script regresses Core Web Vitals, an upstream API slows the server render, a cache quietly stops hitting. I instrumented the platform end to end in Sentry and Google Cloud — real-user vitals, server-side dependencies, cache health, structured logs correlated with traces, and third-party noise — so regressions surface in a dashboard before they surface in rankings.

SentryGCP + PinoCore Web VitalsNext.js use cacheBrandfolderContentful

The stakes

On a platform where organic search drives the traffic, the expensive failures are the quiet ones. A layout shift from a late-loading script, a Brandfolder call that got slow, a personalization token that silently stopped resolving — none of these throw an error or fail the build, but all of them cost rankings or conversions.

Every page is rendered on the server, per request — pulling from Contentful, Brandfolder, and the facilities directory as it goes — which keeps pages fast, crawlable, and personalized. But it also means the render depends on systems I don’t fully control, and a green build says nothing about production. Observability is how I keep it honest: measure what real users actually experience, and watch every system the server render depends on.

The whole system at a glance — the request path on top, instrumented into two correlated sinks:

Upstream dependencyInternal infrastructureTelemetry

The request path (top), instrumented into two correlated sinks — Sentry traces and Pino → Google Cloud logs.

Core Web Vitals at p75

Core Web Vitals are a ranking input, so I treat them as a production SLA, not a lab score. Real-user vitals stream into Sentry and I dashboard them at the 75th percentile — the same percentile Google grades — because an average hides the slow tail that actually gets scored.

SHcore-web-vitals.log

1# p75 Core Web Vitals — Sentry (production, last 24h)2 3metric   p75         Google "good"   status4LCP      2.10 s      <= 2.5 s        * good5INP      185.59 ms   <= 200 ms       * good6CLS      0.093       <= 0.1          * good7FCP      1.33 s      <= 1.8 s        * good8TTFB     668.02 ms   <= 800 ms       * good

Production p75 over the last 24 hours, read straight off the Sentry dashboard.

Reading p75 instead of the mean means the number reflects a genuinely bad-but-common session, not a lucky fast one. Over the last 24 hours every metric sits in Google’s “good” band — and, more importantly, every event, trace, and log is tagged with the release that served it, so a regression ties straight back to the deployment that introduced it. It’s obvious which release caused it, not a guess.

Sampling, on purpose

At production traffic, tracing every session would be noisy and expensive for no extra insight. A fraction of traffic is sampled instead — enough to keep the percentiles stable and the trends trustworthy while keeping event volume, cost, and noise under control. The goal is a signal I’ll actually look at, not a firehose I’ll learn to ignore.

The external dependencies that gate a render

Server-side rendering is only as fast as the systems it calls during the render. Three of them can gate a page, so I time each one as its own span rather than hiding them inside a single “server was slow” number:

Brandfolder metadata — assets are managed in Brandfolder, and the server makes REST calls to resolve asset metadata while rendering. I track how long those calls take, so a slow upstream shows up as server time, not a mystery.
Contentful experience load — the composed Studio experience has to load before the page can render, so I monitor that load time as a first-class dependency.
Brandfolder CDN delivery — once metadata resolves, the assets themselves are served from Brandfolder’s CDN. I track real-user asset load performance, because a fast server render still feels slow if the imagery drags.

Splitting them out means that when a page gets slow, the trace says which dependency moved — the difference between a five-minute fix and an afternoon of guessing.

Caching the facilities lookup

Every request needs the facilities collection — the office directory that powers routing, personalization, and the office pages. Fetching it fresh on every render would be wasteful, so it’s wrapped in Next.js use cache, and each lookup is traced with a cache.hit attribute so I can watch the cache actually working.

SHfacilities-cache.log

1# getFacilitiesCollection — spans grouped by cache.hit (Sentry, last 24h)2 3cache.hit   count   avg       p75       p904true        134K    6.37ms    6.68ms    6.90ms5false        320    6.89ms    6.84ms    8.51ms6 7# ~99.8% of lookups served from cache; even a miss is ~8.5ms at p90

Real span data over a 24-hour window, grouped by cache.hit.

The ratio is the story: over 24 hours, roughly 134K hits against 320 misses — about a 99.8% hit rate — with cached reads near 6.9ms at p90. Even a miss lands around 8.5ms at p90, so the rare cold read never hurts. If that hit rate ever drops, I see it immediately, and it almost always means a cache key or a revalidation setting regressed.

Third-party scripts, correctness, and attribution

Not every production risk is raw performance. These are the three things that used to be invisible.

Third-party scripts, isolated

Tag managers and experimentation tools — GTM, VWO, Freshpaint and friends — run in the user’s browser and throw errors I don’t own but still see. Left alone, they drown out first-party regressions. I fingerprint errors originating from third-party script frames they’re still counted, but never sit in the same error budget as our own code — or page anyone at 2am.

Personalization that’s provably resolving

Server-side personalization replaces tokens like office name and phone at render time — but only on the experiences where marketing has actually added them; plenty of pages carry none. Where an experience does use tokens and that logic silently fails, the page ships with a blank in place of a city — no error, just wrong. I track the variable-replacement step so I can confirm tokens are resolving where they’re expected, and catch it the moment they stop.

Where the traffic comes from

Marketing runs media campaigns — paid social, video, local listings — that land on these pages with UTM parameters. I track requests carrying those parameters so campaign traffic is measurable — how much of it there is, and where it goes — turning “the campaign is live” into a number.

SHutm-traffic.log

1# Inbound requests with UTM parameters — Sentry (last 24h)2 3http.query4?utm_source=YouTube&utm_medium=Video&utm_content=29NPA&utm_campaign=…5?utm_source=googleplaces&utm_medium=lociqgoogleplaces&utm_campaign=…6?utm_source=meta&utm_medium=paidsocial&utm_campaign=AD_META_COR…

Real inbound campaign traffic, grouped by http.query.

Structured logging, correlated with traces

Dashboards tell me something regressed; logs tell me why. The platform emits structured JSON logs through Pino, shipped to Google Cloud Logging. Because every line carries the request’s context — route, facility, timing — and shares identifiers with the Sentry trace for that same request, an incident stops being a scavenger hunt.

When a trace in Sentry looks slow or throws, I pivot straight to the correlated logs in GCP for that exact request — context already attached — instead of grepping free text and guessing. Because the logs are structured, not string-concatenated, I can filter and aggregate in GCP by route, status, or facility. Correlated request context alongside Sentry traces is what turns “something broke in production” into a specific, investigable request.

Production impact

Core Web Vitals tracked at p75 against Google’s thresholds — currently all in the “good” band.
Every event, trace, and log is tagged with its release, so a regression ties back to the exact deployment that introduced it.
Structured JSON logs (Pino → Google Cloud) carry correlated request context, so a Sentry trace links straight to the logs for that exact request.
Server-side dependencies (Brandfolder metadata, Contentful experiences) timed individually, so a slow render points to a specific upstream.
Facilities lookup cached with Next.js use cache at a ~99.8% hit rate — verified in production, not assumed.
Third-party script errors (GTM, VWO, Freshpaint) isolated from the first-party error budget.
Personalization token replacement monitored, so a silent resolution failure is caught, not shipped.
Campaign traffic attributable via UTM tracking, and Brandfolder CDN asset delivery measured on real sessions.

Continue exploring

Related case studies

Server-Side Personalization at ScaleThe token-replacement pipeline and facilities lookup that this observability watches — cache hit-rate, render timings, and resolution correctness.Scaling an SEO-Critical Web PlatformThe SEO-critical platform whose Core Web Vitals and release reliability this instrumentation exists to protect.Composable Personalization with Contentful StudioThe segment rendering that depends on the same facilities lookup staying fast and correct on every request.