Back to work
PRH · Retail Ads/2025/In production

Book tagging for retail ad selection.

Retail ad inventory is a finite, expensive resource. The question isn't 'which book is good?' — it's 'which book, on which retailer, will earn its slot?' I built the production tagging pipeline that turns a publisher's catalogue into a structured signal you can actually buy media against.

Role
Sole ML engineer · pipeline → validation → ad-buy recommendation
Year
2025
Status
In production
GPT-5-miniDatabricksSnowflakeSerpAPIscikit-learnHDBSCANSHAP
01

Problem

A book's metadata — title, BISAC code, a flap-copy paragraph — is not enough to predict whether it will perform in a paid retail slot. The signals that actually matter (audience, format affinities, comparative titles, current cultural moment) live outside the metadata system, scattered across reviews, retailer pages, and the open web.

The brief was practical: tag the catalogue, then prove the tags aren't noise.

02

The tagging pipeline

For each title, the pipeline does grounded extraction:

  • SerpAPI grounding. Pull retailer pages, reviews, and editorial coverage so the model is reasoning over actual evidence, not its priors about what a book like this probably is.
  • Structured tagging with GPT-5-mini.Each tag is emitted with a confidence and a citation back to a source snippet. Tags that can't be supported are returned null rather than guessed.
  • Schema validationon every record before it lands in the warehouse — a downstream join shouldn't be the place a malformed tag first surfaces.

The whole job runs as a Databricks batch and writes back to Snowflake, so the tags become a normal table the ad-ops team can query alongside performance data.

03

Three methods, three patterns

Generating tags is the easy part. The interesting work was proving the tags carry signal — and surfacing the patterns inside that signal that were actually useful for ad-buy decisions. Three methods did most of the lifting, each catching something the others would have missed:

  • K-means → six stable demand archetypes. The clean partition of the catalogue. Books that grouped together had genuinely similar performance profiles, not just similar metadata.
  • HDBSCAN sub-clustering → a hidden 5.28× ROAS pocket. Inside one of the K-means archetypes, HDBSCAN's density-based view caught a small sub-cluster — n = 12 books — whose ad performance was 5.28× the parent group's average. K-means' spherical-cluster assumption averaged it out at the parent level. (Open question worth probing: parameter sensitivity to min_cluster_size — addressed in the appendix on request.)
  • SHAP → which tags actually moved performance. Attribution against a Random Forest told us which tag values were doing the predictive work, separated from the noise of co-occurring features. That's the column that ad-ops uses to filter the catalogue today.

Three further methods ran as cross-checks, each targeting a specific weakness of the lead methods rather than restating them:

  • Bootstrap CIs on the partition's effect sizes — does the result hold under resampling, or did we just get lucky on the original split?
  • Kruskal-Wallisas a distribution-free counterpart — the partition shouldn't depend on the ROAS distributions being Gaussian (which, for ad performance, they aren't).
  • PCA on the tag space— would a linear projection surface the same archetypes K-means' Euclidean distance found, or does the structure depend on the spherical-cluster assumption?

The exploratory pass also touched ANOVA, Cramér's V, Apriori, tag co-occurrence lift, WRAcc, and a vanilla decision tree, but their assumptions overlap too heavily with either the lead methods or the cross-checks above to count as truly independent. I mention them only because every line in this writeup gets fact-checked.

04

From sub-cluster to ad buy

The HDBSCAN sub-cluster wasn't academically interesting on its own — it became useful when we asked "what do these books look like, and which other titles in the catalogue match the same fingerprint?"

The answer was 189 candidate ad titles, surfaced by a three-method consensus across the methods above and grouped into actionable buckets that the ad-ops team could put against actual retailer slots. The output is a normal Snowflake table — no separate UI, no model-in-the-loop decision-making. Engineers and analysts query it the same way they query the rest of the warehouse.

05

Outcome

Books tagged in production
~13K
ROAS pocket (n = 12 books) surfaced by HDBSCAN
5.28×
Candidate ad titles surfaced via 3-method consensus
189