BHI · VC Tooling/2025/In production

An investment memo writer for VC partners.

BHI partners spend the start of every deal cycle in the same loop: pull data from the CRM, dig through web research, skim a deck, and then draft a 12-section memo from scratch. I built the internal tool that compresses that loop — without letting the model invent things the partner didn't sign off on. It's used daily by the partner team on active deals.

Role: Sole engineer · stakeholder discovery → build → roll-out
Year: 2025
Status: In production

FastAPIReactGemini ProPydanticPostgresTavily

Problem

A first-pass memo took roughly two days of partner time. Most of that wasn't writing — it was locating, reconciling, and re-typing the same facts that already existed in three different systems. By the time the partner got to the memo itself, half the energy was spent on plumbing.

The obvious move — "point an LLM at the corpus and ask for a memo" — was the wrong one. It produces a confident draft that quietly contradicts the partner's judgment. In a setting where a single hallucinated number can sink a deal review, that's worse than no tool at all.

The architectural rule

That single rule is the thing that makes the tool trustworthy. The ingest pipeline does heavy lifting on the corpus — extracting, deduplicating, classifying, surfacing — but the output of that pipeline is a structured fact sheet, not free text. The partner edits the fact sheet. The generator only sees the version the partner signed off on.

The result is that the model can't contradict the partner's edits, because by the time it runs, the partner's edits are the only thing it can see.

Ingest pipeline

Four input types feed into the same normalised representation:

CRM records — company entity, deal history, prior interactions.
Internal knowledge base — partner notes, prior memos, sector research.
Live web search via Tavily — funding announcements, news, press, with provenance retained on every extracted fact.
Uploaded documents — pitch decks, founder emails, data rooms — parsed and chunked alongside everything else.

Extraction is done with Pydantic structured outputs. Every field on the fact sheet has a defined type and an optional source citation back to the chunk it was lifted from. If the model can't produce a valid value, it returns null rather than guessing — which surfaces the gap to the partner instead of papering over it.

Generation

Generation is a single call to Gemini Pro per memo, not a chain. The prompt assembles three things in a fixed order: (1) the verified fact sheet, (2) a narrative profile of the company written from the fact sheet, and (3) sector-level background research as supporting context. The corpus itself is deliberately absent.

Per-section regeneration is the main interaction pattern. The partner can highlight a section, give feedback in plain language ("tighten the moat argument", "this paragraph contradicts the team section"), and only that section is regenerated against the same fact sheet — so a tweak doesn't reshuffle the whole document.

Final output exports to Word with the firm's template applied, so the artefact that leaves the tool looks like the artefact the partners already know how to read.

The thing I'd ship next

Today the LLM picks which source each fact came from. That's a model decision where it could be a deterministic lookup. The v2 design: have the LLM output only the value and an exact quote — never a source name. Then a substring/fuzzy gate finds which chunk actually contains that quote. Source becomes derived, not chosen. Free hallucination check: no chunk contains the quote → fact is flagged unverified. Same number of LLM calls, ~30 lines of Python, eliminates the model's biggest discretionary decision in the pipeline.

Eval strategy

The single biggest gap today is the absence of a regression eval harness — the partner catches errors at runtime, but a prompt change that subtly breaks extraction could ship and nobody would know. Five layers I'd build, in order:

Citation accuracy— reference-free substring/fuzzy match against actual chunks. Cheapest, no labelling needed, immediate empirical answer to "how often does the LLM hallucinate citations?" (If you build the gate above, this eval becomes the gate itself.)
Fact extraction — golden facts per company, tracked as F1 over time per field_id. We already have the data — every partner-verified session in Postgres is a labelled example.
Memo structural rubric — deterministic, runs on every generated memo, hard-fails CI if a section is missing.
Memo faithfulness — LLM judge (different model family) checks every numerical claim against the verified facts.
Holistic preference vs golden human-written memos— pairwise judging is more reliable than absolute scoring. Doesn't run in CI; informs prompt iteration.

Outcome

First-pass memo prep (~2 days → ~2 hrs, partner self-report): ~24×
Memo sections, all generated in a single call: 12
Used by BHI partners on every deal cycle: Daily

The interesting result wasn't the speedup — it was that partners started using the tool for the boring parts (fact sheet assembly) even when they planned to write the memo themselves. The ingest pipeline turned out to be more valuable than the generator. It's now part of the standard deal-cycle workflow.