Case Study Apr 2025 — Feb 2026

PivotPoint Insights

An AI-powered business intelligence platform built to collect, process, and structure financial data from across the web — then feed it into proprietary analyses that gave executives a clear picture of the external forces shaping their business.

Read the Story

The Pipeline

01 — Collect

Data Ingestion

A multi-tier web crawler collected documents from Fortune 1000 investor relations pages, SEC EDGAR filings, ESG reports, earnings calls, governance documents, and press releases — automatically, at scale.

02 — Structure

Processing & Output

Raw documents were classified, deduplicated, date-stamped with confidence scoring, and compiled into structured Excel outputs — clean, consistent, and ready for machine consumption.

03 — Analyse

Proprietary BI Analyses

The structured data fed into 20+ business intelligence analyses run by an LLM — producing executive-level documentation that mapped the external landscape pushing on a given company.


Intelligent Multi-Tier Crawling

1000+
Fortune 1000 companies targeted
95%+
Document discovery rate
20+
External data sources
~95%
Date extraction accuracy
Tier 1 ~70%

BeautifulSoup — Static Parsing

Fast HTML parsing with sitemap-first crawling and deep recursive keyword filtering. Handles the majority of sites with no JavaScript required.

Tier 2 ~25%

Selenium — JavaScript Rendering

Full browser automation with bot-detection bypass, lazy-load support, auto-clicking "Load More" buttons, and iframe extraction for JS-heavy sites.

Tier 3 ~5%

Human-in-the-Loop — CAPTCHA

For sites requiring human verification, the system opens a visible browser, pauses for manual solving, then resumes automated collection automatically.


Structured Output for LLM Consumption

Document Types Collected

Financial & Regulatory Filings

SEC filings (10-K, 10-Q, 8-K, Proxy), earnings call transcripts, investor presentations, ESG and sustainability reports, governance charters and bylaws, and press releases — automatically classified and categorised per company.

Output Format

LLM-Ready Excel Sheets

All collected data was compiled into structured, deduplicated Excel outputs with company metadata, document URLs, types, categories, dates, and confidence scores — formatted to be both human-legible and directly consumable by an LLM.

Intelligence Layer

20+ Proprietary BI Analyses

The structured data fed a suite of business intelligence analyses covering external pressures, market dynamics, regulatory exposure, ESG positioning, and competitive landscape — producing executive-level strategic documentation.

Reliability

Built to Keep Running

Auto-save checkpoints every 10 companies, resume-from-interruption support, exponential backoff on rate limiting, and comprehensive error logging — designed to run unattended at scale without losing progress.


Built From Scratch

I came into this project with no formal business knowledge and no prior experience with APIs. What I had was a problem to solve and the willingness to figure it out.

Starting from zero, I taught myself how to interface with external APIs, navigate rate limits, handle authentication, and pull structured data from sources that weren't designed to be machine-readable. I built the entire data collection system — the multi-tier crawler, the classification logic, the deduplication pipeline, and the output formatting — independently.

The part I'm most proud of isn't any single technical piece. It's that I produced something production-grade and genuinely useful — clean, structured data that another system could pick up and immediately run with. The output didn't just work; it was legible to humans and consumable by an LLM without any manual cleanup.

That gap — between raw web data and something an AI can reason about — is harder to close than it looks. I closed it.


Stack

Python BeautifulSoup4 Selenium Undetected ChromeDriver SEC EDGAR API REST APIs pandas openpyxl lxml LLM Integration JSON-LD / schema.org Confidence Scoring Fuzzy Deduplication Checkpoint / Resume Excel Output