AI & Data

SaaSScout: A Grounded RAG Copilot for SaaS Evaluation.

Independent ProjectMar 2026 - May 2026

A production RAG copilot for SaaS evaluation. Indexes 335 products and 4,899 review chunks into four partitioned Chroma vector collections, ranks retrieval candidates across six signals (feature-fit, pricing, review sentiment, provenance trust, query alignment, category overlap), then generates grounded recommendations via a provider-neutral LLM layer — Groq Qwen 32B online, Ollama Qwen 2.5 local, or a deterministic grounded template when both are unavailable. The architecture ensures the LLM never invents a fact the retrieval layer did not supply.

SaaSScout: A Grounded RAG Copilot for SaaS Evaluation — hero image

Live RAG demo Source on GitHub

335

products indexed

4,899

review chunks in Chroma

6-signal

retrieval ranking

3-tier

LLM fallback chain

Skills used

RAGFastAPIReactTypeScriptViteTailwind CSSChromaVector SearchLLMGroqOllamaProduction DeploymentNetlifyRenderData Engineering

chapter 01 —

A five-phase RAG pipeline for procurement-grade answers.

SaaSScout is built around a retrieval-augmented generation (RAG) pipeline that turns a natural-language SaaS query into a grounded, evidence-backed recommendation. The pipeline runs in five phases. (1) Ingestion: five data sources are normalized, canonically joined on product names, and partitioned into separate Chroma vector collections by trust level. (2) Indexing: each collection is embedded using Chroma's default embedding model; a TF-IDF fallback index is pre-built for when the vector store is cold or unavailable. (3) Retrieval: incoming queries hit all four evidence lanes in parallel, each lane returning its top-k matches by cosine similarity, filtered by category and metadata constraints. (4) Ranking: retrieved candidates are scored across six signals before the LLM sees anything — feature-fit coverage, pricing match, review sentiment, provenance trust, query-keyword alignment, and category overlap. The six-signal composite score determines which evidence surfaces in the LLM prompt and which tools lead the comparison. (5) Generation: ranked evidence is assembled into a structured prompt and passed to Groq Qwen 32B (online), Ollama Qwen 2.5 1.5B (local), or a deterministic grounded template if both are unavailable. The LLM role is assembly and narration — it formats evidence it was handed, never supplements with training-set guesses.

chapter 02 —

Why procurement is the worst use case for generic chat — and the best for RAG.

SaaS evaluation is a research task: analysts open ten browser tabs, cross-reference feature pages against pricing plans, scan review sites for recurring pain points, and summarize into a recommendation. Generic ChatGPT-style chat collapses that into one confident-sounding answer where the citations are usually invented and there is no way to tell which features were verified versus fabricated. The procurement-specific failure mode is expensive: a team selects the wrong tool, spends a quarter integrating it, then finds that the "enterprise SSO" the chatbot confirmed does not exist in the pricing tier they bought. RAG solves this by separating knowledge from generation. The retrieval layer owns the facts — 335 real products, 4,899 review chunks, FactGrid enterprise metadata, Wikidata vendor data, OpenAlternative open-source discovery. The generation layer only assembles and narrates what the retrieval layer hands it. The LLM never touches a query without already having ranked, sourced evidence in the prompt. That constraint is what makes the output auditable: every claim traces to a retrieved row, every gap is labeled as a gap, every recommendation cites the composite score that drove it. SaaSScout is a working argument that AI for high-stakes work must be built on evidence architecture — not on prompt engineering around an LLM's stale training data.

chapter 03 —

Product decisions.

The strategic calls behind the prototype and the reasoning each one rests on.

Retrieval-Augmented Generation (RAG) —

Index real product knowledge instead of prompting around ignorance.

A vanilla LLM prompt for "compare Zendesk and Freshdesk" produces a plausible-sounding feature matrix drawn from training data that may be months or years stale. SaaSScout's answer is to not ask the LLM about products at all. Instead, 335 real product records are ingested, canonically normalized, and embedded into Chroma before any user query arrives. The LLM receives a prompt that already contains retrieved, ranked evidence — its role is to assemble readable output, not to recall facts. If the data does not support a claim, the claim does not appear. The RAG constraint is the product's entire trust argument.

Evidence lane architecture —

Partition embeddings into separate Chroma collections by source trust.

The default RAG approach drops all documents into one vector collection and lets semantic search surface the best match. The problem is that user reviews, enterprise metadata, and vendor-reported features have very different trust profiles: Capterra reviews are subjective but reveal real pain points; FactGrid metadata is verified but narrow; Wikidata vendor facts are CC0 and reliable but sparse; OpenAlternative surfaces open-source options that commercial indexes miss. Merging them into one corpus lets high-volume review text dominate cosine similarity, drowning out the structured metadata signals. SaaSScout keeps four separate Chroma collections — one per lane — so retrieval is independently tuned per source and output is labeled by origin. A TF-IDF fallback index mirrors each collection so retrieval degrades gracefully when the vector store is cold.

Multi-signal pre-generation ranking —

Score six signals before the LLM sees a single candidate.

Semantic similarity alone is a weak procurement ranking signal. A tool can embed close to the query keyword "enterprise ticketing" while failing on pricing, missing required features, or carrying uniformly negative reviews. SaaSScout scores every retrieval candidate across six dimensions before assembling the LLM prompt: feature-fit coverage (what fraction of required features are confirmed), pricing match (distance from the stated budget), review sentiment (aggregated Capterra rating), provenance trust (FactGrid weighted highest, then Wikidata, then reviews, then alternatives), query-keyword alignment (TF-IDF overlap with the raw query string), and category overlap (primary category match). The composite score governs prompt slot assignment — higher-ranked tools get more evidence real estate — so the generation step is biased toward the strongest candidates before any text is written.

Epistemic honesty in output design —

Treat missing evidence as a procurement finding, not a model failure.

When the retrieval layer returns no Wikidata record for a vendor, or no pricing data for an enterprise tier, the typical AI response is to either hallucinate a placeholder or silently omit the row. Both are wrong for procurement. SaaSScout renders missing evidence as an explicit cell in the evidence table, tagged with the lane it came from. A blank pricing cell is a procurement red flag — it usually signals "contact sales" or a non-public tier — and surfacing it as a gap gives the analyst the right signal: this is something you must verify before buying. The design rule is that absence of evidence is itself evidence, and the product makes it visible rather than papering over it.

Provider-neutral LLM layer: Groq Qwen 32B → Ollama Qwen 2.5 1.5B → deterministic template.

Tying a RAG pipeline to one LLM provider is a reliability risk. Groq's hosted Qwen 32B is fast — sub-2-second inference on the structured prompt — but rate-limited under demo traffic. Ollama's local Qwen 2.5 1.5B is available offline and avoids API costs, but slower and quantized. The pipeline checks availability in order: Groq first, Ollama second, deterministic grounded template last. The template fallback is not a degraded mode: it uses the same six-signal ranking and the same retrieved evidence, formats them into the same scorecard and recommendation memo, and delivers a procurement-ready output without any LLM call. Because the LLM is positioned as an assembler and narrator rather than a knowledge source, the fallback degrades in fluency but never in factual grounding.

Packaged artifact, same-origin proxy, and scheduled smoke monitor for zero-dollar reliability.

Three infrastructure decisions compound into production reliability on a free-tier stack. The processed Chroma index and normalized data are packaged as a versioned zip in a GitHub Release and pulled at backend startup via DATA_ARTIFACT_URL — dropping cold-start rebuild time from 5 minutes (Kaggle re-download plus Chroma rebuild) to roughly 30 seconds. Frontend API calls route through a Netlify /api/* edge proxy instead of cross-origin requests, collapsing CORS preflight latency and enabling same-origin caching. A scheduled GitHub Actions workflow hits /health, /api/status, and a low-cost template /api/analyze to keep the Render dyno warm during business hours and reports the first failing layer — Netlify, Render, Chroma, enrichment, or analyze — on failure. The combination delivers observable, warm, reproducible infrastructure at $0.

chapter 04 —

Key findings.

6-signal

Pre-generation ranking

Every retrieval candidate is scored across feature-fit coverage, pricing match, review sentiment, provenance trust, query-keyword alignment, and category overlap before the LLM prompt is assembled. The ranking is deterministic and inspectable — no black-box confidence number, just a composite score derived from labeled, retrieved evidence.

4 collections

Partitioned vector indexing

Capterra reviews, FactGrid enterprise metadata, Wikidata vendor facts, and OpenAlternative open-source discovery each live in their own Chroma collection with their own TF-IDF fallback index. Separate indexing means separate trust profiles, independent retrieval tuning, and source-labeled output the user can audit lane by lane.

3 tiers

LLM fallback chain

Groq Qwen 32B online → Ollama Qwen 2.5 1.5B local → deterministic grounded template. Each tier consumes the same six-signal ranked evidence from the retrieval layer. The template fallback produces a full procurement scorecard with no LLM call — output that degrades in fluency but never in factual grounding.

Hallucinated citations in any output mode

The LLM never sees a request without pre-ranked, sourced evidence already in the prompt context. It cannot fabricate a feature not in the retrieved data because the prompt architecture leaves no room for it. Missing evidence surfaces as explicit labeled gaps, not as model-generated placeholders.

chapter 05 —

Key features.

The signature product moments. Each one is a complete scenario the prototype demonstrates end to end.

Configure-then-run dashboard — screenshot

feature 01 —

01.

Configure-then-run dashboard.

The single-pane workspace where evaluations happen. Pick a scenario (Support desk review risk, CRM under $30, PM automation shortlist, CRM vendor comparison), set required features and pricing constraints, choose tools to compare, and run. Demo presets seed reasonable defaults so a first-time user can fire a useful query in three clicks.

Resists the temptation to be a generic chat box. Every input is structured (scenario, category, budget, required features, tools to compare) so the retrieval can rank with confidence before the LLM ever sees the query.

Side-by-side comparison + scorecard — screenshot

feature 02 —

02.

Side-by-side comparison + scorecard.

The output surface for a compare-three-tools query: confidence rating (Low / Medium / High), aligned-features count, feature scorecard per tool, pricing summary, review-derived pain points, recommendation memo, and a list of follow-up procurement checks. Everything grouped so the user can scan one tool top-to-bottom or compare across tools.

The recommendation memo at the end is what makes the output portable. Analysts can paste it into a Slack thread or a procurement deck without rewriting; the format matches what they would have written by hand.

feature 03 —

03.

Evidence lanes panel.

Behind every claim is a sourced row. The Evidence panel exposes the FactGrid Enterprise Metadata table (vendor verification, pricing cross-checks, SLA notes, audit dates), Wikidata Vendor Facts (entity type, official website, country, parent organization, stock ticker), and the underlying review snippets. "Missing" cells are rendered as cells, not hidden, so absent evidence reads as a real finding.

Showing the data tables is the opposite of magic-AI marketing. The app earns trust by showing exactly what it knows and what it does not, including the URL each fact came from.

Mobile responsive workspace — screenshot

feature 04 —

04.

Mobile responsive workspace.

The full evaluation flow stacks cleanly on phones: data status header, query box, scenario controls, feature checklist, comparison set, and run settings. Same components, same evidence lanes, same template fallback path; just laid out vertically with touch-sized targets.

Mobile parity matters for analysts who do quick sanity checks on the train. One set of components and one set of typography across viewports, not a stripped-down mobile variant.

chapter 06 —

Tools & technologies.

React + TypeScript + Vite + Tailwind

Frontend dashboard, scenario presets, evidence tabs, loading states, template-mode retry, responsive grid. Same component library across desktop and mobile.

FastAPI

Backend service with routes for health, status, options, and analysis. Bootstraps the production artifact on startup and caches expensive status checks.

Chroma

Vector store for four separate evidence collections (products, Capterra reviews, FactGrid metadata, OpenAlternative). Each collection is indexed independently with its own embedding space and a pre-built TF-IDF fallback. Queries hit all lanes in parallel via cosine similarity, with metadata filtering for category and source. The partitioned design lets retrieval be tuned per trust level and lets the UI label every result by its origin collection.

Groq + Ollama (Qwen)

Provider-neutral LLM assembly layer. Groq's hosted Qwen 32B handles online inference at sub-2-second latency; Ollama's local Qwen 2.5 1.5B covers offline and zero-API-cost scenarios. A deterministic grounded-template path activates when both providers are unavailable, producing a full procurement scorecard from the same ranked retrieval output — no LLM call required, no hallucinations possible.

Netlify + Render

Frontend deploy with same-origin /api/* proxy to the Render-hosted FastAPI backend. Both free-tier; cold-start handling lives in the monitor and the packaged artifact.

GitHub Actions

Scheduled production monitor that hits /health, /api/status, and a low-cost template /api/analyze. Reports the first failing layer (Netlify, Render, Chroma, enrichment, or analyze) on failure.

Multi-source data ingest

CompareEdge SaaS Market Data (Kaggle), Capterra Ticketsystem reviews, FactGrid enterprise metadata (CC BY 4.0), Wikidata vendor facts (CC0), and OpenAlternative open-source discovery (CC0).

Python data pipeline

Schema discovery, canonical product-name normalization, product / pricing / feature / review joins, unmatched-record QA, and evidence enrichment.

chapter 07 —

The business case for building on RAG instead of prompts.

SaaSScout is a working argument that the right architecture for high-stakes AI work is not a smarter prompt — it is a retrieval layer that owns the facts so the generation layer never has to guess. Generic LLMs fail procurement tasks not because the models are bad but because the task requires grounded, auditable, source-traceable output that training data alone cannot provide. RAG delivers that by design: the pipeline retrieves before it generates, ranks before it narrates, and labels every output with the lane and row it came from.

The product value of this architecture is measurable. Analysts get a recommendation memo they can drop directly into a Slack thread or procurement deck. Every feature claim cites a source. Every pricing gap is flagged rather than papered over. The fallback chain means the app delivers a useful output even when the LLM provider is rate-limited or offline. That reliability is not incidental — it is the product.

Building SaaSScout sharpened a conviction I now apply to every AI product decision: the expensive part is not the model. It is the data pipeline, the evidence partitioning, the ranking heuristics, and the fallback architecture. Models are a commodity. Grounded retrieval infrastructure is the moat. Any team that skips RAG in favor of prompt engineering is trading short-term simplicity for long-term hallucination debt.

chapter 08 —

Honest caveats.

the honest caveats —

Demo data is Kaggle's CompareEdge 2026 SaaS market snapshot, not live vendor pricing. Real procurement use would need a refresh cadence and pricing-API hooks.
Many enterprise tools list pricing as "contact sales" rather than a number. Those quotes appear as gaps in the pricing table, which is honest but means budget-aware ranking is limited for higher tiers.
Render's free tier has cold starts. The scheduled monitor mitigates but does not eliminate the first-of-day delay.
LLM rate limits during demo spikes fall back to the deterministic template, which is grounded but reads less naturally than the LLM output.
No user accounts, saved evaluations, or team sharing yet. Each session is stateless.
The four evidence lanes are the curated v1 set. Adding G2, TrustRadius, and vendor-direct feature pages is the obvious v2 priority.

← back to all projects