Web Scraper Console — Complete Tutorial

A line-by-line, file-by-file walkthrough of every service in the stack — from docker-compose.yml down to the smallest helper function. Each section opens with an SVG flow diagram and ends with a "Deep dive" button for additional detail and links to official docs.

1 · Overview

The whole system at a glance

Six containers cooperate over a shared Docker network. The browser only ever talks to the frontend (port 8000) and the API (port 8080). Everything else — the headless browser farm, S3-compatible storage, the vector DB, and the LLM — is reached only through the API.

The two main flows

Flow A — Scrape

Flow B — Ask the agent

File map

scraper/
├── docker-compose.yml          # 6-service orchestration
├── api/                        # Go HTTP API (port 8080)
│   ├── main.go                 # entrypoint, router, config
│   ├── models.go               # request/response/job types
│   ├── handlers.go             # HTTP handlers
│   ├── scraper.go              # BFS crawler
│   ├── storage.go              # S3/LocalStack client
│   ├── ai.go                   # ChromaDB + Ollama
│   ├── api_test.go             # mocked end-to-end tests
│   └── Dockerfile              # multi-stage build
├── playwright-service/         # JS-page renderer (port 3000)
│   ├── server.js               # Express + Playwright
│   ├── package.json
│   └── Dockerfile
├── frontend/                   # static UI (port 8000)
│   ├── index.html              # Tailwind + JS console
│   └── Dockerfile              # nginx
└── scripts/
    └── init-aws.sh             # creates the S3 bucket on LocalStack boot

Reading order tip: if this is your first time, read sections 1–4 (overview, compose, LocalStack, Playwright) for the infrastructure, then jump to section 13 (the frontend) to see the user perspective, then go back to the Go API (sections 5–12) to understand the orchestration in depth.

2 · Orchestration

docker-compose.yml

One file, six services, one network.

The full file

Compose v2 syntax (no top-level version: key needed). Each service block defines an image (or build context), ports, env vars, dependencies, and any volumes or networks.

Block 1 — The Go API service

services:
  api:
    build: ./api
    container_name: scraper-api
    ports:
      - "8080:8080"
    environment:
      - PLAYWRIGHT_URL=http://playwright:3000
      - S3_ENDPOINT=http://localstack:4566
      - S3_BUCKET=scraper-data
      - AWS_ACCESS_KEY_ID=test
      - AWS_SECRET_ACCESS_KEY=test
      - AWS_REGION=us-east-1
      - CHROMA_URL=http://chromadb:8000
      - OLLAMA_URL=http://ollama:11434
      - OLLAMA_MODEL=llama3.2
      - OLLAMA_EMBED_MODEL=nomic-embed-text
    depends_on:
      - playwright
      - localstack
      - chromadb
      - ollama
    networks:
      - scraper-net

build: ./api tells compose to look for a Dockerfile in the api/ directory and build it locally rather than pulling a prebuilt image. We do this for everything we wrote ourselves.

The environment block uses Docker DNS — every service is reachable from inside the network at http://<service_name>:<internal_port>. So http://playwright:3000 resolves automatically; you never need to know the container's IP.

AWS_ACCESS_KEY_ID=test and AWS_SECRET_ACCESS_KEY=test are LocalStack's hardcoded test credentials. The AWS SDK requires some creds to sign requests, but LocalStack accepts anything. Don't use these against real AWS.

depends_on only controls start order — Docker doesn't wait for the dependency to be ready, only running. The Go code therefore retries on its first failed connection.

Block 2 — Playwright renderer

  playwright:
    build: ./playwright-service
    container_name: scraper-playwright
    ports:
      - "3000:3000"
    networks:
      - scraper-net

Minimal config: just a build context and port. We expose 3000 to the host so you can curl localhost:3000/health while debugging — but the API actually reaches it as http://playwright:3000 on the internal network.

Block 3 — Frontend (nginx)

  frontend:
    build: ./frontend
    container_name: scraper-frontend
    ports:
      - "8000:80"
    depends_on:
      - api
    networks:
      - scraper-net

"8000:80" maps host:8000 to container:80 (nginx's default). The frontend is purely static — no backend code runs in this container.

The frontend itself talks to the API at http://localhost:8080 from the user's browser, not through the Docker network — because the browser runs on your machine, outside Docker. That's why the API also publishes its port to the host.

Block 4 — LocalStack

  localstack:
    image: localstack/localstack:latest
    container_name: scraper-localstack
    ports:
      - "4566:4566"
    environment:
      - SERVICES=s3
      - DEBUG=0
      - AWS_DEFAULT_REGION=us-east-1
    volumes:
      - ./scripts/init-aws.sh:/etc/localstack/init/ready.d/init-aws.sh
    networks:
      - scraper-net

SERVICES=s3 tells LocalStack to only spin up the S3 mock — saves memory because LocalStack can also emulate dozens of other AWS services we don't need.

The volume mount drops our shell script into LocalStack's ready.d/ hook directory. LocalStack runs every script in that folder once the gateway is healthy, so our init-aws.sh auto-creates the S3 bucket on first boot.

Block 5 — ChromaDB

  chromadb:
    image: chromadb/chroma:latest
    container_name: scraper-chromadb
    ports:
      - "8001:8000"
    environment:
      - IS_PERSISTENT=TRUE
      - ANONYMIZED_TELEMETRY=FALSE
    volumes:
      - chroma-data:/data
    networks:
      - scraper-net

"8001:8000" remaps because Chroma listens on 8000 internally, but the host's 8000 is already taken by the frontend. From the API's point of view (inside Docker) it's still http://chromadb:8000.

IS_PERSISTENT=TRUE + the chroma-data named volume means embeddings survive container restarts. ANONYMIZED_TELEMETRY=FALSE turns off Chroma's outbound usage pings.

Block 6 — Ollama + the one-shot model puller

  ollama:
    image: ollama/ollama:latest
    container_name: scraper-ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    networks:
      - scraper-net

  ollama-init:
    image: ollama/ollama:latest
    container_name: scraper-ollama-init
    depends_on:
      - ollama
    entrypoint: >
      sh -c "
        sleep 5 &&
        OLLAMA_HOST=http://ollama:11434 ollama pull llama3.2 &&
        OLLAMA_HOST=http://ollama:11434 ollama pull nomic-embed-text
      "
    networks:
      - scraper-net
    restart: "no"

The ollama-data volume persists downloaded models (multiple GB) across restarts — without it you'd re-download llama3.2 every compose up.

The ollama-init service is a clever pattern: it runs the same image as ollama, but with a custom entrypoint that uses the Ollama CLI to pull two models, then exits. restart: "no" stops Docker from restarting it after exit, and OLLAMA_HOST redirects the CLI to talk to the main ollama container instead of trying to start its own server.

This pattern keeps the actual ollama service simple (it's just the upstream image) while ensuring required models exist before anyone tries to use them.

Block 7 — Volumes & network

volumes:
  chroma-data:
  ollama-data:

networks:
  scraper-net:
    driver: bridge

Named volumes are managed by Docker (typically under /var/lib/docker/volumes/). They persist independently of containers — if you delete a container, its volume sticks around.

driver: bridge is the default for user-defined networks and gives each service its own DNS name. The alternative host would skip the network namespace entirely (useful for low-latency apps, not what we want here).

References

3 · Storage

LocalStack & `init-aws.sh`

A free, local AWS that the Go SDK talks to as if it were real.

`scripts/init-aws.sh`

LocalStack mounts /etc/localstack/init/ready.d/ and runs every script in there once the gateway is healthy. We use that hook to create our bucket.

#!/bin/bash
echo "Creating S3 bucket: scraper-data"
awslocal s3 mb s3://scraper-data
awslocal s3api put-bucket-cors --bucket scraper-data --cors-configuration '{
  "CORSRules": [{
    "AllowedHeaders": ["*"],
    "AllowedMethods": ["GET", "PUT", "POST"],
    "AllowedOrigins": ["*"]
  }]
}'
echo "Bucket created successfully"

Line 1 — shebang: tells the kernel to run this with /bin/bash. Required because LocalStack invokes the file by path.

awslocal is a thin wrapper around the AWS CLI that auto-injects --endpoint-url=http://localhost:4566 and dummy credentials. It's pre-installed inside the LocalStack image.

s3 mb s3://scraper-data — "make bucket". mb is short for make-bucket; the URL form s3://name is the standard S3 URI scheme.

The CORS block exists because in production you might fetch presigned URLs directly from the browser. We don't actually need it here (the API proxies all S3 reads), but it's a useful default for any S3 bucket that might serve assets to a browser.

Why mount this script in `ready.d/`?

LocalStack supports four lifecycle hook directories: boot.d/ (before LocalStack starts), start.d/ (during startup), ready.d/ (after the gateway accepts traffic), and shutdown.d/. We use ready.d/ because we need the S3 service available to make the bucket call.

References

4 · Renderer

Playwright service

A tiny Express wrapper around headless Chromium.

The full `server.js`

A single Express app with a /health probe and one real endpoint.

Imports & setup

const express = require('express');
const { chromium } = require('playwright');

const app = express();
app.use(express.json({ limit: '10mb' }));

let browser;

express.json({ limit: '10mb' }) parses incoming JSON request bodies up to 10 MB. The default is 100 KB, which we'd hit if the API ever sent us a big page.

let browser is a module-level handle. We launch Chromium once and reuse it across requests — launching costs ~1 s, so reusing it cuts per-request latency dramatically. We create a fresh context per request for isolation.

Lazy browser launch

async function getBrowser() {
  if (!browser || !browser.isConnected()) {
    browser = await chromium.launch({
      headless: true,
      args: ['--no-sandbox', '--disable-setuid-sandbox']
    });
  }
  return browser;
}

app.get('/health', (req, res) => {
  res.json({ status: 'ok' });
});

browser.isConnected() guards against the rare case where the browser process dies (out of memory, segfault) but our handle is still set. If so, we relaunch.

--no-sandbox is required when running as root inside Docker. Linux's user namespaces conflict with Chromium's sandbox in unprivileged containers. Acceptable here because we already trust the container; in production you'd run Chromium as a non-root user and keep the sandbox enabled.

/health is what compose health-checks (and humans) hit to verify the service is up before sending real traffic.

The `/render` handler

app.post('/render', async (req, res) => {
  const { url, screenshot = false, timeout = 30000 } = req.body;

  if (!url) {
    return res.status(400).json({ error: 'url is required' });
  }

  let context;
  let page;
  try {
    const b = await getBrowser();
    context = await b.newContext({
      userAgent: 'Mozilla/5.0 (compatible; SimpleScraper/1.0)',
      viewport: { width: 1280, height: 800 }
    });
    page = await context.newPage();

    await page.goto(url, { waitUntil: 'networkidle', timeout });

Destructuring with defaults — const { url, screenshot = false, timeout = 30000 } reads the body and substitutes safe defaults if fields are missing.

browser.newContext() creates an isolated browsing session — its own cookies, cache, localStorage. Two parallel /render calls won't see each other's session state.

waitUntil: 'networkidle' waits until there are no network connections for 500 ms. This is the signal that JS has finished loading data. The other options are load (DOMContentLoaded), domcontentloaded (HTML parsed), and commit (response received).

The userAgent is set to identify ourselves politely; a real scraper would also rotate UAs and respect robots.txt.

Extracting content

    const html = await page.content();
    const text = await page.evaluate(() => document.body ? document.body.innerText : '');
    const title = await page.title();

    const links = await page.evaluate(() => {
      const anchors = Array.from(document.querySelectorAll('a[href]'));
      return anchors.map(a => a.href).filter(h => h && (h.startsWith('http://') || h.startsWith('https://')));
    });

page.content() returns the post-JavaScript serialized HTML — what the page looks like after React, Vue, or any other framework finished rendering. This is the whole point of using a real browser.

page.evaluate(fn) serializes the function, ships it into the browser, runs it there, and returns the result. So document.body.innerText runs inside Chromium, then the resulting string is sent back to Node. Crucially, this means we get the rendered DOM's text, not the raw HTML.

For links, a.href always returns an absolute URL even if the source markup was <a href="/about"> — the browser does the resolution for us. We filter to http(s):// so we drop mailto:, tel:, and javascript: links.

Screenshot & cleanup

    const result = {
      url, title, html, text,
      links: [...new Set(links)]
    };

    if (screenshot) {
      const buf = await page.screenshot({ fullPage: true, type: 'png' });
      result.screenshot = buf.toString('base64');
    }

    res.json(result);
  } catch (err) {
    console.error('Render error:', err.message);
    res.status(500).json({ error: err.message });
  } finally {
    if (page) await page.close().catch(() => {});
    if (context) await context.close().catch(() => {});
  }
});

[...new Set(links)] dedupes — pages often link the same URL multiple times.

Screenshots are returned base64-encoded inside the JSON response. Embedding binary in JSON adds ~33% size overhead, but it lets the API treat all responses uniformly. The Go side decodes back to bytes before storing in S3.

fullPage: true scrolls and stitches the entire scrollable area, not just the viewport.

The finally block guarantees we close the page and context even on error. Forgetting this leaks memory: Chromium tabs accumulate, the process slowly bloats, and eventually you OOM. The .catch(() => {}) swallows close errors because they don't matter once the request is done.

Server bootstrap & graceful shutdown

const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
  console.log(`Playwright service listening on port ${PORT}`);
});

process.on('SIGTERM', async () => {
  if (browser) await browser.close();
  process.exit(0);
});

Docker sends SIGTERM when stopping a container, then SIGKILL 10 seconds later. The handler closes Chromium cleanly so we don't leave orphaned processes. Without this, you can occasionally see "zombie" Chromium processes accumulate during development.

`package.json`

{
  "name": "playwright-service",
  "version": "1.0.0",
  "description": "Playwright-based page renderer",
  "main": "server.js",
  "scripts": {
    "start": "node server.js"
  },
  "dependencies": {
    "express": "^4.19.2",
    "playwright": "^1.47.0"
  }
}

The caret ^1.47.0 means "any 1.x release ≥ 1.47.0". This usually pulls the latest stable. However, that's exactly what bit us in the troubleshooting section — the prebuilt Playwright base image only ships browsers for a specific version. Our Dockerfile fixes that by re-installing browsers at build time.

`Dockerfile`

FROM node:20-bookworm-slim

WORKDIR /app

COPY package.json ./
RUN npm install --omit=dev

# Install the chromium build that matches whatever playwright version
# npm just resolved, plus all of its system library dependencies.
RUN npx playwright install --with-deps chromium \
 && rm -rf /var/lib/apt/lists/*

COPY server.js ./

EXPOSE 3000

CMD ["node", "server.js"]

node:20-bookworm-slim is a small Debian-based Node image. We chose this over the official Microsoft Playwright image specifically to let npm pick the latest Playwright and then download a matching browser, side-stepping version drift.

npm install --omit=dev skips devDependencies. Smaller image, fewer attack-surface packages.

npx playwright install --with-deps chromium does two things: downloads the chromium binary that exactly matches the npm package's expected version, and runs apt-get install for the long list of shared libraries Chromium needs (libnss3, libgtk, etc.). The --with-deps flag is the key piece that makes this self-contained.

rm -rf /var/lib/apt/lists/* deletes the apt cache. Standard Docker hygiene — saves ~50 MB.

Layer order matters: we copy package.json and run npm install before copying server.js. That way edits to server.js don't bust the npm install cache layer, so rebuilds take seconds.

References

5 · Go API

`models.go`

The data shapes every other file uses.

Reading a Go service is much easier when you start with its types. These structs flow through every layer: the HTTP handlers decode them, the scraper builds them, the storage layer reads from them, and the AI layer reads metadata from them.

Package & imports

package main

import "time"

All API files share package main — they compile into one executable, not a library. We only need time here for time.Time timestamps.

The request types

// ScrapeRequest is the body for POST /api/scrape
type ScrapeRequest struct {
	URL    string `json:"url"`
	Depth  int    `json:"depth"`
	Format string `json:"format"` // "text", "html", or "screenshot"
}

The backtick-delimited struct tags tell encoding/json how to map JSON keys to Go fields. Without them, JSON would need "URL" (capitalized) to match the field name. With them, the JSON is conventional lowercase: {"url": "...", "depth": 1, "format": "text"}.

Field exports rule: capitalized field names are exported (visible to other packages and to json.Marshal). Lowercase ones are private. json.Marshal only sees exported fields, which is why every JSON-serializable field must be capitalized.

The job & page types

// ScrapeJob represents one crawl operation
type ScrapeJob struct {
	ID        string       `json:"id"`
	SeedURL   string       `json:"seed_url"`
	Depth     int          `json:"depth"`
	Format    string       `json:"format"`
	Status    string       `json:"status"` // pending|running|done|failed
	Error     string       `json:"error,omitempty"`
	CreatedAt time.Time    `json:"created_at"`
	UpdatedAt time.Time    `json:"updated_at"`
	Pages     []ScrapePage `json:"pages"`
	Indexed   bool         `json:"indexed"`
}

// ScrapePage is a single page captured in a job
type ScrapePage struct {
	URL     string `json:"url"`
	Title   string `json:"title"`
	S3Key   string `json:"s3_key"`
	Depth   int    `json:"depth"`
	Format  string `json:"format"`
	Bytes   int    `json:"bytes"`
}

json:"error,omitempty" — the omitempty modifier tells the encoder to skip this field entirely if it's the zero value (empty string here). So a successful job's JSON has no "error" field at all, instead of "error": "". Tidier and easier for clients to if (job.error) { ... }.

Status is a string, not a typed enum. Go has no native enum; the convention is named constants of a custom type. We didn't bother for a 4-value field where the cost-of-mistake is low.

Pages []ScrapePage embeds the page list directly. When the API returns a job, the pages come along — no need for a separate GET /jobs/:id/pages.

The query types

// QueryRequest is the body for POST /api/query
type QueryRequest struct {
	JobID  string `json:"job_id,omitempty"` // optional: scope to one job
	Prompt string `json:"prompt"`
	TopK   int    `json:"top_k,omitempty"`
}

// QueryResponse from POST /api/query
type QueryResponse struct {
	Answer  string   `json:"answer"`
	Sources []string `json:"sources"`
}

JobID with omitempty + the comment "optional" — when provided, queries are filtered to chunks tagged with that job. When empty, the agent searches across all indexed jobs.

Sources is a list of URLs, deduped, in the order they appeared in retrieval. The frontend renders these as a bullet list under the answer.

The Playwright response shape

// PlaywrightResp shape from the playwright service
type PlaywrightResp struct {
	URL        string   `json:"url"`
	Title      string   `json:"title"`
	HTML       string   `json:"html"`
	Text       string   `json:"text"`
	Links      []string `json:"links"`
	Screenshot string   `json:"screenshot,omitempty"` // base64 png
	Error      string   `json:"error,omitempty"`
}

This struct mirrors what server.js returns from /render. By defining it explicitly we get type-safe access in the scraper (pr.HTML, pr.Links) and the compiler catches any field-name typos.

If the JSON has fields we don't list here, they're silently dropped — that's the default behavior. If we wanted strict mode, we could use a json.Decoder with DisallowUnknownFields.

References

6 · Go API

`main.go`

Entrypoint: load config, build dependencies, wire routes, listen.

Imports

package main

import (
	"context"
	"log"
	"net/http"
	"os"
	"sync"
	"time"

	"github.com/go-chi/chi/v5"
	"github.com/go-chi/chi/v5/middleware"
	"github.com/go-chi/cors"
)

Imports are split into two blocks separated by a blank line: standard library first, third-party second. Most Go formatters (goimports) enforce this.

chi is a router that's API-compatible with net/http — you can pass any http.Handler to it and back. Tiny (no reflection, no extra layers), with patterns like r.Get("/api/scrapes/{id}", h) for path params.

`Config` + env loading

type Config struct {
	PlaywrightURL    string
	S3Endpoint       string
	S3Bucket         string
	AWSRegion        string
	ChromaURL        string
	OllamaURL        string
	OllamaModel      string
	OllamaEmbedModel string
}

func loadConfig() Config {
	return Config{
		PlaywrightURL:    getEnv("PLAYWRIGHT_URL", "http://playwright:3000"),
		S3Endpoint:       getEnv("S3_ENDPOINT", "http://localstack:4566"),
		S3Bucket:         getEnv("S3_BUCKET", "scraper-data"),
		AWSRegion:        getEnv("AWS_REGION", "us-east-1"),
		ChromaURL:        getEnv("CHROMA_URL", "http://chromadb:8000"),
		OllamaURL:        getEnv("OLLAMA_URL", "http://ollama:11434"),
		OllamaModel:      getEnv("OLLAMA_MODEL", "llama3.2"),
		OllamaEmbedModel: getEnv("OLLAMA_EMBED_MODEL", "nomic-embed-text"),
	}
}

func getEnv(k, def string) string {
	if v := os.Getenv(k); v != "" {
		return v
	}
	return def
}

Each setting is read from an env var with a sensible default. This is pure 12-factor config — the same binary runs unchanged in dev, staging, prod, just with different env values.

The defaults match the Docker network names on purpose: if you run the binary locally without setting any env vars, it will fail to connect (because http://playwright doesn't resolve outside Docker), reminding you to either run via compose or override the URLs.

The `Server` struct — dependency injection

type Server struct {
	cfg     Config
	storage *Storage
	ai      *AI
	scraper *Scraper

	// In-memory job registry. Suitable for a simple demo;
	// for production swap with a real DB.
	mu   sync.RWMutex
	jobs map[string]*ScrapeJob
}

func newServer(cfg Config) (*Server, error) {
	storage, err := NewStorage(context.Background(), cfg)
	if err != nil {
		return nil, err
	}
	ai := NewAI(cfg)
	scraper := NewScraper(cfg, storage)

	return &Server{
		cfg:     cfg,
		storage: storage,
		ai:      ai,
		scraper: scraper,
		jobs:    make(map[string]*ScrapeJob),
	}, nil
}

Bundling dependencies into a Server struct lets every handler reach them via s.storage, s.ai, etc. without globals. This is the Go-idiomatic alternative to a DI container.

sync.RWMutex guards the in-memory job map. We pick RWMutex (instead of plain Mutex) because reads vastly outnumber writes — the frontend polls the list every 4 seconds.

map[string]*ScrapeJob stores pointers so handlers can mutate a job's status in place: s.jobs[id].Status = "done". Storing values would require copy-back-on-update.

Router configuration

func (s *Server) router() http.Handler {
	r := chi.NewRouter()
	r.Use(middleware.RequestID)
	r.Use(middleware.RealIP)
	r.Use(middleware.Logger)
	r.Use(middleware.Recoverer)
	r.Use(middleware.Timeout(5 * time.Minute))

	r.Use(cors.Handler(cors.Options{
		AllowedOrigins:   []string{"*"},
		AllowedMethods:   []string{"GET", "POST", "OPTIONS"},
		AllowedHeaders:   []string{"Content-Type", "Authorization"},
		AllowCredentials: false,
	}))

	r.Get("/health", s.handleHealth)

	r.Route("/api", func(r chi.Router) {
		r.Post("/scrape", s.handleScrape)
		r.Get("/scrapes", s.handleListScrapes)
		r.Get("/scrapes/{id}", s.handleGetScrape)
		r.Get("/scrapes/{id}/page", s.handleGetPage) // ?key=...
		r.Post("/scrapes/{id}/index", s.handleIndexScrape)
		r.Post("/query", s.handleQuery)
	})

	return r
}

Middleware: each r.Use() wraps every handler that comes after it.

RequestID stamps a unique ID on every request (visible in logs).
RealIP reads X-Forwarded-For if behind a proxy.
Logger prints the access log line.
Recoverer catches panics in handlers and turns them into 500 responses instead of crashing the server.
Timeout(5min) aborts requests that run too long.

CORS is wide open here (AllowedOrigins: ["*"]) because the frontend can be served from any origin during development. Tighten this in production.

r.Route("/api", ...) mounts a sub-router so all sub-routes inherit the /api prefix without repeating it on every line.

{id} is a path parameter; handlers read it via chi.URLParam(r, "id").

`main()`

func main() {
	cfg := loadConfig()
	srv, err := newServer(cfg)
	if err != nil {
		log.Fatalf("startup failure: %v", err)
	}

	addr := ":8080"
	log.Printf("api listening on %s", addr)
	if err := http.ListenAndServe(addr, srv.router()); err != nil {
		log.Fatal(err)
	}
}

log.Fatalf prints the message and calls os.Exit(1). We use it for unrecoverable startup errors so Docker sees the non-zero exit code and (depending on restart policy) restarts the container.

http.ListenAndServe blocks forever until something kills the process. :8080 binds to all interfaces — required inside a container so the port mapping works.

This is the simplest possible server. Production code typically wraps this in a http.Server with explicit timeouts, then orchestrates graceful shutdown on SIGTERM with srv.Shutdown(ctx).

References

7 · Go API

`handlers.go`

Where HTTP meets business logic.

Six handlers. Each one is small because the heavy lifting lives in scraper.go, storage.go, and ai.go. The handler's job is: parse input, validate, dispatch, format output.

JSON helpers & `/health`

func writeJSON(w http.ResponseWriter, status int, v any) {
	w.Header().Set("Content-Type", "application/json")
	w.WriteHeader(status)
	_ = json.NewEncoder(w).Encode(v)
}

func writeErr(w http.ResponseWriter, status int, msg string) {
	writeJSON(w, status, map[string]string{"error": msg})
}

func (s *Server) handleHealth(w http.ResponseWriter, r *http.Request) {
	writeJSON(w, http.StatusOK, map[string]string{"status": "ok"})
}

Two tiny helpers used by every other handler. Kept here so handlers can read like a story instead of repeating the same 3 lines of header-setting.

any is Go 1.18+ syntactic sugar for interface{} — accepts any type. json.NewEncoder(w).Encode(v) writes directly to the response, avoiding a temporary []byte allocation that json.Marshal + w.Write would create.

The blank identifier _ = on Encode discards the error. Done deliberately here because if the response is already going out, there's nothing useful to do with a write error.

`handleScrape` — the entry point of every job

// POST /api/scrape  { url, depth, format }
func (s *Server) handleScrape(w http.ResponseWriter, r *http.Request) {
	var req ScrapeRequest
	if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
		writeErr(w, http.StatusBadRequest, "invalid json: "+err.Error())
		return
	}

	req.URL = strings.TrimSpace(req.URL)
	if req.URL == "" {
		writeErr(w, http.StatusBadRequest, "url is required")
		return
	}
	if !strings.HasPrefix(req.URL, "http://") && !strings.HasPrefix(req.URL, "https://") {
		writeErr(w, http.StatusBadRequest, "url must start with http:// or https://")
		return
	}
	if req.Depth < 0 {
		req.Depth = 0
	}
	if req.Depth > 5 {
		req.Depth = 5 // safety cap
	}
	if req.Format == "" {
		req.Format = "text"
	}
	switch req.Format {
	case "text", "html", "screenshot":
	default:
		writeErr(w, http.StatusBadRequest, "format must be text|html|screenshot")
		return
	}

Decode then validate: pull the JSON into the request struct, then run a series of cheap checks. We do not trust client-supplied values — depth gets clamped, format must be in a known set, URL must have a valid scheme.

The switch req.Format { case ..., ..., ...: default: } idiom is Go's compact way of saying "accept this set, reject everything else". The empty case body is intentional — it means "no extra action; fall through".

Notice the safety cap on depth. The user could send depth=99999, which would crawl the entire internet. We silently floor it at 5 instead of erroring — friendlier than a 400.

Job creation & async execution

	now := time.Now().UTC()
	job := &ScrapeJob{
		ID:        uuid.NewString(),
		SeedURL:   req.URL,
		Depth:     req.Depth,
		Format:    req.Format,
		Status:    "pending",
		CreatedAt: now,
		UpdatedAt: now,
	}

	s.mu.Lock()
	s.jobs[job.ID] = job
	s.mu.Unlock()

	// Run async; the caller polls /api/scrapes/{id}
	go func(jobID string) {
		ctx, cancel := context.WithTimeout(context.Background(), 10*time.Minute)
		defer cancel()

		s.mu.RLock()
		j := s.jobs[jobID]
		s.mu.RUnlock()
		if j == nil {
			return
		}
		s.scraper.Run(ctx, j)
	}(job.ID)

	writeJSON(w, http.StatusAccepted, job)
}

uuid.NewString() generates a v4 UUID — random, collision-resistant, ~10⁻³⁶ chance of collision after a trillion IDs.

s.mu.Lock()/Unlock() brackets the map write. Always lock writes; readers can use RLock.

go func(jobID string) { ... }(job.ID) launches a goroutine — Go's lightweight thread. Crucial: we pass job.ID as an argument by value rather than capturing the loop variable. (Pre-Go 1.22 closing over loop variables was a notorious foot-gun.)

context.WithTimeout(..., 10*time.Minute) creates a cancellation context. Every downstream call (scraper, HTTP, S3) accepts a context and aborts when it's cancelled. The defer cancel() releases resources even on panic.

StatusAccepted (HTTP 202) is the right code for "I started this, check back later". The frontend polls GET /api/scrapes/{id} to watch progress.

List & get

// GET /api/scrapes
func (s *Server) handleListScrapes(w http.ResponseWriter, r *http.Request) {
	s.mu.RLock()
	defer s.mu.RUnlock()

	out := make([]*ScrapeJob, 0, len(s.jobs))
	for _, j := range s.jobs {
		out = append(out, j)
	}
	writeJSON(w, http.StatusOK, out)
}

// GET /api/scrapes/{id}
func (s *Server) handleGetScrape(w http.ResponseWriter, r *http.Request) {
	id := chi.URLParam(r, "id")
	s.mu.RLock()
	job, ok := s.jobs[id]
	s.mu.RUnlock()
	if !ok {
		writeErr(w, http.StatusNotFound, "job not found")
		return
	}
	writeJSON(w, http.StatusOK, job)
}

make([]*ScrapeJob, 0, len(s.jobs)) pre-allocates the slice's backing array. Without the capacity hint, append would grow the slice incrementally. Cheap optimization for big maps; harmless for small ones.

The job, ok := s.jobs[id] pattern is Go's way of distinguishing "missing key" from "key with zero value". It's the only safe way to check map membership.

Returning 404 for an unknown job rather than 200 with empty data lets the frontend render a meaningful error.

`handleGetPage` — streaming from S3

// GET /api/scrapes/{id}/page?key=...
// Streams the stored object straight from S3 with its content-type.
func (s *Server) handleGetPage(w http.ResponseWriter, r *http.Request) {
	id := chi.URLParam(r, "id")
	key := r.URL.Query().Get("key")
	if key == "" {
		writeErr(w, http.StatusBadRequest, "key is required")
		return
	}
	// Make sure the key actually belongs to this job to prevent path tricks
	if !strings.HasPrefix(key, fmt.Sprintf("jobs/%s/", id)) {
		writeErr(w, http.StatusForbidden, "key does not belong to this job")
		return
	}

	data, ct, err := s.storage.Get(r.Context(), key)
	if err != nil {
		writeErr(w, http.StatusNotFound, err.Error())
		return
	}
	if ct == "" {
		ct = "application/octet-stream"
	}
	w.Header().Set("Content-Type", ct)
	w.WriteHeader(http.StatusOK)
	_, _ = w.Write(data)
}

Authorization check. Without the HasPrefix guard, a malicious caller could pass ?key=jobs/SOMEONE_ELSES_JOB/pages/x.txt and read content from another job. We require the key to match the URL's job ID.

The frontend uses this endpoint via plain <img src> for screenshots and <iframe srcdoc> for HTML. Setting the right Content-Type from S3's stored content-type makes the browser handle each format natively.

r.Context() is the request's context. If the client disconnects mid-fetch, the context cancels and the S3 GET aborts — saves bandwidth.

`handleIndexScrape` — load to ChromaDB

// POST /api/scrapes/{id}/index
// Pulls every page's text from S3, chunks it, and indexes into ChromaDB.
func (s *Server) handleIndexScrape(w http.ResponseWriter, r *http.Request) {
	id := chi.URLParam(r, "id")
	s.mu.RLock()
	job, ok := s.jobs[id]
	s.mu.RUnlock()
	if !ok {
		writeErr(w, http.StatusNotFound, "job not found")
		return
	}
	if job.Status != "done" {
		writeErr(w, http.StatusConflict, "job not finished yet")
		return
	}

	indexed := 0
	for _, p := range job.Pages {
		// Always read the .txt sibling so we index clean text regardless
		// of what format the user asked for.
		textKey := strings.TrimSuffix(p.S3Key, ".png")
		textKey = strings.TrimSuffix(textKey, ".html")
		if !strings.HasSuffix(textKey, ".txt") {
			textKey = textKey + ".txt"
		}

		data, _, err := s.storage.Get(r.Context(), textKey)
		if err != nil {
			continue
		}
		text := string(data)
		chunks := ChunkText(text, 1500)

		for i, ch := range chunks {
			docID := fmt.Sprintf("%s::%s::%d", job.ID, p.URL, i)
			meta := map[string]string{
				"job_id": job.ID,
				"url":    p.URL,
				"title":  p.Title,
				"chunk":  fmt.Sprintf("%d", i),
			}
			if err := s.ai.IndexDoc(r.Context(), docID, ch, meta); err != nil {
				writeErr(w, http.StatusBadGateway, "index error: "+err.Error())
				return
			}
			indexed++
		}
	}

	s.mu.Lock()
	job.Indexed = true
	job.UpdatedAt = time.Now().UTC()
	s.mu.Unlock()

	writeJSON(w, http.StatusOK, map[string]any{
		"job_id":         job.ID,
		"chunks_indexed": indexed,
	})
}

Status check first. 409 Conflict ("not finished yet") is the right code when the resource exists but isn't in a state that allows the action.

The text-key derivation always points to the .txt sibling that the scraper writes alongside every artifact. Even if the page was stored as a screenshot, we have a parallel text version for indexing.

ChunkText(text, 1500) splits the page into ~1500-char pieces at word boundaries (we'll see this in ai.go). Embeddings have a token limit; chunking lets us index pages of any size.

The docID format jobID::url::chunkIndex is human-readable and makes debugging trivial — you can see exactly which chunk of which page got indexed.

Errors during indexing return 502 Bad Gateway because the upstream (Chroma or Ollama) failed, not us.

`handleQuery` — RAG in 25 lines

// POST /api/query  { prompt, job_id?, top_k? }
func (s *Server) handleQuery(w http.ResponseWriter, r *http.Request) {
	var req QueryRequest
	if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
		writeErr(w, http.StatusBadRequest, "invalid json: "+err.Error())
		return
	}
	req.Prompt = strings.TrimSpace(req.Prompt)
	if req.Prompt == "" {
		writeErr(w, http.StatusBadRequest, "prompt is required")
		return
	}

	docs, metas, err := s.ai.QueryChunks(r.Context(), req.Prompt, req.TopK, req.JobID)
	if err != nil {
		writeErr(w, http.StatusBadGateway, err.Error())
		return
	}

	prompt := BuildRAGPrompt(req.Prompt, docs)
	answer, err := s.ai.Generate(r.Context(), prompt)
	if err != nil {
		writeErr(w, http.StatusBadGateway, err.Error())
		return
	}

	sources := make([]string, 0, len(metas))
	seen := map[string]bool{}
	for _, m := range metas {
		if u, ok := m["url"].(string); ok && !seen[u] {
			sources = append(sources, u)
			seen[u] = true
		}
	}

	writeJSON(w, http.StatusOK, QueryResponse{Answer: answer, Sources: sources})
}

This is a tiny RAG pipeline:

QueryChunks embeds the prompt and asks Chroma for the K most-similar chunks.
BuildRAGPrompt wraps those chunks plus the question in a structured prompt.
Generate sends the prompt to Ollama and returns the answer.

The seen map dedupes source URLs — multiple chunks from the same page would otherwise show the URL repeatedly.

The type assertion m["url"].(string) with ok guards against missing/non-string values that could come from Chroma's untyped metadata.

References

8 · Go API

`scraper.go`

A breadth-first crawler with safety caps and same-host scoping.

Imports & constants

package main

import (
	"bytes"
	"context"
	"encoding/base64"
	"encoding/json"
	"fmt"
	"io"
	"net/http"
	"net/url"
	"strings"
	"time"
)

const (
	maxPages       = 50
	requestTimeout = 60 * time.Second
)

Two safety caps as constants. maxPages = 50 bounds the total work per job. requestTimeout = 60s bounds each individual Playwright call so a hanging page can't stall the whole crawl.

net/url is critical — we use it for parsing and host-comparing URLs. Don't try to compare hosts with string operations; net/url handles ports, casing, and IDN normalization.

`Scraper` struct & constructor

type Scraper struct {
	cfg     Config
	storage *Storage
	client  *http.Client
}

func NewScraper(cfg Config, storage *Storage) *Scraper {
	return &Scraper{
		cfg:     cfg,
		storage: storage,
		client: &http.Client{
			Timeout: requestTimeout + 5*time.Second,
		},
	}
}

The HTTP client is created once and reused. Important: never use the default http.DefaultClient in production — it has no timeout, so a slow server can block forever.

The client timeout is slightly longer than requestTimeout so the per-request context cancels first, giving us cleaner error messages than a transport-level timeout.

`Run` — the BFS loop

type queueItem struct {
	URL   string
	Depth int
}

// Run is the main entrypoint. It updates the job in place.
func (s *Scraper) Run(ctx context.Context, job *ScrapeJob) {
	job.Status = "running"
	job.UpdatedAt = time.Now().UTC()

	seedHost, err := hostOf(job.SeedURL)
	if err != nil {
		job.Status = "failed"
		job.Error = "invalid seed url: " + err.Error()
		job.UpdatedAt = time.Now().UTC()
		return
	}

	visited := map[string]bool{}
	queue := []queueItem{{URL: job.SeedURL, Depth: 0}}

	for len(queue) > 0 && len(job.Pages) < maxPages {
		select {
		case <-ctx.Done():
			job.Status = "failed"
			job.Error = "cancelled"
			job.UpdatedAt = time.Now().UTC()
			return
		default:
		}

		item := queue[0]
		queue = queue[1:]

		if visited[item.URL] {
			continue
		}
		visited[item.URL] = true

Mutating the job pointer. job *ScrapeJob is a pointer, so every assignment to job.Status shows up in the API's job map immediately. The HTTP polling endpoint sees status transitions in real time.

queue = queue[1:] "pops" by reslicing — cheap (no allocation) but the underlying array's first slot is now wasted. For a queue of 50 items it doesn't matter; for million-item queues you'd use a ring buffer.

The select-with-default pattern is Go's non-blocking check: "is the context cancelled?" If yes, abort. If no, continue. Without default, this would block forever waiting for cancellation.

The visited set must be before the early-return checks but after the dequeue, otherwise the same URL gets re-fetched.

Per-page fetch & store

		page, links, err := s.fetchAndStore(ctx, job, item.URL, item.Depth)
		if err != nil {
			// Soft-fail: log into job error but keep going for the rest
			if job.Error == "" {
				job.Error = fmt.Sprintf("page %s: %v", item.URL, err)
			}
			continue
		}
		job.Pages = append(job.Pages, page)
		job.UpdatedAt = time.Now().UTC()

		if item.Depth >= job.Depth {
			continue
		}

		for _, link := range links {
			if visited[link] {
				continue
			}
			h, err := hostOf(link)
			if err != nil || h != seedHost {
				continue // same-host scope
			}
			queue = append(queue, queueItem{URL: link, Depth: item.Depth + 1})
		}
	}

	job.Status = "done"
	if job.Error == "" || strings.HasPrefix(job.Error, "page ") {
		// Mark as done even if some pages failed
		job.Status = "done"
	}
	job.UpdatedAt = time.Now().UTC()
}

Soft-fail strategy. If one page errors (timeout, 404, etc.), we record it in job.Error but keep going. A scraper that fails the whole job because one page in a hundred 500'd would be miserable. Only the first error is kept to avoid one bad batch overwriting all error context.

The item.Depth >= job.Depth check stops link expansion at the leaf depth — we still scrape leaf pages, we just don't follow their links.

h != seedHost enforces same-origin scope: a crawl seeded at example.com ignores any link to twitter.com or cdn.example.net. Production crawlers usually allow same-org subdomains via a configurable allowlist.

`fetchAndStore` — talking to Playwright + S3

func (s *Scraper) fetchAndStore(ctx context.Context, job *ScrapeJob, pageURL string, depth int) (ScrapePage, []string, error) {
	body, _ := json.Marshal(map[string]any{
		"url":        pageURL,
		"screenshot": job.Format == "screenshot",
		"timeout":    int(requestTimeout / time.Millisecond),
	})

	rctx, cancel := context.WithTimeout(ctx, requestTimeout+10*time.Second)
	defer cancel()

	req, err := http.NewRequestWithContext(rctx, "POST", s.cfg.PlaywrightURL+"/render", bytes.NewReader(body))
	if err != nil {
		return ScrapePage{}, nil, err
	}
	req.Header.Set("Content-Type", "application/json")

	resp, err := s.client.Do(req)
	if err != nil {
		return ScrapePage{}, nil, fmt.Errorf("playwright: %w", err)
	}
	defer resp.Body.Close()

	if resp.StatusCode != http.StatusOK {
		buf, _ := io.ReadAll(resp.Body)
		return ScrapePage{}, nil, fmt.Errorf("playwright status %d: %s", resp.StatusCode, string(buf))
	}

	var pr PlaywrightResp
	if err := json.NewDecoder(resp.Body).Decode(&pr); err != nil {
		return ScrapePage{}, nil, err
	}

Note http.NewRequestWithContext — always prefer it over plain http.NewRequest in long-running paths. It plumbs the context through, so a parent cancellation aborts the in-flight request mid-byte.

defer resp.Body.Close() is mandatory for every successful HTTP response. Forgetting it leaks the underlying TCP connection; over time the connection pool exhausts.

fmt.Errorf("playwright: %w", err) uses the %w verb to wrap the original error so callers can later errors.Is/As it. Wrapping preserves the error chain instead of flattening to a string.

Storing artifacts to S3

	var data []byte
	var contentType string
	var ext string
	switch job.Format {
	case "html":
		data = []byte(pr.HTML)
		contentType = "text/html; charset=utf-8"
		ext = "html"
	case "screenshot":
		decoded, err := base64.StdEncoding.DecodeString(pr.Screenshot)
		if err != nil {
			return ScrapePage{}, nil, fmt.Errorf("base64 decode: %w", err)
		}
		data = decoded
		contentType = "image/png"
		ext = "png"
	default: // "text"
		data = []byte(pr.Text)
		contentType = "text/plain; charset=utf-8"
		ext = "txt"
	}

	safe := safeKeyFromURL(pageURL)
	key := fmt.Sprintf("jobs/%s/pages/%s.%s", job.ID, safe, ext)

	if err := s.storage.Put(ctx, key, data, contentType); err != nil {
		return ScrapePage{}, nil, err
	}

	// Also store a .txt sibling so AI indexing always has clean text
	if ext != "txt" && pr.Text != "" {
		txtKey := fmt.Sprintf("jobs/%s/pages/%s.txt", job.ID, safe)
		_ = s.storage.Put(ctx, txtKey, []byte(pr.Text), "text/plain; charset=utf-8")
	}

	return ScrapePage{
		URL:    pageURL,
		Title:  pr.Title,
		S3Key:  key,
		Depth:  depth,
		Format: job.Format,
		Bytes:  len(data),
	}, pr.Links, nil
}

The switch picks bytes, content-type, and file extension based on the requested format. For screenshots, we base64-decode what Playwright sent.

The .txt sibling trick. Even if the user asked for HTML or PNG, we always save a parallel .txt so the AI indexer has plain text to embed without re-parsing HTML or running OCR. Tiny extra storage cost; huge UX win for the "ask the agent" feature.

The key layout jobs/{id}/pages/{safe-url}.{ext} gives us implicit per-job folders in S3 — useful for deletion (aws s3 rm --recursive jobs/abc/) and the path-traversal guard in handlers.go.

URL helpers

func hostOf(rawURL string) (string, error) {
	u, err := url.Parse(rawURL)
	if err != nil {
		return "", err
	}
	return strings.ToLower(u.Host), nil
}

func safeKeyFromURL(u string) string {
	parsed, err := url.Parse(u)
	if err != nil {
		return strings.NewReplacer("/", "_", ":", "_", "?", "_").Replace(u)
	}
	host := parsed.Host
	path := strings.Trim(parsed.Path, "/")
	if path == "" {
		path = "index"
	}
	combo := host + "_" + path
	if parsed.RawQuery != "" {
		combo += "_" + parsed.RawQuery
	}
	combo = strings.NewReplacer("/", "_", ":", "_", "?", "_", "&", "_", "=", "-", " ", "_").Replace(combo)
	if len(combo) > 200 {
		combo = combo[:200]
	}
	return combo
}

hostOf normalizes to lowercase so Example.COM and example.com match. Important: hostnames are case-insensitive per RFC 3986.

safeKeyFromURL turns a URL into a string safe for use as an S3 key. The 200-char cap prevents pathological URLs (long query strings, filter combinations) from creating S3 keys that exceed the 1024-byte limit.

This is a lossy transform: example.com/foo/bar and example.com/foo_bar would map to the same key. For a demo this is fine; production code would either use a hash of the URL or store a side mapping.

References

9 · Go API

`storage.go`

A thin wrapper around the AWS S3 SDK, configured to talk to LocalStack.

Imports & type

package main

import (
	"bytes"
	"context"
	"fmt"
	"io"

	"github.com/aws/aws-sdk-go-v2/aws"
	awsconfig "github.com/aws/aws-sdk-go-v2/config"
	"github.com/aws/aws-sdk-go-v2/service/s3"
)

type Storage struct {
	cfg    Config
	client *s3.Client
	bucket string
}

The AWS SDK v2 is split into many small modules, one per service — you only import what you use. config is the top-level setup; service/s3 is the S3-specific client.

The import alias awsconfig avoids a collision with our own Config type.

Constructor — the LocalStack pivot

func NewStorage(ctx context.Context, cfg Config) (*Storage, error) {
	awsCfg, err := awsconfig.LoadDefaultConfig(ctx, awsconfig.WithRegion(cfg.AWSRegion))
	if err != nil {
		return nil, fmt.Errorf("aws config: %w", err)
	}

	client := s3.NewFromConfig(awsCfg, func(o *s3.Options) {
		if cfg.S3Endpoint != "" {
			o.BaseEndpoint = aws.String(cfg.S3Endpoint)
			o.UsePathStyle = true
		}
	})

	return &Storage{
		cfg:    cfg,
		client: client,
		bucket: cfg.S3Bucket,
	}, nil
}

LoadDefaultConfig reads credentials from the standard chain: env vars, then ~/.aws/credentials, then EC2/ECS metadata. Here, our compose env supplies AWS_ACCESS_KEY_ID and friends, so the env step picks them up.

The S3 options closure is the key piece for talking to LocalStack:

BaseEndpoint overrides the AWS endpoint resolver — instead of s3.us-east-1.amazonaws.com we go to http://localstack:4566.
UsePathStyle = true uses path-style URLs (http://host/bucket/key) instead of virtual-hosted style (http://bucket.host/key). LocalStack needs path-style; real AWS supports both.

To switch to real AWS: just delete the closure, or check for an empty endpoint config. The rest of the code is unchanged — the SDK abstracts over the difference.

`Put` & `Get`

func (s *Storage) Put(ctx context.Context, key string, data []byte, contentType string) error {
	_, err := s.client.PutObject(ctx, &s3.PutObjectInput{
		Bucket:      aws.String(s.bucket),
		Key:         aws.String(key),
		Body:        bytes.NewReader(data),
		ContentType: aws.String(contentType),
	})
	if err != nil {
		return fmt.Errorf("s3 put %s: %w", key, err)
	}
	return nil
}

func (s *Storage) Get(ctx context.Context, key string) ([]byte, string, error) {
	out, err := s.client.GetObject(ctx, &s3.GetObjectInput{
		Bucket: aws.String(s.bucket),
		Key:    aws.String(key),
	})
	if err != nil {
		return nil, "", fmt.Errorf("s3 get %s: %w", key, err)
	}
	defer out.Body.Close()

	data, err := io.ReadAll(out.Body)
	if err != nil {
		return nil, "", err
	}

	ct := ""
	if out.ContentType != nil {
		ct = *out.ContentType
	}
	return data, ct, nil
}

Why aws.String? The AWS SDK uses *string (pointer-to-string) for nullable fields, so it can distinguish "not set" from "empty string". aws.String("foo") is just a tiny helper that returns a pointer to its argument.

bytes.NewReader(data) wraps a byte slice as an io.Reader — required because PutObjectInput.Body takes a reader, supporting both small in-memory blobs and giant streams.

The defer-close on Get is mandatory. Without it, the underlying HTTP connection isn't returned to the pool and leaks slowly.

io.ReadAll reads until EOF. For multi-GB objects you'd stream instead — but our pages are KB-sized so reading all at once is fine.

References

10 · Go API

`ai.go`

The RAG layer: embeddings, vector search, and LLM generation.

This file is the largest in the API because it spans two systems (ChromaDB + Ollama) and implements the full retrieval-augmented generation pipeline. Two flows live here: indexing (turn pages into embeddings stored in Chroma) and querying (embed a question, retrieve nearest chunks, ask the LLM).

Imports & type

package main

import (
	"bytes"
	"context"
	"encoding/json"
	"fmt"
	"io"
	"net/http"
	"strings"
	"time"
)

// AI bundles ChromaDB + Ollama clients
type AI struct {
	cfg    Config
	client *http.Client
}

const (
	chromaTenant   = "default_tenant"
	chromaDatabase = "default_database"
	chromaCollName = "scraper_docs"
)

func NewAI(cfg Config) *AI {
	return &AI{
		cfg: cfg,
		client: &http.Client{
			Timeout: 5 * time.Minute, // LLM calls can be slow
		},
	}
}

The Chroma v2 API namespaces collections under tenant and database. We use the defaults — tenants are useful in multi-customer SaaS scenarios where you want hard isolation between customers' embeddings.

The 5-minute timeout looks excessive but reflects reality: a 7B-parameter model like llama3.2 on a CPU can take 30-90 seconds per response. Production deployments use a GPU and a much shorter timeout.

`ensureCollection` — idempotent setup

func (a *AI) ensureCollection(ctx context.Context) error {
	url := fmt.Sprintf("%s/api/v2/tenants/%s/databases/%s/collections", a.cfg.ChromaURL, chromaTenant, chromaDatabase)

	body, _ := json.Marshal(map[string]any{
		"name":          chromaCollName,
		"get_or_create": true,
		"metadata":      map[string]string{"hnsw:space": "cosine"},
	})
	req, err := http.NewRequestWithContext(ctx, "POST", url, bytes.NewReader(body))
	if err != nil {
		return err
	}
	req.Header.Set("Content-Type", "application/json")
	resp, err := a.client.Do(req)
	if err != nil {
		return fmt.Errorf("chroma ensure: %w", err)
	}
	defer resp.Body.Close()
	if resp.StatusCode >= 300 {
		buf, _ := io.ReadAll(resp.Body)
		return fmt.Errorf("chroma ensure status %d: %s", resp.StatusCode, string(buf))
	}
	return nil
}

get_or_create: true makes this idempotent — calling it twice doesn't error. We invoke this on every operation (cheap because it's a no-op after the first call), so a fresh ChromaDB volume gets the collection on demand.

"hnsw:space": "cosine" tells Chroma's HNSW index to use cosine distance for similarity. Alternatives: l2 (Euclidean) or ip (inner product). Cosine is the standard for normalized text embeddings.

`Embed` — text → 768-d vector

func (a *AI) Embed(ctx context.Context, text string) ([]float32, error) {
	url := a.cfg.OllamaURL + "/api/embed"
	body, _ := json.Marshal(map[string]any{
		"model": a.cfg.OllamaEmbedModel,
		"input": text,
	})
	req, err := http.NewRequestWithContext(ctx, "POST", url, bytes.NewReader(body))
	if err != nil {
		return nil, err
	}
	req.Header.Set("Content-Type", "application/json")
	resp, err := a.client.Do(req)
	if err != nil {
		return nil, fmt.Errorf("ollama embed: %w", err)
	}
	defer resp.Body.Close()
	if resp.StatusCode >= 300 {
		buf, _ := io.ReadAll(resp.Body)
		return nil, fmt.Errorf("ollama embed status %d: %s", resp.StatusCode, string(buf))
	}
	var er struct {
		Embeddings [][]float32 `json:"embeddings"`
	}
	if err := json.NewDecoder(resp.Body).Decode(&er); err != nil {
		return nil, err
	}
	if len(er.Embeddings) == 0 {
		return nil, fmt.Errorf("empty embedding response")
	}
	return er.Embeddings[0], nil
}

Ollama's /api/embed can take a single string or an array — it returns embeddings as a 2D array. We always pass a single string so we just take the first row.

nomic-embed-text outputs 768-dimensional vectors. Each dimension is a float32; total = 768 × 4 = 3072 bytes per chunk. For 10 000 chunks that's ~30 MB of vectors, which Chroma keeps in memory for HNSW.

The anonymous struct struct { Embeddings [][]float32 ... } is a one-shot type definition — useful for parsing API responses you only care about briefly.

`IndexDoc` — store one chunk in Chroma

func (a *AI) IndexDoc(ctx context.Context, id, text string, meta map[string]string) error {
	if err := a.ensureCollection(ctx); err != nil {
		return err
	}
	vec, err := a.Embed(ctx, text)
	if err != nil {
		return err
	}

	// Look up the collection ID
	collID, err := a.collectionID(ctx)
	if err != nil {
		return err
	}

	url := fmt.Sprintf("%s/api/v2/tenants/%s/databases/%s/collections/%s/add", a.cfg.ChromaURL, chromaTenant, chromaDatabase, collID)

	metaIface := map[string]any{}
	for k, v := range meta {
		metaIface[k] = v
	}

	body, _ := json.Marshal(map[string]any{
		"ids":        []string{id},
		"embeddings": [][]float32{vec},
		"documents":  []string{text},
		"metadatas":  []map[string]any{metaIface},
	})
	req, err := http.NewRequestWithContext(ctx, "POST", url, bytes.NewReader(body))
	if err != nil {
		return err
	}
	req.Header.Set("Content-Type", "application/json")
	resp, err := a.client.Do(req)
	if err != nil {
		return fmt.Errorf("chroma add: %w", err)
	}
	defer resp.Body.Close()
	if resp.StatusCode >= 300 {
		buf, _ := io.ReadAll(resp.Body)
		return fmt.Errorf("chroma add status %d: %s", resp.StatusCode, string(buf))
	}
	return nil
}

The Chroma "add" payload is parallel arrays: ids[i] belongs with embeddings[i], documents[i], metadatas[i]. We send one item per call — for bulk indexing of a big site you'd batch dozens or hundreds in a single request to save round-trips.

The metaIface conversion is annoying boilerplate: Chroma expects map[string]any values but we receive map[string]string. We could just use any from the start, but the typed map enforces that we only put strings in metadata.

`QueryChunks` — find most-similar text

// QueryChunks returns top-K most similar chunks to the prompt.
// If jobID is non-empty, results are filtered to that job.
func (a *AI) QueryChunks(ctx context.Context, prompt string, topK int, jobID string) ([]string, []map[string]any, error) {
	if topK <= 0 {
		topK = 5
	}
	if err := a.ensureCollection(ctx); err != nil {
		return nil, nil, err
	}
	vec, err := a.Embed(ctx, prompt)
	if err != nil {
		return nil, nil, err
	}
	collID, err := a.collectionID(ctx)
	if err != nil {
		return nil, nil, err
	}

	queryBody := map[string]any{
		"query_embeddings": [][]float32{vec},
		"n_results":        topK,
		"include":          []string{"documents", "metadatas"},
	}
	if jobID != "" {
		queryBody["where"] = map[string]string{"job_id": jobID}
	}

	body, _ := json.Marshal(queryBody)
	url := fmt.Sprintf("%s/api/v2/tenants/%s/databases/%s/collections/%s/query", a.cfg.ChromaURL, chromaTenant, chromaDatabase, collID)
	req, err := http.NewRequestWithContext(ctx, "POST", url, bytes.NewReader(body))
	if err != nil {
		return nil, nil, err
	}
	req.Header.Set("Content-Type", "application/json")
	resp, err := a.client.Do(req)
	if err != nil {
		return nil, nil, fmt.Errorf("chroma query: %w", err)
	}
	defer resp.Body.Close()
	if resp.StatusCode >= 300 {
		buf, _ := io.ReadAll(resp.Body)
		return nil, nil, fmt.Errorf("chroma query status %d: %s", resp.StatusCode, string(buf))
	}

	var qr struct {
		Documents [][]string         `json:"documents"`
		Metadatas [][]map[string]any `json:"metadatas"`
	}
	if err := json.NewDecoder(resp.Body).Decode(&qr); err != nil {
		return nil, nil, err
	}
	if len(qr.Documents) == 0 {
		return nil, nil, nil
	}
	return qr.Documents[0], qr.Metadatas[0], nil
}

The Chroma response shape is nested: Documents[queryIndex][resultIndex]. We send one query at a time, so we always read row 0.

The where clause is Chroma's metadata filter — only chunks whose metadata matches will be considered for similarity. This is how the per-job scoping works: we tag each chunk with its job_id at index time and filter at query time.

You can do more complex filters: {"$and": [{"job_id": "abc"}, {"depth": {"$lte": 1}}]}, etc. Documented in the Chroma reference.

`Generate` — call the LLM

func (a *AI) Generate(ctx context.Context, prompt string) (string, error) {
	url := a.cfg.OllamaURL + "/api/generate"
	body, _ := json.Marshal(map[string]any{
		"model":  a.cfg.OllamaModel,
		"prompt": prompt,
		"stream": false,
		"options": map[string]any{
			"temperature": 0.2,
		},
	})
	req, err := http.NewRequestWithContext(ctx, "POST", url, bytes.NewReader(body))
	if err != nil {
		return "", err
	}
	req.Header.Set("Content-Type", "application/json")
	resp, err := a.client.Do(req)
	if err != nil {
		return "", fmt.Errorf("ollama generate: %w", err)
	}
	defer resp.Body.Close()
	if resp.StatusCode >= 300 {
		buf, _ := io.ReadAll(resp.Body)
		return "", fmt.Errorf("ollama generate status %d: %s", resp.StatusCode, string(buf))
	}
	var gr struct {
		Response string `json:"response"`
	}
	if err := json.NewDecoder(resp.Body).Decode(&gr); err != nil {
		return "", err
	}
	return strings.TrimSpace(gr.Response), nil
}

stream: false tells Ollama to return one big JSON blob instead of token-by-token streaming events. Streaming is great for chat UIs but adds parsing complexity we don't need for an HTTP API.

"temperature": 0.2 — low temperature means the model picks high-probability tokens consistently. Good for factual Q&A. Set higher (0.7-1.0) for creative writing where you want variety.

`ChunkText` & `BuildRAGPrompt`

// ChunkText splits a long document into pieces near a target size,
// breaking on word boundaries.
func ChunkText(text string, maxLen int) []string {
	text = strings.TrimSpace(text)
	if len(text) <= maxLen {
		if text == "" {
			return nil
		}
		return []string{text}
	}
	var chunks []string
	for len(text) > 0 {
		if len(text) <= maxLen {
			chunks = append(chunks, text)
			break
		}
		end := maxLen
		// back off to last whitespace within the slice
		for end > maxLen/2 && !isSpace(text[end]) {
			end--
		}
		if end == maxLen/2 {
			end = maxLen
		}
		chunks = append(chunks, strings.TrimSpace(text[:end]))
		text = strings.TrimSpace(text[end:])
	}
	return chunks
}

func isSpace(b byte) bool {
	return b == ' ' || b == '\n' || b == '\t'
}

// BuildRAGPrompt assembles a context-aware prompt for the LLM.
func BuildRAGPrompt(question string, chunks []string) string {
	var b strings.Builder
	b.WriteString("You are a helpful assistant. Answer the user's question using ONLY the provided context. ")
	b.WriteString("If the context does not contain the answer, say you don't know. Be concise and cite which context block(s) you used.\n\n")
	b.WriteString("CONTEXT:\n")
	for i, c := range chunks {
		fmt.Fprintf(&b, "[%d] %s\n\n", i+1, c)
	}
	b.WriteString("\nQUESTION: ")
	b.WriteString(question)
	b.WriteString("\n\nANSWER:")
	return b.String()
}

Word-boundary chunking. Cutting at the exact maxLen would split words and weaken the embedding. We back off to the previous whitespace, but only within the second half of the chunk — if no whitespace exists, we fall back to a hard cut. This avoids degenerate behavior on a single 10000-char "word" (rare but possible with code or URLs).

strings.Builder is the efficient way to assemble large strings — appending to a string in a loop creates O(n²) garbage; Builder amortizes to O(n).

The prompt template is the heart of RAG quality. Three things matter:

"using ONLY the provided context" — discourages hallucination.
"If the context does not contain the answer, say you don't know" — explicit fallback so the model doesn't invent.
Numbered context blocks [1] [2] [3] — gives the model a way to cite specifically.

References

11 · Go API

`api_test.go`

End-to-end tests with every dependency mocked.

Test setup

package main

import (
	"bytes"
	"encoding/json"
	"fmt"
	"net/http"
	"net/http/httptest"
	"strings"
	"sync"
	"testing"
)

// buildTestServer wires up a *Server backed by mock HTTP servers
// for Playwright, Chroma, Ollama, and an in-memory S3.
func buildTestServer(t *testing.T) (*Server, func()) {
	t.Helper()

	// Mock S3: trivial in-memory store hit by the AWS SDK
	s3Store := struct {
		sync.Mutex
		items map[string][]byte
	}{items: map[string][]byte{}}

	s3Mock := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		// e.g. PUT /scraper-data/jobs/abc/pages/x.txt
		// or   GET /scraper-data/jobs/abc/pages/x.txt
		key := strings.TrimPrefix(r.URL.Path, "/scraper-data/")

		switch r.Method {
		case http.MethodPut:
			buf := new(bytes.Buffer)
			buf.ReadFrom(r.Body)
			s3Store.Lock()
			s3Store.items[key] = buf.Bytes()
			s3Store.Unlock()
			w.WriteHeader(http.StatusOK)
		case http.MethodGet:
			s3Store.Lock()
			data, ok := s3Store.items[key]
			s3Store.Unlock()
			if !ok {
				http.NotFound(w, r)
				return
			}
			w.Header().Set("Content-Type", "text/plain")
			w.Write(data)
		default:
			w.WriteHeader(http.StatusMethodNotAllowed)
		}
	}))

t.Helper() tells Go's test runner: "if a test fails inside this function, blame the caller, not me." Without it, failure messages point at the helper file instead of the test.

httptest.NewServer binds a real HTTP listener to a random localhost port and returns a handle. It's a real server — TLS, headers, status codes all real — only the handler is yours.

The S3 mock is a hand-rolled key/value store keyed by the URL path. Because we configured the SDK with UsePathStyle = true, the bucket is in the path: /scraper-data/jobs/.../page.txt. We strip the bucket prefix to get the key.

Mock Playwright, Chroma, Ollama

	playwrightMock := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		var body map[string]any
		json.NewDecoder(r.Body).Decode(&body)
		url := body["url"].(string)
		resp := map[string]any{
			"url":   url,
			"title": "Test Page",
			"html":  "<html><body>Hello from " + url + "</body></html>",
			"text":  "Hello from " + url,
			"links": []string{},
		}
		w.Header().Set("Content-Type", "application/json")
		json.NewEncoder(w).Encode(resp)
	}))

	// Chroma + Ollama in one mux for convenience
	chromaCalls := 0
	mux := http.NewServeMux()
	mux.HandleFunc("/api/v2/", func(w http.ResponseWriter, r *http.Request) {
		chromaCalls++
		if strings.HasSuffix(r.URL.Path, "/collections") && r.Method == http.MethodPost {
			json.NewEncoder(w).Encode(map[string]any{"id": "test-coll-id"})
			return
		}
		if strings.HasSuffix(r.URL.Path, "/query") {
			json.NewEncoder(w).Encode(map[string]any{
				"documents": [][]string{{"matched chunk text"}},
				"metadatas": [][]map[string]any{{{"url": "http://example.com/", "job_id": "any"}}},
			})
			return
		}
		// add or other operations
		w.WriteHeader(http.StatusOK)
	})
	mux.HandleFunc("/api/embed", func(w http.ResponseWriter, r *http.Request) {
		json.NewEncoder(w).Encode(map[string]any{
			"embeddings": [][]float32{{0.1, 0.2, 0.3}},
		})
	})
	mux.HandleFunc("/api/generate", func(w http.ResponseWriter, r *http.Request) {
		json.NewEncoder(w).Encode(map[string]any{
			"response": "This is a mocked LLM answer.",
		})
	})
	chromaOllamaMock := httptest.NewServer(mux)

	// Build a Server with all upstreams pointing at our mocks
	cfg := Config{
		PlaywrightURL:    playwrightMock.URL,
		S3Endpoint:       s3Mock.URL,
		S3Bucket:         "scraper-data",
		AWSRegion:        "us-east-1",
		ChromaURL:        chromaOllamaMock.URL,
		OllamaURL:        chromaOllamaMock.URL,
		OllamaModel:      "test",
		OllamaEmbedModel: "test",
	}

	srv, err := newServer(cfg)
	if err != nil {
		t.Fatalf("newServer: %v", err)
	}

	cleanup := func() {
		s3Mock.Close()
		playwrightMock.Close()
		chromaOllamaMock.Close()
	}
	return srv, cleanup
}

Each mock is a closure with the minimum logic to make the system under test happy. Chroma's add just returns 200 — we don't need to actually store vectors. The query mock returns a hardcoded response, which is fine because the test only checks that the response shape is propagated correctly.

chromaCalls is a side counter — we don't use it in assertions here, but in larger test suites you'd assert "indexing called Chroma exactly 3 times" to catch over-fetching.

The cleanup function closes all three mock servers when the test ends — vital, otherwise port leaks accumulate over test runs.

Example test

func TestScrapeFlowEndToEnd(t *testing.T) {
	srv, cleanup := buildTestServer(t)
	defer cleanup()
	r := srv.router()

	// 1) POST /api/scrape — start a scrape
	body := strings.NewReader(`{"url":"http://example.com","depth":0,"format":"text"}`)
	req := httptest.NewRequest("POST", "/api/scrape", body)
	req.Header.Set("Content-Type", "application/json")
	w := httptest.NewRecorder()
	r.ServeHTTP(w, req)
	if w.Code != http.StatusAccepted {
		t.Fatalf("expected 202, got %d: %s", w.Code, w.Body.String())
	}
	var job ScrapeJob
	json.NewDecoder(w.Body).Decode(&job)

	// 2) Wait for completion (polling because run is async)
	deadline := time.Now().Add(5 * time.Second)
	for time.Now().Before(deadline) {
		srv.mu.RLock()
		j := srv.jobs[job.ID]
		srv.mu.RUnlock()
		if j != nil && (j.Status == "done" || j.Status == "failed") {
			job = *j
			break
		}
		time.Sleep(50 * time.Millisecond)
	}
	if job.Status != "done" {
		t.Fatalf("expected done, got %q (err=%q)", job.Status, job.Error)
	}
	if len(job.Pages) != 1 {
		t.Fatalf("expected 1 page, got %d", len(job.Pages))
	}

	// 3) POST /api/scrapes/{id}/index
	req = httptest.NewRequest("POST", "/api/scrapes/"+job.ID+"/index", nil)
	w = httptest.NewRecorder()
	r.ServeHTTP(w, req)
	if w.Code != http.StatusOK {
		t.Fatalf("index expected 200, got %d: %s", w.Code, w.Body.String())
	}

	// 4) POST /api/query
	body = strings.NewReader(fmt.Sprintf(`{"prompt":"hello","job_id":"%s"}`, job.ID))
	req = httptest.NewRequest("POST", "/api/query", body)
	req.Header.Set("Content-Type", "application/json")
	w = httptest.NewRecorder()
	r.ServeHTTP(w, req)
	if w.Code != http.StatusOK {
		t.Fatalf("query expected 200, got %d: %s", w.Code, w.Body.String())
	}
	var qr QueryResponse
	json.NewDecoder(w.Body).Decode(&qr)
	if !strings.Contains(qr.Answer, "mocked") {
		t.Fatalf("answer should contain 'mocked', got %q", qr.Answer)
	}
}

This is a 4-step round-trip: scrape → poll → index → query. Each step uses httptest.NewRecorder, which captures the response into memory so we can assert on its body and status.

The polling loop is necessary because handleScrape kicks off a goroutine. The test won't see the final state until the goroutine writes back into s.jobs. We poll with a 5-second deadline — generous because the test's mock Playwright is instant.

Asserting on "mocked" is intentional: if the test ever wired up the real Ollama, this assertion would fail. The hardcoded mock string makes the test self-checking.

Other tests in the file

The file also contains five smaller tests that exercise specific units in isolation:

TestHealthEndpoint — sanity check that /health returns 200.
TestScrapeValidation — rejects empty URLs, bad schemes, illegal formats.
TestGetPageAuthCheck — confirms the path-traversal guard rejects cross-job keys.
TestChunkText — chunking respects max length, breaks on whitespace, handles tiny/empty inputs.
TestBuildRAGPrompt — the assembled prompt contains the question, the rules, and numbered context blocks.

Run the suite with go test -v ./... from inside api/. All 6 tests should pass in <1 s.

References

12 · Go API

Go `Dockerfile`

A two-stage build: compile in golang:alpine, run in plain alpine.

The full file

# --- Stage 1: build ---
FROM golang:1.22-alpine AS builder
WORKDIR /src

# Cache deps before copying full source
COPY go.mod go.sum ./
RUN go mod download

# Now copy source and build
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o /out/api .

# --- Stage 2: run ---
FROM alpine:3.20
RUN apk add --no-cache ca-certificates
WORKDIR /app
COPY --from=builder /out/api ./api

EXPOSE 8080
CMD ["./api"]

Multi-stage build. The builder stage has the Go toolchain (~400 MB). The runtime stage only needs the compiled binary, so we throw the toolchain away. The final image is ~15 MB.

COPY go.mod go.sum ./ and RUN go mod download happen before the source copy. Docker layer caching means: if your source changes but go.mod doesn't, the dependency download layer is reused — saves seconds-to-minutes per build.

CGO_ENABLED=0 compiles a fully static binary with no dependency on libc. That's why we can copy it to a vanilla alpine base; without this we'd need to match the builder's libc version.

GOOS=linux makes the build target Linux even on macOS/Windows hosts — guarantees a binary that works inside the Linux container.

ca-certificates in the runtime image is needed for HTTPS — Go's TLS stack reads the system trust store. Without it, every https:// call would fail with "x509: certificate signed by unknown authority".

References

13 · Frontend

Frontend HTML & Tailwind

A single static page; nginx serves it; the browser handles everything else.

Document head — Tailwind via CDN

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Web Scraper Console</title>
  <script src="https://cdn.tailwindcss.com"></script>
</head>

The Tailwind CDN script (cdn.tailwindcss.com) generates utility CSS on the fly by scanning the page's class attributes. Great for prototypes — no build step. For production you'd run the actual Tailwind CLI to generate a static, smaller CSS file at build time.

The viewport meta makes the page mobile-friendly: it tells the browser to set the layout viewport equal to the device width, so Tailwind's responsive breakpoints (md:, lg:) actually trigger.

Header & container

<body class="bg-slate-50 text-slate-900">
  <div class="max-w-6xl mx-auto p-6">
    <header class="mb-6">
      <h1 class="text-3xl font-bold">Web Scraper Console</h1>
      <p class="text-slate-600 text-sm">
        Crawl, store to S3, index in ChromaDB, ask the agent.
      </p>
    </header>

max-w-6xl mx-auto caps the content width at 72 rem (1152 px) and centers it horizontally — standard Tailwind container pattern.

p-6 = 1.5 rem padding on all sides. Tailwind's spacing scale is 4 px-based (p-1 = 4 px, p-6 = 24 px).

Scrape form

<section class="bg-white rounded-xl shadow p-5 mb-6">
  <h2 class="text-lg font-semibold mb-3">Start a scrape</h2>
  <div class="grid md:grid-cols-12 gap-3">
    <input id="urlInput" type="url" placeholder="https://example.com"
           class="md:col-span-6 px-3 py-2 border rounded-md focus:outline-none focus:ring-2 focus:ring-blue-500">
    <input id="depthInput" type="number" min="0" max="5" value="0"
           class="md:col-span-2 px-3 py-2 border rounded-md">
    <select id="formatInput" class="md:col-span-2 px-3 py-2 border rounded-md">
      <option value="text">Text</option>
      <option value="html">HTML</option>
      <option value="screenshot">Screenshot</option>
    </select>
    <button id="scrapeBtn"
            class="md:col-span-2 bg-blue-600 hover:bg-blue-700 text-white font-medium rounded-md px-4 py-2">
      Scrape
    </button>
  </div>
  <p id="scrapeMsg" class="text-sm text-slate-500 mt-2"></p>
</section>

grid md:grid-cols-12 gap-3 — at the md breakpoint (≥768 px) the form becomes a 12-column grid; below that, a single column (the elements stack). Each input takes md:col-span-N to claim that many columns.

focus:ring-2 focus:ring-blue-500 shows a blue ring when the input is keyboard-focused. Tailwind's focus: prefix is great for accessibility — it preserves the focus indicator instead of stripping it like raw CSS resets often do.

Each interactive element gets an id so the JS can grab it (no jQuery, no React, just document.getElementById).

Jobs table

<section class="bg-white rounded-xl shadow p-5 mb-6">
  <div class="flex items-center justify-between mb-3">
    <h2 class="text-lg font-semibold">Jobs</h2>
    <button id="refreshBtn" class="text-sm text-blue-600 hover:underline">Refresh</button>
  </div>
  <div class="overflow-x-auto">
    <table class="w-full text-sm">
      <thead class="text-left bg-slate-100 text-slate-600">
        <tr>
          <th class="p-2">ID</th>
          <th class="p-2">Seed URL</th>
          <th class="p-2">Depth</th>
          <th class="p-2">Status</th>
          <th class="p-2">Pages</th>
          <th class="p-2">Indexed</th>
          <th class="p-2">Actions</th>
        </tr>
      </thead>
      <tbody id="jobsBody"></tbody>
    </table>
  </div>
</section>

overflow-x-auto on the table wrapper means narrow viewports get a horizontal scroll instead of mangled cells.

The <tbody id="jobsBody"> is left empty — the JS fills it. This is the cleanest way to do "render a list" without a framework: keep the structural HTML, render only the dynamic part.

AI agent panel

<section class="bg-white rounded-xl shadow p-5 mb-6">
  <h2 class="text-lg font-semibold mb-3">Ask the agent</h2>
  <div class="grid md:grid-cols-12 gap-3">
    <textarea id="promptInput" rows="3" placeholder="Summarize the site, list main sections..."
              class="md:col-span-9 px-3 py-2 border rounded-md"></textarea>
    <div class="md:col-span-3 flex flex-col gap-2">
      <select id="jobScope" class="px-3 py-2 border rounded-md">
        <option value="">All indexed jobs</option>
      </select>
      <button id="queryBtn" class="bg-emerald-600 hover:bg-emerald-700 text-white rounded-md px-4 py-2">
        Ask
      </button>
    </div>
  </div>
  <div id="answerBox" class="mt-4 hidden">
    <h3 class="text-sm font-semibold text-slate-700">Answer</h3>
    <p id="answerText" class="text-sm whitespace-pre-wrap"></p>
    <h4 class="text-sm font-semibold text-slate-700 mt-2">Sources</h4>
    <ul id="sourcesList" class="text-xs text-slate-600 list-disc pl-5"></ul>
  </div>
</section>

whitespace-pre-wrap on the answer text preserves line breaks from the LLM response while still wrapping long lines — important because Ollama often returns multi-line answers.

#answerBox starts hidden (class="... hidden") and is shown via JS once an answer arrives. This is the simplest way to do conditional rendering without a framework.

The job scope <select> starts with one option ("All indexed jobs", value=""); the JS adds entries for each indexed job as they appear.

Modal viewer

<div id="modalBg" class="hidden fixed inset-0 bg-black/60 z-40"></div>
<div id="modal"
     class="hidden fixed inset-4 md:inset-10 bg-white rounded-xl shadow-2xl z-50 flex flex-col">
  <div class="flex items-center justify-between p-3 border-b">
    <h3 id="modalTitle" class="font-semibold">Page</h3>
    <button id="modalClose" class="text-slate-500 hover:text-slate-900 text-2xl leading-none">&times;</button>
  </div>
  <div id="modalBody" class="flex-1 overflow-auto p-4"></div>
</div>

The classic two-element modal: a semi-transparent backdrop (#modalBg) and the dialog itself (#modal). Both start hidden; JS shows them together. The backdrop has a lower z-40 than the dialog's z-50 so the dialog is always on top.

fixed inset-4 md:inset-10 pins the modal 16 px from each edge on mobile, 40 px on desktop. The flex flex-col layout puts the title bar at the top, content area expanding to fill.

bg-black/60 is Tailwind's color-with-opacity syntax — black at 60% opacity. Modern, no need for rgba().

Frontend Dockerfile

FROM nginx:alpine
COPY index.html /usr/share/nginx/html/index.html
EXPOSE 80
CMD ["nginx", "-g", "daemon off;"]

This is the minimal "static site in a container" pattern: copy the file to nginx's default web root, run nginx in foreground (daemon off; makes Docker keep the container alive). ~25 MB final image.

References

14 · Frontend

Frontend JavaScript

~80 lines of vanilla JS that wire the page to the API.

API helper & element shortcuts

const API = "http://localhost:8080";
const $ = (id) => document.getElementById(id);

async function api(path, opts = {}) {
  const res = await fetch(API + path, {
    headers: { 'Content-Type': 'application/json', ...(opts.headers || {}) },
    ...opts
  });
  if (!res.ok) {
    const txt = await res.text();
    throw new Error(`${res.status}: ${txt}`);
  }
  return res.json();
}

$ = (id) => document.getElementById(id) is a tiny shortcut so we can write $('urlInput') instead of document.getElementById('urlInput'). Saves ~20 keystrokes per use.

The api() helper centralizes three things: setting the JSON content-type, resolving against the API base URL, and turning non-2xx responses into thrown errors. Without it, every call would have to repeat that boilerplate.

Spread syntax ...(opts.headers || {}) lets callers add their own headers while keeping the default content-type — the spread merges objects, with later keys overriding earlier ones.

Loading & rendering jobs

async function loadJobs() {
  try {
    const jobs = await api('/api/scrapes');
    jobs.sort((a, b) => new Date(b.created_at) - new Date(a.created_at));
    renderJobs(jobs);
    refreshScopeOptions(jobs);
  } catch (e) {
    console.error(e);
  }
}

function renderJobs(jobs) {
  const tbody = $('jobsBody');
  if (!jobs.length) {
    tbody.innerHTML = '<tr><td colspan="7" class="p-4 text-slate-500">No jobs yet</td></tr>';
    return;
  }
  tbody.innerHTML = jobs.map(j => `
    <tr class="border-t">
      <td class="p-2 font-mono text-xs">${j.id.slice(0, 8)}</td>
      <td class="p-2 truncate max-w-xs"><a href="${j.seed_url}" target="_blank" class="text-blue-600 hover:underline">${j.seed_url}</a></td>
      <td class="p-2">${j.depth}</td>
      <td class="p-2"><span class="px-2 py-0.5 rounded text-xs ${statusClass(j.status)}">${j.status}</span></td>
      <td class="p-2">${(j.pages || []).length}</td>
      <td class="p-2">${j.indexed ? '✅' : '—'}</td>
      <td class="p-2 space-x-2">
        <button onclick="viewJob('${j.id}')" class="text-blue-600 hover:underline">View</button>
        ${j.status === 'done' && !j.indexed
          ? `<button onclick="indexJob('${j.id}')" class="text-emerald-600 hover:underline">Index</button>`
          : ''}
      </td>
    </tr>
  `).join('');
}

function statusClass(s) {
  return {
    pending: 'bg-slate-200 text-slate-700',
    running: 'bg-amber-100 text-amber-800',
    done:    'bg-emerald-100 text-emerald-800',
    failed:  'bg-rose-100 text-rose-800',
  }[s] || 'bg-slate-100';
}

Template literals (backtick strings with ${...}) build HTML in one expression. jobs.map(j => ...).join('') is the idiomatic way to render a list — you'd use a framework's loop construct, but for small lists this works fine.

statusClass() uses object lookup as a switch. Cleaner than nested ifs, fast, and easy to extend with new statuses.

The conditional Index button uses a ternary inside the template literal. JSX, but worse — yet it works.

Security note: directly interpolating j.seed_url into HTML is a stored-XSS hole if the API ever returns user-controlled data. For a real product, escape with a helper or use textContent. We accept it here because we're talking to our own API.

Viewing pages

async function viewJob(id) {
  const job = await api(`/api/scrapes/${id}`);
  $('modalTitle').textContent = `Job ${id.slice(0, 8)} — ${job.seed_url}`;
  if (!job.pages || !job.pages.length) {
    $('modalBody').innerHTML = '<p class="text-slate-500">No pages yet.</p>';
  } else {
    $('modalBody').innerHTML = `
      <ul class="space-y-2">
        ${job.pages.map((p, i) => `
          <li class="border rounded p-2">
            <div class="text-xs text-slate-500">${p.url}</div>
            <div class="font-medium">${p.title || '(no title)'}</div>
            <button onclick="showPage('${id}', '${encodeURIComponent(p.s3_key)}', '${p.format}')"
                    class="text-blue-600 text-sm hover:underline">Open</button>
          </li>
        `).join('')}
      </ul>
    `;
  }
  showModal();
}

function showPage(jobId, encKey, format) {
  const url = `${API}/api/scrapes/${jobId}/page?key=${encKey}`;
  if (format === 'screenshot') {
    $('modalBody').innerHTML = `<img src="${url}" class="max-w-full" alt="screenshot">`;
  } else if (format === 'html') {
    $('modalBody').innerHTML = `<iframe src="${url}" class="w-full h-full border-0"></iframe>`;
  } else {
    fetch(url).then(r => r.text()).then(t => {
      $('modalBody').innerHTML = `<pre class="whitespace-pre-wrap text-sm">${escapeHtml(t)}</pre>`;
    });
  }
}

function escapeHtml(s) {
  return s.replace(/[&<>"']/g, c => ({'&':'&amp;','<':'&lt;','>':'&gt;','"':'&quot;',"'":'&#39;'}[c]));
}

Format-aware rendering. Screenshots become <img> tags, HTML becomes an iframe (sandboxed by the browser by default), text gets fetched and HTML-escaped before insertion.

encodeURIComponent on the S3 key handles slashes, equals, etc. Without it the query string would be corrupted.

For text we use <pre class="whitespace-pre-wrap"> — preserves line breaks but wraps long lines, so a 5000-char paragraph still displays nicely.

The escape function maps the five HTML-special characters. Tiny but essential for showing untrusted page text.

Index, scrape, query

async function indexJob(id) {
  $('scrapeMsg').textContent = `Indexing ${id.slice(0, 8)}…`;
  try {
    const r = await api(`/api/scrapes/${id}/index`, { method: 'POST' });
    $('scrapeMsg').textContent = `Indexed ${r.chunks_indexed} chunks.`;
    loadJobs();
  } catch (e) {
    $('scrapeMsg').textContent = 'Index error: ' + e.message;
  }
}

$('scrapeBtn').onclick = async () => {
  const url = $('urlInput').value.trim();
  if (!url) return;
  const body = {
    url,
    depth: parseInt($('depthInput').value || '0', 10),
    format: $('formatInput').value
  };
  $('scrapeMsg').textContent = 'Submitting…';
  try {
    const j = await api('/api/scrape', { method: 'POST', body: JSON.stringify(body) });
    $('scrapeMsg').textContent = `Job ${j.id.slice(0, 8)} created`;
    loadJobs();
  } catch (e) {
    $('scrapeMsg').textContent = 'Error: ' + e.message;
  }
};

$('queryBtn').onclick = async () => {
  const prompt = $('promptInput').value.trim();
  if (!prompt) return;
  const body = { prompt, job_id: $('jobScope').value, top_k: 5 };
  $('answerText').textContent = 'Thinking…';
  $('answerBox').classList.remove('hidden');
  try {
    const r = await api('/api/query', { method: 'POST', body: JSON.stringify(body) });
    $('answerText').textContent = r.answer;
    $('sourcesList').innerHTML = (r.sources || []).map(s =>
      `<li><a href="${s}" target="_blank" class="text-blue-600 hover:underline">${s}</a></li>`
    ).join('');
  } catch (e) {
    $('answerText').textContent = 'Error: ' + e.message;
  }
};

$('refreshBtn').onclick = loadJobs;
loadJobs();
setInterval(loadJobs, 4000);

Three button handlers: Scrape kicks off a job, Ask calls /query, Refresh re-fetches the list. The polling at the bottom (setInterval(loadJobs, 4000)) means new jobs and status transitions appear automatically — no need to click Refresh during a crawl.

The "Thinking…" placeholder appears immediately so the user sees feedback even while waiting 30+ s for the LLM. Without it the page would look broken.

parseInt(value, 10) with the explicit radix is good practice — without it, leading zeros could trigger octal parsing in old engines.

References

15 · AI

ChromaDB — what's actually stored?

A vector database, demystified.

The four parallel arrays

Every Chroma operation works with parallel arrays — index i across all of them refers to one logical row.

{
  "ids":        ["jobA::url1::0", "jobA::url1::1", "jobB::url2::0"],
  "documents":  ["chunk text 1",  "chunk text 2",  "chunk text 3"],
  "embeddings": [[0.1, ...],      [0.2, ...],      [0.3, ...]],
  "metadatas":  [{"job_id":"A"},  {"job_id":"A"},  {"job_id":"B"}]
}

The three things you can do

add

Insert N rows. We pre-compute embeddings client-side (Ollama) so Chroma doesn't have to know about any model.

query

Find K rows most similar (by cosine distance) to a query embedding. Optional where filter on metadata.

delete

Remove rows by ID list or by metadata filter — useful for "delete all chunks for this job".

Why client-side embeddings?

Chroma supports server-side embeddings via embedding functions, but our setup is the opposite: we compute embeddings in ai.go via Ollama and ship them to Chroma. Reasons:

One embedding model, one place to upgrade it.
Chroma stays a pure vector store; the AI complexity stays in the API.
If Chroma changes pricing/operator, we just point at the new vector store with no model rewiring.

The v2 URL structure

POST /api/v2/tenants/{tenant}/databases/{db}/collections        # get-or-create
POST /api/v2/tenants/{tenant}/databases/{db}/collections/{id}/add
POST /api/v2/tenants/{tenant}/databases/{db}/collections/{id}/query

The tenant and database levels exist for multi-tenant deployments — a SaaS hosting embeddings for many customers would put each in its own tenant, hard-isolating their data. We use default_tenant / default_database.

References

16 · AI

Ollama & the RAG pipeline end-to-end

Two models, one server, one HTTP API.

The two models

`nomic-embed-text`

Embedder. Input: text up to ~8K tokens. Output: 768 floats. Optimized purely for similarity — does not generate text.

Size: ~270 MB

`llama3.2`

Generator. Conversational LLM, ~3 B parameters. Quantized for CPU runs.

Size: ~2 GB

Ollama API endpoints we use

// Embed
POST /api/embed
{
  "model": "nomic-embed-text",
  "input": "text to embed"
}
→ { "embeddings": [[0.1, 0.2, ...]] }

// Generate
POST /api/generate
{
  "model": "llama3.2",
  "prompt": "...",
  "stream": false,
  "options": { "temperature": 0.2 }
}
→ { "response": "...", "done": true, ... }

The RAG prompt template, annotated

You are a helpful assistant. Answer the user's question using ONLY the
provided context. If the context does not contain the answer, say you
don't know. Be concise and cite which context block(s) you used.

CONTEXT:
[1] <chunk text from page A>
[2] <chunk text from page B>
[3] <chunk text from page C>

QUESTION: How do I do X?

ANSWER:

Three deliberate elements:

"using ONLY the provided context" — discourages the model from drawing on its training data, which keeps answers grounded.
"If the context does not contain the answer, say you don't know" — explicit fallback. Without it, models invent confidently.
Numbered context blocks — gives the model a citation vocabulary ("see [1] and [3]").

References

17 · Operations

Running the stack

From git clone to your first scrape.

First boot

cd scraper
docker compose up --build

# wait ~3 minutes the first time, ~30 s thereafter
# watch the ollama-init container exit cleanly when models are pulled
docker compose ps

--build forces a rebuild of any locally-built images (api, playwright, frontend). Without it, Compose reuses cached images, which is fine until you change source files.

The first run is slow because Ollama has to download llama3.2 (~2 GB) and nomic-embed-text (~270 MB). Both go into the ollama-data volume and survive container restarts.

Smoke tests

# API up?
curl -s http://localhost:8080/health
# → {"status":"ok"}

# Playwright up?
curl -s http://localhost:3000/health
# → {"status":"ok"}

# LocalStack bucket exists?
docker compose exec localstack awslocal s3 ls
# → 2025-XX-XX  scraper-data

# Ollama models present?
docker compose exec ollama ollama list
# → llama3.2, nomic-embed-text both listed

# Open the UI
open http://localhost:8000

A first scrape via curl

# Submit a job
curl -s -X POST http://localhost:8080/api/scrape \
  -H 'Content-Type: application/json' \
  -d '{"url":"https://example.com","depth":1,"format":"text"}' | jq

# Poll until done
JOB_ID=<copy id from response>
curl -s http://localhost:8080/api/scrapes/$JOB_ID | jq '.status, .pages | length'

# Index it
curl -s -X POST http://localhost:8080/api/scrapes/$JOB_ID/index | jq

# Ask the agent
curl -s -X POST http://localhost:8080/api/query \
  -H 'Content-Type: application/json' \
  -d "{\"prompt\":\"summarize this site\",\"job_id\":\"$JOB_ID\"}" | jq

Resetting

# Stop containers (volumes survive)
docker compose down

# Stop AND wipe volumes (full reset — re-downloads Ollama models on next up)
docker compose down -v

# Just rebuild one service
docker compose build playwright
docker compose up -d playwright

References

18 · Operations

Troubleshooting

Real errors we hit, what they meant, how to fix them.

Common errors & fixes

`browserType.launch: Executable doesn't exist at /ms-playwright/chromium_headless_shell-1217/...`

What it means: the npm-resolved Playwright version expects a chromium binary that isn't in the base image. Happens when package.json uses ^1.47.0 and npm pulls 1.49+ which expects newer browsers.

Fix: use a generic Node image and run npx playwright install --with-deps chromium at build time, so the browser always matches the resolved npm version. Then rebuild:

docker compose build --no-cache playwright
docker compose up -d playwright

`connection refused` from the API to upstreams during boot

What it means: depends_on only waits for the dependency to start, not to be ready. Playwright/Chroma/Ollama may still be initializing when the API tries first request.

Fix: retry the call (the user just clicks again), or add explicit healthchecks and use condition: service_healthy in depends_on:

depends_on:
  playwright:
    condition: service_healthy

`model 'llama3.2' not found` from Ollama

What it means: the ollama-init container hasn't finished pulling models, or it failed silently.

Diagnosis:

docker compose logs ollama-init
docker compose exec ollama ollama list   # should show both models

Fix: run the pull manually, then restart the API:

docker compose exec ollama ollama pull llama3.2
docker compose exec ollama ollama pull nomic-embed-text

S3 calls fail with `NoSuchBucket: The specified bucket does not exist`

What it means: LocalStack's init-aws.sh didn't run. Common causes: file isn't executable, line endings are CRLF (from Windows), or the volume mount path is wrong.

Fix:

chmod +x scripts/init-aws.sh
# convert CRLF if needed:
dos2unix scripts/init-aws.sh
docker compose down -v
docker compose up

Or just create the bucket manually:

docker compose exec localstack awslocal s3 mb s3://scraper-data

Ollama responses extremely slow (60s+) on a CPU

What it means: the llama3.2 3B model on a CPU is genuinely that slow. Embeddings are fine; generation is the bottleneck.

Mitigations:

Use a smaller model: OLLAMA_MODEL=llama3.2:1b (faster, lower quality).
Provide a GPU: add deploy.resources.reservations.devices with NVIDIA driver support to the ollama service.
Use an API-hosted LLM (OpenAI, Anthropic) and adapt ai.go Generate to call it instead.

Frontend shows nothing or `CORS error`

What it means: the browser blocked the cross-origin request from localhost:8000 to localhost:8080, probably because the API didn't return the right CORS headers (or it's down).

Diagnosis: open the browser devtools console, look at the failed request's response headers. Check API logs for incoming requests. If no requests at all, the API URL in frontend/index.html may be wrong.

Fix: ensure const API = "http://localhost:8080" at the top of the JS, and that the API container is running.

Port already in use: `bind: address already in use`

What it means: something on your host is already using 8000, 8080, 3000, 4566, 8001, or 11434.

Find the offender:

lsof -i :8080      # macOS/Linux
netstat -ano | findstr :8080   # Windows

Fix: kill the conflicting process, or change the host-side port in docker-compose.yml: "8081:8080" publishes to a different host port.

Useful commands cheat sheet

# See all running containers
docker compose ps

# Tail one service's logs
docker compose logs -f api
docker compose logs -f ollama

# Shell into a running container
docker compose exec api sh
docker compose exec localstack bash

# Direct curl to inner services (via published ports)
curl http://localhost:3000/health                    # playwright
curl http://localhost:8001/api/v2/heartbeat          # chroma
curl http://localhost:11434/api/tags                 # ollama
curl http://localhost:4566/_localstack/health        # localstack

# Inspect S3 contents
docker compose exec localstack awslocal s3 ls s3://scraper-data --recursive

# Inspect ChromaDB
curl http://localhost:8001/api/v2/tenants/default_tenant/databases/default_database/collections | jq

# Run Go tests against the local source (no docker needed)
cd api && go test -v ./...

Service	Idle	Active
api (Go)	~30 MB	~80 MB
playwright (Chromium)	~150 MB	~400 MB
localstack	~250 MB	~300 MB
chromadb	~200 MB	~300 MB + index
ollama (with model loaded)	~3 GB	~4 GB
frontend (nginx)	~10 MB	~10 MB

The whole system at a glance

The two main flows

Flow A — Scrape

Flow B — Ask the agent

File map

docker-compose.yml

The full file

Block 1 — The Go API service

Block 2 — Playwright renderer

Block 3 — Frontend (nginx)

Block 4 — LocalStack

Block 5 — ChromaDB

Block 6 — Ollama + the one-shot model puller

Block 7 — Volumes & network

References

LocalStack & init-aws.sh

scripts/init-aws.sh

Why mount this script in ready.d/?

References

Playwright service

The full server.js

Imports & setup

Lazy browser launch

The /render handler

Extracting content

Screenshot & cleanup

Server bootstrap & graceful shutdown

package.json

Dockerfile

References

models.go

Package & imports

The request types

The job & page types

The query types

The Playwright response shape

References

main.go

Imports

Config + env loading

The Server struct — dependency injection

Router configuration

main()

References

handlers.go

JSON helpers & /health

handleScrape — the entry point of every job

Job creation & async execution

List & get

handleGetPage — streaming from S3

handleIndexScrape — load to ChromaDB

handleQuery — RAG in 25 lines

References

scraper.go

Imports & constants

Scraper struct & constructor

Run — the BFS loop

Per-page fetch & store

fetchAndStore — talking to Playwright + S3

Storing artifacts to S3

URL helpers

References

storage.go

Imports & type

Constructor — the LocalStack pivot

Put & Get

References

ai.go

Imports & type

ensureCollection — idempotent setup

Embed — text → 768-d vector

IndexDoc — store one chunk in Chroma

QueryChunks — find most-similar text

Generate — call the LLM

ChunkText & BuildRAGPrompt

References

api_test.go

Test setup

Mock Playwright, Chroma, Ollama

Example test

LocalStack & `init-aws.sh`

`scripts/init-aws.sh`

Why mount this script in `ready.d/`?

The full `server.js`

The `/render` handler

`package.json`

`Dockerfile`

`models.go`

`main.go`

`Config` + env loading

The `Server` struct — dependency injection

`main()`

`handlers.go`

JSON helpers & `/health`

`handleScrape` — the entry point of every job

`handleGetPage` — streaming from S3

`handleIndexScrape` — load to ChromaDB

`handleQuery` — RAG in 25 lines

`scraper.go`

`Scraper` struct & constructor

`Run` — the BFS loop

`fetchAndStore` — talking to Playwright + S3

`storage.go`

`Put` & `Get`

`ai.go`

`ensureCollection` — idempotent setup

`Embed` — text → 768-d vector

`IndexDoc` — store one chunk in Chroma

`QueryChunks` — find most-similar text

`Generate` — call the LLM

`ChunkText` & `BuildRAGPrompt`

`api_test.go`

Go `Dockerfile`

`nomic-embed-text`

`llama3.2`

`browserType.launch: Executable doesn't exist at /ms-playwright/chromium_headless_shell-1217/...`

`connection refused` from the API to upstreams during boot

`model 'llama3.2' not found` from Ollama

S3 calls fail with `NoSuchBucket: The specified bucket does not exist`

Frontend shows nothing or `CORS error`

Port already in use: `bind: address already in use`

What's actually happening on `docker compose up`?