Local-first & data ingestion

Contextful is local-first in the literal sense: the authoritative copy of every document, every synthesized memory, and every access-control key lives on a machine you run — a Mac Studio in the office, a box in your rack — under ~/.contextful. Connectors pull your company's real surfaces (Stripe, Slack, PostHog, and more) into that store, and the brain synthesizes them into human-readable Markdown memory you can open, edit, and git diff. Cloud services are optional accelerators, never the home of your data.

Where your data lives

The host runs one binary (sync serve) on your own hardware. Everything it knows sits in one directory:

~/.contextful/
  control/        # principals, keys, policy envelopes
  docs/           # per-document CRDT snapshots + oplogs
  brain/          # synthesized memory — Markdown files, per topic
  brain.duckdb    # raw events, index, embeddings, anomalies
  caps/           # issued/attenuated token records (audit trail)

Peers — browsers, teammates' machines, agents — reach the host over your own Tailscale network (a WireGuard mesh). There is no Contextful cloud in the data path: the network is yours, the disk is yours.

How ingestion works

Every source goes through one connector contract: a connector declares the views it exposes (the unit of access control) and pulls raw events, each stamped with provenance and an access tag (acl_tag) at the moment it enters the system.

flowchart LR
    CON["Connectors<br/>Stripe · Slack · PostHog · Exa"] --> ING["Ingest<br/>raw events + provenance + acl_tag"]
    ING --> EXT["Extract<br/>atomic facts & entities"]
    EXT --> SYN["Synthesize<br/>dedupe · supersede · summarize → Markdown"]
    SYN --> IDX["Index<br/>full-text + embeddings + structured views"]
    IDX --> SRV["Serve<br/>capability-filtered retrieval"]
    SYN --> ANO["Anomalies + learnings<br/>baseline vs. period"]
    ANO --> SRV

Three properties matter:

  • Memory is Markdown, not a vector dump. Synthesized knowledge is a tree of Markdown cards a human can read. Cards self-wire: typed wikilinks in the prose become graph edges, so the brain is a navigable knowledge graph.
  • Access tags travel with the data. A derived memory inherits the strictest access requirement of its sources (taint propagation) — synthesis can never launder a private fact into a public card.
  • Nothing is destroyed. Stale facts are superseded with a timestamp, never overwritten, so the brain's history is auditable.

Ingestion runs on demand (sync ingest --source stripe) or on a cron schedule that keeps the brain fresh — nightly Stripe, hourly web enrichment, and an off-peak daydream cycle in which the brain proposes and grounds new connections between cards on its own.

The world stays outside, on purpose

Agents ground answers in public knowledge — list prices, benchmarks, vendor changelogs — via the Exa search API. Those world facts are cached locally with their source URLs, so every external figure is cited. Outbound queries pass an egress firewall: only public-tainted terms may leave the host, so a private value can never be smuggled out inside a search string.

What happens when the cloud goes away

Local-first is tested, not aspirational. With no cloud credentials at all, Contextful degrades to an on-host floor — never to fakes:

  • Structured brain queries and field/row redaction need no LLM and keep working.
  • Inference falls back to a local model server (LM Studio) on the host.
  • Already-fetched world knowledge serves from cache; only fresh lookups pause.
  • Documents remain editable offline and merge cleanly when peers reconnect — see Collaboration & CRDT.

The boundary that protects your data — described in Sandbox & capability tokens — is deterministic code on the host, so it holds with or without an internet connection.