Skip to content

Architecture

Botholomew is three cooperating process roles that share a single DuckDB database:

  1. Workers — short-lived or long-running bun processes that claim tasks from the queue, evaluate schedules, and run LLM tool loops. Multiple workers can run at once; each registers itself in the DB and heartbeats so dead ones are reaped.
  2. The chat TUI — an Ink/React terminal UI you run on demand; it enqueues tasks, browses history, and can dispatch workers via the spawn_worker tool.
  3. The CLI — everything else (task add, schedule list, context search, worker list, …). Each invocation opens its own DuckDB connection.

All share .botholomew/data.duckdb. DuckDB holds the file lock at the instance level (not the connection), so no process holds a DB connection longer than a single logical operation. Each CRUD call runs inside a short-lived withDb(dbPath, fn) from src/db/connection.ts, which acquires a connection, executes, and releases the instance when the last overlapping caller in the process is done. withRetry wraps the acquire path and retries with exponential backoff if another process is holding the lock.

Safety note. None of these processes give the agent direct access to your machine. Workers are the only things executing LLM tool calls, and the only tools they see are the ones registered in src/tools/ (all operating inside .botholomew/) plus whichever MCP servers you explicitly configured. There is no "just read the file system" escape hatch. See the virtual filesystem doc for the full argument.


The worker tick

A worker executes one tick() per cycle. In --once mode (the default), a single tick runs and then the worker exits. In --persist mode, the worker loops over ticks until it receives SIGTERM/SIGINT.

 tick() ─┐
         ├─► reset stale in_progress tasks (claimed > 3× max_tick_duration)
         ├─► processSchedules(workerId) — atomically claim each due
         │                                 schedule, ask the LLM which are
         │                                 "due", enqueue their tasks
         ├─► claimNextTask(workerId)   — highest-priority unblocked pending
         │                                 task; worker id is stamped on the
         │                                 `claimed_by` column
         ├─► createThread("worker_tick") — one thread per tick for logging
         ├─► buildSystemPrompt()       — always-context + task-relevant
         │                                 context
         ├─► runAgentLoop()            — multi-turn Anthropic tool-use loop
         │                                 every message, thinking block, tool
         │                                 call, and tool result is logged as
         │                                 an `interaction` row
         ├─► updateTaskStatus()        — complete / failed / waiting
         └─► endThread()

If no task is claimable and no schedule is due, tick() returns false. A --persist worker then sleeps tick_interval_seconds before trying again; a --once worker exits immediately.

See src/worker/tick.ts.

Log format

Worker logs prefix every line with a local HH:MM:SS timestamp. Lifecycle phases render as [[phase-name]] in bold magenta so they're easy to scan and grep (grep '\[\[' .botholomew/logs/<id>.log). Phases emitted each tick:

  • [[tick-start]] #N
  • [[evaluating-schedules]] (only when any are enabled)
  • [[claiming-task]]
  • [[tick-end]] #N Xs didWork=true|false
  • [[sleeping]] Ns (only when there was no work in a persist worker)

Background workers (spawned without a TTY) also mirror the conversation thread to the log between [[claiming-task]] and Task ... -> complete, so a tail -f shows what the LLM is actually doing:

  • [[assistant]] <full text response> — assistant message blocks
  • [[tool-call]] <tool> <truncated JSON input> — each tool invocation
  • [[tool-result]] <tool> ok|err in Ns — tool outcome and duration

Full content (untruncated input, tool output, tokens) stays in the interactions table; the log mirrors enough to follow the trace without opening the DB. Foreground workers (worker run) keep their existing streaming UX (per-token output and / markers) — these phase lines are suppressed there to avoid duplication.


Registration, heartbeat, reaping

Every worker writes a row into the workers table on start (registerWorker in src/db/workers.ts) with its id (uuidv7), pid, hostname, mode, optional pinned task id, optional log_path, and status='running'. Detached workers (spawned via worker start or spawn_worker) get a per-worker log file at .botholomew/logs/<worker-id>.log — the spawn parent generates the id and opens that file before launching the child, so the path is recorded on the row from registration onward. Foreground workers (worker run) have log_path = null and write to stdout instead.

From that moment, a non-blocking setInterval in src/worker/heartbeat.ts bumps last_heartbeat_at every worker_heartbeat_interval_seconds (default 15s) — independent of the tick loop, so a worker mid-LLM-call still heartbeats reliably.

Persist workers also run a reaper interval (worker_reap_interval_seconds, default 30s) that does two things:

  1. Flips any worker whose heartbeat is older than worker_dead_after_seconds (default 60s) to status='dead' and releases every task and schedule claim it held. This is the failure-recovery path: anything from a terminal crash to a kill -9 ends with the work reclaimable by another worker.
  2. Deletes cleanly-stopped workers whose stopped_at is older than worker_stopped_retention_seconds (default 3600s). Dead workers are kept as forensic evidence; only the clean exits get auto-pruned so the workers table doesn't grow unbounded.

The chat TUI

botholomew chat is a separate agent with its own system prompt and tool set — it does not execute long-running work itself. Instead, it:

  • answers questions about tasks, threads, and context,
  • creates tasks (via create_task) that workers will pick up,
  • spawns workers on demand (via spawn_worker) when the user wants work run right now,
  • reads worker activity (list_threads, view_thread),
  • looks up context items by path or UUID (context_info, context_search) and can refresh them in place (context_refresh),
  • invokes skills (/review, /standup, …) defined in .botholomew/skills/,
  • edits beliefs.md and goals.md via update_beliefs / update_goals.

It uses Anthropic's streaming API so tokens render in the TUI as they arrive. Every session is itself a chat_session thread with the same interaction log as a worker tick.

See src/chat/ and src/tui/.


Why two agents?

A single-agent design would force the chat loop to wait on whatever the user asked — "summarize this 200-page PDF" blocks the UI for minutes. The split:

  • Chat is fast, streaming, and interactive. It understands the world via the database but doesn't touch it much.
  • Workers are slow, autonomous, and batch-oriented. Each tick can take as long as it needs.

Both speak to the same database, so a worker's results are immediately visible to the chat agent — and the chat agent can dispatch workers without blocking.


Automation without a resident daemon

Earlier versions of Botholomew shipped an OS-level watchdog (launchd on macOS, systemd on Linux) to keep a single daemon alive. That's been replaced: users now run workers directly, and there is no installed background service. See automation.md for cron-based recipes and optional launchd/systemd examples if you want Botholomew to advance on its own.


Thread logging

Every interaction is persisted. A thread is one tick or one chat session; an interaction is a single event within it (user message, assistant message, tool call, tool result, thinking block, status change).

This gives Botholomew total observability without a separate tracing stack — botholomew thread view <id> reads the same rows that produced the work. Threads are also the chat agent's way of reporting on what workers have been doing.

Schema lives in src/db/sql/2-logging_tables.sql; thread types are worker_tick and chat_session.


Connection model

Every process uses the same policy: open a DuckDB connection for one logical operation, then close it.

  • Workers: tick() takes a dbPath, not a held connection. Each call into src/db/* is wrapped in withDb — stale-task reset, task claim, thread create, every logInteraction, the status update. The LLM network round-trip holds no connection.
  • Heartbeat: a separate setInterval opens its own short withDb every ~15s (src/worker/heartbeat.ts). This is deliberately decoupled from the tick loop so a long LLM call doesn't stall the heartbeat.
  • Chat: ChatSession carries dbPath. Each write (user message log, tool-use log, tool-result log, title update, thread end) is its own withDb. Tool execution wraps each call in withDb so ctx.conn is scoped to that tool call only.
  • CLI invocations: withDb in src/commands/with-db.ts opens a connection for the command, applies migrations, and closes when the callback returns.
  • TUI panels: take dbPath, not conn, and wrap each refresh poll in withDb.

DuckDB's file lock is process-wide and held by the instance, not individual connections. Within one process we refcount a shared instance so overlapping withDb calls (e.g., parallel tool execution via Promise.all, or the heartbeat firing alongside a tick) don't trip DuckDB's "don't open the same DB twice" rule; when the last caller in the process releases, we close the instance and free the OS-level lock so another process can claim it.

Vector search uses array_cosine_distance() (core DuckDB, no extension) over a linear scan of the embeddings.embedding column; the FTS extension (INSTALL fts; LOAD fts;) is loaded at connect time for BM25 keyword search. See src/db/connection.ts.


Multi-worker safety

Any number of workers can run against the same project concurrently (spawned by CLI, the chat tool, cron, or --persist). Concurrency is handled at the DB level:

  • Task claimclaimNextTask(conn, workerId) issues an atomic UPDATE tasks SET status='in_progress', claimed_by=?1 WHERE id=?2 AND status='pending' RETURNING *. If another worker claimed the row first, RETURNING comes back empty and the loop tries the next candidate.
  • Schedule claimclaimSchedule(conn, id, workerId, opts) is the same atomic UPDATE pattern, gated by both a schedule_claim_stale_seconds (default 300s) window on the existing claim and a schedule_min_interval_seconds (default 60s) window on last_run_at. Only one worker per schedule per window evaluates and enqueues tasks.
  • Stale release — If a worker crashes mid-task, its claim is released when the reaper flips its row to dead. Existing claim_at staleness also catches tasks claimed for longer than 3× the tick duration, independent of the worker's heartbeat.

Nuke: bulk database resets

During development and when reusing a project, you often want to wipe part of the database without blowing away the whole .botholomew/ directory (which would also erase soul.md, beliefs.md, goals.md, config.json, and your skills). botholomew nuke covers that:

ScopeClears
nuke contextcontext_items, embeddings
nuke taskstasks
nuke schedulesschedules
nuke threadsthreads, interactions (both worker ticks and chat sessions)
nuke alleverything above plus daemon_state

Each subcommand requires -y/--yes to actually delete — running without the flag prints per-table row counts and exits, so it doubles as a dry run. Nothing on disk (soul, beliefs, goals, config, skills) is ever touched.

For safety, nuke refuses to run while any worker is in status='running' — stop them first with botholomew worker stop <id>. The schema itself (tables, _migrations) is always preserved.

See src/commands/nuke.ts.


DB doctor: detect and repair index corruption

Under rare circumstances — typically after a hard crash or interrupted write — DuckDB's primary-key index can fall out of sync with the row data. The symptom is that UPDATE/DELETE against the affected rows fails with Invalid Input Error: Failed to delete all rows from index. Inside Bun, that FATAL error unwinds past the NAPI boundary as a C++ exception, surfacing as panic: A C++ exception occurred from Zig__GlobalObject__onCrash. The CLI command, the worker tick loop, and anything else that touches a corrupted row die immediately.

botholomew db doctor exists to detect and recover from this:

ModeWhat it does
db doctor (default)For each user table, spawns a child Bun process that runs a self-update touching the PK index. Reports ok / empty / missing / corrupt per table. The child-process isolation is essential — a panic in the probe stays out of the doctor itself.
db doctor --repairRefuses if any worker is actually running (PID alive). Stale status='running' rows whose PIDs are dead — the case that tends to coexist with workers-table corruption — are warned about but do not block repair, because flipping them to stopped would just trip the same corruption. Runs CHECKPOINT, EXPORT DATABASE to a timestamped directory under .botholomew/, renames the original data.duckdb (and .wal) to data.duckdb.bak-<timestamp>, opens a fresh DB at the original path, and IMPORT DATABASEs back. Indexes are rebuilt from data, which restores write integrity. After repair, botholomew worker reap cleans up the stale rows.

Repair is idempotent and non-destructive: the original DB is preserved as a .bak-<timestamp> file next to the new one. Delete the backup once you've confirmed the rebuilt DB looks right.

See src/db/doctor.ts and src/commands/db.ts.

Released under the MIT License.