2026-06-25 · 18 min read

An Engineering Program for AI Orchestration Systems

How I built DarDev Studio from scratch — falsifiable hypotheses, offline eval gates, five memory tiers, and measured outcomes (6/6 intent routing, 8/8 RAG IT) on internal engineering workflows.

Framing

This article documents an internal engineering program behind DarDev Studio — an agentic platform I architected from scratch at DarDev. It is not academic research with external baselines or peer review. It is disciplined platform engineering: state a hypothesis, build the mechanism, measure with offline eval fixtures, and only merge when gates pass.

The central question: how should engineering teams structure memory, retrieval, agents, tools, inference, and quality gates so AI-assisted development behaves like infrastructure rather than a demo API?

Methodology

Each major design choice was treated as falsifiable on PC before merge.

Hypothesis — mixing session chat into org RAG degrades code retrieval. Test: hard exclude conversations/ from ingest; run eval:rag:it on IT fixtures. Result: 8/8 pass in check:integration:dev-team after tier split.
Hypothesis — goal-only orchestration needs regression-tested intent routing. Test: eval:orchestrator-intents on engineering-goal fixtures. Result: 6/6 offline; validate:orchestrator-prompts 5/5 on live API matrix at closeout.
Hypothesis — unbounded agent loops waste tokens without improving outcomes. Test: cap iterations and tools per round; gate with integration suite. Result: stable closeout with bounded loops in IT_ORCHESTRATOR default mode.
Hypothesis — chat and embedding should not share one GPU budget. Test: separate :8080 chat and :8081 BGE-M3 embed pool; envelope sweep on RTX 4050. Result: LiteRT Gemma max 32K; llama.cpp ~96K before alloc fail — routing encoded in model-packs.json.

What failed first

The first build used a single retrieve path for everything. Session turns leaked into code questions. That failure — concrete, reproducible, annoying — is what justified five memory tiers instead of a larger context window.

Early agent loops had no per-round tool cap. Runs would call web_search, graph_lookup, and patch tools in one round, then loop again with degraded context. Bounds plus eval:orchestrator-intents made regressions visible in CI instead of in chat.

Direct workspace writes from the agent produced diffs that were hard to review. propose_patch and verify_patch turned "the model edited files" into a gate the same way tests gate code.

Why orchestration is infrastructure

Most teams bolt an LLM onto an IDE and ship "AI features." That fails at scale because retrieval quality is inconsistent, agents loop without policy, tools run without bounds, and nobody can tell whether a regression is RAG, routing, or model drift.

DarDev Studio treats orchestration as infrastructure: separate mechanisms for session history, org corpus, code graph, web context, and planner delegation. Each tier has its own budget and retrieval path. The IT orchestrator routes natural goals to tools without manual mode recipes. Changes merge only when eval gates pass.

Hypothesis: five memory tiers

Hypothesis: mixing session chat, organizational corpus, code structure, web snippets, and task planning into one prompt creates silent quality loss. Each tier should use a different mechanism on purpose.

Tier 1 — Session: conversation history and episodic summaries; never ingested into org RAG
Tier 2 — Org RAG: hybrid BM25 + lexical + semantic over docs and code; BGE-M3 in SQLite
Tier 3 — Code graph: symbol and import index (indexes/code-graph.json), not vectors
Tier 4 — Web: ephemeral Brave/Tavily/Google search; cache only, not corpus
Tier 5 — Planner: delegate_subtask, plan mode, workflow scheduler

Hybrid RAG without mandatory vector SaaS

Org RAG stores vectors in SQLite (chunk_embeddings.db) with sharded JSON chunk indexes. There is no Pinecone or Neo4j requirement in the default path. Retrieval hydrates embeddings from SQLite when shard files omit inline vectors — a design choice for large repos without multi-GB JSON duplication.

Three corpora feed Tier 2: a dogfood mirror of the monorepo, team runbooks under content/projects/, and optional OSS mirrors after ingest:repo. The live workspace you edit is not the RAG index; patch and read_workspace_file tools handle truth-on-disk.

Citation post-check and eval:rag:it gate retrieval quality before orchestrator changes merge.

Bounded agents and the IT orchestrator

The agent loop runs with explicit bounds — iterations and tools per round capped so token spend and tool chatter cannot run away. Streaming and retries stay inside those bounds.

IT_ORCHESTRATOR is the default dev-team mode. A natural-language goal enters the intent router, prepareTutorTurn runs hybrid retrieve and prompt budgeting, and the agent loop executes with graph_lookup, workspace_patch, web_search, MCP invoke, and delegate_subtask. On failure the orchestrator reprompts and re-routes instead of hallucinating forward.

Tools, MCP, and safe patching

Agent tools are not an afterthought. platform-core exposes a registry behind policy gates; platform-admin surfaces capabilities, inference workers, and skill packs to operators.

MCP servers via config/mcp-servers.json and platform-mcp-client
Skill packs — content/skills/**/SKILL.md with registry CRUD in admin
propose_patch and verify_patch — workspace changes through a safe pipeline
sandbox_execute and run_sandbox — isolated shell with configurable roots
Browser tools — navigate, screenshot, snapshot with host allowlists

Inference and embedding planes

Chat and embedding run on separate ports. Model packs M1 through M5 route modes to sampling profiles with provider failover. Central chat may use Gemma via llama.cpp or LiteRT; embeddings use a dedicated BGE-M3 pool on :8081 while chat stays on :8080.

Edge workers on Android and Windows register over WebSocket for federated inference. Admin shows central / self / federated badges and PII tier. Context envelope benchmarks on RTX 4050 class hardware verified LiteRT at 32K and llama.cpp near 96K — evidence for routing decisions, not marketing claims.

Monorepo architecture

DarDev Studio is a full product monorepo, not a single package. The Phase 1 GitHub export publishes platform-core, platform-api, platform-eval, and architecture docs; the full tree adds platform-inference, platform-cli, platform-mcp-*, platform-tui, platform-web, platform-admin, and edge workers.

platform-core — brain: RAG, agent loop, plan mode, policy, tools
platform-api — Fastify, SSE, OpenAPI, WebSocket hub
platform-eval — offline eval runners
platform-web + platform-admin — IT Studio and operator console
platform-tui — Ink terminal orchestrator
android-worker, windows-desktop, windows-worker — edge inference

Eval culture: demos vs platforms

Agentic systems without evals are demos. DarDev Studio ships offline gates that must pass before orchestrator and RAG changes merge into the dev-team profile.

eval:orchestrator-intents — intent routing accuracy
eval:rag:it — retrieval quality with citation verify
eval:graph — code graph lookup
eval:it-modes — edition and mode matrix
validate:orchestrator-prompts — live API prompt consistency
benchmark:platform:gate — inference regression
check:integration:dev-team — integration gate before merge

Measured outcomes

Process milestones (sprints S01–S13, API phases A–G) matter internally. Externally verifiable outcomes from eval runners and benchmarks on PC:

eval:orchestrator-intents — 6/6 on offline engineering-goal fixtures
eval:rag:it — 8/8 in check:integration:dev-team (citation verify)
eval:graph — 5/5; plan mode — 6/6 in the same integration gate
validate:orchestrator-prompts — 5/5 live API prompt matrix
RTX 4050 envelope — LiteRT 32K max; llama.cpp ~96K (Gemma E2B class models)
~17k BGE-M3 chunks indexed on dogfood monorepo mirror (RAG v6 ingest)

Open problems and roadmap

Phase H LiteRT 0.12 and Android on-device E2E (C1) remain in progress. Full corpus RAG re-ingest is ops-dependent. I do not claim external user scale — the platform is validated on DarDev engineering monorepos.

Apache-2.0 OSS Phase 1 is published on github.com/Theemiss/dardev-studio: platform-core, platform-api, platform-eval, and architecture docs. Surfaces and workers follow in later phases after hygiene review.