2026-06-25 · 18 min read
An Engineering Program for AI Orchestration Systems
How I built DarDev Studio from scratch — falsifiable hypotheses, offline eval gates, five memory tiers, and measured outcomes (6/6 intent routing, 8/8 RAG IT) on internal engineering workflows.
Framing
This article documents an internal engineering program behind DarDev Studio — an agentic platform I architected from scratch at DarDev. It is not academic research with external baselines or peer review. It is disciplined platform engineering: state a hypothesis, build the mechanism, measure with offline eval fixtures, and only merge when gates pass.
The central question: how should engineering teams structure memory, retrieval, agents, tools, inference, and quality gates so AI-assisted development behaves like infrastructure rather than a demo API?
Methodology
Each major design choice was treated as falsifiable on PC before merge.
- Hypothesis — mixing session chat into org RAG degrades code retrieval. Test: hard exclude conversations/ from ingest; run eval:rag:it on IT fixtures. Result: 8/8 pass in check:integration:dev-team after tier split.
- Hypothesis — goal-only orchestration needs regression-tested intent routing. Test: eval:orchestrator-intents on engineering-goal fixtures. Result: 6/6 offline; validate:orchestrator-prompts 5/5 on live API matrix at closeout.
- Hypothesis — unbounded agent loops waste tokens without improving outcomes. Test: cap iterations and tools per round; gate with integration suite. Result: stable closeout with bounded loops in IT_ORCHESTRATOR default mode.
- Hypothesis — chat and embedding should not share one GPU budget. Test: separate :8080 chat and :8081 BGE-M3 embed pool; envelope sweep on RTX 4050. Result: LiteRT Gemma max 32K; llama.cpp ~96K before alloc fail — routing encoded in model-packs.json.
What failed first
The first build used a single retrieve path for everything. Session turns leaked into code questions. That failure — concrete, reproducible, annoying — is what justified five memory tiers instead of a larger context window.
Early agent loops had no per-round tool cap. Runs would call web_search, graph_lookup, and patch tools in one round, then loop again with degraded context. Bounds plus eval:orchestrator-intents made regressions visible in CI instead of in chat.
Direct workspace writes from the agent produced diffs that were hard to review. propose_patch and verify_patch turned "the model edited files" into a gate the same way tests gate code.
Why orchestration is infrastructure
Most teams bolt an LLM onto an IDE and ship "AI features." That fails at scale because retrieval quality is inconsistent, agents loop without policy, tools run without bounds, and nobody can tell whether a regression is RAG, routing, or model drift.
DarDev Studio treats orchestration as infrastructure: separate mechanisms for session history, org corpus, code graph, web context, and planner delegation. Each tier has its own budget and retrieval path. The IT orchestrator routes natural goals to tools without manual mode recipes. Changes merge only when eval gates pass.
Hypothesis: five memory tiers
Hypothesis: mixing session chat, organizational corpus, code structure, web snippets, and task planning into one prompt creates silent quality loss. Each tier should use a different mechanism on purpose.
- Tier 1 — Session: conversation history and episodic summaries; never ingested into org RAG
- Tier 2 — Org RAG: hybrid BM25 + lexical + semantic over docs and code; BGE-M3 in SQLite
- Tier 3 — Code graph: symbol and import index (indexes/code-graph.json), not vectors
- Tier 4 — Web: ephemeral Brave/Tavily/Google search; cache only, not corpus
- Tier 5 — Planner: delegate_subtask, plan mode, workflow scheduler
Hybrid RAG without mandatory vector SaaS
Org RAG stores vectors in SQLite (chunk_embeddings.db) with sharded JSON chunk indexes. There is no Pinecone or Neo4j requirement in the default path. Retrieval hydrates embeddings from SQLite when shard files omit inline vectors — a design choice for large repos without multi-GB JSON duplication.
Three corpora feed Tier 2: a dogfood mirror of the monorepo, team runbooks under content/projects/, and optional OSS mirrors after ingest:repo. The live workspace you edit is not the RAG index; patch and read_workspace_file tools handle truth-on-disk.
Citation post-check and eval:rag:it gate retrieval quality before orchestrator changes merge.
Bounded agents and the IT orchestrator
The agent loop runs with explicit bounds — iterations and tools per round capped so token spend and tool chatter cannot run away. Streaming and retries stay inside those bounds.
IT_ORCHESTRATOR is the default dev-team mode. A natural-language goal enters the intent router, prepareTutorTurn runs hybrid retrieve and prompt budgeting, and the agent loop executes with graph_lookup, workspace_patch, web_search, MCP invoke, and delegate_subtask. On failure the orchestrator reprompts and re-routes instead of hallucinating forward.
Tools, MCP, and safe patching
Agent tools are not an afterthought. platform-core exposes a registry behind policy gates; platform-admin surfaces capabilities, inference workers, and skill packs to operators.
- MCP servers via config/mcp-servers.json and platform-mcp-client
- Skill packs — content/skills/**/SKILL.md with registry CRUD in admin
- propose_patch and verify_patch — workspace changes through a safe pipeline
- sandbox_execute and run_sandbox — isolated shell with configurable roots
- Browser tools — navigate, screenshot, snapshot with host allowlists
Inference and embedding planes
Chat and embedding run on separate ports. Model packs M1 through M5 route modes to sampling profiles with provider failover. Central chat may use Gemma via llama.cpp or LiteRT; embeddings use a dedicated BGE-M3 pool on :8081 while chat stays on :8080.
Edge workers on Android and Windows register over WebSocket for federated inference. Admin shows central / self / federated badges and PII tier. Context envelope benchmarks on RTX 4050 class hardware verified LiteRT at 32K and llama.cpp near 96K — evidence for routing decisions, not marketing claims.
Monorepo architecture
DarDev Studio is a full product monorepo, not a single package. The Phase 1 GitHub export publishes platform-core, platform-api, platform-eval, and architecture docs; the full tree adds platform-inference, platform-cli, platform-mcp-*, platform-tui, platform-web, platform-admin, and edge workers.
- platform-core — brain: RAG, agent loop, plan mode, policy, tools
- platform-api — Fastify, SSE, OpenAPI, WebSocket hub
- platform-eval — offline eval runners
- platform-web + platform-admin — IT Studio and operator console
- platform-tui — Ink terminal orchestrator
- android-worker, windows-desktop, windows-worker — edge inference
Eval culture: demos vs platforms
Agentic systems without evals are demos. DarDev Studio ships offline gates that must pass before orchestrator and RAG changes merge into the dev-team profile.
- eval:orchestrator-intents — intent routing accuracy
- eval:rag:it — retrieval quality with citation verify
- eval:graph — code graph lookup
- eval:it-modes — edition and mode matrix
- validate:orchestrator-prompts — live API prompt consistency
- benchmark:platform:gate — inference regression
- check:integration:dev-team — integration gate before merge
Measured outcomes
Process milestones (sprints S01–S13, API phases A–G) matter internally. Externally verifiable outcomes from eval runners and benchmarks on PC:
- eval:orchestrator-intents — 6/6 on offline engineering-goal fixtures
- eval:rag:it — 8/8 in check:integration:dev-team (citation verify)
- eval:graph — 5/5; plan mode — 6/6 in the same integration gate
- validate:orchestrator-prompts — 5/5 live API prompt matrix
- RTX 4050 envelope — LiteRT 32K max; llama.cpp ~96K (Gemma E2B class models)
- ~17k BGE-M3 chunks indexed on dogfood monorepo mirror (RAG v6 ingest)
Open problems and roadmap
Phase H LiteRT 0.12 and Android on-device E2E (C1) remain in progress. Full corpus RAG re-ingest is ops-dependent. I do not claim external user scale — the platform is validated on DarDev engineering monorepos.
Apache-2.0 OSS Phase 1 is published on github.com/Theemiss/dardev-studio: platform-core, platform-api, platform-eval, and architecture docs. Surfaces and workers follow in later phases after hygiene review.