Ahmed Belhaj
All systems

Platform architecture

DarDev Studio

Open-source AI-native engineering platform I architected from scratch at DarDev: hybrid RAG, five memory tiers, bounded agent loops, IT orchestrator, MCP and skill packs, safe patch pipeline, multi-surface clients, and offline eval gates. Validated on DarDev engineering workflows.

TypeScript monorepo@dardev/platform-coreHybrid RAGFastify + OpenAPIReact IT StudioMCPLiteRT / llama.cppApache-2.0

Overview

DarDev Studio is an agentic engineering platform I designed and built from scratch as architecture owner at DarDev. It is not a chat wrapper on a demo API. The product spans a TypeScript monorepo with a core brain, HTTP API, web and admin clients, terminal orchestrator, edge inference workers, eval infrastructure, and a full documentation and sprint program behind it.

The active product SKU is dev-team (PRODUCT_PROFILE=dev-team): IT modes, RAG over docs and code, orchestrated tool execution, sandbox and patch pipelines, and multi-backend inference. I validate it on DarDev internal engineering work before features ship.

DarDev Studio releases under Apache-2.0. A scrubbed Phase 1 slice (platform-core, platform-api, platform-eval, architecture docs) is on GitHub; the full monorepo including web, admin, workers, and ops tooling is developed locally and publishes in phases.

Vision

Most teams bolt an LLM onto an IDE and call it AI-assisted development. DarDev Studio treats AI orchestration as platform infrastructure: explicit memory tiers, hybrid retrieval, tool and policy boundaries, offline eval gates, and inference routing that engineering teams can own, extend, and run locally.

The north star is an engineering brain that knows your organization context (session history, ingested docs and code, symbol graph, ephemeral web), routes natural goals to the right tools and specialist modes without manual recipe picking, and only ships changes when eval suites pass. Open source so teams inspect the orchestrator, eval runners, and architecture directly.

DarDev Studio is built for teams that already ship production software and want agentic workflows with the same rigor they apply to CI/CD. Success means shipped, verified software (integration gates green, deploy smoke recorded), not chat quality alone.

The platform is local-first by design: vectors in SQLite, structure in JSON indexes, no mandatory Pinecone or Neo4j in the default path. Chat inference and embedding inference run on separate planes so retrieval quality does not compete with generation for GPU budget.

Agentic systems without evals are demos, not platforms. DarDev Studio is built for teams that would not merge without tests.

Product scope

DarDev Studio is the dev-team SKU: an agentic engineering platform with IT, dev, and architect editions. Legacy education and LMS code paths exist in the monorepo but are frozen under the dev-team profile. The workspace ships production software through platform sprints S01 through S15, not through ad-hoc feature branches.

  • IT_ORCHESTRATOR mode: goal-only routing as default dev-team entry point
  • Editions: it, dev, architect (active under dev-team profile)
  • Dogfood corpora: monorepo mirror, team runbooks, optional OSS repo ingest
  • Workflows, sandbox execution, dev worktree, and cron scheduler in platform.db
  • Benchmark platform with admin UI, CLI matrix, and regression gates
  • OpenAPI surface validated agent-ready (check:openapi, check:client-api)

The challenge

Engineering teams adopting LLMs hit the same wall: retrieval quality is inconsistent, context windows get stuffed, agents loop without quality gates, tools run without policy, and AI features ship as thin API wrappers instead of integrated workflows.

DarDev Studio addresses this as a full product: memory tiers with separate mechanisms, hybrid RAG with citation verify, bounded agent loops, MCP and skill packs behind policy gates, safe patch propose/verify flow, and offline eval suites before orchestrator changes merge.

Platform architecture

The monorepo uses npm workspaces under @dardev/platform-* packages and apps/*. Dependency direction flows from platform-api into platform-core and platform-inference; clients talk to the API via fetch and OpenAPI, not direct brain imports.

  • platform-core — brain: RAG, agent loop, plan mode, prompts, policy, tools
  • platform-api — Fastify HTTP API, SSE streaming, WebSocket hub, inference init
  • platform-eval — offline eval runners (IT modes, RAG, graph, orchestrator)
  • platform-inference — shared worker protocol types for edge devices
  • platform-mcp-client + platform-mcp-server — MCP integration bridge
  • platform-cli — ingest helpers and dry-run requests
  • platform-tui — terminal orchestrator (Ink)
  • platform-web — IT Studio chat UI (port 5173)
  • platform-admin — operator console: agent capabilities, benchmarks, workflows (5174)
  • android-worker, windows-desktop, windows-worker, desktop-worker — edge inference

Five memory tiers

Context is split by mechanism, not one giant prompt. Session history and episodic summaries cover the current thread (conversations/, never ingested into RAG). Org RAG runs hybrid BM25, lexical, and semantic retrieval with BGE-M3 embeddings in SQLite and sharded chunk indexes. A code graph tier resolves symbols and imports from indexes/code-graph.json. Web search adds ephemeral Brave/Tavily/Google context. A planner tier handles delegate_subtask, plan mode, and workflows.

Three on-disk corpora feed Tier 2: a dogfood mirror of the monorepo, team docs and runbooks under content/projects/, and optional OSS mirrors after ingest:repo. The live workspace you edit is not the RAG index; patch and read_workspace_file tools handle truth-on-disk.

IT orchestrator

IT_ORCHESTRATOR is the default dev-team mode. A natural-language goal enters the intent router (orchestrator-router.ts), which selects memory tiers and tools. prepareTutorTurn runs hybrid RAG retrieval, applies prompt budget rules, and registers patch context. The agent loop executes with bounded iterations and tools per round, streaming via SSE with retries.

On failure the orchestrator reprompts and re-routes rather than silently continuing. delegate_subtask supports parallel specialist work with depth limits. Edge vs central split: RAG, graph, web, and patch run on the API host; phone or desktop inference routes through registered workers.

Capabilities — core brain

platform-core implements the full turn pipeline: corpus routing, hybrid retrieve, context assembly, policy gates, prompt budgeting, agent loop, and tool execution.

  • Hybrid RAG — BM25 + lexical + semantic; BGE-M3 1024-dim embeddings; SQLite chunk store; citation post-check
  • Agent loop — multi-iteration LLM execution with streaming, bounded retries, tool truncation
  • Plan mode — structured planning, critique runs, plan export to files
  • Policy engine — exam mode, tool allow/deny lists, course-scoped capabilities
  • Code graph — symbol and import index; graph_lookup without vector search
  • Turn preparation — episodic memory blocks, prompt-budget.ts, prompt-patch-registry
  • Corpus router — dogfood mirror, team docs, OSS repos, workspace truth separation

Capabilities — agent tools and integrations

The tool registry exposes IT engineering tools behind policy and admin visibility. MCP servers and skill packs extend the agent without hard-coding every integration.

  • graph_lookup, search_code, search_docs, retrieve_sources, get_file_content
  • web_search — Brave, Tavily, or Google Programmable Search
  • propose_patch, verify_patch, PATCH /dev/patch/* safe patch pipeline
  • sandbox_execute, run_sandbox, sandbox read/write — isolated shell execution
  • delegate_subtask — parallel specialists with IT_DELEGATE_MAX_DEPTH
  • MCP invoke via config/mcp-servers.json and platform-mcp-client
  • Skill packs — content/skills/**/SKILL.md with config/skill-packs.json registry
  • Browser tools — navigate, screenshot, snapshot (host allowlist gated)
  • document_generator workflow, validate_document, request_human_review

Capabilities — surfaces and clients

DarDev Studio is a multi-surface product, not a headless API. Each client consumes the same OpenAPI contract.

  • IT Studio web (platform-web) — chat UI, plan export panel, tool traces, skill chips, settings drawer
  • Admin console (platform-admin) — agent capabilities dashboard, inference workers panel, benchmarks tab, workflows scheduler UI, skill-pack editor
  • Terminal TUI (platform-tui) — Ink-based orchestrator for keyboard-first workflows
  • Android worker — on-device LiteRT inference over WebSocket
  • Windows desktop + worker — Compose UI and headless edge worker
  • Async jobs — POST /api/v1/studio/jobs for long-running orchestration
  • Streaming contract — SSE events documented in STREAMING_CONTRACT.md

Inference and embedding plane

Chat and embedding run on separate ports and provider pools. Model packs M1 through M5 map modes to sampling profiles with provider failover. Central chat uses Gemma via llama.cpp or LiteRT; embeddings use a dedicated BGE-M3 pool on :8081 while chat stays on :8080.

  • Providers: LiteRT server, Ollama, llama.cpp HTTP, OpenAI-compatible HTTP, mock runtime for CI
  • Model tier track M1–M5 complete — registry, model-packs.json, envelope benchmarks
  • Edge routing — central / self / federated worker modes with PII tier badges in admin
  • Context envelope evidence — LiteRT 32K, llama.cpp ~96K verified on RTX 4050 class hardware
  • Optional ONNX MiniLM via @xenova/transformers for local embed without GPU pool

Capabilities — workflows, benchmarks, and ops

Beyond chat, DarDev Studio ships workflow orchestration, scheduled jobs, and a benchmark platform used to gate inference and RAG changes.

  • Workflow engine + cron scheduler — platform.db, admin Workflows tab
  • Benchmark platform phases A–G complete — API, CLI matrix, admin tab, regression gates
  • check:integration:dev-team — dev-team integration gate before merge
  • Retrieval cache, improvement metrics, training export CLI (phases E–F)
  • RAG weight tuning — npm run tune:rag-weights + config/rag_weights.json
  • Incremental ingest — npm run ingest:inc; repo ingest via ingest:repo

Measured outcomes

Eval runners and envelope benchmarks on PC — not sprint labels. I do not claim external user scale; gates run on internal engineering workflows and offline fixtures.

  • eval:orchestrator-intents — 6/6 offline on engineering-goal fixtures
  • check:integration:dev-team — RAG IT 8/8, plan mode 6/6, graph lookup 5/5
  • validate:orchestrator-prompts — 5/5 live API prompt matrix at closeout
  • RTX 4050 context envelope — LiteRT Gemma 32K max; llama.cpp ~96K before alloc fail
  • BGE-M3 dogfood ingest — ~17k chunks on monorepo mirror (RAG v6)
  • S01–S13 and API phases A–G shipped on PC (internal program tracking)

Eval and quality gates

Offline gates must pass before orchestrator and RAG changes merge. Eval fixtures live under evals/; runners in platform-eval.

  • eval:orchestrator-intents — intent routing across representative engineering goals
  • eval:rag:it — retrieval quality on IT corpora with citation verify
  • eval:graph — code graph lookup correctness
  • eval:it-modes — edition and mode matrix
  • eval:plan-mode and adaptive orchestrator matrices
  • validate:orchestrator-prompts — live API prompt matrix consistency
  • benchmark:platform:gate — inference regression gate
  • npm test — platform-core, platform-api, platform-eval on mock runtime
Fifteen-minute walkthrough: SYSTEM_ARCHITECTURE.md, one prepareTutorTurn flow, orchestrator mode, eval output.

Production validation

DarDev Studio is developed and validated against DarDev internal engineering platforms. I ingest monorepo docs and code into org RAG, route natural goals through the IT orchestrator, and ship patches through propose/verify gates on the same repos that power dardev.net and the Studio monorepo itself.

Features that do not survive validation on real delivery work do not ship.

About this program

DarDev Studio is an internal engineering program I lead as architecture owner at DarDev — not a launched consumer product with external user scale. The long-form article documents methodology and falsifiable hypotheses; the case study tells what failed first and what the eval numbers were.

Writing: /writing/ai-orchestration-systems-research · Case study: /case-studies/dardev-studio-ai-orchestration

Apache-2.0 OSS on github.com/Theemiss/dardev-studio — Phase 1 export published (platform-core, platform-api, platform-eval, architecture docs). Web, admin, and edge workers follow in later phases.

Roadmap

Roadmap status derived from PROGRAM_SCOPE_AND_STATUS and roadmap-v1 track. Shipped vs in progress vs planned stated honestly.

  • Now — Phase 1 public on GitHub; architecture docs in repo; theemis.cloud systems page and case study
  • In progress — Phase H LiteRT 0.12; C1 Android on-device end-to-end
  • Deferred — C4 MTP benchmark and safety matrix on device
  • Planned — S14–S15 orchestrator and reliability sprints; C5 external deploy after device gates
  • Future — Phase J university LMS bridge (separate from active dev-team SKU)
  • Ops optional — full corpus RAG v6 re-ingest when embedding pool available

Results

6/6
Orchestrator intents
eval:orchestrator-intents offline fixtures
8/8
RAG IT eval
check:integration:dev-team gate
32K
LiteRT envelope
RTX 4050 max verified context
~96K
llama.cpp envelope
Gemma E2B before alloc fail
5
Memory tiers
Session, org RAG, code graph, web, planner
8
Platform packages
@dardev/platform-* monorepo

Key decisions

Memory tiers over one context blob

Each tier uses a different retrieval or execution mechanism on purpose. Mixing session chat into org RAG pollutes code retrieval. Tiers make tradeoffs explicit.

Separate chat and embedding planes

BGE-M3 on :8081 and chat LLM on :8080 prevent retrieval and generation from competing for the same GPU budget.

Offline eval gates before merge

Orchestrator intents, RAG IT suites, and graph lookup evals run in CI-style gates. Agentic systems without evals are demos.

Safe patch pipeline over raw edits

propose_patch and verify_patch gate workspace changes. Agents that ship code need the same review discipline as human commits.

Open source as distribution

Apache-2.0 on GitHub is the front door. Phased export publishes core engine and evals first; surfaces and workers follow hygiene review.

Production validation on DarDev engineering workflows

The platform ingests real monorepo corpora and routes real patch workflows. Not a sandbox disconnected from delivery.

Open-source proof (GitHub)

Theemiss/dardev-studioLicense: Apache-2.0Phase 1 Apache-2.0 export published: platform-core, platform-api, platform-eval, eval fixtures, and architecture docs. Surfaces and workers publish in later phases.

My role

  • Architecture owner and primary builder — DarDev Studio from scratch at DarDev
  • Monorepo design — platform-core brain, API, evals, inference protocol, MCP, CLI, TUI
  • Hybrid RAG, five memory tiers, agent loop, IT orchestrator, and plan mode
  • Multi-surface product — IT Studio web, admin console, edge workers, OpenAPI contract
  • Eval suite and benchmark platform — orchestrator intents, RAG quality, regression gates
  • Open-source export strategy and public architecture documentation (Apache-2.0)
  • Production validation — AI-first development on DarDev.net and Studio monorepo delivery