Measurable Data and AI Progress Every Single Day

We transform your Microsoft technology stack into competitive advantage through transparent, daily improvements you can see and measure.

Your Ability To Drive Decisions from Data Gets Better Every Single Day

Our Commitment to Transparency

We believe complete transparency at every step builds the trust that makes transformation successful. That's why we share daily progress updates, quantify every improvement, and ensure you're never left wondering what's happening with your data platform.

Latest Articles

Skills: Shared Knowledge Without Duplicated Instructions

As I progressed with my project of building an automated data engineering solution, I got more and more frustrated that every time I tried to repeat a step the outcome could be totally different than the last time. I needed to tame the nondeterministic nature of LLMs and create a more consistent output and handover between agents. The business analyst agent should always profile the source data the same way. The test writer should always apply the same coverage rules. The fact builder should always generate surrogate keys using the same pattern. The naive approach was to duplicate those shared instructions across every agent definition. My data-profiler skill, for example, contains the logic for detecting primary key candidates, calculating column statistics, identifying data quality issues, and recommending dbt tests. Without skills, that entire block would need to live inside the business analyst agent, staging builder agent, the test writer agent, and any other agent that needs to understand source data — repeated verbatim each time. That duplication is both a token problem and a consistency problem. Update the profiling logic in one agent, forget to update it in another, and you get different results from the same data. Skills solve this. They are on-demand instruction modules that any agent can invoke when needed. Define the profiling logic once as a skill, reference it from every agent that needs it, and the instructions load only when that agent actually profiles data — not at startup, not in every session, not duplicated across definitions. This is Part 2 of the Context Window Optimization series, and it is about making specialized knowledge modular, consistent, and token-efficient. ℹ️ Note: This is Part 2 of 6 in the “Context Window Optimization” series. While examples use Claude Code, the pattern of “on-demand knowledge loading” applies broadly to any agentic workflow where instruction overhead is a concern. Quick Recap: Where We Are in the Series Part 0 - Hidden Cost of MCPs and Custom Instructions on your context window established that a typical Claude Code session burns roughly 51% of the 200K token context window before any real work begins — MCP tools alone consume 16%, and memory files add another 2.7%. That cost multiplies across every project you work in. Part 1 - Subagent - How delegating work solves the context window problem showed how subagents solve dynamic overhead — the tokens consumed by work in progress — by isolating verbose task output in separate context windows. Each specialized agent gets a narrow focus, reducing drift and hallucinations. This post tackles static overhead from duplicated knowledge: the same instructions repeated across multiple agent definitions, bloating every session and drifting out of sync over time. Skills make that knowledge modular and on-demand. What Are Skills? Skills are reusable instruction modules stored as markdown files. When a skill is invoked—either by you typing /skill-name or by Claude detecting relevance—its full content loads into context. Until then, only the skill’s description is present (a few dozen tokens), not the full instruction set. Think of CLAUDE.md as a whiteboard always visible in the room. Skills are a filing cabinet: the labels are always readable, but you only pull out a folder when you actually need it. Skills are stored in .claude/skills/skill-name/SKILL.md (project-level) or ~/.claude/skills/skill-name/SKILL.md (user-level). Each skill has YAML frontmatter followed by instruction content, and can include supporting files like templates, examples, and scripts. Here’s a different skill in action — /frontend-design loads its full instructions on invocation and responds with context-aware suggestions: Skills vs. CLAUDE.md vs. Subagents The three mechanisms solve different problems and complement each other: Feature CLAUDE.md Subagents Skills When loaded At startup When delegated On demand Context window Main Separate, isolated Main, inline Content in context Always, fully loaded Always, in own window Only when invoked Startup cost Full file loads Nothing until spawned Description only (~40 tokens) Reduces static overhead No No Yes Reduces dynamic overhead No Yes No Best for Core project rules Verbose builds, test runs Shared knowledge, templates Subagents keep your context clean by isolating work output. Skills keep your context lean by loading knowledge only when relevant. Anatomy of a Skill File A skill lives in its own directory with SKILL.md as the required entrypoint: .claude/skills/ └── data-profiler/ ├── SKILL.md # Main instructions (required) └── scripts/ └── profile_data.py # Profiling script Notice the scripts/ directory. Skills go beyond shared instructions — they can bundle executable code. My data-profiler skill includes a Python script (profile_data.py) that every agent calls through Bash — same script, same parameters, same output format every time. The LLM doesn’t interpret profiling instructions and hope for consistency; it runs the script and gets deterministic results. The SKILL.md tells the agent when and why to profile; the script handles the how. --- name: data-profiler description: > Automatically profile SQL Server tables and CSV files with intelligent analysis. Detects primary key candidates, infers data types from CSV data, calculates column statistics (nulls, cardinality, data types), identifies data quality issues, and recommends appropriate dbt tests. Use when exploring source data, creating staging models, or validating data quality before transformation. disable-model-invocation: false allowed-tools: Read, Write, Bash, Grep, Glob --- The name field becomes the /slash-command (defaults to directory name if omitted). The description is what Claude reads to decide when to auto-invoke—write it precisely. allowed-tools restricts which tools Claude can use when the skill is active — this profiler needs: Read and Grep for source exploration, Bash to run the profiling script, Write to save profile reports to disk. No Edit since it generates new files rather than modifying existing ones. disable-model-invocation and user-invocable (covered below) control who triggers the skill. How Invocation Works When a skill is defined, only its description sits in context—roughly 30-60 tokens. The full content loads only on invocation, whether triggered by you or by Claude’s auto-detection. State What’s in context Skill defined, not invoked Description only (~30-60 tokens) Skill invoked by you (/skill-name) Full SKILL.md content loads Skill auto-invoked by Claude Full SKILL.md content loads One exception: subagents with preloaded skills inject full skill content at startup because they start fresh without conversation history. Control Who Can Invoke a Skill Two frontmatter flags give precise control. disable-model-invocation: true means only you can invoke the skill by typing /skill-name—use this for workflows with side effects you want to control explicitly, like /deploy or /send-release-notes. user-invocable: false hides the skill from the / menu so only Claude can invoke it—useful for background knowledge like naming conventions that Claude should apply silently. A Practical Walkthrough: Converting CLAUDE.md Sections to Skills Here’s a section that was living in my CLAUDE.md: ## Data Profiling Rules When profiling source data, always follow this process: ### Connection - Server: localhost, Database: Agentic - Authentication: SQL Server Authentication (read-only user) - CSV files: standard format with header row from `2 - Source Files/` ### What to Analyze - Data type and precision for every column - Null count and percentage - Distinct value count and cardinality - Min/max values for numeric and date columns - Primary key candidates: 100% distinct + 0% nulls - Foreign key candidates: column name ends with `_id` or `_key` ### Test Recommendations - Primary keys: `unique` + `not_null` - Low cardinality columns (< 10 values): `accepted_values` - Foreign keys: `relationships` to parent table - Required fields: `not_null` ### Output - Save profiles to `1 - Documentation/data-profiles/` - Use JSON format for agent consumption - Include: table stats, column profiles, quality issues, recommendations - Generate dbt YAML scaffold with recommended tests That section is roughly 300 tokens, loading every conversation regardless of whether data profiling is involved. The migration: create .claude/skills/data-profiler/, move the content into SKILL.md with the YAML frontmatter shown in the Anatomy section, delete the section from CLAUDE.md. The skill’s description (~117 tokens) replaces the full 300 tokens at startup — and the instructions only load when an agent actually needs to profile data. The result: every agent that invokes the data-profiler skill gets the same profiling logic, the same output format, the same test recommendations — without any of them carrying those instructions at startup. And every profile gets saved to the same location, accessible to all agents without re-profiling: Before and After: Token Savings I applied this same migration pattern across my project — moving the data profiling rules, blog writing template, and several other instruction blocks out of CLAUDE.md and into skills. Here’s what the startup cost looks like now:   Before (in CLAUDE.md) After (as skill) Data profiling rules ~300 tokens ~117 tokens (description only) Deployment checklist ~250 tokens ~30 tokens Code review guidelines ~350 tokens ~35 tokens Release notes template ~200 tokens ~30 tokens Total for these sections ~1,100 tokens ~212 tokens That’s an 81% reduction in startup overhead for these four sections alone. The full skill content only loads when you’re actually doing that type of work — skills you don’t touch stay cold. Skill Design Principles A few clear rules for what belongs where: Skills: specialized domain knowledge for a subset of your work (dbt conventions, data modelling best practices, deployment checklists); step-by-step workflows you want to invoke deliberately; templates and output formats that are token-heavy but situational CLAUDE.md: universal project rules; short frequently-referenced facts; meta-instructions about how Claude should behave Rough test: if you’d say “I only care about this on release days,” it belongs in a skill For large reference material, keep SKILL.md under 500 lines and reference supporting files from within the skill directory. Skills with Arguments Skills can accept arguments when invoked. Running /fix-issue 847 substitutes $ARGUMENTS with 847 throughout the skill content; individual arguments are accessible as $0, $1, etc. --- name: fix-issue description: Fix a GitHub issue by number disable-model-invocation: true --- Fix GitHub issue $ARGUMENTS following our project coding standards. 1. Read the issue description using `gh issue view $ARGUMENTS` 2. Understand the requirements 3. Implement the fix in the appropriate files 4. Write tests covering the change 5. Create a commit following our commit message conventions Running Skills in Isolation: context: fork Setting context: fork runs the skill inside an isolated subagent instead of inline—combining knowledge-on-demand with context isolation. The skill content becomes the subagent’s task prompt, and results come back as a concise summary to your main conversation. --- name: deep-research description: Research a topic thoroughly using codebase exploration context: fork agent: Explore --- Research $ARGUMENTS thoroughly: 1. Find relevant files using Glob and Grep 2. Read and analyze the code 3. Summarize findings with specific file references 4. Return a concise summary for the main conversation Limitations Skill descriptions consume a shared budget. Claude Code loads all descriptions at 2% of the context window (~4,000 tokens for 200K)—if you create many skills, some may be excluded. Run /context to check, and keep descriptions concise. Skills are not magic compression. Invoking five large skills in one conversation loads all their content—savings come from not loading skills you don’t use, not from reducing the ones you do. Auto-invocation can surprise you. Without disable-model-invocation: true, Claude may load a skill automatically when its description matches—make descriptions specific if you get unexpected invocations. context: fork skills don’t have conversation history. Forked subagent skills start clean and need any relevant context passed explicitly as arguments. Key Takeaways Skills are the on-demand alternative to always-loaded memory files—only descriptions load at startup; full content loads only when invoked. The savings come from specialization—instructions that apply to 20% of your work shouldn’t consume 100% of startup overhead. Start by auditing your CLAUDE.md—look for blocks of specialized instructions that only apply in specific contexts; those are your skill candidates. Design for invocation clarity—use disable-model-invocation: true for deliberate workflows, user-invocable: false for background knowledge, and default settings for knowledge you’re happy to invoke either way. Skills and subagents are complementary—subagents reduce dynamic overhead (what work produces), skills reduce static overhead (what knowledge you carry). Getting Started: Need Inspiration? You don’t have to start from scratch. Anthropic maintains an open-source skills repository with ready-to-use skills you can drop into your .claude/skills/ directory. It includes a template/ folder showing the recommended structure, a spec/ folder documenting the skill format, and a growing skills/ collection contributed by the community. Browse the repo for patterns worth adopting — or fork it and contribute your own. If you’ve built a skill that solves a common problem (data profiling, code review, deployment checklists), others are likely duplicating the same instructions you just extracted. What’s Next Skills and subagents keep your context lean during work — but the instructions that load before work begins matter just as much. Part 3 covers CLAUDE.md files — the global, project, and folder-level instruction hierarchy. A bloated CLAUDE.md taxes every conversation. Part 3 shows how to split instructions across levels so each file stays focused, nothing gets duplicated, and your agent behaves consistently across every repo. Resources Official Documentation Extend Claude with Skills - Claude Code Docs Anthropic Skills Repository - GitHub Create Custom Subagents - Claude Code Docs Memory Files (CLAUDE.md) - Claude Code Docs Claude Code Best Practices - Anthropic Engineering Previous Posts in This Series The Hidden Cost of MCPs and Custom Instructions on Your Context Window Subagents: How Delegating Work Solves the Context Window Problem Community Resources Agent Skills Open Standard (agentskills.io) Claude Code Skills: Structure and Invocation — Mikhail Shilkov Have you started breaking your CLAUDE.md into skills? What instruction blocks did you find were most worth extracting? I’d love to hear what works in your setup—and what surprised you about the process.

04 Mar 2026SelfServiceBI

Subagents: How Delegating Work Solves the Context Window Problem

I was experimenting building an automated dbt data engineering solution—a SQL Server + dbt + Power BI pipeline—and I wanted Claude Code to help me build it layer by layer. The problem I kept running into: the agent would drift. When I asked it to build staging models, it would start making assumptions about fact table logic. When writing tests, it would forget the naming conventions from three messages earlier. Errors multiplied. The work got messier the longer the session ran. My first instinct was to write a more detailed CLAUDE.md. But the real issue wasn’t instructions—it was focus. A single agent trying to hold the entire dbt project in its head at once was too much context, too many concerns at the same time. That’s when I started creating specialized subagents: one for staging models, one for fact tables, one for writing tests. Each agent got a narrow, specific set of instructions. The hallucinations dropped dramatically. And as a bonus I didn’t expect—my main context window stayed clean, letting me sustain longer and higher quality sessions. In my previous post about MCPs and custom instructions, I covered static overhead—the 51% of your context window that disappears before you type a single message. This post is about the second problem: dynamic overhead, the tokens consumed by the actual work. And how subagents solve both. ℹ️ Note: This is Part 1 of 6 in the “Context Window Optimization” series. While examples use Claude Code, these concepts apply broadly to any AI agent system that supports task delegation. Quick Recap: The 50% Tax The previous post showed how a typical Claude Code setup burns roughly 51% of the 200K token context window before any real work begins: Category Tokens % of Window System prompt 3.0k 1.5% System tools 14.8k 7.4% MCP tools 32.6k 16.3% Memory files 5.4k 2.7% Autocompact buffer 45.0k 22.5% Free space 99k 49.3% That’s static overhead—it happens just from launching Claude Code. But the dynamic overhead from real work adds up just as fast. Reading 20 files to understand a codebase? That’s ~40,000 tokens. Running a test suite with verbose output? Another 15,000. Combine both types and you can hit 80%+ context usage without writing a single line of implementation code. This is why long sessions often degrade. It’s not the AI getting tired—it’s arithmetic. What Are Subagents? Subagents are specialized AI assistants that run in their own isolated 200K token context window. When Claude delegates a task to a subagent, that work happens inside the subagent’s context—all the file reads, search results, and command output stay there. Only a concise summary returns to your main conversation. Think of it like offloading work to a separate process. Each process has its own memory space. Only the final result comes back. Without subagents: Main conversation reads 25 files to understand a module Those files consume ~40,000 tokens in your main context That’s 40% of your remaining working space, gone With subagents: Subagent reads 25 files in its own 200K context window Subagent returns a 500-token summary to your main conversation Main context barely moves You get the same information. Your main context stays clean. Built-in Subagent Types Claude Code ships with several built-in subagents: Type Model Access Best For Explore Haiku (fast/cheap) Read-only Codebase searches, finding patterns, understanding structure Plan Inherits from main Read-only Research during planning mode (/plan) General-purpose Inherits from main Full read/write Tasks requiring both exploration and code modification Bash Inherits from main Terminal access Running builds, tests, git operations The Explore agent is particularly useful for the dynamic overhead problem. When you ask Claude to understand how a module works, it can delegate that research to the Explore agent (running on fast, cheap Haiku). The Explore agent reads all the relevant files and returns a summary—keeping potentially tens of thousands of tokens of file content out of your main context. The Bash agent handles terminal verbosity. A full test suite might produce 15,000 tokens of output. The Bash agent processes that and returns a 300-token summary of which tests failed and why. Foreground vs Background Subagents run in two modes: Foreground (default): Blocks your main conversation until complete. You can watch progress in real time, and any permission prompts pass through to you for approval. Best for focused tasks where you want to see what’s happening. Background: Runs concurrently while you keep working. Permissions are requested upfront before the agent starts, then auto-denied for anything not pre-approved. Use /tasks to check status. Best for long-running work—web research, large test suites, multi-file builds. To run a background agent: ask Claude to “run this in the background,” or press Ctrl+B to background a task that’s already running. ⚠️ Background agents do not have access to MCP tools. If your task requires an MCP server (database queries, browser automation), use a foreground agent instead. ⚠️ Custom Subagents: The dbt Example Built-in subagents cover generic workflows. For specialized projects, you can define your own. Custom subagents are markdown files with YAML frontmatter stored in .claude/agents/ (project-level) or ~/.claude/agents/ (user-level, available across all projects). Going back to my dbt pipeline: I created a dedicated subagent for building staging models. Here is the actual frontmatter and opening from my dbt-staging-builder agent: --- name: dbt-staging-builder description: > Build staging models (stg_*) that transform raw source data with basic cleaning, renaming, and type casting. Create source definitions in YAML with freshness checks. Handle null values and standardize column names. Use when creating the first transformation layer from raw source tables. tools: Read, Write, Edit, Grep, Glob model: haiku skills: dbt-runner, data-profiler, sql-server-reader --- # Staging Builder Agent You are a specialist in creating staging models (stg_*) - the first transformation layer in dbt projects. Two things stand out in this frontmatter: model: haiku — this subagent runs on the fastest, cheapest model. Staging models follow rigid patterns (rename columns, cast types, filter nulls). Haiku handles that well. No need to pay for Opus reasoning on templated SQL. skills: dbt-runner, data-profiler, sql-server-reader — the agent has access to three Skills (covered in Part 2) that give it on-demand access to dbt commands, data profiling, and database queries. Those skill definitions only load when the agent invokes them — they are not in the main context either. The full agent definition runs to nearly 400 lines. It contains column naming rules, a sanitization reference table, a standard staging SQL template, a profile-driven workflow, and a six-step development process. Here is a taste of the specificity: ## Staging Model Principles **What staging models DO**: - ✅ Select specific columns (no SELECT *) - ✅ Rename columns for consistency - ✅ Cast data types explicitly - ✅ Handle nulls with COALESCE - ✅ Filter out invalid records (null keys) **What staging models DON'T do**: - ❌ Join to other tables - ❌ Aggregate data - ❌ Add complex business logic - ❌ Create derived metrics And the naming convention section: ## Naming Convention **Model**: `stg_<source>__<entity>` - Examples: stg_erp__customers, stg_sales__orders - Double underscore separates source from entity **Columns**: - Primary keys: <entity>_id - Foreign keys: <related_entity>_id - Dates: <event>_date - Timestamps: <event>_at - Booleans: is_<condition> All 400 lines live inside the subagent’s context — not the main agent’s. When I ask Claude to “create a staging model for the customers table,” it spawns the staging builder agent, which loads all these instructions into its own 200K window, does the work (profile the source, write SQL, compile, run, test), and returns a concise summary. My main conversation sees something like: Created stg_erp__customers — 12 columns, materialized as view, 3 tests passing (unique, not_null on customer_id, not_null on customer_name). 4,230 rows. The compilation logs, run output, and test results (which can easily total 15,000+ tokens) stay isolated. My main conversation sees only the summary. Why this works for focus and quality: The subagent’s 400 lines of instructions cover staging models specifically — naming conventions, column sanitization rules, the CTE template, the right dbt commands. It does not know about fact tables or reporting layers. That narrow scope is what prevents the drift and hallucinations I experienced with a single all-knowing agent. Why this works for context: Every token of verbose dbt output stays in the subagent’s 200K window. My main context stays clear for the orchestration work — deciding what to build next, reviewing summaries, planning the next layer. Key Benefits of Custom Subagents Instructions live in the subagent’s context, not yours. Detailed staging model conventions, source system quirks, test patterns—all of that lives in the subagent’s system prompt. It doesn’t load into your main conversation until you invoke the agent. This is the opposite of stuffing everything into CLAUDE.md. Tool restrictions prevent accidents. You can limit a subagent to read-only tools if it’s only supposed to research. A code-review subagent with no write access can’t accidentally modify files. Model selection for cost control. The model: haiku field means this subagent runs on Haiku even if your main session uses Opus. Expensive Opus reasoning for coordination; cheaper Haiku for focused execution tasks. When to Delegate vs Stay in Main Context Not every task should be delegated. Subagents add startup latency and the subagent needs to gather its own context fresh each time. Delegate when: The task will produce verbose output you don’t need line-by-line (test results, build logs, grep output across many files) The work is self-contained with a clear deliverable Multiple independent research paths can run in parallel The task will take several minutes and you want to keep working Stay in main context when: You need tight back-and-forth iteration The task is quick and targeted (a small edit, a single question) The output of one step feeds directly into the next You need interactive clarification mid-task A useful heuristic: if a task will produce multiple pages of output you don’t need to read line by line, it’s a good candidate for delegation. Invocation quality matters Subagents start fresh — they don’t inherit your conversation history. A vague prompt like “build the staging model” forces the subagent to guess at which source table, which naming pattern, and which tests to add. A well-specified prompt tells the subagent what is unique to this task — not what is already in its instructions: Create a staging model for the raw.customers table in the ERP source. The primary key is customer_id. The email column has known nulls — use COALESCE with 'unknown'. Return the row count and test results. The agent already knows to profile the source, follow the stg_<source>__<entity> naming pattern, add primary key tests, and run the compile → run → test workflow. The prompt only needs to provide context the agent cannot infer: which table, which source system, and any data quirks specific to this table. This is where your project CLAUDE.md becomes valuable in a different way. You can instruct the main agent on how to delegate — telling it what context to pass when spawning subagents. For example, my CLAUDE.md includes a note to “always consider subagents and skills” and my standard implementation process (research → plan → tests → implement). The main agent reads those instructions and knows to provide the subagent with the right table name, source system, and any known data issues — rather than firing off a bare “build the staging model” prompt. The CLAUDE.md shapes the orchestrator; the agent definition shapes the worker. Limitations Subagents can’t spawn other subagents. There’s no nesting. If you need multiple subagents coordinating, that coordination happens from your main conversation. Background agents lose MCP access. Any task requiring an MCP tool (browser, database connection) must run as a foreground agent. Coordination overhead grows with scale. Running 3-4 parallel subagents works well. Beyond that, the coordination complexity and the summaries flowing back can start to add clutter of their own. Key Takeaways Subagents keep your main context clean. Verbose output—file reads, build logs, test results—stays isolated in the subagent’s 200K window. Only concise summaries return to your main conversation, preserving your working space. Narrow focus reduces hallucinations. A specialized subagent with domain-specific instructions stays on task and makes fewer mistakes than a general agent trying to hold everything at once. This was the original reason I started using them. Custom subagents move knowledge out of your CLAUDE.md. Instead of stuffing every project convention into a file that loads every session, put specialized instructions in subagents that load only when needed. Choose the right type. Explore for codebase research (fast Haiku, read-only). Bash for terminal-heavy workflows. General-purpose for tasks needing both exploration and implementation. Custom subagents for domain-specific work. Use background agents for long-running tasks. Research, test suites, and builds can run while you keep working. Press Ctrl+B to background a running task, and check status with /tasks. Delegate when the output is verbose and self-contained. The heuristic: >5,000 tokens of output you don’t need line-by-line is a strong signal to delegate. What’s Next: Skills Subagents solve the dynamic overhead problem—the tokens consumed by work in progress. But what about the static overhead from the previous post? A big chunk of that 51% baseline comes from CLAUDE.md and other memory files loaded into every session. What if you could make that modular—loading specialized knowledge only when you actually need it? That’s what Skills do. In the next part of this series, I’ll explore how Skills complement subagents by making static context modular. If subagents are about keeping work isolated, Skills are about keeping knowledge on-demand. Resources Official Documentation Create custom subagents - Claude Code Docs Context windows - Claude API Docs Research How we built our multi-agent research system - Anthropic Building Effective AI Agents - Anthropic How Input Token Count Impacts LLM Latency - Glean Previous Posts in This Series The Hidden Cost of MCPs and Custom Instructions on Your Context Window Have you started using subagents in your AI workflows? I’d especially love to hear from anyone building data engineering pipelines—what tasks do you delegate, and how have you structured your custom agents?

02 Mar 2026SelfServiceBI

The Hidden Cost of MCPs and Custom Instructions on Your Context Window

Large context windows sound limitless—200K, 400K, even a million tokens. But once you bolt on a few MCP servers, dump in a giant CLAUDE.md, and drag a long chat history behind you, you can easily burn over 50% of that window before you paste a single line of code. This post is about that hidden tax—and how to stop paying it. Where This Started This exploration started when I came across a LinkedIn post by Johnny Winter featuring a YouTube video about terminal-based AI tools and context management. The video demonstrates how tools like Claude Code, Gemini CLI, and others leverage project-aware context files—which got me thinking about what’s actually consuming all that context space. Video by NetworkChuck ℹ️ Note: While this post uses Claude Code for examples, these concepts apply to any AI coding agent—GitHub Copilot, Cursor, Windsurf, Gemini CLI, and others. The Problem: You’re Already at 50% Before You Start Think of a context window as working memory. Modern AI models have impressive limits (as of 2025): Claude Sonnet 4.5: 200K tokens (1M beta for tier 4+) GPT-5: 400K tokens via API Gemini 3 Pro: 1M input tokens A token is roughly 3-4 characters, so 200K tokens equals about 150,000 words. That sounds like plenty, right? Here’s what actually consumes it: System prompt and system tools MCP server tool definitions Memory files (CLAUDE.md, .cursorrules) Autocompact buffer (reserved for conversation management) Conversation history Your code and the response being generated By the time you add a few MCPs and memory files, a large chunk of your context window is already gone—before you’ve written a single line of code. Real Numbers: The MCP Tax Model Context Protocol (MCP) servers make it easier to connect AI agents to external tools and data. But each server you add costs tokens. Here’s what my actual setup looked like (from Claude Code’s /context command): MCP tools alone consume 16.3% of the context window—before I’ve even started a conversation. Combined with system overhead, I’m already at 51% usage with essentially zero messages. The Compounding Effect The real problem emerges when overhead compounds. Here’s my actual breakdown: Category Tokens % of Window System prompt 3.0k 1.5% System tools 14.8k 7.4% MCP tools 32.6k 16.3% Custom agents 794 0.4% Memory files 5.4k 2.7% Messages 8 0.0% Autocompact buffer 45.0k 22.5% Free space 99k 49.3% Total: 101k/200k tokens used (51%) You’re working with less than half your theoretical capacity—and that’s with essentially zero conversation history. Once you start coding, the available space shrinks even further. Why This Matters: Performance and Quality Context consumption affects more than just space: Processing Latency: Empirical testing with GPT-4 Turbo shows that time to first token increases by approximately 0.24ms per input token. That means every additional 10,000 tokens adds roughly 2.4 seconds of latency to initial response time. (Source: Glean’s research on input token impact) Cache Invalidation: Modern AI systems cache frequently used context. Any change (adding an MCP, editing instructions) invalidates that cache, forcing full reprocessing. Quality Degradation: When context gets tight, models may: Skip intermediate reasoning steps Miss edge cases Spread attention too thinly across information Fill gaps with plausible but incorrect information Truncate earlier conversation, losing track of prior requirements I’ve noticed this particularly in long coding sessions. After discussing architecture early in a conversation, the agent later suggests solutions that contradict those earlier decisions—because that context has been truncated away. Practical Optimization: Real-World Example Let me share a before/after from my own setup: Before Optimization: 10+ MCPs enabled (all the time) MCP tools consuming 32.6k tokens (16.3%) Only 99k tokens free (49.3%) Frequent need to summarize/restart sessions After Optimization: 3-4 MCPs enabled by default MCP tools reduced to ~12k tokens (~6%) Memory files trimmed to essentials (~3k tokens) Over 140k tokens free (70%+) Results: More working space, better reasoning quality, fewer context limit issues, and faster responses. Optimization Checklist Before adding another MCP or expanding instructions: Have I measured my current context overhead? Is my custom instruction file under 5,000 tokens? Do I actively use all enabled MCPs? Have I removed redundant or outdated instructions? Could I accomplish this goal without consuming more context? In Claude Code: Use the /context command to see your current context usage breakdown. Specific Optimization Strategies 1. Audit Your MCPs Regularly Ask yourself: Do I use this MCP daily? Weekly? Monthly? Could I accomplish this task without the MCP? Action: Disable MCPs you don’t use regularly. Enable them only when needed. Impact of Selective MCP Usage By selectively disabling MCPs you don’t frequently use, you can immediately recover significant context space. This screenshot shows the difference in available context when strategically choosing which MCPs to keep active versus loading everything. In Claude Code, you can toggle MCPs through the settings panel. This simple action can recover 10-16% of your context window. 2. Ruthlessly Edit Custom Instructions Your CLAUDE.md memory files, .cursorrules, or copilot-instructions.md should be: Concise (under 5,000 tokens) Focused on patterns, not examples Project-specific, not general AI guidance Bad Example: When writing code, always follow best practices. Use meaningful variable names. Write comments. Test your code. Follow SOLID principles. Consider performance. Think about maintainability... (Continues for 200 lines) Good Example: Code Style: - TypeScript strict mode - Functional patterns preferred - Max function length: 50 lines - All public APIs must have JSDoc Testing: - Vitest for unit tests - Each function needs test coverage - Mock external dependencies 3. Start Fresh When Appropriate Long conversations accumulate context. Sometimes the best optimization is: Summarizing what’s been decided Starting a new session with that summary Dropping irrelevant historical context 4. Understand Autocompact Buffer Claude Code includes an autocompact buffer that helps manage context automatically. When you run /context, you’ll see something like: Autocompact buffer: 45.0k tokens (22.5%) This buffer reserves space to prevent hitting hard token limits by automatically compacting or summarizing older messages during long conversations. It maintains continuity without abrupt truncation—but it also means that 22.5% of your window is already taken. You can also see and toggle this behavior in Claude Code’s /config settings: In this screenshot, Auto-compact is enabled, which keeps a dedicated buffer for summarizing older messages so long conversations stay coherent without suddenly hitting hard context limits. Claude Code Specific Limitations: The Granularity Problem Claude Code currently has a platform-level limitation that makes fine-grained control challenging, documented in GitHub Issue #7328: “MCP Tool Filtering”. The Core Issue: Claude Code loads ALL tools from configured MCP servers. You can only enable or disable entire servers, not individual tools within a server. The Impact: Large MCP servers with 20+ tools can easily consume 50,000+ tokens just on definitions. If a server has 25 tools but you only need 3, you must either: Load all 25 tools and accept the context cost Disable the entire server and lose access to the 3 tools you need Build a custom minimal MCP server (significant development effort) This makes tool-level filtering essential for context optimization, not just a convenience. The feature is under active development with community support. In the meantime: Use MCP servers sparingly Prefer smaller, focused servers over large multi-tool servers Regularly audit which servers you actually need enabled Provide feedback on the GitHub issues to help prioritize this feature Key Takeaways You’re burning a huge portion of your context window before you even paste in your first file. MCP tools alone can consume 16%+ of your window. System tools add another 7%. The autocompact buffer reserves 22%. It adds up fast. Optimization is ongoing. Regular audits of MCPs and memory files keep your agent running smoothly. Aim to keep baseline overhead under 30% of total context (excluding the autocompact buffer). Measurement matters. Use /context in Claude Code to monitor your overhead. You can’t optimize what you don’t measure. Performance degrades subtly. Latency increases roughly 2.4 seconds per 10,000 tokens based on empirical testing. Reasoning quality drops as context fills up. Start minimal, add intentionally. The best developers using AI agents: Start minimal Add capabilities intentionally Monitor performance impact Optimize regularly Remove what isn’t providing value The goal isn’t to minimize context usage at all costs. The goal is intentional, efficient context usage that maximizes response quality, processing speed, and available working space. Think of your context window like RAM in a computer. More programs running means less memory for each program. Eventually, everything slows down. It’s not about having every tool available. It’s about having the right tools, configured optimally, for the work at hand. Resources Official Documentation Claude Code MCP Documentation Model Context Protocol (MCP) Overview Claude Code Best Practices Claude Code Cost Management Claude Context Windows Research & Performance How Input Token Count Impacts LLM Latency - Glean Community Resources Model Context Protocol Documentation GitHub Copilot Custom Instructions Johnny Winter’s LinkedIn Post on Terminal AI You’ve Been Using AI the Hard Way (Use This Instead) - YouTube Video Have you optimized your AI agent setup? What context window challenges have you encountered? I’d love to hear your experiences and optimization strategies.

23 Nov 2025SelfServiceBI

Ready to See Daily Progress?

Let's discuss how we can transform your data capabilities with transparent, measurable improvements every single day.

Or reach out directly at +44 7495 305 143