Scaling LLMs to Larger Codebases

(blog.kierangill.xyz)

92 points | by kierangill 2 hours ago

17 comments

  • mstank 1 hour ago
    As the models have progressively improved (able to handle more complex code bases, longer files, etc) I’ve started using this simple framework on repeat which seems to work pretty well at one shorting complex fixes or new features.

    [Research] ask the agent to explain current functionality as a way to load the right files into context.

    [Plan] ask the agent to brainstorm the best practices way to implement a new feature or refactor. Brainstorm seems to be a keyword that triggers a better questioning loop for the agent. Ask it to write a detailed implementation plan to an md file.

    [clear] completely clear the context of the agent —- better results than just compacting the conversation.

    [execute plan] ask the agent to review the specific plan again, sometimes it will ask additional questions which repeats the planning phase again. This loads only the plan into context and then have it implement the plan.

    [review & test] clear the context again and ask it to review the plan to make sure everything was implemented. This is where I add any unit or integration tests if needed. Also run test suites, type checks, lint, etc.

    With this loop I’ve often had it run for 20-30 minutes straight and end up with usable results. It’s become a game of context management and creating a solid testing feedback loop instead of trying to purely one-shot issues.

    • dfsegoat 3 minutes ago
      Highly recommend using agent based hooks for things like `[review & test]`.

      At a basic level, they work akin to git-hooks, but they fire up a whole new context whenever certain events trigger (E.g. another agent finishes implementing changes) - and that hook instance is independent of the implementation context (which is great, as for the review case it is a semi-independent reviewer).

    • asim 31 minutes ago
      I don't do any of that. I find with GitHub copilot and Claude sonnet 4.5 if I'm clear enough about the what and where it'll sort things out pretty well, and then there's only reiteration of code styling or reuse of functionality. At that point it has enough context to keep going. The only time I might clear that whole thing is if I'm working on an entirely new feature where the context is too large and it gets stuck in summarising the history. Otherwise it's good. But this in codespaces. I find the Tasks feature much harder. Almost a write-off when trying to do something big. Twice I've had it go off on some strange tangent and build the most absurd thing. You really need to keep your eyes on it.
      • hyperadvanced 26 minutes ago
        Same. I find that if I can piecemeal explain the desired functionality and work as I would pairing with another engineer that it’s totally possible to go from “make me a simple wheel with spokes” to “okay now let’s add a better frame and brakes” with relatively little planning, other than what I’d already do when researching the codebase to implement a new feature
    • godzillafarts 14 minutes ago
      This is effectively what I'm doing, inspired by HumanLayer's Advanced Context Engineering guidelines: https://github.com/humanlayer/advanced-context-engineering-f...

      We've taken those prompts, tweaked them to be more relevant to us and our stack, and have pulled them in as custom commands that can be executed in Claude Code, i.e. `/research_codebase`, `/create_plan`, and `/implement_plan`.

      It's working exceptionally well for me, it helps that I'm very meticulous about reviewing the output and correcting it during the research and planning phase. Aside from a few use cases with mixed results, it hasn't really taken off throughout our team unfortunately.

    • prmph 45 minutes ago
      Nothing will really work when the models fail at the most basic of of reasoning challenges.

      I've model do the complete opposite of that I've put in the plan and guidelines. I've had them go re-read the exact sentences, and still see them come to the opposite conclusion, and my instruction are nothing complex at all.

      I used to think one could build a workflow and process around LLMs that extract good value from them consistently, but I'm now not so sure.

      I notice that sometime the model will be in a good state, and do a long chain of edits of good quality. The problem is, it's still a crap-shoot how to get them into a good state.

      • hu3 0 minutes ago
        Check context size. LLMs get error-prone fast when their memory is full. Just like humans.

        In VSCode Copilot you can keep track of how many tokens the LLMs is dealing with in realtime with "Chat Debug". I know when it reaches 90k tokens I should expect degraded intelligence and brace for a possible forced sumarization. Sometimes I just stop LLM and start over to continue the work.

      • alienbaby 6 minutes ago
        I'm curious in what kinda if situations you are seeing the model the do opposite of your intention consistently where the instructions were not complex. Do you have any examples?
    • AlexB138 1 hour ago
      This is essentially my exact workflow. I also keep the plan markdown files around in the repo to refer agents back to when adding new features. I have found it to be a really effective loop, and a great way to reprime context when returning to features.
      • mstank 1 hour ago
        Exactly this. I clear the old plans every few weeks.

        For really big features or plans I’ll ask the agent to create linear issue tickets to track progress for each phase over multiple sessions. Only MCP I have loaded is usually linear but looking for a good way to transition it to a skill.

        • AlexB138 56 minutes ago
          Ah, that's a great idea. I've just been having the agent add a Progress section to the plan files and checking things off as we work.
      • redrove 51 minutes ago
        I use an Obsidian MCP to essentially keep a database of plans, or versions sometimes that I can just fire off.
    • zeroCalories 14 minutes ago
      I agree this can work okay, but once I find myself doing this much handholding I would prefer to drive the process myself. Coordinating 4 agents and guiding them along really makes you appreciate the mythical-man-month on the scale of hours.
  • Aurornis 1 hour ago
    > Making a prompt library useful requires iteration. Every time the LLM is slightly off target, ask yourself, "What could've been clarified?" Then, add that answer back into the prompt library.

    I'm far from an LLM power user, but this is the single highest ROI practice I've been using.

    You have to actually observe what the LLM is trying to do each time. Simply smashing enter over and over again or setting it to auto-accept everything will just burn tokens. Instead, see where it gets stuck and add a short note to CLAUDE.md or equivalent. Break it out into sub-files to open for different types of work if the context file gets large.

    Letting the LLM churn and experiment for every single task will make your token quota evaporate before your eyes. Updating the context file constantly is some extra work for you, but it pays off.

    My primary use case for LLMs is exploring code bases and giving me summaries of which files to open, tracing execution paths through functions, and handing me the info I need. It also helps a lot to add some instructions for how to deliver useful results for specific types of questions.

    • CPLX 1 hour ago
      I'm with you on that, but I have to say I have been doing that aggressively, and it's pretty easy for Claude Code at least to ignore the prompts, commands, Markdown files, README, architecture docs, etc.

      I feel like I spend quite a bit of time telling the thing to look at information it already knows. And I'm talking about when I HAVE actually created various documents to use and prompts.

      As a specific example, it regularly just doesn't reference CLAUDE.md and it seems pretty random as to when it decides to drop that out of context. That's including right at session start when it should have it fresh.

      • Aurornis 1 hour ago
        > and it's pretty easy for Claude Code at least to ignore the prompts, commands, Markdown files, README, architecture docs, etc.

        I would agree with that!

        I've been experimenting with having Claude re-write those documents itself. It can take simple directives and turn them into hierarchical Markdown lists that have multiple bullet points. It's annoying and overly verbose for humans to read, but the repetition and structure seems to help the LLM.

        I also interrupt it and tell it to refer back to CLAUDE.md if it gets too off track.

        Like I said, though, I'm not really an LLM power user. I'd be interested to hear tips from others with more time on these tools.

      • zarp 1 hour ago
        > it seems pretty random as to when it decides to drop that out of context

        Overcoming this kind of nondeterministic behavior around creating/following/modifying instructions is the biggest thing I wish I could solve with my LLM workflows. It seems like you might be able to do this through a system of Claude Code hooks, but I've struggled with finding a good UX for maintaining a growing and ever-changing collection of hooks.

        Are there any tools or harnesses that attempt to address this and allow you to "force" inject dynamic rules as context?

        • lkjdsklf 35 minutes ago
          Wouldn't it be great if we had some kind of deterministic language to precisely and concisely tell a computer what to do
      • kierangill 1 hour ago
        Agreed here. A key theme, which isn’t terribly explicit in this post, is that your codebase is your context.

        I’ve found that when my agent flies off the rails, it’s due to an underlying weakness in the construction of my program. The organization of the codebase doesn’t implicitly encode the “map”. Writing a prompt library helps to overcome this weakness, but I’ve found that the most enduring guidance comes from updating the codebase itself to be more discoverable.

        • fragmede 1 hour ago
          > my agent flies off the rails

          Which, I've had it delete the entire project including .git out of "shame", so my claude doesn't get permission to run rm anymore.

          Codex has fewer levers but it's deleted my entire project twice now.

          (Play with fire, you're gonna get burnt.)

          • CPLX 44 minutes ago
            Wait, what? Can you please describe this shame incident?

            Also, I have extremely frequent commits and version control syncs to GitHub and so on as part of the process (including when it's working on documents or things that aren't code) as a way to counteract this.

            Although I suppose a sufficiently devious AI can get around those, it seems to not have been a problem.

      • candiddevmike 33 minutes ago
        Because, in my experience/conspiracy theory, the model providers are trying to make the models function better without having to have these kinds of workarounds. And so there's a disconnect where folks are adding more explicit instructions and the models are being trained to effectively ignore them under the guise of using their innate intuition/better learning/mixture of experts.
  • __MatrixMan__ 12 minutes ago
    I'm interested to see where we'll land re: organizing larger codebases to accommodate agents.

    I've been having a lot of fun taking my larger projects and decomposing them into directed graphs where the nodes are nix flakes. If I launch claude code in a flake devshell it has access to only those tools, and it sees the flake.nix and assumes that the project is bounded by the CWD even though it's actually much larger, so its context is small and it doesn't get overwhelmed.

    Inputs/outputs are a nice language agnostic mechanism for coordinating between flakes (just gotta remember to `nix flake update --update-input` when you want updated outputs from an adjacent flake). Then I can have them write feature requests for each other and help each other test fixtures and features. I also like watching them debate over a design, they get lazy and assume the other "team" will do the work, but eventually settle on something reasonable.

    I've been running with the idea for a few weeks, maybe it's dumb, but I'd be surprised if this kind of rethinking didn't eventually yield a radical shift in how we organize code, even if the details look nothing like what I've come up with.

  • lnx01 8 minutes ago
    LLMs are so good at telling me about things I know little to nothing about, but when when I ask about things I have expert knowledge on they consistently fail, hallucinate, and confidently lie...
  • pron 45 minutes ago
    > Here's a LLM literacy dipstick: ask a peer engineer to read some code they're unfamiliar with. Do they understand it? ... No? Then the LLM won't either.

    Of course, but the problem is the converse: There are too many situations where a peer engineer will know what to do but the agent won't. This means that it requires more work to make a codebase understandable to a human than it does to make it understandable to an agent.

    > Moving more implementation feedback from human to computer helps us improve the chance of one-shotting... Think of these as bumper rails. You can increase the likelihood of an LLM reaching the bowling pins by making it impossible to land in the gutter.

    Sort of, but this is also a little similar to claiming that P = NP. Having a an efficient way to reliably check if a solution is correct is not the same at all as a reliable way to find a solution. It's the theory of computation that tells us that it probably isn't. The likelihood may well be higher yet still not high enough. Even though theoretically NP problems are strictly easier than EXPTIME ones, in practice, in many situations (though not all) they are equally intractable.

    In fact, we can put the claim to the test: there are languages, like ATS and Idris, that make almost any property provable and checkable. These languages let the programmer (human or machine) position the "bumper rails" so precisely as to ensure we hit the target. We can ask the agent to write the code, write the proof of correctness, and check it. We'd still need to check that the correctness property is the right one, but if the claim is correct, coding agents should be best at writing code in ATS or Idris, accompanied by correctness proofs. Are they?

    Obviously, mileage mauy vary dependning on the task and the domain, but if it's true that coding models will get significantly better, then the best course of action may well be, in many cases, to just wait until they do rather than spend a lot of effort working around their current limitations, effort that will be wasted if and when capabilities improve. And that's the big question: are we in for a long haul where agent capabilities remain roughly where they are today or not?

  • dmofp 20 minutes ago
    I have a somewhat different take on this (somewhat captured in the post linked below).

    IMO, the best way to raise the floor of LLM performance in codebases is by building meaning into the code base itself ala DDD. If your codebase is hard to understand and grok for a human, it will be the same for an LLM. If your codebase is unstructured and has no definable patterns, it will be harder for an LLM to use.

    You can try to overcome this with even more tooling and more workflows but IMO, it is throwing good money after bad. it is ironic and maybe unpopular, but it turns out LLMs prove that all the folks yapping about language and meaning (re: DDD) were right.

    DDD & the Simplicity Gospel:

    https://oluatte.com/posts/domain-driven-design-simplicity-go...

  • tracker1 34 minutes ago
    Just over the weekend, I decided to shell out for the top tier Claude Code to give it a try... definitely an improvement over the year I spent with Github CoPilot enabled on my personal projects (mostly an annoyance more than a help that I eventually disabled altogether).

    I've seen some impressive output so far, and have a couple friends that have been using AI generation a lot... I'm trying to create a couple legacy (BBS tech related, in Rust) applications to see how they land. So far mostly planning and structure beyond the time I've spent in contemplation. I'm not sure I can justify the expense long term, but wanting to experience the fuss a bit more to have at least a better awareness.

  • EastLondonCoder 36 minutes ago
    I’ve ended up with a workflow that lines up pretty closely with the guidance/oversight framing in the article, but with one extra separation that’s been critical for me.

    I’m working on a fairly messy ingestion pipeline (Instagram exports → thumbnails → grouped “posts” → frontend rendering). The data is inconsistent, partially undocumented, and correctness is only visible once you actually look at the rendered output. That makes it a bad fit for naïve one-shotting.

    What’s worked is splitting responsibility very explicitly:

    • Human (me): judge correctness against reality. I look at the data, the UI, and say things like “these six media files must collapse into one post”, “stories should not appear in this mode”, “timestamps are wrong”. This part is non-negotiably human.

    • LLM as planner/architect: translate those judgments into invariants and constraints (“group by export container, never flatten before grouping”, “IG mode must only consider media/posts/*”, “fallback must never yield empty output”). This model is reasoning about structure, not typing code.

    • LLM as implementor (Codex-style): receives a very boring, very explicit prompt derived from the plan. Exact files, exact functions, no interpretation, no design freedom. Its job is mechanical execution.

    Crucially, I don’t ask the same model to both decide what should change and how to change it. When I do, rework explodes, especially in pipelines where the ground truth lives outside the code (real data + rendered output).

    This also mirrors something the article hints at but doesn’t fully spell out: the codebase isn’t just context, it’s a contract. Once the planner layer encodes the rules, the implementor can one-shot surprisingly large changes because it’s no longer guessing intent.

    The challenges are mostly around discipline:

    • You have to resist letting the implementor improvise.

    • You have to keep plans small and concrete.

    • You still need guardrails (build-time checks, sanity logs) because mistakes are silent otherwise.

    But when it works, it scales much better than long conversational prompts. It feels less like “pair programming with an AI” and more like supervising a very fast, very literal junior engineer who never gets tired, which, in practice, is exactly what these tools are good at.

  • victorbjorklund 45 minutes ago
    Biggest change to my workflow has been to break down projects to smaller parts using libraries. So where I in the past would put everything in the same code base I now break down stuff that can be separate to its own libraries (like wrapping an external API). That way the AI only needs to read the docs for the library instead of having to read all the code when working on features that use the API.
  • mym1990 1 hour ago
    Its kind of crazy that the knee jerk reaction to failing to one shot your prompt is to abandon the whole thing because you think the tool sucks. It very well might, but it could also be user error or a number of other things. There wouldn't be a good nights sleep in sight if I knew an LLM was running rampant all over production code in an effort to "scale it".
    • zeroonetwothree 1 hour ago
      There’s always a trade off in terms of alternative approaches. So I don’t think it’s “crazy” that if one fails you switch to a different one. Sure, sometimes persistence can pay off, but not always.

      Like if I go to a restaurant for the first time and the item I order is bad, could I go back and try something else? Perhaps, but I could also go somewhere else.

    • t_tsonev 1 hour ago
      I'm okay with writing developer docs in the form of agent instructions, those are useful for humans too. If they start to get oddly specific or sound mental, then it's obviously the tool at fault.
  • andrewmutz 1 hour ago
    The issues raised in this article are why I think highly-opinionated frameworks will lead to higher developer productivity when using AI assisted coding

    You may not like all the opinions of the framework, but the LLM knows them and you don’t need to write up any guidelines for it.

  • vivin 1 hour ago
    You can't get away from the engineering part of software engineering even if you are using LLMs. I have been using Claude Opus 4.5, and it's the best out of the models I have tried. I find that I can get Claude to work well if I already know the steps I need to do beforehand, and I can get it to do all of the boring stuff. So it's a series of very focused and directed one-shot prompts that it largely gets correct, because I'm not giving it a huge task, or something open-ended.

    Knowing how you would implement the solution beforehand is a huge help, because then you can just tell the LLM to do the boring/tedious bits.

    • ericmcer 33 minutes ago
      seriously, I stopped agent mode altogether. I hit it with very specific like: write a function that takes an array of X and returns y.

      It almost never fails and usually does it in a neat way, plus its ~50 lines of code so I can copy and paste confidently. Letting the agent just go wild on my code has always been a PITA for me.

    • teaearlgraycold 1 hour ago
      They’re good for getting you from A to B. But you need to know A (current state of the code) and how to get to B (desired end state). They’re fast typers not automated engineers.
  • smallerize 1 hour ago
    This highlights a missing feature of LLM tooling, which is asking questions of the user. I've been experimenting with Gemini in VS Code, and it just fills in missing information by guessing and then runs off writing paragraphs of design and a bunch of code changes that could have been avoided by asking for clarification at the beginning.
    • tharkun__ 41 minutes ago
      So like most junior to mid level devs ;)

      Claude does have this specific interface for asking questions now. I've only had it choose to ask me questions on its own a very few times though. But I did have it ask clarifying questions before that interface was even a thing, when I specifically asked it to ask me clarifying questions.

      Again, like a junior dev. And like a junior dev, it can also help to ask it to ask / check what its doing "mid-way", i.e. watch what it's doing and stop it, when it's running down some rabbit hole you know is not gonna yield results.

    • skolos 1 hour ago
      Claude code regularly asks me questions - I like how anthropic implemented this
      • rockbruno 1 hour ago
        Yeah I experienced this yesterday and it was really cool. It really only happened once though.
    • pteetor 49 minutes ago
      For complicated prompts, I always add this:

      "Before you start, please ask me any questions you have about this so I can give you more context. Be extremely comprehensive."

      (I got the idea from a Medium article[1].) The LLM will, indeed, stop and ask good questions. It often notices what I've overlooked. Works very well for me!

      [1] https://medium.com/@jordan_gibbs/the-most-important-chatgpt-...

    • zvorygin 1 hour ago
      Append “First ask clarifying questions” to your prompt.
  • tschellenbach 1 hour ago
    I wrote this forever ago in AI terms :) https://getstream.io/blog/cursor-ai-large-projects/

    But the summary here is that with the right guidance, AI currently crushes it on large codebases.

  • CuriouslyC 1 hour ago
    STAN'd to the top.

    Decent article but it feels like a linkedin rehashing of stuff the people at the edge have already known for a while.

    • Aurornis 1 hour ago
      > but it feels like a linkedin rehashing of stuff the people at the edge have already known for a while.

      You're not wrong, but it bears repeating to newcomers.

      The average LLM user I encounter is still just hammering questions into the prompt and getting frustrated when the LLM makes the same mistakes over and over again.

  • uoaei 51 minutes ago
    What is the current state of LCMs (large code models)? I.e. models that operate on the AST and not on text tokens.
  • rootnod3 1 hour ago
    Or why you shouldn't....