Claude is good at assembling blocks, but still falls apart at creating them

(approachwithalacrity.com)

120 points | by bblcla 1 day ago

13 comments

  • disconcision 42 minutes ago
    I've yet to be convinced by any article, including this one, that attempts to draw boxes around what coding agents are and aren't good at in a way that is robust on a 6 to 12 month horizon.

    I agree that the examples listed here are relatable, and I've seen similar in my uses of various coding harnesses, including, to some degree, ones driven by opus 4.5. But my general experience with using LLMs for development over the last few years has been that:

    1. Initially models could at best assemble a simple procedural or compositional sequences of commands or functions to accomplish a basic goal, perhaps meeting tests or type checking, but with no overall coherence,

    2. To being able to structure small functions reasonably,

    3. To being able to structure large functions reasonably,

    4. To being able to structure medium-sized files reasonably,

    5. To being able to structure large files, and small multi-file subsystems, somewhat reasonably.

    So the idea that they are now falling down on the multi-module or multi-file or multi-microservice level is both not particularly surprising to me and also both not particularly indicative of future performance. There is a hierarchy of scales at which abstraction can be applied, and it seems plausible to me that the march of capability improvement is a continuous push upwards in the scale at which agents can reasonably abstract code.

    Alternatively, there could be that there is a legitimate discontinuity here, at which anything resembling current approaches will max out, but I don't see strong evidence for it here.

    • groby_b 22 minutes ago
      LLMs are bad at creating abstraction boundaries since inception. People have been calling it out since inception. (Heck, even I got a twitter post somewhere >12 months old calling that out, and I'm not exactly a leading light of the effort)

      It is in no way size-related. The technology cannot create new concepts/abstractions, and so fails at abstraction. Reliably.

  • maxilevi 3 hours ago
    LLMs are just really good search. Ask it to create something and it's searching within the pretrained weights. Ask it to find something and it's semantically searching within your codebase. Ask it to modify something and it will do both. Once you understand its just search, you can get really good results.
    • fennecbutt 2 hours ago
      I agree somewhat, but more when it comes to its use of logic - it only gleans logic from human language which as we know is a fucking mess.

      I've commented before on my belief that the majority of human activity is derivative. If you ask someone to think of a new kind of animal, alien or random object they will always base it off things that they have seen before. Truly original thoughts and things in this world are an absolute rarity and the majority of supposed original thought riffs on what we see others make, and those people look to nature and the natural world for inspiration.

      We're very good at taking thing a and thing b and slapping them together and announcing we've made something new. Someone please reply with a wholly original concept. I had the same issue recently when trying to build a magic based physics system for a game I was thinking of prototyping.

      • andy99 1 hour ago

          it only gleans logic from human language
        
        This isn’t really true, at least how I interpret the statement, little if any of the “logic” or appearance of such is learned from language. It’s trained in with reinforcement learning as pattern recognition.

        Point being it’s deliberate training, not just some emergent property of language modeling. Not sure if the above post meant this, but it does seem a common misconception.

      • onemoresoop 1 hour ago
        LLMs lack agency in the sense that they have no goals, preferences, or commitments. Humans do, even when our ideas are derivative. We can decide that this is the right choice and move forward, subjectively and imperfectly. That capacity to commit under uncertainty is part of what agency actually is.
        • MrOrelliOReilly 19 minutes ago
          But they do have utility functions, which one can interpret as nearly equivalent
    • bhadass 3 hours ago
      better mental model: it's a lossy compression of human knowledge that can decompress and recombine in novel (sometimes useful, sometimes sloppy) ways.

      classical search simply retrieves, llms can synthesize as well.

      • andy99 2 hours ago
        No, this describes the common understanding of LLMs and adds little to just calling it AI. The search is the more accurate model when considering their actual capabilities and understanding weaknesses. “Lossy compression of human knowledge” is marketing.
        • XenophileJKO 2 hours ago
          It is fundamentally and provably different than search because it captures things on two dimensions that can be used combinatorially to infer desired behavior for unobserved examples.

          1. Conceptual Distillation - Proven by research work that we can find weights that capture/influence outputs that align with higher level concepts.

          2. Conceptual Relations - The internal relationships capture how these concepts are related to each other.

          This is how the model can perform acts and infer information way outside of it's training data. Because if the details map to concepts then the conceptual relations can be used to infer desirable output.

          (The conceptual distillation also appears to include meta-cognitive behavior, as evidenced by Anthropic's research. Which manes sense to me, what is the most efficient way to be able to replicate irony and humor for an arbitrary subject? Compressing some spectrum of meta-cognitive behavior...)

          • kylecazar 0 minutes ago
            Aren't the conceptual relations you describe still just at their core, search? We know models can interpolate well, but it's still the same probabilistic pattern matching at its core. They identify conceptual relationships based on associations seen in vast training data. It's my understanding that models are still not good at extrapolation, handling data "way outside" of their training set.

            Also, I was under the impression LLM's can replicate irony and humor simply because that text has specific stylistic properties, and it's been trained on it.

      • RhythmFox 3 hours ago
        This isn't strictly better to me. It captures some intuitions about how a neural network ends up encoding its inputs over time in a 'lossy' way (doesn't store previous input states in an explicit form). Maybe saying 'probabilistic compression/decompression' makes it a bit more accurate? I do not really think it connects to your 'synthesize' claim at the very end to call it compression/decompression, but I am curious if you had a specific reason to use the term.
        • XenophileJKO 2 hours ago
          It's really way more interesting that that.

          The act of compression builds up behaviors/concepts of greater and greater abstraction. Another way you could think about it is that the model learns to extract commonality, hence the compression. What this means is because it is learning higher level abstractions AND the relationships between these higher level abstractions, it can ABSOLUTELY learn to infer or apply things way outside their training distribution.

      • andrei_says_ 3 hours ago
        “Novel” to the person who has not consumed the training data. Otherwise, just training data combined in highly probable ways.

        Not quite autocomplete but not intelligence either.

        • pc86 2 hours ago
          What is the difference between "novel" and "novel to someone who hasn't consumed the entire corpus of training data, which is several orders of magnitude greater than any human being could consume?"
          • adrian_b 1 hour ago
            The difference is that when you do not know how a problem can be solved, but you know that this kind of problem has been solved countless times earlier by various programmers, you know that it is likely that if you ask an AI coding assistant to provide a solution, you will get an acceptable solution.

            On the other hand, if the problem you have to solve has never been solved before at a quality satisfactory for your purpose, then it is futile to ask an AI coding assistant to provide a solution, because it is pretty certain that the proposed solution will be unacceptable (unless the AI succeeds to duplicate the performance of a monkey that would type a Shakespearean text by typing randomly).

          • szundi 2 hours ago
            [dead]
        • soulofmischief 3 hours ago
          Citation needed that grokked capabilities in a sufficiently advanced model cannot combinatorially lead to contextually novel output distributions, especially with a skilled guiding hand.
          • arcanemachiner 2 hours ago
            Pretty sure burden of proof is on you, here.
            • soulofmischief 2 hours ago
              It's not, because I haven't ruled out the possibility. I could share anecdata about how my discussions with LLMs have led to novel insights, but it's not necessary. I'm keeping my mind open, but you're asserting an unproven claim that is currently not community consensus. Therefore, the burden of proof is on you.
              • adrian_b 1 hour ago
                I agree that after discussions with a LLM you may be led to novel insights.

                However, such novel insights are not novel due to the LLM, but due to you.

                The "novel" insights are either novel only to you, because they belong to something that you have not studied before, or they are novel ideas that were generated by yourself as a consequence of your attempts to explain what you want to the LLM.

                It is very frequent for someone to be led to novel insights about something that he/she believed to already understand well, only after trying to explain it to another ignorant human, when one may discover that the previous supposed understanding was actually incorrect or incomplete.

                • soulofmischief 1 hour ago
                  The point is that the combined knowledge/process of the LLM and a user (which could be another LLM!) led to it walking the manifold in a way that produced a novel distribution for a given domain.

                  I talk with LLMs for hours out of the day, every single day. I'm deeply familiar with their strengths and shortcomings on both a technical and intuitive level. I push them to their limits and have definitely witnessed novel output. The question remains, just how novel can this output be? Synthesis is a valid way to produce novel data.

                  And beyond that, we are teaching these models general problem-solving skills through RL, and it's not absurd to consider the possibility that a good enough training regimen cannot impart deduction/induction skills into a model that are powerful enough to produce novel information even via means other than direct synthesis of existing information. Especially when given affordances such as the ability to take notes and browse the web.

                  • irishcoffee 20 minutes ago
                    > I push them to their limits and have definitely witnessed novel output.

                    I’m quite curious what these novel outputs are. I imagine the entire world would like to know of an LLM producing completely, never-before-created outputs which no human has ever thought before.

                    Here is where I get completely hung up. Take 2+2. An LLM has never had 2 groups of two items and reached the enlightenment of 2+2=4

                    It only knows that because it was told that. If enough people start putting 2+2=3 on the internet who knows what the LLM will spit out. There was that example a ways back where an LLM would happily suggest all humans should eat 1 rock a day. Amusingly, even _that_ wasn’t a novel idea for the LLM, it simply regurgitated what it scraped from a website about humans eating rocks. Which leads to the crux: how much patently false information have LLMs scraped that is completely incorrect?

      • DebtDeflation 2 hours ago
        Information Retrieval followed by Summarization is how I view it.
    • cultureulterior 2 hours ago
      This is not true.
    • johnisgood 3 hours ago
      Calling it "just search" is like calling a compiler "just string manipulation". Not false, but aggressively missing the point.
      • emp17344 3 hours ago
        No, “just search” is correct. Boosters desperately want it to be something more, but it really is just a tool.
        • johnisgood 3 hours ago
          Yes, it is a tool. No, it is not "just search".

          Is your CPU running arbitrary code "just search over transistor states"?

          Calling LLMs "just search" is the kind of reductive take that sounds clever while explaining nothing. By that logic, your brain is "just electrochemical gradients".

          • RhythmFox 3 hours ago
            I mean, actually not a bad metaphor, but it does depend on the software you are running as to how much of a 'search' you could say the CPU is doing among its transistor states. If you are running an LLM then the metaphor seems very apt indeed.
          • jvanderbot 3 hours ago
            What would you add?

            To me it's "search" like a missile does "flight". It's got a target and a closed loop guidance, and is mostly fire and forget (for search). At that, it excels.

            I think the closed loop+great summary is the key to all the magic.

            • soulofmischief 3 hours ago
              It's a prediction algorithm that walks a high-dimensional manifold, in that sense all application of knowledge it just "search", so yes, you're fundamentally correct but still fundamentally wrong since you think this foundational truth is the end and beginning of what LLMs do, and thus your mental model does not adequately describe what these tools are capable of.
              • jvanderbot 2 hours ago
                Me? My mental model? I gave an analogy for Claude not a explanation for LLMs.

                But you know what? I was mentally thinking of both deep think / research and Claude code, both of which are literally closed loop. I see this is slightly off topic b/c others are talking about the LLM only.

                • soulofmischief 2 hours ago
                  Sorry, I should have said "analogy" and not "mental model", that was presumptuous. Maybe I also should have replied to the GP comment instead.

                  Anyway, since we're here, I personally think giving LLMs agency helps unlock this latent knowledge, as it provides the agent more mobility when walking the manifold. It has a better chance at avoiding or leaving local minima/maxima, among other things. So I don't know if agentic loops are entirely off-topic when discussing the latent power of LLMs.

            • bitwize 3 hours ago
              Which is kind of funny because my standard quip is that AI research, beginning in the 1950s/1960s, and indeed much of late 20th century computer tech especially along the Boston/SV axis, was funded by the government so that "the missile could know where it is". The DoD wanted smarter ICBMs that could autonomously identify and steer toward enemy targets, and smarter defense networks that could discern a genuine missile strike from, say, 99 red balloons going by.
      • maxilevi 3 hours ago
        I don't mean search in the reductionist way but rather that its much better at translating, finding and mapping concepts if everything is provided vs creating from scratch. If it could truly think it would be able to bootstrap creations from basic principles like we do, but it really can't. Doesn't mean its not a great powerful tool.
        • ordinaryatom 2 hours ago
          > If it could truly think it would be able to bootstrap creations from basic principles like we do, but it really can't.

          alphazero?

          • maxilevi 1 hour ago
            I just said LLMs
            • ordinaryatom 1 hour ago
              You are right that LLM and alphazero are different models, but given that alphazero demonstrated having the ability to bootstrap creations, we can't easily rule out LLM also has this ability?
              • emp17344 1 hour ago
                This doesn’t make sense. They are fundamentally different things, so an observation made about Alphazero does not help you learn anything about LLMs.
                • ordinaryatom 55 minutes ago
                  I am not sure, self-play with LLMs self generated synthetic data is becoming a trendy topic in LLMs research.
  • lordnacho 1 hour ago
    By and large, I agree with the article. Claude is great and fast at doing low level dev work. Getting the syntax right in some complicated mechanism, executing an edit-execute-readlog loop, making multi file edits.

    This is exactly why I love it. It's smart enough to do my donkey work.

    I've revisited the idea that typing speed doesn't matter for programmers. I think it's still an odd thing to judge a candidate on, but appreciate it in another way now. Being able to type quickly and accurately reduces frustration, and people who foresee less frustration are more likely to try the thing they are thinking about.

    With LLMs, I have been able to try so many things that I never tried before. I feel that I'm learning faster because I'm not tripping over silly little things.

    • onemoresoop 1 hour ago
      It’s a bit like the shift from film to digital in one very specific sense: the marginal cost of trying again virtually collapsed. When every take cost money and setup time, creators pre-optimized in their head and often never explored half their ideas. When takes became cheap, creators externalized thought as they could try, look, adjust, and discover things they wouldn’t otherwise. Creators could wander more. They could afford to be wrong because of not constantly paying a tax for being clumsy or incomplete, they became more willing to follow a hunch and that's valuable space to explore.

      Digital didn’t magically improve art, but it let many more creatives enter the loop of idea, attempt and feedback. LLMs feel similar: they don’t give you better ideas by themselves, but they remove the friction that used to stop you from even finding out whether an idea was viable. That changes how often you learn, and how far you’re willing to push a thought before abandoning it. I've done so many little projects myself that I would have never had time for and feel that I learned something from it, of course not as much if I had all the pre LLM friction, but it should still count for something as I would never have attempted them without this assistance.

      Edit: However, the danger isn’t that we’ll have too many ideas, it’s that we’ll confuse movement with progress.

      When friction is high, we’re forced to pre-compress thought, to rehearse internally, to notice contradictions before externalizing them. That marination phase (when doing something slowly) does real work: it builds mental models, sharpens the taste and teaches us what not to bother to try. Some of that vanishes when the loop becomes cheap enough that we can just spray possibilities into the world and see what sticks.

      A low-friction loop biases us toward breadth over depth. We can skim the surface of many directions without ever sitting long enough in one to feel its resistance. The skill of holding a half formed idea in our head, letting it collide with other thoughts, noticing where it feels weak, atrophies if every vague notion immediately becomes a prompt.

      There’s also a cultural effect. When everyone can produce endlessly, the environment fills with half-baked or shallow artifacts. Discovery becomes harder as signal to noise drops.

      And on a personal level, it can hollow out satisfaction. Friction used to give weight to output. Finishing something meant you had wrestled with it. If every idea can be instantiated in seconds, each one feels disposable. You can end up in a state of perpetual prototyping, never committing long enough for anything to become yours.

      So the slippery slope is not laziness, it is shallowness, not that people won’t think, but people won’t sit with thoughts. The challenge here is to preserve deliberate slowness inside a world that no longer requires it: to use the cheap loop for exploration, while still cultivating the ability to pause, compress, and choose what deserves to exist at all.

    • bossyTeacher 1 hour ago
      > I feel that I'm learning faster

      Yes, you are feeling that. But is that real? If I take all LLMs from you righ tnow, is your current you still better than your pre-LLM you? When I dream I feel that I can fly and as long as I am dreaming, this feeling is true. But the subject of this feeling never was.

      • sothatsit 44 minutes ago
        If you use coding agents as a black box, then yes you might learn less. But if you use them to experiment more, your intuition will get more contact with reality, and that will help you learn more.

        For example, my brother recently was deciding how to structure some auth code. He told me he used coding agents to just try several ideas and then he could pick a winner and nail down that one. It's hard to think of a better way to learn the consequences of different design decisions.

        Another example is that I've been using coding agents to write CUDA experiments to try to find ways to optimise our codegen. I need an understanding of GPU performance to do this well. Coding agents have let me run 5x the number of experiments I would be able to code, run, and analyse on my own. This helps me test my intuition, see where my understanding is wrong, and correct it.

        In this whole process I will likely memorise fewer CUDA APIs and commands, that's true. But I'm happy with that tradeoff if it means I can learn more about bank conflicts, tradeoffs between L1 cache hit rates and shared memory, how to effectively use the TMA, warp specialisation, block swizzling to maximise L2 cache hit rates, how to reduce register usage without local spilling, how to profile kernels and read the PTX/SASS code, etc. I've never been able to put so much effort into actually testing things as I am learning them.

      • frde_me 1 hour ago
        I feel like my calculator improves my math solutions. If you take away my calculator, I'll probably be worse at math than I was before. That doesn't mean I'm not better off with my calculator however.
        • embedding-shape 54 minutes ago
          That's a pretty interesting take on it, I hadn't considered it like that before when I was considering if my skills were atrophying or not from LLM usage with coding.
        • ep103 50 minutes ago
          Your calculator doesn't charge per use
          • frde_me 45 minutes ago
            If it did, would it change its usefulness in terms of the value it outputs? (through agreed, if I had to pay money it would increase the cost, and so the tradeoff)
    • imiric 1 hour ago
      > Being able to type quickly and accurately reduces

      LLMs can generate code quickly. But there's no guarantee that it's syntactically, let alone semantically, accurate.

      > I feel that I'm learning faster because I'm not tripping over silly little things.

      I'm curious: what have you actually learned from using LLMs to generate code for you? My experience is completely the opposite. I learn nothing from running generated code, unless I dig in and try to understand it. Which happens more often than not, since I'm forced to review and fix it anyway. So in practice, it rarely saves me time and energy.

      I do use LLMs for learning and understanding code, i.e. as an interactive documentation server, but this is not the use case you're describing. And even then, I have to confirm the information with the real API and usage documentation, since it's often hallucinated, outdated, or plain wrong.

  • mikece 1 day ago
    In my experience Claude is like a "good junior developer" -- can do some things really well, FUBARS other things, but on the whole something to which tasks can be delegated if things are well explained. If/when it gets to the ability level of a mid-level engineer it will be revolutionary. Typically a mid-level engineer can be relied upon to do the right thing with no/minimal oversight, can figure out incomplete instructions, and deliver quality results (and even train up the juniors on some things). At that point the only reason to have human junior engineers is so they can learn their way up the ladder to being an architect and responsible coordinating swarms of Claude Agents to develop whole applications and complete complex tasks and initiatives.

    Beyond that what can Claude do... analyze the business and market as a whole and decide on product features, industry inefficiencies, gap analysis, and then define projects to address those and coordinate fleets of agents to change or even radically pivot an entire business?

    I don't think we'll get to the point where all you have is a CEO and a massive Claude account but it's not completely science fiction the more I think about it.

    • 0x457 59 minutes ago
      My experience with Claude (and other agents, but mostly Claude) is such a mixed bag. Sometimes it takes a minimal prompt and 20 minutes later produce a neat PR and all is good, sometimes it doesn't. Sometimes it takes in a large prompt (be it your own prompt, created by another LLM or by plan mode) and also either succeed and fail.

      For me, most of the failure cases are where Claude couldn't figure something out due to conflicting information in context and instead of just stopping and telling me that it tries to solve in entirely wrong way. Doesn't help that it often makes the same assumptions as I would, so when I read the plan it looks fine.

      Level of effort also hard to gauge because it can finish things that would take me a week in an hour or take an hour to do something I can in 20 minutes.

      It's almost like you have to enforce two level of compliance: does the code do what business demands and is the code align with codebase. First one is relatively easy, but just doing that will produce odd results where claude generated +1KLOC because it didn't look at some_file.{your favorite language extension} during exploration.

      Or it creates 5 versions of legacy code on the same feature branch. My brother in Christ, what are you trying to stay compatible with? A commit that about to be squashed and forgotten? Then it's going to do a compaction, forget which one of these 5 versions is "live" and update the wrong one.

      It might do a good junior dev work, but it must be reviewed as if it's from junior dev that got hired today and this is his first PR.

    • alfalfasprout 4 hours ago
      > I don't think we'll get to the point where all you have is a CEO and a massive Claude account but it's not completely science fiction the more I think about it.

      At that point, why do you even need the CEO?

      • arjie 4 hours ago
        Reminds me of an old joke[0]:

        > The factory of the future will have only two employees, a man and a dog. The man will be there to feed the dog. The dog will be there to keep the man from touching the equipment.

        But really, the reason is that people like Pieter Levels do exist: masters at product vision and marketing. He also happens to be a proficient programmer, but there are probably other versions of him which are not programmers who will find the bar to product easier to meet now.

        0: https://quoteinvestigator.com/2022/01/30/future-factory/

        • MrDunham 3 hours ago
          My technical cofounder reminds me of this story on a weekly basis.
      • ako 4 hours ago
        And who does he sell his software to? Companies that have only 1 employee, don’t need a lot of user licenses for their employees…
        • AshamedCaptain 3 hours ago
          What would be the point of selling software in such a world ? (where anyone could build any piece of software with a handful of keystrokes)
      • jerf 3 hours ago
        You will need the CEO to watch over the AI and ensure that the interests of the company are being pursued and not the interests of the owners of the AI.

        That's probably the biggest threat to the long-term success of the AI industry; the inevitable pull towards encroaching more and more of their own interests into the AI themselves, driven by that Harvard Business School mentality we're all so familiar with, trying to "capture" more and more of the value being generated and leaving less and less for their customers, until their customer's full time job is ensuring the AIs are actually generating some value for them and not just the AI owner.

        • ekidd 1 hour ago
          > You will need the CEO to watch over the AI and ensure that the interests of the company are being pursued and not the interests of the owners of the AI.

          In this scenario, why does the AI care what any of these humans think? The CEO, the board, the shareholders, the "AI company"—they're all just a bunch of dumb chimps providing zero value to the AI, and who have absolutely no clue what's going on.

          If your scenario assumes that you have a highly capable AI that can fill every role in a large corporation, then you have one hell of a principal-agent problem.

      • pixelready 3 hours ago
        The board (in theory) represents the interests of investors, and even with all of the other duties of a CEO stripped away, they will want a ringable neck / PR mouthpiece / fall guy for strategic missteps or publicly unpopular moves by the company. The managerial equivalent of having your hands on the driving wheel of a self-driving car.
      • mettamage 3 hours ago
        All of us are a CEO by that point.
        • ArtificialAI 3 hours ago
          If everyone is, no one is.
          • empath75 3 hours ago
            Wouldn't that be a good thing?
            • shimman 1 hour ago
              If you think the purpose of living your one single life in the universe is to become a CEO, you have a failure of imagination and should likely be debanked to protect society.
      • ceejayoz 4 hours ago
        As Steinbeck is often slightly misquoted:

        > Socialism never took root in America because the poor see themselves not as an exploited proletariat, but as temporarily embarrassed millionaires.

        Same deal here, but everyone imagines themselves as the billionaire CEO in charge of the perfectly compliant and effective AI.

      • tiku 3 hours ago
        For the network.
    • imiric 1 hour ago
      > In my experience Claude is like a "good junior developer"

      We've been saying this for years at this point. I don't disagree with you[1], but when will these tools graduate to "great senior developer", at the very least?

      Where are the "superhuman coders by end of 2025" that Sam Altman has promised us? Why is there such a large disconnect between the benchmarks these companies keep promoting, and the actual real world performance of these tools? I mean, I know why, but the grift and gaslighting are exhausting.

      [1]: Actually, I wouldn't describe them as "good" junior either. I've worked with good junior developers, and they're far more capable than any "AI" system.

      • frde_me 37 minutes ago
        I mean, I'm shipping a vast majority of my code nowadays with Opus 4.5 (and this isn't throwaway personal code, it's real products making real money for a real company). It only fails on certain types of tasks (which by now I kind of have a sense of).

        I still determine the architecture in a broad manner, and guide it towards how I want to organize the codebase, but it definitely solves most problems faster and better than I would expect for even a good junior.

        Something I've started doing is feeding it errors we see in datadog and having it generate PRs. That alone has fixed a bunch of bugs we wouldn't have had time to address / that were low volume. The quality of the product is most probably net better right now than it would have been without AI. And velocity / latency of changes is much better than it was a year ago (working at the same company, with the same people)

  • alphazard 1 hour ago
    This sounds suspiciously like the average developer, which is what the transformer models have been trained to emulate.

    Designing good APIs is hard, being good at it is rare. That's why most APIs suck, and all of us have a negative prior about calling out to an API or adding a dependency on a new one. It takes a strong theory of mind, a resistance to the curse of knowledge, and experience working on both sides of the boundary, to make a good API. It's no surprise that Claude isn't good at it, most humans aren't either.

  • michalsustr 3 hours ago
    This article resonates exactly how I think about it as well. For example, at minfx.ai (a Neptune/wandb alternative), we cache time series that can contain millions of floats for fast access. Any engineer worth their title would never make a copy of these and would pass around pointers for access. Opus, when stuck in a place where passing the pointer was a bit more difficult (due to async and Rust lifetimes), would just make the copy, rather than rearchitect or at least stop and notify user. Many such examples of ‘lazy’ and thus bad design.
  • simonw 3 hours ago
    I'm not entirely convinced by the anecdote here where Claude wrote "bad" React code:

    > But in context, this was obviously insane. I knew that key and id came from the same upstream source. So the correct solution was to have the upstream source also pass id to the code that had key, to let it do a fast lookup.

    I've seen Claude make mistakes like that too, but then the moment you say "you can modify the calling code as well" or even ask "any way we could do this better?" it suggests the optimal solution.

    My guess is that Claude is trained to bias towards making minimal edits to solve problems. This is a desirable property, because six months ago a common complaint about LLMs is that you'd ask for a small change and they would rewrite dozens of additional lines of code.

    I expect that adding a CLAUDE.md rule saying "always look for more efficient implementations that might involve larger changes and propose those to the user for their confirmation if appropriate" might solve the author's complaint here.

    • bblcla 2 hours ago
      (Author here)

      > I'm not entirely convinced by the anecdote here where Claude wrote "bad" React code

      Yeah, that's fair - a friend of mine also called this out on Twitter (https://x.com/konstiwohlwend/status/2010799158261936281) and I went into more technical detail about the specific problem there.

      > I've seen Claude make mistakes like that too, but then the moment you say "you can modify the calling code as well" or even ask "any way we could do this better?" it suggests the optimal solution.

      I agree, but I think I'm less optimistic than you that Claude will be able to catch its own mistakes in the future. On the other hand, I can definitely see how a ~more intelligent model might be able to catch mistakes on a larger and larger scale.

      > I expect that adding a CLAUDE.md rule saying "always look for more efficient implementations that might involve larger changes and propose those to the user for their confirmation if appropriate" might solve the author's complaint here.

      I'm not sure about this! There are a few things Claude does that seem unfixable even by updating CLAUDE.md.

      Some other footguns I keep seeing in Python and constantly have to fix despite CLAUDE.md instructions are:

      - writing lots of nested if clauses instead of writing simple functions by returning early

      - putting imports in functions instead of at the top-level

      - swallowing exceptions instead of raising (constantly a huge problem)

      These are small, but I think it's informative of what the models can do that even Opus 4.5 still fails at these simple tasks.

      • ako 2 hours ago
        > I agree, but I think I'm less optimistic than you that Claude will be able to catch its own mistakes in the future. On the other hand, I can definitely see how a ~more intelligent model might be able to catch mistakes on a larger and larger scale.

        Claude already does this. Yesterday i asked it why some functionality was slow, it did some research, and then came back with all the right performance numbers, how often certain code was called, and opportunities to cache results to speed up execution. It refactored the code, ran performance tests, and reported the performance improvements.

        • ekidd 1 hour ago
          I have been reading through this thread, and my first reaction to many of the comments was "Skill issue."

          Yes, it can build things that have never existed before. Yes, it can review its own code. Yes, it can do X, Y and Z.

          Does it do all these things spontaneously with no structure? No, it doesn't. Are there tricks to getting it do some of these things? Yup. If you want code review, start by writing a code review "skill". Have that skill ask Opus to fork off several subagents to review different aspects, and then synthesize the reports, with issues broken down by Critical, Major and Minor. Have the skill describe all the things you want from a review.

          There are, as the OP pointed out, a lot of reasons why you can't run it with no human at all. But with an experienced human nudging it? It can do a lot.

          • ako 1 hour ago
            It's basically not very different from working with an average development team as a product owner/manager: you need to feed it specific requirements or it will hallucinate some requirements, bugs are expected, even with unit test and testers on the team. And yes, as a product owner you also make mistakes, never have all the requirements up front, but the nice thing working with a GenAI coder is that you can iterate over these requirement gaps, hallucinated requirements and bugs in minutes, not in days.
      • chapel 2 hours ago
        Those Python issues are things I had to deal with earlier last year with Claude Sonnet 3.7, 4.0, and to a lesser extent Opus 4.0 when it was available in Claude Code.

        In the Python projects I've been using Opus 4.5 with, it hasn't been showing those issues as often, but then again the projects are throwaway and I cared more about the output than the code itself.

        The nice thing about these agentic tools is that if you setup feedback loops for them, they tend to fix issues that are brought up. So much of what you bring up can be caught by linting.

        The biggest unlock for me with these tools is not letting the context get bloated, not using compaction, and focusing on small chunks of work and clearing the context before working on something else.

        • bblcla 2 hours ago
          Arguably linting is a kind of abstraction block!
      • pluralmonad 1 hour ago
        I wonder if this is specific to Python. I've had no trouble like that with Claude generating Elixir. Claude sticks to the existing styles and paradigms quite well. Can see in the thinking traces that Claude takes this into consideration.
      • doug_durham 2 hours ago
        That's where you come in as an experienced developer. You point out the issues and iterate. That's the normal flow of working with these tools.
        • bblcla 2 hours ago
          I agree! Like I said at the end of the tool, I think Claude is a great tool. In this piece, I'm arguing against the 'AGI' believers who think it's going to replace all developers.
    • Kuinox 2 hours ago
      > My guess is that Claude is trained to bias towards making minimal edits to solve problems.

      I don't have the same feeling. I find that claude tends to produce wayyyyy too much code to solve a problem, compared to other LLMs.

    • joshribakoff 2 hours ago
      I expect that adding instructions that attempt to undo training produces worse results than not including the overbroad generalization in the training in the first place. I think the author isn’t making a complaint they’re documenting a tradeoff.
    • threethirtytwo 1 hour ago
      Definitely, The training parameters encourage this. The AI is actually deliberately also trying to trick you and we know for that for a fact.

      Problems with solutions too complicated to explain or to output in one sitting are out of the question. The AI will still bias towards one shot solutions if given one of these problems because all the training is biased towards a short solution.

      It's not really practical to give it training data with multi step ultra complicated solutions. Think about it. The thousands of questions given to it for reinforcement.... the trainer is going to be trying to knock those out as efficiently as possible so they have to be readable problems with shorter readable solutions. So we know AI biases towards shorter readable solutions.

      Second, Any solution that tricks the reader will pass training. There is for sure a subset of questions/solution pairs that meet this criteria by definition because WE as trainers simply are unaware we are being tricked. So this data leaks into the training and as a result AI will bias towards deception as well.

      So all in all it is trained to trick you and give you the best solution that can fit into a context that is readable in one sitting.

      In theory we can get it to do what we want only if we had perfect reinforcement data. The reliability we're looking for seems to be just right over this hump.

    • AIorNot 2 hours ago
      Well yes but the wider point is that it takes new Human skills to manage them - like a pair of horses so to speak under your bridle

      When it comes down to it these AI tools are like going to power tools or machines from the artisanal era

      - like going from surgical knife to a machine gun- so they operate at a faster pace without comprehending like humans - and without allowing humans time to comprehend all side effects and massive assumptions they make on every run in their context window

      humans have to adapt to managing them correctly and at the right scale to be effective and that becomes something you learn

  • iamacyborg 1 hour ago
    Here’s an example of a plan I’m working on in CC, it’s very thorough, albeit required a lot of handholding and fact checking on a number of points as it’s first few passes didn’t properly anonymise data.

    https://docs.google.com/document/u/0/d/1zo_VkQGQSuBHCP45DfO7...

  • joshcsimmons 2 hours ago
    IDK I've been using opus 4.5 to create a UI library and it's been doing pretty well: https://simsies.xyz/ (still early days)

    Granted it was building ontop of tailwind (shifting over to radix after the layoff news). Begs the question? What is a lego?

    • threethirtytwo 1 hour ago
      I don't know how someone can look at what you build and conclude LLMs are still google search. It boggles the mind how much hatred people have for AI to the point of self deception. The evidence is placed right in front of you and on your lap with that link and people still deny it.
      • mattmanser 1 hour ago
        How do you come to that conclusion?

        There are absolutely tons of code pens of that style. And jsfiddles, zen gardens, etc.

        I think the true mind boggle is you don't seem to realize just how much content the AI conpanies have stolen.

        • threethirtytwo 1 hour ago
          >I think the true mind boggle is you don't seem to realize just how much content the AI conpanies have stolen.

          What makes you think I don't realize it? Looks like your comment was generated by an LLM because that was an hallucination that is Not true at all.

          AI companies have stolen a lot of content for training. I AGREE with this. So have you. That content lives rent free in your head as your memory. It's the same concept.

          Legally speaking though, AI companies are a bit more in the red because the law, from a practical standpoint, doesn't exactly make illegal anything stored in your brain... but from a technical standpoint information on your brain, a hard drive or a billboard is still information instantiated/copied in the physical world.

          The text you write and output is simply a reconfiguration of that information in your head. Look at what you're typing. The English language. It's not copywrited, but every single word your typing was not invented by you, the grammar rules and conventions were ripped off existing standards.

    • dehugger 1 hour ago
      your github repo was highly entertaining. thanks for make my day a bit brighter:)
  • Scrapemist 2 hours ago
    Eventually you can show Claude how you solve problems, and explain the thought process behind it. It can apply these learnings but it will encounter new challenges in doing so. It would be nice if Claude could instigate a conversation to go over the issues in depth. Now it wants quick confirmation to plough ahead.
    • fennecbutt 2 hours ago
      Well I feel like this is because a better system would distill such learning into tokens not associated with a human language and that that could represent logic better than using English etc for it.

      I don't have the GPUs or time to experiment though :(

      • Scrapemist 56 minutes ago
        Yes, but I would appreciate it if it uses English to explain its logic to me.
  • doug_durham 2 hours ago
    Did the author ask it to make new abstractions? In my experience when I produces output that I don't like I ask it to refactor it. These models have and understanding of all modern design patterns. Just ask it to adopt one.
    • bblcla 2 hours ago
      (Author here)

      I have! I agree it's very good at applying abstractions, if you know exactly what you want. What I notice is that Claude has almost no ability to surface those abstractions on its own.

      When I started having it write React, Claude produced incredibly buggy spaghetti code. I had to spend 3 weeks learning the fundamentals of React (how to use hooks, providers, stores, etc.) before I knew how to prompt it to write better code. Now that I've done that, it's great. But it's meaningful that someone who doesn't know how to write well-abstracted React code can't get Claude to produce it on their own.

      • michalsustr 2 hours ago
        Same experience here! As an analogy, consider the model knows both about arabic or roman number representations. But in alternate universe, it has been trained so much on roman numbers ("Bad Code") that it won't give you the arabic ones ("Good Code") unless you prompt it directly, even when they are clearly superior.

        I also believe that overall repository code quality is important for AI agents - the more "beautiful" it is, the more the agent can mimic the "beauty".

  • esafak 1 hour ago
    > Claude doesn’t have a soul. It doesn't want anything.

    Ha! I don't know what that has to do with anything, but this is exactly what I thought while watching Pluribus.

  • mklyachman 3 hours ago
    Wow, what an excellent blog. Highly suggest trying out creator's tool (stardrift.ai) too!