Updates on LLM-assisted coding
31/03/2025
I thought I'd give a quick update on my LLM coding journey - I've spent a lot more time on doing this over the last few weeks and also spent some time integrating LLM generated code with low-code tools.
Claude Sonnet 3.7
Most of my development in the last month has been with Claude assistance and the majority has been using the 3.7 model and getting to grips how it differs with the previous version.
In short, it's been much more verbose and tends to do more than I actually want! I'm not sure what tuning they did between the two versions but in my opinion it's actually less usable than the previous version. The issue I've noticed is that it tends to carry on generating things and extrapolating what I might want to do next, which isn't necessarily what I'd prefer to do. I've sometimes even found myself reverting back to 3.5 (October).
To get around this issue I've been varying my project instructions to ask it to:
- Come up with a plan and run it past me before starting to implement anything
- Implement solutions in small independent steps
- Ask for feedback after completing each meaningful compoent to check user requirements
- Implement only what is explicitly requested.
This prompting does seem to help, but it still has a tendency to run away with itself. I hope they can pull this back. The thinking mode does seem to be helpful and it's good that it doesn't require you to change to a specific reasoning model. That said, this feature doesn't seem to produce a major improvement in quality when it's invoked (possibly because 3.5 is already good on its own).
I haven't tried Claude Code because I'm subscribing to Pro and I don't want a separate charge at the moment.
Reasoning models
I've been spending a bit of time recently with the O1 and O3 models in ChatGPT too. My feedback is similar on these to the Claude thinking approach above too - I'm not sure the results are sufficiently good to need the extra time. I suspect this is something that may need more usage first!
Github Copilot
I've been playing around with Github Copilot since it was made free for light users in VS Code before Christmas. I do find it useful as better autocomplete at times, but I've definitely not found the chat functionality as useful as something like Claude. At some point in the near future I'll try the competitors like Cursor and Cline.
On 'Vibe Coding'
The phrase 'Vibe Coding' has been all over my Bluesky and LinkedIn feeds and I'm not sure it's been a great coinage to describe the AI development phenomenon. It seems to being used for things that don't fall under Andrej Karpathy's original definition. I'm happy with the thought of letting the LLM take over for things that are basically development spikes or proofs of concept, but it seems to avoid the hard parts of creating well factored, modular, tested code that can go seamlessly into production. "Vibe coding" seems to miss out discussing this aspect - but, to me, the AI makes this part easier to do as well. If planning, specifying and testing are easier to do, it's possible to do more of these activities, but still keep up with getting fast feedback. Perhaps someone with a better marketing brain than me needs to come up with a snappy alternative to "AI Assisted Software Engineering" or "AI assisted architecture".
Simon Willison's article Not all AI-assisted programming is vibe coding (but vibe coding rocks) covers similar ground but from a slightly different perspective. He recommends vibing away when there's low stakes involved (experiments, getting the feel of AI coding or for personal tools), but considering different modes for "serious programming". I agree with this, and I'd add architectural discussion to the things you should do first. There's some interesting posts on Reddit about using the AI to do the README first before letting it start programming.
I'm looking forward to carrying this experiment on over the next few weeks and delving more into the tools and also the wild world of MCP servers and others. It's an exciting time to be a nerd!
An experiment with coding with Claude
27/02/2025
Most of my working week has been spent on talking to customers about generative AI - and I've got lots I'd like to say on that in the future.
However, a lot of the work that we're doing in that space in the day job relies on low-code/no-code tooling because in the main, our clients don't have in-house developers. In that space, AI is being a godsend because the development is largely writing prose and then creating tools, either using low-code workflows (such as Power Automate ) or bridging in to custom APIs. I've found that although the AI part is good, clients still struggle with the logic required for data transformation and integration with other systems, and also the practical side of dealing with other systems - handling failure, retries and that sort of thing, i.e. the parts that most benefit from technical experience.
I'm interested in how far AI can take you in improving productivity, and So as part of my personal spare time coding I thought I'd do an experiment to see how far I could get with a small project. Claude was my assistant during the project.
Note: this experiement was done before the release of Sonnet 3.7 - I'll update once I've really had a chance to kick the tyres!
The idea
The idea was to generate a good readable version of an online video - with a twist of my own. This was something that I'd wanted to do for a while to help me catch up with conferences, but also in general as so much good content is now available in video only format. I'm sure there are services that already do something similar, but I had a few specific requirements of my own. These were:
- I wanted to make sure I could run it locally and consume the content offline (sadly although 5G is coming to the Tube, it's not yet everywhere)
- I wanted to make sure that the level of detail in the transcript was retained, but the writing was legible and grammatically correct, so I wasn't interested in just getting a transcript.
- I wanted to make sure that any sigificant visuals were captured so I can see any diagrams, slides etc in context.
- I wanted to be able to get to the point in the video if needed to confirm what was said
This felt specific enough that it would be something I could knock up relatively easily but was specific enough that I wouldn't find something identical (although I'd be interested to hear of any!)
The experiment
I set myself the following boundaries:
- As much as possible had to be done via the AI and not myself. Using me as feedback and error checking was fine, but I wanted not to write much code myself.
- I wanted the AI to guide me as much as possible, so I tried not to suggest solutions but asked it for possible approaches before getting it to write some code.
- Claude would be the main AI used so I wanted to use as much of its facilities as possible (e.g. Projects)
- I wanted it in Python for a couple of reasons - as a personal project I was happy to forge my own path and I knew that I'd probably be using AI libraries, so I went where the ecosystem was strongest.
What I did
I set up a project in Claude (my current coding AI of choice) and instructed it to create a python application using local libraries rather than remote APIs. I then (over the course of a couple of commutes to work), instructed it to create a few different scripts to:
- Download the video
- Look for parts of the video where the screen didn't change significantly for a defined time limit (it used OpenCV for this)
- Transcribe the video (it chose Faster Whisper)
- Use a LLM to clean up the transcript (in the end we got to llama.cpp)
- Generate a webpage (I had to prompt it here to use Jinja templates rather than other approaches)
Once I'd got something working I asked it to write a driver script to call the other scripts by importing them and calling them in order. At each stage of the process I was adding the files to the project knowlege, and also, after a bit, pruning old ones on a regular basis.
Outcome
It worked (eventually)! See the screenshot below:
Issues
Unfortunately, it wasn't all plain sailing. I encountered several issues, including:
- The LLM really had to be prompted hard to get off its default path - for example, asking it to change the method of transcription to another library (but keeping the same interface) failed more often than not. Sometimes a chat outside the project got better results.
- When it was making significant changes, fixes that were part of the files in the project knowledge would get lost. For example, changing the data to be rendered would go back to the original styling it decided on, rather than using the styling already present in the file in the project. This was quite frustrating - and though Claude is very pleasant, I did find myself getting a bit shirty at times! "You've lost the fixes we did to the styling, can you restore them please??"
- When it was revising files occasionally it would start duplicating content and get very confused. Starting a new chat generally fixed this.
- Imports were often a problem - it would try to use modules that it hadn't imported. As Claude doesn't yet have python interpretation built in, it wouldn't catch this; this probably wouldn't have been a problem with ChatGPT, for example.
- Style was all over the place - sometimes things were done as classes, sometimes as dataclasses, sometimes as standalone functions. Adding this info into the project instructions at the start would have helped, most probably - something to try on the next one.
- It was very bad at async - trying to get it to parallelise work often lead to poorer performance (which I normally tracked down to bad use of the library or not chunking text correctly). I ended up reverting most of the scripts to run sequentially. I wasn't altogether surprised about this though, human programmers also find concurrency difficult! I'd also seen the same issues on other work I'd been doing.
I'd say the above would likely apply to using most AI. There was also some things that were Claude specific:
- The context length was an issue - starting new chats very frequently was necessary and files had to be kept small to avoid running into output length issues. No bad thing perhaps, it lead to more modular code.
- If I wasn't disciplined I'd run into Claude rate limits, which meant either using the poorer Haiku model, or stopping work.
- The times I used Haiku, the code generation quality went down and errors and bad code that wouldn't compile increased. Sonnet is so much better than Haiku that I was surprised given results I've had with smaller Llama models.
Conclusion
All in all, I'd class this experiment as a success. At the end of it, I had a working piece of software, and I suspect I would have run out of time and capacity to finish the project without the AI helping. Being able to noodle away at it on my phone was surprisingly helpful to get little bits drafted and then integrated when back on a more powerful machine. My next steps would be to get Claude to refactor the code a bit and make it more consistent, and then likely extend it into a full web application. As Sonnet 3.7 and Claude Code was released recently I'm going to see how they perform, and possibly do a comparison with the experience of using Github Copilot as well.