For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | singluere's commentsregister

While I've not used this product, I've created somewhat similar setup using open source LLMs that runs locally. After having used it for about three months, I can say that debugging LLM prompts was far more annoying than debugging code. Ultimately, I ended up abandoning my setup and going in favor of writing code the good old fashioned way. YMMV


I had ChatGPT output an algorithm implementation in Go (Shamir Secret Sharing) that I didn't want to figure out. It kinda worked, but everytime I pointed out a problem with the code it seemed more bugs were added (and I ended up hating the "Good catch!" text responses...)

Eventually, figuring out why it didn't work made me have to read the algorithm spec and basically write the code from scratch, throwing away all of the ChatGPT work. Definitely took more time than doing it the "hard way".


The skill in using an LLM currently is in getting you to where you want to be, rather than wasting time convincing the LLM to spit out exactly what you want. That means flipping between having Aider write the code and editing the code yourself, when it's clear the LLM doesn't get it, or you get it better than it does.


This is the key thing that I feel most people who dislike using LLMs for development miss. You need to be able to quickly tell if the model is just going to keep spinning on something stupid, and just do it yourself in those scenarios. If you're decent at this then there can only really be a net benefit.


The bits GPT4 always gets wrong - and as you say, more and more wrong the further I try to work with it to fix the mistakes - are exactly the bits I want it to do for me. Tedious nested loops that I need to calculate on paper in particular.

What it's good for is high level overview and structuring of simple apps, which saves me a lot of googling, reviewing prior work, and some initial typing.

After my last attempts to work with it, I've decided that until there's another large improvement in the models (GPT5 or similar), I won't try to use it beyond this initial structure creation phase.

The issue is that for complex apps that already have a structure in place - especially if it's not a great structure and I don't have the rights or time to do a refactoring - the AI can't really do anything to help. So in this case, for new, simple, or test projects it'll seem like an amazing tool and then in the real world it's pretty much useless or even just wastes time, except for brainstorming entirely new features that can be reasoned about in isolation, in which case it's useful again.

A counterpoint is that code should always be written in a modular way so that each piece can be reasoned about in isolation. Which doesn't often happen in large apps that I've worked on, unfortunately. Unless I'm the one who writes them from scratch.


Copilot is a decent autocomplete saving you half a line here and there, that's about it.


I can regularly get it to autocomplete big chunks of code that are good. But specifically only when it's completely mind numbingly boring, repetitive and derivative code. Good for starting out a new view or controller that is very similar to something that already exists in the codebase. Anything remotely novel and it's useless.


I have strange documentation habits and sometimes when you document everything in code comments up front, Copilot does seem to synthesize from your documentation most of the "bones" that you need. It often needs a thorough code review, but it's not unlike sending a requirements document to a very Junior developer that sometimes surprises you and getting back something that almost works in a PR you need a fine tooth comb on. A few times I've "finished my PR review" with "Not bad, Junior, B+".

I know a lot of us generally don't write comments until "last" so will never see this side of Copilot, but it is interesting to try if you haven't.


Yes I use this trick regularly too.


An alterinative to this workflow that I find myself returning to is the good ol' nicking code from stackoverflow or Github.

ChatGPT works really well because the stuff you are looking for is already written somewhere and it solves the needle-in-the-haystack problem of finding it, very well.

But I often find it tends to output code that doesn't work but eerily looks like it should, whereas Github stuff tends to need a bit more wrangling but tends to work.


The big benefit to me with SO is that with a question with multiple answers, the top up voted question likely works, since those votes are probably people that tried it. I also like the 'well, actually' responses and follow up, because people point out performance issues or edge cases I may or may not care about.

I only find current LLMs to be useful for code that I could easily write, but I am too lazy to do so. The kind of boilerplate that can be verified quickly by eye.


Once it writes the code, take that into a new session to fix a bug. Repeat with new sessions. Don’t let it read the buggy code, it will just get worse.


Yah this works for me and I'm not a SWE. I use it to make marketing websites. Sometimes it will do something perfectly but mess up one part, if I keep getting it to fix that one part in the same session almost certainly it's never going to work (I burnt a week this way). However, if I take it into a brand-new GPT sessions and say here is this webpage i wrote, but I made a mistake and the dropdown box should be on the left not the right, it can almost always fix it. Again, I'm not really a SWE so I'm not sure what is going on here, but if you click the drop down on that "Analyzing" thing that shows up, in the same session it seems to try to re-work the code from memory, on a new session if you look at the drop down Analyzing thing, it seems to be using a different method to re-work the code.


Interesting - I almost always iterate on code in the same session. I will try doing it with history off and frequently re-starting the session. I naively assumed the extra context would help, but I can see how it's also just noise when there are 5 versions of the same code in the context.


How do you not let it read the buggy code but also take it into a new session?


I assume just copy and paste it


I'm just as confused... If you're copy/pasting the code into a new session, isn't that reading the code?


The way I understand it:

First Variant:

1. User: Asks coding question

2. Ai: Outputs half functioning code

3. User: Asks to fix specific things

4. Ai: Creates buggy code

5. User: asks again to fix things

6. Ai: writes even more buggy code

Proposed second variant with copying code:

Until step 4 everything stays the same, but instead of asking it to fix the code again you copy it into another session, this way, you'll repeat step 3 again, without the LLM "seeing" the code it previously generated for step 4.


I dunno how you SWE's are doing it, but I have my ChatGPT output files and if multi, zip files, not code snippets (unless I want a code snippet), and then I re-upload those files to new session using the attach thinger. Also, in my experience just building marketing websites, I don't do step 3, I just do step 1 and 2 over and over in new sessions, it's longer because you have to figure out a flow through a bunch of work sessions, but it's faster because it makes wwwwaaaaayyyyyyy fewer mistakes. (You're basically just shaking off any additional context the GPT has at all about what you are doing when you put it in a brand-new session, so it can be more focused on the task, I guess?)


The only time I've had success with using AI to drive development work is for "writers block" situations where I'm staring at an empty file or using a language/tool with which I'm out of practice or simply don't have enough experience.

In these situations, giving me something that doesn't work (even if I wind up being forced to rewrite it) is actually kinda helpful. The faster I get my hands dirty and start actually trying to build the thing, the faster I usually get it done.

The alternative is historically trying to read the docs or man pages and getting overwhelmed and discouraged if they wind up being hard to grok.


I've literally never seen an LLM respond negatively to being told "hold on that's not right"; they always say "Oh, you're right!" even if you aren't right.

GPT-4 today: "Hey are you sure that's the right package to import?" "Oh, sorry, you're right, its this other package" (hallucinates the most incorrect response only a computer could imagine for ten paragraphs).

I've seen junior engineers lose half a day traveling alongside GPT's madness before an adult is brought in to question an original assumption, or incorrect fork in the road, or whathaveyou.


One thing that's helped me a little bit, is to open up the spec as context, and then asking the LLM to generate tests based on that spec.


I just asked it the same question and this was the answer it gave me:

go get go.dedis.ch/kyber/v3

LOL...


That's pointing to a fairly solid implementation, though (I've used it.) I would trust it way before I'd trust a de novo implementation from ChatGPT. The idea of people using cryptographic implementations written by current AI services is a bit terrifying.


> Shamir Secret Sharing

> ChatGPT

please don't roll your own crypto, and PLEASE don't roll your own crypto from a LLM. They're useful for other kinds of programs, but crypto libraries need to be to spec, and heavily used and reviewed to not be actively harmful. Not sure ChatGPT can write constant time code :)


People always say this but how else are you going to learn? I doubt many of us who are "rolling our own crypto" are actually deploying it into mission critical contexts anyway.


Asking an LLM to do something for you doesn't involve any learning at all.


I’m not talking about the LLM case, just the mantra of “don’t roll your own crypto” constantly. Comes off as unnecessarily gatekeepy.


I mean, by that, people don't generally mean, literally, "never write your own crypto". They just mean "on no account _use_ self-written crypto for anything".


That is my struggle as well. I need to keep pointing out issues of the llm output, until after multiple iterations it may reach the correct answer. At that point I don't feel I gained anything productivity wise.

Maybe the whole point of coding with llms in 2024 is for us to train their models.


Indeed, and the more niche the use case, the worse it gets.


I can definitely echo the challenges of debugging non-trivial LLM apps, and making sure you have the right evals to validate progress. I spent many hours optimizing Copilot Workspace, and there is definitely both an art and a science to it :)

That said, I’m optimistic that tool builders can take on a lot of that responsibility, and create abstractions that allow developer to focus solely on their code, and the problem at hand.


For sure! As a user, I would love to be able to have some sort of debugger like behavior for debugging the LLM's output generation. Maybe some ability for the LLM to keep on running some tests until they pass? That sort of stuff would make me want to try this :)


see langtail app (I am not maker)


I'm sure we'll share some of the strategies we used here in upcoming talks. It's, uh, "nontrivial". And it's not just "what text do you stick in the prompt".


Honestly, I've found using GH CoPilot chat to be the real value add. It's amazing for rubber ducking.

That being said, my employer pays for it. I am still on the fence about which LLM to subscribe to with my own money.


Creating this using Open Source LLMs would be like saying you tried A5 Wagyu by going to Burger King, respectfully.

I think benchmarks are severely overselling what open source models are capable of compared to closed source models.


I really don't think they're being over sold that much. I'm running llama 3 8b on my machine, and it feels a lot like running claude 3 haiku with a much lower context window. Quality wise it is surprisingly nice.


Llama 3 just came out so they couldn't have used it, and Claude Haiku is the smallest cheapest closed source model out there from what I've seen.

Github is likely using a GPT-4 class model which is two (massive) steps up in capabilities in Anthropic's offerings alone


Yeah I just mentioned Llama to point out that the open weight models have been really catching up.

Microsoft is almost certainly using GPT-4 given their relationship with ClosedAI, but I would definitely not put GPT-4 (nor Turbo) "two massive steps up" from Claude 3 Opus. I have access to both through Kagi, and I have found myself favoring the responses of Claude to the point where I almost never use GPT(TM) anymore.


You're misreading in multiple ways, maybe in a rush to dunk on "Closed AI".

Github Copilot is not the same as Copilot Chat which uses GPT-4, there still some uncertainty on if Copilot completions use GPT-4 as outsiders know it (and iirc they've specifically said it doesn't at some point)

I also said Haiku is two massive steps behind Anthropic's offerings... which are Sonnet and Opus.

Anthropic isn't any more open than OpenAI, and I personally don't attribute any sort of virtue to any major corporation, so I'll take what works best


I... don't think I misread you? Maybe you didn't mean what you wrote, but what you said was:

> Github is likely using a GPT-4 class model which is two (massive) steps up in capabilities in Anthropic's offerings alone

Comparing GPT-4 to Anthropics offerings, which, as you say, includes Sonnet and Opus.

> Anthropic isn't any more open than OpenAI, [...] so I'll take what works best

I understand that, and same here. I don't prefer Claude for any reason other than the quality of its output. I just think OpenAIs name is goofy with how they actually behave, so I prefer the more accurate derivative of their name :)

Regarding what model Copilot Completions is using - point taken, I have no comment on that. My original comment in this thread was only meant to point out that open weight models are getting a lot better. Not saying they're using them.


I used "in Anthropic's capabilities" intentionally: it's two steps up in what they can do from Claude Haiku


Locally running LLMs in Apr 2024 are no where close to GPT-4 in terms of coding capabilities.


Depends what for, I find AI tools best for boilerplate or as a substitute for stackoverflow. For complex logic, even GPT-4 ends up sending me down the garden path more often than not.

I got Llama 3 8B down over the weekend and it's alright. Not plugged it in to VSCode yet, but I could see it (or code specific derivatives) handling those first two use cases fine. I'd say close enough to be useful.


Agreed. Can even get specialized LLMs like Deepsync Coder for better results


And GPT-4 is nowhere close to the human brain in terms of coding capabilities, and model advancements appear to be hitting an asymptote. So...


I don't see a flattening. I see a lot of other groups catching up to OpenAI and some even slightly surpassing them like Claude 3 Opus. I'm very interested in how Llama 3 400B turns out but my conservative prediction (backed by Meta's early evaluations) is that it will be at least as good as GPT 4. It's been a little over a year since GPT 4 was released to the public and in that time Meta and Anthropic seem to have caught up and Google would have too if they spent less time tying themselves up in knots. So OpenAI has a 1 year lead though they seem to have spent some of that time on making inference less expensive which is not a terrible choice. If they release 4.5 or 5 and it flops or isn't much better then maybe you are right but it's very premature to call the race now, maybe 2 years from now with little progress from anyone.


I shouldn't have used the word asymptote; I should have said logarithmic. I don't doubt a best-case situation where we get a GPT-5, GPT-6, GPT-7, etc; each is more capable than the last; just that there will be more months between each, it'll be more expensive to train each, and the gain of function between each will be smaller than the previous.

Let me phrase this another way: Llama 3 400B releases and it has GPT-5 level performance. Obviously; we have not seen GPT-5; so we don't have a sense of what that level of performance looks like. It might be that OpenAI simply has a one year lead, but it might also be that all these frontier model developers are stuck in the same capability swamp; and we simply don't have the compute, virgin tokens, economic incentives, algorithms, etc to push through it (yet). So, Meta pulls ahead, but we're talking about feet, not miles.


Maybe they know something about GPT5



Greg Brockman sharing the timeline on Twitter: https://twitter.com/gdb/status/1725736242137182594?s=46&t=Nn...


Copy-pasted here for posterity:

Greg Brockman @gdb

Sam and I are shocked and saddened by what the board did today.

Let us first say thank you to all the incredible people who we have worked with at OpenAI, our customers, our investors, and all of those who have been reaching out.

We too are still trying to figure out exactly what happened. Here is what we know:

- Last night, Sam got a text from Ilya asking to talk at noon Friday. Sam joined a Google Meet and the whole board, except Greg, was there. Ilya told Sam he was being fired and that the news was going out very soon.

- At 12:19pm, Greg got a text from Ilya asking for a quick call. At 12:23pm, Ilya sent a Google Meet link. Greg was told that he was being removed from the board (but was vital to the company and would retain his role) and that Sam had been fired. Around the same time, OpenAI published a blog post.

- As far as we know, the management team was made aware of this shortly after, other than Mira who found out the night prior.

The outpouring of support has been really nice; thank you, but please don’t spend any time being concerned. We will be fine. Greater things coming soon.

10:42 PM · Nov 18, 2023

8.1M Views


Reading that thread made me realise how low the signal to noise ratio is over on twitter.

90% of the replies scrolling down were rehashed versions of "can't believe they used Google meet"


Try blocking all bluechecks. After doing so is the first time in like a decade that Twitter has had good content for me.

Before, braindead or cloutchasing bluechecks were mixed in with the rest of us rabble. Hard to pick them out of the pack, you had to read their detritus with the rest of the comments.

Now they voluntarily self identify, and even better, their comments are lumped at the top. So block them all with a bot or just scroll down until there's no more blue checks and the comment quality jumps exponentially.


And “can’t believe how shitty the Twitter replies are” is any better?


Seemed like a pretty productive conversation to me. As a non-twitter regular I now know how to make things more bearable in the future thanks to this discussion.


Monetization of "hot takes" and baiting, true example of enshittification


As both hint to "greater things" already on the horizon: maybe they were working on/for competitor on the sidelines and the board found out?


That's standard "You were too good for them anyways" post break-up speech


His timeline.


Thanks for the advice. I've been trying to improve communication on my end but somehow I feel it has not improved. Given the power dynamics in a manager/report relationship, I feel that with certain managers its more difficult to get on the same page.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You