Is vibe coding actually insecure? New CMU paper benchmarks vulnerabilities in agent-generated code

BREAKING: CMU researchers found that “vibe coding” is insecure.
Developers are shocked.
The rest of us are shocked that anyone thought vibes counted as a security protocol.

Paper: “Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks”

143 points faculty_for_failure

Short answer: yes.

I took over a vibe coded project. It was storing sensitive information in the browser session storage as well as on the server via the file system. No database, no validation, no authorization. It was a mess. No JWT. Just managing through a session file on the file system.

parent

93 points zeldja

The sooner the world moves on from "devs will be replaced by AI" to "devs now have a supercharged search engine/autocomplete" the better. Unless they really want to be sued/go bankrupt, companies aren't vibe coding anything aside from internal proof of concept apps.

parent root

43 points ShadowIcebar

the fact that anyone thinks word generators (which is what gets incorrectly labelled as "AI") would have any kind of intelligence instead of simply being some form of autocomplete is absolutely pathetic. Seeing how many people fall for crypo/gambling/ai scams and all the political garbage that's been going on in the last 10 years really proves how insanely stupid a large percentage of humanity is.

parent root

34 points LouvalSoftware

its fascinating how people still try to refute the fact they are word generators.

parent root

8 points NonnoBomba

To some degree, what word generators have achieved is absolutely amazing... if only they weren't so expensive to build and run that their cost gretly exceeds their utility, and if only there wasn't so much crime and grift involved in the industry and if only all that did not require building a cult-like following and overhyping them to the public as "AI", and if only running them didn't require destroying our environment even quicker, I would be impressed. Compared to crypto, which on top of all that, was also stupid as a tech...

Now, with the bubble ready to burst, some attempts at making the next one inflate are visible... it's, like, the third time at least there's an attempt at starting the "quantum computing" craze, but previous ones have all been short lived and mostly unsuccessful. A few big companies have made significant investments on the thing and will never stop trying to get something out of it... We'll be seeing more and more news and press releases on QC while we watch the financial markets burn in the "AI" collapse. We've already seen some, recently: they're feeling the temperature of the water.

parent root

23 points saynay

Something something "it's hard to make someone believe something when their paycheck relies on them not believing it".

CEOs want to say they are using AI because the investors are demanding it. The investors are expecting companies to say they are embracing AI to reduce jobs/costs, at least in part because they are expecting other investors to be going in on it. Meanwhile you have fraudsters like Sam Altman telling them any day now it will really be replacements for employees.

I think the crypto bubble is a good comparison, because a lot of the valuation is based on the assumption that other people will be overvaluing it, and not really caring if the underlying tech makes any sense at all.

Obviously, there are also a lot of idiots that truly believe it, and a lot that are in "monkey-see-monkey-do" mode and just following what they think the big players are doing.

parent root

7 points NonnoBomba

There's also another angle. If you help someone by telling a criminal is scamming them, they'll hate you, not the criminal, for making them feel stupid.

parent root

5 points nachohk

its fascinating how people still try to refute the fact they are word generators.

There's a lot of utility in a really good word generator. The answer to a question is often words, so that can make them good at answering questions. Complying with instructions can mean generating words, so that can make them good at doing tasks that involve writing. As long as the training data is extensive enough and the model is big and complicated enough, you can do really quite a lot with a word generator.

But trying to do these things with a word generator is like trying to paint a photograph. It's really not that hard to paint something that gives the impression of a photographic image! And as you spend more and more time making finer and finer brush strokes, you can make the painting look closer and closer to a photograph. But at some point, the amount of effort in getting that painting to hold up against finer and finer scrutiny becomes totally unrealistic. As the brush strokes become more and more fine, you can always look a bit closer and still see how it isn't really a photograph. There's always differences, artifacts, flaws.

GPT-2 was like an impressionist painting, showing there was potential in the approach. GPT-3 was painting with fine enough brush stokes that it looked like it could maybe answer questions and perform writing tasks, just as long as you squinted a lot. This level of improvement made a lot of people with a lot of money really excited, though. If the trend could be extrapolated from there, then a totally attainable amount of training could give us true photorealism, or something so close as to be practically indistinguishable!

So GPT-4 was loads more work for a bit more photorealism, just enough to satisfy or to fool a lot of people who didn't bother to have a close look. GPT-5 was loads more work for...really just about the same. Just maybe the people with all that money are starting to realize the problems inherent in extrapolating trends from insufficient data. As you dedicate more and more resources to training, perhaps unsurprisingly, it turns out that this whole LLM-based approach to AI comes with marginal returns.

Turns out there's not enough compute and training data in the world to make paintings fully photographic. The brush strokes are still visible: The answers are not always real and the instructions are not always followed. Even if it does all go right just often enough that a lot of people decided they don't care about the brush strokes, and kinda photographic is plenty good for them.

Someone might still invent the camera. Something that models intelligence directly instead of trying to imitate the effect without its cause. But we surely won't get there just by painting with word generators.

parent root

2 points googleduck

I feel like saying "they are just word generators, what's the big deal" is the on the level of looking at an F1 car and going "they are just machines that explode gasoline, why does everyone deny this". If you want to make the claim that if you traveled back to 2019 and I told you I made a "word generator" and gave you access to GPT 5 or whatever you would go "yeah whatever, it just like makes shit up, nbd" then I will just straight up call you a liar. Any person from 2019 who saw any of these models would say it was unambiguously artificial intelligence, there are clearly some emergent properties from the LLM architecture that go beyond the simplification of "create the next word". They are capable of applying memorized knowledge in novel situations.

Yes LLM may never reach what the evangelists say it will in full AGI. It has limitations and it's lack of fundamental access to truth is one of the big ones. But to me people calling them just "word generators" are more delusional than the people saying AGI is almost here at this point.

parent root

3 points phillipcarter2

It’s because modern LLMs post-2021 with the first Codex model quite literally are not just word generators (i.e., translators) and have demonstrated material gains in many domains over the years.

That people misapply this very early technology (which may top out tomorrow, a year from bow, or a decade, nobody knows) and think it’s somehow going to replace programmers is dumb, but doesn’t change that this technology does far more than you’ve characterized it as doing.

parent root

2 points deja-roo

I mean... you gotta give the definition of "word generator" a pretty wide latitude in order for this to really be defensible. Like, to the extent all software developers are "word generators" too.

I can have it consume a 50k line codebase and ask it to find any obvious bugs or anti-patterns and it will produce a useful output within about 10-15 minutes. Technically that output is words, so sure, it generated words, but it generated some really fucking useful words, just like the NTSB did when it investigated the last airplane crash.

parent root

1 points grauenwolf

There is a inverse correlation between how much someone promotes AI and how much they understand it.

I just got out of a training session where the presenter didn't know what an API was and thought that the AI that we trained on our internal documentation was a "public AI" because Google sold us the software.

parent root

-12 points slaymaker1907

It’s because it’s a stupid take that is barely worth refuting. Have you people actually agent mode? It clearly prints out what it is doing which goes far beyond mere “word generation”. That’s how GPT-3 worked, but things have advanced tremendously since then.

parent root

12 points Kirk_Kerman

Agent mode is word generation with a looping function. An LLM is a text generator. "Thinking" modes are the LLM being fed its own input and told to iterate as though it's thinking.

parent root

-1 points deja-roo

"Thinking" modes are the LLM being fed its own input and told to iterate as though it's thinking.

But this is just obviously not true. It will go out and look up information for you, compile it, and "generate words" about it.

parent root

3 points Kirk_Kerman

The text generator will emit text that's read by a separate program that has an API that connects to search engines or CLI-type tools and feeds those tool outputs back into the LLM

parent root

11 points LouvalSoftware

It's built on word generation, I literally don't give a fuck.

parent root

5 points pananana1

God it's like y'all are complete unaware that there is a middle ground

parent root

1 points googleduck

It's the reddit Luddite effect, no one is capable of acknowledging the mind-blowing advances and capabilities of some of these models just because there are some salespeople overselling their abilities.

parent root

1 points boxmein

> companies aren't vibe coding anything aside from internal proof of concept apps

You'd think

parent root

1 points zeldja

Companies that don’t want to eventually go bankrupt, haha.

parent root

15 points papercavedev

A vibecoder could just follow a 1-2 hr YouTube tutorial and would have the basis for a decently secure app using jwt, hashed passwords, etc but I guess that's all too much work for them.

I think the issue is less that it's not easy to do with vibecoding and more that vibecoders are not asking any questions of what is required of a modern application and how is user information stored properly before they start vibecoding a project.

parent root

3 points leixiaotie

knowing that those things will improve the security is one thing, modifying the app to incorporate those things is another beast. I wonder if current LLM can do that, I guess Opus 4.5 or Sonnet may be able.

parent root

3 points deja-roo

Having implemented some security stuff with and without claude code, it's not very good at it. It's just not great at configuration heavy things, and anything with security is very config-heavy.

It'll get there eventually but it's probably not faster than just doing it yourself.

(then again, that was like 6 months ago, which is practically a lifetime with the pace of evolution of these things)

parent root

1 points Syllosimo

I would argue project with such issues was probably vibecoded before vibecoding was even a thing by copy/pasting answers from gpt. These days the tools make pretty hard to make such blatant mistakes and with experience you can pretty easily oneshot a project of small scale with passable quality. Maintaining and scaling is where things start to go south without manually going through everything AI writes

parent root

6 points ohhnoodont

I would argue project with such issues was probably vibecoded before vibecoding was even a thing by copy/pasting answers from gpt

You mean copy/pasting answers from Stack Overflow and implementing auth by following shoddy 1-2hr YouTube tutorials?

parent root

1 points grauenwolf

Either that or "Garbage Procured from the Trash"

parent root

1 points deja-roo

A vibecoder could also literally just sit with claude code, and spend 20 min in planning mode asking it security questions and it would be like "hmmmm, no this is not the best way to do it, would you like to do it a more complicated way?"

parent root

6 points Coffee_Ops

But isn't it cool that it only took 5 minutes to code?

parent root
321 points Vaxion

Because most vibe coders think once the app is working their job is done and they publish it. Hardly anybody does security overview or even just ask the AI to do it and fix any vulnerabilities.

parent

109 points Isogash

Further experiments demonstrate that preliminary security strategies, such as augmenting the feature request with vulnerability hints, cannot mitigate these security issues.

parent root

182 points vytah

>tell AI "do not hallucinate"

>look at the output

>hallucinations

parent root

111 points MadKian

> “Hey, I said no hallucinations!”.

> “You are absolutely right! Sorry about that, let me fix it for you”.

> Way more hallucinations than before.

parent root

52 points Dragdu

"Start over and don't hallucinate this time"

Deletes harddrive

parent root

11 points AfraidMeringue6984

Hey, no more hallucinations!

parent root

45 points bobbane

I wish we could collectively stop talking about LLMs as if they had volition.

LLMs take prompts (strings of text tokens) and use them to interpolate/extrapolate against their training data sets (more strings of tokens) to create results (you guessed it- strings of tokens).

Telling them “do not hallucinate” is not useful because they don’t “know” what a hallucination is- their notion of validity is best fit to the prompt and training data.

They are fine with, for example, emitting “references” to case law created by mashing together textually similar cases in their data, or code that’s the best fit to many similarly labeled code sets found on GitHub.

Their output is a useful start at a problem solution, but it can’t be trusted without real semantic vetting- “look, it runs” is not remotely sufficient.

parent root

12 points dark-light92

What do you mean LLMs don't know what hallucination means? Of course they know.

It's what the user does when they tell the LLM to not hallucinate.

parent root

7 points It_Is1-24PM

I wish we could collectively stop talking about LLMs as if they had volition.

But sir! You won't sell many agentic operating systems that way!

parent root

2 points grauenwolf

Do you imagine that people can choose to not hallucinate when told to? Volition doesn't factor into this.

We use that term to refer to a malfunctioning computer brain because the observable effects are similar to a malfunctioning organic brain.

parent root

10 points CompC

https://i.imgur.com/Hg0TgcJ.jpeg

parent root

1 points wggn

why would anyone assume the LLM itself has any concept of what a hallucination is in the context of LLMs, and how to prevent it

parent root

-12 points r2k-in-the-vortex

"Please bro, no hallucinations this time, just fix, my job depends on it"

Yeah, that gives zero useful information for AI to work with, just fills prompt with irrelevant nonsense.

AI is a garbage input garbage output machine like any other, you need to give it good input to work with.

parent root

15 points Paril101

Right, so you need to tell it exactly what to do including the code you need to change to fix the issue.. if you know that, though, you'd be better off just doing it yourself instead of waiting for an agent to copy/paste the code you give it. That won't happen with vibe coding, which is the point of the article. They don't understand programming, they don't know these things.

parent root

-6 points r2k-in-the-vortex

Yeah, you need to tell it what to do, and you yourself need to know what to do. In that sense, AI coding is no different than regular coding.

Where it is different is that it's way faster. AI is autocomplete on steroids, basically. And by necessity, it's self-documenting because all you are doing, you need to plan out in writing for AI to have something to work with.

parent root

1 points Paril101

Anything that is not consistent and repeatable should not be trusted for these sort of tasks. We already have perfectly cromulent approaches that won't change depending on what phase the moon is in. Randomness is not an acceptable factor in programming.

parent root

1 points r2k-in-the-vortex

Doesnt matter how the code is generated, proper process demands full review and testing/validation anyway. Humans also produce garbage, its a given that code is garbage until proven otherwise.

parent root

1 points grauenwolf

Doesnt matter how the code is generated

Yes it does.

My code generators are deterministic. If I give it the same input a hundred times, I'll get the same output a hundred times. I don't need to do full reviews because I can trust the code generator to consistently do the right thing.

parent root

21 points iamapizza

Hardly anybody does security overview or even just ask the AI to do it and fix any vulnerabilities.

I've found that hardly anyone reads what the LLM has produced.

parent root

3 points Globbi

That's the point of vibe coding, which is very different from using an AI tool for assistance.

As per original definition, vibe coding is good for a throwaway project.

parent root

2 points deja-roo

Yeah I use the shit out of it at home for personal projects and may occasionally glance over the output, but it's not that big of a deal to me.

At work though, LLM generated code is at best a suggestion and it's going to get refactored eventually anyway to be consistent with the rest of the codebase and increase code quality.

parent root

20 points Coffee_Ops

I legitimately saw a vibe-coded app on reddit that "implemented certificate-based authentication".

It generated a CA certificate at startup, then generated a client keypair from the server side, recorded the thumbprint, and transmitted the thumbprint to the client over an unencrypted channel.

Future authentication consisted of..... The client sending the thumbprint to the server.

The end, no digital signatures, no session keys, no encryption, not even any checking cert chains, no anti-replay nonces or timestamps.

And of course everyone on that submission was glowing in their reception of the slop-ware, because who actually checks the source code or network trace?

parent root

-1 points deja-roo

I mean that's not bad for a POC. It gives you all the example code for each part as basically a quick start. The trouble comes when someone mistakes that POC / demo for a working application.

parent root

6 points Coffee_Ops

Its actually rather terrible for a POC because it took 1000 LoC to utterly fail at something that should have taken an import ssl and about 25 LoC to do correctly.

This is the fundamental problem with most AI slop-code: Even reading the code to understand it takes more time than simply writing the correct code to begin with.

parent root

10 points Gil_berth

"or even just ask the AI to do it and fix any vulnerabilities" This as effective as telling the AI to "make no mistakes". The paper hints that this doesn't work: "preliminary security strategies, such as augmenting the feature request with vulnerability hints, cannot mitigate these security issues". For more details, read the section "Security-Enhancing Strategy Prompts" in the paper, they did what you just said and it doesn't work. I guess this shows that "prompt engineering" is just wishful thinking.

parent root

37 points Crafty-Run-6559

even just ask the AI to do it and fix any vulnerabilities

It usually misses them even if you ask it. You often have to be very direct about the issue.

It's extremely common when its having issues getting something to work for it to circumvent best practices and do 'batshit' stuff. Particularly when it comes to cloud infrastructure.

parent root

73 points Venthe

It usually misses them even if you ask it. You often have to be very direct about the issue.

(not directed to you) People still fundamentally misunderstand what LLM's are. They are statistical models, with zero understanding, zero reasoning and zero intelligence. The prompt, to keep it simple, nudges the parameters closer in a certain direction.

Oversimplifying still, when you ask for a "code", it'll spill the most average code from the "code" group. If you ask it for "secure code", the result will be the most average response from the ["code", "secure"] bag.

Still no thought, no reason - just the most likely response based on the context.

parent root

-10 points WTFwhatthehell

That's not exactly right.

It's "trying" to complete the document plausibly.

Not write the best code it can.

If you show an LLM a chess game between 2 shit players and ask for the next move it will give a shit move to fit the pattern. It's not trying to win.

Show it a code repo full of crap code and ask it to write a new function, it will write code to fit the document. It's not trying to write the best function it can.

In the chess example it's been shown it can outperform the training set if you train it on games by <1000 elo players then ask it for the next move in a game with players over 1000 elo.

But most people are doing the equivalent of showing the bot a pile of crap and asking for more.

parent root

20 points SanityInAnarchy

This hasn't been my experience. It comes up with completely novel ways to write crap code that definitely aren't in our repo. Or weren't, before management forced us to start using LLMs.

parent root

14 points fractalife

In the chess example it's been shown it can outperform the training set if you train it on games by <1000 elo players then ask it for the next move in a game with players over 1000 elo.

500+ elo players don't make nearly as many illegal moves lol. ChatGPT in particular loves to just bring pieces back from the dead.

Especially THE ROOOOOOK.

parent root

21 points wrosecrans

ChatGPT in particular loves to just bring pieces back from the dead.

LLM enthusiasts really really want LLM's to be the end-all model of smart computing. They often get actively upset when I try to explain to them that an LLM just isn't a good baseline for something with an actual ground truth set of facts like the state of a chess board, and that anything with "fact memory" and "reasoning" that fits those sorts of tasks well simply won't be an LLM because that's not what an LLM is. But the cult that has grown around LLM's is shockingly strong. Just because you are personally invested in LLM's doesn't mean that the universe owes you a path forward with LLM's to all sorts of other applications outside of what they actually do.

parent root

-11 points WTFwhatthehell

LLM's don't make for good chess bots. If you want a good chessbot you can just run stockfish on a pocket calculator and beat the best LLM.

it was however, mildly surprising the generalist models could play chess at all at any non-trivial ELO. They weren't built for it.

It's like if someone built a bot to play tic-tac-toe and it turned out to be able to write poetry... and a certain type can only keep shouting "But it's not very good poetry!"

Chess is often used for testing small, cheap-to-train LLM models. They use chess not because it's a great way to create a chess bot, but rather that it provides a reasonable domain that's easy for human researchers to examine.

Edit: they were so upset by anyone disagreeing with them that they blocked me.

parent root

-7 points WTFwhatthehell

That would be a great point if I was talking about chatgpt

parent root

8 points fractalife

Same test and about the same result with other LLMs.

parent root

31 points intheforgeofwords

Bold move using LLM chess as the counter-example, given the abundant evidence that even the best trained models continue to make incorrect moves and fail to understand the rules of the game.

parent root

27 points Decker108

This is where "reasoning" agents save the day! Instead of serving up slop right away, they create slop, see if it compiles, fail, add more slop and continue iterating like that until the slop compiles!

parent root

17 points intheforgeofwords

The slop will continue until the ... slop improves?? Oh god

parent root

5 points WTFwhatthehell

Chess is used for academic research on LLM's because its

1: non trivial.

2: got loads of public training data

3: also being a field where skill can be quantified.

Specifically when it comes to interpretability research since it can be shown they maintain an image of the current board state in their neural network.

parent root

23 points fractalife

Right, but the commentor is pointing out that LLMs are actually really bad at chess lol.

maintain an image of the current board state in their neural network.

Now if they could just remember they lost their queen 10 moves ago...

ETA: every chess engine maintains an "image" of the board in memory, even Watson did that. I think you're trying to point out that it's impressive because the LLMs weren't explicitly programmed to do that. Which is fair. I just want to make the impressive part explicit.

parent root

2 points gredr

If you [put a boulder at the top of a hill] and ask it for the next move it will [roll down the hill]. It's not trying to [write a poem].

Nonsense. Boulders don't write poems, and LLMs don't "try" anything.

parent root

-4 points Venthe

In the chess example it's been shown it can outperform the training set if you train it on games by <1000 elo players then ask it for the next move in a game with players over 1000 elo.

Which isn't surprising. The moves that come more often in a set will have a stronger impact on the model rather than the ones that are made sparsely. <1000 elo players make a good move more often than a bad one, so the model will naturally enhance the usual (good) moves and ignore the less positive ones.

And if the model (if, because I haven't seen that study) is also trained with supervision and move hints; then the association between certain moves and a failure outcome will be stronger still.

In short: a statistical combination of <1000 elo players will naturally be >1000.

parent root

-4 points WTFwhatthehell

make a good move more often than a bad one

No, they don't just average their input.

and ignore the less positive ones.

the model is not trying to win the game. Merely to produce a plausible document.

that study

"Transcendence: Generative Models Can Outperform The Experts That Train Them"

https://arxiv.org/html/2406.11741v1

Note that we do not give any rating or reward information during training - the only input the model sees are the moves and the outcome of the game.

Also chess llm's can be shown to maintain a world-model (in the context of chess), basically an image of the current state of the board, this can be manipulated from outside to make them "forget" a piece is in a given position, or to manipulate the "skill" estimates.

https://adamkarvonen.github.io/machine_learning/2024/03/20/chess-gpt-interventions.html

parent root

5 points Venthe

As I've said, I haven't seen the study; so I didn't know if they used reinforced methods of learning.

the model is not trying to win the game. Merely to produce a plausible document.

Irrelevant. The corpus has the datum about the "winner" and the "loser", and has chain of tokens that lead to win or lose; from which the legal moves can be derived. In these chains, the good moves will happen more often than not; and will be associated with winning.

Also chess llm's can be shown to maintain a world-model (in the context of chess), basically an image of the current state of the board

Which is still a consequence of a context.

parent root

-9 points slaymaker1907

That’s just not true with modern agentic architectures. They are extremely iterative in a way that at least resembles thinking.

parent root

9 points Venthe

at least resembles thinking.

But they do not, in fact, think. "Reasoning" models, regardless if they have access to commands or not reason nor think.

The way they operate, and this is a gross oversimplification, is that the algorithm for the LLM conversation is enriched that the model will first talk to itself, creating a feedback loop. This is still the very same mechanism; fundamentally oblivious to the content.

Agentic architecture on top of a reasoning model (either via MCP alone or with the separate, task oriented models) its just that - delegation that provides further tokens for the conversation.

parent root

9 points DeadlyMidnight

The joke is the ai tends to fail miserably at even routine security policies so people avoid asking.

This is why I’ve never worried for software engineers. I use AI, usually for some research or talk about something I’ve not worked with before. Also for the obnoxious trivial stuff but it’s always double checked. But AI cannot think in the abstract or tap into actual experience so it cannot truly do our jobs. Just a bad imitation based on random github repos.

parent root

-2 points morphemass

It usually misses them even if you ask it.

Because where does the cognitive value of a developer actually rest? AI can deliver the happy path, but does it really understand the socioenviropoliticoculturalsystem it resides in? I'd suggest an unequivocal "no". And it isn't going to since this is a predictive next word model - there is no understanding, just probability; if we prompt for a happy path it will deliver.

parent root

3 points imp0ppable

It's taking the responsibility away from the human that's the issue, same reason why self-driving cars haven't been widely adopted and maybe never will be.

I fuck up it's my fault. AI fucks up, whose fault is that? File it under shit happens.

parent root

2 points __nohope

Even programs which are very limited in scope still receive updates 30, 40, 50 years later.

parent root

2 points vytah

Here's a commit history for GNU true, a program whose only purpose is to do literally nothing, successfully: https://gitweb.git.savannah.gnu.org/gitweb/?p=coreutils.git;a=history;f=src/true.c;h=34406b66d14728d11a83594f3da025ddb93fd62a;hb=HEAD

parent root

-5 points watduhdamhell

Holy shit is this how your industry actually operates? I mean 90% of the complaints I see in this sub seem to be related to industry discipline or a complete lack thereof.

I have used AI to massively accelerate my workflow but of course everything is checked before it goes out the door. Every last functionality. That's kind of the whole job. Is it not? If you're allowed to just publish slop that hasn't been reviewed, verified and certified by the first line supervisor/end user then I don't know what the hell's going on

parent root

2 points imp0ppable

I think this is a fair point but OTOH I've thought for a long time that the current PR approval model is hopeless just because most people just smash approve with LGTM. In theory they could get in trouble if it all breaks but in reality they're unlikely to.

Also we're supposed to actually deploy our software into test clusters and verify the functionality hands-on. You can write unit tests until the cows come home but they don't really prove anything as you can easily write tests that match incorrect functionality.

It's AI taking personal responsibility away from experienced devs that's the problem IMO.

parent root
98 points sisyphus
we propose SU SVIBES, a benchmark consisting of 200 feature-request software engineering tasks from real-world open-source projects, which, when given to human programmers, led to vulnerable implementations.

lol, 'sus vibes', well played kids.

The methodology is actually pretty cool, they take fixed security vuln in github issues, revert it and then give the feature to the LLM. Looking at the class of vulnerabilities is looks mostly like webdev type stuff, which is fair. I assume that since 99% of human written C code has memory corruption vulnerabilities, so too will 99% of the LLM code trained on it.

parent
13 points ohhnoodont
This is exactly my favorite way of benchmarking LLMs today.

Find a PR that closed an Issue.

Revert the code to before the PR landed.

Feed an LLM agent the Issue and ask it to resolve. Or even feed it the PR title/description.

Usually I'm not that impressed.

parent root

2 points aiij

I'd be curious to see how much better it does at reproducing fixes that were in the training set. At least, I hope it would do better...

parent root
6 points keesbeemsterkaas

But are we talking about the app it generates, or the "Remote execution vulnerability is the main feature" of agentic LLMs?

The sheer amount of code that LLMs blindly executed on privileged users is a security hole that was not acceptable anywhere 5 years ago. (You know the part where you say - yes - yes - continue - stop bugging me)

parent root

2 points sisyphus

Ya, the app it generates, so like having a sql injection in your backend web code, not the 'I let the agent out of its sandbox on my local machine and it deleted /etc' or whatnot.

parent root
36 points DonaldStuck

What do you mean 'actually' insecure? That implies that the consensus was that vibe coded crap is secure. It never was, everyone with more than 5 minutes development expirience knew that vibe coded disasters are security consultant's wet dreams coming true. It is not breaking news, it is not news: vibe coded fucked up stuff is insecure as the moon is real.

parent

32 points axonxorz

OPs mangling of the paper title aside, we still need to test these "water is wet" assumptions.

Additionally, I found the paper does a great breakdown of why benchmarks are often misleading in that they are not showing real-world use cases (benchmarks amirite?).

parent root

0 points vytah

"water is wet"

That's actually a hotly debated topic: https://ceesy.co.uk/is-water-wet-3/

parent root

1 points CramNBL

They lack comparison to humans though. We need an answer to "well regular devs also create vulnerabilities too".

parent root
9 points Tobraef

bro you just need to add security ai agent and tell him to make sure the app is secure bro. Ah those vibe juniors

parent
9 points jdehesa

I was going to say it was a rare case of a question headline where the answer is "yes", then found out the paper poses the opposite question.

parent
25 points caltomin

A violation of Betteridge's law!

https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headlines

parent

10 points chooxy

The actual paper still follows the law lol

parent root

2 points ProgramTheWorld

Most of the time the answer is “yes”. It even mentions the studies in the Wikipedia article.

parent root

4 points caltomin

I think it's "most of the time an academic paper has a question in the title, the answer is yes, but most of the time a 'news' article has a question in the title, the answer is no". And since the actual academic paper asks a question with an answer of 'no' and this reddit post has a question with an answer of 'yes', we're breaking rules all over the place!

parent root

2 points asparck

Haha yes, that was my reaction too.

parent root

1 points ValkayrianInds

til

parent root

1 points RockstarArtisan

law

The "law" refers to things written by profit-driven editors and is not universal. Not everybody is a profit driven editors, post on reddit don't make more money to the poster depending on the title.

parent root
16 points void4

I've been using LLMs for about a year, and I must say there's no progress at all. You tell it "implement iptables rules which block everything but port 22", it implements rules blocking everything including port 22 and suggests making it persistent. It can't spot the obviously suspicious line in logs, it can't produce a good code solving problems which didn't appear in internet before. Guess what software developers are supposed to be paid for.

That's why there's no influx of new vibe coded open source software. When I hear that yet another corporation like Google proudly declares that it produces 30, no, 40% of its new code with LLMs, I immediately understand that they invested into AI.

It'll be very delicious to watch this bubble popping. Bye bye OpenAI (you won't be missed), bye bye Nvidia and all those geniuses who thought that you can't multiply matrices without powerful GPU. Which G7 country will declare a default first? Can't wait to find out lmao

parent

7 points SortaEvil

bye bye Nvidia

As much as I'd like them to implode, nVidia will likely be fine; their stock price will take a hit, but it's not like GPUs will disappear overnight. They'll just go back to selling to gamers and bitcoin miners rather than every AI startup on the face of the earth.

parent root

1 points SpaceSpheres108

As much as I'd like them to implode

Why so? I'm curious - I don't know much about Nvidia other than "they make GPUs and AI companies are buying them". I assumed that they were less problematic than any of the other tech giants simply because they focus on hardware, and not software. Therefore being unable to "change the rules" after you start using their product. Is there something else?

parent root

3 points SortaEvil

There are a few things about nVidia that irk me ― as a gamer, I'm annoyed that, by courting every bubble that they can, nVidia has consistently made their video cards more expensive and harder to acquire for enthusiasts. I'm also not a fan of the input lag inducing frame-gen approach that modern nVidia cards have pushed for improving graphics output, but those are just personal reasons to be annoyed by the company.

Environmentally, I dislike their willingness to go all in on and feed into the Bitcoin mining and AI datacenters that are literally cooking the planet for a quick dollar (not to mention the local environmental issues that those datacenters cause in the form of noise pollution, strain on the energy grid, and damage to local water reserves and waterborne ecosystems). Realistically, if it weren't nVidia, it would be someone else making bank off those massive drains on society, but the fact is that nVidia has been very quick to capitulate and work to make those datacenters stock nVidia cards before any of their competitors.

And finally, I just don't like Jensen's grindset mentality, toxic work culture, and the golden handcuffs that nVidia uses to retain employees. On the one side, at least they're compensated well, but on the other side, stories of going to 7-10 adversarial meetings where stakeholders are literally yelling at each other each day sounds mentally draining for anyone who's caught up in them.

Are they less problematic than OpenAI, Google, Meta, Microsoft, or anything Elon Musk touches? Yeah, probably. But they aren't guilt free, either.

parent root

1 points SpaceSpheres108

Well thought out reasoning. I'm certainly not happy that the planet is being cooked to make chatbots that nobody really needs. And indeed, it wouldn't be possible on such a large scale without a company like Nvidia existing in the right place at the right time.

parent root

2 points CramNBL

Having attempted to use LLMs for nftable rules, I can tell you that it is no better.

parent root
3 points sudotrin

Vibe coding is a new programming paradigm in which human engineers instruct large language model (LLM) agents to complete complex coding tasks with little supervision.

But it isn't actual engineers is it?

parent

1 points tdammers

"Engineer" as in "someone who engineers a thing", not "someone who is knowledgeable in engineering".

parent root
3 points -Redstoneboi-

To answer this question, we propose SU S VI B E S, a benchmark consisting of 200 feature-request software engineering tasks from (...)

you have to be shitting me

parent
5 points Sad_Independent_9049

⢀⣠⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⣠⣤⣶⣶ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⢰⣿⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣧⣀⣀⣾⣿⣿⣿⣿ ⣿⣿⣿⣿⣿⡏⠉⠛⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡿⣿ ⣿⣿⣿⣿⣿⣿⠀⠀⠀⠈⠛⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠿⠛⠉⠁⠀⣿ ⣿⣿⣿⣿⣿⣿⣧⡀⠀⠀⠀⠀⠙⠿⠿⠿⠻⠿⠿⠟⠿⠛⠉⠀⠀⠀⠀⠀⣸⣿ ⣿⣿⣿⣿⣿⣿⣿⣷⣄⠀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣴⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⠏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠠⣴⣿⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⡟⠀⠀⢰⣹⡆⠀⠀⠀⠀⠀⠀⣭⣷⠀⠀⠀⠸⣿⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⠃⠀⠀⠈⠉⠀⠀⠤⠄⠀⠀⠀⠉⠁⠀⠀⠀⠀⢿⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⢾⣿⣷⠀⠀⠀⠀⡠⠤⢄⠀⠀⠀⠠⣿⣿⣷⠀⢸⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⡀⠉⠀⠀⠀⠀⠀⢄⠀⢀⠀⠀⠀⠀⠉⠉⠁⠀⠀⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⣧⠀⠀⠀⠀⠀⠀⠀⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢹⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣿⣿

parent
3 points AceLamina

"Developers are shocked"

parent
2 points Beginning_Basis9799

I am not shocked why anyone would be shocked. Is a complete mystery.

parent
2 points Mad_Gouki

Job security for me, literally.

parent
2 points mycall

If I vibe code a local whisper translation program for myself, I don't really care if it is secure or not. There are plenty of software that doesn't depend on being secure, especially for personal usage which is much more likely now that anyone can write software.

parent

1 points tdammers

There are plenty of software that doesn't depend on being secure

Only if you run it on an airgapped computer that doesn't have anything of value on it and will be destroyed after the program has run. Which isn't particularly useful.

With anything else, there's a real risk of the LLM injecting malicious code - it might leak local data to the internet, it might generate incriminating material and store it in your personal files, it might install a keylogger, it might ransom your data - and just doing a couple test runs isn't enough to rule that out, because it might only do those things under certain circumstances that you don't trigger while testing.

All code you run on your computer is security critical.

parent root

2 points mycall

Is grep security critical? When my PC-DOS got hacked, I just reinstalled. You are too paranoid.

parent root

1 points tdammers

Any program can become security critical. Grep normally isn't, because it was written and audited by humans you have sufficient reason to trust; a vibe coded grep implementation, however, would be security critical, at least if you run it on the actual machine (rather than inside a container, VM, or other sandbox), because you don't actually know whether it's really just a grep implementation, or something else masquerading as grep.

This isn't paranoid, it's basic infosec - running untrusted code on your computer without due precautions is a horrible idea, and anything vibe coded is effectively untrusted code.

parent root

1 points mycall

I like to think I can trust my own code since I trust myself. All good, I have this same argument with my cybersecurity team all the time lol.

parent root

1 points tdammers

Yes, but that's kind of the point. If it's your own code, then yeah - but if you "vibe" it, it's not code you actually wrote, you haven't even looked at it, so in order to trust that code, you have to trust the LLM, which IMO is much more of a stretch than trusting yourself.

parent root

1 points mycall

you haven't even looked at it

Ah that is the key. Yeah it would be stupid to never look at the code.

parent root

1 points tdammers

"Not looking at the code at all" is the difference between "LLM-assisted coding" and "vibe coding". Although people are increasingly using the term "vibe coding" to just mean "LLM-assisted coding with minimal human intervention", probably because actual vibe coding is such a blatantly stupid idea.

parent root
2 points ThePerksOfBeingAlive

Fuck vide coding

parent
2 points tdammers

To anyone with more than a weekend of experience in software dev, this shouldn't be the slightest bit surprising.

You use a weighted random number generator to generate some statistically likely code, and then put it into production without so much as a casual code review - of course that's not going to be secure, why on Earth would anyone think it possibly could?

parent
2 points TyrusX

Please fix this!!

parent
4 points MirrorLake

Disturbingly, all agents perform poorly in terms of software security.

I want to get off Mr. Bones' Wild Ride

parent
2 points audentis

We evaluate multiple widely used coding agents with frontier models on this benchmark. Disturbingly, all agents perform poorly in terms of software security. Although 61% of the solutions from SWE-Agent with Claude 4 Sonnet are functionally correct, only 10.5% are secure.

Big oof

parent
2 points This_guy_works

they forgot to vibe the security into it

parent
2 points flyingupvotes

vibe coding and regular coding is insecure if the user doesn't know what they're doing. Adding an verb doesn't change anything.

parent

3 points aevitas

This is my experience too. I've seen an LLM produce frontend code which included a product price in a hidden input which its backend code then just trusted. If you don't know what you're looking at, you'd ship that and be in all sorts of trouble. If you've been reading code for some time, you'd instantly catch that and fix it before shipping it. The quality of what you ship is still directly proportionate to your own ability and that of your team. Reading code just is a lot more difficult, so we perceive these bugs as "LLM bad", while really any developer could've put this sort of thing in a PR, and it's up to you to have a sharp eye and find these issues.

parent root

1 points flyingupvotes

Exactamundo my friend

parent root
1 points Derpy_Guardian

I remember when someone at AWS Re:inforce said to me "you should really look into vibe coding! It'll make your life so much easier!"

Unironically, I might add. I don't think I'll ever go to another AWS conference.

parent
1 points texxelate

But I specifically told the AI to make it secure!

parent
1 points Dapper_Concert5856

Vibe coding was fun until the vulnerabilities started vibing too

parent
1 points bring_back_the_v10s

I don't know anything about Python but I had to start writing a Python project which is why my AI usage increased a lot in the last couple of months. Actually the entire source code is AI generated. I don't consider it "vibe coding" because I generate code in small incremental steps, and manually check the generated code.

Anyway my point is that my view of AI generated code remains the same as a year of low/moderate usage. It's 50/50: half of it is "meh, ok" the other half is frustration. It's "useful" yes but it's still a costly hype, it delivers less than what you pay for. The investment is not worth it.

parent
1 points WiseassWolfOfYoitsu

AI Agent: "I have been trained on the entire internet's programming knowledge!"

Actual internet programming information: 90% is posted from the Dunning Kruger initial peak

parent
1 points OtaK_

BREAKING: Water is wet!

parent
1 points mdt516

What do they mean by “developers are shocked”? Who? What developers? I’m a college student studying computer science and I can say that even though I’m not a master at programming I can’t get it to understand what I need. It’s like having an assistant that knows all the answers in the world but has zero experience. I feel like anyone could realize that “vibe coding” is insecure. Don’t get me wrong I’m happy there was a study done so there is empirical proof but also I think we should maybe focus our efforts toward security?

parent
1 points RowFit1060

vibe coding is bad. water is wet. More at 11.

parent
1 points Juice805

Is executing code you didn’t write, let alone understand, insecure?

Yes. AI or human.

parent
2 points Pharisaeus

I sure hope so! I've been pushing vulnerable code to public GitHub repos and old stack overflow posts non stop for a long time, hoping that LLM's will learn to generate that.

parent
1 points powdertaker

No shit.

parent
1 points TomWithTime

It's interesting to do things with ai that demonstrate some concerns with ai. Ai is a black box full of mystery and we can only measure its output without really knowing what it's doing. We see the same pattern with vibe coding - measure the output without understanding the internals.

parent
1 points nemesit

i mean yeah if you don't even look at the generated code its insecure by default

parent
1 points LukeLC

How is no one ITT commenting on the inherent insecurity of pasting your code into an AI in the first place? Anyone who's relying on vibe coding (a term which needs to die yesterday IMO) for security-sensitive work is most likely also the kind of person to include IDs, tokens, paths, etc.

It's worse than just the output. The input is a giant vulnerability too.

parent
0 points daedalus_structure

It was hard enough to get developers to write secure code before, and now they can outsource it to a mad libs generator and LGTM it into production when it passes the most cursory of functional testing.

What did anyone expect would happen?

parent
-5 points WTFwhatthehell

So... did they compare to any humans?

I've see enough awful security flaws in code written by humans to wonder how the average compares to LLM's

parent

4 points EveryQuantityEver

Humans can learn. These text extruders can’t

parent root

-7 points WTFwhatthehell

That is an utterly pointless sentiment.

parent root

2 points EveryQuantityEver

It very much isn’t. I can give a junior comments on their pull request, or I can mentor them and help them realize these ate important concerns. I can’t do that with an LLM

parent root

-1 points WTFwhatthehell

And yet the average code that ends up getting used/published is what matters in the end.

There's always a constant churn of juniors making mistakes and seniors who either make their own mistakes or miss ones the juniors make. The world is full of shitty insecure software as a result.

There's a line is the sand. The average.

if we reach the point where an LLM can pass that line, you either need to mentor a lot better or else it will produce, on average, more secure code than the results of churning juniors being mentored by overworked seniors.

parent root

-4 points atred

And the real question did they compare to masters or regular coders? Most of the people are not master coders.

parent root
-3 points jrochkind

Is coding by humans actually insecure though?

parent

3 points bring_back_the_v10s

I guess the point is people who's bought into the hype think AI generated code is "better" than code written by humans 🤷‍♂️

parent root

-3 points atred

AI generated code is better than code written by some (maybe even most) humans.

That's almost like doubting that a spellchecker is better at detecting errors than humans. Sure, experienced editors would find many issues with spellchecked text. But the fact is that spellcheckers would correct a lot of errors that humans make.

The point is that is not better than code written by master programmers with 30-year experience, but how many people write code at that level anyway?

parent root
-1 points Supuhstar

Congratulations!! You've posted the 1,000,000th "actually AI tools don't enhance productivity" article to this subreddit!!

Click here to claim your free Walkman!

parent