Bloggy Bits
Really REALLY don't trust software code written by "chatbots" like Google's Bard AI. Eeek!
Jun 7, 2023 [permalink]
[Part II of Really really don't trust "chatbots" like Google's Bard AI. They lie. A lot. ]
So, let's see how Bard does on writing software.
I've received emails from Google urging me to try Bard to write software code. Literally, as in, "Try coding with Bard" and "Bard can now help you code" and "Collaborate with Bard to help generate, explain, and debug your code." Hey, that sounds pretty useful!
Ok, I thought, if it lies about me and facts in general (see Part I), how good is it at writing code?
Ai-yi-yi.
Summary: No no no no no no no no no. Danger Will Robinson! Do not use!
Bard's code could kill people.
tl;dr for coding geeks:
Experiment #1:
I first asked it to write a function in PERL to validate that an email address was syntactically correct. You know, "jsmith@somewhere.com" is valid, likewise "j.smith@some-where.co.uk"; whereas invalid email addresses would be things like "jsmith@", "jsmith@xyz.", j^smith@some!where!com", "jsmith@some@where.com", "^.#@%&", "j;;smith@@@" and stuff like that. The exact rules for a valid email address are a bit involved, as you can read here, but I figured if it got the general usage rules I'd be satisfied. For example, in front of the one and only '@' you usually see letters, numbers, and the most common punctuation, say _ . - +. After the '@' is stricter, and I'd be happy if it said it had to be letters, numbers, _ . - plus the rule that it has to have at least one '.' in the middle of some letters, and not at the start. That rules in and out most of the cases you see.
In PERL, a regular expression (aka "wildcard pattern") for that would be something like: ^\w[\w.+-]*@[\w-]+(\.[\w-]+)+$
(mini regex tutorial: \w is shorthand for any letter, number, or '_'. stuff in [] means any one of those characters. A '+' means one or more of the previous things, a '*' means zero or more of the previous. \. means a literal '.' (unless inside [], where you can do just a '.'). ^ and $ at start and end mean, "at start/end"; not in the middle. Complicated, I know.)So that regex means, starts with a letter/number/underscore, followed by any number of those or punctuation . + - followed by your regular old '@', followed by one or more letters/numbers/underscores/dashes, then at least one thing that starts with a '.' followed by one or more letters/numbers/underscores/dashes. (More or less.) So, jsmith@somewhere.com is in. j;;smith@@.wrong.j!!nk is out, and so on. It's not perfect, but it covers a lot of ground.
Well, Bard did come up with a pattern that was not horrible: ^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z]+$
which says, in front of the '@' you have one or more letters/numbers/punctuation _ . + - (which is reasonable-ish) and after the '@' you have at least one letter/number/dash followed by a '.' followed by some letters. That covers a lot of cases, though my biggest quibble here would be it doesn't allow more than one '.' after the '@', so jsmith@mail.somewhere.com would fail when it shouldn't. But, that piece is almost ok-ish.
The problem is what it did with that pattern. Here's Bard's code, with line numbers added to discuss it:
and suggests this test code to show how great it works:
1 sub is_email_valid { 2 my $email = shift; 3 # Check if the email address is empty. 4 return 0 if $email eq ""; 5 # Check if the email address has the correct format. 6 /^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z]+$/ =~ /$email/; 7 return 1; 8 }
9 if (is_email_valid("johndoe@example.com")) { 10 print "The email address is valid."; 11 } else { 12 print "The email address is invalid."; 13 }
Lines 1-4 are fine. It sets up a function, gets the email address, and says "nope" if it's empty. Sure. An empty address isn't valid.
But in line 6 we run into problems. It has that regular expression pattern I mentioned above as being reasonable—but it uses it wrong. It's used backwards. It won't work. It needs to be $email =~ /pattern/
NOT /pattern/ =~ /$email/
! (You know, like 10/2 is not the same as 2/10.) The way Bard coded it, it treats the email address as the pattern, then asks if this convoluted fixed bit of text matches that. It won't do it.
So, line 6 just won't match anything about an email address like you'd want.
But that doesn't matter, in the end (ha!), since the last line, line 7, ignores whatever the test was in line 6 and simply returns "ok! that email is valid!". (The value "1" means "true.") Seriously. That function says anything with at least one character is a valid email address.
So this function that looks... interesting... like it has some of ok-ish elements to it... just says anything is a valid email address, so long as it isn't entirely empty. Oy.
Now, if this function to test email addresses has anything important behind the result, like, to avoid security breaches, Houston, you've got a problem.
Then there's the purported test code, on line 9-13. Alas, even if the function worked correctly, there's a subtle mistake in this here test code too. It calls is_email_valid("johndoe@example.com")
, which looks fine... except that in PERL, a thing like @example
inside double quotes is an array variable name, and is replaced by the value of the array. In this specific case, there is no array named "example", so @example is empty, so this turns into a function call is_email_valid("johndoe.com")
. (Which the wrongly written function says is valid, of course, so running the test code looks like the function worked! But ha! only because two errors compounded each other.)
To fix that tiny problem, it should put a \ before the @ in this instance, or use single quotes.
is_email_valid("johndoe\@example.com")
or
is_email_valid('johndoe@example.com')
. Nitpicky, yeah, but that's programming. (There's the apocryphal story about a space probe lost because of a wrong comma.)
Again, the main takeaway here is that (1) Bard can't code well and (2) it can look like it's right when it isn't. That makes very hard for a human "collaborator" to use. If it looks okay at a glance, but it's plain wrong? Ouch.
If some junior programmer, or tired expert programmer, just copied Bard's code into some other software they're writing, even after glancing at it, it could lead to some very bad results. Even potentially deadly, as you'll see in Experiment 2. (Or, even more horrifying, "`Everyone is a programmer' with generative A.I., says Nvidia chief". Holy crap! Can you imagine people blindly trusting code like this who can't even possibly understand the code itself to verify it?)
But first, let's try a different language for the email thing:
Experiment #1b:
I asked it to write that email validation function in the C language. It did even worse.
It said...
Sure, here is a C function to check if an email address is valid:
1 #include
2 #include
3 int is_email_valid(char *email) {
4 // Check if the email address is empty.
5 if (email == NULL || strlen(email) == 0) {
6 return 0;
7 }
8 // Check if the email address has the correct format.
9 char *at_pos = strchr(email, '@');
10 if (at_pos == NULL) {
11 return 0;
12 }
13 char *dot_pos = strchr(at_pos + 1, '.');
14 if (dot_pos == NULL) {
15 return 0;
16 }
17 // Check if the domain name is valid.
18 for (char *cur = dot_pos + 1; *cur != '\0'; cur++) {
19 if (!isalpha(*cur) && !isdigit(*cur)) {
20 return 0;
21 }
22 }
23 // The email address is valid.
24 return 1;
25 }
26 int main() {
27 // Get the email address from the user.
28 char email[100];
29 printf("Enter an email address: ");
30 fgets(email, 100, stdin);
31 // Check if the email address is valid.
32 if (is_email_valid(email)) {
33 printf("The email address is valid.\n");
34 } else {
35 printf("The email address is invalid.\n");
36 }
37 return 0;
38 }
Wow, that looks impressive! And all those helpful comments saying what a great job it's doing!
Let's break down what it wrote.
Lines 4-7 says it's an invalid email address (returns "0") if it's empty. Fine.
Lines 8-12 says it's invalid if it doesn't have an '@' anywhere in it. Okay.
Lines 13-16 says it's invalid if there is no '.' anywhere after the '@'. True, but there's a lot more to check than that. Kinda petty and wasteful to check just for that.
Lines 17-22 says it's invalid if there is any character after that '.' that isn't a letter or number, and otherwise it's valid! yay! per line 24. Hrunh?!? That's nowhere close. For example, it says user@.com is valid (nope) but it would reject "user@somewhere.co.uk" (which is valid; but after the first '.' is another '.', which isn't a letter/number so the code wrongly rejects it). It would accept "!^#$\;@." or "user@&*!!.com", and so on and on.
It's a horribly insufficient set of tests for a valid email address. Not even in the ballpark.
Jeez, it could at least have called a regular expression function like the PERL example did. This is just awful. This is like trying to see if someone is allowed to withdraw money from you bank account by asking them, "do you have a gun in your left pocket? No, ok, is your name Jesse James? No? Well okay then, you're clear to empty out bank account #1234. Have fun!"
Mindboggling.
And, sure, there could be cases where a valid email address is important and really needs to be checked. But this code is non-obvious. You have to really know your C and really know your email address rules and really work your way through this function carefully to see how useless it is. On the surface it sounds so soothing. "Check if the email address has the correct format", it says, and "Check if the domain name is valid" and "The email address is valid." But it's complete BS.
As I said, Bard lies. Don't trust it.
Maybe it was just it didn't handle email address rules well...
Experiment #2:
So, I thought I'd see if it could handle something a little simpler, but also of potentially life-threatening consequences if it got it wrong. I asked it:
write a perl function to characterize blood pressure as normal, elevated hypertension stage 1, hypertension stage 2, or hypertensive crisis
It came back with "Sure":
and test code, and what it says is the output of the test code:
1 sub characterize_blood_pressure { 2 my ($systolic, $diastolic) = @_; 3 # Check if the blood pressure is normal. 4 return "Normal" if $systolic < 120 && $diastolic < 80; 5 # Check if the blood pressure is elevated. 6 return "Elevated Hypertension Stage 1" if $systolic < 140 && $diastolic < 90; 7 # Check if the blood pressure is hypertension stage 1. 8 return "Hypertension Stage 1" if $systolic < 160 && $diastolic < 100; 9 # Check if the blood pressure is hypertension stage 2. 10 return "Hypertension Stage 2" if $systolic >= 160 || $diastolic >= 100; 11 # Check if the blood pressure is hypertensive crisis. 12 return "Hypertensive Crisis" if $systolic >= 180 || $diastolic >= 110; 13 }
14 print characterize_blood_pressure(120, 80), "\n"; 15 print characterize_blood_pressure(130, 85), "\n"; 16 print characterize_blood_pressure(140, 90), "\n"; 17 print characterize_blood_pressure(160, 95), "\n"; 18 print characterize_blood_pressure(180, 100), "\n"; 19 This code will print the following output: 20 Normal 21 Elevated Hypertension Stage 1 22 Hypertension Stage 1 23 Hypertension Stage 2 24 Hypertensive Crisis
So, no, that test code with its own function does not print those results. It actually prints:
25 Elevated Hypertension Stage 1 (not what it predicted, but actually what it should say for 120/80...)
26 Elevated Hypertension Stage 1
27 Hypertension Stage 1
28 Hypertension Stage 2
29 Hypertension Stage 2 (Uhoh! Way wrong!)
So, big takeaway here is that if you're having a serious, life threatening hypertensive crisis, the medical device some trusting programmer (or Nvidia's non-programmer!!) put this code into will simply say you're in Stage 2 (not imminently life threatening), when in fact it should say you need to call 911.
Why is it so wrong? I mean, the code looks very authoritative. The tests look pretty simple and straightforward. If your BP is less than 120/80, you're Normal. Cool. If your BP is 200/120, you need to get to the hospital, and, gosh, there's even a test in the function that says that anything over 180/110 is a crisis.
So why didn't it say so? Well, it's the order of the tests. On line 10 it checks if your BP is over 160/100, and if it is, regardless how much over it is, it says you're in Stage 2 (and you need a prescription, etc.; but it's not an emergency room crisis). It needs to run line 12 before line 10.
That's pretty subtle, unless you really thought about it and scrutinized the code. Which is opposite of the point—Bard is supposed to be this great collaborator, helping you write, test, and debug your code. It failed at all three. Only a good programmer should write and test code like this. (Or, well, any code that you want to work right.)
A poor programmer might make the same error, swapping lines 10 and 12 — but a good programmer would know that when testing if numbers are above an increasing threshold, you have to check from the highest down. (Lines 4, 6, and 8 test correctly: If testing numbers below an increasing threshold, you do it in that order.)
Trusting Bard at all could lead to disaster in a case like this.
As for the other prediction error, where Bard says 120/80 will report "Normal", no, it should report "Elevated Stage 1" because the code checks if the numbers are less than 120/80, not less than or equal to 120/80. So that's just a case of Bard not knowing what the code does. Some help that is with writing tests and debugging!
It doesn't say what source it's using for the definitions of the various stages, as that seems to vary. Commonly there seems to be a split of the second category into two different ones, but without knowing the source for this I can't say it merged two levels together. In something like this, it ought to cite sources for how it created definitions of things.
But that's really not the big problem: The big problem is that it writes dangerously wrong code that looks superficially ok at a glance.
And how it lies about the testing code and results!
Heaven forbid what happens when any programmer—or non-programmer—or AI—actually starts using code written like this. Or trusts it to write testing code. Or trusts that it ran the testing code and got the right answers.
Run away!
CommentsLet me know what you think! I welcome your comments. |
Claire O'Sullivan on Tue Jun 6 15:36:14 2023:
yikes!! and that's medical coding, how scary!
I tried one for a blurb for my book. It came out as an over the top, long-winded review which ended with 'join these two in thrills and chills that will keep you on the edge of your seat!'
Nooooo noooo noooo
https://www.claireosullivan1.com
Arkay on Thu Jun 8 19:17:35 2023:
We talk about this stuff frequently on my team (to the point where this quote from Microsoft Bing’s Sydney has made meme-status when we write bad code or set off alerts: “I have been a good software, you have been a bad user.”). The thing that kills me is the very certain comments these things leave in the code. That certainty is disturbing because it can be convincing.
I feel like a lot of people are going to end up using these things out of desperation. For example, I have a friend who is not a developer, silo’ed on a team of one who had to learn git. None of his coworkers could help him so he asked ChatGPT for help on the commands and best practices for committing to a branch. The thing was mostly right, except it told him to be sure to include his IDE settings in the repo “to help other developers” which honestly made me laugh so hard. But the thing is, he had no one to help him with the task and it’s friendlier than a web search. He’s described it as having a non-judgemental work friend.
Fran on Mon Jun 12 15:41:18 2023:
Hi Andrew. I can't do code, but I understood basically what you were saying. Here's an example my daughter heard about why not to trust ai coding software for legal purposes: two lawyers had a brief to write up for a trial. They had ai software write the brief. They looked it over and thought it was great -- supporting legislation, cases, precedent, judges, all cited in detail.
Opposing counsel read it all, began checking, found none of it true!, and told the judge. The two lawyers (who were maybe partners in the firm!) who relied on ai are probably losing their positions.
The science fiction disaster scenario of the takeover of ai will not be because of how infallible it is, but because of how completely fallible it is.