I pitted Claude 3.5 Sonnet against AI coding tests ChatGPT aced – and it failed creatively

Published on:

Final week, I bought an e-mail from Anthropic saying that Claude 3.5 Sonnet was out there. In accordance with the AI firm, “Claude 3.5 Sonnet raises the business bar for intelligence, outperforming competitor fashions and Claude 3 Opus on a variety of evaluations.”

The corporate added: “Claude 3.5 Sonnet is right for complicated duties like code technology.” I made a decision to see if that was true.

I am going to topic the brand new Claude 3.5 Sonnet mannequin to my normal set of coding exams —  exams I’ve run in opposition to a variety of AIs with a variety of outcomes. Wish to comply with together with your individual exams? Level your browser to How I check an AI chatbot’s coding capacity – and you’ll too, which comprises all the usual exams I apply, explanations of how they work, and what to search for within the outcomes.

- Advertisement -

OK, let’s dig into the outcomes of every check and see how they evaluate to earlier exams utilizing Microsoft Copilot, Meta AI, Meta Code Llama, Google Gemini Superior, and ChatGPT.

1. Writing a WordPress plugin

At first, this appeared to have a lot promise. Let’s begin with the person interface Claude 3.5 Sonnet created primarily based on my check immediate.

That is the primary time an AI has determined to place the 2 information fields side-by-side. The structure is clear and appears nice.

- Advertisement -

Claude additionally determined to do one thing else I’ve by no means seen an AI do. This plugin might be created utilizing simply PHP code, which is the code operating on the again finish of a WordPress server.

However some AI implementations additionally have added JavaScript code (which runs within the browser to regulate dynamic person interface options) and CSS code (which controls how the browser shows info).

In a PHP setting, in the event you want PHP, JavaScript, and CSS, you possibly can both embody the CSS and JavaScript proper within the PHP code (that is a function of PHP), or you possibly can put the code in three separate recordsdata — one for PHP, one for JavaScript, and one for CSS.

Normally, when an AI needs to make use of all three languages, it exhibits what must be lower and pasted into the PHP file, then one other block to be lower and pasted right into a JavaScript file, after which a 3rd block to be lower and pasted right into a CSS file.

See also  Google partners with Thomson Reuters, Moody’s and more to give AI real-world data

However Claude simply supplied one PHP file after which, when it ran, auto-generated the JavaScript and CSS recordsdata into the plugin’s house listing. That is each pretty spectacular and considerably wrong-headed. It is cool that it tried to make the plugin creation course of simpler, however whether or not or not a plugin can write to its personal folder depends on the settings of the OS configuration — and there is a very excessive probability it may fail.

I allowed it in my testing setting, however I would by no means permit a plugin to rewrite its personal code in a manufacturing setting. That is a really critical safety flaw.

Regardless of the pretty artistic nature of Claude’s code technology resolution, the underside line is that the plugin failed. Urgent the Randomize button does completely nothing. That is unhappy as a result of, as I mentioned, it had a lot promise.

- Advertisement -

Listed here are the mixture outcomes of this and former exams:

  • Claude 3.5 Sonnet: Interface: good, performance: fail
  • ChatGPT GPT-4o: Interface: good, performance: good
  • Microsoft Copilot: Interface: ample, performance: fail
  • Meta AI: Interface: ample, performance: fail
  • Meta Code Llama: Full failure
  • Google Gemini Superior: Interface: good, performance: fail
  • ChatGPT 4: Interface: good, performance: good
  • ChatGPT 3.5: Interface: good, performance: good

2. Rewriting a string perform

This check is designed to guage how the AI does rewriting code to work extra appropriately for the given want; on this case — {dollars} and cents conversions.

The Claude 3.5 Sonnet revision correctly eliminated main zeros, ensuring that entries like “000123” are handled as “123”. It correctly permits integers and decimals with as much as two decimal locations (which is the important thing repair the immediate requested for). It prevents destructive values. And it is sensible sufficient to return “0” for any bizarre or sudden enter, which prevents the code from abnormally ending in an error.

One failure is that it will not permit decimal values alone to be entered. So if the person entered 50 cents as “.50” as a substitute of “0.50”, it could fail the entry. Based mostly on how the unique textual content description for the check is written, it ought to have allowed this enter type.

See also  I bought the cheapest Surface Pro Copilot+ PC - here are my 3 takeaways as a Windows expert

Though many of the revised code labored, I’ve to rely this as a fail as a result of if the code have been pasted right into a manufacturing challenge, customers wouldn’t be capable of enter inputs that contained solely values for cents.

Listed here are the mixture outcomes of this and former exams:

  • Claude 3.5 Sonnet: Failed
  • ChatGPT GPT-4o: Succeeded
  • Microsoft Copilot: Failed
  • Meta AI: Failed
  • Meta Code Llama: Succeeded
  • Google Gemini Superior: Failed
  • ChatGPT 4: Succeeded
  • ChatGPT 3.5: Succeeded

3. Discovering an annoying bug

The large problem of this check is that the AI is tasked with discovering a bug that is not apparent and — to unravel accurately — requires platform data of the WordPress platform. It is also a bug I didn’t instantly see alone and, initially, requested ChatGPT to unravel (which it did).

Claude not solely bought this proper — catching the subtlety of the error and correcting it — however it was additionally the primary AI since I revealed the total set of exams on-line to catch the truth that the publishing course of launched an error into the pattern question (which I subsequently mounted and republished).

Listed here are the mixture outcomes of this and former exams:

  • Claude 3.5 Sonnet: Succeeded
  • ChatGPT GPT-4o: Succeeded
  • Microsoft Copilot: Failed. Spectacularly. Enthusiastically. Emojically.
  • Meta AI: Succeeded
  • Meta Code Llama: Failed
  • Google Gemini Superior: Failed
  • ChatGPT 4: Succeeded
  • ChatGPT 3.5: Succeeded

Up to now, we’re at two out of three fails. Let’s transfer on to our final check.

4. Writing a script

This check is designed to see how far the AI’s programming data goes into specialised programming instruments. Whereas AppleScript is pretty frequent for scripting on Macs, Keyboard Maestro is a industrial utility bought by a lone programmer in Australia. I discover it indispensable, however it’s simply one in all many such apps on the Mac.

See also  How evolving AI regulations impact cybersecurity

Nevertheless, when testing in ChatGPT, ChatGPT knew methods to “communicate” Keyboard Maestro in addition to AppleScript, which exhibits how broad its programming language data is.

Sadly, Claude doesn’t have that data. It did write an AppleScript that tried to talk to Chrome (that is a part of the check parameter) however it ignored the important Keyboard Maestro element.

Worse, it generated code in AppleScript that may generate a runtime error. In an try to ignore case for the match within the check, Claude generated the road:

if theTab's title comprises enter ignoring case then

That is just about a double error as a result of the “comprises” assertion is case insensitive and the phrase “ignoring case” doesn’t belong the place it was positioned. It brought about the script to error out with an “Ignoring cannot go after this” syntax error message.

Listed here are the mixture outcomes of this and former exams:

  • Claude 3.5 Sonnet: Failed
  • ChatGPT GPT-4o: Succeeded however with reservations
  • Microsoft Copilot: Failed
  • Meta AI: Failed
  • Meta Code Llama: Failed
  • Google Gemini Superior: Succeeded
  • ChatGPT 4: Succeeded
  • ChatGPT 3.5: Failed

General outcomes

Listed here are the general outcomes of the 5 exams:

I used to be considerably bummed about Claude 3.5 Sonnet. The corporate particularly promised that this model was suited to programming. However as you possibly can see, not a lot. It isn’t that it may well’t program. It simply cannot program accurately.

I preserve searching for an AI that may greatest the ChatGPT options, particularly as platform and programming setting distributors begin to combine these different fashions instantly into the programming course of. However, for now, I am going again to ChatGPT after I want programming assist, and that is my recommendation to you as nicely.

Have you ever used an AI that will help you program? Which one? How did it go? Tell us within the feedback beneath.


You possibly can comply with my day-to-day challenge updates on social media. You should definitely subscribe to my weekly replace publication, and comply with me on Twitter/X at @DavidGewirtz, on Fb at Fb.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.

- Advertisment -

Related

- Advertisment -

Leave a Reply

Please enter your comment!
Please enter your name here