OpenAI's o1-preview model aced my coding tests, and showed its work (in surprising detail)

Normally, when a software program firm pushes out a significant new launch in Could, they do not attempt to prime it with one other main new launch 4 months later. However there’s nothing common in regards to the tempo of innovation within the AI enterprise.

Though OpenAI dropped its new omni-powerful GPT-4o mannequin in mid-Could, the corporate has been busy. Way back to final November, Reuters revealed a rumor that OpenAI was engaged on a next-generation language mannequin, then referred to as Q*. They doubled down on that report in Could, stating that Q* was being labored on beneath the code title of Strawberry.

Strawberry, because it seems, is definitely a mannequin known as o1-preview, which is obtainable now as an choice to ChatGPT Plus subscribers. You possibly can select the mannequin from the choice dropdown:

- Advertisement -

As you may think, if there is a new ChatGPT mannequin accessible, I will put it by its paces. And that is what I am doing right here.

The brand new Strawberry mannequin focuses on reasoning, breaking down prompts and issues into steps. OpenAI showcases this method by a reasoning abstract that may be displayed earlier than every reply.

When o1-preview is requested a query, it does some considering after which shows how lengthy it took to try this considering. Should you toggle the dropdown, you will see some reasoning. This is an instance from one in every of my coding checks:

It is good that the AI knew sufficient so as to add error dealing with, however I discover it attention-grabbing that o1-preview categorizes that step beneath “Regulatory compliance”.

- Advertisement -

I additionally found the o1-preview mannequin gives extra exposition after the code. In my first take a look at, which created a WordPress plugin, the mannequin supplied explanations of the header, class construction, admin menu, admin web page, logic, safety measures, compatibility, set up directions, working directions, and even take a look at information. That is much more info than was supplied by earlier fashions.

However actually, the proof is within the pudding. Let’s put this new mannequin by our commonplace checks and see how properly it really works.

1. Writing a WordPress plugin

This easy coding take a look at requires data of the PHP programming language and the WordPress framework. The problem asks the AI to jot down each interface code and purposeful logic, with the twist being that as an alternative of eradicating duplicate entries, it has to separate the duplicate entries, so they don’t seem to be subsequent to one another.

The o1-preview mannequin excelled. It offered the UI first as simply the entry area:

As soon as the information was entered, and Randomize Traces was clicked, the AI generated an output area with correctly randomized output information. You possibly can see how Abigail Williams is duplicated, and in compliance with the take a look at directions, each entries aren’t listed side-by-side:

In my checks of different LLMs, solely 4 of the ten fashions handed this take a look at. The o1-preview mannequin accomplished this take a look at completely.

- Advertisement -

2. Rewriting a string operate

Our second take a look at fixes a string common expression that was a bug reported by a person. The unique code was designed to check if an entered quantity was legitimate for {dollars} and cents. Sadly, the code solely allowed integers (so 5 was allowed, however not 5.25).

The o1-preview LLM rewrote the code efficiently. The mannequin joined 4 of my earlier LLM checks within the winners’ circle.

3. Discovering an annoying bug

This take a look at was created from a real-world bug I had problem resolving. Figuring out the foundation trigger requires data of the programming language (on this case PHP) and the nuances of the WordPress API.

The error messages supplied weren’t technically correct. The error messages referenced the start and the tip of the calling sequence I used to be operating, however the bug was associated to the center a part of the code.

I wasn’t alone in struggling to unravel the issue. Three of the opposite LLMs I examined could not establish the foundation reason for the issue and beneficial the extra apparent (however fallacious) answer of fixing the start and ending of the calling sequence.

The o1-preview mannequin supplied the proper answer. In its clarification, the mannequin additionally pointed to the WordPress API documentation for the features I used incorrectly, offering an added useful resource to study why it had made its suggestion. Very useful.

4. Writing a script

This problem requires the AI to combine data of three separate coding spheres, the AppleScript language, the Chrome DOM (how an online web page is structured internally), and Keyboard Maestro (a specialty programming instrument from a single programmer).

Answering this query requires an understanding of all three applied sciences, in addition to how they need to work collectively.

As soon as once more, o1-preview succeeded, becoming a member of solely three of the opposite 10 LLMs which have solved this downside.

A really chatty chatbot

The brand new reasoning method for o1-preview definitely does not diminish ChatGPT’s capacity to ace our programming checks. The output from my preliminary WordPress plugin take a look at, particularly, appeared to operate as a extra subtle piece of software program than earlier variations.

It is nice that ChatGPT gives reasoning steps at first of its work and a few explanatory information on the finish. Nevertheless, the reasons may be chatty. I requested o1-preview to jot down “Hey world” in C#, the canonical take a look at line in programming. That is how GPT-4o responded:

And that is how o1-preview responded to the identical take a look at:

I imply, wow, proper? That is plenty of chat from ChatGPT. You may as well flip the reasoning dropdown and get much more info:

All of this info is nice, but it surely’s plenty of textual content to filter by. I favor a concise clarification, with further info choices in dropdowns faraway from the principle reply.

But ChatGPT’s o1-preview mannequin carried out excellently. I look ahead to how properly it can work when built-in extra absolutely with the GPT-4o options, reminiscent of file evaluation and internet entry.

Have you ever tried coding with o1-preview? What have been your experiences? Tell us within the feedback under.

You possibly can observe my day-to-day challenge updates on social media. Make sure to subscribe to my weekly replace e-newsletter, and observe me on Twitter/X at @DavidGewirtz, on Fb at Fb.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.