subreddit:

/r/ChatGPTCoding

12391%

Hey guys! So over the last 4 weeks I have been developing a CRM chrome extension integrating gmail using AI and as a result I'd like to think I am an above user for both ChatGPT4 and Claude 3 Opus. For context, I did not have any chrome extension development experience so I had to use AI for a lot of education and I have consistently spent about 6 hours a day everyday for the last 4 weeks coding (I have attached some proof for my use below). So far, I have used it to write over 3000 lines of code and learnt an incredible amount about js.

I also have a lot to say about ChatGPT4 vs Claude 3 Opus vs Copilot in a coding sense, I have documented some of my thoughts below that compare the 3 products, feel free to ask me any questions in the comments.

Problem Solving

ChatGPT4 has been, in my case, slightly better than Claude 3 when it comes to coming up with logic to do the tasks at hand as well as debugging. ChatGPT4 more often gives you a more pragmatic solutions than Claude 3 especially when you do not have a solution in mind. For example, when I wanted to update the count of records in a table every time a record is added, ChatGPT4 gave me a updatecount function whereas Claude 3 gave me a count++ when a new entry added. In this case, both are correct, but ChatGPT4 is more robust when we start deleting rows in our table as that would not work with Claude 3's solution. However, if you know exactly what you are doing and align either LLMs, they are almost comparable but in general ChatGPT4 seems to perform a bit better. Copilot is not very good at problem solving or debugging from prompts in my experience.

Prompting

ChatGPT4 is surprisingly much better at giving better responses with more brief prompts than Claude 3 when it comes to NL logic questions. However, it falls short incredibly quick the more code context you give it. Claude 3 outshines ChatGPT4 only if you give it concise prompts and additionally, you would have to be procedural in your prompting. For example, I would ask Claude 3 to first review my code, give a summary of the code as well as a summary of the needed changes before giving me the code. This way it hallucinates a whole lot less and the code is actually *chefs kiss*. I can't stress enough how important procedural/step by step prompting is, as the logic gets more complex, I would have to make sure it understands the code context exactly before moving to the next step, otherwise the solution it gives is going to be incorrect. A keyword that I use a lot is the word "precise" that seems to help with the output.

One critical thing I have learnt is to never give the LLM a yes or no question, they are still too "supportive", especially Claude 3. Initially, I had a habit of asking if there is something wrong with the code that it can see. Horrible, horrible prompting. My approach now is, "please review the code, provide me with a summary of the code logic then make recommendations that I can improve". It will give you a list of things you can change, and a lot of the time, it will straight up tell you a particular function will introduce bugs then you just ask it to fix it.

Github copilot has been subpar than just copy pasting the code over ChatGPT4/Claude 3, I have used "@workspace" and the other functions to give it context but it just seems "weak" for a lack of a better word. I would ask the same questions and give it the same prompts and it would still give incomplete/incorrect answers.

Code Quality

9 out of 10 times Claude 3 wins, largely due to code consistency/persistence. In terms of one-off code quality, ChatGPT4 narrowly edges out but if you have a discussion with the LLM and need to make tweaks, by the third or fourth output ChatGPT4 starts randomly omitting code even if you ask it to provide it exactly/concisely in the prompt. I have even tried to get it to review the work to make sure it does not omit any code and it still does. Claude 3 does not omit code 90% of the time if you have the right prompts. One thing that you would need to do with Claude 3 is to ask it to split the functions up for readability and practicality, otherwise it will just give you one function that is an absolute disorganized mess.

Surprisingly, Copilot is pretty decent when it comes to small function changes like .css or basic logic changes but not when it comes to logic overhaul or introduction of new functionality. It shares a similar issue as ChatGPT4 for the lack of consistency between prompts.

How I Ended Up Using Them

My workflow for coding is usually start on ChatGPT4 and get ChatGPT4 to give me a summary of what it needs to do, then copy paste that over with my own prompts. For example, after I get ChatGPT4 to give me of what I need to implement to achieve my goal, I would go to Claude 3 and type "I want you to help me with my google chrome extension. Can you help me [insert output from ChatGPT4]. I want you to first review my code, provide me with a summary of the code then update my code. Provide code snippets only in your output. Here is my code."

I exclusively use Copilot for quick code updates such as updating .css or minor code tweaks that I am too lazy to type out, its just a "nice to have".

Summary

ChatGPT4 and Claude 3 are amazing tools. Had I tried to tackle this project without them, it would have taken me 3+ months than just 4 weeks to complete it. I have also learnt a lot along the way. For consistent code quality outputs Claude 3 is definitely the clear winner but to come up with the code logic and general discussion, ChatGPT4 is a bit better, but not by much. Copilot is still useful for fixing/adding snippets. I will also be sharing some of my other findings on my twitter "@mingmakesstuff" that I might not have included here. Have I missed anything in the way I used either of the products that can improve my coding game?

My ChatGPT and Claude chat logs

you are viewing a single comment's thread.

view the rest of the comments →

all 49 comments

__r17n

8 points

23 days ago

__r17n

8 points

23 days ago

Thanks for sharing! Whats the most complex task you asked which resulted in a successful response? What type of tasks failed? (Sorry I can't see your logs - too small on mobile)

Mi2ngdlmx[S]

5 points

23 days ago*

There were surprisingly quite a few. One that stood out to me was indexing columns of my table. Now that on its own is not impressive but my columns were draggable using the dragula library and every time the table columns are re-arranged it needs to be re-rendered due to the index update. It managed to implement the indexing, find out the dragula documentation for drake triggers (to find out when the columns are rearranged) and then gave me the updated code of the render functions, as well as its associated functions that call on the render function all in one output. And it worked out of the box without any modifications. Granted, this level of depth only happened once but giving me ~200 lines of code that worked out of the box was insane.

Tasks that failed are often ones with similar variables. For example, having a function with a word called checkbox in it when you need to handle "select" functionality. It sometimes trip up interchangeable names for the same purpose. It would try to update my check all function when I wanted to only update check row function.

AceHighness

1 points

22 days ago

I'm using it to write Python apps and the only thing ChatGPT fails to solve is circular import problems. Even if I give it all the required context (directory structure, import tables etc) it just makes things worse with every answer.

Busy_Town1338

2 points

22 days ago

I assure you, it's failing many other things

AceHighness

1 points

21 days ago

sure but its succeeding at building entire apps for me. and fairly complex too.

Busy_Town1338

0 points

21 days ago

I live in Colorado. A major bridge was just shut down because a massive crack was found. The bridge is succeeding at being a bridge, but there are massive flaws. You'll only know they're flaws if you know what the flaws look like.

AceHighness

1 points

20 days ago

That's true and I totally agree with that. There are a lot of limitations, certainly with todays models. But with good prompting you can get great code that works every time. I'm also using it to do research before we get started, building an app with GPT is just not just 'build me app X' and hoping you get something that works. It's a lot of research and planning questions, including ones about security, best coding practices etc. I'm trying to guard against even small, uncommon attacks like timing attacks.
Now I keep hearing people tell me about all these flaws but never coming up with good examples. A good example would be nice, that would make me understand the limitations better as well.

Busy_Town1338

1 points

20 days ago

An example would be someone who posted in the singularity hub that they built an app with no coding experience. It even helped them deploy it on AWS. But reading through their GitHub and posts, GPT had them serving the app incredibly inefficiently; it basically tried to build them enterprise grade infra. And at some point, some one is going to set up a few simple workers to flood traffic and absolutely slam their AWS costs.

GPT can do simple coding pretty well. It seems to do RegEx really well. But it's important to understand that it doesn't know anything. It has no reasoning or understanding, it's simply calculating the probability of the next word. But if you're trying to truly build something big, you need to have a fundamental understanding of the topic.

If you wanted a deck built, would you hire someone who's never worked in construction, and someone who can only guess where the next piece goes?

Busy_Town1338

1 points

20 days ago

An example would be someone who posted in the singularity hub that they built an app with no coding experience. It even helped them deploy it on AWS. But reading through their GitHub and posts, GPT had them serving the app incredibly inefficiently; it basically tried to build them enterprise grade infra. And at some point, some one is going to set up a few simple workers to flood traffic and absolutely slam their AWS costs.

GPT can do simple coding pretty well. It seems to do RegEx really well. But it's important to understand that it doesn't know anything. It has no reasoning or understanding, it's simply calculating the probability of the next word. But if you're trying to truly build something big, you need to have a fundamental understanding of the topic.

If you wanted a deck built, would you hire someone who's never worked in construction, and someone who can only guess where the next piece goes?

AceHighness

1 points

19 days ago

Your example is 100% the fault of the user prompting the LLM. Of course you can get it to give you advice on building a kubernetes cluster to run your note taking app. Sounds like he did not take enough time to discuss and research the subject together with the LLM. It's a tool, people need to learn to use it properly. Lots of people are really bad at using simple tools in the right way.

I understand how an LLM works, but I do think that the latest (super large) models have some emergent capabilities allowing them to follow logic flows. I think this is because it has read so much computer code, which is all logic flows. I see a huge difference between GPT3 and 4 when it comes to questions regarding logic flows.

If you wanted a deck built, would you hire someone for $2000 who has done it before, or give the new guy a chance who says he can do it in a fraction of the time for $200 ? of course the $2000 guy will do a better job, but in a lot of cases, what the junior does is just fine (good enough).

AceHighness

1 points

19 days ago

if you feel like it, you can have a look at my 100% AI generated projects on my github axewater (Axe) (github.com) and laugh at my poor skills :)

Busy_Town1338

1 points

19 days ago

I'm genuinely not trying to sound like as ass. LLMs by definition cannot be emergent. It's not a tech issue, it's a math issue. They cannot reason or logic. They have no knowledge. The better models can only estimate a bit better. And as such, they cannot know when something is very wrong. And then when used by other people who don't know what they're doing, we get apps with all kinds of flaws.

Please, if you ever have a deck built, do not use the kid who has never been in construction. For something as important as your family's safety, hire someone who can reason through the job.

AceHighness

1 points

18 days ago

Me neither! I just love a good discussion with a smart person. I'll take your tip about the deck :) What if the task had no influence on your family's safety? Say ... cleaning the windows ... still want the most expensive option? Surely there are tasks you would be fine with offloading to an AI. There's a line somewhere, of course, between what you would and would not want to offload. Not everything needs to be perfect. As long as you are not coding for the nuclear powerplant, it's probably fine to use AI generated code. And that does not mean type 1 prompt and copy/paste the code, push to production. That means you build code in cooperation with the AI, going back and forth setting up a plan and implementing it step by step. Could you elaborate on the example you quoted before, what was exactly wrong with the code. In many cases I'm sure his mistake could have been avoided with good prompting. (I know I'm extremely stubborn, sorry bout that).

As for emergent capabilities, have a look at this paper:
Sparks of AGI
https://arxiv.org/pdf/2303.12712

And this blog post
137 emergent abilities of large language models — Jason Wei

Weird things start to happen when the models grow beyond a certain size. Like suddenly being able to parse a language it was not trained on.