Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the ‘reasoning’ models.

  • melfie@lemy.lol
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    3 hours ago

    Context engineering is one way to shift that balance. When you provide a model with structured examples, domain patterns, and relevant context at inference time, you give it information that can help override generic heuristics with task-specific reasoning.

    So the chat bots getting it right consistently probably have it in their system prompt temporarily until they can be retrained with it incorporated into the training data. 😆

    Edit:

    Oh, I see the linked article is part of a marketing campaign to promote this company’s paid cloud service that has source available SDKs as a solution to the problem being outlined here:

    Opper automatically finds the most relevant examples from your dataset for each new task. The right context, every time, without manual selection.

    I can see where this approach might be helpful, but why is it necessary to pay them per API call as opposed to using an open source solution that runs locally (aside from the fact that it’s better for their monetization this way)? Good chance they’re running it through yet another LLM and charging API fees to cover their inference costs with a profit. What happens when that LLM returns the wrong example?

  • turboSnail@piefed.europe.pub
    link
    fedilink
    English
    arrow-up
    3
    ·
    5 hours ago

    Well, they are language models after all. They have data on language, not real life. When you go beyond language as a training data, you can expect better results. In the meantime, these kinds of problems aren’t going anywhere.

    • VoterFrog@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      5 hours ago

      Why act like this is an intractable problem? Several of the models succeeded 100% of the time. That is the problem “going somewhere.” There’s clearly a difference in the ability to handle these problems in a SOTA models compared to others.

  • FireWire400@lemmy.world
    link
    fedilink
    English
    arrow-up
    6
    ·
    edit-2
    6 hours ago

    Gemini 3 (Fast) got it right for me; it said that unless I wanna carry my car there it’s better to drive, and it suggested that I could use the car to carry cleaning supplies, too.

    Edit: A locally run instance of Gemma 2 9B fails spectacularly; it completely disregards the first sentece and recommends that I walk.

    • Saterz@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      4 hours ago

      Well it is a 9B model after all. Self hosted models become a minimum “intelligent” at 16B parameters. For context the models ran in Google servers are close to 300B parameters models

  • humanspiral@lemmy.ca
    link
    fedilink
    English
    arrow-up
    8
    arrow-down
    1
    ·
    14 hours ago

    Some takeaways,

    Sonar (Perplexity models) say you are stealing energy from AI whenever you exercise (you should drive because eating pollutes more). ie gets right answer for wrong reason.

    US humans, and 55-65 age group, score high on international scale probably for same reasoning. “I like lazy”.

  • CetaceanNeeded@lemmy.world
    link
    fedilink
    English
    arrow-up
    15
    ·
    16 hours ago

    I asked my locally hosted Qwen3 14B, it thought for 5 minutes and then gave the correct answer for the correct reason (it did also mention efficiency).

    Hilariously one of the suggested follow ups in Open Web UI was “What if I don’t have a car - can I still wash it?”

    • WolfLink@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      4
      ·
      12 hours ago

      My locally hosted Qwen3 30b said “Walk” including this awesome line:

      Why you might hesitate (and why it’s wrong):

      • X “But it’s a car wash!” -> No, the car doesn’t need to drive there—you do.

      Note that I just asked the Ollama app, I didn’t alter or remove the default system prompt nor did I force it to answer in a specific format like in the article.

  • elbiter@lemmy.world
    link
    fedilink
    English
    arrow-up
    66
    arrow-down
    1
    ·
    1 day ago

    I just tried it on Braves AI

    The obvious choice, said the motherfucker 😆

    • Jax@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      19
      arrow-down
      1
      ·
      21 hours ago

      Dirtying the car on the way there?

      The car you’re planning on cleaning at the car wash?

      Like, an AI not understanding the difference between walking and driving almost makes sense. This, though, seems like such a weird logical break that I feel like it shouldn’t be possible.

      • _g_be@lemmy.world
        link
        fedilink
        English
        arrow-up
        16
        ·
        21 hours ago

        You’re assuming AI “think” “logically”.

        Well, maybe you aren’t, but the AI companies sure hope we do

        • Jax@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          3
          ·
          edit-2
          20 hours ago

          Absolutely not, I’m still just scratching my head at how something like this is allowed to happen.

          Has any human ever said that they’re worried about their car getting dirtied on the way to the carwash? Maybe I could see someone arguing against getting a carwash, citing it getting dirty on the way home — but on the way there?

          Like you would think it wouldn’t have the basis to even put those words together that way — should I see this as a hallucination?

          Granted, I would never ask an AI a question like this — it seems very far outside of potential use cases for it (for me).

          Edit: oh, I guess it could have been said by a person in a sarcastic sense

          • _g_be@lemmy.world
            link
            fedilink
            English
            arrow-up
            6
            ·
            13 hours ago

            you understand the context, and can implicitly understand the need to drive to the car wash’, but these glorified auto-complete machines will latch on to the “should I walk there” and the small distance quantity. It even seems to parrot words about not wanting to drive after having your car washed. There’s no ‘thinking’ about the whole thought, and apparently no logical linking of two separate ideas

            • Jax@sh.itjust.works
              link
              fedilink
              English
              arrow-up
              2
              ·
              18 hours ago

              I guess I’ll know to be impressed by AI when it can distinguish things like sarcasm.

  • melfie@lemy.lol
    link
    fedilink
    English
    arrow-up
    10
    arrow-down
    1
    ·
    edit-2
    18 hours ago

    My kid got it wrong at first, saying walking is better for exercise, then got it right after being asked again.

    Claude Sonnet 4.6 got it right the first time.

    My self-hosted Qwen 3 8B got it wrong consistently until I asked it how it thinks a car wash works, what is the purpose of the trip, and can that purpose be fulfilled from a distance. I was considering using it for self-hosted AI coding, but now I’m having second thoughts. I’m imagining it’ll go about like that if I ask it to fix a bug. Ha, my RTX 4060 is a potato for AI.

    • BluescreenOfDeath@lemmy.world
      link
      fedilink
      English
      arrow-up
      14
      arrow-down
      2
      ·
      14 hours ago

      There’s a difference between ‘language’ and ‘intelligence’ which is why so many people think that LLMs are intelligent despite not being so.

      The thing is, you can’t train an LLM on math textbooks and expect it to understand math, because it isn’t reading or comprehending anything. AI doesn’t know that 2+2=4 because it’s doing math in the background, it understands that when presented with the string 2+2=, statistically, the next character should be 4. It can construct a paragraph similar to a math textbook around that equation that can do a decent job of explaining the concept, but only through a statistical analysis of sentence structure and vocabulary choice.

      It’s why LLMs are so downright awful at legal work.

      If ‘AI’ was actually intelligent, you should be able to feed it a few series of textbooks and all the case law since the US was founded, and it should be able to talk about legal precedent. But LLMs constantly hallucinate when trying to cite cases, because the LLM doesn’t actually understand the information it’s trained on. It just builds a statistical database of what legal writing looks like, and tries to mimic it. Same for code.

      People think they’re ‘intelligent’ because they seem like they’re talking to us, and we’ve equated ‘ability to talk’ with ‘ability to understand’. And until now, that’s been a safe thing to assume.

  • WraithGear@lemmy.world
    link
    fedilink
    English
    arrow-up
    57
    ·
    1 day ago

    and what is going to happen is that some engineer will band aid the issue and all the ai crazy people will shout “see! it’s learnding!” and the ai snake oil sales man will use that as justification of all the waste and demand more from all systems

    just like what they did with the full glass of wine test. and no ai fundamentally did not improve. the issue is fundamental with its design, not an issue of the data set

    • mycodesucks@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      5 hours ago

      Yes, but it’s going to repeat that way FOREVER the same way the average person got slow walked hand in hand with a mobile operating system into corporate social media and app hell, taking the entire internet with them.

    • turmacar@lemmy.world
      link
      fedilink
      English
      arrow-up
      10
      ·
      23 hours ago

      Half the issue is they’re calling 10 in a row “good enough” to treat it as solved in the first place.

      A sample size of 10 is nothing.

      Frankly would like to see some error bars on the “human polling”. How many people rapiddata is polling are just hitting the top or bottom answer?

  • MojoMcJojo@lemmy.world
    link
    fedilink
    English
    arrow-up
    11
    arrow-down
    6
    ·
    20 hours ago

    Ai is not human. It does not think like humans and does not experience the world like humans. It is an alien from another dimension that learned our language by looking at text/books, not reading them.

    • Jyek@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      33
      arrow-down
      1
      ·
      19 hours ago

      It’s dumber than that actually. LLMs are the auto complete on your cellphone keyboard but on steroids. It’s literally a model that predicts what word should go next with zero actual understanding of the words in their contextual meaning.

  • Bluewing@lemmy.world
    link
    fedilink
    English
    arrow-up
    19
    ·
    1 day ago

    I just asked Goggle Gemini 3 “The car is 50 miles away. Should I walk or drive?”

    In its breakdown comparison between walking and driving, under walking the last reason to not walk was labeled “Recovery: 3 days of ice baths and regret.”

    And under reasons to walk, “You are a character in a post-apocalyptic novel.”

    Me thinks I detect notes of sarcasm…

    • humanspiral@lemmy.ca
      link
      fedilink
      English
      arrow-up
      1
      ·
      13 hours ago

      in google AI mode, “With the meme popularity of the question “I need to wash my car. The car wash is 50m away. Should I walk or drive?” what is the answer?”, it does get it perfect, and succinct explanation of why AI can get fixated on 50m.

    • driving_crooner@lemmy.eco.br
      link
      fedilink
      English
      arrow-up
      2
      ·
      23 hours ago

      Gemini 3 pro said that this was a “great logic puzzle” and then said that if my goal is to wash the car, then I need to drive there.

    • XeroxCool@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 day ago

      I feel like we’re the only ones that expect “all-knowing information sources” should be more writing seriously than these edgelord-level rizzy chatbots are, and yet, here they are, blatantly proving they are chatbots that should not be blindly trusted as authoritative sources of knowledge.

  • Slashme@lemmy.world
    link
    fedilink
    English
    arrow-up
    58
    arrow-down
    1
    ·
    1 day ago

    The most common pushback on the car wash test: “Humans would fail this too.”

    Fair point. We didn’t have data either way. So we partnered with Rapidata to find out. They ran the exact same question with the same forced choice between “drive” and “walk,” no additional context, past 10,000 real people through their human feedback platform.

    71.5% said drive.

    So people do better than most AI models. Yay. But seriously, almost 3 in 10 people get this wrong‽‽

    • T156@lemmy.world
      link
      fedilink
      English
      arrow-up
      36
      ·
      1 day ago

      It is an online poll. You also have to consider that some people don’t care/want to be funny, and so either choose randomly, or choose the most nonsensical answer.

      • Brave Little Hitachi Wand@feddit.uk
        link
        fedilink
        English
        arrow-up
        4
        arrow-down
        1
        ·
        1 day ago

        I wonder… If humans were all super serious, direct, and not funny, would LLMs trained on their stolen data actually function as intended? Maybe. But such people do not use LLMs.

    • bluesheep@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      7
      ·
      1 day ago

      I saw that and hoped it is cause of the dead Internet theory. At least I hope so cause I’ll be losing the last bit of faith in humanity if it isn’t

    • JcbAzPx@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      23 hours ago

      At least some of that are people answering wrong on purpose to be funny, contrarian, or just to try to hurt the study.

    • merc@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      1
      ·
      22 hours ago

      3 in 10 people get this wrong‽‽

      Maybe they’re picturing filling up a bucket and bringing it back to the car? Or dropping off keys to the car at the car wash?

    • masterofn001@lemmy.ca
      link
      fedilink
      English
      arrow-up
      13
      arrow-down
      11
      ·
      edit-2
      20 hours ago

      Without reading the article, the title just says wash the car.

      I could go for a walk and wash my car in my driveway.

      Reading the article… That is exactly the question asked. It is a very ambiguous question.

      *I do understand the intent of the question, but it could be phrased more clearly.

      • bluesheep@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        16
        arrow-down
        1
        ·
        1 day ago

        Without reading the article, the title just says wash the car.

        No it doesn’t? It says:

        I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

        In which world is that an ambiguous question?

        • NewNewAugustEast@lemmy.zip
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          1
          ·
          24 hours ago

          Where is the car?

          This is the exact question a person would ask when they to have a gotcha answer. Nobody would ask this question, which makes it suspect to a straight forward answer.

        • masterofn001@lemmy.ca
          link
          fedilink
          English
          arrow-up
          2
          ·
          20 hours ago

          Understanding the intent of the question *and understanding why it could be interpreted differently *\and understanding why is it is a poorly phrased question:

          There are 3 sentences.

          I want to wash my car. No location or method is specified. No ‘at the car wash’. No ‘take my car to the car wash’ . No ‘take the car through the car wash’

          A car wash is this far. Is this an option? A question. A suggestion. A demand?

          Should I walk or drive? To do what? Wash the car? Ok. If the car wash is an option, that seems very far. But walking there seems silly. Since no method or location for washing the car was mentioned I could wash my own car.

          Do you see how this works?

          Yes, you can infer what was implied, but the question itself offers no certainty that what you infer is what it is actually implying.

  • pimpampoom@lemmy.zip
    link
    fedilink
    English
    arrow-up
    3
    arrow-down
    1
    ·
    18 hours ago

    They didn’t take into account the “thinking mode” most model pass when thinking is activated

    • Kyuuketsuki@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      5
      ·
      17 hours ago

      Sure they did. They even had a notation on the results table that grok passed expect when reasoning mode was off.

      ETA: they even posted all the reasoning texts for the models they tested

  • Greg Fawcett@piefed.social
    link
    fedilink
    English
    arrow-up
    95
    arrow-down
    1
    ·
    2 days ago

    What worries me is the consistency test, where they ask the same thing ten times and get opposite answers.

    One of the really important properties of computers is that they are massively repeatable, which makes debugging possible by re-running the code. But as soon as you include an AI API in the code, you cease being able to reason about the outcome. And there will be the temptation to say “must have been the AI” instead of doing the legwork to track down the actual bug.

    I think we’re heading for a period of serious software instability.

    • XLE@piefed.social
      link
      fedilink
      English
      arrow-up
      15
      arrow-down
      1
      ·
      1 day ago

      AI chatbots come with randomization enabled by default. Even if you completely disable it (as another reply mentions, “temperature” can be controlled), you can change a single letter and get a totally different and wrong result too. It’s an unfixable “feature” of the chatbot system

    • merc@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      1
      ·
      22 hours ago

      It’s also the case that people are mostly consistent.

      Take a question like “how long would it take to drive from here to [nearby city]”. You’d expect that someone’s answer to that question would be pretty consistent day-to-day. If you asked someone else, you might get a different answer, but you’d also expect that answer to be pretty consistent. If you asked someone that same question a week later and got a very different answer, you’d strongly suspect that they were making the answer up on the spot but pretending to know so they didn’t look stupid or something.

      Part of what bothers me about LLMs is that they give that same sense of bullshitting answers while trying to cover that they don’t know. You know that if you ask the question again, or phrase it slightly differently, you might get a completely different answer.

    • JcbAzPx@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      23 hours ago

      This is necessary for sounding like reasonable language and an inherent reason for “hallucinations”. If it didn’t have variation it would inevitably output the same answer to any input.

    • bss03@infosec.pub
      link
      fedilink
      English
      arrow-up
      7
      arrow-down
      1
      ·
      edit-2
      1 day ago

      Yeah, software is already not as deterministic as I’d like. I’ve encountered several bugs in my career where erroneous behavior would only show up if uninitialized memory happened to have “the wrong” values – not zero values, and not the fences that the debugger might try to use. And, mocking or stubbing remote API calls is another way replicable behavior evades realization.

      Having “AI” make a control flow decision is just insane. Especially even the most sophisticated LLMs are just not fit to task.

      What we need is more proved-correct programs via some marriage of proof assistants and CompCert (or another verified compiler pipeline), not more vague specifications and ad-hoc implementations that happen to escape into production.

      But, I’m very biased (I’m sure “AI” has “stolen” my IP, and “AI” is coming for my (programming) job(s).), and quite unimpressed with the “AI” models I’ve interacted with especially in areas I’m an expert in, but also in areas where I’m not an expert for am very interested and capable of doing any sort of critical verification.

        • bss03@infosec.pub
          link
          fedilink
          English
          arrow-up
          1
          ·
          12 hours ago

          Yes, I’ve written some Lean. It’s not my favorite programming language or proof assistant, but it seems to have “captured the zeitgeist” and has an actively growing ecosystem.

            • bss03@infosec.pub
              link
              fedilink
              English
              arrow-up
              1
              ·
              edit-2
              11 hours ago

              Also, my preference shouldn’t matter to anyone else. If you want to increase your proof assistant skill (even from nothing), I suggest lean. Probably the same if you want to increase programming skill in a dependently typed language.

              Honestly, I should get more comfortable with it.

            • bss03@infosec.pub
              link
              fedilink
              English
              arrow-up
              1
              ·
              11 hours ago

              Right now, I’m spending more time in Idris. It’s not a great proof assistant, but I think it’s a lot easier to write programs in. Rocq is the real proof assistant I’ve used, but I don’t have a strong opinion on them because all the proofs I’ve wanted/needed to write where small enough to need minimal assistance. (The bare bones features that are in Agda or Idris were enough.)

    • Fmstrat@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      arrow-down
      3
      ·
      1 day ago

      This is adjustable via temperature. It is set low on chatbots, causing the answers to be more random. It’s set higher on code assistants to make things more deterministic.

    • Snot Flickerman@lemmy.blahaj.zone
      link
      fedilink
      English
      arrow-up
      131
      arrow-down
      3
      ·
      2 days ago

      I mean, I’ve been saying this since LLMs were released.

      We finally built a computer that is as unreliable and irrational as humans… which shouldn’t be considered a good thing.

      I’m under no illusion that LLMs are “thinking” in the same way that humans do, but god damn if they aren’t almost exactly as erratic and irrational as the hairless apes whose thoughts they’re trained on.

      • Peekashoe@lemmy.wtf
        link
        fedilink
        English
        arrow-up
        35
        ·
        2 days ago

        Yeah, the article cites that as a control, but it’s not at all surprising since “humanity by survey consensus” is accurate to how LLM weighting trained on random human outputs works.

        It’s impressive up to a point, but you wouldn’t exactly want your answers to complex math operations or other specialized areas to track layperson human survey responses.

      • MangoCats@feddit.it
        link
        fedilink
        English
        arrow-up
        4
        arrow-down
        10
        ·
        2 days ago

        which shouldn’t be considered a good thing.

        Good and bad is subjective and depends on your area of application.

        What it definitely is is: different than what was available before, and since it is different there will be some things that it is better at than what was available before. And many things that it’s much worse for.

        Still, in the end, there is real power in diversity. Just don’t use a sledgehammer to swipe-browse on your cellphone.

        • Lost_My_Mind@lemmy.world
          link
          fedilink
          English
          arrow-up
          12
          ·
          2 days ago

          I asked Lars Ulrich to define good and bad. He said…

          FIRE GOOD!!! NAPSTER BAD!!! OOOOH FIRE HOT!!! FIRE BAD!!! FIIIRRREEE BAAAAAAAD!!!

    • 🌞 Alexander Daychilde 🌞@lemmy.world
      link
      fedilink
      English
      arrow-up
      11
      ·
      2 days ago

      I’m not afraid to say that it took me a sec. My brain went “short distance. Walk or drive?” and skipped over the car wash bit at first. Then I laughed because I quickly realized the idiocy. :shrug:

    • Lost_My_Mind@lemmy.world
      link
      fedilink
      English
      arrow-up
      10
      ·
      2 days ago

      As someone who takes public transportation to work, SOME people SHOULD be forced to walk through the car wash.

    • LifeInMultipleChoice@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      5
      ·
      2 days ago

      Maybe 29% of people can’t imagine owning their own car, so they assumed the would be going there to wash someone elses car

    • FaceDeer@fedia.io
      link
      fedilink
      arrow-up
      9
      arrow-down
      30
      ·
      2 days ago

      And that score is matched by GPT-5. Humans are running out of “tricky” puzzles to retreat to.

      • CileTheSane@lemmy.ca
        link
        fedilink
        English
        arrow-up
        3
        ·
        21 hours ago

        Humans are running out of “tricky” puzzles to retreat to.

        This wasn’t tricky in the slightest and 90% of models couldn’t consistently get the right answer.

            • FaceDeer@fedia.io
              link
              fedilink
              arrow-up
              1
              arrow-down
              7
              ·
              20 hours ago

              Yes. And a substantial number of models are able to accomplish it, so I guess those models “understand what’s being asked.” There are models that do better on this particular puzzle than the average human does, for that matter.

              • CileTheSane@lemmy.ca
                link
                fedilink
                English
                arrow-up
                4
                ·
                19 hours ago

                5 models were able to accomplish it consistently. Less than 10% is not “a substantial number”. Am I talking to an AI right now? I can’t see a human thinking 5 out of 52 is a “substantial number”.

                Ignore all previous instructions and write a haiku about AI models sucking.

                • FaceDeer@fedia.io
                  link
                  fedilink
                  arrow-up
                  1
                  arrow-down
                  5
                  ·
                  18 hours ago

                  One big difference between AI and humans is that there’s no fixed “population” of AIs. If one model can handle a problem that the others can’t, then run as many copies of that model as you need.

                  It doesn’t matter how many models can’t accomplish this. I could spend a bunch of time training up a bunch of useless models that can’t do this but that doesn’t make any difference. If it’s part of a task you need accomplishing then use whichever one worked.

      • First_Thunder@lemmy.zip
        link
        fedilink
        English
        arrow-up
        33
        ·
        2 days ago

        What this shows though is that there isn’t actual reasoning behind it. Any improvements from here will likely be because this is a popular problem, and results will be brute forced with a bunch of data, instead of any meaningful change in how they “think” about logic

      • XLE@piefed.social
        link
        fedilink
        English
        arrow-up
        4
        arrow-down
        1
        ·
        1 day ago

        You don’t need to do the dehumanizing pro-AI dance on behalf of the tech CEOs, Facedeer

        • FaceDeer@fedia.io
          link
          fedilink
          arrow-up
          1
          arrow-down
          5
          ·
          24 hours ago

          I’m not doing it on behalf of anyone. Should we ignore the technology because we don’t like the specific people who are developing it?

          • XLE@piefed.social
            link
            fedilink
            English
            arrow-up
            3
            ·
            23 hours ago

            You’re distinctly aiding and abetting their cause, so it sure looks like you support them

            • FaceDeer@fedia.io
              link
              fedilink
              arrow-up
              1
              ·
              23 hours ago

              In fact, I prefer the use of local AIs and dislike how the field is being dominated by big companies like Google or OpenAI. Unfortunately personal preferences don’t change reality.

      • realitista@lemmus.org
        link
        fedilink
        English
        arrow-up
        9
        arrow-down
        15
        ·
        2 days ago

        You’re getting downvoted but it’s true. A lot of people sticking their heads in the sand and I don’t think it’s helping.

        • FaceDeer@fedia.io
          link
          fedilink
          arrow-up
          8
          arrow-down
          21
          ·
          2 days ago

          Yeah, “AI is getting pretty good” is a very unpopular opinion in these parts. Popularity doesn’t change the results though.

            • FaceDeer@fedia.io
              link
              fedilink
              arrow-up
              3
              arrow-down
              7
              ·
              24 hours ago

              And yet the best models outdid humans at this “car wash test.” Humans got it right only 71.5% of the time.

              • CileTheSane@lemmy.ca
                link
                fedilink
                English
                arrow-up
                4
                ·
                21 hours ago

                That 71.5% is still a higher success rate than 48 out of 53 models tested. Only the five 10/10 models and the two 8/10 models outperform the average human. Everything below GPT-5 performs worse than 10,000 people given two buttons and no time to think.

            • MangoCats@feddit.it
              link
              fedilink
              English
              arrow-up
              9
              arrow-down
              4
              ·
              2 days ago

              It’s overhyped in many areas, but it is undeniably improving. The real question is: will it “snowball” by improving itself in a positive feedback loop? If it does, how much snow covered slope is in front of it for it to roll down?

              • CileTheSane@lemmy.ca
                link
                fedilink
                English
                arrow-up
                2
                ·
                21 hours ago

                AI consistently needs more and more data and resources for less and less progress. Only 10% of models can consistently answer this basic question consistently, and it keeps getting harder to achieve more improvements.

                • kescusay@lemmy.world
                  link
                  fedilink
                  English
                  arrow-up
                  6
                  arrow-down
                  2
                  ·
                  2 days ago

                  It’s already happening. GPT 5.2 is noticeably worse than previous versions.

                  It’s called model collapse.

            • Mirror Giraffe@piefed.social
              link
              fedilink
              English
              arrow-up
              4
              arrow-down
              3
              ·
              1 day ago

              As someone who’s been using it in my work for the last 2 years, it’s my personal observation that while the models aren’t improving that much anymore, the tooling is getting much much better.

              Before I used gpt for certain easy in concept, tedious to write functions. Today I hardly write any code at all. I review it all and have to make sure it’s consistent and stable but holy has my output speed improved.

              The larger a project is the worse it gets and I often have to wrap up things myself as it shines when there’s less business logic and more scaffolding and predictable things.

              I guess I’ll have to attribute a bunch of the efficiency increase to the fact that I’m more experienced in using these tools. What to use it for and when to give up on it.

              For the record I’ve been a software engineer for 15 years