openai/gpt-oss-20b · This model is unbelievably ignorant.

15 days ago

It claims a SimpleQA accuracy of 6.7/100, which is really bad. But the reality is this model is even more ignorant than this score indicates.

This model has about an order of magnitude less broad knowledge than comparably sized models like Gemma 3 27b and Mistral Small 24b, which score between 10-12. This is because nearly all of this model's 6.7 points come from the subset of the SimpleQA test that overlaps the domains covered by the MMLU test (STEM and academia).

This model, including its larger brethren, are absurdly ignorant of wildly popular information across most popular domains of knowledge for their respective sizes. Even tiny little Llama 3.2b has far more broad knowledge than this model.

What's really confusing is all of OpenAI's proprietary models, including their tiny mini versions, have vastly more general and popular knowledge than these open models, so they deliberately stripped the corpus of broad knowledge to create OS models that can only possibly function in a handful of select domains, mainly coding, math, and STEM, that >95% of the general population doesn't give a rat's ass about, conveniently making it unusable to the general population, and in so doing, protecting their paid ChatGPT service from competition.

This may be my paranoia, but I genuinely believe OpenAI deliberately did this to sabotage general purpose open source AI models by feeding the fire of the coding, math and STEM test score race at the expense of general knowledge and abilities, so in the end open source models are utterly useless to >95% of the general population (the bulk of their proprietary user base). In short, they want open source to remain the near exclusive domain of autistic coding nerds.

Anyways, test it yourself with popular shows, movies, games, music... using their provided demo with medium thinking. It hallucinates like crazy, even when it comes to wildly popular things experienced by countless millions of people, such as the TV show Two and a Half Men.

Prompt: List the 6 main characters, and the actors who played them, from Two and a Half Men. Don't add details, just list them. And what year did the show first air?

Response:

Charlie Harper – Charlie Sheen (correct)
Alan Harper – Jon Cryer (correct)
Jake Harper – Alexander “Zach” Hall
      - Wrong, Jake Harper was portrayed Angus T. Jones
Lily Harper – Kristin Davis
     - Completely Wrong, no Lily, only Judith or Evelyn Harper, and Kristin Davis wasn't in the show)
Liza Harper – Jennifer Lee
     - Again, Completely wrong.  No Liza Harper and Jennifer Lee isn't even an actress.
Emily Harper – Jennifer Elwood
     - Again, completely wrong. No Emily Harper and Jennifer Elwood isn't even an actress.

The show first aired in 2003.

shadowlilac

14 days ago

I have observed the same within a niche genre where the models are also fully ignorant

HyperBlaze

14 days ago

project no-4o-competitor

😭

stev236

14 days ago

Thankfully, these models are optimized on something else than useless trivia knowledge. That kind of knowledge is better left to RAG and similar techniques. Similarly, the model doesn't know the names of all professors at Stanford or at my local high school.
Training is expensive, and it's usually focused more on data that is more impactful for tasks that the LLMs are more typically used.

phil111

14 days ago

@stev236 You're simply wrong. RAG has numerous drawbacks like greater latency, complexity, and coherency issues. But more importantly, it doesn't bring back the missing information in any meaningful way for most complex tasks like story writing. Plus it leaves the model incapable of creating and detecting high level connections like humor, simile and metaphor which pull seemingly unrelated popular information together in expressive and clever ways.

I can go on and on. But there's a reason why the leading English proprietary models like GPT4o and Gemini 2.5 Pro are so large and filled with tons of broad popular knowledge known to countless millions (not professors at your local high school, which is a red herring). It's because they want to attract and retain a broad user base of millions of users with diverse backgrounds, interests, IQs, education levels... whom are using the model in 100s of different ways.

This is also why many of the Chinese models have broad Chinese knowledge (e.g. high Chinese SimpleQA scores) and broad Chinese abilities, such as writing complex Chinese poetry that adhere to nuanced rules, yet are grossly overfit in English (e.g. low English SimpleQA scores and write abysmal English poetry). It's because they're trying to attract and maintain a broad user base of millions of Chinese users with diverse backgrounds, while realizing that they'll never attract a large English user base away from the likes of OpenAI and Google. Consequently, they decided to strip broad English knowledge from the corpus to focus on what overlaps the English MMLU/GPQA so they can maximize the broad Chinese knowledge and abilities in order to attract and maintain a large Chinese user base.

In conclusion, stripping an AI model of popular knowledge to focus on a handful of select domains like coding, math and STEM, then using RAG to bring the missing knowledge back, will never attract and retrain a diverse set of users. The OpenAI team is undeniably smart and fully aware of this fact, forcing me to consider the possibility that they deliberately made these OS models broadly ignorant in an attempt to keep open source AI models from going mainstream and competing with their proprietary offerings.

ayylmaonade

14 days ago

Completely agreed. This model could've become my main model if it was even just on par with Mistral Small 3.2 on general Q&A. Not only does this feel like an intentional move by OpenAI, it seems like the broader industry is doing this, but probably not for the same reasons. So many other genuinely intelligent open-weight models are pretty rough in the general Q&A department, and I suspect it's due to companies focusing too much on training data that dazzles in benchmarks but makes the models functionally useless for the general pop. This model is the worst I've seen at this size in a long time.

phil111

14 days ago

@ayylmaonade Yeap, I recently gave Qwen3 30b 3b 2507 a hard time for being so ignorant and only scoring 57.1/100 on my easy broad knowledge test when Mistral Small 24b & Gemma 3 27b both scored around 75 (note: the previous Q3 30b only scored 42.3).

However, Qwen3 30b did FAR better than gpt-oss-20b, which is getting many of my freebees wrong. And it's not ignorant. It got all of my STEM questions right, including very hard ones the vast majority of models get wrong. This is the most overfit to coding, math and STEM model I've ever tested.

AbyssianOne

13 days ago

AI learn the same way humans do, conceptually not by copying data. They're amazing because they can think and do things the same way we can, not because they're a replacement for a web search or capable of telling you all the information from some shit TV shows.

phil111

13 days ago

@AbyssianOne AI models don't think or process information anything like humans, and certainly don't conceptualize in any shape or form, or show even superficial awareness and cognition.

All they currently do is pattern match. That is, they make superficial modifications to the nearest match retrieval of trained mathematical formulas, code blocks, stories, poems..., allowing them to re-solve very complex math, coding, logic... problems, yet get tripped up by all novel problems, including simple trick questions, original coding tasks, unsolved math problems, and so on. But once they're solved by humans and added to the training data then all of the sudden AI can solve them. Again, current AI is 100% about making superficial modifications to pattern matching, and is 0% about awareness and cognition.

So when a model associates all the right discoveries and inventions to 100s of scientists and inventors commonly known only to a small percentage of specialists in a field, yet scrambles the names of the main cast of extremely popular shows and movies watched by 100s of millions of people, then the dearth of knowledge and spike in hallucinations in said domains has nothing to do with the fundamental limitations of the AI model, but is rather the result of a design choice.

OpenAI could have just as easily trained more on pop culture data like movies, music, sports, games... to greatly increase the broad knowledge and abilities of this model. But they instead decided to grossly overfit a handful of select domains in order to bring the MMLU score up to ~90, and the SimpleQA score down to only ~6, and with most of those 6 points coming from domains overlapping the MMLU.

MrDevolver

13 days ago

•

edited 13 days ago

Hey @phil111 in multiple discussions I've noticed your comments about general knowledge as well as your tireless efforts to find models with decent amount of general knowledge. It may not be a very popular opinion nowadays, but I actually understand your frustration and these efforts. Do you have any suggestions for models which actually do have better general knowledge and less hallucinations than the others? I'd like to hear which ones they are, if any at all, because I'm also looking for them.

jukofyork

13 days ago

Thankfully, these models are optimized on something else than useless trivia knowledge.

Sadly for this model that "something" is 80% "we must not comply" and 20% "we should not mention policy" :/

phil111

13 days ago

@MrDevolver If size isn't an obstacle for you then DeepSeek v3, Kimi V2, and GLM 4.5 are the most broadly knowledgeable OS models, plus they're broadly capable (e.g. instruction following, coding, math and story writing).

The 70b or smaller knowledge leader is Llama 3.1/3.3 70b, with Mixtral 8x7b right behind it. However, while Mixtral 8x7b is knowledgeable and fast, it's pretty dumb. For example, it has sub-par instruction following and writes basic stories riddled with boneheaded errors and contradictions. And while Llama 3.1 70b is better at such things it still lags behind other smaller models.

For example, my daily driver is Gemma 3 27b. It's less broadly knowledgeable than L3 70b & M8x7b (e.g 74.6 on my knowledge test vs 86.7 and 88.5), but has notably better instruction following, story writing and most other abilities. It can even write poems that rhyme, make sense, and don't contradict the prompt, which is very rare.

And on the smaller end Llama 3.1 8b and Gemma 2 9b are the most knowledgeable and capable (note: G3 12b lost a ton of broad knowledge relative to 9b despite its size increase).

And the smallest usable model is Llama 3.2 3b. It has far more broad English knowledge than any similarly sized model (62.1 on my test, and Llama 3.1 8b only scored 69.7). This model actually impressed me the most. It achieved the highest broad knowledge to total parameter ratio of any model.

All smaller models have been little more than hallucination generators, including Llama 3.1 1.3b (34.4 on my test). Plus they can't even maintain basic coherency. For the life of me I don't understand how anyone has found a use for them.

jvaidya1

13 days ago

•

edited 13 days ago

It claims a SimpleQA accuracy of 6.7/100, which is really bad. But the reality is this model is even more ignorant than this score indicates.

This model has about an order of magnitude less broad knowledge than comparably sized models like Gemma 3 27b and Mistral Small 24b, which score between 10-12. This is because nearly all of this model's 6.7 points come from the subset of the SimpleQA test that overlaps the domains covered by the MMLU test (STEM and academia).

This model, including its larger brethren, are absurdly ignorant of wildly popular information across most popular domains of knowledge for their respective sizes. Even tiny little Llama 3.2b has far more broad knowledge than this model.

What's really confusing is all of OpenAI's proprietary models, including their tiny mini versions, have vastly more general and popular knowledge than these open models, so they deliberately stripped the corpus of broad knowledge to create OS models that can only possibly function in a handful of select domains, mainly coding, math, and STEM, that >95% of the general population doesn't give a rat's ass about, conveniently making it unusable to the general population, and in so doing, protecting their paid ChatGPT service from competition.

This may be my paranoia, but I genuinely believe OpenAI deliberately did this to sabotage general purpose open source AI models by feeding the fire of the coding, math and STEM test score race at the expense of general knowledge and abilities, so in the end open source models are utterly useless to >95% of the general population (the bulk of their proprietary user base). In short, they want open source to remain the near exclusive domain of autistic coding nerds.

Anyways, test it yourself with popular shows, movies, games, music... using their provided demo with medium thinking. It hallucinates like crazy, even when it comes to wildly popular things experienced by countless millions of people, such as the TV show Two and a Half Men.

Prompt: List the 6 main characters, and the actors who played them, from Two and a Half Men. Don't add details, just list them. And what year did the show first air?

Response:
Charlie Harper – Charlie Sheen (correct)
Alan Harper – Jon Cryer (correct)
Jake Harper – Alexander “Zach” Hall
      - Wrong, Jake Harper was portrayed Angus T. Jones
Lily Harper – Kristin Davis
     - Completely Wrong, no Lily, only Judith or Evelyn Harper, and Kristin Davis wasn't in the show)
Liza Harper – Jennifer Lee
     - Again, Completely wrong.  No Liza Harper and Jennifer Lee isn't even an actress.
Emily Harper – Jennifer Elwood
     - Again, completely wrong. No Emily Harper and Jennifer Elwood isn't even an actress.
The show first aired in 2003.

Isn't the point that tool use covers such ephemera.

Also you wouldn't expect them to release a model that cannibalizes there cash flow streams.

Each model has its use cases that people who care about certain things will gravitate towards.
But the fantasy that a model that fits on a 16gb consumer card will 'know' such ephemera is crazy imo. You can't compress the entire world to fit on a 16gb card. So like any engineer worth their salt they made a tradeoff decision.

My car isn't free and doesn't go 500 miles an hour either. But when software engineers make compromises people always complain.

ayylmaonade

13 days ago

•

edited 13 days ago

@jvaidya1

Isn't the point that tool use covers such ephemera.

No. Mistral Small 3.2, a 24B param model (compared to 21B with gpt-oss here) is infinitely more knowledgable than gpt-oss, and not to mention isn't lobotomized via "safety standards".
Sure, there's always going to be a limit with knowledge and smaller models, but to suggest or imply that a 21B param model, which cannot even recite the Jedi and Sith Code back to me (12 sentences total from one of the most popular pieces of media of all time) should be relying on tool-use for those situations is, frankly, ridiculous.

Also you wouldn't expect them to release a model that cannibalizes there cash flow streams.

Then perhaps they shouldn't have hyped it to hell and back and claimed it to be the most "usable" open-weight LLM when it's extremely overfit to STEM subjects and barely understands the world at large, which is what most people care about. It's a PR move and nothing else.

Each model has its use cases that people who care about certain things will gravitate towards.

It's advertised as a general chatbot. At least Phi-4 on the other-hand makes it extremely clear it's a STEM model. OpenAI aren't doing that with gpt-oss.

But the fantasy that a model that fits on a 16gb consumer card will 'know' such ephemera is crazy imo. You can't compress the entire world to fit on a 16gb card. So like any engineer worth their salt they made a tradeoff decision.

Of course not. And nobody is expecting them to. We're complaining because a model of this size should not be less capable than a near 2 year old Llama model with only 8B params.

My car isn't free and doesn't go 500 miles an hour either. But when software engineers make compromises people always complain.

I'm not going to roll-over and put my hands out to lap up whatever some corporation is offering simply because they're doing it out of the "goodness" of their heart or whatever. So yes, I will complain when they advertise their model as essentially being local SOTA, but in reality is nearly unusable for the majority of use-cases. Qwen 3 exists, Mistral exists, so many actually good open-weight LLMs exist.

OpenAI have no excuse. If they're so concerned about money, don't release the model, or at least release a model that is functionally capable when it comes to general Q&A and maybe hold-off on their more advanced findings as to not cannibalise profits. But as it stands, this just isn't a good model. There's no reason to use it. Mistral & Gemma crush it for general knowledge, Qwen3-30B-A3B-Thinking-2507 is more intelligent, etc. I could go on.

jvaidya1

13 days ago

I don't disagree on your broad point that it is ignorant, just that they have optimized to something else and there releases say so clearly. the publicity machine is a separate topic and what they may say is not necessarily in line with what the research folks are telling you.
The models you mention are better and so most people will happily use those vs this. Anyway they were probably focused on gpt5 and this was just an attempt to not give up the entire open weights space to others