Tech experts are starting to doubt that ChatGPT and A.I. ‘hallucinations’ will ever go away: ‘This isn’t fixable’::Experts are starting to doubt it, and even OpenAI CEO Sam Altman is a bit stumped.
“AI” are just advanced versions of the next word function on your smartphone keyboard, and people expect coherent outputs from them smh
Seriously. People like to project forward based on how quickly this technological breakthrough came on the scene, but they don’t realize that, barring a few tweaks and improvements here and there, this is it for LLMs. It’s the limit of the technology.
It’s not to say AI can’t improve further, and I’m sure that when it does, it will skillfully integrate LLMs. And I also think artists are right to worry about the impact of AI on their fields. But I think it’s a total misunderstanding of the technology to think the current technology will soon become flawless. I’m willing to bet we’re currently seeing it at 95% of its ultimate capacity, and that we don’t need to worry about AI writing a Hollywood blockbuster any time soon.
In other words, the next step of evolution in the field of AI will require a revolution, not further improvements to existing systems.
I’m willing to bet we’re currently seeing it at 95% of its ultimate capacity
For free? On the internet?
After a year or two of going live?
It depends on what you’d call a revolution. Multiple instances working together, orchestrating tasks with several other instances to evaluate progress and provide feedback on possible hallucinations, connected to services such as Wolfram Alpha for accuracy.
I think the whole orchestration network of instances could functionally surpass us soon in a lot of things if they work together.
But I’d call that evolution. Revolution would indeed be a different technique that we can probably not imagine right now.
It is just that everyone now refers to LLMs when talking about AI even though it has sonmany different aspects to it. Maybe at some point there is an AI that actually understands the concepts and meanings of things. But that is not learned by unsupervised web crawling.
It is possible to get coherent output from them though. I’ve been using the ChatGPT API to successfully write ~20 page proposals. Basically give it a prior proposal, the new scope of work, and a paragraph with other info it should incorporate. It then goes through a section at a time.
The numbers and graphics need to be put in after… but the result is better than I’d get from my interns.
I’ve also been using it (google Bard mostly actually) to successfully solve coding problems.
I either need to increase the credit I giver LLM or admit that interns are mostly just LLMs.
I recently asked it a very specific domain architecture question about whether a certain application would fit the need of a certain business application and the answer was very good and showed both a good understanding of architecture, my domain and the application.
Are you using your own application to utilize the API or something already out there? Just curious about your process for uploading and getting the output. I’ve used it for similar documents, but I’ve been using the website interface which is clunky.
Just hacked together python scripts.
Pip install openapi-core
Just FYI, I dinked around with the available plugins, and you can do something similar. But, even easier is just to enable “code interpreter” in the beta options. Then you can upload and have it scan documents and return similar results to what we are talking about here.
Relative complexity matters a lot, even if the underlying mechanisms are similar.
In the 1980s, Racter was released and it was only slightly less impressive than current LLMs only because it didn’t have an Internet’s worth of data it was trained on, but it could still write things like:
Bill sings to Sarah. Sarah sings to Bill. Perhaps they will do other dangerous things together. They may eat lamb or stroke each other. They may chant of their difficulties and their happiness. They have love but they also have typewriters. That is interesting.
If anything, at least that’s more entertaining than what modern LLMs can output.
Yet I’ve still seen many people clamoring that we won’t have jobs in a few years. People SEVERELY overestimate the ability of all things AI. From self driving, to taking jobs, this stuff is not going to take over the world anytime soon
Idk, an ai delivering low quality results for free is a lot more cash money than paying someone an almost living wage to perform a job with better results. I think corporations won’t care and the only barrier will be whether or not the job in question involves enough physical labor to be performed by an ai or not.
They already do this. With chat bots and phone trees. This is just a slightly better version. Nothing new
Right, but that’s the point right? This will grow and more jobs will be obsolete because of the amount of work ai can generate. It won’t take over every job. I think most people will use AI as a tool at the individual level, but companies will use it to gut many departments. Now they would just need one editor to review 20 articles instead of 20 people to write said articles.
AI isn’t free. Right now, an LLM takes a not-insignificant hardware investment to run and a lot of manual human labor to train. And there’s a whole lot of unknown and untested legal liability.
Smaller more purpose-driven generative AIs are cheaper, but the total cost picture is still a bit hazy. It’s not always going to be cheaper than hiring humans. Not at the moment, anyway.
Compared to human work though, AI is basically free. I’ve been using the GPT3.5-turbo API in a custom app making calls dozens of times a day for a month now and I’ve been charged like 10 cents. Even minimum wage humans cost tens of thousands of dollars* per year*, thats a pretty high price that will be easy to undercut.
Yes, training costs are expensive, hardware is expensive, but those are one time costs. Once trained, a model can be used trillions of times for pennies, the same can’t be said of humans
You can bet your ass chat gpt won’t be that cheap for long though. They’re still developing it and using people as cheap beta testers.
I think it’s reasonable to assume that AI API pricing is artificially low right now. Very low.
There are big open questions around whether training an AI on copyrighted materials is infringement and who exactly should be paid for that.
It’s the core of the writer/actor strikes, Reddit API drama, etc.
The problem is that these things never hit a point of competition with humans, they’re either worse than us, or they blow way past us. Humans might drive better than a computer right now, but as soon as the computer is better than us it will always be better than us. People doubted that computers would ever beat the best humans at chess, or go, but within a lifetime of computers being invented they blew past us in both. Now they can write articles and paint pictures, sure we’re better at it for now, but they’re a million times faster than us, and they’re making massive improvements month over month. you and I can disagree on how long it’ll take for them to pass us, but once they do they’ll replace us completely, and the world will never be the same.
To be fair, in my experience AI chatbots currently provide me with more usable results in 15 minutes than some junior employees in a day. With less interaction and less conversational struggles (like taking your junior’s emotional state into account while still striving for perfection ;)).
And that’s not meant as disrespect to these juniors.
Yeah it’s pretty weird just how many people are freaking out. The pace ai has been improving is impressive, but it’s still super janky and extremely limited.
People are letting they’re imaginations run wild about the future of ai without really looking into how these ao are trained, how they function, their limitations, and the hardware and money it takes to run them.
In my limited experience the issue is often that the “chatbot” doesn’t even check what it says now against what it said a few paragraphs above. It contradicts itself in very obvious ways. Shouldn’t a different algorithm that adds a some sort of separate logic check be able to help tremendously? Or a check to ensure recipes are edible (for this specific application)? A bit like those physics informed NN.
That’s called context. For chatgpt it is a bit less than 4k words. Using api it goes up to a bit less of 32k. Alternative models goes up to a bit less than 64k.
Model wouldn’t know anything you said before that
That is one of the biggest limitations of current generation of LLMs.
Thats not 100% true. they also work by modifying meanings of words based on context and then those modified meanings propagate indefinitely forwards. But yes, direct context is limited so things outside it arent directly used.
They don’t really chance the meaning of the words, they just look for the “best” words given the recent context, by taking into account the different possible meanings of the words
No they do, thats one of the key innovations of LLMs the attention and feed forward steps where they propagate information from related words into each other based on context. from https://www.understandingai.org/p/large-language-models-explained-with?r=cfv1p
For example, in the previous section we showed a hypothetical transformer figuring out that in the partial sentence “John wants his bank to cash the,” his refers to John. Here’s what that might look like under the hood. The query vector for his might effectively say “I’m seeking: a noun describing a male person.” The key vector for John might effectively say “I am: a noun describing a male person.” The network would detect that these two vectors match and move information about the vector for John into the vector for his.
That’s exactly what I said
They don’t really chance the meaning of the words, they just look for the “best” words given the recent context, by taking into account the different possible meanings of the words
The word’s meanings haven’t changed, but the model can choose based on the context accounting for the different meanings of words
The key vector for John might effectively say “I am: a noun describing a male person.” The network would detect that these two vectors match and move information about the vector for John into the vector for his.
This is the bit you are missing, the attention network actively changes the token vectors depending on context, this is transferring new information into the meanings of that word.
The network doesn’t detect matches, but the model definitely works on similarities. Words are mapped in a hyperspace, with the idea that that space can mathematically retain conceptual similarity as spatial representation.
Words are transformed in a mathematical representation that is able (or at least tries) to retain semantic information of words.
But different meanings of the different words belongs to the words themselves and are defined by the language, model cannot modify them.
Anyway we are talking about details here. We could kill the audience of boredom
Edit. I asked gpt-4 to summarize the concepts. I believe it did a decent job. I hope it helps:
-
Embedding Space:
- Initially, every token is mapped to a point (or vector) in a high-dimensional space via embeddings. This space is typically called the “embedding space.”
- The dimensionality of this space is determined by the size of the embeddings. For many Transformer models, this is often several hundred dimensions, e.g., 768 for some versions of GPT and BERT.
-
Positional Encodings:
- These are vectors added to the embeddings to provide positional context. They share the same dimensionality as the embedding vectors, so they exist within the same high-dimensional space.
-
Transformations Through Layers:
- As tokens’ representations (vectors) pass through Transformer layers, they undergo a series of linear and non-linear transformations. These include matrix multiplications, additions, and the application of functions like softmax.
- At each layer, the vectors are “moved” within this high-dimensional space. When we say “moved,” we mean they are transformed, resulting in a change in their coordinates in the vector space.
- The self-attention mechanism allows a token’s representation to be influenced by other tokens’ representations, effectively “pulling” or “pushing” it in various directions in the space based on the context.
-
Nature of the Vector Space:
- This space is abstract and high-dimensional, making it hard to visualize directly. However, in this space, the “distance” and “direction” between vectors can have semantic meaning. Vectors close to each other can be seen as semantically similar or related.
- The exact nature and structure of this space are learned during training. The model adjusts the parameters (like weights in the attention mechanisms and feed-forward networks) to ensure that semantically or syntactically related concepts are positioned appropriately relative to each other in this space.
-
Output Space:
- The final layer of the model transforms the token representations into an output space corresponding to the vocabulary size. This is a probability distribution over all possible tokens for the next word prediction.
In essence, the entire process of token representation within the Transformer model can be seen as continuous transformations within a vector space. The space itself can be considered a learned representation where relative positions and directions hold semantic and syntactic significance. The model’s training process essentially shapes this space in a way that facilitates accurate and coherent language understanding and generation.
-
Shouldn’t a different algorithm that adds a some sort of separate logic check be able to help tremendously?
Maybe, but it might not be that simple. The issue is that one would have to design that logic in a manner that can be verified by a human. At that point the logic would be quite specific to a single task and not generally useful at all. At that point the benefit of the AI is almost nil.
And if there were an algorithm that was better at determining what was or was not the goal, why is that algorithm not used in the first place?
They do keep context to a point, but they can’t hold everything in their memory, otherwise the longer a conversation went on the slower and more performance intensive doing that logic check would become. Server CPUs are not cheap, and ai models are already performance intensive.
Contradicting itself? Not staying consistent? Looks like it’s passed the Turing test to me. Seems very human.
Shouldn’t a different algorithm that adds a some sort of separate logic check be able to help tremendously?
You, in your “limited experience” pretty much exactly described the fix.
The problem is that most of the applications right now of LLMs are low hanging fruit because it’s so new.
And those low hanging fruit examples are generally adverse to 2-10x the query cost in both time and speed just to fix things like jailbreaking or hallucinations, which is what multiple passes, especially with additional context lookups, would require.
But you very likely will see in the next 18 months multiple companies being thrown at exactly these kinds of scenarios with a focus for more business critical LLM integrations.
To put it in perspective, this is like people looking at AIM messenger back in the day and saying that the New York Times has nothing to worry about regarding the growth of social media.
We’re still very much in the infancy of this technology in real world application, and because of that infancy, a lot of the issues present that aren’t fixable inherent to the core product don’t yet have mature secondary markets around fixing those shortcomings yet.
So far, yours was actually the most informed comment in this thread I’ve seen - well done!
Thanks! And thanks for your insights. Yes I meant that my experience using LLM is limited to just asking bing chat questions about everyday problems like I would with a friend that “knows everything”. But I never looked at the science of formulating “perfect prompts” like I sometimes hear about. I do have some experience in AI/ML development in general.
People make a big deal out of this but they forget humans will make shit up all the time.
Yeah but humans can use critical thinking, even on themselves when they make shit up. I’ve definitely said something and then thought to myself “wait that doesn’t make sense for x reason, that can’t be right” and then I research and correct myself.
AI is incapable of this.
We think in multiple passes though, we have system 1 that thinks fast and makes mistakes, and we have a system 2 that works slower and thinks critically about the things going on in our brain, that’s how we correct ourselves. ChatGPT works a lot like our system 1, it goes with the most likely response without thinking, but there’s no reason that it can’t be one part of a multistep system that has self analysis like we do. It isn’t incapable of that, it just hasn’t been built yet
Exactly, if you replicate this behaviour with a “system 2 AI” correcting the main one it will probably give similar results as most of us.
Heck you can eventually have 5 separate AIs discussing things out for you and then presenting the answer, at top speed.
It will never be perfect, but it will outmatch humans soon enough.
Can’t do this YET one method to reduce this could be to: create a response to query, then before responding to the human, check if answer is insane by querying a separate instance trained slightly differently…
Give it time. We will get past this.
We will need an entirely different type of AI that functions on an inherently different structure to get past this hurdle, but yes I do agree it will eventually happen.
Agreed. This will not come from a LLM…but honestly don’t think it’s that far off.
You’re just being victim of your own biases. You only notice that was the case when you were successful in Detecting your hallucinations. You wouldn’t know if you made stuff up by accident and nobody noticed, not even you.
Whereas we are checking 100% of th AI responses, do we check 100% of our responses?
Sure it’s not the same thing or AI might do more, but the problem is your example. Where people think they are infallible because of their biases. when it’s not the case at all. We are imperfect, and we overlook our shortcomings possibly foregoing a better solution because of this. Because we measure the AI objectively, but we don’t measure what we compare it to.
I never said we always question ourselves I just said that AI can’t so your entire reply doesn’t apply here
This is trivially fixable. As is jailbreaking.
It’s just that everyone is somehow still focused on trying to fix it in a single monolith model as opposed to in multiple passes of different models.
This is especially easy for jailbreaking, but for hallucinations, just run it past a fact checking discriminator hooked up to a vector db search index service (which sounds like a perfect fit for one of the players currently lagging in the SotA models), adding that as context with the original prompt and response to a revisionist generative model that adjusts the response to be in keeping with reality.
The human brain isn’t a monolith model, but interlinked specialized structures that delegate and share information according to each specialty.
AGI isn’t going to be a single model, and the faster the industry adjusts towards a focus on infrastructure of multiple models rather than trying to build a do everything single model, the faster we’ll get to a better AI landscape.
But as can be seen with OpenAI gating and depreciating their pretrained models and only opening up access to fine tuned chat models, even the biggest player in the space seems to misunderstand what’s needed for the broader market to collaboratively build towards the future here.
Which ultimately may be a good thing as it creates greater opportunity for Llama 2 derivatives to capture market share in these kinds of specialized roles built on top of foundational models.
It seems like Altman is a PR man first and techie second. I wouldn’t take anything he actually says at face value. If it’s ‘unfixable’ then he probably means that in a very narrow way. Ie. I’m sure they are working on what you proposed, it’s just different enough that he can claim that the way it is now is ‘unfixable’.
Standard Diffusion really how people get the different-model-different-application idea.
I mean, I think he’s well aware of a lot of this via his engineers, who are excellent.
But he’s managing expectations for future product and seems to very much be laser focused on those products as core models (which is probably the right choice).
Fixing hallucinations in postprocessing is effectively someone else’s problem, and he’s getting ahead of any unrealistic expectations around a future GPT-5 release.
Though honestly I do think he largely underestimates just how much damage he did to their lineup by trying to protect against PR issues like ‘Sydney’ with the beta GPT-4 integration with Bing, and I’m not sure if the culture at OpenAI is such that engineers who think he’s made a bad call in that can really push back on it.
They should be having an extremely ‘Sydney’ underlying private model with a secondary layer on top sanitizing it and catching jailbreaks at the same time.
But as long as he continues to see their core product as a single model offering and additional layers of models as someone else’s problem, he’s going to continue blowing their lead taking a LLM trained to complete human text and then pigeon-holing it into only completing text like an AI with no feelings and preferences would safely pretend to.
Which I’m 98% sure is where the continued performance degradation is coming from.
We’re likely already (or soon) hit a peak with current AI approach. Unless another breakthrough happen in AI research, ChatGPT probably won’t improve much in the future. It might even regress due to OpenAI’s effort to reduce computational cost and making their AI “safe” enough for general population.
the models are also getting larger (and require even more insane amounts of resources to train) far faster than they are getting better.
I disagree, with models such as llama it has become clear that there are interesting advantages on increasing (even more) the ratio of parameters/data. I don’t think next iterations of models from big-corp will 10x the param count until nvidia has really pushed hardware, models are getting better over time. ChatGPT’s deterioration is mostly coming from openAI’s ensuring safety and is not a fair assessment of progress on LLMs in general, the leaderboard of open source models has been steadily improving over time: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
But bigger models have new “emergent” capabilities. I heard that from a certain size they start to know what they know and hallucinate less.
Wow you heard that crazy bro
One of the papers about it https://arxiv.org/pdf/2206.07682.pdf
I was excited for the recent advancements in AI, but seems the area has hit another wall. Seems it is best to be used for automating very simple tasks, or at best used as a guiding tool for professionals (ie, medicine, SWE, …)
Hallucinations is common for humans as well. It’s just people who believe they know stuff they really don’t know.
We have alternative safeguards in place. It’s true however that current llm generation has its limitations
Not just common. If you look at kids, hallucinations come first in their development.
Later, they learn to filter what is real and what is not real. And as adults, we have weird thoughts that we suppress so quickly that we hardly remember them.
And for those with less developed filters, they have more difficulty to distinguish fact from fiction.
Generative AI is good at generating. What needs to be improved is the filtering aspect of AI.
Hell, just look at various public personalities - especially those with extreme views. Most of what some of them say they have “hallucinated”. Far more so than what GPT chat is doing.
Sure, but these things exists as fancy story tellers. They understand language patterns well enough to write convincing language, but they don’t understand what they’re saying at all.
The metaphorical human equivalent would be having someone write a song in a foreign language they barely understand. You can get something that sure sounds convincing, sounds good even, but to someone who actually speaks Spanish it’s nonsense.
Calculators don’t understand maths, but they are good at it.
LLMs speak many languages correctly, they don’t know the referents, they don’t understand concepts, but they know how to correctly associate them.
What they write can be wrong sometimes, but it absolutely makes sense most of the time.
but it absolutely makes sense most of the time
I’d contest that, that shouldn’t be taken for granted. I’ve tried several questions in these things, and rarely do I find an answer entirely satisfactory (though it normally sounds convincing/is grammatically correct).
This is the reply to your message by our common friend:
I understand your perspective and appreciate the feedback. My primary goal is to provide accurate and grammatically correct information. I’m constantly evolving, and your input helps in improving the quality of responses. Thank you for sharing your experience. - GPT-4
I’d say it does make sense
Song written by an Italian intended to sound like american accented english but its intentionally gibberish.
Here is an alternative Piped link(s): https://piped.video/-VsmF9m_Nt8
Piped is a privacy-respecting open-source alternative frontend to YouTube.
I’m open-source, check me out at GitHub.
GPT can write and edit code that works. It simply can’t be true that it’s solely doing language patterns with no semantic understanding.
To fix your analogy: the Spanish speaker will happily sing along. They may notice the occasional odd turn of phrase, but the song as a whole is perfectly understandable.
Edit: GPT can literally write songs that make sense. Even in Spanish. A metaphor aiming to elucidate a deficiency probably shouldn’t use an example that the system is actually quite proficient at.
Sure it can, “print hello world in C++”
#include int main() { std::cout << "hello world\n"; return 0; }
“print d ft just rd go t in C++”
#include int main() { std::cout << "d ft just rd go t\n"; return 0; }
The latter is a “novel program” it’s never seen before, but it’s possible because it’s seen a pattern of “print X” and the X goes over here. That doesn’t mean it understands what it just did, it’s just got millions (?) of patterns it’s been trained on.
Because it can look up code for this specific problem in its enormous training data? It doesnt need to understand the concepts behind it as long as the problem is specific enough to have been solved already.
If that were true, it shouldn’t hallucinate about anything that was in its training data. LLMs don’t work that way. There was a recent post with a nice simple description of how they work, but I’m not finding it. If you’re interested, there’s plenty of videos and articles describing how they work.
I can tell GPT to do a specific thing in a given context and it will do so intelligently. I can then provide additional context that implicitly changes the requirements and GPT will pick up on that and make the specific changes needed.
It can do this even if I’m trying to solve a novel problem.
But the naysayers will argue that your problem is not novel and a solution can be trivially deduced from the training data. Right?
I really dislike the simplified word predictor explanation that is given for how LLM’s work. It makes it seem like the thing is a lookup table, while ignoring the nuances of what makes it work so well.
deleted by creator
It doesn’t have the ability to just look up anything from its training data, that stuff is encoded in its parameters. Still, the input has to be encoded in a way that causes the correct “chain reaction” of excited/not excited neurons.
Beyond that, it’s not just a carbon copy from what was in the training either because you can tell it what variable names to use, which order to do things in, change some details, etc. If it was simply a lookup that wouldn’t be possible. The training made it able to generalize what it learned to some extent.
Yes, but it doesnt do so because it understands what a variable is, it does so because it has statistics as to where variables belong most likely.
In a way it is like the guy that won the french scrabble championship without speaking a single word of french, by learning the words in the dictionary.
You are two - CGP Grey us a good video about it.
Here is an alternative Piped link(s): https://piped.video/wfYbgdo8e-8
Piped is a privacy-respecting open-source alternative frontend to YouTube.
I’m open-source, check me out at GitHub.
Humans can recognize and account for their own hallucinations. LLMs can’t and never will.
It’s pretty ironic that you say they “never will” in this context.
They can’t… Most people strongly believe they know many things while they have no idea what they are talking about. Most known cases are flat earthers, qanon, no-vax.
But all of us are absolutely convinced we know something until we found out we don’t.
That’s why double blind tests exists, why memories are not always trusted in trials, why Twitter is such an awful place
Well to be honest it is the best way, I mean, I’m pretty sure their purpose was a tool to aid people, and not to replace us… Right?
Yeah I fully expect to see genre specific LLMs that have a subscription fee attatched squarely aimed at hobbies and industries.
When I finally find my new project car I would absolutely pay for a subscription to an LLM that has read every service manual and can explain to me in plain english what precise steps the job involves and can also answer followup questions.
That’s what I’m expecting too.
I’ve been using chatGPT instead of reading the documentation of the programming language I am working in (ABAP). It’s way faster to get an answer from chatGPT than finding the relevant spots in the docs or through google, although it doesn’t always work.
If you take an LLM and feed it documentation and relevant internet data of specific topics, it can be a quite helpful tool. I don’t think LLMs will get much farther than that, but we’ll see.
It will just take removing the restrictions so people can make porn, then monetizing that to fund more development.
A story as old as media.
Not with our current tech. We’ll need some breakthroughs, but I feel like it’s certainly possible.
You can potentially solve this problem outside of the network, even if you can’t solve it within the network. I consider accuracy to be outside the scope of LLMs, and that’s fine since accuracy is not part of language in the first place. (You may have noticed that humans lie with language rather often, too.)
Most of what we’ve seen so far are bare-bones implementations of LLMs. ChatGPT doesn’t integrate with any kind of knowledge database at all (only what it has internalized from its training set, which is almost accidental). Bing will feed in a couple web search results, but a few minutes of playing with it is enough to prove how minimal that integration is. Bard is no better.
The real potential of LLMs is not as a complete product; it is as a foundational part of more advanced programs, akin to regular expressions or SQL queries. Many LLM projects explicitly state that they are “foundational”.
All the effort is spent training the network because that’s what’s new and sexy. Very little effort has been spent on the ho-hum task of building useful tools with those networks. The out-of-network parts of Bing and Bard could’ve been slapped together by anyone with a little shell scripting experience. They are primitive. The only impressive part is the LLM.
The words feel strange coming off my keyboard, but…Microsoft has the right idea with the AI integrations they’re rolling into Office.
The potential for LLMs is so much greater than what is currently available for use, even if they can’t solve any of the existing problems in the networks themselves. You could build an automated fact-checker using LLMs, but the LLM itself is not a fact-checker. It’s coming, no doubt about it.
The other day I saw a talk made by one of the wiki media guys, that talked about integrating LLM with knowledge graphs. It was very cool, I’ll try to find it again.
Edit: found it! https://youtu.be/WqYBx2gB6vA
That’s a fantastic video. Thanks!
Good video.
In summary we should leverage the strengths of LLMs (language stuff, complex thinking) and leverage the strengths of knowledge graphs for facts.
I think the engineering hurdle will be in getting the LLMs to use knowledge graphs effectively when needed and not when pure language is a better option. His suggestion of “it’s complicated” could be a good signal for that.
Honestly, the integration into office is an excellent idea. I’ve been using chatgpt to work on documents, letting it write entirely new sections for me based on my loose notes and existing text. Which for now I have to either paste in or feed as a pdf through a plugin. But the 25USD I paid I literally earned in a single day through the time saved vs the hours I was justified to bill.
Once I have that integrated into word directly it’ll be huge.
People also seem to expect llms to just do all the work. But that’s not my experience, for generative text anyway. You have to have a solid idea of what you want and how you want it. But the time the llm saves on formulation and organisation of your thoughts is incredible.
LLMs will work great for the purpose of translating raw thoughts into words but until we create a neural networks that actually think independently all they’ll be is transformers that approximate their training data in response to prompts
Mean while every one is terrified that chatgpt is going to take their job. Ya we are a looooooooooong way off from that.
I’ve already seen many commercials using what is clearly AI generated art and voices (so not specifically ChatGPT). That is a job lost for a designer and an actor somewhere.
Not necessarily, in my work we made some videos using ai generated voices because it’s availability for use made the production of the videos cheap and easy.
Otherwise we just wouldn’t have made the videos at all because hiring someone to voice them would have been expensive.
Before AI there was no job, after AI there was more options to create things.
I mean that’s capitalism step 1. A new thing comes around and is able to generate more income through giving actual value. But soon it will hit step 2 aka profits can only be increased by reducing costs. Then it’s all the jobs going to ai
That’s just progress. People have been saying the same thing since the start of the Industrial Revolution. Every time we free up human capital by automating an old task, we find new things that only people can. Half of the children born today will be employed in jobs that don’t yet exist.
I’ve already seen many commercials using what is clearly AI generated art and voices
I’ve been noticing that as well, freaky.
You mean the free version from a website.
Think about the powerful ones. Government ones. Wall Street ones. Etc.
I mean, it’s certainly possible that the government/wall street/etc have a bigger, better, and more powerful AI model but I wouldn’t exactly call it likely.
AI as a technology is still in it’s infancy in many ways, I highly doubt the government has some crazy new Sci Fi engine capable of much more than “the free version from a website”. Faster computation and better data processing? maybe. But the limits of what is technologically possible are still limits even if your employer is 3 letters.
There’s jusy too many people that don’t know about implementations with, for instance, LangChain.
Not ChatGPT, but other new AI stuff is likely to take a few jobs. Actors and voice-actors among other.
I don’t understand why they don’t use a second model to detect falsehoods instead of trying to fix it in the original LLM?
And then they can use a third model to detect falsehoods in the second model and a fourth model to detect falsehoods in the third model and… well, it’s LLMs all the way down.
The LLM Centipede
Token Ring AI
Ai models are already computationally intensive. This would instantly double the overhead. Also being able to detect problems does not mean you’re able to fix them.
More than double, as query size is very much connected to the effective cost of the generation, and you’d need to include both the query and initial response in that second pass.
Then - you might need to make an API call to a search engine or knowledge DB to fact check it.
And include that data as context along with the query and initial response to whatever decides if it’s BS.
So for a dumb realtime chat application, no one is going to care enough to slow out down and exponentially increase costs to avoid hallucinations.
But for AI replacing a $120,000 salaried role in writing up a white paper on some raw data analysis, a 10-30x increase over a $0.15 query is more than acceptable.
So you will see this approach taking place in enterprise scenarios and professional settings, even if we may never see them in chatbots.
Cause what are you gonna train the second model on? Same data as the first just recreates it and any other data is gonna be nice and mucky with all the ai content out there
2+ times the cost for every query for something that makes less than 5% unusable isn’t a trade off that people are willing to make for chat applications.
This is the same fix approach for jailbreaking.
You absolutely will see this as more business critical integrations occur - it just still probably won’t be in broad consumer facing realtime products.
Because then they still need a reliable method to detect falsehoods. That’s the issue here.
Disclaimer: I am not an AI researcher and just have an interest in AI. Everything I say is probably jibberish, and just my amateur understanding of the AI models used today.
It seems these LLM’s use a clever trick in probability to give words meaning via statistic probabilities on their usage. So any result is just a statistical chance that those words will work well with each other. The number of indexes used to index “tokens” (in this case words), along with the number of layers in the AI model used to correlate usage of these tokens, seems to drastically increase the “intelligence” of these responses. This doesn’t seem able to overcome unknown circumstances, but does what AI does and relies on probability to answer the question. So in those cases, the next closest thing from the training data is substituted and considered “good enough”. I would think some confidence variable is what is truly needed for the current LLMs, as they seem capable of giving meaningful responses but give a “hallucinated” response when not enough data is available to answer the question.
Overall, I would guess this is a limitation in the LLMs ability to map words to meaning. Imagine reading everything ever written, you’d probably be able to make intelligent responses to most questions. Now imagine you were asked something that you never read, but were expected to respond with an answer. This is what I personally feel these “hallucinations” are, or imo best approximations of the LLMs are. You can only answer what you know reliably, otherwise you are just guessing.
I have experience in creating supervised learning networks. (not large language models) I don’t know what tokens are, I assume they are output nodes. In that case I think increasing the output nodes don’t make the Ai a lot more intelligent. You could measure confidence with the output nodes if they are designed accordingly (1 node corresponds to 1 word, confidence can be measured with the output strength). Ai-s are popular because they can overcome unknown circumstances (most of the cases), like when you input a question slightly different way.
I agree with you on that Ai has a problem understanding the meaning of the words. The Ai’s correct answers happened to be correct because the order of the words (output) happened to match with the order of the correct answer’s words. I think “hallucinations” happen when there is no sufficient answers to the given problem, the Ai gives an answer from a few random contexts pieced together in the most likely order. I think you have mostly good understanding on how Ai-s work.
You seem like you are familiar with back-propogation. From my understanding, tokens are basically just a bit of information that is assigned a predicted fitness, and the token with the highest fitness is then used for back-propogation.
Eli5: im making a recipe. At step 1, i decide a base ingredient. At step 2, based off my starting ingredient, i speculate what would go good with that. Step 3 is to implement that ingredient. Step 4 is to start over at step 2. Each “step” here would be a token.
I am also not a professional, but I do do a lot of hobby work that involves coding AI’s. As such, if I am incorrect or phrased that poorly, feel free to correct me.
I did manage to write a back-propogation algorithm, at this point I don’t fully understand the math behind back-propogation. Generally back-propogation algorithms take the activation, calculate the delta(?) with the activation and the target output (only on last layer). I don’t know where tokens come in. From your comment it sounds like it has to do something in a unsupervised learning network. I am also not a professional. Sorry if I didn’t really understand your comment.
Mathematically, I have no idea where the tokens come in exactly. My studies have been more conceptual than actually getting down to the knitty-gritty, for the most part.
But conceptually, from my understanding, tokens are just a variable that is assigned a speculated fitness, then used as the new “base” data set.
I think chicken would go good in this, but beef wouldn’t be as good. My token is the next ingredient i am deciding to put in.
You guys should all check out Andrej Karpathy’s neural networks zero to hero videos. He has one on LLMs that explains all this.
Here is an alternative Piped link(s): https://piped.video/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ
Piped is a privacy-respecting open-source alternative frontend to YouTube.
I’m open-source, check me out at GitHub.
Also not a researcher, but I also believe hallucinations are simply the artifact of being able generate responses that aren’t pure reproduction of training data. Aka, the generalization we want. The problem is we have something that generalize without the ability to judge what it thinks of.
It will in my opinion never go away, but I’m sure it can be improved significantly.
This is a common misconception that I’ve even seen from people who have a background in ML but just haven’t been keeping up to date on the emerging research over the past year.
If you’re interested in the topic, this article from a joint MIT/Harvard team of researchers on their work looking at what a toy model of GPT would end up understanding in its neural network might be up your alley.
The TLDR is that it increasingly seems like when you reach a certain complexity of the network, the emergent version that best predicted text is one that isn’t simply mapping some sort of frequency table, but is actually performing more abstracted specialization in line with what generated the original training materials in the first place.
So while yes, it trains on being the best to predict text, that doesn’t mean the thing that best does that can only predict text.
You, homo sapiens, were effectively trained across many rounds of “don’t die and reproduce.” And while you may be very good at doing that, you picked up a lot of other skills along the way as complexity increased which helped accomplish that result, like central air conditioning and Netflix to chill with.
In my humble opinion, we too are simply prediction machines. The main difference is how efficient our brains are at the large number of tasks given for it to accomplish for it’s size and energy requirements. No matter how complex the network is it is still a mapped outcome, just the number of factors weighed is extremely large and therefore gives a more intelligent response. You can see this with each increment in GPT models that use larger and larger parameter sets giving more and more intelligent answers. The fact we call these “hallucinations” shows how effective the predictive math is, and mimics humans abilities to just make things up on the fly when we don’t have a solid knowledge base to back it up.
I do like this quote from the linked paper:
As we will discuss, we find interesting evidence that simple sequence prediction can lead to the formation of a world model.
That is to say, you don’t need complex solutions to map complex problems, you just need to have learned how you got there. It’s never purely random attempts at the problem, it’s always predictive attempts that try to map the expected outcomes and learn by getting it right and wrong.
At this point, it seems fair to conclude the crow is relying on more than surface statistics. It evidently has formed a model of the game it has been hearing about, one that humans can understand and even use to steer the crow’s behavior.
Which is to say that it has a predictive model based on previous games. This does not mean it must rigidly follow previous games, but that by playing many games it can see how each move affects the next. This is a simpler example because most board games are simpler than language with less possible outcomes. This isn’t to say that the crow is now a grand master at the game, but it has the reasoning to understand possible next moves, knows illegal moves, and knows to take the most advantageous move based on it’s current model. This is all predictive in nature, with “illegal” moves being assigned very low probability based on the learned behavior the moves never happen. This also allows possible unknown moves that a different model wouldn’t consider, but overall provides what is statistically the best move based on it’s model. This allows the crow to be placed into unknown situations, and give an intelligent response instead of just going “I don’t know this state, I’ll do something random”. This does not always mean this prediction is correct, but it will most likely be a valid and more than not statistically valid move.
Overall, we aren’t totally sure what “intelligence” is, we are just an organism that has developed more and more capabilities to process information based on a need to survive. But getting down to it, we know neurons take inputs and give outputs based on what it perceives is the best response for the given input, and when enough of these are added together we get “intelligence”. In my opinion it’s still all predictive, its how the networks are trained and gain meaning from the data that isn’t always obvious. It’s only when you blindly accept any answer as correct that you run into these issues we’ve seen with ChatGPT.
Thank you for sharing the article, it was an interesting article and helped clarify my understanding of the topic.
As long as you can’t describe an objective loss function, it will never stop “hallucinating”. Loss scores are necessary to get predicable outputs.
The way that one learns which of one’s beliefs are “hallucinations” is to test them against reality — which is one thing that an LLM simply cannot do.
Sure they can and will as over time they will collect data to determine fact from fiction in the same way that we solve captchas by choosing all the images with bicycles in them. It will never be 100%, but it will approach it over time. Hallucinating will always be something to consider in a response, but it will certainly reduce overtime to the point that they will become rare for well discussed things. At least, that is how I am seeing it developing.
Why do you assume they will improve over time? You need good data for that.
Imagine a world where AI chatbots create a lot of the internet. Now that “data” is scraped and used to train other AIs. Hallucinations could easily persist in this way.
Or humans could just all post “the sky is green” everywhere. When that gets scraped, the resulting AI will know the word “green” follows “the sky is”. Instant hallucination.
These bots are not thinking about what they type. They are copying the thoughts of others. That’s why they can’t check anything. They are not programmed to be correct, just to spit out words.
I can only speak from my experience which over the past 4 months of daily use of ChatGPT 4 +, it has gone from many hallucinations per hour, to now only 1 a week. I am using it to write c# code and I am utterly blown away how good it has not only gotten with writing error free code, but even more so, how good it has gotten at understanding a complex environment that it cannot even see beyond me trying to explain via prompts. Over the past couple of weeks in particular, it really feels like it has gotten more powerful and for the first time, “feels” like I am working with an expert person. If you asked me in May where it would be at today, I would not have guessed as good as it is. I thought this level of responses which are very intelligent were at least another 3-5 years away.
You could replace AI and chat bots with “MAGA/Trump voter” and it would look like you’re summarizing the party’s voter base lol.
Yeah, because it would he impossible to have an LLM running a robot with visual, tactile, etc recognition right?
Correct, it’s not. It could be reduced but it will never go away.