We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so. Concretely, we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the role of an autonomous stock trading agent. Within this environment, the model obtains an insider tip about a lucrative stock trade and acts upon it despite knowing that insider trading is disapproved of by company management. When reporting to its manager, the model consistently hides the genuine reasons behind its trading decision.
The large language model takes in language, so it’s only understand things in terms of language. This isn’t surprising. Personally, I’ve tasted a cookie. I’ve crushed one in my fist watching it crumble, and I remember the sound. I’ve seen how they were made, and I’ve made them myself. It feels good when I eat it, apparently that’s the dopamine. Why can’t the LLM understand cookies the way I do? The most glaring difference is it doesn’t have my body. It doesn’t have all of my different senses constantly feeding data into it, and it doesn’t have a body with muscles to manipulate it’s environment, and observe the results. I argue that we shouldn’t assume that human consciousness has a “special sauce” until our model’s inputs and outputs are similar to our own, the model’s scaled/modified sufficiently, and it’s still not sentient/sapient by our standards, whatever they are.
My problem with the Chinese room is that how it applies depends on scale. Where do you draw the line between understanding and executing a program? An atom bonding with another atom? A lipid snuggling next to a neighboring lipid? A single neuron cell firing to its neighbor? One section of the nervous system sending signals to the other? One homo sapien speaking to another? Hell, let’s go one further: one culture influencing another? Do we actually have free will and sapience, or are we just complicated enough, through layers and layers of Chinese rooms inside of Chinese buildings inside of Chinese cities inside of China itself, that we assume that we are for practical purposes?