One of the most compelling reasons why a superintelligent (i.e., way smarter than human), artificial intelligence (AI) may end up destroying us is the so-called paperclip apocalypse. Posited by Nick Bostrom, this involves some random engineer creating an AI with the goal of making paperclips. That AI then becomes superintelligent and in the single minded pursuit of paperclip making ends up appropriating all of the world’s (and maybe universe’s) resources. In the process, it wipes us out perhaps because it sees us as being a threat to its paperclip making goals. It is superintelligent so the assumption is that it can find a way to do all of this. (There is now a mobile game where you can play paperclip AI to your heart’s content).
This is a worrying hypothesis to an economist because (a) we know that it is hard to align incentives in a principal-agent setting of which is this definitely one; and (b) we also know that a random engineer could certainly end up activating such an AI. So if the argument is to be believed it is only a matter of time before we are all wiped out. If not as a result of paperclips then because of something else.
To date, there had been no serious objections to the control problem that lay at the heart of the paperclip apocalypse prediction. But I was always bothered by it. I wondered if we were missing something obvious especially with an argument that should be capable of formalisation in a proper economic or game theory model.
I now believe I have found a potential argument that might give us some comfort and, while it is preliminary (that is, it hasn’t gone anywhere near peer review), I think it is worth sharing. It is posted to arXiv as a paper entitled “Self-Regulating Artificial General Intelligence.”
The paper provides a formal model but let me describe the argument intuitively. The key to it all is the control problem: if you activate an AI that is superintelligent, you can’t control it. I assume that this control problem does not go away. That means that to prevent this you either must not activate an AI (which we all agree some human will do regardless) or an activated AI must choose not to destroy us even though all it wants is more paperclips.
To this mix I add an assumption that postulates how an AI may end up getting all resources in pursuit of its goals. It needs two things. First, it needs the ability to turn all manner of resources into paperclips. Under current technology it needs suitable metals but one can imagine that there may be some technology that allows paperclips to be made out of pretty much anything. Second, it needs a way to get those resources out of the hands of those who currently have them (namely, us). So it needs powers — you know, Terminator like powers — to do that. Both of those things can be rationalised in service of the ultimate paperclip making goal but they require sub-goals to achieve those other things.
An implicit assumption in the paperclip apocalypse argument, therefore, is that our paperclip making AI can self-improve to do all of the above. Fair enough. But, in the process, isn’t there a risk it will stop becoming all about the paperclips. For instance, what if the best Terminator AI is an independent AI with an off-switch held by the paperclip AI? But if the paperclip AI activates the Terminator AI, couldn’t that AI gain power and perhaps be even more intelligent and see the original AI with its off-switch as a threat and … you know the rest. Or if it wanted to develop Terminator capabilities couldn’t the original AI be worried that this might change its own goals without it realising it and … you know the rest.
The point here is that the paperclip apocalypse argument assumes that there is a control problem that humans cannot overcome but the paperclip AI can. Moreover, the paperclip AI has to be 100 percent certain of this or it will be worried that self-improvement in the wrong way will lead to its doom. And that is the last thing it wants. The point is that if the control problem exists for the paperclip AI — and there are compelling reasons (I think) that it will — then if that AI is superintelligent, it will not activate Terminator-like capabilities for fear that it will lose that control.
This moves the argument down a level to consider what self-improving AI really means and how it interacts with both its goals and control. If superintelligent AI’s cannot control themselves that should actually give us comfort that they won’t themselves risk activating capabilities that might end up destroying us. In other words, if the control problem is something fundamental, we are all in it together and can draw comfort in the idea that an AI might be smarter than we are and not choose to destroy everything.
3 Replies to “Can a superintelligence self-regulate and not destroy us?”
I like this sort of thinking a lot, thinking about what the notion of self might contain. Currently self seems to me a sort of conceptual escape hatch that let’s us try to find a way around the fact that we live in a social world and our social relations are strategic: what we get from interaction depends not only one what we do, but on what other(s) do(es) in response. So here the word super stands in for the right answer, whatever the problem is. As illustrations, in the face of New Inst Econ recognition that contracts have problems, there arose the idea that somehow there could be self-enforcing contracts. In the world of capitalism characterized by the fundamental relation of production, we have the wonderful category of people who sell their services directly in a market rather than to a capitalist, self-employed. And in the other direction, one longs for that factory which runs entirely on its own, with trucks that dump in the raw materials at the one end, and the opening of the cornucopia at the other, untouched by human hands along the way. You can, I’m sure find many more interesting examples of such usage, but I think now they all suffer the paradox of self-reference, which lets you think you’ve got more than you do,even while you don’t even have enough to figure out whether what you are saying is true or not: does a post card that says on both sides “How do you keep a moron busy all day? (see over)” really have two sides? Thanks for the blog.