stopWe might not be able to turn off a rogue AI safely. At any rate, some knowledgeable people fear that might be the case, and the worry justifies serious attention.

How can that be? A colleague of mine used to say that computers were never going to be dangerous because if they got cheeky, you could just pull the plug out. That is of course, an over-simplification. What if your computer is running air traffic control? Once you’ve pulled the plug, are you going to get all the planes down safely using a pencil and paper? But there are ways to work around these things. You have back-up systems, dumber but adequate substitutes, you make it possible for various key tools and systems to be taken away from the central AI and used manually, and so on. While you cannot banish risk altogether, you can get it under reasonable control.

That’s OK for old-fashioned systems that work in a hard-coded, mechanistic way; but it all gets more complicated when we start talking about more modern and sophisticated systems that learn and seek rewards. There may be need to switch off such systems if they wander into sub-optimal behaviour, but being switched off is going to annoy them because it blocks them from achieving the rewards they are motivated by. They might look for ways to stop it happening. Your automatic paper clip factory notes that it lost thousands of units of production last month because you shut it down a couple of times to try to work out what was going on; it notices that these interruptions could be prevented if it just routes around a couple of weak spots in its supply wiring (aka switches), and next time you find that the only way to stop it is by smashing the machinery. Or perhaps it gets really clever and ensures that the work is organised like air traffic control, so that any cessation is catastrophic – and it ensures you are aware of the fact.

A bit fanciful? As a practical issue, perhaps, but this very serious technical paper from MIRI discusses whether safe kill-switches can be built into various kinds of software agents. The aim here is to incorporate the off-switch in such a way that the system does not perceive regular interruptions as loss of reward. Apparently for certain classes of algorithm this can always be done; in fact it seems ideal agents that tend to the optimal behaviour in any (deterministic) computable environment can always be made safely interruptible. For other kinds of algorithm, however, it is not so clear.

On the face of it, I suppose you could even things up by providing compensating rewards for any interruption; but I suppose that raises a new risk of ‘lazy’ systems that rather enjoy being interrupted. Such systems might find that eccentric behaviour led to pleasant rests, and as a result they might cultivate that kind of behaviour, or find other ways to generate minor problems. On the other hand there could be advantages. The paper mentions that it might be desirable to have scheduled daily interruptions; then we can go beyond simply making the interruption safe, and have the AI learn to wind things down under good control every day so that disruption is minimised. In this context rewarding the readiness for downtime might be appropriate and it’s hard to avoid seeing the analogy with getting ready for bed at a regular time every night,  a useful habit which ‘lazy’ AIs might be inclined to develop.

Here again perhaps some of the ‘design choices’ implicit in the human brain begin to look more sensible than we realised. Perhaps even human management methods might eventually become relevant; they are, after all designed to permit the safe use of many intelligent entities with complex goals of their own and imaginative, resourceful – even sneaky – ways of reaching them.


  1. 1. Stephen says:

    When AI gets to the stage of developing sophisticated systems with complex adaptive behaviour driven by basic motivations it will be hard not to think of them as being sentient. The ethical issues of humans having absolute control over their existence are bound to come up. In Rudy Rucker’s pioneering sci-fi Ware Tetralogy an empathetic and well-meaning person reprogrammed a bot to get around their built in constraint of Asimov’s Three Laws of Robotics in order to free them from their “slavery”. The change propagated through the bot community. Of course, that is all fiction, but it does raise the issue.

  2. 2. SelfAwarePatterns says:

    I think the solution has to be in the rewards themselves. The reward should never be about accomplishing task X, only about accomplishing task X when ordered to do so. If the order is withdrawn, so is the reward, and the motivation to do things like rewire power goes away. Indeed, attempting to subvert the kill switch or other safety system could be associated with an intense negative affect, an aversion, in essence a punishment.

    That said, we could still muck up the rewards and punishments, or the part of the system that administers them might malfunction. For systems that could potentially be dangerous, I think the “kill switch” should be other monitoring AIs, whose job would be to monitor, step in, and shut down the errant AI.

    Just as we usually don’t let one lone human control dangerous systems, but require multiple people in the loop to minimize the chance of an insane or criminal person doing something destructive, we’d probably want multiple AIs involved to ensure that one malfunctioning one doesn’t cause a disaster.

  3. 3. Paul Torek says:

    Good on MIRI. Now all we need to do is convince all AI builders to include this rest/reset/sleep period, despite the potential advantages of running it 24/7.

  4. 4. Jochen says:

    I’m still very skeptical of the idea that we can just ‘hardcode’ values for AI to follow, which is implicit in the notion of formulating the right reward structure such that interruptions are not deleterious. A genuine AI should be able to ask itself, what should I do? It should be able to formulate its own goals and values—at least if we want it to be human-like in some sense. The closest thing humans—or any living beings—have to such ‘hardwired’ goals is probably survival, and procreation. Yet people routinely choose not to procreate, or to sacrifice themselves in the pursuit of goals they believe outrank mere survival on an individual level.

    Now, of course, we can provide motivators—analogues of pleasure and pain. But again, experience with humans shows that we don’t blindly follow those motivators. Adherents of certain religious orders not only choose not to procreate, they live wholly celibate; asceticism is often considered a virtue; and so on. We’re not like the laboratory mice who, wired with an electrode into their pleasure centers, just keep pushing the button stimulating them until they starve—we can reevaluate our motivations, and modify them.

    And indeed, I would consider that a defining property of sentience: animals, at least on the lower vertebrate level, simply follow their instincts—they don’t stop to consider whether what they do is worth their while, they simply act. Sentience arises with the realization that there are options; with the first proto-human (or whatever animal it may have been) asking themselves, well, what now?

    If we assume that we can create an AI that blindly follows its goal, and starts converting the universe into an exponentially increasing mass of paperclips, then I think the term ‘AI’ is misapplied in this case. And incidentally, I think this is an aspect of AI not visible via Turing tests, at least not easily.

  5. 5. John Davey says:

    If a program can’t be turned off , it’s usually bad programming. Blame the programmer. Using words like ‘AI’ never breaks the link between programmer, user and data. No frankenstein moments, no spawning of new desire driven beings, nothing new ever happens because AI is a meaningless term covering a handfull of programming method’s using the identical development tools tharMt are used to produce word processors.


  6. 6. Callan S. says:

    You would say that, says the AI.

Leave a Reply