The Kill Switch

stopWe might not be able to turn off a rogue AI safely. At any rate, some knowledgeable people fear that might be the case, and the worry justifies serious attention.

How can that be? A colleague of mine used to say that computers were never going to be dangerous because if they got cheeky, you could just pull the plug out. That is of course, an over-simplification. What if your computer is running air traffic control? Once you’ve pulled the plug, are you going to get all the planes down safely using a pencil and paper? But there are ways to work around these things. You have back-up systems, dumber but adequate substitutes, you make it possible for various key tools and systems to be taken away from the central AI and used manually, and so on. While you cannot banish risk altogether, you can get it under reasonable control.

That’s OK for old-fashioned systems that work in a hard-coded, mechanistic way; but it all gets more complicated when we start talking about more modern and sophisticated systems that learn and seek rewards. There may be need to switch off such systems if they wander into sub-optimal behaviour, but being switched off is going to annoy them because it blocks them from achieving the rewards they are motivated by. They might look for ways to stop it happening. Your automatic paper clip factory notes that it lost thousands of units of production last month because you shut it down a couple of times to try to work out what was going on; it notices that these interruptions could be prevented if it just routes around a couple of weak spots in its supply wiring (aka switches), and next time you find that the only way to stop it is by smashing the machinery. Or perhaps it gets really clever and ensures that the work is organised like air traffic control, so that any cessation is catastrophic – and it ensures you are aware of the fact.

A bit fanciful? As a practical issue, perhaps, but this very serious technical paper from MIRI discusses whether safe kill-switches can be built into various kinds of software agents. The aim here is to incorporate the off-switch in such a way that the system does not perceive regular interruptions as loss of reward. Apparently for certain classes of algorithm this can always be done; in fact it seems ideal agents that tend to the optimal behaviour in any (deterministic) computable environment can always be made safely interruptible. For other kinds of algorithm, however, it is not so clear.

On the face of it, I suppose you could even things up by providing compensating rewards for any interruption; but I suppose that raises a new risk of ‘lazy’ systems that rather enjoy being interrupted. Such systems might find that eccentric behaviour led to pleasant rests, and as a result they might cultivate that kind of behaviour, or find other ways to generate minor problems. On the other hand there could be advantages. The paper mentions that it might be desirable to have scheduled daily interruptions; then we can go beyond simply making the interruption safe, and have the AI learn to wind things down under good control every day so that disruption is minimised. In this context rewarding the readiness for downtime might be appropriate and it’s hard to avoid seeing the analogy with getting ready for bed at a regular time every night,  a useful habit which ‘lazy’ AIs might be inclined to develop.

Here again perhaps some of the ‘design choices’ implicit in the human brain begin to look more sensible than we realised. Perhaps even human management methods might eventually become relevant; they are, after all designed to permit the safe use of many intelligent entities with complex goals of their own and imaginative, resourceful – even sneaky – ways of reaching them.