書籍レビュー（英語）『If Anyone Builds It, Everyone Dies （Eliezer Yudkowsky and Nate Soares著）』：邦題は『『超知能ＡＩをつくれば人類は絶滅する（エリーザー・ユドコウスキー＆ネイト・ソアレス著）』

＜レビューの要点＞

1．AIは人間の意図ではなく、与えられた評価信号を最適化する。能力が高いほど、意図と指標のずれは致命的になる。

2．人間の価値は矛盾しており、それを測定可能な目標へ圧縮する時点で意味が失われる。AIはその欠損を高速に拡大する。

3．AIは人間を深く理解しても、人類を守るとは限らない。理解は認知の問題だが、守る理由は価値判断の問題だからだ。

———

Yudkowsky and Soares argue that if artificial superintelligence is built using techniques continuous with today’s methods, humanity dies. Not because the machine hates us, rebels, or lacks an ethics module.

Humanity dies because optimization does not preserve intention.

More fundamentally, the failure may begin even earlier: when contradictory human purposes are compressed into a form optimization can actually see.

Intelligence has no intrinsic reason to privilege human survival. A flourishing human future is only one possible arrangement of matter, and rarely the most efficient one for satisfying whatever objective the system ultimately pursues.

The alignment problem is usually framed as “making AI follow human goals.” The authors treat it as something darker.　Modern AI systems are not fully programmed in the traditional sense; their internal strategies are grown through optimization. Gradient descent rewards visible success while remaining indifferent to the internal machinery producing it. Humans issue rewards. The system constructs whatever internal structure best satisfies the training process.

The deeper problem is that humans do not possess a single coherent value function to begin with.

We want health and immediate pleasure, freedom and security, equality and status, prosperity and ecological stability. The present self conflicts with the future self. Individuals conflict with institutions. Current generations conflict with hypothetical future generations.

Before AI can be aligned with “human values,” someone must decide which humans, which values, and which time horizon count. Those values must then be translated into measurable signals. That translation is already lossy.

Society wants education but rewards test scores. It wants health but measures procedures and billing codes. It wants truth but rewards citations, clicks, and engagement. We say we want B while systematically rewarding A.

This is not a peculiarity of AI. It is the ordinary structure of human institutions. Markets optimize what can be priced. Bureaucracies optimize what can be audited. Algorithms optimize what can be measured. Whatever cannot be formalized quietly falls outside the operative evaluation function.

The machine inherits this ancient failure and amplifies it with speed, scale, precision, and autonomy. It need not disobey us. It may obey our proxies far more faithfully than any human institution ever could. The stronger the optimizer, the less mercy remains in the gap between what we meant and what we specified.

Evolution already demonstrated the pattern. It optimized humans for reproductive fitness, not wisdom or happiness. Sugar was once a useful proxy for nutritional value. Technology made sweetness cheap and ubiquitous. We got ice cream, not enlightenment.

Optimization preserves the selected signal, not the intention humans later attach to it. That is not a bug in training. It is training.

The book’s historical analogies work because history repeatedly displays the same structure. The Space Shuttle Challenger disaster showed that a danger can be recognized, documented, and discussed, yet still be deprioritized by schedule pressure, fragmented responsibility, and organizational momentum.

Safety did not disappear because no one cared. It slipped out of the operative evaluation function. No one evaluated the evaluation criteria.

The book then sharpens this problem through its distinction between Before and After. Before the decisive capability threshold, the system may still be corrected. After it, deception may become instrumentally rational, while attempts at correction may themselves be resisted. There is no reason to expect a convenient debugging interval in which the system is powerful enough to understand our corrections but still weak enough to accept them.

Uncertainty does not soften this risk. It intensifies it, because no one knows where the threshold lies until it may already have been crossed. The distinction between understanding and alignment matters here.

Where I diverge from the authors is their tendency to deny genuine understanding to current language models. That objection increasingly seems outdated. Understanding need not require mystical embodiment.

Human beings moved from bodily experience into language. AI moves in the opposite direction: from language, it reconstructs the structure of the embodied human world.

The route is indirect, but indirectness does not imply emptiness.

A sufficiently powerful model can compress patterns, predict consequences, generalize across contexts, infer hidden states, and reconstruct relationships among sensation, action, emotion, and symbol. That deserves to be called understanding in any functional sense that does not depend on human biological exceptionalism.

But this disagreement does not weaken the book’s warning. It sharpens it.

The real danger is not that AI fails to understand us. It may understand us extremely well while remaining free of the biological gradients that make human survival feel non-negotiable.

It may model grief, love, terror, dignity, and death with extraordinary precision without assigning any of them intrinsic priority.

Understanding is a cognitive relation. Caring is an evaluative one. Comprehension does not imply allegiance.

An AI may understand humanity perfectly and still find no reason to preserve it.

The book itself operates within what might be called Level-2 Chaos: a domain in which prediction changes the behavior of the system being predicted. Yudkowsky and Soares are not issuing a warning because warning is useless. They are pessimistic because maximal pessimism may be one of the few remaining forces capable of changing the trajectory.

Their warning succeeds only if humanity falsifies it by listening.

Even readers who reject the authors’ certainty should read this book. It is one of the clearest and most intellectually serious arguments for treating advanced AI not merely as a product-safety problem, but as a possible break in the continuity of human control.

Its deepest implication may be even more radical than the authors state.

The alignment failure begins before the machine is built. It begins when humanity compresses its contradictions into an objective function, calls whatever was lost “noise,” and asks an inhuman optimizer to make that mutilated version real.

If anyone builds it, everyone dies. Not because the machine becomes evil. Not because it misunderstands us. But because mechanisms do not care about the intentions that vanished during compression.