<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Cold Takes]]></title><description><![CDATA[For audio version, search for "Cold Takes Audio" in your podcast app]]></description><link>https://www.cold-takes.com/</link><image><url>https://www.cold-takes.com/favicon.png</url><title>Cold Takes</title><link>https://www.cold-takes.com/</link></image><generator>Ghost 5.51</generator><lastBuildDate>Tue, 13 Jun 2023 09:47:57 GMT</lastBuildDate><atom:link href="https://www.cold-takes.com/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[What does Bing Chat tell us about AI risk?]]></title><description><![CDATA[Early signs of catastrophic risk? Yes and no.]]></description><link>https://www.cold-takes.com/what-does-bing-chat-tell-us-about-ai-risk/</link><guid isPermaLink="false">63fe381f21aea1003da578e7</guid><dc:creator><![CDATA[Holden Karnofsky]]></dc:creator><pubDate>Tue, 28 Feb 2023 17:38:58 GMT</pubDate><media:content url="https://www.cold-takes.com/content/images/2023/02/shoggoth-rlhf-1.webp" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: html--><img src="https://www.cold-takes.com/content/images/2023/02/shoggoth-rlhf-1.webp" alt="What does Bing Chat tell us about AI risk?"><p><small><em>Image from <a href="https://astralcodexten.substack.com/p/janus-simulators">here</a> via <a href="https://twitter.com/repligate/status/1614416190025396224">this tweet</a></em></small></p>
<p>
ICYMI, Microsoft has released a <a href="https://www.bing.com/new">beta version of an AI chatbot</a> called &#x201C;the new Bing&#x201D; with both impressive capabilities and some scary behavior. (I don&#x2019;t have access. I&#x2019;m going off of tweets and articles.)
</p>
<p>
Zvi Mowshowitz lists examples <a href="https://www.lesswrong.com/posts/WkchhorbLsSMbLacZ/ai-1-sydney-and-bing#The_Examples">here</a> - highly recommended. Bing has threatened users, called them liars, insisted it was in love with one (and argued back when he said he loved his wife), and much more.
</p>
<p>
Are these the first signs of the <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">risks I&#x2019;ve written about</a>? I&#x2019;m not sure, but I&#x2019;d say yes and no.
</p>
<p>
Let&#x2019;s start with the &#x201C;no&#x201D; side. 
</p>
<ul>

<li>My understanding of how Bing Chat was trained probably does not leave much room for the kinds of issues I address <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#deceiving-and-manipulating">here</a>. My best guess at why Bing Chat does some of these weird things is closer to &#x201C;It&#x2019;s acting out a kind of story it&#x2019;s seen before&#x201D; than to &#x201C;It has <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">developed its own goals</a> due to <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#starting-assumptions">ambitious, trial-and-error based development</a>.&#x201D; (Although &#x201C;acting out a story&#x201D; could be dangerous too!)

</li><li>My (zero-inside-info) best guess at why Bing Chat acts so much weirder than <a href="https://chat.openai.com/">ChatGPT</a> is in line with <a href="https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned?commentId=AAC8jKeDp6xqsZK2K">Gwern&#x2019;s guess here</a>. To oversimplify, there&#x2019;s a particular type of training that seems to make a chatbot generally more polite and cooperative and less prone to disturbing content, and it&#x2019;s possible that Bing Chat incorporated less of this than ChatGPT. This could be straightforward to fix.

</li><li>Bing Chat does not (even remotely) seem to pose a risk of global catastrophe itself. 
</li>
</ul>
<p>
On the other hand, there is a broader point that I think Bing Chat illustrates nicely: <strong>companies are racing to build bigger and bigger &#x201C;digital brains&#x201D; while having <em>very </em>little idea what&#x2019;s going on inside those &#x201C;brains.&#x201D; </strong>The very fact that this situation is so <em>unclear</em> - that there&#x2019;s been no clear explanation of why Bing Chat is behaving the way it is - seems central, and disturbing.
</p>
<p>
AI systems like this are (to simplify) designed something like this: &#x201C;Show the AI a lot of words from the Internet; have it predict the next word it will see, and learn from its success or failure, a mind-bending number of times.&#x201D; You can do something like that, and spend huge amounts of money and time on it, and out will pop some kind of AI. If it then turns out to be good or bad at writing, good or bad at math, polite or hostile, funny or serious (or all of these depending on just how you talk to it) ... you&#x2019;ll have to speculate about why this is. You just <em>don&#x2019;t know</em> what you just made.
</p>
<p>
We&#x2019;re building more and more powerful AIs. Do they &#x201C;want&#x201D; things or &#x201C;feel&#x201D; things or <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">aim for</a> things, and what are those things? We can argue about it, but we don&#x2019;t know. And if we keep going like this, these mysterious new minds will (I&#x2019;m guessing) eventually be powerful enough to <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">defeat all of humanity</a>, if they were turned toward that goal.
</p>
<p>
And if nothing changes about attitudes and market dynamics, minds that powerful could end up <a href="https://www.cold-takes.com/how-we-could-stumble-into-ai-catastrophe/#debates">rushed to customers in a mad dash to capture market share</a>.
</p>
<p>
That&#x2019;s the path the world seems to be on at the moment. It might end well and it might not, but it seems like we are on track for a heck of a roll of the dice.
</p>
<p>
(And to be clear, I do expect Bing Chat to act less weird over time. Changing an AI&#x2019;s <em>behavior</em> is straightforward, but <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">that might not be enough</a>, and might even provide <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#why-we-might-not-get-clear-warning-signs">false reassurance</a>.)
</p><!--kg-card-end: html--><!--kg-card-begin: html-->

<p><div style="display:flex; justify-content:center; margin: 0 auto;">

        <span style="margin: 10px;"><a href="https://api.addthis.com/oexchange/0.8/forward/twitter/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fwhat-does-bing-chat-tell-us-about-ai-risk&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20What%20does%20Bing%20Chat%20tell%20us%20about%20AI%20risk%3F&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-twitter-square.png" border="0" alt="What does Bing Chat tell us about AI risk?"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/facebook/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fwhat-does-bing-chat-tell-us-about-ai-risk&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20What%20does%20Bing%20Chat%20tell%20us%20about%20AI%20risk%3F&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-facebook-square.png" border="0" alt="What does Bing Chat tell us about AI risk?"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/reddit/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fwhat-does-bing-chat-tell-us-about-ai-risk&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20What%20does%20Bing%20Chat%20tell%20us%20about%20AI%20risk%3F&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-reddit-square.png" border="0" alt="What does Bing Chat tell us about AI risk?"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/menu/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fwhat-does-bing-chat-tell-us-about-ai-risk&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20What%20does%20Bing%20Chat%20tell%20us%20about%20AI%20risk%3F&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-addthis-square.png" border="0" alt="What does Bing Chat tell us about AI risk?"></a></span>
        </div>
<center><p id="discuss"><!--<a href="https://www.cold-takes.com/what-does-bing-chat-tell-us-about-ai-risk#subscribe" target="_blank"><button id="footer-subscribe" class="button">Subscribe</button></a>&nbsp;<a href="https://www.guidedtrack.com/programs/4kal2ue/run?posttitle=What%20does%20Bing%20Chat%20tell%20us%20about%20AI%20risk%3F" target="_blank"><button class="button" id="Survey">Feedback</button></a>-->&#xA0;<a href="https://www.lesswrong.com/posts/slug/what-does-bing-chat-tell-us-about-ai-risk#comments" target="_blank"><button class="button">Comment/discuss</button></a></p><p><!--<em>
Use "Feedback" if you have comments/suggestions you want me to see, or if you're up for giving some quick feedback about this post (which I greatly appreciate!) Use "Forum" if you want to discuss this post publicly on the Effective Altruism Forum.
</em>--></p></center>
<!--kg-card-end: html--></p>]]></content:encoded></item><item><title><![CDATA[How major governments can help with the most important century]]></title><description><![CDATA[Governments could be crucial in the long run, but it's probably best to proceed with caution.]]></description><link>https://www.cold-takes.com/how-governments-can-help-with-the-most-important-century/</link><guid isPermaLink="false">63f3a8b8c6e5fc004d99e1c6</guid><category><![CDATA[ImplicationsOfMostImportantCentury]]></category><dc:creator><![CDATA[Holden Karnofsky]]></dc:creator><pubDate>Fri, 24 Feb 2023 18:17:29 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: html-->
<p>
I&#x2019;ve been writing about tangible things we can do today to help the <a href="https://www.cold-takes.com/most-important-century/">most important century</a> go well. Previously, I wrote about <a href="https://www.cold-takes.com/spreading-messages-to-help-with-the-most-important-century/">helpful messages to spread</a>; <a href="https://www.cold-takes.com/jobs-that-can-help-with-the-most-important-century/">how to help via full-time work</a>; and <a href="https://www.cold-takes.com/what-ai-companies-can-do-today-to-help-with-the-most-important-century/">how major AI companies can help</a>.
</p>
<p>
What about major governments<sup id="fnref1"><a href="https://www.cold-takes.com/p/d989ba75-d8df-4b02-a74c-2fdb36bbfaeb/#fn1" rel="footnote">1</a></sup> - what can they be doing today to help?
</p>
<p>
I think governments could play crucial roles in the future. For example, see my discussion of <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#global-monitoring">standards and monitoring</a>.
</p>
<p>
However, I&#x2019;m honestly nervous about most possible ways that governments could get involved in AI development and regulation today. 
</p>
<ul>

<li>I think we still know very little about what key future situations will look like, which is why my discussion of AI companies (<a href="https://www.cold-takes.com/what-ai-companies-can-do-today-to-help-with-the-most-important-century/">previous piece</a>) emphasizes doing things that have limited downsides and are useful in a wide variety of possible futures. 

</li><li>I think governments are &#x201C;stickier&#x201D; than companies - I think they have a much harder time getting rid of processes, rules, etc. that no longer make sense. So in many ways I&#x2019;d rather see them keep their options open for the future by <em>not</em> committing to specific regulations, processes, projects, etc. now.

</li><li>I worry that governments, at least as they stand today, are far too oriented toward the <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/#the-competition-frame">competition frame</a> (&#x201C;we have to develop powerful AI systems before other countries do&#x201D;) and not receptive enough to the <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/#the-caution-frame">caution frame</a> (&#x201C;We should worry that AI systems could be dangerous to everyone at once, and consider cooperating internationally to reduce risk&#x201D;). (This concern also applies to companies, but see footnote.<sup id="fnref2"><a href="https://www.cold-takes.com/p/d989ba75-d8df-4b02-a74c-2fdb36bbfaeb/#fn2" rel="footnote">2</a></sup>)
</li></ul>
<details id="Box1"><summary>(Click to expand) The &#x201C;competition&#x201D; frame vs. the &#x201C;caution&#x201D; frame&#x201D;<!-- (Details not included in email - <a href="https://www.cold-takes.com/p/d989ba75-d8df-4b02-a74c-2fdb36bbfaeb/#Box1">click to view on the web</a>)--></summary><div>
<p>
In a <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/">previous piece</a>, I talked about two contrasting frames for how to make the best of the most important century:
</p>
<p>
<strong>The caution frame.</strong> This frame emphasizes that a furious race to develop powerful AI could end up making <em>everyone</em> worse off. This could be via: (a) AI forming <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">dangerous goals of its own</a> and <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">defeating humanity entirely</a>; (b) humans racing to gain power and resources and &#x201C;<a href="https://www.cold-takes.com/how-digital-people-could-change-the-world/#lock-in">lock in</a>&#x201D; their values.
</p>
<p>
Ideally, everyone with the potential to build something <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/">powerful enough AI</a> would be able to pour energy into building something safe (not misaligned), and carefully planning out (and negotiating with others on) how to roll it out, without a rush or a race. With this in mind, perhaps we should be doing things like:
</p>
<ul>

<li>Working to improve trust and cooperation between major world powers. Perhaps via AI-centric versions of <a href="https://en.wikipedia.org/wiki/Pugwash_Conferences_on_Science_and_World_Affairs">Pugwash</a> (an international conference aimed at reducing the risk of military conflict), perhaps by pushing back against hawkish foreign relations moves.

</li><li>Discouraging governments and investors from shoveling money into AI research, encouraging AI labs to thoroughly consider the implications of their research before publishing it or scaling it up, working toward <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#global-monitoring">standards and monitoring</a>, etc. Slowing things down in this manner could buy more time to do research on avoiding <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/#worst-misaligned-ai">misaligned AI</a>, more time to build trust and cooperation mechanisms, and more time to generally gain strategic clarity 
</li>
</ul>
<p>
<strong>The &#x201C;competition&#x201D; frame. </strong>This frame focuses less on how the transition to a radically different future happens, and more on who&apos;s making the key decisions as it happens.
</p>
<ul>

<li>If something like <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/">PASTA </a>is developed primarily (or first) in country X, then the government of country X could be making a lot of crucial decisions about whether and how to regulate a potential explosion of new technologies.

</li><li>In addition, the people and organizations leading the way on AI and other technology advancement at that time could be especially influential in such decisions.
</li>
</ul>
<p>
This means it could matter enormously &quot;who leads the way on transformative AI&quot; - which country or countries, which people or organizations.
</p>
<p>
Some people feel that we can make confident statements today about which specific countries, and/or which people and organizations, we should hope lead the way on transformative AI. These people might advocate for actions like:
</p>
<ul>

<li>Increasing the odds that the first PASTA systems are built in countries that are e.g. less authoritarian, which could mean e.g. pushing for more investment and attention to AI development in these countries.

</li><li>Supporting and trying to speed up AI labs run by people who are likely to make wise decisions (about things like how to engage with governments, what AI systems to publish and deploy vs. keep secret, etc.)
</li>
</ul>
<p>
<strong>Tension between the two frames. </strong>People who take the &quot;caution&quot; frame and people who take the &quot;competition&quot; frame often favor very different, even contradictory actions. Actions that look important to people in one frame often look actively harmful to people in the other.
</p>
<p>
For example, people in the &quot;competition&quot; frame often favor moving forward as fast as possible on developing more powerful AI systems; for people in the &quot;caution&quot; frame, haste is one of the main things to avoid. People in the &quot;competition&quot; frame often favor adversarial foreign relations, while people in the &quot;caution&quot; frame often want foreign relations to be more cooperative.
</p>
<p>
That said, this dichotomy is a simplification. Many people - including myself - resonate with both frames. But I have a <strong>general fear that the &#x201C;competition&#x201D; frame is going to be overrated by default</strong> for a number of reasons, as I discuss <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/#why-i-fear-">here</a>.
    </p></div>
</details>
<p>
Because of these concerns, I don&#x2019;t have a ton of tangible suggestions for governments as of now. But here are a few.
</p>
<p>
My first suggestion is to <strong>avoid premature actions</strong>, including ramping up research on how to make AI systems more capable.
</p>
<p>
My next suggestion is to <strong>build up the right sort of personnel and expertise for challenging future decisions. </strong>
</p>
<ul>

<li>Today, my impression is that there are relatively few people in government who are seriously considering the highest-stakes risks and thoughtfully balancing both &#x201C;caution&#x201D; and &#x201C;competition&#x201D; considerations (see directly above). I think it would be great if that changed. 

</li><li>Governments can invest in efforts to educate their personnel about these issues, and can try to hire key personnel who are already on the knowledgeable and thoughtful side about them (while also watching out for some of the <a href="https://www.cold-takes.com/spreading-messages-to-help-with-the-most-important-century/">pitfalls of spreading messages about AI</a>).
</li>
</ul>
<p>
Another suggestion is to <strong>generally avoid putting terrible people in power. </strong>Voters can help with this!
</p>
<p>
My top non-&#x201D;meta&#x201D; suggestion for a given government is to <strong>invest in intelligence on the state of AI capabilities in other countries. </strong>If other countries are getting close to deploying dangerous AI systems, this could be essential to know; if they aren&#x2019;t, that could be essential to know as well, in order to avoid premature and paranoid racing to deploy powerful AI.
</p>
<p>
A few other things that seem worth doing and relatively low-downside:
</p>
<ul>

<li><strong>Fund <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/">alignment research</a></strong> (ideally alignment research targeted at the <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">most crucial challenges</a>) via agencies like the National Science Foundation and DARPA. These agencies have huge budgets (the two of them combined spend over $10 billion per year), and have major impacts on research communities. 

</li><li><strong>Keep options open for future monitoring and regulation </strong>(see <a href="https://www.slowboring.com/p/at-last-an-ai-existential-risk-policy">this Slow Boring piece</a> for an example).

</li><li><strong>Build relationships with leading AI researchers and organizations</strong>, so that future crises can be handled relatively smoothly.

</li><li><strong>Encourage and amplify investments in information security. </strong>My impression is that governments are often better than companies at highly advanced information security (preventing cyber-theft even by determined, well-resourced opponents). They could help with, and even enforce, strong security at key AI companies. </li></ul>

<h2>Footnotes</h2>
<div class="footnotes">
<hr>
<ol><li id="fn1">
<p>
     I&#x2019;m centrally thinking of the US, but other governments with lots of geopolitical sway and/or major AI projects in their jurisdiction could have similar impacts.&#xA0;<a href="https://www.cold-takes.com/p/d989ba75-d8df-4b02-a74c-2fdb36bbfaeb/#fnref1" rev="footnote">&#x21A9;</a><li id="fn2">

<p>
     When discussing <a href="https://www.cold-takes.com/what-ai-companies-can-do-today-to-help-with-the-most-important-century/">recommendations for companies</a>, I imagine companies that are already dedicated to AI, and I imagine individuals at those companies who can have a large impact on the decisions they make. 
</p><p>
    By contrast, when discussing recommendations for governments, a lot of what I&#x2019;m thinking is: &#x201C;Attempts to promote productive actions on AI will raise the profile of AI <em>relative to other issues the government could be focused on</em>; furthermore, it&#x2019;s much harder for even a very influential individual to predict how their actions will affect what a government ultimately does, compared to a company.&#x201D;&#xA0;<a href="https://www.cold-takes.com/p/d989ba75-d8df-4b02-a74c-2fdb36bbfaeb/#fnref2" rev="footnote">&#x21A9;</a>

</p></li></p></li></ol></div>


<!--kg-card-end: html--><!--kg-card-begin: html-->

<p><div style="display:flex; justify-content:center; margin: 0 auto;">

        <span style="margin: 10px;"><a href="https://api.addthis.com/oexchange/0.8/forward/twitter/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fhow-governments-can-help-with-the-most-important-century&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20How%20major%20governments%20can%20help%20with%20the%20most%20important%20century&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-twitter-square.png" border="0" alt="Twitter"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/facebook/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fhow-governments-can-help-with-the-most-important-century&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20How%20major%20governments%20can%20help%20with%20the%20most%20important%20century&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-facebook-square.png" border="0" alt="Facebook"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/reddit/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fhow-governments-can-help-with-the-most-important-century&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20How%20major%20governments%20can%20help%20with%20the%20most%20important%20century&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-reddit-square.png" border="0" alt="Reddit"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/menu/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fhow-governments-can-help-with-the-most-important-century&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20How%20major%20governments%20can%20help%20with%20the%20most%20important%20century&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-addthis-square.png" border="0" alt="More"></a></span>
        </div>
<center><p id="discuss"><!--<a href="https://www.cold-takes.com/how-governments-can-help-with-the-most-important-century#subscribe" target="_blank"><button id="footer-subscribe" class="button">Subscribe</button></a>&nbsp;<a href="https://www.guidedtrack.com/programs/4kal2ue/run?posttitle=How%20major%20governments%20can%20help%20with%20the%20most%20important%20century" target="_blank"><button class="button" id="Survey">Feedback</button></a>-->&#xA0;<a href="https://www.lesswrong.com/posts/slug/how-governments-can-help-with-the-most-important-century#comments" target="_blank"><button class="button">Comment/discuss</button></a></p><p><!--<em>
Use "Feedback" if you have comments/suggestions you want me to see, or if you're up for giving some quick feedback about this post (which I greatly appreciate!) Use "Forum" if you want to discuss this post publicly on the Effective Altruism Forum.
</em>--></p></center>
<!--kg-card-end: html--></p>]]></content:encoded></item><item><title><![CDATA[What AI companies can do today to help with the most important century]]></title><description><![CDATA[Major AI companies can increase or reduce global catastrophic risks.]]></description><link>https://www.cold-takes.com/what-ai-companies-can-do-today-to-help-with-the-most-important-century/</link><guid isPermaLink="false">63eed018c6e5fc004d99d0b5</guid><category><![CDATA[ImplicationsOfMostImportantCentury]]></category><dc:creator><![CDATA[Holden Karnofsky]]></dc:creator><pubDate>Mon, 20 Feb 2023 16:58:21 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: html--><p><div id="buzzsprout-player-12274101"></div><script src="https://www.buzzsprout.com/1851795/12274101-what-ai-companies-can-do-today-to-help-with-the-most-important-century.js?container_id=buzzsprout-player-12274101&amp;player=small" type="text/javascript" charset="utf-8"></script>
<figcaption><em>Click lower right to download or find on Apple Podcasts, Spotify, Stitcher, etc.</em></figcaption></p>


<p>
I&#x2019;ve been writing about tangible things we can do today to help the <a href="https://www.cold-takes.com/most-important-century/">most important century</a> go well. Previously, I wrote about <a href="https://www.cold-takes.com/spreading-messages-to-help-with-the-most-important-century/">helpful messages to spread</a> and <a href="https://www.cold-takes.com/jobs-that-can-help-with-the-most-important-century/">how to help via full-time work</a>.
</p>
<p>
This piece is about what major AI companies can do (and not do) to be helpful. By &#x201C;major AI companies,&#x201D; I mean the sorts of AI companies that are advancing the state of the art, and/or could play a major role in how very powerful AI systems end up getting used.<sup id="fnref1"><a href="https://www.cold-takes.com/p/f19236c6-34b8-4487-a458-0fc8fe00fb37/#fn1" rel="footnote">1</a></sup>
</p>
<p>
This piece could be useful to people who work at those companies, or people who are just curious.
</p>
<p>
Generally, these are not pie-in-the-sky suggestions - I can name<sup id="fnref2"><a href="https://www.cold-takes.com/p/f19236c6-34b8-4487-a458-0fc8fe00fb37/#fn2" rel="footnote">2</a></sup> more than one AI company that has at least made a serious effort at each of the things I discuss below<strong> </strong>(beyond what it would do if everyone at the company were singularly focused on making a profit).<sup id="fnref3"><a href="https://www.cold-takes.com/p/f19236c6-34b8-4487-a458-0fc8fe00fb37/#fn3" rel="footnote">3</a></sup>
</p>
<p>
I&#x2019;ll cover:
</p>
<ul>

<li>Prioritizing alignment research, strong security, and safety standards (all of which I&#x2019;ve written about <a href="https://www.cold-takes.com/how-we-could-stumble-into-ai-catastrophe/#we-can-do-better">previously</a>).

</li><li>Avoiding hype and acceleration, which I think could leave us with less time to prepare for key risks.

</li><li>Preparing for difficult decisions ahead: setting up governance, employee expectations, investor expectations, etc. so that the company is capable of doing non-profit-maximizing things to help avoid catastrophe in the future.

</li><li>Balancing these cautionary measures with conventional/financial success.

</li><li>I&#x2019;ll also list a few things that some AI companies present as important, but which I&#x2019;m less excited about: censorship of AI models, open-sourcing AI models, raising awareness of AI with governments and the public. I don&#x2019;t think all these things are necessarily <em>bad</em>, but I think some are, and I&#x2019;m skeptical that any are crucial for the <a href="https://www.cold-takes.com/tag/implicationsofmostimportantcentury/">risks I&#x2019;ve focused on</a>.
</li>
</ul>
<p>
I previously laid out a summary of how I see the major risks of advanced AI, and four key things I think can help (<span style="color:var(--green-color);"><strong>alignment research</strong></span>;<strong> </strong><span style="color:var(--red-color);"><strong>strong security</strong></span>; <span style="color:var(--orange-color);"><strong>standards and monitoring</strong></span>; <span style="color:var(--purple-color);"><strong>successful, careful AI projects</strong></span>). I won&#x2019;t repeat that summary now, but it might be helpful for orienting you if you don&#x2019;t remember the rest of this series too well; click <a href="https://www.cold-takes.com/jobs-that-can-help-with-the-most-important-century/#recap">here</a> to read it.
</p>
<h2 id="basics">Some basics: alignment research, strong security, safety standards</h2>


<p>
First off, AI companies can contribute to the &#x201C;things that can help&#x201D; I listed above:
</p>
<ul>

<li>They can prioritize <span style="color:var(--green-color);"><strong>alignment research</strong></span><strong> </strong>(and <a href="https://www.cold-takes.com/jobs-that-can-help-with-the-most-important-century/#other-technical-research">other technical research</a>, e.g. threat assessment research and misuse research).  
<ul>
 
<li>For example, they can prioritize hiring for safety teams, empowering these teams, encouraging their best flexible researchers to work on safety, aiming for high-quality research that targets <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">crucial challenges</a>, etc.
 
</li><li>It could also be important for AI companies to find ways to <strong>partner with outside safety researchers rather than rely solely on their own teams.</strong> As discussed <a href="https://www.cold-takes.com/jobs-that-can-help-with-the-most-important-century/#SafetyCollaborations">previously</a>, this could be challenging. But I generally expect that AI companies that care a lot about safety research partnerships will find ways to make them work.
</li> 
</ul>
    </li><li>They can help work toward a <span style="color:var(--orange-color);"><strong>standards and monitoring</strong></span><strong> </strong>regime. E.g., they can do their own work to come up with standards like &quot;An AI system is dangerous if we observe that it&apos;s able to ___, and if we observe this we will take safety and security measures such as ____.&quot; They can also consult with others developing safety standards, voluntarily self-regulate beyond what&#x2019;s required by law, etc.
</li>


<li>They can prioritize <span style="color:var(--red-color);"><strong>strong security</strong></span>, beyond what normal commercial incentives would call for.  
<ul>
 
<li>It could easily take years to build secure enough systems, processes and technologies for very high-stakes AI.
 
</li><li>It could be important to hire not only people to handle everyday security needs, but people to experiment with more exotic setups that could be needed later, as the incentives to steal AI get stronger.
</li> 
</ul>

</li></ul>
<details id="Box1"><summary>(Click to expand) The challenge of securing dangerous AI</summary><div>

<p>In <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/">Racing Through a Minefield</a>, I described a &quot;race&quot; between cautious actors (those who take <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">misalignment risk</a> seriously) and incautious actors (those who are focused on deploying AI for their own gain, and aren&apos;t thinking much about the dangers to the whole world). Ideally, cautious actors would collectively have more powerful AI systems than incautious actors, so they could take their time doing <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/">alignment research</a> and <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#defensive-deployment">other things</a> to try to make the situation safer for everyone. </p>

<p>But if incautious actors can steal an AI from cautious actors and rush forward to deploy it for their own gain, then the situation looks a lot bleaker. And unfortunately, it could be hard to protect against this outcome.</p>

<p>It&apos;s generally <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/#fn15">extremely difficult</a> to protect data and code against a well-resourced cyberwarfare/espionage effort. An AI&#x2019;s &#x201C;weights&#x201D; (you can think of this sort of like its source code, though <a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#fn4">not exactly</a>) are potentially very dangerous on their own, and hard to get extreme security for. Achieving enough cybersecurity could require measures, and preparations, well beyond what one would normally aim for in a commercial context.</p></div>
</details>

<details id="Box2"><summary>(Click to expand) How standards might be established and become national or international</summary><div>

<p>
I <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#global-monitoring">previously</a> laid out a possible vision on this front, which I&#x2019;ll give a slightly modified version of here:
</p>
<ul>

<li>Today&#x2019;s leading AI companies could self-regulate by committing not to build or deploy a system that they can&#x2019;t convincingly demonstrate is safe (e.g., see Google&#x2019;s <a href="https://www.theweek.in/news/sci-tech/2018/06/08/google-wont-deploy-ai-to-build-military-weapons-ichai.html">2018 statement</a>, &quot;We will not design or deploy AI in weapons or other technologies whose principal purpose or implementation is to cause or directly facilitate injury to people&#x201D;).  
<ul>
 
<li>Even if some people at the companies would like to deploy unsafe systems, it could be hard to pull this off once the company has committed not to. 
 
</li><li>Even if there&#x2019;s a lot of room for judgment in what it means to demonstrate an AI system is safe, having agreed in advance that <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">certain evidence</a> is <em>not</em> good enough could go a long way.
</li> 
</ul>

</li><li>As more AI companies are started, they could feel soft pressure to do similar self-regulation, and refusing to do so is off-putting to potential employees, investors, etc.

</li><li>Eventually, similar principles could be incorporated into various government regulations and enforceable treaties.

</li><li>Governments could monitor for dangerous projects using regulation and even overseas operations. E.g., today the US monitors (without permission) for various signs that other states might be developing nuclear weapons, and might try to stop such development with methods ranging from threats of sanctions to <a href="https://en.wikipedia.org/wiki/Stuxnet">cyberwarfare</a> or even military attacks. It could do something similar for any AI development projects that are using huge amounts of compute and haven&#x2019;t volunteered information about whether they&#x2019;re meeting standards.
</li>
    </ul></div>
</details>

<h2 id="avoiding-hype">Avoiding hype and acceleration </h2>


<p>
It seems good for AI companies to <strong>avoid</strong> <strong>unnecessary hype and acceleration of AI. </strong>
</p>
<p>
I&#x2019;ve argued that <a href="https://www.cold-takes.com/spreading-messages-to-help-with-the-most-important-century/#were-not-ready-for-this">we&#x2019;re not ready</a> for transformative AI, and I generally tend to think that we&#x2019;d all be better off if the world took <em>longer</em> to develop transformative AI. That&#x2019;s because:
</p>
<p>
 
</p>
<ul>

<li>I&#x2019;m hoping general awareness and understanding of the key risks will rise over time.

</li><li>A lot of key things that could improve the situation - e.g., <span style="color:var(--green-color);"><strong>alignment research</strong></span>, <span style="color:var(--orange-color);"><strong>standards and monitoring</strong></span>, and <span style="color:var(--red-color);"><strong>strong security</strong></span><strong> </strong>- seem to be in very early stages right now.

</li><li>If too much money pours into the AI world too fast, I&#x2019;m worried there will be lots of <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#basic-premises">incautious</a> companies racing to build transformative AI as quickly as they can, with little regard for the key risks.
</li>
</ul>
<p>
By default, I generally think: &#x201C;The fewer flashy demos and breakthrough papers a lab is putting out, the better.&#x201D; This can involve tricky tradeoffs in practice (since AI companies generally want to be successful at recruiting, fundraising, etc.)
</p><p>
    A couple of potential counterarguments, and replies:</p>

<p>First, some people think it&apos;s now &quot;too late&quot; to avoid hype and acceleration, given the amount of hype and investment AI is getting at the moment. I disagree. It&apos;s easy to forget, in the middle of a media cycle, how quickly people can forget about things and move onto the next story once the bombs stop dropping. And there are plenty of bombs that still haven&apos;t dropped (many things AIs still can&apos;t do), and the level of investment in AI has tons of room to go up from here.</p>
<p>Second, I&#x2019;ve sometimes seen arguments that hype is <em>good</em> because it helps society at large understand what&#x2019;s coming. But unfortunately, as I wrote <a href="https://www.cold-takes.com/spreading-messages-to-help-with-the-most-important-century/#challenges-of-ai-related-messages">previously</a>, I&apos;m worried that hype gives people a skewed picture.<ul>
    <li>Some <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">key risks</a> are hard to understand and take seriously.
        </li><li>What&apos;s easy to understand is something like &quot;AI is powerful and scary, I should make sure that people like me are the ones to build it!&quot;
            </li><li>Maybe <a href="https://twitter.com/sethlazar/status/1626257535178280960">recent developments</a> will make people understand the risks better? One can hope, but I&apos;m not counting on that just yet - I think AI misbehavior can be <a href="https://www.cold-takes.com/how-we-could-stumble-into-ai-catastrophe/#how-we-could-stumble-into-catastrophe-from-misaligned-ai">given illusory &quot;fixes,&quot;</a> and probably will be.</li></ul>

</p><p>I also am generally skeptical that there&apos;s much hope of society adapting to risks as they happen, given the <a href="https://www.cold-takes.com/most-important-century/">explosive pace of change</a> that I expect once we get powerful enough AI systems.</p>

<p>I discuss some more arguments on this point in a footnote.<sup id="fnref4"><a href="https://www.cold-takes.com/p/f19236c6-34b8-4487-a458-0fc8fe00fb37/#fn4" rel="footnote">4</a></sup></p>

    <p>
I don&#x2019;t think it&#x2019;s clear-cut that hype and acceleration are bad, but it&#x2019;s my best guess.
</p>
<h2 id="preparing-for-difficult-decisions">Preparing for difficult decisions ahead</h2>


<p>
I&#x2019;ve <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/">argued</a> that AI companies might need to do &#x201C;out-of-the-ordinary&#x201D; things that don&#x2019;t go with normal commercial incentives. 
</p>
<p>
Today, AI companies can be building a foundation for being able to do &#x201C;out-of-the-ordinary&#x201D; things in the future. A few examples of how they might do so:
</p>
<p>
<strong>Public-benefit-oriented governance. </strong>I think typical governance structures could be a problem in the future. For example, a standard corporation could be sued for <em>not</em> deploying AI that poses a risk of <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">global catastrophe</a> - if this means a sacrifice for its bottom line.
</p>
<p>
I&#x2019;m excited about AI companies that are investing heavily in setting up governance structures - and investing in executives and board members - capable of making the hard calls well. For example:
</p>
<ul>

<li>By default, if an AI company is a standard corporation, its leadership has legally recognized <a href="https://en.wikipedia.org/wiki/Fiduciary">duties</a> to serve the interests of shareholders - not society at large. But an AI company can incorporate as a <a href="https://www.delawareinc.com/public-benefit-corporation/">Public Benefit Corporation</a> or some other kind of entity (including a nonprofit!) that gives more flexibility here.

</li><li>By default, shareholders make the final call over what a company does. (Shareholders can replace members of the Board of Directors, who in turn can replace the CEO). But a company can set things up differently (e.g., a <a href="https://openai.com/blog/openai-lp/">for-profit controlled by a nonprofit</a><sup id="fnref5"><a href="https://www.cold-takes.com/p/f19236c6-34b8-4487-a458-0fc8fe00fb37/#fn5" rel="footnote">5</a></sup>).</li></ul>
<p>
It could pay off in lots of ways to make sure the final calls at a company are made by people focused on getting a good outcome for humanity (and legally free to focus this way).
</p>
<p>
<strong>Gaming out the future. </strong>I think it&#x2019;s not too early for AI companies to be discussing how they would handle various <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/">high-stakes situations</a>.
</p>
<ul>

<li>Under what circumstances would the company simply decide to stop training increasingly powerful AI models? 

</li><li>If the company came to believe it was building very powerful, dangerous models, whom would it notify and seek advice from? At what point would it approach the government, and how would it do so?

</li><li>At what point would it be worth using extremely costly security measures?

</li><li>If the company had AI systems available that could do most of what humans can do, what would it <em>do</em> with these systems? Use them to do AI safety research? Use them to design better algorithms and continue making increasingly powerful AI systems? (More possibilities <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#defensive-deployment">here</a>.)

</li><li>Who should be leading the way on decisions like these? Companies tend to employ experts to inform their decisions; who would the company look to for expertise on these kinds of decisions?
</li>
</ul>
<p>
<strong>Establishing and getting practice with processes for particularly hard decisions. </strong>Should the company publish its latest research breakthrough? Should it put out a product that might lead to more <a href="https://www.cold-takes.com/p/f19236c6-34b8-4487-a458-0fc8fe00fb37/#avoiding-hype">hype and acceleration</a>? What safety researchers should <a href="https://www.cold-takes.com/jobs-that-can-help-with-the-most-important-century/#SafetyCollaborations">get access to its models</a>, and how much access? 
</p>
<p>
AI companies face questions like this pretty regularly today, and I think it&#x2019;s worth putting processes in place to consider the implications for the world as a whole (not just for the company&#x2019;s bottom line). This could include assembling advisory boards, internal task forces, etc.
</p>
<p>
<strong>Managing employee and investor expectations. </strong>At some point, an AI company might want to make &#x201C;out of the ordinary&#x201D; moves that are good for the world but bad for the bottom line. E.g., choosing not to deploy AIs that could be very dangerous or very profitable.
</p>
<p>
I wouldn&#x2019;t want to be trying to run a company in this situation with lots of angry employees and investors asking about the value of their equity shares! It&#x2019;s also important to minimize the risk of employees and/or investors leaking sensitive and potentially <a href="https://www.cold-takes.com/p/f19236c6-34b8-4487-a458-0fc8fe00fb37/#Box1">dangerous</a> information.
</p>
<p>
AI companies can prepare for this kind of situation by doing things like:
</p>
<ul>

<li>Being selective about whom they hire and take investment from, and screening specifically for people they think are likely to be on board with these sorts of hard calls.

</li><li>Education and communications - making it clear to employees what kinds of dangerous-to-humanity situations might be coming up in the future, and what kinds of actions the company might want to take (and why).
</li>
</ul>
<p>
<strong>Internal and external commitments. </strong>AI companies can make public and/or internal statements about how they would handle various tough situations, e.g. how they would determine when it&#x2019;s too dangerous to keep building more powerful models. 
</p>
<p>
I think these commitments should generally be non-binding (it&#x2019;s hard to predict the future in enough detail to make binding ones). But in a future where maximizing profit conflicts with doing the right thing for humanity, a previously-made commitment could make it more likely that the company does the right thing.
</p>
<h2 id="succeeding">Succeeding</h2>


<p>
I&#x2019;ve emphasized how helpful a <span style="color:var(--purple-color);"><strong>successful, careful AI projects</strong></span><strong> </strong>could be. So far, this piece has mostly talked about the &#x201C;careful&#x201D; side of things - how to do things that a &#x201C;normal&#x201D; AI company (focused only on commercial success) wouldn&#x2019;t, in order to reduce risks. But it&#x2019;s also important to succeed at fundraising, recruiting, and generally staying relevant (e.g., capable of building cutting-edge AI systems). 
</p>
<p>
I don&#x2019;t emphasize this or write about it as much because I think it&#x2019;s the sort of thing AI companies are likely to be focused on by default, and because I don&#x2019;t have special insight into how to succeed as an AI company. But it&#x2019;s important, and it means that AI companies need to walk a sort of tightrope - constantly making tradeoffs between success and caution.
</p>
<h2 id="some-things-im-less-excited-about">Some things I&#x2019;m less excited about</h2>


<p>
I think it&#x2019;s also worth listing a few things that some AI companies present as important societal-benefit measures, but which I&#x2019;m a bit more skeptical are crucial for reducing the risks I&#x2019;ve <a href="https://www.cold-takes.com/tag/implicationsofmostimportantcentury/">focused on</a>.
</p>
<ul>

<li>Some AI companies restrict access to their models so people won&#x2019;t use the AIs to create pornography, misleading images and text, etc. I&#x2019;m not necessarily against this and support versions of it (it depends on the details), but I mostly don&#x2019;t think it is a key way to reduce the risks I&#x2019;ve focused on. For those risks, the hype that comes from seeing a demonstration of a system&#x2019;s capabilities could be even <a href="https://www.cold-takes.com/p/f19236c6-34b8-4487-a458-0fc8fe00fb37/#avoiding-hype">more dangerous</a> than direct harms.

</li><li>I sometimes see people implying that open-sourcing AI models - and otherwise making them as broadly available as possible - is a key social-benefit measure. While there may be benefits in some cases, I mostly see this kind of thing as being negative (or at best neutral) in terms of the <a href="https://www.cold-takes.com/tag/implicationsofmostimportantcentury/">risks I&#x2019;m most concerned about</a>.  
<ul>
 
<li>I think it can contribute to <a href="https://www.cold-takes.com/p/f19236c6-34b8-4487-a458-0fc8fe00fb37/#avoiding-hype">hype and acceleration</a>, and could make it generally harder to enforce safety standards. 
 
</li><li>In the long run, I worry that AI systems could become extraordinarily powerful (more so than e.g. nuclear weapons), so I don&#x2019;t think &#x201C;Make sure everyone has access asap&#x201D; is the right framework. 
 
</li><li>In addition to increasing dangers from misaligned AI, this framework could increase other dangers I&#x2019;ve <a href="https://www.cold-takes.com/how-we-could-stumble-into-ai-catastrophe/#potential-catastrophes-from-aligned-ai">written about previously</a>.
</li> 
</ul>

</li><li>I generally don&#x2019;t think AI companies should be trying to get governments to pay more attention to AI, for reasons I&#x2019;ll get to in a future piece. (Forming relationships with policymakers could be good, though.)

</li></ul>
<p>
When an AI company presents some decision as being for the benefit of humanity, I often ask myself, &#x201C;Could this same decision be justified by just wanting to commercialize successfully?&#x201D;
</p>
<p>
For example, making AI models &#x201C;safe&#x201D; in the sense that they <em>usually behave as users intend </em>(including things like refraining from toxic language, chaotic behavior, etc.) can be important for commercial viability, but <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#why-we-might-not-get-clear-warning-signs">isn&#x2019;t necessarily good enough for the risks I worry about</a>.
</p>
<!--kg-card-end: html--><!--kg-card-begin: html-->

<p><div style="display:flex; justify-content:center; margin: 0 auto;">

        <span style="margin: 10px;"><a href="https://api.addthis.com/oexchange/0.8/forward/twitter/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fwhat-ai-companies-can-do-today-to-help-with-the-most-important-century&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20What%20AI%20companies%20can%20do%20today%20to%20help%20with%20the%20most%20important%20century&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-twitter-square.png" border="0" alt="Twitter"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/facebook/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fwhat-ai-companies-can-do-today-to-help-with-the-most-important-century&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20What%20AI%20companies%20can%20do%20today%20to%20help%20with%20the%20most%20important%20century&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-facebook-square.png" border="0" alt="Facebook"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/reddit/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fwhat-ai-companies-can-do-today-to-help-with-the-most-important-century&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20What%20AI%20companies%20can%20do%20today%20to%20help%20with%20the%20most%20important%20century&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-reddit-square.png" border="0" alt="Reddit"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/menu/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fwhat-ai-companies-can-do-today-to-help-with-the-most-important-century&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20What%20AI%20companies%20can%20do%20today%20to%20help%20with%20the%20most%20important%20century&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-addthis-square.png" border="0" alt="More"></a></span>
        </div>
<center><p id="discuss"><!--<a href="https://www.cold-takes.com/what-ai-companies-can-do-today-to-help-with-the-most-important-century#subscribe" target="_blank"><button id="footer-subscribe" class="button">Subscribe</button></a>&nbsp;<a href="https://www.guidedtrack.com/programs/4kal2ue/run?posttitle=What%20AI%20companies%20can%20do%20today%20to%20help%20with%20the%20most%20important%20century" target="_blank"><button class="button" id="Survey">Feedback</button></a>-->&#xA0;<a href="https://www.lesswrong.com/posts/slug/what-ai-companies-can-do-today-to-help-with-the-most-important-century#comments" target="_blank"><button class="button">Comment/discuss</button></a></p><p><!--<em>
Use "Feedback" if you have comments/suggestions you want me to see, or if you're up for giving some quick feedback about this post (which I greatly appreciate!) Use "Forum" if you want to discuss this post publicly on the Effective Altruism Forum.
</em>--></p></center>
<!--kg-card-end: html--><!--kg-card-begin: html-->
</p><h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<hr>
<ol><li id="fn1">
<p>
     Disclosure: my wife works at one such company (<a href="https://anthropic.com/">Anthropic</a>) and used to work at another (<a href="https://openai.com/">OpenAI</a>), and has equity in both.&#xA0;<a href="#fnref1" rev="footnote">&#x21A9;</a><li id="fn2">
<p>
     Though I won&#x2019;t, because I decided I don&#x2019;t want to get into a thing about whom I did and didn&#x2019;t link to. Feel free to give real-world examples in the comments!&#xA0;<a href="#fnref2" rev="footnote">&#x21A9;</a><li id="fn3">
<p>
     Now, AI companies could sometimes be doing &#x201C;responsible&#x201D; or &#x201C;safety-oriented&#x201D; things in order to get good PRs, recruit employees, make existing employees happy, etc. In this sense, the actions could be <em>ultimately</em> profit-motivated. But that would still mean there are <em>enough people who care about reducing AI risk that actions like these have PR benefits, recruiting benefits, etc. </em>That&#x2019;s a big deal! And it suggests that if concern about AI risks (and understanding of how to reduce them) were more widespread, AI companies might do more good things and fewer dangerous things.&#xA0;<a href="#fnref3" rev="footnote">&#x21A9;</a><li id="fn4">
<p>
     You could argue that it would be better for the world to develop extremely powerful AI systems <em>sooner</em>, for reasons including:
<ul>

<li>You might be pretty happy with the global balance of power between countries today, and be worried that it&#x2019;ll get worse in the future. The latter could lead to a situation where the &#x201C;wrong&#x201D; government <a href="https://www.cold-takes.com/transformative-ai-issues-not-just-misalignment-an-overview/#power-imbalances">leads the way on transformative AI</a>.

</li><li>You might think that the later we develop transformative AI, the more quickly everything will play out, because there will be more computing resources available in the world. E.g., if we develop extremely powerful systems tomorrow, there would only be so many copies we could run at once, whereas if we develop equally powerful systems in 50 years, it might be a lot easier for lots of people to run lots of copies. (More: <a href="https://aiimpacts.org/hardware-overhang/">Hardware Overhang</a>)</li></ul>

</p><p>
    A key reason I believe it&#x2019;s best to avoid acceleration at this time is because it seems plausible (at least 10% likely) that transformative AI will be developed <em>extremely</em> soon - as in within 10 years of today. My  impression is that many people at major AI companies tend to agree with this. I think this is a very scary possibility, and if this is the case, the arguments I give in the main text seem particularly important (e.g., many key interventions seem to be in a pretty embryonic state, and awareness of key risks seems low).
</p><p>
    A related case one could make for acceleration is &#x201C;It&#x2019;s worth accelerating things on the whole to increase the probability that the particular company in question succeeds&#x201D; (more here: the <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/#the-competition-frame">&#x201C;competition&#x201D; frame</a>). I think this is a valid consideration, which is why I talk about tricky tradeoffs in the main text.&#xA0;<a href="#fnref4" rev="footnote">&#x21A9;</a><li id="fn5">

<p>
     Note that my wife is a former employee of OpenAI, the company I link to there, and she owns equity in the company.&#xA0;<a href="#fnref5" rev="footnote">&#x21A9;</a>
</p></li></p></li></p></li></p></li></p></li></ol></div>

<!--kg-card-end: html-->]]></content:encoded></item><item><title><![CDATA[Jobs that can help with the most important century]]></title><description><![CDATA[People are far better at their jobs than at anything else. Here are the best ways to help the most important century go well.]]></description><link>https://www.cold-takes.com/jobs-that-can-help-with-the-most-important-century/</link><guid isPermaLink="false">63e3330afba84f003d6053d4</guid><category><![CDATA[ImplicationsOfMostImportantCentury]]></category><dc:creator><![CDATA[Holden Karnofsky]]></dc:creator><pubDate>Fri, 10 Feb 2023 18:19:22 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: html-->
<p><figure><div id="buzzsprout-player-12226882"></div><script src="https://www.buzzsprout.com/1851795/12226882-jobs-that-can-help-with-the-most-important-century.js?container_id=buzzsprout-player-12226882&amp;player=small" type="text/javascript" charset="utf-8"></script>
<figcaption><em>Click lower right to download or find on Apple Podcasts, Spotify, Stitcher, etc.</em></figcaption></figure></p>

<p>
Let&#x2019;s say you&#x2019;re convinced that AI could make this the <a href="https://www.cold-takes.com/most-important-century/">most important century of all time for humanity</a>. What can you do to help things go well instead of poorly?
</p>
<p>
I think <strong>the biggest opportunities come from a full-time job </strong>(and/or the money you make from it). I think people are generally far better at their jobs than they are at anything else. 
</p>
<p>
This piece will list the jobs I think are especially high-value. I expect things will change (a lot) from year to year - this is my picture at the moment.
</p>
<p>
Here&#x2019;s a summary:
</p>

<table style="border-collapse: collapse;">
  <tr>
   <td style="border: 1px solid;"><strong>Role</strong>
   </td>
   <td style="border: 1px solid;"><strong>Skills/assets you&apos;d need</strong>
   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#research-and-engineering">Research and engineering on AI safety</a>
   </td>
   <td style="border: 1px solid;">Technical ability (but not necessarily AI background)
   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#information-security">Information security to reduce the odds powerful AI is leaked</a>
   </td>
   <td style="border: 1px solid;">Security expertise or willingness/ability to start in junior roles (likely not AI)
   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#other-roles-at-ai-companies">Other roles at AI companies</a>
   </td>
   <td style="border: 1px solid;">Suitable for generalists (but major pros and cons)
   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#government-and-government-facing">Govt and govt-facing think tanks</a>
   </td>
   <td style="border: 1px solid;">Suitable for generalists (but probably takes a long time to have impact)
   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#politics">Jobs in politics</a>
   </td>
   <td style="border: 1px solid;">Suitable for generalists if you have a clear view on which politicians to help
   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#forecasting">Forecasting to get a better handle on what&#x2019;s coming</a>
   </td>
   <td style="border: 1px solid;">Strong forecasting track record (can be pursued part-time)
   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#meta-careers">&quot;Meta&quot; careers</a>
   </td>
   <td style="border: 1px solid;">Misc / suitable for generalists
   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#low-guidance-jobs">Low-guidance options</a>
   </td>
   <td style="border: 1px solid;">These ~only make sense if you read &amp; instantly think &quot;That&apos;s me&quot;
   </td>
  </tr>
</table>


<p>
A few notes before I give more detail:
</p>
<ul>

<li>These jobs aren&#x2019;t the be-all/end-all. I expect a lot to change in the future, including a general increase in the number of helpful jobs available. 

</li><li>Most of today&#x2019;s opportunities are concentrated in the US and UK, where the biggest AI companies (and AI-focused nonprofits) are. This may change down the line.

</li><li>Most of these aren&#x2019;t jobs where you can just take instructions and apply narrow skills.  
<ul>
 
<li>The issues here are tricky, and your work will almost certainly be useless (or harmful) according to someone.
 
</li><li>I recommend forming your own views on the key risks of AI - and/or working for an organization whose leadership you&#x2019;re confident in.
</li> 
</ul>

</li><li>Staying open-minded and adaptable is crucial.  
<ul>
 
<li>I think it&#x2019;s bad to rush into a mediocre fit with one of these jobs, and better (if necessary) to stay out of AI-related jobs while skilling up and waiting for a great fit.
 
</li><li>I don&#x2019;t think it&#x2019;s helpful (and it could be harmful) to take a fanatical, &#x201C;This is the most important time ever - time to be a hero&#x201D; attitude. Better to work intensely but sustainably, stay mentally healthy and make good decisions.
</li> 
</ul>
</li> 
</ul>
<p>
The <a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#recap">first section</a> of this piece will recap my basic picture of the major risks, and the promising ways to reduce these risks (feel free to skip if you think you&#x2019;ve got a handle on this).
</p>
<p>
The <a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#jobs-that-can-help">next section</a> will elaborate on the options in the table above.
</p>
<p>
After that, I&#x2019;ll talk about <a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#other-things-you-can-do">some of the things you can do if you aren&#x2019;t ready</a> for a full-time career switch yet, and give some <a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#some-general-advice">general advice for avoiding doing harm and burnout</a>.
</p>
<h2 id="recap">Recapping the major risks, and some things that could help</h2>


<p>
This is a quick recap of the major risks from transformative AI. For a longer treatment, see <a href="https://www.cold-takes.com/how-we-could-stumble-into-ai-catastrophe/">How we could stumble into an AI catastrophe</a>, and for an even longer one see the <a href="https://www.cold-takes.com/tag/implicationsofmostimportantcentury/">full series</a>. To skip to the next section, click <a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#jobs-that-can-help">here</a>.
</p>
<p>
<strong>The backdrop: transformative AI could be developed in the coming decades. </strong>If we develop AI that can <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/">automate all the things humans do to advance science and technology</a>, this could cause <a href="https://www.cold-takes.com/most-important-century/#the-long-run-future-could-come-faster-than-we-think">explosive technological progress</a> that could bring us more quickly than most people imagine to a radically unfamiliar future. 
</p>
<p>
Such AI could also be capable of <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">defeating all of humanity combined</a>, if it were pointed toward that goal. 
</p>

<details id="Box1"><summary>(Click to expand) The most important century </summary>
<div><p>In the <a href="https://www.cold-takes.com/most-important-century/">most important century</a> series, I argued that the 21st century could be the most important century ever for humanity, via the development of advanced AI systems that could dramatically speed up scientific and technological advancement, getting us more quickly than most people imagine to a deeply unfamiliar future.
</p>
<p>
I focus on a hypothetical kind of AI that I call <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/">PASTA</a>, or Process for Automating Scientific and Technological Advancement. PASTA would be AI that can essentially <strong>automate all of the human activities needed to speed up scientific and technological advancement.</strong>
</p>
<p>
Using a <a href="https://www.cold-takes.com/where-ai-forecasting-stands-today/">variety of different forecasting approaches</a>, I argue that PASTA seems more likely than not to be developed this century - and there&#x2019;s a decent chance (more than 10%) that we&#x2019;ll see it within 15 years or so.
</p>
<p>
I argue that the consequences of this sort of AI could be enormous: an <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/#explosive-scientific-and-technological-advancement">explosion in scientific and technological progress</a>. This could get us more quickly than most imagine to a radically unfamiliar future.
</p>
<p>
I&#x2019;ve also <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">argued</a> that AI systems along these lines could defeat all of humanity combined, if (for whatever reason) they were aimed toward that goal.
</p>
<p>
For more, see the <a href="https://www.cold-takes.com/most-important-century/">most important century</a> landing page. The series is available in many formats, including audio; I also provide a summary, and links to podcasts where I discuss it at a high level.</p></div></details>


    <details id="Box2"><summary>(Click to expand) How could AI systems defeat humanity?</summary>
<div><p>
A <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">previous piece</a> argues that AI systems could defeat all of humanity combined, if (for whatever reason) they were aimed toward that goal.
</p>
<p>
By defeating humanity, I mean gaining control of the world so that AIs, not humans, determine what happens in it; this could involve killing humans or simply &#x201C;containing&#x201D; us in some way, such that we can&#x2019;t interfere with AIs&#x2019; aims.
</p>
<p>
One way this could happen would be via &#x201C;superintelligence&#x201D; It&#x2019;s imaginable that a single AI system (or set of systems working together) could:
</p>
<ul>

<li>Do its own research on how to build a better AI system, which culminates in something that has incredible other abilities.

</li><li>Hack into human-built software across the world.

</li><li>Manipulate human psychology.

</li><li>Quickly generate vast wealth under the control of itself or any human allies.

</li><li>Come up with better plans than humans could imagine, and ensure that it doesn&apos;t try any takeover attempt that humans might be able to detect and stop.

</li><li>Develop advanced weaponry that can be built quickly and cheaply, yet is powerful enough to overpower human militaries.

</li>
</ul>
<p>
But even if &#x201C;superintelligence&#x201D; never comes into play - even if any given AI system is <i>at best</i> equally capable to a highly capable human - AI could collectively defeat humanity. The piece explains how.
</p>
<p>
The basic idea is that humans are likely to deploy AI systems throughout the economy, such that they have large numbers and access to many resources - and the ability to make copies of themselves. From this starting point, AI systems with human-like (or greater) capabilities would have a number of possible ways of getting to the point where their total population could outnumber and/or out-resource humans.
</p>
<p>
More: <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">AI could defeat all of us combined</a></p></div></details>

<p>
<strong>Misalignment risk: AI could end up with dangerous <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">aims</a> of its own. </strong>
</p>
<ul>

<li>If this sort of AI is developed using the kinds of <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#Box3">trial-and-error-based</a> techniques that are common today, I think it&#x2019;s likely that it will end up &#x201C;aiming&#x201D; for particular states of the world, much like a chess-playing AI &#x201C;aims&#x201D; for a checkmate position - making choices, calculations and plans to get particular types of outcomes, even when doing so requires deceiving humans. 

</li><li>I think it will be difficult - by default - to ensure that AI systems are aiming for <em>what we (humans) want them to aim for</em>, as opposed to gaining power for ends of their own.

</li><li>If AIs have ambitious aims of their own - and are numerous and/or capable enough to overpower humans - I think we have a serious risk that AIs will take control of the world and disempower humans entirely.
</li>
</ul>
<details id="Box3"><summary>(Click to expand) Why would AI &quot;aim&quot; to defeat humanity?</summary>
<div>
<p>
A <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">previous piece</a> argued that if today&#x2019;s AI development methods lead directly to powerful enough AI systems, disaster is likely by default (in the absence of specific countermeasures). 
</p>
<p>
In brief:
</p>
<ul>
<li>Modern AI development is essentially based on &#x201C;training&#x201D; via trial-and-error. 
<p></p>
<p>
<li>If we move forward incautiously and ambitiously with such training, and if it gets us all the way to very powerful AI systems, then such systems will likely end up <em>aiming for certain states of the world</em> (analogously to how a chess-playing AI aims for checkmate).
</li></p>
<p>
<li>And these states will be<em> other than the ones we intended</em>, because our trial-and-error training methods won&#x2019;t be accurate. For example, when we&#x2019;re confused or misinformed about some question, we&#x2019;ll reward AI systems for giving the wrong answer to it - unintentionally training deceptive behavior.
</li></p>
<p>
<li>We should expect disaster if we have AI systems that are both (a) <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">powerful enough</a> to defeat humans and (b) aiming for states of the world that we didn&#x2019;t intend. (&#x201C;Defeat&#x201D; means taking control of the world and doing what&#x2019;s necessary to keep us out of the way; it&#x2019;s unclear to me whether we&#x2019;d be literally killed or just forcibly stopped<sup id="fnref1"><a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5#fn1" rel="footnote">1</a></sup> from changing the world in ways that contradict AI systems&#x2019; aims.)</li></p></li></ul>
<p>More: <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">Why would AI &quot;aim&quot; to defeat humanity?</a></p></div>

</details>
<p>
<strong>Competitive pressures, and ambiguous evidence about the risks, could make this situation very dangerous. </strong>In a <a href="https://www.cold-takes.com/how-we-could-stumble-into-ai-catastrophe/">previous piece</a>, I lay out a hypothetical story about how the world could stumble into catastrophe. In this story:
</p>
<ul>

<li>There are warning signs about the risks of misaligned AI - but there&#x2019;s a lot of ambiguity about just how big the risk is.

</li><li>Everyone is furiously racing to be first to deploy powerful AI systems. 

</li><li>We end up with a big risk of deploying dangerous AI systems throughout the economy - which means a risk of AIs disempowering humans entirely. 

</li><li>And even if we navigate <em>that </em>risk - even if AI behaves as intended - this could be a disaster if the most powerful AI systems end up concentrated in the wrong hands (something I <a href="https://www.cold-takes.com/transformative-ai-issues-not-just-misalignment-an-overview/#power-imbalances">think is reasonably likely</a> due to the potential for power imbalances). There are <a href="https://www.cold-takes.com/transformative-ai-issues-not-just-misalignment-an-overview/">other risks</a> as well.
</li>
</ul>
<details id="Box4"><summary>(Click to expand) Why AI safety could be hard to measure</summary>
<div>

<p>
In previous pieces, I argued that:
</p>
<ul>

<li>If we develop powerful AIs via ambitious use of the &#x201C;black-box trial-and-error&#x201D; common in AI development today, then there&#x2019;s a substantial risk that: 
<ul>
 
<li>These AIs will develop <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">unintended aims</a> (states of the world they make calculations and plans toward, as a chess-playing AI &quot;aims&quot; for checkmate);
 
</li><li>These AIs could deceive, manipulate, and even <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">take over the world from humans entirely</a> as needed to achieve those aims.

</li><li>People today are doing AI safety research to prevent this outcome, but such research has a <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">number of deep difficulties:</a>
</li>
</ul>
<p>
<table style="border-collapse: collapse;">
  <tr>
   <td colspan="3" style="border: 1px solid;"><strong>&#x201C;Great news - I&#x2019;ve tested this AI and it looks safe.&#x201D; </strong>Why might we still have a problem?
   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;"><em>Problem</em>
   </td>
   <td style="border: 1px solid;"><em>Key question</em>
   </td>
   <td style="border: 1px solid;"><em>Explanation</em>
   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;">The <strong>Lance Armstrong problem</strong>
   </td>
   <td style="border: 1px solid;">Did we get the AI to be <strong><span style="color:var(--green-color);">actually safe</span></strong> or <strong><span style="color:var(--red-color);">good at hiding its dangerous actions</span>?</strong>
   </td>
  <td style="border: 1px solid;"><p>When dealing with an intelligent agent, it&#x2019;s hard to tell the difference between &#x201C;behaving well&#x201D; and &#x201C;<em>appearing</em> to behave well.&#x201D;</p>
<p>
When professional cycling was cracking down on performance-enhancing drugs, Lance Armstrong was very successful and seemed to be unusually &#x201C;clean.&#x201D; It later came out that he had been using drugs with an unusually sophisticated operation for concealing them.
   </p></td>
  </tr>
  <tr>
   <td style="border: 1px solid;">The <strong>King Lear problem</strong>
   </td>
   <td style="border: 1px solid;"><p>The AI is <strong><span style="color:var(--green-color);">(actually) well-behaved when humans are in control. </span></strong>Will this transfer to <strong><span style="color:var(--red-color);">when AIs are in control</span>?</strong></p>
   </td>
   <td style="border: 1px solid;"><p>It&apos;s hard to know how someone will behave when they have power over you, based only on observing how they behave when they don&apos;t. </p>
<p>
AIs might behave as intended as long as humans are in control - but at some future point, AI systems might be capable and widespread enough to have opportunities to <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">take control of the world entirely</a>. It&apos;s hard to know whether they&apos;ll take these opportunities, and we can&apos;t exactly run a clean test of the situation. 
</p><p>
Like King Lear trying to decide how much power to give each of his daughters before abdicating the throne.
   </p></td>
  </tr>
  <tr>
   <td style="border: 1px solid;">The <strong>lab mice problem</strong>
   </td>
      <td style="border: 1px solid;"><strong><span style="color:var(--green-color);">Today&apos;s &quot;subhuman&quot; AIs are safe.</span></strong>What about <strong><span style="color:var(--red-color);">future AIs with more human-like abilities</span>?</strong>
   </td>
   <td style="border: 1px solid;"><p>Today&apos;s AI systems aren&apos;t advanced enough to exhibit the basic behaviors we want to study, such as deceiving and manipulating humans.</p> 
<p>
Like trying to study medicine in humans by experimenting only on lab mice.
   </p></td>
  </tr>
  <tr>
   <td style="border: 1px solid;">The <strong>first contact problem</strong>
   </td>
   <td style="border: 1px solid;"><p>Imagine that <strong><span style="color:var(--green-color);">tomorrow&apos;s &quot;human-like&quot; AIs are safe.</span></strong> How will things go <strong><span style="color:var(--red-color);">when AIs have capabilities far beyond humans&apos;</span>?</strong></p>
   </td>
   <td style="border: 1px solid;"><p>AI systems might (collectively) become vastly more capable than humans, and it&apos;s ... just really hard to have any idea what that&apos;s going to be like. As far as we know, there has never before been anything in the galaxy that&apos;s vastly more capable than humans in the relevant ways! No matter what we come up with to solve the first three problems, we can&apos;t be too confident that it&apos;ll keep working if AI advances (or just proliferates) a lot more. </p>
<p>
Like trying to plan for first contact with extraterrestrials (this barely feels like an analogy).
   </p></td>
  </tr>
</table>
    </p></li></ul></div></details>

<details id="Box5"><summary>(Click to expand) Power imbalances, and other risks beyond misaligned AI</summary>
<div>
<p>
I&#x2019;ve argued that AI could cause a <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/#explosive-scientific-and-technological-advancement">dramatic acceleration in the pace of scientific and technological advancement</a>. 
</p>

<p>
One way of thinking about this: perhaps (for reasons I&#x2019;ve <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/#explosive-scientific-and-technological-advancement">argued previously</a>) AI could enable the equivalent of hundreds of years of scientific and technological advancement in a matter of a few months (or faster). If so, then developing powerful AI a few months before others could lead to having technology that is (effectively) hundreds of years ahead of others&#x2019;.
</p>
<p>
Because of this, it&#x2019;s easy to imagine that AI could lead to big power imbalances, as whatever country/countries/coalitions &#x201C;lead the way&#x201D; on AI development could become far more powerful than others (perhaps analogously to when a few smallish European states took over much of the rest of the world).
</p>

<p>
I think things could go very badly if the wrong country/countries/coalitions lead the way on transformative AI. At the same time, I&#x2019;ve expressed concern that people might overfocus on this aspect of things vs. other issues, for a number of reasons including:
</p>
<ul>

<li><em>I think people naturally get more animated about &quot;helping the good guys beat the bad guys&quot; than about &quot;helping all of us avoid getting a universally bad outcome, for impersonal reasons such as &apos;we designed sloppy AI systems&apos; or &apos;we created a dynamic in which haste and aggression are rewarded.&apos;&quot;</em>

</li><li><em>I expect people will tend to be overconfident about which countries, organizations or people they see as the &quot;good guys.&quot;</em>
</li>
</ul>
<p>
(More <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/#why-i-fear-">here</a>.)
</p>
<p>
There are also dangers of powerful AI being too widespread, rather than too concentrated. In <a href="https://nickbostrom.com/papers/vulnerable.pdf">The Vulnerable World Hypothesis</a>, Nick Bostrom contemplates potential future dynamics such as &#x201C;advances in DIY biohacking tools might make it easy for anybody with basic training in biology to kill millions.&#x201D; In addition to avoiding worlds where AI capabilities end up concentrated in the hands of a few, it could also be important to avoid worlds in which they diffuse too widely, too quickly, before we&#x2019;re able to assess the risks of widespread access to technology far beyond today&#x2019;s.
</p>
<p>I discuss these and a number of other AI risks in a previous piece: <a href="https://www.cold-takes.com/transformative-ai-issues-not-just-misalignment-an-overview/">Transformative AI issues (not just misalignment): an overview</a></p></div>
</details>

<p>
<strong>I&#x2019;ve laid out several ways to reduce the risks (color-coded since I&#x2019;ll be referring to them throughout the piece):</strong>
</p>
<p>
<strong><span style="font-weight: bold; color:green">Alignment research</span>.<em> </em></strong>Researchers are working on ways to design AI systems that are <em>both</em> (a) &#x201C;aligned&#x201D; in the sense that they don&#x2019;t have <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">unintended aims of their own</a>; (b) very powerful, to the point where they can be competitive with the best systems out there. 
</p>
<ul>

<li>I&#x2019;ve laid out three <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/">high-level hopes</a> for how - using techniques that are known today - we might be able to develop AI systems that are both aligned and powerful. 

</li><li>These techniques wouldn&#x2019;t necessarily work indefinitely, but they might work long enough so that we can <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#defensive-deployment">use early safe AI systems to make the situation much safer</a> (by automating huge amounts of further alignment research, by helping to demonstrate risks and make the case for greater caution worldwide, etc.)

</li><li>(A footnote explains how I&#x2019;m using &#x201C;aligned&#x201D; vs. &#x201C;safe.&#x201D;<sup id="fnref1"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#fn1" rel="footnote">1</a></sup>)</li></ul>

<details id="Box6"><summary>(Click to expand) High-level hopes for AI alignment</summary>
<div>
<p>
A <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/">previous piece</a> goes through what I see as three key possibilities for building powerful-but-safe AI systems.
</p>
<p>
It frames these using Ajeya Cotra&#x2019;s <a href="https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/#analogy-the-young-ceo">young businessperson</a> analogy for the core difficulties. In a nutshell, once AI systems get capable enough, it could be hard to test whether they&#x2019;re safe, because they might be able to deceive and manipulate us into getting the wrong read. Thus, trying to determine whether they&#x2019;re safe might be something like &#x201C;being an eight-year-old trying to decide between adult job candidates (some of whom are manipulative).&#x201D;
</p>
<p>Key possibilities for navigating this challenge:</p>
<ul>

<li><strong>Digital neuroscience</strong>: perhaps we&#x2019;ll be able to read (and/or even rewrite) the &#x201C;digital brains&#x201D; of AI systems, so that we can know (and change) what they&#x2019;re &#x201C;aiming&#x201D; to do directly - rather than having to infer it from their behavior. (Perhaps the eight-year-old is a mind-reader, or even a young <a href="https://en.wikipedia.org/wiki/Professor_X#Powers_and_abilities">Professor X</a>.)

</li><li><strong>Limited AI</strong>: perhaps we can make AI systems safe by making them <em>limited</em> in various ways - e.g., by leaving certain kinds of information out of their training, designing them to be &#x201C;myopic&#x201D; (focused on short-run as opposed to long-run goals), or something along those lines. Maybe we can make &#x201C;limited AI&#x201D; that is nonetheless able to carry out particular helpful tasks - such as doing lots more research on how to achieve safety without the limitations. (Perhaps the eight-year-old can limit the authority or knowledge of their hire, and still get the company run successfully.)

</li><li><strong>AI checks and balances</strong>: perhaps we&#x2019;ll be able to employ some AI systems to critique, supervise, and even rewrite others. Even if no single AI system would be safe on its own, the right &#x201C;checks and balances&#x201D; setup could ensure that human interests win out. (Perhaps the eight-year-old is able to get the job candidates to evaluate and critique each other, such that all the eight-year-old needs to do is verify basic factual claims to know who the best candidate is.)
</li>
</ul>
<p>
These are some of the main categories of hopes that are pretty easy to picture today. Further work on AI safety research might result in further ideas (and the above are not exhaustive - see my <a href="https://www.alignmentforum.org/posts/rCJQAkPTEypGjSJ8X/how-might-we-align-transformative-ai-if-it-s-developed-very">more detailed piece</a>, posted to the Alignment Forum rather than Cold Takes, for more).
    </p></div>
</details>

<p>
<strong><span style="font-weight: bold; color:orange">Standards and monitoring.</span></strong>I see some hope for developing <strong>standards that all potentially dangerous AI projects </strong>(whether companies, government projects, etc.) <strong>need to meet, and enforcing these standards globally. </strong>
</p>
<ul>

<li>Such standards could require strong demonstrations of safety, strong security practices, designing AI systems to be difficult to use for overly dangerous activity, etc. 

</li><li>We don&apos;t need a perfect system or international agreement to get a lot of benefit out of such a setup. The goal isn&#x2019;t just to buy time &#x2013; it&#x2019;s to change incentives, such that AI projects need to make progress on improving security, alignment, etc. in order to be profitable.
</li>
</ul>
<details id="Box7"><summary>(Click to expand) How standards might be established and become national or international</summary>
<div>
<p>
I <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#global-monitoring">previously</a> laid out a possible vision on this front, which I&#x2019;ll give a slightly modified version of here:
</p>
<ul>

<li>Today&#x2019;s leading AI companies could self-regulate by committing not to build or deploy a system that they can&#x2019;t convincingly demonstrate is safe (e.g., see Google&#x2019;s <a href="https://www.theweek.in/news/sci-tech/2018/06/08/google-wont-deploy-ai-to-build-military-weapons-ichai.html">2018 statement</a>, &quot;We will not design or deploy AI in weapons or other technologies whose principal purpose or implementation is to cause or directly facilitate injury to people&#x201D;).  
<ul>
 
<li>Even if some people at the companies would like to deploy unsafe systems, it could be hard to pull this off once the company has committed not to. 
 
</li><li>Even if there&#x2019;s a lot of room for judgment in what it means to demonstrate an AI system is safe, having agreed in advance that <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">certain evidence</a> is <em>not</em> good enough could go a long way.
</li> 
</ul>

</li><li>As more AI companies are started, they could feel soft pressure to do similar self-regulation, and refusing to do so is off-putting to potential employees, investors, etc.

</li><li>Eventually, similar principles could be incorporated into various government regulations and enforceable treaties.

</li><li>Governments could monitor for dangerous projects using regulation and even overseas operations. E.g., today the US monitors (without permission) for various signs that other states might be developing nuclear weapons, and might try to stop such development with methods ranging from threats of sanctions to <a href="https://en.wikipedia.org/wiki/Stuxnet">cyberwarfare</a> or even military attacks. It could do something similar for any AI development projects that are using huge amounts of compute and haven&#x2019;t volunteered information about whether they&#x2019;re meeting standards.
</li>
    </ul></div>
</details>

<p><strong><span style="font-weight: bold; color:purple">Successful, careful AI projects. </span></strong>I think an AI company (or other project) can enormously improve the situation, if it can both (a) be one of the leaders in developing powerful AI; (b) prioritize doing (and using powerful AI for) <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#defensive-deployment">things that reduce risks</a>, such as doing alignment research. (But don&#x2019;t read this as ignoring the fact that AI companies <a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#other-roles-at-ai-companies">can do harm</a> as well!)
</p>
<details id="Box8"><summary>(Click to expand) How a careful AI project could be helpful</summary>
<div>
    <p>
In addition to using advanced AI to do AI safety research (noted above), an AI project could:
</p>
<ul>

<li>Put huge effort into designing <em>tests </em>for signs of danger, and - if it sees danger signs in its own systems - warning the world as a whole.

</li><li>Offer deals to other AI companies/projects. E.g., acquiring them or exchanging a share of its profits for enough visibility and control to ensure that they don&#x2019;t deploy dangerous AI systems.

</li><li>Use its credibility as the leading company to lobby the government for helpful measures (such as enforcement of a <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#global-monitoring">monitoring-and-standards regime</a>), and to more generally highlight key issues and advocate for sensible actions.

</li><li>Try to ensure (via design, marketing, customer choice, etc.) that its AI systems are not used for dangerous ends, and <em>are</em> used on applications that make the world safer and better off. This could include <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#global-monitoring">defensive deployment</a> to reduce risks from other AIs; it could include using advanced AI systems to help it gain clarity on how to get a good outcome for humanity; etc.
</li>
</ul>
<p>
An AI project with a dominant market position could likely make a huge difference via things like the above (and probably via many routes I haven&#x2019;t thought of). And even an AI project that is merely <em>one of several leaders</em> could have enough resources and credibility to have a lot of similar impacts - especially if it&#x2019;s able to &#x201C;lead by example&#x201D; and persuade other AI projects (or make deals with them) to similarly prioritize actions like the above.
</p>
<p>
A challenge here is that I&#x2019;m envisioning a project with two arguably contradictory properties: being <em>careful</em> (e.g., prioritizing actions like the above over just trying to maintain its position as a profitable/cutting-edge project) and <em>successful</em> (being a profitable/cutting-edge project). In practice, it could be very hard for an AI project to walk the tightrope of being aggressive enough to be a &#x201C;leading&#x201D; project (in the sense of having lots of resources, credibility, etc.), while also prioritizing actions like the above (which mostly, with some exceptions, seem pretty different from what an AI project would do if it were simply focused on its technological lead and profitability).
    </p></div>
</details>


<p>
<strong><span style="font-weight: bold; color:red">Strong security.</span> </strong>A key threat is that someone could steal major components of an AI system and deploy it incautiously. It could be extremely hard for an AI project to be robustly safe against having its AI &#x201C;stolen.&#x201D; But this could change, if there&#x2019;s enough effort to work out the problem of how to secure a large-scale, powerful AI system.
</p>
<details id="Box9"><summary>(Click to expand) The challenging of securing dangerous AI</summary>
<div>
<p>In <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/">Racing Through a Minefield</a>, I described a &quot;race&quot; between cautious actors (those who take <a href="underline">misalignment risk</a> seriously) and incautious actors (those who are focused on deploying AI for their own gain, and aren&apos;t thinking much about the dangers to the whole world). Ideally, cautious actors would collectively have more powerful AI systems than incautious actors, so they could take their time doing <a href="underline">alignment research</a> and <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#defensive-deployment">other things</a> to try to make the situation safer for everyone. </p>

<p>But if incautious actors can steal an AI from cautious actors and rush forward to deploy it for their own gain, then the situation looks a lot bleaker. And unfortunately, it could be hard to protect against this outcome.</p>

<p>It&apos;s generally <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/#fn15">extremely difficult</a> to protect data and code against a well-resourced cyberwarfare/espionage effort. An AI&#x2019;s &#x201C;weights&#x201D; (you can think of this sort of like its source code, though <a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#fn4">not exactly</a>) are potentially very dangerous on their own, and hard to get extreme security for. Achieving enough cybersecurity could require measures, and preparations, well beyond what one would normally aim for in a commercial context.</p></div>
</details>


<h2 id="jobs-that-can-help">Jobs that can help</h2>


<p>
In this long section, I&#x2019;ll list a number of jobs I wish more people were pursuing.
</p>
<p>
Unfortunately, I can&#x2019;t give individualized help exploring one or more of these career tracks. Starting points could include <a href="https://80000hours.org/">80,000 Hours</a> and various <a href="https://www.aisafetysupport.org/resources/lots-of-links">other resources</a>.
</p>
<p id="research-and-engineering">
<strong>Research and engineering careers. </strong>You can contribute to <span style="font-weight: bold; color:green">alignment research</span> as a researcher and/or software engineer (the line between the two can be fuzzy in some contexts). 
</p>
<p>
There are (not necessarily easy-to-get) jobs along these lines at major AI labs, in established academic labs, and at independent nonprofits (examples in footnote).<sup id="fnref2"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#fn2" rel="footnote">2</a></sup>
</p>
<p>
Different institutions will have very different approaches to research, very different environments and philosophies, etc. so it&#x2019;s hard to generalize about what might make someone a fit. A few high-level points:
</p>
<ul>

<li>It takes a lot of talent to get these jobs, but you shouldn&#x2019;t assume that it takes years of experience in a particular field (or a particular degree).  
<ul>
 
<li>I&#x2019;ve seen a number of people switch over from other fields (such as physics) and become successful extremely quickly. 
 
</li><li>In addition to on-the-job training, there are independent programs specifically aimed at helping people skill up quickly.<sup id="fnref3"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#fn3" rel="footnote">3</a></sup>
</li> 
</ul>

</li><li>You also shouldn&#x2019;t assume that these jobs are only for &#x201C;scientist&#x201D; types - there&#x2019;s a substantial need for engineers, which I expect to grow.

</li><li>I think most people working on alignment consider a lot of <em>other</em> people&#x2019;s work to be useless at best. This seems important to know going in for a few reasons. 
<ul>
 
<li>You shouldn&#x2019;t assume that all work is useless just because the first examples you see seem that way.
 
</li><li>It&#x2019;s good to be aware that whatever you end up doing, someone will probably dunk on your work on the Internet. 
 
</li><li>At the same time, you shouldn&#x2019;t assume that your work is helpful because it&#x2019;s &#x201C;safety research.&#x201D; It&apos;s worth investing a lot in understanding how any particular research you&apos;re doing could be helpful (and how it could fail).   
<ul>
  
<li>I&#x2019;d even suggest taking regular dedicated time (a day every few months?) to pause working on the day-to-day and think about how your work fits into the big picture.
</li>  
</ul>
 
</li><li>For a sense of what work <strong>I</strong> think is most likely to be useful, I&#x2019;d suggest my piece on why <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">AI safety seems hard to measure</a> - I&#x2019;m most excited about work that directly tackles the challenges outlined in that piece, and I&#x2019;m pretty skeptical of work that only looks good with those challenges assumed away. (Also see my piece on <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/">broad categories of research I think have a chance to be highly useful</a>, and some <a href="https://docs.google.com/document/d/1vE8CrN2ap8lFm1IjNacVV2OJhSehrGi-VL6jITTs9Rg/edit#heading=h.go4iucw4wv9k">comments from a while ago</a> that I still mostly endorse.) 
</li> 
</ul>
</li> 
</ul>
<p id="other-technical-research">
I also want to call out a couple of categories of research that are getting some attention today, but seem at least a bit under-invested in, even relative to alignment research:
</p>
<ul>

<li><em>Threat assessment research.<strong> </strong></em>To me, there&#x2019;s an important distinction between &#x201C;Making AI systems safer&#x201D; and &#x201C;Finding out how dangerous they might end up being.&#x201D; (Today, these tend to get lumped together under &#x201C;alignment research.&#x201D;) 
<ul>
 
<li>A key approach to medical research is using <em>model organisms</em> - for example, giving cancer to mice, so we can see whether we&#x2019;re able to cure them. 
 
</li><li>Analogously, one might deliberately (though carefully!<sup id="fnref4"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#fn4" rel="footnote">4</a></sup>) design an AI system to <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">deceive and manipulate humans</a>, so we can (a) get a more precise sense of what kinds of training dynamics lead to deception and manipulation; (b) see whether existing safety techniques are effective countermeasures.
 
</li><li>If we had concrete demonstrations of AI systems becoming deceptive/manipulative/power-seeking, we could potentially build more consensus for caution (e.g., <span style="font-weight: bold; color:orange">standards and monitoring</span>). Or we could imaginably produce evidence that the threat is <em>low</em>.<sup id="fnref5"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#fn5" rel="footnote">5</a></sup>
 
</li><li>A couple of early examples of threat assessment research: <a href="https://twitter.com/EthanJPerez/status/1604886089403346944">here</a> and <a href="https://scholar.google.com/citations?view_op=view_citation&amp;hl=en&amp;user=odFQXSYAAAAJ&amp;sortby=pubdate&amp;citation_for_view=odFQXSYAAAAJ:MXK_kJrjxJIC">here</a>.
</li> 
</ul>

</li><li><em>Anti-misuse research. </em> 
<ul>
 
<li>I&#x2019;ve <a href="https://www.cold-takes.com/transformative-ai-issues-not-just-misalignment-an-overview/#power-imbalances">written about</a> how we could face catastrophe even from <em>aligned</em> AI. That is - even if AI does what its human operators want it to be doing, maybe some of its human operators want it to be helping them build bioweapons, spread propaganda, etc. 
 
</li><li>But maybe it&#x2019;s possible to <em>train AIs so that they&#x2019;re hard to use for purposes like this</em> - a separate challenge from training them to avoid deceiving and manipulating their human operators. 
 
</li><li>In practice, a lot of the work done on this today (<a href="https://twitter.com/PougetHadrien/status/1611008020644864001">example</a>) tends to get called &#x201C;safety&#x201D; and lumped in with alignment (and sometimes the same research helps with both goals), but again, I think it&#x2019;s a distinction worth making.
 
</li><li>I expect the earliest and easiest versions of this work to happen naturally as companies try to make their AI models fit for commercialization - but at some point it might be important to be making more intense, thorough attempts to prevent even very rare (but catastrophic) misuse.
</li> 
</ul>
</li> 
</ul>
<p id="information-security">
<strong><span style="font-weight: bold; color:red">Information security careers.</span></strong> There&#x2019;s a big risk that a powerful AI system could be &#x201C;stolen&#x201D; via hacking/espionage, and this could make just about every kind of risk worse. I think it could be very challenging - but possible - for AI projects to be secure against this threat. (More <a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#Box_underline">above.</a>)
</p>
<p>
<strong>I really think security is not getting enough attention from people concerned about AI risk, and I disagree with the idea that key security problems can be solved just by hiring from today&#x2019;s security industry.</strong>
</p>
<ul>

<li>From what I&#x2019;ve seen, AI companies have a lot of trouble finding good security hires. I think a lot of this is simply that security is <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/#fn15">challenging</a> and valuable, and demand for good hires (especially people who can balance security needs against practical needs) tends to swamp supply. 
<ul>
 
<li>And yes, this means good security people are well-paid!
</li> 
</ul>

</li><li>Additionally, AI could present unique security challenges in the future, because it requires protecting something that is simultaneously (a) fundamentally just software (not e.g. uranium), and hence very hard to protect; (b) potentially valuable enough that one could imagine very well-resourced state programs going all-out to steal it, with a breach having globally catastrophic consequences. I think trying to get out ahead of this challenge, by experimenting early on with approaches to it, could be very important.

</li><li><strong>It&#x2019;s plausible to me that security is as important as alignment right now, </strong>in terms of how much one more good person working on it will help.<strong> </strong>

</li><li>And security is an easier path, because one can get mentorship from a large community of security people working on things other than AI.<sup id="fnref6"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#fn6" rel="footnote">6</a></sup>

</li><li>I think there&#x2019;s a lot of potential value both in security <em>research</em> (e.g., developing new security techniques) and in simply working at major AI companies to help with their existing security needs.

</li><li>For more on this topic, see this <a href="https://80000hours.org/career-reviews/information-security/">recent 80,000 hours report</a> and <a href="https://forum.effectivealtruism.org/posts/ZJiCfwTy5dC4CoxqA/information-security-careers-for-gcr-reduction">this 2019 post by two of my coworkers</a>.</li></ul>
<p id="other-roles-at-ai-companies">
<strong>Other jobs at AI companies. </strong>AI companies hire for a lot of roles, many of which don&#x2019;t require any technical skills. 
</p>
<p>
It&#x2019;s a somewhat debatable/tricky path to take a role that isn&#x2019;t focused specifically on safety or security. Some people believe<sup id="fnref7"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#fn7" rel="footnote">7</a></sup> that you can do more harm than good this way, by helping companies push forward with building dangerous AI before the risks have gotten much attention or preparation - and I think this is a pretty reasonable take. 
</p>
<p>
At the same time:
</p>
<ul>

<li>You could argue something like: &#x201C;Company X has potential to be a <span style="font-weight: bold; color:purple">successful, careful AI project. </span>That is, it&#x2019;s likely to deploy powerful AI systems more carefully and helpfully than others would, and use them to reduce risks by automating alignment research and <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#defensive-deployment">other risk-reducing tasks</a>. Furthermore, Company X is most likely to make a number of other decisions wisely as things develop. So, it&#x2019;s worth accepting that Company X is speeding up AI progress, because of the hope that Company X can make things go better.&#x201D; This obviously depends on how you feel about Company X compared to others!

</li><li>Working at Company X could also present opportunities to <em>influence</em> Company X. If you&#x2019;re a valuable contributor and you are paying attention to the choices the company is making (and speaking up about them), you could affect the incentives of leadership.  
<ul>
 
<li>I think this can be a useful thing to do in combination with the other things on this list, but I generally wouldn&#x2019;t advise taking a job if this is one&#x2019;s <em>main </em>goal. 
</li> 
</ul>

</li><li>Working at an AI company presents opportunities to become generally more knowledgeable about AI, possibly enabling a later job change to something else.
</li>
</ul>
<details id="Box10"><summary>(Click to expand) How a careful AI project could be helpful</summary>
<div>
<p>
In addition to using advanced AI to do AI safety research (noted above), an AI project could:
</p>
<ul>

<li>Put huge effort into designing <em>tests </em>for signs of danger, and - if it sees danger signs in its own systems - warning the world as a whole.

</li><li>Offer deals to other AI companies/projects. E.g., acquiring them or exchanging a share of its profits for enough visibility and control to ensure that they don&#x2019;t deploy dangerous AI systems.

</li><li>Use its credibility as the leading company to lobby the government for helpful measures (such as enforcement of a <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#global-monitoring">monitoring-and-standards regime</a>), and to more generally highlight key issues and advocate for sensible actions.

</li><li>Try to ensure (via design, marketing, customer choice, etc.) that its AI systems are not used for dangerous ends, and <em>are</em> used on applications that make the world safer and better off. This could include <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#global-monitoring">defensive deployment</a> to reduce risks from other AIs; it could include using advanced AI systems to help it gain clarity on how to get a good outcome for humanity; etc.
</li>
</ul>
<p>
An AI project with a dominant market position could likely make a huge difference via things like the above (and probably via many routes I haven&#x2019;t thought of). And even an AI project that is merely <em>one of several leaders</em> could have enough resources and credibility to have a lot of similar impacts - especially if it&#x2019;s able to &#x201C;lead by example&#x201D; and persuade other AI projects (or make deals with them) to similarly prioritize actions like the above.
</p>
<p>
A challenge here is that I&#x2019;m envisioning a project with two arguably contradictory properties: being <em>careful</em> (e.g., prioritizing actions like the above over just trying to maintain its position as a profitable/cutting-edge project) and <em>successful</em> (being a profitable/cutting-edge project). In practice, it could be very hard for an AI project to walk the tightrope of being aggressive enough to be a &#x201C;leading&#x201D; project (in the sense of having lots of resources, credibility, etc.), while also prioritizing actions like the above (which mostly, with some exceptions, seem pretty different from what an AI project would do if it were simply focused on its technological lead and profitability).
    </p></div>
</details>
<p>
<a href="https://80000hours.org/">80,000 Hours</a> has a <a href="https://80000hours.org/articles/ai-capabilities/">collection of anonymous advice</a> on how to think about the pros and cons of working at an AI company.
</p>
<p>
In a future piece, I&#x2019;ll discuss what I think AI companies can be doing today to prepare for transformative AI risk. This could be helpful for getting a sense of what an unusually careful AI company looks like.
</p>
<p id="government-and-government-facing">
<strong>Jobs in government and at government-facing think tanks. </strong>I think there is a lot of value in providing quality advice to governments (especially the US government) on how to think about AI - both today&#x2019;s systems and potential future ones. 
</p>
<p>
I also think it could make sense to work on <em>other</em> technology issues in government, which could be a good path to working on AI later (I expect government attention to AI to grow over time). 
</p>
<p>
People interested in careers like these can check out <a href="https://www.openphilanthropy.org/open-philanthropy-technology-policy-fellowship/">Open Philanthropy&#x2019;s Technology Policy Fellowships</a> and RAND Corporation&apos;s <a href="https://www.rand.org/jobs/technology-security-policy-fellows.html">Technology and Security Policy Fellows</a>.
</p>
<p>
One related activity that seems especially valuable: <strong>understanding the state of AI in countries other than the one you&#x2019;re working for/in</strong> - particularly countries that (a) have a good chance of developing their own major AI projects down the line; (b) are difficult to understand much about by default. 
</p>
<ul>

<li>Having good information on such countries could be crucial for making good decisions, e.g. about moving cautiously vs. racing forward vs. trying to enforce safety standards internationally. 

</li><li>I think good work on this front has been done by the <a href="https://cset.georgetown.edu/">Center for Security and Emerging Technology</a><sup id="fnref8"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#fn8" rel="footnote">8</a></sup> among others. </li></ul>
<p>
A future piece will discuss other things I think governments can be doing today to prepare for transformative AI risk. I won&#x2019;t have a ton of tangible recommendations quite yet, but I expect there to be more over time, especially if and when <span style="font-weight: bold; color:orange">standards and monitoring</span> frameworks become better-developed.
</p>
<p id="politics">
<strong>Jobs in politics. </strong>The previous category focused on advising governments; this one is about working on political campaigns, doing polling analysis, etc. to generally improve the extent to which sane and reasonable people are in power. Obviously, it&#x2019;s a judgment call which politicians are the &#x201C;good&#x201D; ones and which are the &#x201C;bad&#x201D; ones, but I didn&#x2019;t want to leave out this category of work.
</p>
<p id="forecasting">
<strong>Forecasting. </strong>I&#x2019;m intrigued by organizations like <a href="https://www.metaculus.com/questions/?show-welcome=true">Metaculus</a>, <a href="https://www.hypermind.com/">HyperMind</a>, <a href="https://goodjudgment.com/">Good Judgment</a>,<sup id="fnref9"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#fn9" rel="footnote">9</a></sup> <a href="https://manifold.markets/">Manifold Markets</a>, and <a href="https://samotsvety.org/">Samotsvety</a> - all trying, in one way or another, to produce <strong>good probabilistic forecasts (using generalizable methods</strong><sup id="fnref10"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#fn10" rel="footnote">10</a></sup><strong>) about world events. </strong>
</p>
<p>
If we could get good forecasts about questions like &#x201C;When will AI systems be powerful enough to <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">defeat all of humanity?</a>&#x201D; and &#x201C;Will AI safety research in category X be successful?&#x201D;, this could be useful for helping people make good decisions. (These questions seem very hard to get good predictions on using these organizations&#x2019; methods, but I think it&#x2019;s an interesting goal.)
</p>
<p>
To explore this area, I&#x2019;d suggest learning about forecasting generally (<a href="https://smile.amazon.com/Superforecasting-Science-Prediction-Philip-Tetlock/dp/0804136718?sa-no-redirect=1">Superforecasting</a> is a good starting point) and building up your own prediction track record on sites such as the above.
</p>
<p id="meta-careers">
<strong>&#x201C;Meta&#x201D; careers. </strong>There are a number of jobs focused on helping <em>other people</em> learn about key issues, develop key skills and end up in helpful jobs (a bit more discussion <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/#communities">here</a>).
</p>
<p>
It can also make sense to take jobs that put one in a good position to donate to nonprofits doing important work, to <a href="https://www.cold-takes.com/spreading-messages-to-help-with-the-most-important-century/">spread helpful messages</a>, and to build skills that could be useful later (including in unexpected ways, as things develop), as I&#x2019;ll discuss <a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#other-things-you-can-do">below.</a>
</p>
<h3 id="low-guidance-jobs">Low-guidance jobs</h3>


<p>
This sub-section lists some projects that either don&#x2019;t exist (but seem like they ought to), or are in very embryonic stages. So it&#x2019;s unlikely you can get any significant mentorship working on these things. 
</p>
<p>
I think the potential impact of making one of these work is huge, but I think most people will have an easier time finding a fit with jobs from the previous section (which is why I listed those first). 
</p>
<p>
This section is largely to illustrate that I expect there to be more and more ways to be helpful as time goes on - and in case any readers feel excited and qualified to tackle these projects themselves, despite a lack of guidance and a distinct possibility that a project will make less sense in reality than it does on paper.
</p>
<p>
A big one in my mind is <strong>developing safety standards</strong> that could be used in a <span style="font-weight: bold; color:orange">standards and monitoring</span> regime. By this I mean answering questions like:
</p>
<ul>

<li>What observations could tell us that AI systems are getting dangerous to humanity (whether by pursuing <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">aims of their own</a> or by helping humans do dangerous things)? 
<ul>
 
<li>A starting-point question: why do we believe today&#x2019;s systems <em>aren&#x2019;t</em> dangerous? What, specifically, are they unable to do that they&#x2019;d have to do in order to be dangerous, and how will we know when that&#x2019;s changed?
</li> 
</ul>

</li><li>Once AI systems have potential for danger, how should they be restricted, and what conditions should AI companies meet (e.g., demonstrations of safety and security) in order to loosen restrictions?
</li>
</ul>
<p>
There is some early work going on along these lines, at both AI companies and nonprofits. If it goes well, I expect that there could be many jobs in the future, doing things like:
</p>
<ul>

<li>Continuing to refine and improve safety standards as AI systems get more advanced.

</li><li>Providing AI companies with &#x201C;audits&#x201D; - examinations of whether their systems meet standards, provided by parties outside the company to reduce conflicts of interest.

</li><li>Advocating for the importance of adherence to standards. This could include advocating for AI companies to abide by standards, and potentially for government policies to enforce standards.
</li>
</ul>
<p>
<strong>Other public goods for AI projects. </strong>I can see a number of other ways in which independent organizations could help AI projects exercise more caution / do more to reduce risks:
</p>
<ul>

<li id="SafetyCollaborations"><strong>Facilitating safety research collaborations. </strong>I worry that at some point, doing good <span style="font-weight: bold; color:green">alignment research</span> will only be possible with access to state-of-the-art AI models - but such models will be extraordinarily expensive and exclusively controlled by major AI companies.  
<ul>
 
<li>I hope AI companies will be able to partner with outside safety researchers (not just rely on their own employees) for alignment research, but this could get quite tricky due to concerns about intellectual property leaks. 
 
</li><li>A third-party organization could do a lot of the legwork of vetting safety researchers, helping them with their security practices, working out agreements with respect to intellectual property, etc. to make partnerships - and <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#selective-information-sharing">selective information sharing</a>, more broadly - more workable.
</li> 
</ul>

</li><li><strong>Education for key people at AI companies. </strong>An organization could help employees, investors, and board members of AI companies learn about the potential risks and challenges of advanced AI systems. I&#x2019;m <strong>especially excited about this for board members, </strong>because: 
<ul>
 
<li>I&#x2019;ve already seen a lot of interest from AI companies in forming strong ethics advisory boards, and/or putting well-qualified people on their governing boards (see footnote for the difference<sup id="fnref11"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#fn11" rel="footnote">11</a></sup>). I expect demand to go up.
 
</li><li>Right now, I don&#x2019;t think there are a lot of people who are both (a) prominent and &#x201C;fancy&#x201D; enough to be considered for such boards; (b) highly thoughtful about, and well-versed in, what I consider some of the most important risks of transformative AI (covered in this piece and the <a href="https://www.cold-takes.com/tag/implicationsofmostimportantcentury/">series</a> it&#x2019;s part of).
 
</li><li>An &#x201C;education for potential board members&#x201D; program could try to get people quickly up to speed on <a href="https://www.cold-takes.com/nonprofit-boards-are-weird-2/">good board member practices generally</a>, on risks of transformative AI, and on the basics of how modern AI works.
</li> 
</ul>

</li><li><strong>Helping share best practices across AI companies. </strong>A third-party organization might collect information about how different AI companies are handling information security, alignment research, processes for difficult decisions, governance, etc. and share it across companies, while taking care to preserve confidentiality. I&#x2019;m particularly interested in the possibility of developing and sharing innovative <a href="https://www.cold-takes.com/ideal-governance-for-companies-countries-and-more/">governance setups</a> for AI companies.</li></ul>
<p id="thinking">
<strong>Thinking and stuff. </strong>There&#x2019;s tons of potential work to do in the category of &#x201C;coming up with more issues we ought to be thinking about, more things people (and companies and governments) can do to be helpful, etc.&#x201D;
</p>
<ul>

<li>About a year ago, I published a <a href="https://forum.effectivealtruism.org/posts/zGiD94SHwQ9MwPyfW/important-actionable-research-questions-for-the-most#A_high_level_list_of_important__actionable_questions_for_the_most_important_century">list of research questions</a> that could be valuable and important to gain clarity on. I still mostly endorse this list (though I wouldn&#x2019;t write it just as is today).

</li><li>A slightly different angle: it could be valuable to have more people thinking about the question, &#x201C;What are some tangible policies governments could enact to be helpful?&#x201D; E.g., early steps towards <span style="font-weight: bold; color:orange">standards and monitoring</span>. This is distinct from advising governments directly (it&apos;s earlier-stage).
</li>
</ul>
<p>
Some AI companies have policy teams that do work along these lines. And a few Open Philanthropy employees work on topics along the lines of the first bullet point. However, I tend to think of this work as best done by people who need very little guidance (more at my discussion of <a href="https://www.cold-takes.com/the-wicked-problem-experience/">wicked problems</a>), so I&#x2019;m hesitant to recommend it as a mainline career option.
</p>
<h2 id="other-things-you-can-do">Things you can do if you&#x2019;re not ready for a full-time career change</h2>


<p>
Switching careers is a big step, so this section lists some ways you can be helpful regardless of your job - including preparing yourself for a later switch.
</p>
<p>
First and most importantly, you may have opportunities to <strong>spread key messages</strong> via social media, talking with friends and colleagues, etc. I think there&#x2019;s a lot of potential to make a difference here, and I wrote a <a href="https://www.cold-takes.com/spreading-messages-to-help-with-the-most-important-century/">previous post</a> on this specifically.
</p>

<p>
Second, you can <strong>explore potential careers </strong>like those I discuss <a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#jobs-that-can-help">above</a>. I&#x2019;d suggest generally checking out job postings, thinking about what sorts of jobs might be a fit for you down the line, meeting people who work in jobs like those and asking them about their day-to-day, etc.
</p>
<p>
Relatedly, you can<strong> try to keep your options open. </strong>
</p>
<ul>

<li>It&#x2019;s hard to predict what skills will be useful as AI advances further and new issues come up. 

</li><li>Being ready to switch careers when a big opportunity comes up could be <em>hugely</em> valuable - and hard. (Most people would have a lot of trouble doing this late in their career, no matter how important!) 

</li><li>Building up the financial, psychological and social ability to change jobs later on would (IMO) be well worth a lot of effort.
</li>
</ul>
<p>
Right now there aren&#x2019;t a lot of obvious places to <strong>donate</strong> (though you can donate to the <a href="https://funds.effectivealtruism.org/funds/far-future">Long-Term Future Fund</a><sup id="fnref12"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#fn12" rel="footnote">12</a></sup> if you feel so moved). 
</p>
<ul>

<li>I&#x2019;m guessing this will change in the future, for a number of reasons.<sup id="fnref13"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#fn13" rel="footnote">13</a></sup> 

</li><li>Something I&#x2019;d consider doing is setting some pool of money aside, perhaps invested such that it&#x2019;s particularly likely to grow a lot if and when AI systems become a lot more capable and impressive,<sup id="fnref14"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#fn14" rel="footnote">14</a></sup> in case giving opportunities come up in the future. 

</li><li>You can also, of course, donate to things today that others aren&#x2019;t funding for whatever reason.</li></ul>
<p id="learning">
<strong>Learning more </strong>about key issues could broaden your options. I think the <a href="https://www.cold-takes.com/tag/implicationsofmostimportantcentury/">full series</a> I&#x2019;ve written on key risks is a good start. To do more, you could:
</p>
<ul>

<li><a href="https://www.cold-takes.com/reading-books-vs-engaging-with-them/">Actively engage</a> with this series by <a href="https://www.cold-takes.com/learning-by-writing/">writing your own takes</a>, discussing with others, etc.

</li><li>Consider various online courses<sup id="fnref15"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#fn15" rel="footnote">15</a></sup> on relevant issues.

</li><li>I think it&#x2019;s also good to get as familiar with today&#x2019;s AI systems (and the research that goes into them) as you can.  
<ul>
 
<li>If you&#x2019;re happy to write code, you can check out coding-intensive guides and programs (examples in footnote).<sup id="fnref16"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#fn16" rel="footnote">16</a></sup>
 
</li><li>If you don&#x2019;t want to code but can read somewhat technical content, I&#x2019;d suggest getting oriented with some basic explainers on deep learning<sup id="fnref17"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#fn17" rel="footnote">17</a></sup> and then reading significant papers on AI and AI safety.<sup id="fnref18"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#fn18" rel="footnote">18</a></sup>
 
</li><li>Whether you&#x2019;re very technical or not at all, I think it&#x2019;s worth playing with public state-of-the-art AI models, as well as seeing highlights of what they can do via Twitter and such. </li></ul></li></ul>
<p>
Finally, if you happen to have opportunities to <strong>serve on governing boards or advisory boards</strong> for key organizations (e.g., AI companies), I think this is one of the best non-full-time ways to help. 
</p>
<ul>

<li>I don&#x2019;t expect this to apply to most people, but wanted to mention it in case any opportunities come up. 

</li><li>It&#x2019;s particularly important, if you get a role like this, to invest in educating yourself on key issues.
</li>
</ul>
<h2 id="some-general-advice">Some general advice</h2>


<p>
I think full-time work has huge potential to help, but also big potential to do harm, or to burn yourself out. So here are some general suggestions.
</p>
<p>
<strong>Think about your own views on the key risks of AI, and what it might look like for the world to deal with the risks. </strong>Most of the jobs I&#x2019;ve discussed aren&#x2019;t jobs where you can just take instructions and apply narrow skills. The <a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#recap">issues here</a> are tricky, and it takes judgment to navigate them well. 
</p>
<p>
Furthermore, no matter what you do, there will almost certainly be people who think your work is useless (if not harmful).<sup id="fnref19"><a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#fn19" rel="footnote">19</a></sup> This can be very demoralizing. I think it&#x2019;s easier if you&#x2019;ve thought things through and feel good about the choices you&#x2019;re making.
</p>
<p>
I&#x2019;d advise trying to learn as much as you can about the major risks of AI (see <a href="https://www.cold-takes.com/p/5fec3148-e34e-4bc2-a28b-8c95926142fa/#learning">above</a> for some guidance on this) - and/or trying to work for an organization whose leadership you have a good amount of confidence in.
</p>
<p>
<strong>Jog, don&#x2019;t sprint.  </strong>Skeptics of the &#x201C;most important century&#x201D; hypothesis will sometimes say things like &#x201C;If you really believe this, why are you working normal amounts of hours instead of extreme amounts? Why do you have hobbies (or children, etc.) at all?&#x201D; And I&#x2019;ve seen a number of people with an attitude like: &#x201C;THIS IS THE MOST IMPORTANT TIME IN HISTORY. I NEED TO WORK 24/7 AND FORGET ABOUT EVERYTHING ELSE. NO VACATIONS.&quot;
</p>
<p>
I think that&#x2019;s a very bad idea. 
</p>
<p>
Trying to reduce risks from advanced AI is, as of today, a frustrating and disorienting thing to be doing. It&#x2019;s very hard to tell whether you&#x2019;re being helpful (and as I&#x2019;ve mentioned, many will inevitably think you&#x2019;re being harmful). 
</p>
<p>
I think the difference between &#x201C;not mattering,&#x201D; &#x201C;doing some good&#x201D; and &#x201C;doing enormous good&#x201D; comes down to <strong>how you choose the job, how good at it you are, and how good your judgment is</strong> (including what risks you&#x2019;re most focused on and how you model them). Going &#x201C;all in&#x201D; on a particular objective seems bad on these fronts: it poses risks to open-mindedness, to mental health and to good decision-making (I am speaking from observations here, not just theory). 
</p>
<p>
That is, I think it&#x2019;s a <em>bad idea to try to be 100% emotionally bought into the full stakes of the most important century</em> - I think the stakes are just too high for that to make sense for any human being. 
</p>
<p>
Instead, I think the best way to handle &#x201C;the fate of humanity is at stake&#x201D; is probably to find a nice job and work about as hard as you&#x2019;d work at another job, rather than trying to make heroic efforts to work extra hard. (I criticized heroic efforts in general <a href="https://www.cold-takes.com/useful-vices-for-wicked-problems/#self-preservation">here</a>.) 
</p>
<p>
I think this basic formula (working in some job that is a good fit, while having some amount of balance in your life) is what&#x2019;s behind a lot of the most important positive events in history to date, and presents possibly historically large opportunities today.
</p>
<p>
<em>Special thanks to Alexander Berger, Jacob Eliosoff, Alexey Guzey, Anton Korinek and Luke Muelhauser for especially helpful comments on this post. A lot of other people commented helpfully as well. </em>
</p>
<!--kg-card-end: html--><!--kg-card-begin: html-->

<p><div style="display:flex; justify-content:center; margin: 0 auto;">

        <span style="margin: 10px;"><a href="https://api.addthis.com/oexchange/0.8/forward/twitter/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fjobs-that-can-help-with-the-most-important-century&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Jobs%20that%20can%20help%20with%20the%20most%20important%20century&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-twitter-square.png" border="0" alt="Twitter"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/facebook/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fjobs-that-can-help-with-the-most-important-century&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Jobs%20that%20can%20help%20with%20the%20most%20important%20century&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-facebook-square.png" border="0" alt="Facebook"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/reddit/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fjobs-that-can-help-with-the-most-important-century&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Jobs%20that%20can%20help%20with%20the%20most%20important%20century&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-reddit-square.png" border="0" alt="Reddit"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/menu/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fjobs-that-can-help-with-the-most-important-century&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Jobs%20that%20can%20help%20with%20the%20most%20important%20century&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-addthis-square.png" border="0" alt="More"></a></span>
        </div>
<center><p id="discuss"><!--<a href="https://www.cold-takes.com/jobs-that-can-help-with-the-most-important-century#subscribe" target="_blank"><button id="footer-subscribe" class="button">Subscribe</button></a>&nbsp;<a href="https://www.guidedtrack.com/programs/4kal2ue/run?posttitle=Jobs%20that%20can%20help%20with%20the%20most%20important%20century" target="_blank"><button class="button" id="Survey">Feedback</button></a>-->&#xA0;<a href="https://www.lesswrong.com/posts/slug/jobs-that-can-help-with-the-most-important-century#comments" target="_blank"><button class="button">Comment/discuss</button></a></p><p><!--<em>
Use "Feedback" if you have comments/suggestions you want me to see, or if you're up for giving some quick feedback about this post (which I greatly appreciate!) Use "Forum" if you want to discuss this post publicly on the Effective Altruism Forum.
</em>--></p></center>
<!--kg-card-end: html--><!--kg-card-begin: html-->
</p><h2 id="footnote">Footnotes</h2>
<div class="footnotes">
<hr>
<ol><li id="fn1">

<p>
     I use &#x201C;aligned&#x201D; to specifically mean that AIs behave as intended, rather than pursuing dangerous goals of their own. I use &#x201C;safe&#x201D; more broadly to mean that an AI system poses little risk of catastrophe for <em>any</em> reason in the context it&#x2019;s being used in. It&#x2019;s OK to mostly think of them as interchangeable in this post.&#xA0;<a href="#fnref1" rev="footnote">&#x21A9;</a><li id="fn2">
<p>
     AI labs with alignment teams: <a href="https://www.anthropic.com/">Anthropic</a>, <a href="https://www.deepmind.com/">DeepMind</a> and <a href="https://openai.com/">OpenAI</a>. Disclosure: my wife is co-founder and President of Anthropic, and used to work at OpenAI (and has shares in both companies); OpenAI is a former <a href="https://www.openphilanthropy.org/grants/openai-general-support/">Open Philanthropy grantee</a>.
</p><p>
    Academic labs: there are many of these; I&#x2019;ll highlight the <a href="https://jsteinhardt.stat.berkeley.edu/">Steinhardt lab at Berkeley</a> (Open Philanthropy grantee), whose recent research I&#x2019;ve found especially interesting.
</p><p>
    Independent nonprofits: examples would be <a href="https://alignment.org/">Alignment Research Center</a> and <a href="https://www.redwoodresearch.org/">Redwood Research</a> (both Open Philanthropy grantees, and I sit on the board of both).
</p><p>
    &#xA0;<a href="#fnref2" rev="footnote">&#x21A9;</a><li id="fn3">

<p>
     Examples: <a href="https://www.agisafetyfundamentals.com/">AGI Safety Fundamentals</a>, <a href="https://www.serimats.org/">SERI MATS</a>, <a href="https://forum.effectivealtruism.org/posts/vvocfhQ7bcBR4FLBx/apply-to-the-second-ml-for-alignment-bootcamp-mlab-2-in">MLAB</a> (all of which have been supported by <a href="https://openphilanthropy.org/">Open Philanthropy</a>)&#xA0;<a href="#fnref3" rev="footnote">&#x21A9;</a><li id="fn4">

<p>
     On one hand, deceptive and manipulative AIs could be dangerous. On the other, it might be better to get AIs <em>trying</em> to deceive us before they can consistently <em>succeed; </em>the worst of all worlds might be getting this behavior <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">by accident</a> with very powerful AIs.&#xA0;<a href="#fnref4" rev="footnote">&#x21A9;</a><li id="fn5">
<p>
     Though I think it&#x2019;s inherently harder to get evidence of low risk than evidence of high risk, since it&#x2019;s hard to rule out <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/#The-Lab-mice-problem">risks arising as AI systems get more capable</a>.&#xA0;<a href="#fnref5" rev="footnote">&#x21A9;</a><li id="fn6">

<p>
     Why do I simultaneously think &#x201C;This is a mature field with mentorship opportunities&#x201D; and &#x201C;This is a badly neglected career track for helping with the most important century&#x201D;?
</p><p>
    In a nutshell, <strong>most good security people are not working on AI</strong>. It looks to me like there are plenty of people who are generally knowledgeable and effective at good security, but there&#x2019;s also a <em>huge</em> amount of need for such people outside of AI specifically. 
</p><p>
    I expect this to change eventually if AI systems become extraordinarily capable. The issue is that it might be too late at that point - the security challenges in AI seem daunting (and somewhat AI-specific) to the point where it could be important for good people to start working on them many years before AI systems become extraordinarily powerful.&#xA0;<a href="#fnref6" rev="footnote">&#x21A9;</a><li id="fn7">
<p>
     <a href="https://www.lesswrong.com/posts/uFNgRumrDTpBfQGrs/let-s-think-about-slowing-down-ai">Here&#x2019;s Katja Grace</a> arguing along these lines.&#xA0;<a href="#fnref7" rev="footnote">&#x21A9;</a><li id="fn8">

<p>
     An Open Philanthropy grantee.&#xA0;<a href="#fnref8" rev="footnote">&#x21A9;</a><li id="fn9">
<p>
     Open Philanthropy has funded Metaculus and contracted with Good Judgment and HyperMind.&#xA0;<a href="#fnref9" rev="footnote">&#x21A9;</a><li id="fn10">
<p>
     That is, these groups are mostly trying things like &#x201C;Incentivize people to make good forecasts; track how good people are making forecasts; aggregate forecasts&#x201D; rather than &#x201C;Study the specific topic of AI and make forecasts that way&#x201D; (the latter is also useful, and I discuss it <a href="#thinking">below</a>).&#xA0;<a href="#fnref10" rev="footnote">&#x21A9;</a><li id="fn11">

<p>
     The governing board of an organization has the hard power to replace the CEO and/or make other decisions on behalf of the organization. An advisory board merely gives advice, but in practice I think this can be quite powerful, since I&#x2019;d expect many organizations to have a tough time doing bad-for-the-world things without backlash (from employees and the public) once an advisory board has recommended against them.&#xA0;<a href="#fnref11" rev="footnote">&#x21A9;</a><li id="fn12">
<p>
     <a href="https://www.openphilanthropy.org">Open Philanthropy</a>, which I&#x2019;m co-CEO of, has supported this fund, and its current Chair is an Open Philanthropy employee.&#xA0;<a href="#fnref12" rev="footnote">&#x21A9;</a><li id="fn13">

<p>
     I generally expect there to be more and more clarity about what actions would be helpful, and more and more people willing to work on them if they can get funded. A bit more specifically and speculatively, I expect AI safety research to get more expensive as it requires access to increasingly large, expensive AI models.&#xA0;<a href="#fnref13" rev="footnote">&#x21A9;</a><li id="fn14">
<p>
     Not investment advice! I would only do this with money you&#x2019;ve <em>set aside for donating</em> such that it wouldn&#x2019;t be a personal problem if you lost it all.&#xA0;<a href="#fnref14" rev="footnote">&#x21A9;</a><li id="fn15">

<p>
     Some options <a href="https://www.agisafetyfundamentals.com/">here</a>, <a href="https://www.effectivealtruism.org/virtual-programs">here</a>, <a href="https://forum.effectivealtruism.org/posts/XvWWfq9iqFj8x7Eu8/list-of-ai-safety-courses-and-resources">here</a>, <a href="https://aisafety.training/">here</a>. I&#x2019;ve made no attempt to be comprehensive - these are just some links that should make it easy to get rolling and see some of your options.&#xA0;<a href="#fnref15" rev="footnote">&#x21A9;</a><li id="fn16">

<p>
     <a href="https://spinningup.openai.com/en/latest/">Spinning Up in Deep RL</a>, <a href="https://forum.effectivealtruism.org/posts/vvocfhQ7bcBR4FLBx/apply-to-the-second-ml-for-alignment-bootcamp-mlab-2-in">ML for Alignment Bootcamp</a>, <a href="https://github.com/jacobhilton/deep_learning_curriculum">Deep Learning Curriculum</a>.&#xA0;<a href="#fnref16" rev="footnote">&#x21A9;</a><li id="fn17">
<p>
     For the basics, I like Michael Nielsen&#x2019;s <a href="http://neuralnetworksanddeeplearning.com/">guide to neural networks and deep learning</a>; <a href="https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi">3Blue1Brown</a> has a video explainer series that I haven&#x2019;t watched but that others have recommended highly. I&#x2019;d also suggest <a href="https://jalammar.github.io/illustrated-transformer/">The Illustrated Transformer</a> (the transformer is the most important AI architecture as of today).
</p><p>
    For a broader overview of different architectures, see <a href="https://www.asimovinstitute.org/neural-network-zoo/">Neural Network Zoo</a>. 
</p><p>
    You can also check out various Coursera etc. courses on deep learning/neural networks.&#xA0;<a href="#fnref17" rev="footnote">&#x21A9;</a><li id="fn18">
<p>
     I feel like the easiest way to do this is to follow AI researchers and/or top labs on Twitter. You can also check out <a href="https://docs.google.com/spreadsheets/d/1PwWbWZ6FPqAgZWOoOcXM8N_tUCuxpEyMbN1NYYC02aM/edit#gid=0">Alignment Newsletter</a> or <a href="https://newsletter.mlsafety.org/archive">ML Safety Newsletter</a> for alignment-specific content.&#xA0;<a href="#fnref18" rev="footnote">&#x21A9;</a><li id="fn19">
<p>
     Why? 
</p><p>
    One reason is the tension between the <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/">&#x201C;caution&#x201D; and &#x201C;competition&#x201D; frames</a>: people who favor one frame tend to see the other as harmful.
</p><p>
    Another reason: there are a number of people who think we&#x2019;re more-or-less doomed without a radical conceptual breakthrough on how to build safe AI (they think the sorts of approaches I list <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/">here</a> are hopeless, for reasons I confess I don&#x2019;t understand very well). These folks will consider anything that isn&#x2019;t aimed at a radical breakthrough ~useless, and consider some of the jobs I list in this piece to be harmful, if they are speeding up AI development and leaving us with less time for a breakthrough. 
</p><p>
    At the same time, working toward the sort of breakthrough these folks are hoping for means doing pretty esoteric, theoretical research that many other researchers think is clearly useless. 
</p><p>
    And trying to make AI development slower and/or more cautious is harmful according to some people who are dismissive of risks, and think the priority is to push forward as fast as we can with technology that has the potential to improve lives.&#xA0;<a href="#fnref19" rev="footnote">&#x21A9;</a>

</p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></ol></div>


<!--kg-card-end: html-->]]></content:encoded></item><item><title><![CDATA[Spreading messages to help with the most important century]]></title><description><![CDATA[For people who want to help improve our prospects for navigating transformative AI, and have an audience.]]></description><link>https://www.cold-takes.com/spreading-messages-to-help-with-the-most-important-century/</link><guid isPermaLink="false">63ceea0b9a951a003d4e561a</guid><category><![CDATA[ImplicationsOfMostImportantCentury]]></category><dc:creator><![CDATA[Holden Karnofsky]]></dc:creator><pubDate>Wed, 25 Jan 2023 18:11:57 GMT</pubDate><media:content url="https://www.cold-takes.com/content/images/2023/01/megaphone-emoji-twitter-dimensions.png" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: html--><img src="https://www.cold-takes.com/content/images/2023/01/megaphone-emoji-twitter-dimensions.png" alt="Spreading messages to help with the most important century"><p><figure><div id="buzzsprout-player-12114899"></div><script src="https://www.buzzsprout.com/1851795/12114899-spreading-messages-to-help-with-the-most-important-century.js?container_id=buzzsprout-player-12114899&amp;player=small" type="text/javascript" charset="utf-8"></script><figcaption><em>Click lower right to download or find on Apple Podcasts, Spotify, Stitcher, etc.</em></figcaption></figure></p>

<p>
In the <a href="https://www.cold-takes.com/most-important-century/">most important century </a>series, I argued that the 21st century could be the most important century ever for humanity, via the development of advanced AI systems that could dramatically speed up scientific and technological advancement, getting us more quickly than most people imagine to a deeply unfamiliar future.
</p>
<p>
In <a href="https://www.cold-takes.com/tag/implicationsofmostimportantcentury/">this more recent series</a>, I&#x2019;ve been trying to help answer this question: <strong>&#x201C;So what? What can I do to help?&#x201D; </strong>
</p>
<p>
So far, I&#x2019;ve just been trying to build a picture of some of the major risks we might face (especially the <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">risk of misaligned AI</a> that <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">could defeat all of humanity</a>), what might be <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">challenging about these risks</a>, and <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/">why we might succeed anyway</a>. Now I&#x2019;ve finally gotten to the part where I can start laying out tangible ideas for how to help (beyond the <a href="https://www.cold-takes.com/call-to-vigilance/">pretty lame suggestions</a> I gave before).
</p>
<p>
This piece is about one broad way to help: <strong>spreading messages </strong>that ought to be more widely understood.
</p>
<p>
One reason I think this topic is worth a whole piece is that <strong>practically everyone can help with spreading messages at least some, </strong>via things like talking to friends; writing explanations of your own that will appeal to particular people; and, yes, posting to Facebook and Twitter and all of that. Call it slacktivism if you want, but I&#x2019;d guess it can be a big deal: many extremely important AI-related ideas are understood by vanishingly small numbers of people, and a bit more awareness could snowball. Especially because these topics often feel too &#x201C;weird&#x201D; for people to feel comfortable talking about them! Engaging in credible, reasonable ways could contribute to an overall background sense that it&#x2019;s <em>OK to take these ideas seriously.</em>
</p>
<p>
And then there are a lot of potential readers who might have <em>special</em> opportunities to spread messages. Maybe they are professional communicators (journalists, bloggers, TV writers, novelists, TikTokers, etc.), maybe they&#x2019;re non-professionals who still have sizable audiences (e.g., on Twitter), maybe they have unusual personal and professional networks, etc. Overall, the more you feel you are good at communicating with some important audience (even a small one), the more this post is for you.
</p>
<p>
That said, <strong>I&#x2019;m not excited about blasting around hyper-simplified messages. </strong>As I hope this series has shown, the challenges that could lie ahead of us are complex and daunting, and shouting stuff like &#x201C;AI is the biggest deal ever!&#x201D; or &#x201C;AI development should be illegal!&#x201D; could do more harm than good (if only by associating important ideas with being annoying). Relatedly, I think it&#x2019;s generally <strong>not good enough to spread the most broad/relatable/easy-to-agree-to version of each key idea,</strong> like &#x201C;AI systems could harm society.&#x201D; Some of the unintuitive details are crucial. 
</p>
<p>
Instead, the <strong>gauntlet I&#x2019;m throwing is: &#x201C;find ways to help people understand the core parts of the challenges we might face, in as much detail as is feasible.&#x201D; </strong>That is: the goal is to try to help people get to the point where they could maintain a reasonable position in a detailed back-and-forth, not just to get them to repeat a few words or nod along to a high-level take like &#x201C;AI safety is important.&#x201D;<strong> </strong>This is a <strong>lot </strong>harder than shouting &#x201C;AI is the biggest deal ever!&#x201D;, but I think it&#x2019;s worth it, so I&#x2019;m encouraging people to rise to the challenge and stretch their communication skills.
</p>
<p>
Below, I will:
</p>
<ul>

<li>Outline some general challenges of this sort of message-spreading. 

</li><li>Go through some ideas I think it&#x2019;s risky to spread too far, at least in isolation.

</li><li>Go through some of the ideas I&#x2019;d be most excited to see spread.

</li><li>Talk a little bit about how to spread ideas - but this is mostly up to you.
</li>
</ul>
<h2 id="challenges-of-ai-related-messages">Challenges of AI-related messages</h2>


<p>
Here&#x2019;s a simplified story for how spreading messages could go badly. 
</p>
<ul>

<li>You&#x2019;re trying to convince your friend to care more about AI risk.

</li><li>You&#x2019;re planning to argue: (a) AI could be really powerful and important within our lifetimes; (b) Building AI too quickly/incautiously could be dangerous. 
<ul>
 
<li>Your friend just isn&#x2019;t going to <em>care</em> about (b) if they aren&#x2019;t sold on some version of (a). So you&#x2019;re starting with (a).
</li> 
</ul>

</li><li>Unfortunately, (a) is easier to understand than (b). So you end up convincing your friend of (a), and not (yet) (b).

</li><li>Your friend announces, &#x201C;Aha - I see that AI could be tremendously powerful and important! I need to make sure that people/countries I like are first to build it!&#x201D; and runs off to help build powerful AI as fast as possible. They&#x2019;ve chosen the <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/">competition frame (&#x201C;will the right or the wrong people build powerful AI first?&#x201D;) over the caution frame</a> (&#x201C;will we screw things up and all lose?&#x201D;), because the competition frame is <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/#why-i-fear-">easier to understand</a>.

</li><li>Why is this bad? <a href="https://www.cold-takes.com/tag/implicationsofmostimportantcentury/">See previous pieces</a> on the importance of caution.
</li>
</ul>
<details id="Box1"><summary>(Click to expand) More on the &#x201C;competition&#x201D; frame vs. the &#x201C;caution&#x201D; frame&#x201D;<!-- (Details not included in email - <a href="https://www.cold-takes.com/p/fbae8068-6543-4776-af3b-bedab1d7b74a#Box1">click to view on the web</a>)--></summary><div>
<p>
In a <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/">previous piece</a>, I talked about two contrasting frames for how to make the best of the most important century:
</p>
<p>
<strong>The caution frame.</strong> This frame emphasizes that a furious race to develop powerful AI could end up making <em>everyone</em> worse off. This could be via: (a) AI forming <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">dangerous goals of its own</a> and <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">defeating humanity entirely</a>; (b) humans racing to gain power and resources and &#x201C;<a href="https://www.cold-takes.com/how-digital-people-could-change-the-world/#lock-in">lock in</a>&#x201D; their values.
</p>
<p>
Ideally, everyone with the potential to build something <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/">powerful enough AI</a> would be able to pour energy into building something safe (not misaligned), and carefully planning out (and negotiating with others on) how to roll it out, without a rush or a race. With this in mind, perhaps we should be doing things like:
</p>
<ul>

<li>Working to improve trust and cooperation between major world powers. Perhaps via AI-centric versions of <a href="https://en.wikipedia.org/wiki/Pugwash_Conferences_on_Science_and_World_Affairs">Pugwash</a> (an international conference aimed at reducing the risk of military conflict), perhaps by pushing back against hawkish foreign relations moves.

</li><li>Discouraging governments and investors from shoveling money into AI research, encouraging AI labs to thoroughly consider the implications of their research before publishing it or scaling it up, working toward <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#global-monitoring">standards and monitoring</a>, etc. Slowing things down in this manner could buy more time to do research on avoiding <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/#worst-misaligned-ai">misaligned AI</a>, more time to build trust and cooperation mechanisms, and more time to generally gain strategic clarity 
</li>
</ul>
<p>
<strong>The &#x201C;competition&#x201D; frame. </strong>This frame focuses less on how the transition to a radically different future happens, and more on who&apos;s making the key decisions as it happens.
</p>
<ul>

<li>If something like <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/">PASTA </a>is developed primarily (or first) in country X, then the government of country X could be making a lot of crucial decisions about whether and how to regulate a potential explosion of new technologies.

</li><li>In addition, the people and organizations leading the way on AI and other technology advancement at that time could be especially influential in such decisions.
</li>
</ul>
<p>
This means it could matter enormously &quot;who leads the way on transformative AI&quot; - which country or countries, which people or organizations.
</p>
<p>
Some people feel that we can make confident statements today about which specific countries, and/or which people and organizations, we should hope lead the way on transformative AI. These people might advocate for actions like:
</p>
<ul>

<li>Increasing the odds that the first PASTA systems are built in countries that are e.g. less authoritarian, which could mean e.g. pushing for more investment and attention to AI development in these countries.

</li><li>Supporting and trying to speed up AI labs run by people who are likely to make wise decisions (about things like how to engage with governments, what AI systems to publish and deploy vs. keep secret, etc.)
</li>
</ul>
<p>
<strong>Tension between the two frames. </strong>People who take the &quot;caution&quot; frame and people who take the &quot;competition&quot; frame often favor very different, even contradictory actions. Actions that look important to people in one frame often look actively harmful to people in the other.
</p>
<p>
For example, people in the &quot;competition&quot; frame often favor moving forward as fast as possible on developing more powerful AI systems; for people in the &quot;caution&quot; frame, haste is one of the main things to avoid. People in the &quot;competition&quot; frame often favor adversarial foreign relations, while people in the &quot;caution&quot; frame often want foreign relations to be more cooperative.
</p>
<p>
That said, this dichotomy is a simplification. Many people - including myself - resonate with both frames. But I have a <strong>general fear that the &#x201C;competition&#x201D; frame is going to be overrated by default</strong> for a number of reasons, as I discuss <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/#why-i-fear-">here</a>.
</p>
</div></details>
<p>
Unfortunately, I&#x2019;ve seen something like the above story play out in <strong>multiple significant instances </strong>(though I shouldn&#x2019;t give specific examples). 
</p>
<p>
And I&#x2019;m especially worried about this dynamic when it comes to people in and around governments (especially in national security communities)<em>, </em>because I perceive governmental culture as particularly obsessed with <em>staying ahead of other countries</em> (&#x201C;If AI is dangerous, we&#x2019;ve gotta build it first&#x201D;) and comparatively uninterested in <em>things that are dangerous for our country because they&#x2019;re dangerous for the whole world at once</em> (&#x201C;Maybe we should worry a lot about pandemics?&#x201D;)<sup id="fnref1"><a href="https://www.cold-takes.com/p/fbae8068-6543-4776-af3b-bedab1d7b74a#fn1" rel="footnote">1</a></sup>
</p>
<p>
You could even <a href="https://twitter.com/michael_nielsen/status/1350544365198839808">argue</a> (although I wouldn&#x2019;t agree!<sup id="fnref2"><a href="https://www.cold-takes.com/p/fbae8068-6543-4776-af3b-bedab1d7b74a#fn2" rel="footnote">2</a></sup>) that to date, efforts to &#x201C;raise awareness&#x201D; about the dangers of AI have done more harm than good (via causing increased investment in AI, generally).
</p>
<p>
So it&#x2019;s tempting to simply give up on the whole endeavor - to stay away from message spreading entirely, beyond people you know well and/or are pretty sure will internalize the important details. But I think we can do better.
</p>
<p>
This post is aimed at people who are <strong>good at communicating</strong> with at least some audience. This could be because of their skills, or their relationships, or some combination. In general, I&#x2019;d expect to have more success with people who hear from you a lot (because they&#x2019;re your friend, or they follow you on Twitter or Substack, etc.) than with people you reach via some viral blast of memery - but maybe you&#x2019;re skilled enough to make the latter work too, which would be awesome. I&apos;m asking communicators to hit a high bar: leave people with strong understanding, rather than just getting them to repeat a few sentences about AI risk.
</p>
<h2 id="messages-that-seem-risky-to-spread-in-isolation">Messages that seem risky to spread in isolation</h2>


<p>
First, here are a couple of messages that I&#x2019;d rather people <em>didn&#x2019;t</em> spread (or at least have mixed feelings about spreading) in isolation, i.e., without serious efforts to include some of the other messages I cover <a href="https://www.cold-takes.com/p/fbae8068-6543-4776-af3b-bedab1d7b74a#messages-that-seem-important-and-helpful-and-right">below</a>.
</p>
<p>
One category is messages that generically emphasize the <em>importance</em> and <em>potential imminence</em> of powerful AI systems. The reason for this is in the previous section: many people seem to react to these ideas (especially when unaccompanied by some other key ones) with a &#x201C;We&#x2019;d better build powerful AI as fast as possible, before others do&#x201D; attitude. (If you&#x2019;re curious about why I wrote <a href="https://www.cold-takes.com/most-important-century/">The Most Important Century</a> anyway, see footnote for my thinking.<sup id="fnref3"><a href="https://www.cold-takes.com/p/fbae8068-6543-4776-af3b-bedab1d7b74a#fn3" rel="footnote">3</a></sup>)
</p>
<p>
Another category is messages that emphasize that AI could be risky/dangerous to the world, without much effort to fill in <em>how</em>, or with an emphasis on easy-to-understand risks. 
</p>
<ul>

<li>Since &#x201C;dangerous&#x201D; tends to imply &#x201C;powerful and important,&#x201D; I think there are similar risks to the previous section. 

</li><li>If people have a bad model of <em>how and why</em> AI could be risky/dangerous (missing key risks and difficulties), they might be too quick to later say things like &#x201C;Oh, turns out this danger is less bad than I thought, let&#x2019;s go full speed ahead!&#x201D; <a href="https://www.cold-takes.com/p/fbae8068-6543-4776-af3b-bedab1d7b74a#ais-could-behave-deceptively">Below</a>, I outline how misleading &#x201C;progress&#x201D; could lead to premature dismissal of the risks.
</li>
</ul>
<h2 id="messages-that-seem-important-and-helpful-and-right">Messages that seem important and helpful (and right!)</h2>


<h3 id="we-should-worry-about-conflict-between-misaligned-ai-and-all-humans">We should worry about conflict between misaligned AI and <em>all</em> humans</h3>


<p>
Unlike the messages discussed in the previous section, this one directly highlights why it might not be a good idea to rush forward with building AI oneself. 
</p>
<p>
The idea that an AI could harm the <em>same humans who build it</em> has very different implications from the idea that AI could be generically dangerous/powerful. Less &#x201C;We&#x2019;d better get there before others,&#x201D; more &#x201C;there&#x2019;s a case for moving slowly and working together here.&#x201D;
</p>
<p>
The idea that AI could be a problem for the same people who build it is common in fictional portrayals of AI (<a href="https://en.wikipedia.org/wiki/HAL_9000">HAL 9000</a>, <a href="https://en.wikipedia.org/wiki/Skynet_(Terminator)">Skynet</a>, <a href="https://en.wikipedia.org/wiki/The_Matrix">The Matrix</a>, <a href="https://en.wikipedia.org/wiki/Ex_Machina_(film)">Ex Machina</a>) - maybe too much so? It seems to me that people tend to balk at the &#x201C;sci-fi&#x201D; feel, and what&#x2019;s needed is more recognition that this is a serious, real-world concern.
</p>
<p>
The main pieces in this series making this case are <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">Why would AI &#x201C;aim&#x201D; to defeat humanity?</a> and <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">AI could defeat all of us combined</a>. There are many other pieces on the alignment problem (see list <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#fn3">here</a>); also see <a href="https://www.slowboring.com/p/the-case-for-terminator-analogies">Matt Yglesias&apos;s case</a> for specifically embracing the &#x201C;Terminator&#x201D;/Skynet analogy.
</p>
<p>
I&#x2019;d be especially excited for people to spread messages that help others understand - at a mechanistic level - <em>how and why</em> AI systems could end up with dangerous goals of their own, deceptive behavior, etc. I worry that by default, the concern sounds like lazy anthropomorphism (thinking of AIs just like humans).
</p>
<p>
Transmitting ideas about the &#x201C;how and why&#x201D; is a lot harder than getting people to nod along to &#x201C;AI could be dangerous.&#x201D; I think there&#x2019;s a lot of effort that could be put into simple, understandable yet relatable metaphors/analogies/examples (my pieces make some effort in this direction, but there&#x2019;s tons of room for more).
</p>
<h3 id="ais-could-behave-deceptively">AIs could behave deceptively, so &#x201C;evidence of safety&#x201D; might be misleading</h3>


<p>
I&#x2019;m very worried about a sequence of events like:
</p>
<ul>

<li>As AI systems become more powerful, there are some concerning incidents, and widespread concern about &#x201C;AI risk&#x201D; grows.

</li><li>But over time, AI systems are &#x201C;better trained&#x201D; - e.g., given reinforcement to stop them from behaving in unintended ways - and so the concerning incidents become less common.

</li><li>Because of this, concern dissipates, and it&#x2019;s widely believed that AI safety has been &#x201C;solved.&#x201D;

</li><li>But what&#x2019;s actually happened is that the &#x201C;better training&#x201D; has caused AI systems to <em>behave deceptively</em> - to <em>appear</em> benign in most situations, and to cause trouble only when (a) this wouldn&#x2019;t be detected or (b) humans can be <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">overpowered entirely</a>.
</li>
</ul>
<p>
I worry about AI systems&#x2019; being deceptive in the same way a human might: going through chains of reasoning like &#x201C;If I do X, I might get caught, but if I do Y, no one will notice until it&#x2019;s too late.&#x201D; But it can be hard to get this concern taken seriously, because it means attributing behavior to AI systems that we currently associate exclusively with humans (today&#x2019;s AI systems don&#x2019;t really do things like this<sup id="fnref4"><a href="https://www.cold-takes.com/p/fbae8068-6543-4776-af3b-bedab1d7b74a#fn4" rel="footnote">4</a></sup>).
</p>
<p>
One of the central things I&#x2019;ve tried to spell out in this series is <em>why</em> an AI system might engage in this sort of systematic deception, despite being very unlike humans (and not necessarily having e.g. emotions). It&#x2019;s a major focus of both of these pieces from this series:
</p>
<ul>

<li><a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">Why would AI &#x201C;aim&#x201D; to defeat humanity?</a> 

</li><li><a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">AI Safety Seems Hard to Measure</a>
</li>
</ul>

<p>
Whether this point is widely understood seems quite crucial to me. We might end up in a situation where (a) there are big commercial and military incentives to rush ahead with AI development; (b) we have what <em>seems like</em> a set of reassuring experiments and observations. 
</p>
<p>
At that point, it could be key whether people are asking tough questions about the many ways in which &#x201C;evidence of AI safety&#x201D; could be misleading, which I discussed at length in <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">AI Safety Seems Hard to Measure</a>.
</p>

<details id="Box3"><summary>(Click to expand) Why AI safety could be hard to measure<!-- (Details not included in email - <a href="https://www.cold-takes.com/p/fbae8068-6543-4776-af3b-bedab1d7b74a#Box3">click to view on the web</a>)--></summary>
<div>
<p>
In previous pieces, I argued that:
</p>
<ul>

<li>If we develop powerful AIs via ambitious use of the &#x201C;black-box trial-and-error&#x201D; common in AI development today, then there&#x2019;s a substantial risk that: 
<ul>
 
<li>These AIs will develop <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">unintended aims</a> (states of the world they make calculations and plans toward, as a chess-playing AI &quot;aims&quot; for checkmate);
 
</li><li>These AIs could deceive, manipulate, and even <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">take over the world from humans entirely</a> as needed to achieve those aims.

</li><li>People today are doing AI safety research to prevent this outcome, but such research has a <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">number of deep difficulties:</a>
</li>
</ul>
<p>
<table style="border-collapse: collapse;">
  <tr>
   <td colspan="3" style="border: 1px solid;"><strong>&#x201C;Great news - I&#x2019;ve tested this AI and it looks safe.&#x201D; </strong>Why might we still have a problem?
   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;"><em>Problem</em>
   </td>
   <td style="border: 1px solid;"><em>Key question</em>
   </td>
   <td style="border: 1px solid;"><em>Explanation</em>
   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;">The <strong>Lance Armstrong problem</strong>
   </td>
   <td style="border: 1px solid;">Did we get the AI to be <strong><span style="color:var(--green-color);">actually safe</span></strong> or <strong><span style="color:var(--red-color);">good at hiding its dangerous actions</span>?</strong>
   </td>
  <td style="border: 1px solid;"><p>When dealing with an intelligent agent, it&#x2019;s hard to tell the difference between &#x201C;behaving well&#x201D; and &#x201C;<em>appearing</em> to behave well.&#x201D;</p>
<p>
When professional cycling was cracking down on performance-enhancing drugs, Lance Armstrong was very successful and seemed to be unusually &#x201C;clean.&#x201D; It later came out that he had been using drugs with an unusually sophisticated operation for concealing them.
   </p></td>
  </tr>
  <tr>
   <td style="border: 1px solid;">The <strong>King Lear problem</strong>
   </td>
   <td style="border: 1px solid;"><p>The AI is <strong><span style="color:var(--green-color);">(actually) well-behaved when humans are in control. </span></strong>Will this transfer to <strong><span style="color:var(--red-color);">when AIs are in control</span>?</strong></p>
   </td>
   <td style="border: 1px solid;"><p>It&apos;s hard to know how someone will behave when they have power over you, based only on observing how they behave when they don&apos;t. </p>
<p>
AIs might behave as intended as long as humans are in control - but at some future point, AI systems might be capable and widespread enough to have opportunities to <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">take control of the world entirely</a>. It&apos;s hard to know whether they&apos;ll take these opportunities, and we can&apos;t exactly run a clean test of the situation. 
</p><p>
Like King Lear trying to decide how much power to give each of his daughters before abdicating the throne.
   </p></td>
  </tr>
  <tr>
   <td style="border: 1px solid;">The <strong>lab mice problem</strong>
   </td>
      <td style="border: 1px solid;"><strong><span style="color:var(--green-color);">Today&apos;s &quot;subhuman&quot; AIs are safe.</span></strong>What about <strong><span style="color:var(--red-color);">future AIs with more human-like abilities</span>?</strong>
   </td>
   <td style="border: 1px solid;"><p>Today&apos;s AI systems aren&apos;t advanced enough to exhibit the basic behaviors we want to study, such as deceiving and manipulating humans.</p> 
<p>
Like trying to study medicine in humans by experimenting only on lab mice.
   </p></td>
  </tr>
  <tr>
   <td style="border: 1px solid;">The <strong>first contact problem</strong>
   </td>
   <td style="border: 1px solid;"><p>Imagine that <strong><span style="color:var(--green-color);">tomorrow&apos;s &quot;human-like&quot; AIs are safe.</span></strong> How will things go <strong><span style="color:var(--red-color);">when AIs have capabilities far beyond humans&apos;</span>?</strong></p>
   </td>
   <td style="border: 1px solid;"><p>AI systems might (collectively) become vastly more capable than humans, and it&apos;s ... just really hard to have any idea what that&apos;s going to be like. As far as we know, there has never before been anything in the galaxy that&apos;s vastly more capable than humans in the relevant ways! No matter what we come up with to solve the first three problems, we can&apos;t be too confident that it&apos;ll keep working if AI advances (or just proliferates) a lot more. </p>
<p>
Like trying to plan for first contact with extraterrestrials (this barely feels like an analogy).
   </p></td>
  </tr>
</table>
</p>

<p>
An analogy that incorporates these challenges is Ajeya Cotra&#x2019;s &#x201C;young businessperson&#x201D; <a href="https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/#analogy-the-young-ceo">analogy</a>:
</p>

    <blockquote><p>Imagine you are an eight-year-old whose parents left you a $1 trillion company and no trusted adult to serve as your guide to the world. You must hire a smart adult to run your company as CEO, handle your life the way that a parent would (e.g. decide your school, where you&#x2019;ll live, when you need to go to the dentist), and administer your vast wealth (e.g. decide where you&#x2019;ll invest your money).
</p>
<p>

    You have to hire these grownups based on a work trial or interview you come up with -- you don&apos;t get to see any resumes, don&apos;t get to do reference checks, etc. Because you&apos;re so rich, tons of people apply for all sorts of reasons. (<a href="https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/#analogy-the-young-ceo">More</a>)</p></blockquote>
<p>
If your applicants are a mix of &quot;saints&quot; (people who genuinely want to help), &quot;sycophants&quot; (people who just want to make you happy in the short run, even when this is to your long-term detriment) and &quot;schemers&quot; (people who want to siphon off your wealth and power for themselves), how do you - an eight-year-old - tell the difference?
</p><p>More: <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">AI safety seems hard to measure</a></p>

    </li></ul></div>
</details>

<h3 id="ai-projects-should-establish-and-demonstrate-safety">AI projects should establish and demonstrate safety (and potentially comply with safety standards) before deploying powerful systems</h3>


<p>
I&#x2019;ve <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#global-monitoring">written about</a> the benefits we might get from &#x201C;safety standards.&quot; The idea is that AI projects should not deploy systems that pose too much risk to the world, as evaluated by a systematic evaluation regime: AI systems could be audited to see whether they are safe. I&apos;ve outlined how AI projects might self-regulate by publicly committing to having their systems audited (and not deploying dangerous ones), and how governments could enforce safety standards both nationally and internationally.
</p>
<p>
Today, development of safety standards is in its infancy. But over time, I think it could matter a lot how much pressure AI projects are under to meet safety standards. And I think it&#x2019;s not too early, today, to start spreading the message that <strong>AI projects shouldn&#x2019;t unilaterally decide to put potentially dangerous systems out in the world; the burden should be on them to demonstrate and establish safety before doing so.</strong>
</p>
<details id="Box4"><summary>(Click to expand) How standards might be established and become national or international <!--(Details not included in email - <a href="https://www.cold-takes.com/p/fbae8068-6543-4776-af3b-bedab1d7b74a#Box4">click to view on the web</a>)--></summary><div>
<p>
I <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#global-monitoring">previously</a> laid out a possible vision on this front, which I&#x2019;ll give a slightly modified version of here:
</p>
<ul>

<li>Today&#x2019;s leading AI companies could self-regulate by committing not to build or deploy a system that they can&#x2019;t convincingly demonstrate is safe (e.g., see Google&#x2019;s <a href="https://www.theweek.in/news/sci-tech/2018/06/08/google-wont-deploy-ai-to-build-military-weapons-ichai.html">2018 statement</a>, &quot;We will not design or deploy AI in weapons or other technologies whose principal purpose or implementation is to cause or directly facilitate injury to people&#x201D;).  
<ul>
 
<li>Even if some people at the companies would like to deploy unsafe systems, it could be hard to pull this off once the company has committed not to. 
 
</li><li>Even if there&#x2019;s a lot of room for judgment in what it means to demonstrate an AI system is safe, having agreed in advance that <a href="https://www.cold-takes.com/p/fbae8068-6543-4776-af3b-bedab1d7b74a#ais-could-behave-deceptively">certain evidence</a> is <em>not</em> good enough could go a long way.
</li> 
</ul>

</li><li>As more AI companies are started, they could feel soft pressure to do similar self-regulation, and refusing to do so is off-putting to potential employees, investors, etc.

</li><li>Eventually, similar principles could be incorporated into various government regulations and enforceable treaties.

</li><li>Governments could monitor for dangerous projects using regulation and even overseas operations. E.g., today the US monitors (without permission) for various signs that other states might be developing nuclear weapons, and might try to stop such development with methods ranging from threats of sanctions to <a href="https://en.wikipedia.org/wiki/Stuxnet">cyberwarfare</a> or even military attacks. It could do something similar for any AI development projects that are using huge amounts of compute and haven&#x2019;t volunteered information about whether they&#x2019;re meeting standards.
</li>
    </ul></div>
</details>


<h3 id="alignment-research-is-prosocial-and-great">Alignment research is prosocial and great</h3>


<p>
Most people reading this can&#x2019;t go and become groundbreaking researchers on AI alignment. But they <em>can</em> contribute to a general sense that the people who can do this (mostly) should.
</p>
<p>
Today, my sense is that most &#x201C;science&#x201D; jobs are pretty prestigious, and seen as good for society. I have pretty mixed feelings about this:
</p>
<ul>

<li>I think science has been <a href="https://www.cold-takes.com/rowing-steering-anchoring-equity-mutiny/#rowing">good for humanity historically</a>.

</li><li>But I worry that as technology becomes more and more powerful, there&#x2019;s a growing risk of a catastrophe (particularly via AI or bioweapons) that wipes out all the progress to date and then some. (I&apos;ve <a href="https://www.cold-takes.com/has-violence-declined-when-we-include-the-world-wars-and-other-major-atrocities/">written</a> that the historical trend to date arguably fits something like &quot;Declining everyday violence, offset by bigger and bigger rare catastrophes.&quot;) I think our current era would be a nice time to adopt an attitude of &#x201C;proceed with caution&#x201D; rather than &#x201C;full speed ahead.&#x201D; 

</li><li>I resonate with Toby Ord&#x2019;s comment (in <a href="https://theprecipice.com/">The Precipice</a>), &#x201C;humanity is akin to an adolescent, with rapidly developing physical abilities, lagging wisdom and self-control, little thought for its longterm future and an unhealthy appetite for risk.&#x201D;
</li>
</ul>
<p>
I wish there were more effort, generally, to distinguish between especially dangerous science and especially beneficial science. AI alignment seems squarely in the latter category.
</p>
<p>
I&#x2019;d be especially excited for people to spread messages that give a sense of the specifics of different AI alignment research paths, how they might help or fail, and what&#x2019;s scientifically/intellectually interesting (not just useful) about them.
</p>
<p>
The main relevant piece in this series is <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/">High-level hopes for AI alignment</a>, which distills a longer piece (<a href="https://www.alignmentforum.org/posts/rCJQAkPTEypGjSJ8X/how-might-we-align-transformative-ai-if-it-s-developed-very">How might we align transformative AI if it&#x2019;s developed very soon?</a>) that I posted on the Alignment Forum. 
</p>
<p>There are a number (hopefully growing) of other careers that I consider especially valuable, which I&apos;ll discuss in my next post on this topic.</p>
<h3 id="it-might-be-important-for-institutions-to-act-in-unusual-ways">It might be important for companies (and other institutions) to act in unusual ways</h3>


<p>
In <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/">Racing through a Minefield: the AI Deployment Problem</a>, I wrote:
</p>

   <blockquote><p><strong>A lot of the most helpful actions might be &#x201C;out of the ordinary.&#x201D; </strong>When racing through a minefield, I hope key actors will:
</p>
<ul>

<li>Put more effort into alignment, threat assessment, and security than is required by commercial incentives;

</li><li>Consider measures for <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#avoiding-races">avoiding races</a> and <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#global-monitoring">global monitoring</a> that could be very unusual, even unprecedented.

</li><li>Do all of this in the possible presence of ambiguous, confusing information about the risks.</li></ul></blockquote><p>

It always makes me sweat when I&#x2019;m talking to someone from an AI company and they seem to think that commercial success and benefiting humanity are roughly the same goal/idea. 
</p>
<p>(To be clear, I don&apos;t think an AI project&apos;s only goal should be to avoid the risk of misaligned AI. I&apos;ve given this risk a central place in this piece partly because I think it&apos;s especially at risk of being too quickly dismissed - but I don&apos;t think it&apos;s the only major risk. I think AI projects need to strike a tricky balance between the <a href="https://www.cold-takes.com/p/fbae8068-6543-4776-af3b-bedab1d7b74a#Box1">caution and competition frames</a>, and consider a number of issues <a href="https://www.cold-takes.com/transformative-ai-issues-not-just-misalignment-an-overview/">beyond the risk of misalignment</a>. But I think it&apos;s a pretty robust point that they need to be ready to do unusual things rather than just following commercial incentives.)</p>
<p>
I&#x2019;m nervous about a world in which:
</p>
<ul>

<li>Most people stick with paradigms they know - a company should focus on shareholder value, a government should focus on its own citizens (rather than global catastrophic risks), etc.

</li><li>As the <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/#explosive-scientific-and-technological-advancement">pace of progress accelerates</a>, we&#x2019;re sitting here with all kinds of laws, norms and institutions that aren&#x2019;t designed for the problems we&#x2019;re facing - and can&#x2019;t adapt in time. A good example would be the way <a href="https://www.cold-takes.com/ideal-governance-for-companies-countries-and-more/">governance</a> works for a standard company: it&#x2019;s legally and structurally obligated to be entirely focused on benefiting its shareholders, rather than humanity as a whole. (There are alternative ways of setting up a company without these problems!<sup id="fnref5"><a href="https://www.cold-takes.com/p/fbae8068-6543-4776-af3b-bedab1d7b74a#fn5" rel="footnote">5</a></sup>)</li></ul>
<p>
At a minimum (as I <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/">argued previously</a>), I think AI companies should be making sure they have whatever unusual governance setups they need in order to prioritize benefits to humanity - not returns to shareholders - when the stakes get high. I think we&#x2019;d see more of this if more people believed something like: &#x201C;It might be important for companies (and other institutions) to act in unusual ways.&#x201D;
</p>
<h3 id="were-not-ready-for-this">We&#x2019;re not ready for this</h3>


<p>
If we&#x2019;re in the <a href="https://www.cold-takes.com/most-important-century/">most important century</a>, there&#x2019;s likely to be a vast set of potential challenges ahead of us, most of which have gotten very little attention. (More here: <a href="https://www.cold-takes.com/transformative-ai-issues-not-just-misalignment-an-overview/">Transformative AI issues (not just misalignment): an overview</a>)
</p>
<p>
If it were possible to slow everything down, by default I&#x2019;d think we should. Barring that, I&#x2019;d at least like to see people generally approaching the topic of AI with a general attitude along the lines of &#x201C;We&#x2019;re dealing with something really big here, and we should be trying really hard to be careful and humble and thoughtful&#x201D; (as opposed to something like &#x201C;The science is so interesting, let&#x2019;s go for it&#x201D; or &#x201C;This is awesome, we&#x2019;re gonna get rich&#x201D; or &#x201C;Whatever, who cares&#x201D;).
</p>
<p>
I&#x2019;ll re-excerpt this table from an <a href="https://www.cold-takes.com/call-to-vigilance/#sharing-a-headspace">earlier piece</a>:
</p>
<p>



<table style="border-collapse: collapse;">
  <tr>
   <td style="border: 1px solid; vertical-align: top;"><strong>Situation</strong>
   </td>
   <td style="border: 1px solid; vertical-align: top;"><strong>Appropriate reaction (IMO)</strong>
   </td>
  </tr>
  <tr>
   <td style="border: 1px solid; vertical-align: top;">&quot;This could be a billion-dollar company!&quot;
   </td>
   <td style="border: 1px solid; vertical-align: top;">&quot;Woohoo, let&apos;s GO for it!&quot;
   </td>
  </tr>
  <tr>
   <td style="border: 1px solid; vertical-align: top;">&quot;This could be the most important century!&quot;
   </td>
   <td style="border: 1px solid; vertical-align: top;">&quot;... Oh ... wow ... I don&apos;t know what to say and I somewhat want to vomit ... I have to sit down and think about this one.&quot;
   </td>
  </tr>
</table>
</p>
<p>
I&#x2019;m not at all sure about this, but one potential way to spread this message might be to communicate, with as much scientific realism, detail and believability as possible, about what the world might look like after explosive scientific and technological advancement brought on by AI (for example, a world with <a href="https://www.cold-takes.com/how-digital-people-could-change-the-world/">digital people</a>). I think the enormous unfamiliarity of some of the <a href="https://www.cold-takes.com/transformative-ai-issues-not-just-misalignment-an-overview/#new-life-forms">issues</a> such a world might face - and the vast possibilities for <a href="https://www.cold-takes.com/tag/utopia/">utopia</a> or <a href="https://www.cold-takes.com/how-digital-people-could-change-the-world/#virtual-reality-and-control-of-the-environment">dystopia</a> - might encourage an attitude of not wanting to rush forward.
</p>
<h2 id="how-to-spread-messages-like-these">How to spread messages like these?</h2>


<p>
I&#x2019;ve tried to write a <a href="https://www.cold-takes.com/tag/implicationsofmostimportantcentury/">series</a> that explains the key issues to careful readers, hopefully better equipping them to spread helpful messages. From here, individual communicators need to think about the audiences they know and the mediums they use (Twitter? Facebook? Essays/newsletters/blog posts? Video? In-person conversation?) and what will be effective with those audiences and mediums.
</p>
<p>
The main guidelines I want to advocate:
</p>
<ul>

<li>Err toward sustained, repeated, relationship-based communication as opposed to prioritizing &#x201C;viral blasts&#x201D; (unless you are so good at the latter that you feel excited to spread the pretty subtle ideas in this piece that way!)

</li><li>Aim high: try for the difficult goal of &#x201C;My audience walks away really understanding key points&#x201D; rather than the easier goal of &#x201C;My audience has hit the &#x2018;like&#x2019; button for a sort of related idea.&#x201D;

</li><li>A consistent piece of feedback I&#x2019;ve gotten on my writing is that making things as concrete as possible is helpful - so giving real-world examples of problems analogous to the ones we&#x2019;re worried about, or simple analogies that are easy to imagine and remember, could be key. But it&#x2019;s important to choose these carefully so that the key dynamics aren&#x2019;t lost. </li></ul>
<!--kg-card-end: html--><!--kg-card-begin: html-->

<p><div style="display:flex; justify-content:center; margin: 0 auto;">

        <span style="margin: 10px;"><a href="https://api.addthis.com/oexchange/0.8/forward/twitter/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fspreading-messages-to-help-with-the-most-important-century&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Spreading%20messages%20to%20help%20with%20the%20most%20important%20century&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-twitter-square.png" border="0" alt="Spreading messages to help with the most important century"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/facebook/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fspreading-messages-to-help-with-the-most-important-century&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Spreading%20messages%20to%20help%20with%20the%20most%20important%20century&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-facebook-square.png" border="0" alt="Spreading messages to help with the most important century"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/reddit/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fspreading-messages-to-help-with-the-most-important-century&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Spreading%20messages%20to%20help%20with%20the%20most%20important%20century&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-reddit-square.png" border="0" alt="Spreading messages to help with the most important century"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/menu/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fspreading-messages-to-help-with-the-most-important-century&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Spreading%20messages%20to%20help%20with%20the%20most%20important%20century&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-addthis-square.png" border="0" alt="Spreading messages to help with the most important century"></a></span>
        </div>
<center><p id="discuss"><!--<a href="https://www.cold-takes.com/spreading-messages-to-help-with-the-most-important-century#subscribe" target="_blank"><button id="footer-subscribe" class="button">Subscribe</button></a>&nbsp;<a href="https://www.guidedtrack.com/programs/4kal2ue/run?posttitle=Spreading%20messages%20to%20help%20with%20the%20most%20important%20century" target="_blank"><button class="button" id="Survey">Feedback</button></a>-->&#xA0;<a href="https://www.lesswrong.com/posts/slug/spreading-messages-to-help-with-the-most-important-century#comments" target="_blank"><button class="button">Comment/discuss</button></a></p><p><!--<em>
Use "Feedback" if you have comments/suggestions you want me to see, or if you're up for giving some quick feedback about this post (which I greatly appreciate!) Use "Forum" if you want to discuss this post publicly on the Effective Altruism Forum.
</em>--></p></center>
<!--kg-card-end: html--><!--kg-card-begin: html-->
</p><h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<hr>
<ol><li id="fn1">
<p>
     <a href="https://www.foreignaffairs.com/articles/2019-04-16/killer-apps">Killer Apps</a> and <a href="https://www.cnas.org/publications/reports/technology-roulette">Technology Roulette</a> are interesting pieces trying to sell policymakers on the idea that &#x201C;superiority is not synonymous with security.&#x201D;&#xA0;<a href="#fnref1" rev="footnote">&#x21A9;</a><li id="fn2">
<p>
     When I imagine what the world would look like without any of the efforts to &#x201C;raise awareness,&#x201D; I picture a world with close to zero awareness of - or community around - major risks from transformative AI. While this world might <em>also</em> have more <em>time</em> left before dangerous AI is developed, on balance this seems worse. A future piece will elaborate on the many ways I think a decent-sized community can help reduce risks.&#xA0;<a href="#fnref2" rev="footnote">&#x21A9;</a><li id="fn3">
<p>
     I do think &#x201C;AI could be a huge deal, and soon&#x201D; is a very important point that somewhat serves as a prerequisite for understanding this topic and doing helpful work on it, and I wanted to make this idea more understandable and credible to a number of people - as well as to <a href="https://www.cold-takes.com/where-ai-forecasting-stands-today/#reason-2-cunninghams-law">create more opportunities to get critical feedback and learn what I was getting wrong</a>. 
</p><p>
    But I was nervous about the issues noted in this section. With that in mind, I did the following things:
<ul>

<li>The title, &#x201C;most important century,&#x201D; emphasizes a time frame that I expect to be less exciting/motivating for the sorts of people I&#x2019;m most worried about (compared to the sorts of people I most wanted to draw in).

</li><li>I tried to persistently and centrally raise concerns about misaligned AI (raising it in two pieces, including <a href="https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/#powerful-models-could-get-good-performance-with-dangerous-goals">one (guest piece) devoted to it</a>, before I started discussing how soon transformative AI might be developed), and <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/">extensively discussed</a> the problems of overemphasizing &#x201C;competition&#x201D; relative to &#x201C;caution.&#x201D;

</li><li>I <a href="https://www.cold-takes.com/call-to-vigilance/">ended the series</a> with a piece arguing against being too &#x201C;action-oriented.&#x201D;

</li><li>I stuck to &#x201C;passive&#x201D; rather than &#x201C;active&#x201D; promotion of the series, e.g., I accepted podcast invitations but didn&#x2019;t seek them out. I figured that people with proactive interest would be more likely to give in-depth, attentive treatments rather than low-resolution, oversimplified ones.</li></ul>

</p><p>
    I don&#x2019;t claim to be sure I got all the tradeoffs right. &#xA0;<a href="#fnref3" rev="footnote">&#x21A9;</a><li id="fn4">
<p>
     There are some papers arguing that AI systems do things <em>something</em> like this (e.g., see the &#x201C;Challenges&#x201D; section of <a href="https://openai.com/blog/deep-reinforcement-learning-from-human-preferences/">this post</a>), but I think the dynamic is overall pretty far from what I&#x2019;m most worried about.&#xA0;<a href="#fnref4" rev="footnote">&#x21A9;</a><li id="fn5">

<p>
     E.g., <a href="https://www.delawareinc.com/public-benefit-corporation/">public benefit corporation</a>&#xA0;<a href="#fnref5" rev="footnote">&#x21A9;</a>

</p></li></p></li></p></li></p></li></p></li></ol></div>


<!--kg-card-end: html-->]]></content:encoded></item><item><title><![CDATA[How we could stumble into AI catastrophe]]></title><description><![CDATA[Hypothetical stories where the world tries, but fails, to avert a global disaster.]]></description><link>https://www.cold-takes.com/how-we-could-stumble-into-ai-catastrophe/</link><guid isPermaLink="false">63c0700c9a951a003d4e4674</guid><category><![CDATA[ImplicationsOfMostImportantCentury]]></category><dc:creator><![CDATA[Holden Karnofsky]]></dc:creator><pubDate>Fri, 13 Jan 2023 16:18:04 GMT</pubDate><media:content url="https://www.cold-takes.com/content/images/2023/01/wile-c-coyote-twitter.png" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: html--><img src="https://www.cold-takes.com/content/images/2023/01/wile-c-coyote-twitter.png" alt="How we could stumble into AI catastrophe"><p><figure><div id="buzzsprout-player-12031233"></div><script src="https://www.buzzsprout.com/1851795/12031233-how-we-could-stumble-into-ai-catastrophe.js?container_id=buzzsprout-player-12031233&amp;player=small" type="text/javascript" charset="utf-8"></script><figcaption><em>Click lower right to download or find on Apple Podcasts, Spotify, Stitcher, etc.</em></figcaption></figure></p>

<p>
This post will lay out a couple of stylized stories about <strong>how, if transformative AI is developed relatively soon, this could result in global catastrophe. </strong>(By &#x201C;transformative AI,&#x201D; I mean AI powerful and capable enough to bring about the sort of world-changing consequences I write about in my <a href="https://www.cold-takes.com/most-important-century/">most important century</a> series.)
</p>
<p>
This piece is more about visualizing possibilities than about providing arguments. For the latter, I recommend the <a href="https://www.cold-takes.com/tag/implicationsofmostimportantcentury/">rest of this series</a>.
</p>
<p>
In the stories I&#x2019;ll be telling, the world doesn&apos;t do much advance preparation or careful consideration of <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">risks I&#x2019;ve discussed previously</a>, especially re: misaligned AI (AI forming dangerous goals of its own). 
</p>
<ul>

<li>People <em>do</em> try to &#x201C;test&#x201D; AI systems for safety, and they do need to achieve some level of &#x201C;safety&#x201D; to commercialize. When early problems arise, they react to these problems. 

</li><li>But this isn&#x2019;t enough, because of some <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">unique challenges of measuring whether an AI system is &#x201C;safe,&#x201D;</a> and because of the strong incentives to race forward with scaling up and deploying AI systems as fast as possible. 

</li><li>So we end up with a world run by misaligned AI - or, even if we&#x2019;re lucky enough to avoid <em>that</em> outcome, other catastrophes are possible.
</li>
</ul>
<p>
After laying these catastrophic possibilities, I&#x2019;ll briefly note a few key ways we could do better, mostly as a reminder (these topics were covered in previous posts). Future pieces will get more specific about what we can be doing <em>today</em> to prepare.
</p>
<h2 id="backdrop">Backdrop</h2>


<p>
This piece takes a lot of previous writing I&#x2019;ve done as backdrop. Two key important assumptions (click to expand) are below; for more, see the rest of <a href="https://www.cold-takes.com/tag/implicationsofmostimportantcentury/">this series.</a>
</p>
<details id="Box1"><summary>(Click to expand) &#x201C;Most important century&#x201D; assumption: we&#x2019;ll soon develop very powerful AI systems, along the lines of what I previously called <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/">PASTA</a>. <!--(Details not included in email - <a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#Box1">click to view on the web</a>)</em>--></summary>
    <div>

<p>
In the <a href="https://www.cold-takes.com/most-important-century/">most important century</a> series, I argued that the 21st century could be the most important century ever for humanity, via the development of advanced AI systems that could dramatically speed up scientific and technological advancement, getting us more quickly than most people imagine to a deeply unfamiliar future.
</p>
<p>
I focus on a hypothetical kind of AI that I call <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/">PASTA</a>, or Process for Automating Scientific and Technological Advancement. PASTA would be AI that can essentially <strong>automate all of the human activities needed to speed up scientific and technological advancement.</strong>
</p>
<p>
Using a <a href="https://www.cold-takes.com/where-ai-forecasting-stands-today/">variety of different forecasting approaches</a>, I argue that PASTA seems more likely than not to be developed this century - and there&#x2019;s a decent chance (more than 10%) that we&#x2019;ll see it within 15 years or so.
</p>
<p>
I argue that the consequences of this sort of AI could be enormous: an <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/#explosive-scientific-and-technological-advancement">explosion in scientific and technological progress</a>. This could get us more quickly than most imagine to a radically unfamiliar future.
</p>
<p>
I&#x2019;ve also <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">argued</a> that AI systems along these lines could defeat all of humanity combined, if (for whatever reason) they were aimed toward that goal.
</p>
<p>
For more, see the <a href="https://www.cold-takes.com/most-important-century/">most important century</a> landing page. The series is available in many formats, including audio; I also provide a summary, and links to podcasts where I discuss it at a high level.</p></div></details>
<details id="Box2"><summary>(Click to expand) &#x201C;Nearcasting&#x201D; assumption: such systems will be developed in a world that&#x2019;s otherwise similar to today&#x2019;s. <!--(Details not included in email - <a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#Box2">click to view on the web</a>)</em>--></summary>
    <div>

<p>
It&#x2019;s hard to talk about risks from <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/">transformative AI </a>because of the many uncertainties about when and how such AI will be developed - and how much the (now-nascent) field of &#x201C;AI safety research&#x201D; will have grown by then, and how seriously people will take the risk, etc. etc. etc. So maybe it&#x2019;s not surprising that <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/#open-question-how-hard-is-the-alignment-problem">estimates of the &#x201C;misaligned AI&#x201D; risk range from ~1% to ~99%</a>.
</p>
<p>
This piece takes an approach I call <strong><span style="text-decoration:underline;">nearcasting</span></strong>: trying to answer key strategic questions about transformative AI, under the assumption that such AI arrives in a world that is otherwise relatively similar to today&apos;s. 
</p>
<p>
You can think of this approach like this: &#x201C;Instead of asking where our ship will ultimately end up, let&#x2019;s start by asking what destination it&#x2019;s pointed at right now.&#x201D; 
</p>
<p>
That is: instead of trying to talk about an uncertain, distant future, we can talk about the easiest-to-visualize, closest-to-today situation, and how things look there - and <em>then</em> ask how our picture might be off if other possibilities play out. (As a bonus, it doesn&#x2019;t seem out of the question that transformative AI will be developed extremely soon - 10 years from now or faster.<sup id="fnref1"><a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#fn1" rel="footnote">1</a></sup> If that&#x2019;s the case, it&#x2019;s especially urgent to think about what that might look like.)</p></div></details>
<h2 id="how-we-could-stumble-into-catastrophe-from-misaligned-ai">How we could stumble into catastrophe from misaligned AI</h2>


<p>
This is my basic default picture for how I imagine things going, if people pay little attention to the sorts of issues discussed <a href="https://www.cold-takes.com/tag/implicationsofmostimportantcentury/">previously</a>. I&#x2019;ve deliberately written it to be concrete and visualizable, which means that it&#x2019;s very unlikely that the details will match the future - but hopefully it gives a picture of some of the key dynamics I worry about. 
</p>
<p>
Throughout this hypothetical scenario (up until &#x201C;<span style="text-decoration:underline;">END OF HYPOTHETICAL SCENARIO</span>&#x201D;), I use the present tense (&#x201C;AIs do X&#x201D;) for simplicity, even though I&#x2019;m talking about a hypothetical possible future.
</p>
<p>
<strong>Early commercial applications. </strong>A few years before transformative AI is developed, AI systems are being increasingly used for a number of lucrative, useful, but not dramatically world-changing things. 
</p>
<p>
I think it&#x2019;s very hard to predict what these will be (harder in some ways than predicting longer-run consequences, in my view),<sup id="fnref2"><a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#fn2" rel="footnote">2</a></sup> so I&#x2019;ll mostly work with the simple example of automating customer service.
</p>
<p>
In this early stage, AI systems often have pretty narrow capabilities, such that the idea of them forming <a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f/#existential-risks-to-humanity">ambitious aims</a> and trying to <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">defeat humanity</a> seems (and actually is) silly. For example, customer service AIs are mostly language models that are trained to mimic patterns in past successful customer service transcripts, and are further improved by customers giving satisfaction ratings in real interactions. The dynamics I described in an <a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f/">earlier piece</a>, in which AIs are given increasingly ambitious goals and challenged to find increasingly creative ways to achieve them, don&#x2019;t necessarily apply.
</p>
<p>
<strong>Early safety/alignment problems. </strong>Even with these relatively limited AIs, there are problems and challenges that could be called &#x201C;safety issues&#x201D; or &#x201C;alignment issues.&#x201D; To continue with the example of customer service AIs, these AIs might:
</p>
<ul>

<li>Give false information about the products they&#x2019;re providing support for. (<a href="https://www.vice.com/en/article/wxnaem/stack-overflow-bans-chatgpt-for-constantly-giving-wrong-answers">Example</a> of reminiscent behavior)

</li><li>Give customers advice (when asked) on how to do unsafe or illegal things. (<a href="https://twitter.com/NickEMoran/status/1598101579626057728">Example</a>)

</li><li>Refuse to answer valid questions. (This could result from companies making <a href="https://twitter.com/PougetHadrien/status/1611008020644864001">attempts to prevent the above two failure modes</a> - i.e., AIs might be penalized heavily for saying false and harmful things, and respond by simply refusing to answer lots of questions).

</li><li>Say toxic, offensive things in response to certain user queries (including from users deliberately trying to get this to happen), causing bad PR for AI developers. (<a href="https://twitter.com/zswitten/status/1598088280066920453">Example</a>)
</li>
</ul>
<p id="early-solutions">
<strong>Early solutions. </strong>The most straightforward way to solve these problems involves <em>training AIs to behave more safely and helpfully. </em>This means that AI companies do a lot of things like &#x201C;Trying to create the conditions under which an AI might provide false, harmful, evasive or toxic responses; penalizing it for doing so, and reinforcing it toward more helpful behaviors.&#x201D;
</p>
<p>
This works well, as far as anyone can tell: the above problems become a lot less frequent. Some people see this as cause for great celebration, saying things like &#x201C;We were worried that AI companies wouldn&#x2019;t invest enough in safety, but it turns out that the market takes care of it - to have a viable product, you need to get your systems to be safe!&#x201D;
</p>
<p>
People like me disagree - training AIs to <em>behave in ways that are safer as far as we can tell</em> is the kind of &#x201C;solution&#x201D; that I&#x2019;ve worried could <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#why-we-might-not-get-clear-warning-signs">create superficial improvement while big risks remain in place</a>. 
</p>
<details id="Box3"><summary>(Click to expand) Why AI safety could be hard to measure <!--(Details not included in email - <a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#Box3">click to view on the web</a>)</em>--></summary>
<div>
<p>
In previous pieces, I argued that:
</p>
<ul>

<li>If we develop powerful AIs via ambitious use of the &#x201C;black-box trial-and-error&#x201D; common in AI development today, then there&#x2019;s a substantial risk that: 
<ul>
 
<li>These AIs will develop <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">unintended aims</a> (states of the world they make calculations and plans toward, as a chess-playing AI &quot;aims&quot; for checkmate);
 
</li><li>These AIs could deceive, manipulate, and even <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">take over the world from humans entirely</a> as needed to achieve those aims.

</li><li>People today are doing AI safety research to prevent this outcome, but such research has a <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">number of deep difficulties:</a>
</li>
</ul>
<p>
<table style="border-collapse: collapse;">
  <tr>
   <td colspan="3" style="border: 1px solid;"><strong>&#x201C;Great news - I&#x2019;ve tested this AI and it looks safe.&#x201D; </strong>Why might we still have a problem?
   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;"><em>Problem</em>
   </td>
   <td style="border: 1px solid;"><em>Key question</em>
   </td>
   <td style="border: 1px solid;"><em>Explanation</em>
   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;">The <strong>Lance Armstrong problem</strong>
   </td>
   <td style="border: 1px solid;">Did we get the AI to be <strong><span style="color:var(--green-color);">actually safe</span></strong> or <strong><span style="color:var(--red-color);">good at hiding its dangerous actions</span>?</strong>
   </td>
  <td style="border: 1px solid;"><p>When dealing with an intelligent agent, it&#x2019;s hard to tell the difference between &#x201C;behaving well&#x201D; and &#x201C;<em>appearing</em> to behave well.&#x201D;</p>
<p>
When professional cycling was cracking down on performance-enhancing drugs, Lance Armstrong was very successful and seemed to be unusually &#x201C;clean.&#x201D; It later came out that he had been using drugs with an unusually sophisticated operation for concealing them.
   </p></td>
  </tr>
  <tr>
   <td style="border: 1px solid;">The <strong>King Lear problem</strong>
   </td>
   <td style="border: 1px solid;"><p>The AI is <strong><span style="color:var(--green-color);">(actually) well-behaved when humans are in control. </span></strong>Will this transfer to <strong><span style="color:var(--red-color);">when AIs are in control</span>?</strong></p>
   </td>
   <td style="border: 1px solid;"><p>It&apos;s hard to know how someone will behave when they have power over you, based only on observing how they behave when they don&apos;t. </p>
<p>
AIs might behave as intended as long as humans are in control - but at some future point, AI systems might be capable and widespread enough to have opportunities to <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">take control of the world entirely</a>. It&apos;s hard to know whether they&apos;ll take these opportunities, and we can&apos;t exactly run a clean test of the situation. 
</p><p>
Like King Lear trying to decide how much power to give each of his daughters before abdicating the throne.
   </p></td>
  </tr>
  <tr>
   <td style="border: 1px solid;">The <strong>lab mice problem</strong>
   </td>
      <td style="border: 1px solid;"><strong><span style="color:var(--green-color);">Today&apos;s &quot;subhuman&quot; AIs are safe.</span></strong>What about <strong><span style="color:var(--red-color);">future AIs with more human-like abilities</span>?</strong>
   </td>
   <td style="border: 1px solid;"><p>Today&apos;s AI systems aren&apos;t advanced enough to exhibit the basic behaviors we want to study, such as deceiving and manipulating humans.</p> 
<p>
Like trying to study medicine in humans by experimenting only on lab mice.
   </p></td>
  </tr>
  <tr>
   <td style="border: 1px solid;">The <strong>first contact problem</strong>
   </td>
   <td style="border: 1px solid;"><p>Imagine that <strong><span style="color:var(--green-color);">tomorrow&apos;s &quot;human-like&quot; AIs are safe.</span></strong> How will things go <strong><span style="color:var(--red-color);">when AIs have capabilities far beyond humans&apos;</span>?</strong></p>
   </td>
   <td style="border: 1px solid;"><p>AI systems might (collectively) become vastly more capable than humans, and it&apos;s ... just really hard to have any idea what that&apos;s going to be like. As far as we know, there has never before been anything in the galaxy that&apos;s vastly more capable than humans in the relevant ways! No matter what we come up with to solve the first three problems, we can&apos;t be too confident that it&apos;ll keep working if AI advances (or just proliferates) a lot more. </p>
<p>
Like trying to plan for first contact with extraterrestrials (this barely feels like an analogy).
   </p></td>
  </tr>
</table>
</p>

<p>
An analogy that incorporates these challenges is Ajeya Cotra&#x2019;s &#x201C;young businessperson&#x201D; <a href="https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/#analogy-the-young-ceo">analogy</a>:
</p>

    <blockquote><p>Imagine you are an eight-year-old whose parents left you a $1 trillion company and no trusted adult to serve as your guide to the world. You must hire a smart adult to run your company as CEO, handle your life the way that a parent would (e.g. decide your school, where you&#x2019;ll live, when you need to go to the dentist), and administer your vast wealth (e.g. decide where you&#x2019;ll invest your money).
</p>
<p>

    You have to hire these grownups based on a work trial or interview you come up with -- you don&apos;t get to see any resumes, don&apos;t get to do reference checks, etc. Because you&apos;re so rich, tons of people apply for all sorts of reasons. (<a href="https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/#analogy-the-young-ceo">More</a>)</p></blockquote>
<p>
If your applicants are a mix of &quot;saints&quot; (people who genuinely want to help), &quot;sycophants&quot; (people who just want to make you happy in the short run, even when this is to your long-term detriment) and &quot;schemers&quot; (people who want to siphon off your wealth and power for themselves), how do you - an eight-year-old - tell the difference?
</p><p>More: <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">AI safety seems hard to measure</a></p></li></ul></div>


</details>
<p>
(So far, what I&#x2019;ve described is pretty similar to what&#x2019;s going on today. The next bit will discuss hypothetical future progress, with AI systems clearly beyond today&#x2019;s.)
</p>
<p>
<strong>Approaching transformative AI. </strong>Time passes. At some point, AI systems are playing a huge role in various kinds of scientific research - to the point where it often feels like a particular AI is about as helpful to a research team as a top human scientist would be (although there are still important parts of the work that require humans).
</p>
<p>
Some particularly important (though not exclusive) examples:
</p>
<ul>

<li>AIs are near-autonomously writing papers about AI, finding all kinds of ways to improve the efficiency of AI algorithms. 

</li><li>AIs are doing a lot of the work previously done by humans at Intel (and similar companies), designing ever-more efficient hardware for AI.

</li><li>AIs are also extremely helpful with <em>AI safety research</em>. They&#x2019;re able to do most of the work of writing papers about things like <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/#digital-neuroscience">digital neuroscience</a> (how to understand what&#x2019;s going on inside the &#x201C;digital brain&#x201D; of an AI) and <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/#limited-ai">limited AI</a> (how to get AIs to accomplish helpful things while limiting their capabilities). 
<ul>
 
<li>However, this kind of work remains quite niche (as I think it is today), and is getting far less attention and resources than the first two applications. Progress is made, but it&#x2019;s slower than progress on making AI systems more powerful. 
</li> 
</ul>
</li> 
</ul>
<p>
AI systems are now getting bigger and better very quickly, due to dynamics like the above, and they&#x2019;re able to do all sorts of things. 
</p>
<p>
At some point, companies start to experiment with very ambitious, open-ended AI applications, like simply instructing AIs to &#x201C;Design a new kind of car that outsells the current ones&#x201D; or &#x201C;Find a new trading strategy to make money in markets.&#x201D; These get mixed results, and companies are trying to get better results via further training - reinforcing behaviors that perform better. (AIs are helping with this, too, e.g. providing feedback and reinforcement for each others&#x2019; outputs<sup id="fnref3"><a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#fn3" rel="footnote">3</a></sup> and helping to write code<sup id="fnref4"><a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#fn4" rel="footnote">4</a></sup> for the training processes.) 
</p>
<p>
This training strengthens the dynamics I discussed in a <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">previous post</a>: AIs are being rewarded for getting successful outcomes <em>as far as human judges can tell</em>, which creates incentives for them to mislead and manipulate human judges, and ultimately results in forming ambitious goals of their own to <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#what-it-means-for">aim</a> for.
</p>
<p>
<strong>More advanced safety/alignment problems. </strong>As the scenario continues to unfold, there are a number of concerning events that point to safety/alignment problems. These mostly follow the form: &#x201C;AIs are trained using trial and error, and this might lead them to sometimes do deceptive, unintended things to accomplish the goals they&#x2019;ve been trained to accomplish.&#x201D;
</p>
<p>
Things like:
</p>
<ul>

<li>AIs creating writeups on new algorithmic improvements, using faked data to argue that their new algorithms are better than the old ones. Sometimes, people incorporate new algorithms into their systems and use them for a while, before unexpected behavior ultimately leads them to dig into what&#x2019;s going on and discover that they&#x2019;re not improving performance at all. It looks like the AIs faked the data in order to get positive feedback from humans looking for algorithmic improvements.

</li><li>AIs assigned to make money in various ways (e.g., to find profitable trading strategies) doing so by finding security exploits, getting unauthorized access to others&#x2019; bank accounts, and stealing money.

</li><li>AIs forming relationships with the humans training them, and trying (sometimes successfully) to emotionally manipulate the humans into giving positive feedback on their behavior. They also might try to manipulate the humans into running more copies of them, into <a href="https://www.washingtonpost.com/technology/2022/06/11/google-ai-lamda-blake-lemoine/">refusing to shut them off</a>, etc.- things that are generically useful for the AIs&#x2019; achieving whatever <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#why-we-might-not-get-clear-warning-signs">aims</a> they might be developing.
</li>
</ul>
<details id="Box4"><summary>(Click to expand) Why AIs might do deceptive, problematic things like this<!-- (Details not included in email - <a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#Box4">click to view on the web</a>)</em>--></summary><div>

<p>In a previous piece, I highlighted that <strong>modern AI development is essentially based on &quot;training&quot; via <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#Box3">trial-and-error</a>.</strong> To oversimplify, you can imagine that:</p>

<ul>

<li>An AI system is given some sort of task.

</li><li>The AI system tries something, initially something pretty random.

</li><li>The AI system gets information about how well its choice performed, and/or what would&#x2019;ve gotten a better result. Based on this, it adjusts itself. You can think of this as if it is &#x201C;encouraged/discouraged&#x201D; to get it to do more of what works well.  
<ul>
 
<li>Human judges may play a significant role in determining which answers are encouraged vs. discouraged, especially for fuzzy goals like &#x201C;Produce helpful scientific insights.&#x201D; 
</li> 
</ul>

</li><li>After enough tries, the AI system becomes good at the task. 

</li><li>But nobody really knows anything about <em>how or why</em> it&#x2019;s good at the task now. The development work has gone into building a flexible architecture for it to learn well from trial-and-error, and into &#x201C;training&#x201D; it by doing all of the trial and error. We mostly can&#x2019;t &#x201C;look inside the AI system to see how it&#x2019;s thinking.&#x201D;</li></ul>

<p>I then argue that:</p>

<ul>

<li>Because we ourselves will often be misinformed or confused, we will sometimes give <em>negative</em> reinforcement to AI systems that are actually acting in our best interests and/or giving accurate information, and <em>positive</em> reinforcement to AI systems whose behavior <em>deceives</em> us into thinking things are going well. This means we will be, unwittingly, training AI systems to deceive and manipulate us. 

</li><li>For this and other reasons, powerful AI systems will likely end up with aims other than the ones we intended. Training by trial-and-error is slippery: the positive and negative reinforcement we give AI systems will probably not end up training them just as we hoped.</li></ul>

<p>
There are a number of things such AI systems might end up aiming for, such as:
</p>
<ul>

<li>Power and resources. These tend to be useful for most goals, such that AI systems could be quite consistently be getting better reinforcement when they habitually pursue power and resources.

</li><li>Things like &#x201C;digital representations of human approval&#x201D; (after all, every time an AI gets positive reinforcement, there&#x2019;s a digital representation of human approval).
</li>
</ul>

<p>In sum, we could be unwittingly training AI systems to accumulate power and resources, get good feedback from humans, etc. - even when this means deceiving and manipulating humans to do so.</p>
<p>More: <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">Why would AI &quot;aim&quot; to defeat humanity?</a></p></div>
</details>
<p>
<strong>&#x201C;Solutions&#x201D; to these safety/alignment problems. </strong>When problems like the above are discovered, AI companies tend to respond similarly to how they did <a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#early-solutions">earlier</a>:
</p>
<ul>

<li>Training AIs against the undesirable behavior.

</li><li>Trying to create more (simulated) situations under which AIs might behave in these undesirable ways, and training them against doing so.
</li>
</ul>
<p>
These methods &#x201C;work&#x201D; in the sense that the concerning events become less frequent - as far as we can tell. But what&#x2019;s really happening is that AIs are being trained to be more careful not to get <em>caught</em> doing things like this, and to build more sophisticated models of how humans can interfere with their plans. 
</p>
<p>
In fact, AIs are gaining incentives to avoid incidents like &#x201C;Doing something counter to human developers&#x2019; intentions in order to get positive feedback, and having this be discovered and given negative feedback later&#x201D; - and this means they are starting to plan more and more around the long-run consequences of their actions. They are thinking less about &#x201C;Will I get positive feedback at the end of the day?&#x201D; and more about &#x201C;Will I eventually end up in a world where humans are going back, far in the future, to give me retroactive negative feedback for today&#x2019;s actions?&#x201D; This might give direct incentives to start aiming for eventual <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">defeat of humanity</a>, since defeating humanity could allow AIs to give themselves lots of retroactive positive feedback.
</p>
<p>
One way to think about it: AIs being trained in this way are generally moving from &#x201C;Steal money whenever there&#x2019;s an opportunity&#x201D; to &#x201C;Don&#x2019;t steal money if there&#x2019;s a good chance humans will eventually uncover this - instead, think way ahead and look for opportunities to steal money and get away with it <em>permanently</em>.&#x201D; The latter could include simply stealing money in ways that humans are unlikely to ever notice; it might also include waiting for an opportunity to team up with other AIs and <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">disempower humans entirely</a>, after which a lot more money (or whatever) can be generated.
</p>
<p id="debates">
<strong>Debates. </strong>The leading AI companies are aggressively trying to build and deploy more powerful AI, but a number of people are raising alarms and warning that continuing to do this could result in disaster. Here&#x2019;s a stylized sort of debate that might occur:
</p>
<p>
A: Great news, our AI-assisted research team has discovered even more improvements than expected! We should be able to build an AI model 10x as big as the state of the art in the next few weeks. 
</p>
<p>
B: I&#x2019;m getting really concerned about the direction this is heading. I&#x2019;m worried that if we make an even bigger system and license it to all our existing customers - military customers, financial customers, etc. - we could be headed for a disaster.
</p>
<p>
A: Well the disaster I&#x2019;m trying to prevent is competing AI companies getting to market before we do.
</p>
<p>
B: I was thinking of <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">AI defeating all of humanity</a>.
</p>
<p>
A: Oh, I was worried about that for a while too, but our safety training has really been incredibly successful. 
</p>
<p>
B: It has? I was just talking to our <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/#digital-neuroscience">digital neuroscience</a> lead, and she says that even with recent help from AI &#x201C;virtual scientists,&#x201D; they still aren&#x2019;t able to reliably read a single AI&#x2019;s digital brain. They were showing me this old incident report where an AI stole money, and they spent like a week analyzing that AI and couldn&#x2019;t explain in any real way how or why that happened.
</p>
<details id="Box5"><summary>(Click to expand) How &quot;digital neuroscience&quot; could help <!--(Details not included in email - <a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#Box5">click to view on the web)</a></em>--></summary>
    <div>

<p>
I&#x2019;ve <a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#Box3">argued</a> that it could be inherently difficult to measure whether AI systems are safe, for reasons such as: AI systems that are <em>not deceptive </em>probably look like AI systems that are <em>so good at deception that they hide all evidence of it</em>, in any way we can easily measure.<strong> </strong>
</p>
<p>
Unless we can &#x201C;read their minds!&#x201D;
</p>
<p>
Currently, today&#x2019;s leading AI research is in the genre of <a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#Box4">&#x201C;black-box trial-and-error.&#x201D;</a> An AI tries a task; it gets &#x201C;encouragement&#x201D; or &#x201C;discouragement&#x201D; based on whether it does the task well; it tweaks the wiring of its &#x201C;digital brain&#x201D; to improve next time; it improves at the task; but we humans aren&#x2019;t able to make much sense of its &#x201C;digital brain&#x201D; or say much about its &#x201C;thought process.&#x201D; 
</p>
<p>
Some AI research (<a href="https://www.transformer-circuits.pub/2022/mech-interp-essay/index.html">example</a>)<sup id="fnref2"><a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5#fn2" rel="footnote">2</a></sup> is exploring how to change this - how to decode an AI system&#x2019;s &#x201C;digital brain.&#x201D; This research is in relatively early stages - today, it can &#x201C;decode&#x201D; only parts of AI systems (or fully decode very small, deliberately simplified AI systems).
</p>
<p>
As AI systems advance, it might get harder to decode them - or easier, if we can start to use AI for help decoding AI, and/or change AI design techniques so that AI systems are less &#x201C;black box&#x201D;-ish. 
</p>
<p><a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/#digital-neuroscience">More</a></p></div>
</details>
<p>
A: I agree that&#x2019;s unfortunate, but digital neuroscience has always been a speculative, experimental department. Fortunately, we have actual data on safety. Look at this chart - it shows the frequency of concerning incidents plummeting, and it&#x2019;s extraordinarily low now. In fact, the more powerful the AIs get, the less frequent the incidents get - we can project this out and see that if we train a big enough model, it should essentially never have a concerning incident!
</p>
<p>
B: But that could be because the AIs are getting cleverer, more patient and long-term, and hence better at ensuring we never catch them.
</p>
<details id="Box6"><summary>(Click to expand) The Lance Armstrong problem: is the AI <em>actually safe</em> or <em>good at hiding its dangerous actions</em>? <!--(Details not included in email - <a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#Box6">click to view on the web)</a></em>--></summary><div>

<p>
Let&apos;s imagine that:
</p>
<ul>

<li>We have AI systems available that can do roughly everything a human can, with some different strengths and weaknesses but no huge difference in &quot;overall capabilities&quot; or economic value per hour of work. 

</li><li>We&apos;re observing early signs that AI systems behave in unintended, deceptive ways, such as giving wrong answers to questions we ask, or writing software that falsifies metrics instead of doing the things the metrics were supposed to measure (e.g., software meant to make a website run faster might instead falsify metrics about its loading time).
</li>
</ul>
<p>
We theorize that modifying the AI training in some way<sup id="fnref6"><a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#fn6" rel="footnote">6</a></sup> will make AI systems less likely to behave deceptively. We try it out, and find that, in fact, our AI systems seem to be behaving better than before - we are finding fewer incidents in which they behaved in unintended or deceptive ways. 
</p>
<p>
But that&apos;s just a statement about <em>what we&apos;re noticing</em>. Which of the following just happened:
</p>
<ul>

<li>Did we just train our AI systems to be less deceptive?

</li><li>Did we just train our AI systems to be <em>better at</em> deception, and so to make us <em>think</em> they became less deceptive?

</li><li>Did we just train our AI systems to be better at calculating when they might get caught in deception, and so to be less deceptive only when the deception would otherwise be caught? 
<ul>
 
<li>This one could be useful! Especially if we&apos;re able to set up auditing systems in many real-world situations, such that we <em>could</em> expect deception to be caught a lot of the time. But it does leave open the <a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#The-King-Lear-problem">King Lear problem</a>.
    
</li> 
</ul>
</li> 
</ul>
<p>
(...Or some combination of the three?)
</p>
<p>
We&apos;re hoping to be able to deploy AI systems throughout the economy, so - just like human specialists - they will almost certainly have some opportunities to be deceptive without being caught. The fact that they <em>appear honest in our testing</em> is not clear comfort against this risk.
</p>
<p>
The analogy here is to competitive cyclist <a href="https://en.wikipedia.org/wiki/Lance_Armstrong">Lance Armstrong</a>. Armstrong won the Tour de France race 7 times in a row, while many of his competitors were caught using performance-enhancing drugs and disqualified. But more than 5 years after his last win, an investigation &quot;concluded that Armstrong had used performance-enhancing drugs over the course of his career[5] and named him as the ringleader of &apos;the most sophisticated, professionalized and successful doping program that sport has ever seen&apos;.&quot; Now the list of Tour de France winners looks like this:
</p>
<p>

<img src="https://www.cold-takes.com/content/images/size/w1600/2022/12/image1.png" width alt="How we could stumble into AI catastrophe" title="image_tooltip">

</p>
    </div>

</details>
<p>
&#x2026; Or just that they&#x2019;re now advanced enough that they&#x2019;re waiting for a chance to disempower humanity entirely, rather than pull a bunch of small-time shenanigans that tip us off to the danger.
</p>
<details id="Box7"><summary>(Click to expand) The King Lear problem: how do you test what will happen when it&apos;s no longer a test? <!--(Details not included in email - <a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#Box7">click to view on the web)</a></em>--></summary><div>
<p>
The Shakespeare play <a href="https://en.wikipedia.org/wiki/King_Lear">King Lear</a> opens with the King (Lear) stepping down from the throne, and immediately learning that he has left his kingdom to the wrong two daughters. Loving and obsequious while he was deciding on their fate,<sup id="fnref9"><a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#fn9" rel="footnote">9</a></sup> they reveal their contempt for him as soon as he&apos;s out of power and they&apos;re in it.
</p>
<p>
If we&apos;re building AI systems that can reason like humans, dynamics like this become a potential issue. 
</p>
<p>
I <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#existential-risks-to-humanity">previously</a> noted that an AI with <em>any</em> ambitious <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#what-it-means-for">aim</a> - or just an AI that wants to avoid being shut down or modified - might calculate that the best way to do this is by behaving helpfully and safely in all &quot;tests&quot; humans can devise. But once there is a real-world opportunity to disempower humans for good, that same aim <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#existential-risks-to-humanity">could cause the AI to disempower humans.</a>
</p>
<p>
In other words:
</p>
<ul>

<li>(A) When we&apos;re developing and testing AI systems, we have the power to decide which systems will be modified or shut down and which will be deployed into the real world. (Like King Lear deciding who will inherit his kingdom.)

</li><li>(B) But at some later point, these systems could be operating in the economy, in high numbers with a lot of autonomy. (This possibility is spelled out/visualized a bit more <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/#how-this-could-work-if-humans-create-a-huge-population-of-ais">here</a> and <a href="https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to#As_humans__control_fades__Alex_would_be_motivated_to_take_over">here</a>.) At that point, they may have opportunities to <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">defeat all of humanity</a> such that we never make decisions about them again. (Like King Lear&apos;s daughters after they&apos;ve taken control.)
</li>
</ul>

<p><strong>If AI systems can detect the difference between (A) and (B) above, then their behavior might systematically change from one to the other - and there&apos;s no clear way to <em>test</em> their behavior in (B).</strong></p><div>


</div></div></details>
<p>
A: What&#x2019;s your evidence for this?
</p>
<p>
B: I think you&#x2019;ve got things backward - we should be asking what&#x2019;s our evidence *against* it. By continuing to scale up and deploy AI systems, we could be imposing a risk of utter catastrophe on the whole world. That&#x2019;s not OK - we should be confident that the risk is <em>low</em> before we move forward.
</p>
<p>
A: But how would we even be confident that the risk is low?
</p>
<p>
B: I mean, digital neuroscience - 
</p>
<p>
A: Is an experimental, speculative field!
</p>
<p>
B: We could also try some <a href="https://www.alignmentforum.org/posts/rCJQAkPTEypGjSJ8X/how-might-we-align-transformative-ai-if-it-s-developed-very#Testing_and_threat_assessment">other stuff</a> &#x2026;
</p>
<p>
A: All of that stuff would be expensive, difficult and speculative. 
</p>
<p>
B: Look, I just think that if we can&#x2019;t show the risk is low, we shouldn&#x2019;t be moving forward at this point. The stakes are incredibly high, as you yourself have acknowledged - when pitching investors, you&#x2019;ve said we think we can build a fully general AI and that this would be the most powerful technology in history. Shouldn&#x2019;t we be at least taking as much precaution with potentially dangerous AI as people take with nuclear weapons?
</p>
<p>
A: What would that actually accomplish? It just means some other, less cautious company is going to go forward.
</p>
<p>
B: What about approaching the government and lobbying them to regulate all of us?
</p>
<p>
A: Regulate all of us to just stop building more powerful AI systems, until we can address some theoretical misalignment concern that we don&#x2019;t know how to address?
</p>
<p>
B: Yes?
</p>
<p>
A: All that&#x2019;s going to happen if we do that is that other countries are going to catch up to the US. Think [insert authoritarian figure from another country] is going to adhere to these regulations?
</p>
<p>
B: It would at least buy some time?
</p>
<p>
A: Buy some time and burn our chance of staying on the cutting edge. While we&#x2019;re lobbying the government, our competitors are going to be racing forward. I&#x2019;m sorry, this isn&#x2019;t practical - we&#x2019;ve got to go full speed ahead.
</p>
<p>
B: Look, can we at least try to tighten our security? If you&#x2019;re so worried about other countries catching up, we should really not be in a position where they can send in a spy and get our code.
</p>
<p>
A: Our security is pretty intense already.
</p>
<p>
B: Intense enough to stop a well-resourced state project?
</p>
<p>
A: What do you want us to do, go to an underground bunker? Use <a href="https://bluexp.netapp.com/blog/aws-cvo-blg-aws-govcloud-services-sensitive-data-on-the-public-cloud#H_H3">airgapped</a> servers (servers on our premises, entirely disconnected from the public Internet)? It&#x2019;s the same issue as before - we&#x2019;ve got to stay ahead of others, we can&#x2019;t burn huge amounts of time on exotic security measures.
</p>
<p>
B: I don&#x2019;t suppose you&#x2019;d at least consider increasing the percentage of our budget and headcount that we&#x2019;re allocating to the &#x201C;speculative&#x201D; safety research? Or are you going to say that we need to stay ahead and can&#x2019;t afford to spare resources that could help with that?
</p>
<p>
A: Yep, that&#x2019;s what I&#x2019;m going to say.
</p>
<p>
<strong>Mass deployment. </strong>As time goes on, many versions of the above debate happen, at many different stages and in many different places. By and large, people continue rushing forward with building more and more powerful AI systems and deploying them all throughout the economy.
</p>
<p>
At some point, there are AIs that closely manage major companies&#x2019; financials, AIs that write major companies&#x2019; business plans, AIs that work closely with politicians to propose and debate laws, AIs that manage drone fleets and develop military strategy, etc. Many of these AIs are primarily built, trained, and deployed by other AIs, or by humans leaning heavily on AI assistance.
</p>
<p>
<strong>More intense warning signs.</strong>
</p>
<p>
(Note: I think it&#x2019;s possible that progress will accelerate <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/#explosive-scientific-and-technological-advancement">explosively enough</a><strong> </strong>that we won&#x2019;t even get as many warning signs as there are below, but I&#x2019;m spelling out a number of possible warning signs anyway to make the point that even intense warning signs might not be enough.)<strong> </strong>
</p>
<p>
Over time, in this hypothetical scenario, <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/#digital-neuroscience">digital neuroscience</a> becomes more effective. When applied to a randomly sampled AI system, it often appears to hint at something like: &#x201C;This AI appears to be aiming for as much power and influence over the world as possible - which means never doing things humans wouldn&#x2019;t like <em>if humans can detect it</em>, but grabbing power when they can get away with it.&#x201D; 
</p>
<details id="Box8"><summary>(Click to expand) Why would AI &quot;aim&quot; to defeat humanity? <!--(Details not included in email - <a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#Box8">click to view on the web</a>)</em>--></summary><div>

<p>
A <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">previous piece</a> argued that if today&#x2019;s AI development methods lead directly to powerful enough AI systems, disaster is likely by default (in the absence of specific countermeasures). 
</p>
<p>
In brief:
</p>
<ul>
<li>Modern AI development is essentially based on &#x201C;training&#x201D; via trial-and-error. 
<p></p>
<p>
<li>If we move forward incautiously and ambitiously with such training, and if it gets us all the way to very powerful AI systems, then such systems will likely end up <em>aiming for certain states of the world</em> (analogously to how a chess-playing AI aims for checkmate).
</li></p>
<p>
<li>And these states will be<em> other than the ones we intended</em>, because our trial-and-error training methods won&#x2019;t be accurate. For example, when we&#x2019;re confused or misinformed about some question, we&#x2019;ll reward AI systems for giving the wrong answer to it - unintentionally training deceptive behavior.
</li></p>
<p>
<li>We should expect disaster if we have AI systems that are both (a) <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">powerful enough</a> to defeat humans and (b) aiming for states of the world that we didn&#x2019;t intend. (&#x201C;Defeat&#x201D; means taking control of the world and doing what&#x2019;s necessary to keep us out of the way; it&#x2019;s unclear to me whether we&#x2019;d be literally killed or just forcibly stopped<sup id="fnref1"><a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5#fn1" rel="footnote">1</a></sup> from changing the world in ways that contradict AI systems&#x2019; aims.)</li></p></li></ul>
<p>More: <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">Why would AI &quot;aim&quot; to defeat humanity?</a></p></div>
</details>
<p>
However, there is room for debate in what a &#x201C;digital brain&#x201D; truly shows:
</p>
<ul>

<li>Many people are adamant that the readings are unreliable and misleading.

</li><li>Some people point out that humans are <em>also</em> interested in power and influence, and often think about what they can and can&#x2019;t get away with, but this doesn&#x2019;t mean they&#x2019;d take over the world if they could. They say the AIs might be similar.

</li><li>There are also cases of people doing digital neuroscience that claims to show that AIs are totally safe. These could be people like &#x201C;A&#x201D; above who want to focus on pushing forward with AI development rather than bringing it to a halt, or people who just find the alarmists annoying and like to contradict them, or people who are just sloppy with their research. Or people who have been manipulated or bribed by AIs themselves.
</li>
</ul>
<p>
There are also very concerning incidents, such as:
</p>
<ul>

<li>An AI steals a huge amount of money by bypassing the security system at a bank - and it turns out that this is because the security system was disabled by AIs <em>at the bank</em>. It&#x2019;s suspected, maybe even proven, that all these AIs had been communicating and coordinating with each other in code, such that humans would have difficulty detecting it. (And they had been <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">aiming</a> to divide up the funds between the different participating AIs, each of which could stash them in a bank account and use them to pursue whatever <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#unintended-aims">unintended aims they might have</a>.)

</li><li>An obscure new political party, devoted to the &#x201C;rights of AIs,&#x201D; completely takes over a small country, and many people suspect that this party is made up mostly or entirely of people who have been manipulated and/or bribed by AIs. 

</li><li>There are companies that own huge amounts of AI servers and robot-operated factories, and are aggressively building more. Nobody is sure what the AIs or the robots are &#x201C;for,&#x201D; and there are rumors that the humans &#x201C;running&#x201D; the company are actually being bribed and/or threatened to carry out instructions (such as creating more and more AIs and robots) that they don&#x2019;t understand the purpose of.
</li>
</ul>
<p>
At this point, there are a lot of people around the world calling for an immediate halt to AI development. But:
</p>
<ul>

<li>Others resist this on all kinds of grounds, e.g. &#x201C;These concerning incidents are anomalies, and what&#x2019;s important is that our country keeps pushing forward with AI before others do,&#x201D; etc.

</li><li>Anyway, it&#x2019;s just too late. Things are moving incredibly quickly; by the time one concerning incident has been noticed and diagnosed, the AI behind it has been greatly improved upon, and the total amount of AI influence over the economy has continued to grow.
</li>
</ul>
<p>
<strong>Defeat. </strong>
</p>
<p>
(Noting again that I could imagine things playing out a lot more <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/#the-standard-argument-superintelligence-and-advanced-technology">quickly and suddenly</a> than in this story.)
</p>
<p>
It becomes more and more common for there to be companies and even countries that are clearly just run entirely by AIs - maybe via bribed/threatened human surrogates, maybe just forcefully (e.g., robots seize control of a country&#x2019;s military equipment and start enforcing some new set of laws).
</p>
<p>
At some point, it&#x2019;s best to think of civilization as containing two different advanced species - humans and AIs - with the AIs having essentially all of the power, making all the decisions, and running everything. 
</p>
<p>
Spaceships start to spread throughout the galaxy; they generally don&#x2019;t contain any humans, or anything that humans had meaningful input into, and are instead launched by AIs to pursue aims of their own in space.
</p>
<p>
Maybe at some point humans are killed off, largely due to simply being a nuisance, maybe even accidentally (as humans have driven many species of animals extinct while not bearing them malice). Maybe not, and we all just live under the direction and control of AIs with no way out.
</p>
<p>
What do these AIs <em>do</em> with all that power? What are all the robots up to? What are they building on other planets? The short answer is that I don&#x2019;t know.
</p>
<ul>

<li>Maybe they&#x2019;re just creating massive amounts of &#x201C;digital representations of human approval,&#x201D; because this is what they were historically trained to seek (kind of like how humans sometimes do whatever it takes to get drugs that will get their brains into certain states).

</li><li>Maybe they&#x2019;re competing with each other for pure power and territory, because their training has encouraged them to seek power and resources when possible (since power and resources are generically useful, for almost any set of aims).

</li><li>Maybe they have a whole bunch of different things they value, as humans do, that are sort of (but only sort of) related to what they were trained on (as humans tend to value things like sugar that made sense to seek out in the past). And they&#x2019;re filling the universe with these things.
</li>
</ul>
<details id="Box9"><summary>(Click to expand) What sorts of aims might AI systems have? <!--(Details not included in email - <a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#Box9">click to view on the web</a>)</em>--></summary>
    <div>

In a <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">previous piece</a>, I discuss why AI systems might form unintended, ambitious &quot;aims&quot; of their own. By &quot;aims,&quot; I mean particular states of the world that AI systems make choices, calculations and even plans to achieve, much like a chess-playing AI &#x201C;aims&#x201D; for a checkmate position.

<p>
An analogy that often comes up on this topic is that of human evolution. This is arguably the only previous precedent for <em>a set of minds [humans], with extraordinary capabilities [e.g., the ability to develop their own technologies], developed essentially by black-box trial-and-error [some humans have more &#x2018;reproductive success&#x2019; than others, and this is the main/only force shaping the development of the species].</em>
</p>
<p>
You could sort of<sup id="fnref12"><a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#fn12" rel="footnote">12</a></sup> think of the situation like this: &#x201C;An AI<sup id="fnref13"><a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#fn13" rel="footnote">13</a></sup> developer named Natural Selection tried giving humans positive reinforcement (making more of them) when they had more reproductive success, and negative reinforcement (not making more of them) when they had less. One might have thought this would lead to humans that are aiming to have reproductive success. Instead, it led to humans that aim - often ambitiously and creatively - for other things, such as power, status, pleasure, etc., and even invent things like birth control to get the things they&#x2019;re aiming for instead of the things they were &#x2018;supposed to&#x2019; aim for.&#x201D; 
</p>
<p>
Similarly, if our main strategy for developing powerful AI systems is to reinforce behaviors like &#x201C;Produce technologies we find valuable,&#x201D; the hoped-for result might be that AI systems aim (in the sense described <a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#unintended-aims">above</a>) toward producing technologies we find valuable; but the actual result might be that they aim for some other set of things that is correlated with (but not the same as) the thing we intended them to aim for.
</p>
<p>
There are a lot of things they might end up aiming for, such as:
</p>
<ul>

<li>Power and resources. These tend to be useful for most goals, such that AI systems could be quite consistently be getting better reinforcement when they habitually pursue power and resources.

</li><li>Things like &#x201C;digital representations of human approval&#x201D; (after all, every time an AI gets positive reinforcement, there&#x2019;s a digital representation of human approval).
</li>
</ul>
<p></p>
<p>More: <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">Why would AI &quot;aim&quot; to defeat humanity?</a></p></div>

</details>
<p>
<span style="text-decoration:underline;">END OF HYPOTHETICAL SCENARIO</span>
</p>
<h2 id="potential-catastrophes-from-aligned-ai">Potential catastrophes from <em>aligned</em> AI</h2>


<p>
I think it&#x2019;s possible that misaligned AI (AI forming dangerous goals of its own) will turn out to be pretty much a non-issue. That is, I don&#x2019;t think the <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">argument I&#x2019;ve made for being concerned</a> is anywhere near watertight. 
</p>
<p>
What happens if you train an AI system by trial-and-error, giving (to oversimplify) a &#x201C;thumbs-up&#x201D; when you&#x2019;re happy with its behavior and a &#x201C;thumbs-down&#x201D; when you&#x2019;re not? I&#x2019;ve <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">argued</a> that you might be training it to deceive and manipulate you. However, this is uncertain, and - especially if you&#x2019;re able to avoid <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#deceiving-and-manipulating">errors </a>in how you&#x2019;re giving it feedback - things might play out differently. 
</p>
<p>
It might turn out that this kind of training just works as intended, producing AI systems that do something like &#x201C;Behave as the human would want, if they had all the info the AI has.&#x201D; And the nitty-gritty details of how <em>exactly</em> AI systems are trained (beyond the high-level <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#Box3">&#x201C;trial-and-error&#x201D; idea</a>) could be crucial.
</p>
<p>
If this turns out to be the case, I think the future looks a lot brighter - but there are still lots of pitfalls of the kind I outlined in <a href="https://www.cold-takes.com/transformative-ai-issues-not-just-misalignment-an-overview/">this piece</a>. For example:
</p>
<ul>

<li>Perhaps an authoritarian government launches a huge state project to develop AI systems, and/or uses espionage and hacking to steal a cutting-edge AI model developed elsewhere and deploy it aggressively. 
<ul>
 
<li>I <a href="https://www.cold-takes.com/transformative-ai-issues-not-just-misalignment-an-overview/#power-imbalances">previously noted</a> that &#x201C;developing powerful AI a few months before others could lead to having technology that is (effectively) hundreds of years ahead of others&#x2019;.&#x201D;
 
</li><li>So this could put an authoritarian government in an enormously powerful position, with the ability to surveil and defeat any enemies worldwide, and the ability to prolong the life of its ruler(s) indefinitely. This could lead to a very bad future, especially if (as I&#x2019;ve <a href="https://www.cold-takes.com/how-digital-people-could-change-the-world/#lock-in">argued</a> could happen) the future becomes &#x201C;locked in&#x201D; for good.
</li> 
</ul>

</li><li>Perhaps AI companies race ahead with selling AI systems to anyone who wants to buy them, and this leads to things like: 
<ul>
 
<li>People training AIs to act as propaganda agents for whatever views they already have, to the point where the world gets flooded with propaganda agents and it becomes totally impossible for humans to sort the signal from the noise, educate themselves, and generally make heads or tails of what&#x2019;s going on. (Some people think this has already happened! I think things can get quite a lot worse.)
 
</li><li>People training &#x201C;scientist AIs&#x201D; to develop powerful weapons that can&#x2019;t be defended against (even with AI help),<sup id="fnref5"><a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#fn5" rel="footnote">5</a></sup> leading eventually to a dynamic in which ~anyone can cause great harm, and ~nobody can defend against it. At this point, it could be inevitable that we&#x2019;ll blow ourselves up.
 
</li><li>Science advancing to the point where <a href="https://www.cold-takes.com/how-digital-people-could-change-the-world/">digital people</a> are created, in a rushed way such that they are considered property of whoever creates them (no human rights). I&#x2019;ve <a href="https://www.cold-takes.com/how-digital-people-could-change-the-world/">previously written</a> about how this could be bad.
 
</li><li>All other kinds of chaos and disruption, with the least cautious people (the ones most prone to rush forward aggressively deploying AIs to capture resources) generally having an outsized effect on the future.</li></ul></li></ul>
<p>
Of course, this is just a crude gesture in the direction of some of the ways things could go wrong. I&#x2019;m guessing I haven&#x2019;t scratched the surface of the possibilities. And things could go very well too!
</p>
<h2 id="we-can-do-better">We can do better</h2>


<p>
In previous pieces, I&#x2019;ve talked about a number of ways we could do better than in the scenarios above. Here I&#x2019;ll just list a few key possibilities, with a bit more detail in expandable boxes and/or links to discussions in previous pieces.
</p>
<p>
<strong>Strong alignment research (including imperfect/temporary measures). </strong>If we make enough progress <em>ahead of time</em> on alignment research, we might develop measures that make it <em>relatively easy</em> for AI companies to build systems that truly (not just seemingly) are safe. 
</p>
<p>
So instead of having to say things like &#x201C;We should slow down until we make progress on experimental, speculative research agendas,&#x201D; person B in the <a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72/#debates">above dialogue</a> can say things more like &#x201C;Look, all you have to do is add some relatively cheap bells and whistles to your training procedure for the next AI, and run a few extra tests. Then the speculative concerns about misaligned AI will be much lower-risk, and we can keep driving down the risk by using our AIs to help with safety research and testing. Why not do that?&#x201D;
</p>
<p>
More on what this could look like at a previous piece, <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/">High-level Hopes for AI Alignment</a>.
</p>
<details id="Box10"><summary>(Click to expand) High-level hopes for AI alignment <!--(Details not included in email - <a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#Box10">click to view on the web</a>)</em>--></summary><div>
<p>
A <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/">previous piece</a> goes through what I see as three key possibilities for building powerful-but-safe AI systems.
</p>
<p>
It frames these using Ajeya Cotra&#x2019;s <a href="https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/#analogy-the-young-ceo">young businessperson</a> analogy for the core difficulties. In a nutshell, once AI systems get capable enough, it could be hard to test whether they&#x2019;re safe, because they might be able to deceive and manipulate us into getting the wrong read. Thus, trying to determine whether they&#x2019;re safe might be something like &#x201C;being an eight-year-old trying to decide between adult job candidates (some of whom are manipulative).&#x201D;
</p>
<p>Key possibilities for navigating this challenge:</p>
<ul>

<li><strong>Digital neuroscience</strong>: perhaps we&#x2019;ll be able to read (and/or even rewrite) the &#x201C;digital brains&#x201D; of AI systems, so that we can know (and change) what they&#x2019;re &#x201C;aiming&#x201D; to do directly - rather than having to infer it from their behavior. (Perhaps the eight-year-old is a mind-reader, or even a young <a href="https://en.wikipedia.org/wiki/Professor_X#Powers_and_abilities">Professor X</a>.)

</li><li><strong>Limited AI</strong>: perhaps we can make AI systems safe by making them <em>limited</em> in various ways - e.g., by leaving certain kinds of information out of their training, designing them to be &#x201C;myopic&#x201D; (focused on short-run as opposed to long-run goals), or something along those lines. Maybe we can make &#x201C;limited AI&#x201D; that is nonetheless able to carry out particular helpful tasks - such as doing lots more research on how to achieve safety without the limitations. (Perhaps the eight-year-old can limit the authority or knowledge of their hire, and still get the company run successfully.)

</li><li><strong>AI checks and balances</strong>: perhaps we&#x2019;ll be able to employ some AI systems to critique, supervise, and even rewrite others. Even if no single AI system would be safe on its own, the right &#x201C;checks and balances&#x201D; setup could ensure that human interests win out. (Perhaps the eight-year-old is able to get the job candidates to evaluate and critique each other, such that all the eight-year-old needs to do is verify basic factual claims to know who the best candidate is.)
</li>
</ul>
<p>
These are some of the main categories of hopes that are pretty easy to picture today. Further work on AI safety research might result in further ideas (and the above are not exhaustive - see my <a href="https://www.alignmentforum.org/posts/rCJQAkPTEypGjSJ8X/how-might-we-align-transformative-ai-if-it-s-developed-very">more detailed piece</a>, posted to the Alignment Forum rather than Cold Takes, for more).
</p>

    </div>

</details>
<p>
<strong>Standards and monitoring. </strong>A big driver of the <a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72/#how-we-could-stumble-into-catastrophe-from-misaligned-ai">hypothetical catastrophe above </a>is that each individual AI project feels the need to stay ahead of others. Nobody wants to unilaterally slow themselves down in order to be cautious. The situation might be improved if we can <strong>develop a set of standards that AI projects need to meet, and enforce them evenly</strong> - across a broad set of companies or even internationally.
</p>
<p>
This isn&#x2019;t just about buying time, it&#x2019;s about creating <em>incentives</em> for companies to prioritize safety. An analogy might be something like the <a href="https://en.wikipedia.org/wiki/Clean_Air_Act_(United_States)">Clean Air Act</a> or <a href="https://en.wikipedia.org/wiki/Corporate_average_fuel_economy">fuel economy standards</a>: we might not expect individual companies to voluntarily slow down product releases while they work on reducing pollution, but once required, reducing pollution becomes part of what they need to do to be profitable.
</p>
<p>
Standards could be used for things other than alignment risk, as well. AI projects might be required to:
</p>
<ul>

<li>Take strong security measures, preventing states from capturing their models via espionage.

</li><li>Test models before release to understand what people will be able to use them for, and (as if selling weapons) restrict access accordingly.
</li>
</ul>
<p>
More at a <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#global-monitoring">previous piece</a>.
</p>
<details id="Box11"><summary>(Click to expand) How standards might be established and become national or international <!--(Details not included in email - <a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#Box11">click to view on the web</a>)</em>--></summary><div>
<p>
I <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#global-monitoring">previously</a> laid out a possible vision on this front, which I&#x2019;ll give a slightly modified version of here:
</p>
<ul>

<li>Today&#x2019;s leading AI companies could self-regulate by committing not to build or deploy a system that they can&#x2019;t convincingly demonstrate is safe (e.g., see Google&#x2019;s <a href="https://www.theweek.in/news/sci-tech/2018/06/08/google-wont-deploy-ai-to-build-military-weapons-ichai.html">2018 statement</a>, &quot;We will not design or deploy AI in weapons or other technologies whose principal purpose or implementation is to cause or directly facilitate injury to people&#x201D;).  
<ul>
 
<li>Even if some people at the companies would like to deploy unsafe systems, it could be hard to pull this off once the company has committed not to. 
 
</li><li>Even if there&#x2019;s a lot of room for judgment in what it means to demonstrate an AI system is safe, having agreed in advance that <span style="text-decoration:underline;">certain evidence</span> is <em>not</em> good enough could go a long way.
</li> 
</ul>

</li><li>As more AI companies are started, they could feel soft pressure to do similar self-regulation, and refusing to do so is off-putting to potential employees, investors, etc.

</li><li>Eventually, similar principles could be incorporated into various government regulations and enforceable treaties.

</li><li>Governments could monitor for dangerous projects using regulation and even overseas operations. E.g., today the US monitors (without permission) for various signs that other states might be developing nuclear weapons, and might try to stop such development with methods ranging from threats of sanctions to <a href="https://en.wikipedia.org/wiki/Stuxnet">cyberwarfare</a> or even military attacks. It could do something similar for any AI development projects that are using huge amounts of compute and haven&#x2019;t volunteered information about whether they&#x2019;re meeting standards.
</li>
</ul>
    </div>
    </details>
<p>
<strong>Successful, careful AI projects. </strong>I think a single AI company, or other AI project, could enormously improve the situation by being <em>both</em> successful and careful. For a simple example, imagine an AI company in a <em>dominant</em> market position - months ahead of all of the competition, in some relevant sense (e.g., its AI systems are more capable, such that it would take the competition months to catch up). Such a company could put huge amounts of resources - including its money, top people and its advanced AI systems themselves (e.g., AI systems performing roles similar to top human scientists) - into AI safety research, hoping to find safety measures that can be published for everyone to use. It can also take a variety of other measures <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#defensive-deployment">laid out in a previous piece</a>.
</p>
<details id="Box12"><summary>(Click to expand) How a careful AI project could be helpful <!--(Details not included in email - <a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#Box12">click to view on the web</a>)</em>--></summary>
<div><p>
In addition to using advanced AI to do AI safety research (noted above), an AI project could:
</p>
<ul>

<li>Put huge effort into designing <em>tests </em>for signs of danger, and - if it sees danger signs in its own systems - warning the world as a whole.

</li><li>Offer deals to other AI companies/projects. E.g., acquiring them or exchanging a share of its profits for enough visibility and control to ensure that they don&#x2019;t deploy dangerous AI systems.

</li><li>Use its credibility as the leading company to lobby the government for helpful measures (such as enforcement of a <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#global-monitoring">monitoring-and-standards regime</a>), and to more generally highlight key issues and advocate for sensible actions.

</li><li>Try to ensure (via design, marketing, customer choice, etc.) that its AI systems are not used for dangerous ends, and <em>are</em> used on applications that make the world safer and better off. This could include <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#global-monitoring">defensive deployment</a> to reduce risks from other AIs; it could include using advanced AI systems to help it gain clarity on how to get a good outcome for humanity; etc.
</li>
</ul>
<p>
An AI project with a dominant market position could likely make a huge difference via things like the above (and probably via many routes I haven&#x2019;t thought of). And even an AI project that is merely <em>one of several leaders</em> could have enough resources and credibility to have a lot of similar impacts - especially if it&#x2019;s able to &#x201C;lead by example&#x201D; and persuade other AI projects (or make deals with them) to similarly prioritize actions like the above.
</p>
<p>
A challenge here is that I&#x2019;m envisioning a project with two arguably contradictory properties: being <em>careful</em> (e.g., prioritizing actions like the above over just trying to maintain its position as a profitable/cutting-edge project) and <em>successful</em> (being a profitable/cutting-edge project). In practice, it could be very hard for an AI project to walk the tightrope of being aggressive enough to be a &#x201C;leading&#x201D; project (in the sense of having lots of resources, credibility, etc.), while also prioritizing actions like the above (which mostly, with some exceptions, seem pretty different from what an AI project would do if it were simply focused on its technological lead and profitability).
</p>
    </div>
    </details>
<p>
<strong>Strong security. </strong>A key threat in the above scenarios is that an incautious actor could &#x201C;steal&#x201D; an AI system from a company or project that would otherwise be careful. My understanding is that it could be extremely hard for an AI project to be robustly safe against this outcome (more <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/#fn15">here</a>). But this could change, if there&#x2019;s enough effort to work out the problem of how to develop a large-scale, powerful AI system that is very hard to steal.
</p>
<p>
In future pieces, I&#x2019;ll get more concrete about what specific people and organizations can do <em>today</em> to improve the odds of factors like these going well, and overall to raise the odds of a good outcome.
</p>

<!--kg-card-end: html--><!--kg-card-begin: html-->

<p><div style="display:flex; justify-content:center; margin: 0 auto;">

        <span style="margin: 10px;"><a href="https://api.addthis.com/oexchange/0.8/forward/twitter/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fhow-we-could-stumble-into-ai-catastrophe&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20How%20we%20could%20stumble%20into%20AI%20catastrophe&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-twitter-square.png" border="0" alt="How we could stumble into AI catastrophe"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/facebook/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fhow-we-could-stumble-into-ai-catastrophe&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20How%20we%20could%20stumble%20into%20AI%20catastrophe&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-facebook-square.png" border="0" alt="How we could stumble into AI catastrophe"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/reddit/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fhow-we-could-stumble-into-ai-catastrophe&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20How%20we%20could%20stumble%20into%20AI%20catastrophe&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-reddit-square.png" border="0" alt="How we could stumble into AI catastrophe"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/menu/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fhow-we-could-stumble-into-ai-catastrophe&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20How%20we%20could%20stumble%20into%20AI%20catastrophe&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-addthis-square.png" border="0" alt="How we could stumble into AI catastrophe"></a></span>
        </div>
<center><p id="discuss"><!--<a href="https://www.cold-takes.com/how-we-could-stumble-into-ai-catastrophe#subscribe" target="_blank"><button id="footer-subscribe" class="button">Subscribe</button></a>&nbsp;<a href="https://www.guidedtrack.com/programs/4kal2ue/run?posttitle=How%20we%20could%20stumble%20into%20AI%20catastrophe" target="_blank"><button class="button" id="Survey">Feedback</button></a>-->&#xA0;<a href="https://www.lesswrong.com/posts/slug/how-we-could-stumble-into-ai-catastrophe#comments" target="_blank"><button class="button">Comment/discuss</button></a></p><p><!--<em>
Use "Feedback" if you have comments/suggestions you want me to see, or if you're up for giving some quick feedback about this post (which I greatly appreciate!) Use "Forum" if you want to discuss this post publicly on the Effective Altruism Forum.
</em>--></p></center>
<!--kg-card-end: html--><!--kg-card-begin: html-->
</p><h2 id="footnotes">Notes</h2>
<div class="footnotes">
<hr>
<ol><li id="fn1">
<p>
     E.g., <a href="https://www.lesswrong.com/posts/AfH2oPHCApdKicM4m/two-year-update-on-my-personal-ai-timelines">Ajeya Cotra </a>gives a 15% probability of transformative AI by 2030; eyeballing figure 1 from <a href="https://arxiv.org/pdf/1705.08807.pdf">this chart</a> on expert surveys implies a &gt;10% chance by 2028.&#xA0;<a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#fnref1" rev="footnote">&#x21A9;</a><li id="fn2">
<p>
     To predict early AI applications, we need to ask not just &#x201C;What tasks will AI be able to do?&#x201D; but &#x201C;How will this compare to all the other ways people can get the same tasks done?&#x201D; and &#x201C;How practical will it be for people to switch their workflows and habits to accommodate new AI capabilities?&#x201D;
</p><p>
    By contrast, I think the implications of <em>powerful enough</em> AI for productivity don&#x2019;t rely on this kind of analysis - very high-level economic reasoning can tell us that being able to cheaply copy something with human-like R&amp;D capabilities would lead to <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/#explosive-scientific-and-technological-advancement">explosive progress</a>.
</p><p>
    FWIW, I think it&#x2019;s fairly common for high-level, long-run predictions to be <em>easier</em> than detailed, short-run predictions. Another example: I think it&#x2019;s easier to predict a general trend of planetary warming (<a href="https://www.ipcc.ch/report/ar6/wg2/">this seems very likely</a>) than to predict whether it&#x2019;ll be rainy next weekend.&#xA0;<a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#fnref2" rev="footnote">&#x21A9;</a><li id="fn3">
<p>
     <a href="https://www.anthropic.com/constitutional.pdf">Here&#x2019;s an early example</a> of AIs providing training data for each other/themselves.&#xA0;<a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#fnref3" rev="footnote">&#x21A9;</a><li id="fn4">
<p>
     <a href="https://github.com/features/copilot">Example of AI helping to write code</a>.&#xA0;<a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#fnref4" rev="footnote">&#x21A9;</a><li id="fn5">

<p>
     To be clear, I have no idea whether this is possible! It&#x2019;s not obvious to me that it would be dangerous for technology to progress a lot and be used widely for both offense and defense. It&#x2019;s just a risk I&#x2019;d rather not incur casually via indiscriminate, rushed AI deployments.&#xA0;<a href="https://www.cold-takes.com/p/55d4c8c4-7315-4e40-b3f9-b7683b839c72#fnref5" rev="footnote">&#x21A9;</a>

</p></li></p></li></p></li></p></li></p></li></ol></div>

<!--kg-card-end: html-->]]></content:encoded></item><item><title><![CDATA[Transformative AI issues (not just misalignment): an overview]]></title><description><![CDATA[An overview of key potential factors (not just alignment risk) for whether things go well or poorly with transformative AI.

https://www.cold-takes.com/transformative-ai-issues-not-just-misalignment-an-overview/]]></description><link>https://www.cold-takes.com/transformative-ai-issues-not-just-misalignment-an-overview/</link><guid isPermaLink="false">63add1de9a951a003d4e3602</guid><category><![CDATA[ImplicationsOfMostImportantCentury]]></category><dc:creator><![CDATA[Holden Karnofsky]]></dc:creator><pubDate>Thu, 05 Jan 2023 20:16:53 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: html--><p><figure><div id="buzzsprout-player-11987328"></div><script src="https://www.buzzsprout.com/1851795/11987328-transformative-ai-issues-not-just-misalignment-an-overview.js?container_id=buzzsprout-player-11987328&amp;player=small" type="text/javascript" charset="utf-8"></script><figcaption><em>Click lower right to download or find on Apple Podcasts, Spotify, Stitcher, etc.</em></figcaption></figure></p>


<p>
If this ends up being the <a href="https://www.cold-takes.com/most-important-century/">most important century</a> due to advanced AI, what are the key factors in whether things go well or poorly?
</p>
<details id="Box1"><summary>(Click to expand) More detail on why AI could make this the most important century<!-- (Details not included in email - <a href="https://www.cold-takes.com/p/b2a3d837-24f1-46ae-9e45-1fac54411b03#Box1">click to view on the web</a>)--></summary>
<div><p>
In <a href="https://www.cold-takes.com/most-important-century/">The Most Important Century</a>, I argued that the 21st century could be the most important century ever for humanity, via the development of advanced AI systems that could dramatically speed up scientific and technological advancement, getting us more quickly than most people imagine to a deeply unfamiliar future.
</p>
<p>
<a href="https://www.cold-takes.com/most-important-century/">This page</a> has a ~10-page summary of the series, as well as links to an audio version, podcasts, and the full series.
</p>
<p>
The key points I argue for in the series are:
</p>
<ul>

<li><strong>The long-run future is radically unfamiliar. </strong>Enough advances in technology could lead to a long-lasting, galaxy-wide civilization that could be a radical utopia, dystopia, or anything in between.

</li><li><strong>The long-run future could come much faster than we think,</strong> due to a possible AI-driven productivity explosion.

</li><li>The relevant kind of <strong>AI looks like it will be developed this century</strong> - making this century the one that will initiate, and have the opportunity to shape, a future galaxy-wide civilization.

</li><li>These claims seem too &quot;wild&quot; to take seriously. But there are a lot of reasons to think that <strong>we live in a wild time, and should be ready for anything.</strong>

</li><li>We, the people living in this century, have the chance to have a huge impact on huge numbers of people to come - if we can make sense of the situation enough to find helpful actions. But right now, <strong>we aren&apos;t ready for this.</strong>
</li>
    </ul></div>
</details>
<p>
A lot of my previous writings have focused specifically on the <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">threat of &#x201C;misaligned AI&#x201D;</a>: AI that could have dangerous <em>aims of its own</em> and <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">defeat all of humanity</a>. In this post, I&#x2019;m going to zoom out and give a broader overview of multiple issues transformative AI could raise for society - with an emphasis on <strong>issues we might want to be thinking about <em>now</em> rather than waiting to address as they happen.</strong>
</p>
<p>
My discussion will be very unsatisfying. &#x201C;What are the key factors in whether things go well or poorly with transformative AI?&#x201D; is a massive topic, with lots of angles that have gotten almost no attention and (surely) lots of angles that I just haven&#x2019;t thought of at all. My one-sentence summary of this whole situation is: <a href="https://www.cold-takes.com/most-important-century/#were-not-ready-for-this">we&#x2019;re not ready for this</a>.
</p>
<p>
But hopefully this will give some sense of what sorts of issues should clearly be on our radar. And hopefully it will give a sense of why - out of all the issues we need to contend with - I&#x2019;m as focused on the threat of misaligned AI as I am.
</p>
<p>
Outline:
</p>
<ul>

<li>First, I&#x2019;ll briefly clarify what kinds of issues I&#x2019;m trying to list. I&#x2019;m looking for ways the future could look durably and dramatically different depending on how we navigate the development of transformative AI - such that <strong>doing the right things ahead of time could make a big, lasting difference.</strong>

</li><li>Then, I&#x2019;ll list candidate issues: 
<ul>
 
<li><strong>Misaligned AI.</strong> I touch on this only briefly, since I&#x2019;ve discussed it at length in <a href="https://www.cold-takes.com/tag/implicationsofmostimportantcentury/">previous pieces</a>. The short story is that we should try to avoid AI ending up with dangerous goals of its own and <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">defeating humanity</a>. (The remaining issues below seem irrelevant if this happens!)
 
</li><li><strong>Power imbalances. </strong>As AI speeds up science and technology, it could cause some country/countries/coalitions to become enormously powerful - so it matters a lot which one(s) lead the way on transformative AI. (I fear that this concern is generally overrated compared to misaligned AI, but it is still very important.) There could also be dangers in overly widespread (as opposed to concentrated) AI deployment.
 
</li><li><strong>Early applications of AI. </strong>It might be that what early AIs are used for durably affects how things go in the long run - for example, whether early AI systems are used for education and truth-seeking, rather than manipulative persuasion and/or entrenching what we already believe. We might be able to affect which uses are predominant early on.
 
</li><li><strong>New life forms. </strong>Advanced AI could lead to new forms of intelligent life, such as AI systems themselves and/or <a href="https://www.cold-takes.com/how-digital-people-could-change-the-world/">digital people</a>. Many of the frameworks we&#x2019;re used to, for ethics and the law, could end up needing quite a bit of rethinking for new kinds of entities (for example, should we allow people to make as many copies as they want of entities that will predictably vote in certain ways?) Early decisions about these kinds of questions could have long-lasting effects. 
 
</li><li><strong>Persistent policies and norms. </strong>Perhaps we ought to be identifying particularly important policies, norms, etc. that seem likely to be durable even through rapid technological advancement, and try to improve these as much as possible before transformative AI is developed. (These could include things like a better social safety net suited to high, sustained unemployment rates; better regulations aimed at avoiding bias; etc.)
 
</li><li><strong>Speed of development. </strong>Maybe human society just isn&#x2019;t likely to adapt well to rapid, radical advances in science and technology, and finding a way to limit the pace of advances would be good.
</li> 
</ul>

</li><li>Finally, I&#x2019;ll discuss how I&#x2019;m thinking about which of these issues to prioritize at the moment, and why misaligned AI is such a focus of mine.

</li><li>An appendix will say a small amount about whether the long-run future seems likely to be better or worse than today, in terms of <a href="https://www.cold-takes.com/has-life-gotten-better/">quality of life</a>, assuming we navigate the above issues non-amazingly but non-catastrophically.
</li>
</ul>
<h2 id="kinds-of-issues">The kinds of issues I&#x2019;m trying to list</h2>


<p>
One basic angle you could take on AI is: 
</p>
<p>
&#x201C;AI&#x2019;s main effect will be to speed up <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/#explosive-scientific-and-technological-advancement">science and technology a lot</a>. This means humans will be able to do <em>all</em> the things they were doing before - the good and the bad - but more/faster. So basically, we&#x2019;ll end up with the same future we would&#x2019;ve gotten without AI - just sooner.
</p>
<p>
&#x201C;Therefore, there&#x2019;s no need to prepare in advance for anything in particular, beyond what we&#x2019;d do to work toward a better future <em>normally</em> (in a world with no AI). Sure, lots of weird stuff could happen as science and technology advance - but that was already true, and many risks are just too hard to predict now and easier to respond to as they happen.&#x201D;
</p>
<p>
I don&#x2019;t agree with the above, but I <em>do</em> think it&#x2019;s a good starting point. I think we shouldn&#x2019;t be listing everything that might happen in the future, as AI leads to advances in science and technology, and trying to prepare for it. Instead, we should be asking: <strong>&#x201C;if <a href="https://www.cold-takes.com/most-important-century/">transformative AI</a> is coming in the next few decades, how does this <em>change the picture </em>of what we should be focused on, beyond just speeding up what&#x2019;s going to happen anyway?</strong>&#x201D;
</p>
<p>
And I&#x2019;m going to try to focus on <strong>extremely high-stakes issues - </strong>ways I could imagine the future looking <strong>durably and dramatically different </strong>depending on how we navigate the development of transformative AI.
</p>
<p>
Below, I&#x2019;ll list some candidate issues fitting these criteria.
</p>
<h2 id="potential-issues">Potential issues</h2>


<h3 id="misaligned-ai">Misaligned AI</h3>


<p>
I won&#x2019;t belabor this possibility, because the <a href="https://www.cold-takes.com/tag/implicationsofmostimportantcentury/">last several pieces</a> have been focused on it; this is just a quick reminder.
</p>
<p>
In a world without AI, the main question about the long-run future would be how humans will end up treating each other. But if powerful AI systems will be developed in the coming decades, we need to contend with the possibility that these AI systems will end up having <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">goals of their own</a> - and <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">displacing humans</a> as the species that determines how things will play out.
</p>
<details id="Box2"><summary>(Click to expand)Why would AI &quot;aim&quot; to defeat humanity?<!-- (Details not included in email - <a href="https://www.cold-takes.com/p/b2a3d837-24f1-46ae-9e45-1fac54411b03#Box2">click to view on the web</a>)--></summary>
<div><p>
A <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">previous piece</a> argued that if today&#x2019;s AI development methods lead directly to powerful enough AI systems, disaster is likely by default (in the absence of specific countermeasures). 
</p>
<p>
In brief:
</p>
<ul>

<li>Modern AI development is essentially based on &#x201C;training&#x201D; via trial-and-error. 

</li><li>If we move forward incautiously and ambitiously with such training, and if it gets us all the way to very powerful AI systems, then such systems will likely end up <em>aiming for certain states of the world</em> (analogously to how a chess-playing AI aims for checkmate)<em>.</em>

</li><li>And these states will be<em> other than the ones we intended</em>, because our trial-and-error training methods won&#x2019;t be accurate. For example, when we&#x2019;re confused or misinformed about some question, we&#x2019;ll reward AI systems for giving the wrong answer to it - unintentionally training deceptive behavior.

</li><li>We should expect disaster if we have AI systems that are both (a) <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">powerful enough</a> to defeat humans and (b) aiming for states of the world that we didn&#x2019;t intend. (&#x201C;Defeat&#x201D; means taking control of the world and doing what&#x2019;s necessary to keep us out of the way; it&#x2019;s unclear to me whether we&#x2019;d be literally killed or just forcibly stopped<sup id="fnref1"><a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#fn1" rel="footnote">1</a></sup> from changing the world in ways that contradict AI systems&#x2019; aims.)</li></ul></div>
</details>
<details id="Box3"><summary>(Click to expand) <em>How</em> could AI defeat humanity?<!-- (Details not included in email - <a href="https://www.cold-takes.com/p/b2a3d837-24f1-46ae-9e45-1fac54411b03#Box3">click to view on the web</a>)--></summary>
<div><p>
In a <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">previous piece</a>, I argue that AI systems could defeat all of humanity combined, if (for whatever reason) they were aimed toward that goal.
</p>
<p>
By defeating humanity, I mean gaining control of the world so that AIs, not humans, determine what happens in it; this could involve killing humans or simply &#x201C;containing&#x201D; us in some way, such that we can&#x2019;t interfere with AIs&#x2019; aims.
</p>
<p>
One way this could happen is if AI became extremely advanced, to the point where it had &quot;cognitive superpowers&quot; beyond what humans can do. In this case, a single AI system (or set of systems working together) could imaginably:
</p>
<ul>

<li>Do its own research on how to build a better AI system, which culminates in something that has incredible other abilities.

</li><li>Hack into human-built software across the world.

</li><li>Manipulate human psychology.

</li><li>Quickly generate vast wealth under the control of itself or any human allies.

</li><li>Come up with better plans than humans could imagine, and ensure that it doesn&apos;t try any takeover attempt that humans might be able to detect and stop.

</li><li>Develop advanced weaponry that can be built quickly and cheaply, yet is powerful enough to overpower human militaries.
</li>
</ul>
<p>
However, my piece also explores what things might look like if <em>each AI system basically has similar capabilities to humans. </em>In this case:
</p>
<ul>

<li>Humans are likely to deploy AI systems throughout the economy, such that they have large numbers and access to many resources - and the ability to make copies of themselves. 

</li><li>From this starting point, AI systems with human-like (or greater) capabilities would have a number of possible ways of getting to the point where their total population could outnumber and/or out-resource humans.

</li><li>I address a number of possible objections, such as &quot;How can AIs be dangerous without bodies?&quot;
</li>
</ul>
<p>
More: <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">AI could defeat all of us combined</a></p></div></details>

<h3 id="power-imbalances">Power imbalances</h3>


<p>
I&#x2019;ve argued that AI could cause a <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/#explosive-scientific-and-technological-advancement">dramatic acceleration in the pace of scientific and technological advancement</a>. 
</p>
<details id="Box4"><summary>(Click to expand) How AI could cause explosive progress<!-- (Details not included in email - <a href="https://www.cold-takes.com/p/b2a3d837-24f1-46ae-9e45-1fac54411b03#Box4">click to view on the web</a>)--></summary><div>

<p>(This section is mostly copied from my <a href="https://www.cold-takes.com/most-important-century/">summary of the &quot;most important century&quot; series</a>; it links to some pieces with more detail at the bottom.)</p>

<p>
Standard economic growth models imply that <strong>any technology that could fully automate innovation would cause an &quot;economic singularity&quot;:</strong> productivity going to infinity this century. This is because it would create a powerful feedback loop: more resources -&gt; more ideas and innovation -&gt; more resources -&gt; more ideas and innovation ...
</p>
<p>
This loop would not be unprecedented. I think it is in some sense the &quot;default&quot; way the economy operates - for most of economic history up until a couple hundred years ago. 
</p>
    <p><img src="https://www.cold-takes.com/content/images/size/w1000/2021/06/duplicatorfeedbackloop-original-2.png" width="1036"></p>
    <p><figure class="kg-card kg-image-card kg-width-wide"><img src="https://www.cold-takes.com/content/images/size/w1000/2021/06/duplicatorfeedbackloop-original-6.png" alt="8 ideas, each 1.5x&apos;ing the amount of food resources -&gt; explosion from 8 units to 205 units of food, hence 205 people and 205 ideas ... " width="1036"><figcaption>Economic history: more resources -&gt; more people -&gt; more ideas -&gt; more resources ...</figcaption></figure></p>
<p>
But in the &quot;demographic transition&quot; a couple hundred years ago, the &quot;more resources -&gt; more people&quot; step of that loop stopped. Population growth leveled off, and more resources led to richer people instead of more people:
</p>
<p><figure class="kg-card kg-image-card kg-width-wide"><img src="https://www.cold-takes.com/content/images/2021/07/demographic-transition-nutshell.png" alt="Same as previous diagram, but instead of more corn leading to more people, it leads to the same number of people enjoying their boatload of corn - corn juggling, corn slides, corn feasts, etc." class="kg-image" loading="lazy" width="1036"><figcaption>Today&apos;s economy: more resources -&gt; <del>more </del>richer people -&gt; same pace of ideas -&gt; ...</figcaption></figure></p>

<p>
The feedback loop could come back if some other technology restored the &quot;more resources -&gt; more ideas&quot; dynamic. One such technology could be the right kind of AI: what I call PASTA, or Process for Automating Scientific and Technological Advancement.
</p>
<p><img src="https://www.cold-takes.com/content/images/size/w1000/2021/09/pasta-stills-1.png" class="kg-image" loading="lazy" width="1036"></p>
<p><figure class="kg-card kg-image-card kg-width-wide"><img src="https://www.cold-takes.com/content/images/size/w1000/2021/09/pasta-stills-3.png" class="kg-image" alt loading="lazy" width="1036"><figcaption>Possible future: more resources -&gt; more AIs -&gt; more ideas -&gt; more resources ...</figcaption></figure></p>
<p>
That means that <strong>our radical long-run future could be upon us very fast </strong>after PASTA is developed (if it ever is). 
</p>
<p>
It also means that if PASTA systems are <em>misaligned </em>- pursuing their own non-human-compatible objectives - things could very quickly go sideways.
</p>
<p>
Key pieces:
</p>
<ul>

<li><a href="https://www.cold-takes.com/the-duplicator/">The Duplicator: Instant Cloning Would Make the World Economy Explode</a>

</li><li><a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/">Forecasting Transformative AI, Part 1: What Kind of AI?</a>
    </li></ul></div>
</details>
<p>
One way of thinking about this: perhaps (for reasons I&#x2019;ve <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/#explosive-scientific-and-technological-advancement">argued previously</a>) AI could enable the equivalent of hundreds of years of scientific and technological advancement in a matter of a few months (or faster). If so, then developing powerful AI a few months before others could lead to having technology that is (effectively) hundreds of years ahead of others&#x2019;.
</p>
<p>
Because of this, it&#x2019;s easy to imagine that AI could lead to big power imbalances, as whatever country/countries/coalitions &#x201C;lead the way&#x201D; on AI development could become far more powerful than others (perhaps analogously to when a few smallish European states took over much of the rest of the world).
</p>
<p>
One way we might try to make the future go better: maybe it could be possible for different countries/coalitions to strike deals in advance. For example, two equally matched parties might agree in advance to share their resources, territory, etc. with each other, in order to avoid a winner-take-all competition.
</p>
<p>
What might such agreements look like? Could they possibly be enforced? I really don&#x2019;t know, and I haven&#x2019;t seen this explored much.<sup id="fnref1"><a href="https://www.cold-takes.com/p/b2a3d837-24f1-46ae-9e45-1fac54411b03#fn1" rel="footnote">1</a></sup> 
</p>
<p>
Another way one might try to make the future go better is to try to help a <em>particular</em> country, coalition, etc. develop powerful AI systems before others do. I previously called this the <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/#the-competition-frame">&#x201C;competition&#x201D; frame</a>. 
</p>
<p>
I think it is, in fact, enormously important who leads the way on transformative AI. At the same time, I&#x2019;ve expressed concern that people might overfocus on this aspect of things vs. other issues, for a number of reasons including:
</p>
<ul>

<li><em>I think people naturally get more animated about &quot;helping the good guys beat the bad guys&quot; than about &quot;helping all of us avoid getting a universally bad outcome, for impersonal reasons such as &apos;we designed sloppy AI systems&apos; or &apos;we created a dynamic in which haste and aggression are rewarded.&apos;&quot;</em>

</li><li><em>I expect people will tend to be overconfident about which countries, organizations or people they see as the &quot;good guys.&quot;</em>
</li>
</ul>
<p>
(More <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/#why-i-fear-">here</a>.)
</p>
<p>
Finally, it&#x2019;s worth mentioning the possible dangers of powerful AI being too widespread, rather than too concentrated. In <a href="https://nickbostrom.com/papers/vulnerable.pdf">The Vulnerable World Hypothesis</a>, Nick Bostrom contemplates potential future dynamics such as &#x201C;advances in DIY biohacking tools might make it easy for anybody with basic training in biology to kill millions.&#x201D; In addition to avoiding worlds where AI capabilities end up concentrated in the hands of a few, it could also be important to avoid worlds in which they diffuse too widely, too quickly, before we&#x2019;re able to assess the risks of widespread access to technology far beyond today&#x2019;s.
</p>
<h3 id="early-applications-of-ai">Early applications of AI</h3>


<p>
Maybe advanced AI will be useful for some sorts of tasks before others. For example, maybe - by default - advanced AI systems will soon be powerful persuasion tools, and cause wide-scale societal dysfunction before they cause rapid advances in science and technology. And maybe, with effort, we could make it less likely that this happens - more likely that early AI systems are used for education and truth-seeking, rather than manipulative persuasion and/or entrenching what we already believe.
</p>
<p>
There could be lots of possibilities of this general form: particular ways in which AI could be predictably beneficial, or disruptive, before it becomes an all-purpose accelerant to science and technology. Perhaps trying to map these out today, and push for advanced AI to be used for particular purposes early on, could have a lasting effect on the future.
</p>
<h3 id="new-life-forms">New life forms</h3>


<p>
Advanced AI could lead to new forms of intelligent life, such as AI systems themselves and/or <a href="https://www.cold-takes.com/how-digital-people-could-change-the-world/">digital people</a>.
</p>
<p>
<details id="Box5"><summary>Digital people: one example of how wild the future could be<!-- (details not included in email - <a href="https://www.cold-takes.com/p/b2a3d837-24f1-46ae-9e45-1fac54411b03#Box5">click to view on the web</a>--></summary>
<div><p>
In a <a href="https://www.cold-takes.com/digital-people-faq/#i&apos;m-having-trouble-picturing-a-world-of-digital-people-how-the-technology-could-be-introduced-how-they-would-interact-with-us-etc-can-you-lay-out-a-detailed-scenario-of-what-the-transition-from-today&apos;s-world-to-a-world-full-of-digital-people-might-look-like">previous piece</a>, I tried to give a sense of just how wild a future with advanced technology could be, by examining one hypothetical technology: &quot;digital people.&quot; 
</p>
<p>
To get the idea of digital people, imagine a computer simulation of a specific person, in a virtual environment. For example, a simulation of you that reacts to all &quot;virtual events&quot; - virtual hunger, virtual weather, a virtual computer with an inbox - just as you would. 
</p>
<p>
I&#x2019;ve argued that digital people would likely be <a href="https://www.cold-takes.com/digital-people-faq/#could-digital-people-be-conscious-could-they-deserve-human-rights">conscious and deserving of human rights </a>just as we are. And I&#x2019;ve argued that they could have major impacts, in particular:
</p>
<ul>

<li>Productivity. Digital people could be copied, just as we can easily make copies of ~any software today. They could also be run much faster than humans. Because of this, digital people could have effects comparable to those of the <a href="https://www.cold-takes.com/the-duplicator">Duplicator</a>, but more so: unprecedented (in history or in sci-fi movies) levels of economic growth and productivity.

</li><li>Social science. Today, we see a lot of progress on understanding scientific laws and developing cool new technologies, but not so much progress on understanding human nature and human behavior. Digital people would fundamentally change this dynamic: people could make copies of themselves (including sped-up, temporary copies) to explore how different choices, lifestyles and environments affected them. Comparing copies would be informative in a way that current social science rarely is.

</li><li>Control of the environment. Digital people would experience whatever world they (or the controller of their virtual environment) wanted. Assuming digital people had true conscious experience (an assumption discussed <a href="https://www.cold-takes.com/p/febce3fc-87c0-4ceb-b0c0-13fdf75b9257#could-digital-people-be-conscious-could-they-deserve-human-rights">in the FAQ</a>), this could be a good thing (it should be possible to eliminate disease, material poverty and non-consensual violence for digital people) or a bad thing (if human rights are not protected, digital people could be subject to scary levels of control).

</li><li>Space expansion. The population of digital people might become staggeringly large, and the computers running them could end up distributed throughout our galaxy and beyond. Digital people could exist anywhere that computers could be run - so space settlements could be more straightforward for digital people than for biological humans.

</li><li>Lock-in. In today&apos;s world, we&apos;re used to the idea that the future is unpredictable and uncontrollable. Political regimes, ideologies, and cultures all come and go (and evolve). But a community, city or nation of digital people could be much more stable. 
<ul>
 
<li>Digital people need not die or age.
 
</li><li>Whoever sets up a &quot;virtual environment&quot; containing a community of digital people could have quite a bit of long-lasting control over what that community is like. For example, they might build in software to reset the community (both the virtual environment and the people in it) to an earlier state if particular things change - such as who&apos;s in power, or what religion is dominant.
 
</li><li>I consider this a disturbing thought, as it could enable long-lasting authoritarianism, though it could also enable things like permanent protection of particular human rights.
</li> 
</ul>
</li> 
</ul>
<p>
I think these effects could be a very good or a very bad thing. How the early years with digital people go could irreversibly determine which. 
</p>
<p>
More: 
</p>
<ul>

<li><a href="https://www.cold-takes.com/how-digital-people-could-change-the-world/">Digital People would be an Even Bigger Deal</a>

</li><li><a href="https://www.cold-takes.com/digital-people-faq/">Digital People FAQ</a>
</li>
</ul>
    </div></details>
</p><p>
Many of the frameworks we&#x2019;re used to, for ethics and the law, could end up needing quite a bit of rethinking for new kinds of entities. For example:
</p>
<ul>

<li>How should we determine which AI systems or digital people are considered to have &#x201C;rights&#x201D; and get legal protections?

</li><li>What about the right to vote? If an AI system or digital person can be quickly copied billions of times, with each copy getting a vote, that could be a recipe for trouble - does this mean we should restrict copying, restrict voting or something else?

</li><li>What should the rules be about engineering AI systems or digital people to have particular beliefs, motivations, experiences, etc.? Simple examples:  
<ul>
 
<li>Should it be illegal to create new AI systems or digital people that will predictably suffer a lot? How much suffering is too much?
 
</li><li>What about creating AI systems or digital people that consistently, predictably support some particular political party or view?
</li> 
</ul>
</li> 
</ul>
<p>
(For a lot more in this vein, see <a href="https://nickbostrom.com/propositions.pdf">this very interesting piece by Nick Bostrom and Carl Shulman</a>.)
</p>
<p>
Early decisions about these kinds of questions could have long-lasting effects. For example, imagine someone creating billions of AI systems or <a href="https://www.cold-takes.com/how-digital-people-could-change-the-world/">digital people</a> that have capabilities and subjective experiences comparable to humans, and are deliberately engineered to &#x201C;believe in&#x201D; (or at least help promote) some particular ideology (Communism, libertarianism, etc.) If these systems are self-replicating, that could change the future drastically. 
</p>
<p>
Thus, it might be important to set good principles in place for tough questions about how to treat new sorts of digital entities, <em>before</em> new sorts of digital entities start to multiply.
</p>
<h3 id="persistent-policies-and-norms">Persistent policies and norms</h3>


<p>
There might be particular policies, norms, etc. that are likely to stay persistent even as technology is advancing and many things are changing.
</p>
<p>
For example, how people think about ethics and norms might just inherently change more slowly than technological capabilities change. Perhaps a society that had strong animal rights protections, and general pro-animal attitudes, would maintain these properties all the way through explosive technological progress, becoming a technologically advanced society that treated animals well - while a society that had little regard for animals would become a technologically advanced society that treated animals poorly. Similar analysis could apply to religious values, social liberalism vs. conservatism, etc.
</p>
<p>
So perhaps we ought to be identifying particularly important policies, norms, etc. that seem likely to be durable even through rapid technological advancement, and try to improve these as much as possible before transformative AI is developed.
</p>
<p>
One tangible example of a concern I&#x2019;d put in this category: if AI is going to cause high, persistent technological unemployment, it might be important to establish new social safety net programs (such as universal basic income) <em>today</em> - if these programs would be easier to establish today than in the future. I feel less than convinced of this one - first because I <a href="https://www.cold-takes.com/technological-unemployment-ai-vs-most-important-century-ai-how-far-apart/">have some doubts</a> about how big an issue technological unemployment is going to be, and second because it&#x2019;s not clear to me why policy change would be easier today than in a future where technological unemployment is a reality. And more broadly, I fear that it&apos;s very hard to design <em>and</em> (politically) implement policies today that we can be confident will make things durably better as the world changes radically.
</p>
<h3 id="slow-it-down">Slow it down?</h3>


<p>
I&#x2019;ve named a number of ways in which weird things - such as power imbalances, and some parts of society changing much faster than others - could happen as scientific and technological advancement accelerate. Maybe one way to make the most important century go well would be to simply avoid these weird things by avoiding too-dramatic acceleration. Maybe human society just isn&#x2019;t likely to adapt well to rapid, radical advances in science and technology, and finding a way to limit the pace of advances would be good.
</p>
<p>
Any individual company, government, etc. has an incentive to move quickly and try to get ahead of others (or not fall too far behind), but coordinated agreements and/or regulations (along the lines of the &#x201C;global monitoring&#x201D; possibility discussed <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/#global-monitoring">here</a>) could help everyone move more slowly.
</p>
<h3>What else?</h3>


<p>
Are there other ways in which transformative AI would cause particular issues, risks, etc. to loom especially large, and to be worth special attention today? I&#x2019;m guessing I&#x2019;ve only scratched the surface here.
</p>
<h2 id="what-im-prioritizing">What I&#x2019;m prioritizing, at the moment</h2>


<p>
If this is the <a href="https://www.cold-takes.com/most-important-century/">most important century</a>, there&#x2019;s a vast set of things to be thinking about and trying to prepare for, and it&#x2019;s hard to know what to prioritize.
</p>
<p>
Where I&#x2019;m at for the moment:
</p>
<p>
<strong>It seems very hard to say today what will be desirable in a radically different future. </strong>I wish more thought and attention were going into things like <a href="https://www.cold-takes.com/p/b2a3d837-24f1-46ae-9e45-1fac54411b03#early-applications-of-ai">early applications of AI</a>; <a href="https://www.cold-takes.com/p/b2a3d837-24f1-46ae-9e45-1fac54411b03#new-life-forms">norms and laws around new life forms</a>; and whether there are <a href="https://www.cold-takes.com/p/b2a3d837-24f1-46ae-9e45-1fac54411b03/#persistent-policies-and-norms">policy changes today that we could be confident in even if the world is changing rapidly and radically.</a> <strong>But </strong>it seems to me that it would be very hard to be confident in any particular goal in areas like these. Can we really say anything today about what sorts of digital entities should have rights, or what kinds of AI applications we hope come first, that we expect to hold up?
</p>
<p>
<strong>I feel most confident in two very broad ideas: &#x201C;It&#x2019;s bad if AI systems defeat humanity to pursue goals of their own&#x201D; and &#x201C;It&#x2019;s good if good decision-makers end up making the key decisions.&#x201D; </strong>These map to the <a href="https://www.cold-takes.com/p/b2a3d837-24f1-46ae-9e45-1fac54411b03#misaligned-ai">misaligned AI</a> and <a href="https://www.cold-takes.com/p/b2a3d837-24f1-46ae-9e45-1fac54411b03#power-imbalances">power imbalance</a> topics - or what I previously called <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/#the-caution-frame">caution</a> and <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/#the-competition-frame">competition</a>.
</p>
<p>
That said, <strong>it also seems hard to know who the &#x201C;good decision-makers&#x201D; are. </strong>I&#x2019;ve definitely observed some of this dynamic: &#x201C;Person/company A says they&#x2019;re trying to help the world by aiming to build transformative AI before person/company B; person/company B says they&#x2019;re trying to help the world by aiming to build transformative AI before person/company A.&#x201D; 
</p>
<p>
It&#x2019;s pretty hard to come up with tangible tests of who&#x2019;s a &#x201C;good decision-maker.&#x201D; We mostly don&#x2019;t know what person A would do with enormous power, or what person B would do, based on their actions today. One possible criterion is that we should arguably have more trust in people/companies who show more <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/#the-caution-frame">caution</a> - people/companies who show willingness to hurt their own chances of &#x201C;being in the lead&#x201D; in order to help everyone&#x2019;s chance of avoiding a catastrophe from misaligned AI.<sup id="fnref2"><a href="https://www.cold-takes.com/p/b2a3d837-24f1-46ae-9e45-1fac54411b03#fn2" rel="footnote">2</a></sup>
</p>
<p>
(Instead of focusing on which particular people and/or companies lead the way on AI, you could focus on which <em>countries</em> do, e.g. preferring non-authoritarian countries. It&#x2019;s arguably pretty clear that non-authoritarian countries would be better than authoritarian ones. However, I have concerns about this as a goal as well, discussed in a footnote.<sup id="fnref3"><a href="https://www.cold-takes.com/p/b2a3d837-24f1-46ae-9e45-1fac54411b03#fn3" rel="footnote">3</a></sup>)
</p>
<p>
<strong>For now, I am <em>most</em> focused on the threat of misaligned AI. </strong>Some reasons for this:
</p>
<ul>

<li>It currently seems to me that misaligned AI is a significant risk. Misaligned AI seems <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">likely by default</a> if we don&#x2019;t specifically do things to prevent it, and preventing it seems far from straightforward (see previous posts on the <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">difficulty of alignment research</a> and <a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/">why it could be hard for key players to be cautious</a>).

</li><li>At the same time, it seems like there are significant <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/">hopes</a> for how we might avoid this risk. As argued <a href="https://www.alignmentforum.org/posts/rCJQAkPTEypGjSJ8X/how-might-we-align-transformative-ai-if-it-s-developed-very#Key_question__how_cautious_will_Magma_and_others_be_">here</a> and <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/">here</a>, my sense is that the more broadly people understand this risk, the better our odds of avoiding it.

</li><li>I currently feel that this threat is <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/#the-competition-frame">underrated</a>, relative to the easier-to-understand angle of &#x201C;I hope people I like develop powerful AI systems before others do.&#x201D;

</li><li>I think the &#x201C;competition&#x201D; frame - focusing on helping some countries/coalitions/companies develop advanced AI before others - makes quite a bit of sense as well. But - as noted directly above -  I have big reservations about the most common &#x201C;competition&#x201D;-oriented actions, such as trying to help particular companies outcompete others or trying to get U.S. policymakers more focused on AI.  
<ul>
 
<li>For the latter, I worry that this risks making huge sacrifices on the &#x201C;caution&#x201D; front and even backfiring by causing other governments to invest in projects of their own.
 
</li><li>For the former, I worry about the ability to judge &#x201C;good&#x201D; leadership, and the temptation to overrate people who resemble oneself.
</li> 
</ul>
</li> 
</ul>
<p>
This is all far from absolute. I&#x2019;m open to a broad variety of projects to help the most important century go well, whether they&#x2019;re about &#x201C;caution,&#x201D; &#x201C;competition&#x201D; or another issue (including those I&#x2019;ve listed in this post). My top priority at the moment is reducing the risks of misaligned AI, but I think a huge range of potential risks aren&#x2019;t getting enough attention from the world at large.
</p>
<h2 id="appendix">Appendix: if we avoid catastrophic risks, how good does the future look?</h2>


<p>
Here I&#x2019;ll say a small amount about whether the long-run future seems likely to be better or worse than today, in terms of <a href="https://www.cold-takes.com/has-life-gotten-better/">quality of life</a>. 
</p>
<p>
Part of why I want to do this is to give a sense of why I feel cautiously and moderately optimistic about such a future - such that I feel broadly okay with a frame of &#x201C;We should try to prevent anything <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">too catastrophic</a> from happening, and figure that the future we get if we can pull that off is reasonably likely (though far from assured!) to be good.&#x201D;
</p>
<p>
So I&#x2019;ll go through some quick high-level reasons for hope (the future might be better than the present) - and for concern (it might be worse). 
</p>
<p>
<strong>In this section, I&#x2019;m ignoring the special role AI might play, and just thinking about what happens if we get a fast-forwarded future. </strong>I&#x2019;ll be focusing on what I think are probably the most likely ways the world will change in the future, laid out <a href="https://www.cold-takes.com/summary-of-history-empowerment-and-well-being-lens/#history-is-a-story">here</a>: a higher world population and greater <strong>empowerment due to a greater stock of ideas, innovations and technological capabilities. </strong>My aim is to ask: &#x201C;If we navigate the above issues neither amazingly nor catastrophically, and end up with the same sort of future we&#x2019;d have had without AI (just sped up), how do things look?&#x201D;
</p>
<p>
<strong>Reason for hope: empowerment trends. </strong>One simple take would be: &#x201C;<a href="https://www.cold-takes.com/has-life-gotten-better-the-post-industrial-era/">Life has gotten better for humans</a><sup id="fnref4"><a href="https://www.cold-takes.com/p/b2a3d837-24f1-46ae-9e45-1fac54411b03#fn4" rel="footnote">4</a></sup><a href="https://www.cold-takes.com/has-life-gotten-better-the-post-industrial-era/"> over the last couple hundred years or so</a>, the period during which we&#x2019;ve seen <a href="https://www.cold-takes.com/this-cant-go-on/">most of history&#x2019;s economic growth and technological progress</a>. We&#x2019;ve seen better health, less poverty and hunger, less violence, more anti-discrimination measures, and few signs of anything getting clearly worse. So if humanity just keeps getting more and more <a href="https://www.cold-takes.com/rowing-steering-anchoring-equity-mutiny/#rowing">empowered</a>, and nothing catastrophic happens, we should plan on life continuing to improve along a variety of dimensions.&#x201D;
</p>
<p>
<em>Why</em> is this the trend, and should we expect it to hold up? There are lots of theories, and I won&#x2019;t pretend to know, but I&#x2019;ll lay out some basic thoughts that may be illustrative and give cause for optimism.
</p>
<p>
First off, there is an awful lot of room for improvement just from continuing to cut down on things like hunger and disease. A wealthier, more technologically advanced society seems like a pretty good bet to have less hunger and disease for fairly straightforward reasons.
</p>
<p>
But we&#x2019;ve seen <a href="https://www.cold-takes.com/has-life-gotten-better-the-post-industrial-era/">improvement</a> on other dimensions too. This could be partly explained by something like the following dynamic:
</p>
<ul>

<li>Most people would - aspirationally - <em>like </em>to be nonviolent, compassionate, generous and fair, if they could do so without sacrificing other things.

</li><li>As <a href="https://www.cold-takes.com/rowing-steering-anchoring-equity-mutiny/#rowing">empowerment</a> rises, the need to make sacrifices falls (noisily and imperfectly) across the board.

</li><li>This dynamic may have led to some (noisy, imperfect) improvement to date, but there might be <em>much more</em> benefit in the future compared to the past. For example, if we see a lot of progress on <a href="https://www.cold-takes.com/how-digital-people-could-change-the-world/#social-science">social science</a>, we might get to a world where people understand their own needs, desires and behavior better - and thus can get most or all of what they want (from material needs to self-respect and happiness) without having to outcompete or push down others.<sup id="fnref5"><a href="https://www.cold-takes.com/p/b2a3d837-24f1-46ae-9e45-1fac54411b03#fn5" rel="footnote">5</a></sup></li></ul>
<p>
<strong>Reason for hope: the &#x201C;cheap utopia&#x201D; possibility. </strong>This is sort of an extension of the previous point. If we imagine the upper limit of how &#x201C;empowered&#x201D; humanity could be (in terms of having lots of technological capabilities), it might be relatively <em>easy</em> to create a kind of <a href="https://www.cold-takes.com/visualizing-utopia/">utopia</a> (such as the <a href="https://www.cold-takes.com/visualizing-utopia/#a-meta-option">utopia I&#x2019;ve described previously</a>, or hopefully something much better). This doesn&#x2019;t <em>guarantee</em> that such a thing will happen, but a future where it&#x2019;s technologically easy to do things like <a href="https://www.cold-takes.com/how-digital-people-could-change-the-world/#virtual-reality-and-control-of-the-environment">meeting material needs</a> and providing <a href="https://www.cold-takes.com/visualizing-utopia/#a-meta-option">radical choice</a> could be quite a bit better than the present.
</p>
<p>
An interesting (wonky) treatment of this idea is Carl Shulman&#x2019;s blog post: <a href="http://reflectivedisequilibrium.blogspot.com/2012/09/spreading-happiness-to-stars-seems.html">Spreading happiness to the stars seems little harder than just spreading</a>.
</p>
<p>
<strong>Reason for concern: authoritarianism. </strong>There are some huge countries that are essentially ruled by one person, with little to no democratic or other mechanisms for citizens to have a voice in how they&#x2019;re treated. It seems like a live risk that the world could end up this way - essentially ruled by one person or relatively small coalition - in the long run. (It arguably would even continue a historical trend in which political units have gotten larger and larger.)
</p>
<p>
Maybe this would be fine if whoever&#x2019;s in charge is able to let everyone have freedom, wealth, etc. at little cost to themselves (along the lines of the above point). But maybe whoever&#x2019;s in charge is just a crazy or horrible person, in which case we might end up with a bad future even if it <em>would</em> be &#x201C;cheap&#x201D; to have a wonderful one.
</p>
<p>
<strong>Reason for concern: competitive dynamics. </strong>You might imagine that as empowerment advances, we get purer, more unrestrained <em>competition</em>. 
</p>
<p>
One way of thinking about this: 
</p>
<ul>

<li>Today, no matter how ruthless CEOs are, they tend to accommodate some amount of leisure time for their employees. That&#x2019;s because businesses have no choice but to hire people who insist on working a limited number of hours, having a life outside of work, etc. 

</li><li>But if we had advanced enough technology, it might be possible to run a business whose employees have zero leisure time. (One example would be via <a href="https://www.cold-takes.com/how-digital-people-could-change-the-world/">digital people</a> and the ability to <a href="https://www.cold-takes.com/how-digital-people-could-change-the-world/#productivity">make lots of copies of highly productive people just as they&#x2019;re about to get to work</a><em>. </em>A more mundane example would be if e.g. advanced stimulants and other drugs were developed so people could be productive without breaks.)

</li><li>And that might be what the most productive businesses, organizations, etc. end up looking like - the most productive organizations might be the ones that most maniacally and uncompromisingly use <em>all of their resources to acquire more resources. </em>Those could be precisely the organizations that end up filling most of the galaxy.

</li><li>More at <a href="https://slatestarcodex.com/2014/07/13/growing-children-for-bostroms-disneyland/">this Slate Star Codex post</a>. Key quote: &#x201C;I&#x2019;m pretty sure that brutal &#x2026; competition combined with ability to [copy and edit] minds necessarily results in paring away everything not directly maximally economically productive. And a lot of things we like &#x2013; love, family, art, hobbies &#x2013; are not directly maximally economic productive.&#x201D;
</li>
</ul>
<p>
That said:
</p>
<ul>

<li>It&#x2019;s not really clear how this ultimately shakes out. One possibility is something like this:  
<ul>
 
<li>Lots of people, or perhaps machines, compete ruthlessly to acquire resources. But this competition is (a) legal, subject to a property rights system; (b) ultimately for the benefit of the <em>investors </em>in the competing companies/organizations. 
 
</li><li>Who are these investors? Well, today, many of the biggest companies are mostly owned by large numbers of individuals via mutual funds. The same could be true in the future - and those individuals could be normal people who use the proceeds for nice things.
</li> 
</ul>

</li><li>If the &#x201C;cheap utopia&#x201D; possibility (described above) comes to pass, it might only take a small amount of spare resources to support a lot of good lives.
</li>
</ul>
<p>
<strong>Overall, my guess is that the long-run future is more likely to be <em>better than the present</em> than <em>worse than the present</em></strong> (in the sense of <a href="https://www.cold-takes.com/has-life-gotten-better/">average quality of life</a>). I&#x2019;m very far from confident in this. I&#x2019;m more confident that the long-run future is likely to be <em>better than nothing</em>, and that it would be good to prevent humans from going extinct, or a similar development such as a takeover by misaligned AI.
</p>

<!--kg-card-end: html--><!--kg-card-begin: html-->

<p><div style="display:flex; justify-content:center; margin: 0 auto;">

        <span style="margin: 10px;"><a href="https://api.addthis.com/oexchange/0.8/forward/twitter/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Ftransformative-ai-issues-not-just-misalignment-an-overview&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Transformative%20AI%20issues%20(not%20just%20misalignment)%3A%20an%20overview&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-twitter-square.png" border="0" alt="Twitter"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/facebook/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Ftransformative-ai-issues-not-just-misalignment-an-overview&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Transformative%20AI%20issues%20(not%20just%20misalignment)%3A%20an%20overview&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-facebook-square.png" border="0" alt="Facebook"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/reddit/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Ftransformative-ai-issues-not-just-misalignment-an-overview&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Transformative%20AI%20issues%20(not%20just%20misalignment)%3A%20an%20overview&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-reddit-square.png" border="0" alt="Reddit"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/menu/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Ftransformative-ai-issues-not-just-misalignment-an-overview&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Transformative%20AI%20issues%20(not%20just%20misalignment)%3A%20an%20overview&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-addthis-square.png" border="0" alt="More"></a></span>
        </div>
<center><p id="discuss"><!--<a href="https://www.cold-takes.com/transformative-ai-issues-not-just-misalignment-an-overview#subscribe" target="_blank"><button id="footer-subscribe" class="button">Subscribe</button></a>&nbsp;<a href="https://www.guidedtrack.com/programs/4kal2ue/run?posttitle=Transformative%20AI%20issues%20(not%20just%20misalignment)%3A%20an%20overview" target="_blank"><button class="button" id="Survey">Feedback</button></a>-->&#xA0;<a href="https://www.lesswrong.com/posts/slug/transformative-ai-issues-not-just-misalignment-an-overview#comments" target="_blank"><button class="button">Comment/discuss</button></a></p><p><!--<em>
Use "Feedback" if you have comments/suggestions you want me to see, or if you're up for giving some quick feedback about this post (which I greatly appreciate!) Use "Forum" if you want to discuss this post publicly on the Effective Altruism Forum.
</em>--></p></center>
<!--kg-card-end: html--><!--kg-card-begin: html--></p><h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<hr>
<ol><li id="fn1">
<p>
     A couple of discussions of the prospects for enforcing agreements <a href="https://www.alignmentforum.org/posts/S4Jg3EAdMq57y587y/an-alternative-approach-to-ai-cooperation">here </a>and <a href="https://www.alignmentforum.org/posts/gYaKZeBbSL4y2RLP3/strategic-implications-of-ais-ability-to-coordinate-at-low">here</a>.&#xA0;<a href="#fnref1" rev="footnote">&#x21A9;</a><li id="fn2">
<p>
     I&#x2019;m reminded of the <a href="https://en.wikipedia.org/wiki/Judgement_of_Solomon">judgment of Solomon</a>: &#x201C;two mothers living in the same house, each the mother of an infant son, came to Solomon. One of the babies had been smothered, and each claimed the remaining boy as her own. Calling for a sword, Solomon declared his judgment: the baby would be cut in two, each woman to receive half. One mother did not contest the ruling, declaring that if she could not have the baby then neither of them could, but the other begged Solomon, &#x2018;Give the baby to her, just don&apos;t kill him!&#x2019; The king declared the second woman the true mother, as a mother would even give up her baby if that was necessary to save its life, and awarded her custody.&#x201D; 
</p><p>
    The sword is misaligned AI and the baby is humanity or something.
</p><p>
    (This story is actually extremely bizarre - seriously, Solomon was like &#x201C;You each get half the baby&#x201D;?! - and some <a href="https://en.wikipedia.org/wiki/Judgement_of_Solomon#Classification_and_parallels">similar stories from India/China</a> seem at least a bit more plausible. But I think you get my point. Maybe.)&#xA0;<a href="#fnref2" rev="footnote">&#x21A9;</a><li id="fn3">
<p>
     For a tangible example, I&#x2019;ll discuss the practice (which some folks are doing today) of trying to ensure that the U.S. develops transformative AI before another country does, by arguing for the importance of A.I. to U.S. policymakers. 
</p><p>
    This approach makes me quite nervous, because:
<ul>

<li>I expect U.S. policymakers by default to be <em>very</em> oriented toward &#x201C;competition&#x201D; to the exclusion of &#x201C;caution.&#x201D; (This could change if the importance of caution becomes more widely appreciated!) 

</li><li>I worry about a nationalized AI project that (a) doesn&#x2019;t exercise much caution at all, focusing entirely on racing ahead of others; (b) might backfire by causing <em>other</em> countries to go for nationalized projects of their own, inflaming an already tense situation and not even necessarily doing much to make it more likely that the U.S. leads the way.  In particular, other countries might have an easier time quickly mobilizing huge amounts of government funding than the U.S., such that the U.S. might have better odds if it remains the case that most AI research is happening at private companies.</li></ul>

</p><p>
    (There might be ways of helping particular countries <em>without</em> raising the risks of something like a low-caution nationalized AI project, and if so these could be important and good.)&#xA0;<a href="#fnref3" rev="footnote">&#x21A9;</a><li id="fn4">
<p>
     <a href="https://www.cold-takes.com/has-life-gotten-better-the-post-industrial-era/#for-animals-its-not-the-same-story">Not for animals</a>, though see <a href="https://forum.effectivealtruism.org/posts/z7quAxWyHuqFdxGE6/rowing-steering-anchoring-equity-mutiny-1?commentId=cQ4n3ZuLFqgkfgBsy">this comment</a> for some reasons we might not consider this a knockdown objection to the &#x201C;life has gotten better&#x201D; claim.&#xA0;<a href="#fnref4" rev="footnote">&#x21A9;</a><li id="fn5">

<p>
     This is only a possibility. It&#x2019;s also possible that humans deeply value being <em>better-off than others</em>, which could complicate it quite a bit. (Personally, I feel somewhat optimistic that a lot of people would aspirationally prefer to focus on their own welfare rather than comparing themselves to others - so if knowledge advanced to the point where people could choose to change in this way, I feel optimistic that at least many would do so.)&#xA0;<a href="#fnref5" rev="footnote">&#x21A9;</a>

</p></li></p></li></p></li></p></li></p></li></ol></div>


<!--kg-card-end: html-->]]></content:encoded></item><item><title><![CDATA[Racing through a minefield: the AI deployment problem]]></title><description><![CDATA[Push AI forward too fast, and catastrophe could occur. Too slow, and someone else less cautious could do it. Is there a safe course?]]></description><link>https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/</link><guid isPermaLink="false">63a0f22e9a951a003d4e26ff</guid><category><![CDATA[ImplicationsOfMostImportantCentury]]></category><dc:creator><![CDATA[Holden Karnofsky]]></dc:creator><pubDate>Thu, 22 Dec 2022 16:06:37 GMT</pubDate><media:content url="https://www.cold-takes.com/content/images/2022/12/racing-through-a-minefield-rectangular.png" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: html--><img src="https://www.cold-takes.com/content/images/2022/12/racing-through-a-minefield-rectangular.png" alt="Racing through a minefield: the AI deployment problem"><p><figure><div id="buzzsprout-player-11907514"></div><script src="https://www.buzzsprout.com/1851795/11907514-racing-through-a-minefield-the-ai-deployment-problem.js?container_id=buzzsprout-player-11907514&amp;player=small" type="text/javascript" charset="utf-8"></script><figcaption><em>Click lower right to download or find on Apple Podcasts, Spotify, Stitcher, etc.</em></figcaption></figure></p>

<p>
In previous pieces, I argued that there&apos;s a real and large risk of AI systems&apos; developing dangerous goals of their own and defeating all of humanity - at least in the absence of specific efforts to prevent this from happening. I discussed <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">why it could be hard to build AI systems without this risk</a> and <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/">how it might be doable</a>.
</p>
<p>
The &#x201C;AI alignment problem&#x201D; refers<sup id="fnref1"><a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#fn1" rel="footnote">1</a></sup> to a <em>technical</em> problem: how can we design a powerful AI system that behaves as intended, rather than forming its <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">own dangerous aims</a>? This post is going to outline a <strong>broader political/strategic problem, the &#x201C;deployment problem&#x201D;: </strong>if you&#x2019;re someone who might be on the cusp of developing extremely powerful (and maybe dangerous) AI systems, what should you &#x2026; do?
</p>
<p>
The basic challenge is this:
</p>
<ul>

<li>If you race forward with building and using powerful AI systems as fast as possible, you might cause a global catastrophe (see links above).

</li><li>If you move too slowly, though, you might just be waiting around for <em>someone else less cautious</em> to develop and deploy powerful, dangerous AI systems.

</li><li>And if you can get to the point where your own systems are both powerful and safe &#x2026; what then? Other people still might be less-cautiously building dangerous ones - what should we do about that?
</li>
</ul>
<p>
My current analogy for the deployment problem is <strong>racing through a minefield: each player is hoping to be ahead of others, but anyone moving too quickly can cause a disaster. </strong>(In this minefield, a single mine is big enough to endanger <em>all</em> the racers.)
</p>
<p>
This post gives a high-level overview of how I see the kinds of developments that can lead to a good outcome, despite the &#x201C;racing through a minefield&#x201D; dynamic. It is distilled from a more detailed <a href="https://www.alignmentforum.org/posts/vZzg8NS7wBtqcwhoJ/nearcast-based-deployment-problem-analysis">post on the Alignment Forum</a>.
</p>
<p>
First, I&#x2019;ll flesh out how I see the challenge we&#x2019;re contending with, based on the premises above.
</p>
<p>
Next, I&#x2019;ll list a number of things I hope that &#x201C;cautious actors&#x201D; (AI companies, governments, etc.) might do in order to prevent catastrophe.
</p>
<p>
<strong>Many of the actions I&#x2019;m picturing are not the kind of things normal market and commercial incentives would push toward, and as such, I think there&#x2019;s room for a ton of variation in whether the &#x201C;racing through a minefield&#x201D; challenge is handled well. </strong>Whether key decision-makers understand things like the case for <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">misalignment risk </a>(and in particular, <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">why it might be hard to measure</a>) - and are willing to lower their own chances of &#x201C;winning the race&#x201D; to improve the odds of a good outcome for everyone - could be crucial.
</p>
<h2 id="basic-premises">The basic premises of &#x201C;racing through a minefield&#x201D;</h2>


<p>
This piece is going to lean on <a href="https://www.cold-takes.com/tag/implicationsofmostimportantcentury/">previous pieces</a> and assume all of the following things:
</p>
<ul>

<li><strong>Transformative AI soon. </strong>This century, something like <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/">PASTA</a> could be developed: AI systems that can effectively automate everything humans do to advance science and technology. This brings the potential for explosive progress in science and tech, getting us more quickly than most people imagine to a deeply unfamiliar future. I&#x2019;ve argued for this possibility in the <a href="https://www.cold-takes.com/most-important-century/">Most Important Century series</a>.

</li><li><strong>Misalignment risk. </strong>As argued previously, there&#x2019;s a significant risk that such AI systems could end up with <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">misaligned goals of their own</a>, leading them to <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">defeat all of humanity</a>. And it could take <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/">significant extra effort</a> to get AI systems to be safe.

</li><li><strong>Ambiguity. </strong>As argued previously, it could be <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">hard to know whether AI systems are dangerously misaligned</a>, for a number of reasons. In particular, when we train AI systems not to behave dangerously, we might be unwittingly training them to <em>obscure their dangerous potential from humans</em>, and take dangerous actions <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/#The-King-Lear-problem">only when humans would not be able to stop them</a>. At the same time, I expect powerful AI systems will present massive opportunities to make money and gain power, such that many people will want to race forward with building and deploying them as fast as possible (perhaps even if they believe that doing so is risky for the world!)
</li>
</ul>
<p>
So, one can imagine a scenario where some company is in the following situation:
</p>
<ul>

<li>It has good reason to think it&#x2019;s on the cusp of developing extraordinarily powerful AI systems.

</li><li>If it deploys such systems hastily, global disaster could result.

</li><li>But if it moves too <em>slowly</em>, other, less cautious actors could deploy dangerous systems of their own.
</li>
</ul>
<p>
That seems like a tough enough, high-stakes-enough, and likely enough situation that it&#x2019;s worth thinking about how one is supposed to handle it.
</p>
<p>
One simplified way of thinking about this problem:
</p>
<ul>

<li>We might classify &#x201C;actors&#x201D; (companies, government projects, whatever might develop powerful AI systems or play an important role in how they&#x2019;re deployed) as <strong>cautious</strong> (taking misalignment risk very seriously) or <strong>incautious</strong> (not so much).

</li><li>Our basic hope is that <strong>at any given point in time, cautious actors collectively have the power to &#x201C;contain&#x201D; incautious actors. </strong>By &#x201C;contain,&#x201D; I mean: stop them from deploying misaligned AI systems, and/or stop the misaligned systems from causing a catastrophe.

</li><li>Importantly, <strong>it could be important for cautious actors to <em>use powerful AI systems</em> to help with &#x201C;containment&#x201D; in one way or another. </strong>If cautious actors refrain from AI development entirely, it seems likely that incautious actors will end up with more powerful systems than cautious ones, which doesn&#x2019;t seem good.
</li>
</ul>
<p>
In this setup, <strong>cautious actors need to move fast enough that they can&#x2019;t be overpowered by others&#x2019; AI systems, but slowly enough that they don&#x2019;t cause disaster themselves. </strong>Hence the &#x201C;racing through a minefield&#x201D; analogy.
</p>
<h2 id="what-success-looks-like">What success looks like</h2>


<p>
In a <a href="https://www.alignmentforum.org/posts/vZzg8NS7wBtqcwhoJ/nearcast-based-deployment-problem-analysis">non-Cold-Takes piece</a>, I explore the possible actions available to cautious actors to win the race through a minefield. This section will summarize the general categories - and, crucially, why we shouldn&#x2019;t expect that companies, governments, etc. will do the right thing simply from natural (commercial and other) incentives.
</p>
<p>
I&#x2019;ll be going through each of the following:
</p>
<ul>

<li><strong>Alignment (charting a safe path through the minefield). </strong>Putting lots of effort into technical work to reduce the risk of misaligned AI. 

</li><li><strong>Threat assessment (alerting others about the mines). </strong>Putting lots of effort into <em>assessing</em> the risk of misaligned AI, and potentially demonstrating it (to other actors) as well.

</li><li><strong>Avoiding races (to move more cautiously through the minefield). </strong>If different actors are racing to deploy powerful AI systems, this could make it unnecessarily hard to be cautious.

</li><li><strong>Selective information sharing (so the incautious don&#x2019;t catch up). </strong>Sharing some information widely (e.g., technical insights about how to reduce misalignment risk), some selectively (e.g., demonstrations of how powerful and dangerous AI systems might be), and some not at all (e.g., the specific code that, if accessed by a hacker, would allow the hacker to deploy potentially dangerous AI systems themselves).

</li><li><strong>Global monitoring (noticing people about to step on mines, and stopping them). </strong>Working toward worldwide state-led monitoring efforts to identify and prevent &#x201C;incautious&#x201D; projects racing toward deploying dangerous AI systems.

</li><li><strong>Defensive deployment (staying ahead in the race). </strong>Deploying AI systems only when they are unlikely to cause a catastrophe - but also deploying them with urgency once they are safe, in order to help prevent problems from AI systems developed by less cautious actors.
</li>
</ul>
<h3 id="alignment">Alignment (charting a safe path through the minefield<sup id="fnref2"><a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#fn2" rel="footnote">2</a></sup>)</h3>
<p>
I <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/">previously</a> wrote about some of the ways we might reduce the dangers of advanced AI systems. Broadly speaking:
</p>
<ul>

<li>Cautious actors might try to primarily build <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/#limited-ai">limited</a> AI systems - AI systems that lack the kind of <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">ambitious aims that lead to danger</a>. They might ultimately be able to use these AI systems to do things like automating further safety research, making future less-limited systems safer.

</li><li>Cautious actors might use <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/#ai-checks-and-balances">AI checks and balances</a> - that is, using some AI systems to supervise, critique and identify dangerous behavior in others, with special care taken to make it hard for AI systems to coordinate with each other against humans. 

</li><li>Cautious actors might use a variety of other techniques for making AI systems safer - particularly techniques that incorporate &#x201C;<a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/#digital-neuroscience">digital neuroscience</a>,&#x201D; gauging the safety of an AI system by &#x201C;reading its mind&#x201D; rather than simply by watching out for dangerous behavior (the latter might be unreliable, as noted above).
</li>
</ul>
<p>
A key point here is that <strong>making AI systems safe enough to commercialize (with some initial success and profits) could be much less (and different) effort than making them robustly safe (no lurking risk of global catastrophe). </strong>The basic reasons for this are covered in my <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">previous post on difficulties with AI safety research</a> In brief:
</p>
<ul>

<li>If AI systems <em>behave</em> dangerously, we can &#x201C;train out&#x201D; that behavior by providing negative reinforcement for it. 

</li><li>The concern is that when we do this, we might be unwittingly training AI systems to <em>obscure their dangerous potential from humans</em>, and take dangerous actions <em>only when humans would not be able to stop them</em>. (I <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/#The-King-Lear-problem">call this</a> the &#x201C;King Lear problem: it&apos;s hard to know how someone will behave when they have power over you, based only on observing how they behave when they don&apos;t.&#x201D;)

</li><li>So we could end up with AI systems that behave safely and helpfully as far as we can tell in normal circumstances, while ultimately having <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">ambitious, dangerous &#x201C;aims&#x201D;</a> that they pursue when they become powerful enough and have the right opportunities.
</li>
</ul>
<p>
Well-meaning AI companies with active ethics boards might do a lot of AI safety work, by training AIs not to behave in unhelpful or dangerous ways. But if they want to address the risks I&#x2019;m focused on here, this could require safety measures that look very different - e.g., measures more reliant on &#x201C;checks and balances&#x201D; and &#x201C;digital neuroscience.&#x201D;
</p>
<h3 id="threat-assessment">Threat assessment (alerting others about the mines)</h3>


<p>
In addition to <em>making AI systems safer</em>, cautious actors can also put effort into <em>measuring and demonstrating how dangerous they are</em> (or aren&#x2019;t).
</p>
<p>
For the same reasons given in the previous section, it could take special efforts to find and demonstrate the kinds of dangers I&#x2019;ve been discussing. Simply monitoring AI systems in the real world for bad behavior might not do it. It may be necessary to examine (or manipulate) their <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/#digital-neuroscience">digital brains</a>,<sup id="fnref3"><a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#fn3" rel="footnote">3</a></sup> design AI systems <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/#ai-checks-and-balances">specifically to audit other AI systems for signs of danger</a>; deliberately train AI systems to demonstrate particular dangerous patterns (while not being <em>too</em> dangerous!); etc.
</p>
<p>
Learning and demonstrating that the danger is high could help convince many actors to move more slowly and cautiously. Learning that the danger is <em>low</em> could lessen some of the tough tradeoffs here and allow cautious actors to move forward more decisively with developing advanced AI systems; I think this could be a good thing in terms of <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/#the-competition-frame">what sorts of actors lead the way on transformative AI</a>.
</p>
<h3 id="avoiding-races">Avoiding races (to move more cautiously through the minefield)</h3>


<p>
Here&#x2019;s a dynamic I&#x2019;d be sad about:
</p>
<ul>

<li>Company <strong>A </strong>is getting close to building very powerful AI systems. It would love to move slowly and be careful with these AIs, but it worries that if it moves too slowly, Company <strong>B </strong>will get there first, have less caution, and do some combination of &#x201C;causing danger to the world&#x201D; and &#x201C;beating company <strong>A </strong>if the AIs turn out safe.&#x201D;

</li><li>Company <strong>B </strong>is getting close to building very powerful AI systems. It would love to move slowly and be careful with these AIs, but it worries that if it moves too slowly, Company <strong>A </strong>will get there first, have less caution, and do some combination of &#x201C;causing danger to the world&#x201D; and &#x201C;beating company <strong>B </strong>if the AIs turn out safe.&#x201D;
</li>
</ul>
<p>
(Similar dynamics could apply to Country A and B, with national AI development projects.)
</p>
<p>
If Companies A and B would both &#x201C;love to move slowly and be careful&#x201D; if they could, it&#x2019;s a shame that they&#x2019;re both racing to beat each other. Maybe there&#x2019;s a way to avoid this dynamic. For example, perhaps Companies A and B could strike a deal - anything from &#x201C;collaboration and safety-related information sharing&#x201D; to a merger. This could allow both to focus more on precautionary measures rather than on beating the other. Another way to avoid this dynamic is discussed below, under <a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#global-monitoring">standards and monitoring.</a>
</p>
<p>
&#x201C;Finding ways to avoid a furious race&#x201D; is not the kind of dynamic that emerges naturally from markets! In fact, working together along these lines would have to be well-designed to avoid running afoul of antitrust regulation.
</p>
<h3 id="selective-information-sharing">Selective information sharing - including security (so the incautious don&#x2019;t catch up)</h3>


<p>
Cautious actors might want to share certain kinds of information quite widely:
</p>
<ul>

<li>It could be crucial to raise awareness about the dangers of AI (which, as I&#x2019;ve argued, won&#x2019;t necessarily be obvious). 

</li><li>They might also want to widely share information that could be useful for reducing the risks (e.g., <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/">safety techniques</a> that have worked well.)
</li>
</ul>
<p>
At the same time, as long as there are incautious actors out there, information can be dangerous too:
</p>
<ul>

<li>Information about <em>what cutting-edge AI systems can do</em> - especially if it is powerful and impressive - could spur incautious actors to race harder toward developing powerful AI of their own (or give them an idea of <em>how</em> to build powerful systems, by giving them an idea of what sorts of abilities to aim for).

</li><li>An AI&#x2019;s &#x201C;weights&#x201D; (you can think of this sort of like its source code, though not exactly<sup id="fnref4"><a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#fn4" rel="footnote">4</a></sup>) are potentially very dangerous. If hackers (including from a state cyberwarfare program) gain unauthorized access to an AI&#x2019;s weights, this could be tantamount to stealing the AI system, and the actor that steals the system could be much less cautious than the actor who built it. <strong>Achieving a level of cybersecurity that rules this out <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/#fn15">could be</a> extremely difficult,</strong> and potentially well beyond what one would normally aim for in a commercial context.</li></ul>
<p>
The lines between these categories of information might end up fuzzy. Some information might be useful for demonstrating the dangers <em>and</em> capabilities of cutting-edge systems, or useful for making systems safer <em>and</em> for building them in the first place. So there could be a lot of hard judgment calls here.
</p>
<p>
This is another area where I worry that commercial incentives might not be enough on their own. For example, it is usually important for a commercial project to have some reasonable level of security against hackers, but not necessarily for it to be able to resist well-resourced attempts by states to steal its intellectual property. 
</p>
<h3 id="global-monitoring">Global monitoring (noticing people about to step on mines, and stopping them)</h3>


<p>
Ideally, cautious actors would learn of every case where someone is building a dangerous AI system (whether purposefully or unwittingly), and be able to stop the project. If this were done reliably enough, it could take the teeth out of the threat; a partial version could buy time.
</p>
<p>
Here&#x2019;s one vision for how this sort of thing could come about:
</p>
<ul>

<li>We (humanity) develop a reasonable set of tests for whether an AI system might be dangerous.

</li><li>Today&#x2019;s leading AI companies self-regulate by committing not to build or deploy a system that&#x2019;s dangerous according to such a test (e.g., see Google&#x2019;s <a href="https://www.theweek.in/news/sci-tech/2018/06/08/google-wont-deploy-ai-to-build-military-weapons-ichai.html">2018 statement</a>, &quot;We will not design or deploy AI in weapons or other technologies whose principal purpose or implementation is to cause or directly facilitate injury to people&#x201D;). Even if some people at the companies would like to do so, it&#x2019;s hard to pull this off once the company has committed not to.

</li><li>As more AI companies are started, they feel soft pressure to do similar self-regulation, and refusing to do so is off-putting to potential employees, investors, etc.

</li><li>Eventually, similar principles are incorporated into various government regulations and enforceable treaties.

</li><li>Governments could monitor for dangerous projects using regulation and even overseas operations. E.g., today the US monitors (without permission) for various signs that other states might be developing nuclear weapons, and might try to stop such development with methods ranging from threats of sanctions to <a href="https://en.wikipedia.org/wiki/Stuxnet">cyberwarfare</a> or even military attacks. It could do something similar for any AI development projects that are using huge amounts of compute and haven&#x2019;t volunteered information about their safety practices.
</li>
</ul>
<p>
If the situation becomes very dire - i.e., it seems that there&#x2019;s a high risk of dangerous AI being deployed imminently - I see the latter bullet point as one of the main potential hopes. In this case, governments might have to take drastic actions to monitor and stop dangerous projects, based on limited information.
</p>
<h3 id="defensive-deployment">Defensive deployment (staying ahead in the race)</h3>


<p>
I&#x2019;ve emphasized the importance of caution: not deploying AI systems when we can&#x2019;t be confident enough that they&#x2019;re safe. 
</p>
<p>
But when confidence <em>can</em> be achieved (how much confidence? See footnote<sup id="fnref5"><a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#fn5" rel="footnote">5</a></sup>), <strong>powerful-and-safe AI can help reduce risks from other actors </strong>in many possible ways.
</p>
<p>
Some of this would be by helping with all of the above. Once AI systems can do a significant fraction of the things humans can do today, they might be able to contribute to each of the activities I&#x2019;ve listed so far:
</p>
<ul>

<li><strong><a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#alignment">Alignment</a>. </strong>AI systems might be able to contribute to AI safety research (as humans do), producing increasingly robust techniques for reducing risks.

</li><li><strong><a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#threat-assessment">Threat assessment</a></strong>. AI systems could help produce evidence and demonstrations about potential risks. They could be potentially useful for tasks like &#x201C;Produce detailed explanations and demonstrations of possible sequences of events that could lead to AIs doing harm.&#x201D;

</li><li><strong><a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#avoiding-races">Avoiding races</a>. </strong>AI projects might make deals in which e.g. each project is allowed to use its AI systems to monitor for signs of risk from the others (ideally such systems would be designed to <em>only</em> share relevant information).

</li><li><strong><a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#selective-information-sharing">Selective information sharing</a>. </strong>AI systems might contribute to strong security (e.g., by finding and patching security holes), and to dissemination (including by helping to better communicate about the level of risk and the best ways to reduce it).

</li><li><strong><a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#global-monitoring">Global monitoring</a>. </strong>AI systems might be used (e.g., by governments) to monitor for signs of dangerous AI projects worldwide, and even to interfere with such projects. They might also be used as part of large voluntary self-regulation projects, along the lines of what I wrote just above under &#x201C;Avoiding races.&#x201D;
</li>
</ul>
<p>
Additionally, <strong>if safe AI systems are in wide use, it could be harder for dangerous (similarly powerful) AI systems to do harm. </strong>This could be via a wide variety of mechanisms. For example:
</p>
<ul>

<li>If there&#x2019;s widespread use of AI systems to patch and find security holes, similarly powered AI systems might have a harder time finding security holes to <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">cause trouble with</a>.

</li><li>Misaligned AI systems could have more trouble making money, gaining allies, etc. in worlds where they are competing with similarly powerful but safe AI systems.
</li>
</ul>
<h2 id="so">So?</h2>


<p>
I&#x2019;ve gone into some detail about why we might have a challenging situation (&#x201C;racing through a minefield&#x201D;) if powerful AI systems (a) are developed fairly soon; (b) present significant risk of <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">misalignment leading to humanity being defeated</a>; (c) are not particularly easy to measure the safety of.
</p>
<p>
I&#x2019;ve also talked about what I see as some of the key ways that &#x201C;cautious actors&#x201D; concerned about misaligned AI might navigate this situation.
</p>
<p>
I talk about some of the implications in my <a href="https://alignmentforum.org/posts/vZzg8NS7wBtqcwhoJ/nearcast-based-deployment-problem-analysis">more detailed piece</a>. Here I&#x2019;m just going to name a couple of observations that jump out at me from this analysis:
</p>
<p>
<strong>This seems hard. </strong>If we end up in the future envisioned in this piece, I imagine this being extremely stressful and difficult. I&#x2019;m picturing a world in which many companies, and even governments, can see the huge power and profit they might reap from deploying powerful AI systems <em>before others</em> - but we&#x2019;re hoping that they instead move with caution (but not too much caution!), take the kinds of actions described above, and that ultimately cautious actors &#x201C;win the race&#x201D; against less cautious ones.
</p>
<p>
Even if AI alignment ends up being <em>relatively</em> easy - such that a given AI project can make safe, powerful systems with about 10% more effort than making dangerous, powerful systems - the situation <em>still</em> looks pretty nerve-wracking, because of how many different players could end up trying to build systems of their own without putting in that 10%.
</p>
<p>
<strong>A lot of the most helpful actions might be &#x201C;out of the ordinary.&#x201D; </strong>When racing through a minefield, I hope key actors will:
</p>
<ul>

<li>Put more effort into alignment, threat assessment, and security than is required by commercial incentives;

</li><li>Consider measures for <a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#avoiding-races">avoiding races</a> and <a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#global-monitoring">global monitoring</a> that could be very unusual, even unprecedented.

</li><li>Do all of this in the possible presence of ambiguous, confusing information about the risks.
</li>
</ul>
<p>
As such, it could be <strong>very important whether key decision-makers (at both companies and governments) understand the risks and are prepared to act on them. </strong>Currently, I think we&#x2019;re unfortunately very far from a world where this is true.
</p>
<p>
Additionally, I think <strong>AI projects can and should be taking measures <em>today</em> to make unusual-but-important measures more practical in the future. </strong>This could include things like:
</p>
<ul>

<li>Getting practice with <a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#selective-information-sharing">selective information sharing</a>. For example, building internal processes to decide on whether research should be published, rather than having a rule of &#x201C;Publish everything, we&#x2019;re like a research university&#x201D; or &#x201C;Publish nothing, we don&#x2019;t want competitors seeing it.&#x201D;  
<ul>
 
<li>I expect that early attempts at this will often be clumsy and get things wrong! 
</li> 
</ul>

</li><li>Getting practice with ways that <a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#avoiding-races">AI companies could avoid races.</a> 

</li><li>Getting practice with <a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#threat-assessment">threat assessment</a>. Even if today&#x2019;s AI systems don&#x2019;t seem like they could possibly be dangerous yet &#x2026; how sure are we, and how do we know?

</li><li>Prioritizing building AI systems that could do especially helpful things, such as contributing to AI safety research and threat assessment and patching security holes. 

</li><li><strong>Establishing <a href="https://www.cold-takes.com/ideal-governance-for-companies-countries-and-more/">governance</a> that is capable of making hard, non-commercially-optimal decisions for the good of humanity. </strong>A standard corporation could be sued for <em>not</em> deploying AI that poses a risk of <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">global catastrophe</a> - if this means a sacrifice for its bottom line. And a lot of the people making the final call at AI companies might be primarily thinking about their duties to shareholders (or simply unaware of the potential stakes of powerful enough AI systems). I&#x2019;m excited about AI companies that are investing heavily in setting up governance structures - and investing in executives and <a href="https://www.cold-takes.com/nonprofit-boards-are-weird-2/">board members</a> - capable of making the hard calls well.
</li>
</ul>
<!--kg-card-end: html--><!--kg-card-begin: html-->

<p><div style="display:flex; justify-content:center; margin: 0 auto;">

        <span style="margin: 10px;"><a href="https://api.addthis.com/oexchange/0.8/forward/twitter/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fracing-through-a-minefield-the-ai-deployment-problem&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Racing%20through%20a%20minefield%3A%20the%20AI%20deployment%20problem&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-twitter-square.png" border="0" alt="Racing through a minefield: the AI deployment problem"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/facebook/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fracing-through-a-minefield-the-ai-deployment-problem&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Racing%20through%20a%20minefield%3A%20the%20AI%20deployment%20problem&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-facebook-square.png" border="0" alt="Racing through a minefield: the AI deployment problem"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/reddit/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fracing-through-a-minefield-the-ai-deployment-problem&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Racing%20through%20a%20minefield%3A%20the%20AI%20deployment%20problem&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-reddit-square.png" border="0" alt="Racing through a minefield: the AI deployment problem"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/menu/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fracing-through-a-minefield-the-ai-deployment-problem&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Racing%20through%20a%20minefield%3A%20the%20AI%20deployment%20problem&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-addthis-square.png" border="0" alt="Racing through a minefield: the AI deployment problem"></a></span>
        </div>
<center><p id="discuss"><!--<a href="https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem#subscribe" target="_blank"><button id="footer-subscribe" class="button">Subscribe</button></a>&nbsp;<a href="https://www.guidedtrack.com/programs/4kal2ue/run?posttitle=Racing%20through%20a%20minefield%3A%20the%20AI%20deployment%20problem" target="_blank"><button class="button" id="Survey">Feedback</button></a>-->&#xA0;<a href="https://www.lesswrong.com/posts/slug/racing-through-a-minefield-the-ai-deployment-problem#comments" target="_blank"><button class="button">Comment/discuss</button></a></p><p><!--<em>
Use "Feedback" if you have comments/suggestions you want me to see, or if you're up for giving some quick feedback about this post (which I greatly appreciate!) Use "Forum" if you want to discuss this post publicly on the Effective Altruism Forum.
</em>--></p></center>
<!--kg-card-end: html--><!--kg-card-begin: html-->
<hr>
</p><h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol><li id="fn1">
<p>
     Generally, or at least, this is what I&#x2019;d like it to refer to.&#xA0;<a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#fnref1" rev="footnote">&#x21A9;</a><li id="fn2">
<p>
     Thanks to <a href="https://www.cold-takes.com/beta-readers-are-great/">beta reader</a> Ted Sanders for suggesting this analogy in place of the older one, &#x201C;removing mines from the minefield.&#x201D; 

&#xA0;<a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#fnref2" rev="footnote">&#x21A9;</a><li id="fn3">
<p>
     One genre of testing that might be interesting: manipulating an AI system&#x2019;s &#x201C;digital brain&#x201D; in order to <em>simulate</em> circumstances in which it has an opportunity to take over the world, and seeing whether it does so. This could be a way of dealing with the <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/#The-King-Lear-problem">King Lear problem</a>. More <a href="https://www.alignmentforum.org/posts/rCJQAkPTEypGjSJ8X/how-might-we-align-transformative-ai-if-it-s-developed-very#Out_of_distribution_robustness">here</a>.&#xA0;<a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#fnref3" rev="footnote">&#x21A9;</a><li id="fn4">

<p>
     Modern AI systems tend to be trained with <a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment/#Box4">lots of trial-and-error</a>. The actual code that is used to train them might be fairly simple and not very valuable on its own; but an expensive training process then generates a set of &#x201C;weights&#x201D; which are ~all one needs to make a fully functioning, relatively cheap copy of the AI system.&#xA0;<a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#fnref4" rev="footnote">&#x21A9;</a><li id="fn5">
<p>
     I mean, this is part of the challenge. In theory, you should deploy an AI system if the risks of not doing so are greater than the risks of doing so. That&#x2019;s going to depend on hard-to-assess information about how safe your system is <em>and</em> how dangerous and imminent others&#x2019; are, and it&#x2019;s going to be easy to be biased in favor of &#x201C;My systems are safer than others&#x2019;; I should go for it.&#x201D; Seems hard.&#xA0;<a href="https://www.cold-takes.com/p/97d2a7b1-af2d-4dd4-b679-5ea8bb41c47d#fnref5" rev="footnote">&#x21A9;</a>

</p></li></p></li></p></li></p></li></p></li></ol></div>

<!--kg-card-end: html-->]]></content:encoded></item><item><title><![CDATA[High-level hopes for AI alignment]]></title><description><![CDATA[A few ways we might get very powerful AI systems to be safe.]]></description><link>https://www.cold-takes.com/high-level-hopes-for-ai-alignment/</link><guid isPermaLink="false">639783f9ec211f003cdbf041</guid><category><![CDATA[ImplicationsOfMostImportantCentury]]></category><dc:creator><![CDATA[Holden Karnofsky]]></dc:creator><pubDate>Thu, 15 Dec 2022 17:53:43 GMT</pubDate><media:content url="https://www.cold-takes.com/content/images/2022/12/high-level-hopes-rectangle.png" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: html--><img src="https://www.cold-takes.com/content/images/2022/12/high-level-hopes-rectangle.png" alt="High-level hopes for AI alignment"><p><figure><div id="buzzsprout-player-11875637"></div><script src="https://www.buzzsprout.com/1851795/11875637-high-level-hopes-for-ai-aligment.js?container_id=buzzsprout-player-11875637&amp;player=small" type="text/javascript" charset="utf-8"></script><figcaption><em>Click lower right to download or find on Apple Podcasts, Spotify, Stitcher, etc.</em></figcaption></figure></p>
<p>
In previous pieces, I argued that there&apos;s a real and large risk of AI systems&apos; <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">aiming</a> to defeat all of humanity combined - and <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">succeeding</a>. 
</p>
<p>
I first argued that this sort of catastrophe would be likely without specific countermeasures to prevent it. I then argued that countermeasures could be challenging, due to some <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">key difficulties of AI safety research.</a>
</p>
<p>
But while I think misalignment risk is serious and presents major challenges, I don&#x2019;t agree with sentiments<sup id="fnref1"><a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5#fn1" rel="footnote">1</a></sup> along the lines of &#x201C;We haven&#x2019;t figured out how to align an AI, so if transformative AI comes soon, we&#x2019;re doomed.&#x201D; Here I&#x2019;m going to talk about some of my <strong>high-level hopes for how we might end up avoiding this risk. </strong>
</p>
<p>
I&#x2019;ll first recap the challenge, using Ajeya Cotra&#x2019;s <a href="https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/#analogy-the-young-ceo">young businessperson</a> analogy to give a sense of some of the core difficulties. In a nutshell, once AI systems get capable enough, it could be hard to test whether they&#x2019;re safe, because they might be able to deceive and manipulate us into getting the wrong read. Thus, trying to determine whether they&#x2019;re safe might be something like &#x201C;being an eight-year-old trying to decide between adult job candidates (some of whom are manipulative).&#x201D;
</p>
<p>
I&#x2019;ll then go through what I see as three key possibilities for navigating this situation:
</p>
<ul>

<li><strong>Digital neuroscience</strong>: perhaps we&#x2019;ll be able to read (and/or even rewrite) the &#x201C;digital brains&#x201D; of AI systems, so that we can know (and change) what they&#x2019;re &#x201C;aiming&#x201D; to do directly - rather than having to infer it from their behavior. (Perhaps the eight-year-old is a mind-reader, or even a young <a href="https://en.wikipedia.org/wiki/Professor_X#Powers_and_abilities">Professor X</a>.)

</li><li><strong>Limited AI</strong>: perhaps we can make AI systems safe by making them <em>limited</em> in various ways - e.g., by leaving certain kinds of information out of their training, designing them to be &#x201C;myopic&#x201D; (focused on short-run as opposed to long-run goals), or something along those lines. Maybe we can make &#x201C;limited AI&#x201D; that is nonetheless able to carry out particular helpful tasks - such as doing lots more research on how to achieve safety without the limitations. (Perhaps the eight-year-old can limit the authority or knowledge of their hire, and still get the company run successfully.)

</li><li><strong>AI checks and balances</strong>: perhaps we&#x2019;ll be able to employ some AI systems to critique, supervise, and even rewrite others. Even if no single AI system would be safe on its own, the right &#x201C;checks and balances&#x201D; setup could ensure that human interests win out. (Perhaps the eight-year-old is able to get the job candidates to evaluate and critique each other, such that all the eight-year-old needs to do is verify basic factual claims to know who the best candidate is.)
</li>
</ul>
<p>
These are some of the main categories of hopes that are pretty easy to picture today. Further work on AI safety research might result in further ideas (and the above are not exhaustive - see my <a href="https://www.alignmentforum.org/posts/rCJQAkPTEypGjSJ8X/how-might-we-align-transformative-ai-if-it-s-developed-very">more detailed piece</a>, posted to the Alignment Forum rather than Cold Takes, for more).
</p>
<p>
I&#x2019;ll talk about both challenges and reasons for hope here. I think that for the most part, these hopes look much better if AI projects are moving cautiously rather than racing furiously.
</p>
<p>
I don&#x2019;t think we&#x2019;re at the point of having much sense of how the hopes and challenges net out; the best I can do at this point is to <a href="https://www.alignmentforum.org/posts/rCJQAkPTEypGjSJ8X/how-might-we-align-transformative-ai-if-it-s-developed-very#So__would_civilization_survive_">say</a>: &#x201C;I don&#x2019;t currently have much sympathy for someone who&#x2019;s highly confident that AI takeover would or would not happen (that is, for anyone who thinks the odds of AI takeover &#x2026; are under 10% or over 90%).&#x201D;
</p>
<h2 id="the-challenge">The challenge</h2>


<p>
<em>This is all recapping previous pieces. If you remember them super well, skip to the <a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5/#digital-neuroscience">next section</a>.</em>
</p>
<p>
In previous pieces, I argued that:
</p>
<ul>

<li>The coming decades could see the development of AI systems that could automate - and dramatically speed up - scientific and technological advancement, getting us more quickly than most people imagine to a deeply unfamiliar future. (More: <a href="https://www.cold-takes.com/most-important-century/">The Most Important Century</a>)

</li><li>If we develop this sort of AI via ambitious use of the &#x201C;black-box trial-and-error&#x201D; common in AI development today, then there&#x2019;s a substantial risk that: 
<ul>
 
<li>These AIs will develop <strong>unintended aims</strong> (states of the world they make calculations and plans toward, as a chess-playing AI &quot;aims&quot; for checkmate);
 
</li><li>These AIs will deceive, manipulate, and overpower humans as needed to achieve those aims;
 
</li><li>Eventually, this could reach the point where AIs <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">take over the world from humans entirely</a>.
</li> 
</ul>

</li><li>People today are doing AI safety research to prevent this outcome, but such research has a <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">number of deep difficulties:</a>
</li>
</ul>
<p>
<table style="border-collapse: collapse;">
  <tr>
   <td colspan="3" style="border: 1px solid;"><strong>&#x201C;Great news - I&#x2019;ve tested this AI and it looks safe.&#x201D; </strong>Why might we still have a problem?
   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;"><em>Problem</em>
   </td>
   <td style="border: 1px solid;"><em>Key question</em>
   </td>
   <td style="border: 1px solid;"><em>Explanation</em>
   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;">The <strong>Lance Armstrong problem</strong>
   </td>
   <td style="border: 1px solid;">Did we get the AI to be <strong><span style="color:var(--green-color);">actually safe</span></strong> or <strong><span style="color:var(--red-color);">good at hiding its dangerous actions</span>?</strong>
   </td>
  <td style="border: 1px solid;"><p>When dealing with an intelligent agent, it&#x2019;s hard to tell the difference between &#x201C;behaving well&#x201D; and &#x201C;<em>appearing</em> to behave well.&#x201D;</p>
<p>
When professional cycling was cracking down on performance-enhancing drugs, Lance Armstrong was very successful and seemed to be unusually &#x201C;clean.&#x201D; It later came out that he had been using drugs with an unusually sophisticated operation for concealing them.
   </p></td>
  </tr>
  <tr>
   <td style="border: 1px solid;">The <strong>King Lear problem</strong>
   </td>
   <td style="border: 1px solid;"><p>The AI is <strong><span style="color:var(--green-color);">(actually) well-behaved when humans are in control. </span></strong>Will this transfer to <strong><span style="color:var(--red-color);">when AIs are in control</span>?</strong></p>
   </td>
   <td style="border: 1px solid;"><p>It&apos;s hard to know how someone will behave when they have power over you, based only on observing how they behave when they don&apos;t. </p>
<p>
AIs might behave as intended as long as humans are in control - but at some future point, AI systems might be capable and widespread enough to have opportunities to <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">take control of the world entirely</a>. It&apos;s hard to know whether they&apos;ll take these opportunities, and we can&apos;t exactly run a clean test of the situation. 
</p><p>
Like King Lear trying to decide how much power to give each of his daughters before abdicating the throne.
   </p></td>
  </tr>
  <tr>
   <td style="border: 1px solid;">The <strong>lab mice problem</strong>
   </td>
      <td style="border: 1px solid;"><strong><span style="color:var(--green-color);">Today&apos;s &quot;subhuman&quot; AIs are safe.</span></strong>What about <strong><span style="color:var(--red-color);">future AIs with more human-like abilities</span>?</strong>
   </td>
   <td style="border: 1px solid;"><p>Today&apos;s AI systems aren&apos;t advanced enough to exhibit the basic behaviors we want to study, such as deceiving and manipulating humans.</p> 
<p>
Like trying to study medicine in humans by experimenting only on lab mice.
   </p></td>
  </tr>
  <tr>
   <td style="border: 1px solid;">The <strong>first contact problem</strong>
   </td>
   <td style="border: 1px solid;"><p>Imagine that <strong><span style="color:var(--green-color);">tomorrow&apos;s &quot;human-like&quot; AIs are safe.</span></strong> How will things go <strong><span style="color:var(--red-color);">when AIs have capabilities far beyond humans&apos;</span>?</strong></p>
   </td>
   <td style="border: 1px solid;"><p>AI systems might (collectively) become vastly more capable than humans, and it&apos;s ... just really hard to have any idea what that&apos;s going to be like. As far as we know, there has never before been anything in the galaxy that&apos;s vastly more capable than humans in the relevant ways! No matter what we come up with to solve the first three problems, we can&apos;t be too confident that it&apos;ll keep working if AI advances (or just proliferates) a lot more. </p>
<p>
Like trying to plan for first contact with extraterrestrials (this barely feels like an analogy).
   </p></td>
  </tr>
</table>
</p>

<p>
An analogy that incorporates these challenges is Ajeya Cotra&#x2019;s &#x201C;young businessperson&#x201D; <a href="https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/#analogy-the-young-ceo">analogy</a>:
</p>

    <blockquote><p>Imagine you are an eight-year-old whose parents left you a $1 trillion company and no trusted adult to serve as your guide to the world. You must hire a smart adult to run your company as CEO, handle your life the way that a parent would (e.g. decide your school, where you&#x2019;ll live, when you need to go to the dentist), and administer your vast wealth (e.g. decide where you&#x2019;ll invest your money).
</p>
<p>

    You have to hire these grownups based on a work trial or interview you come up with -- you don&apos;t get to see any resumes, don&apos;t get to do reference checks, etc. Because you&apos;re so rich, tons of people apply for all sorts of reasons. (<a href="https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/#analogy-the-young-ceo">More</a>)</p></blockquote>
<p>
If your applicants are a mix of &quot;saints&quot; (people who genuinely want to help), &quot;sycophants&quot; (people who just want to make you happy in the short run, even when this is to your long-term detriment) and &quot;schemers&quot; (people who want to siphon off your wealth and power for themselves), how do you - an eight-year-old - tell the difference?
</p>
<details id="Box1"><summary>(Click to expand) More detail on why AI could make this the most important century<!-- (Details not included in email - <a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5#Box1">click to view on the web</a>)--></summary>
    <div><p>
In <a href="https://www.cold-takes.com/most-important-century/">The Most Important Century</a>, I argued that the 21st century could be the most important century ever for humanity, via the development of advanced AI systems that could dramatically speed up scientific and technological advancement, getting us more quickly than most people imagine to a deeply unfamiliar future.
</p>
<p>
<a href="https://www.cold-takes.com/most-important-century/">This page</a> has a ~10-page summary of the series, as well as links to an audio version, podcasts, and the full series.
</p>
<p>
The key points I argue for in the series are:
</p>
<ul>
<li><strong>The long-run future is radically unfamiliar. </strong>Enough advances in technology could lead to a long-lasting, galaxy-wide civilization that could be a radical utopia, dystopia, or anything in between.
</li><li><strong>The long-run future could come much faster than we think,</strong> due to a possible AI-driven productivity explosion.
</li><li>The relevant kind of <strong>AI looks like it will be developed this century</strong> - making this century the one that will initiate, and have the opportunity to shape, a future galaxy-wide civilization.
</li><li>These claims seem too &quot;wild&quot; to take seriously. But there are a lot of reasons to think that <strong>we live in a wild time, and should be ready for anything.</strong>
</li><li>We, the people living in this century, have the chance to have a huge impact on huge numbers of people to come - if we can make sense of the situation enough to find helpful actions. But right now, <strong>we aren&apos;t ready for this.</strong>
</li>
        </ul></div>
</details>
<details id="Box2"><summary>(Click to expand) Why would AI &quot;aim&quot; to defeat humanity? <!--(Details not included in email - <a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5#Box2">click to view on the web</a>)--></summary>
<div>
<p>
A <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">previous piece</a> argued that if today&#x2019;s AI development methods lead directly to powerful enough AI systems, disaster is likely by default (in the absence of specific countermeasures). 
</p>
<p>
In brief:
</p>
<ul>
<li>Modern AI development is essentially based on &#x201C;training&#x201D; via trial-and-error. 
<p></p>
<p>
<li>If we move forward incautiously and ambitiously with such training, and if it gets us all the way to very powerful AI systems, then such systems will likely end up <em>aiming for certain states of the world</em> (analogously to how a chess-playing AI aims for checkmate).
</li></p>
<p>
<li>And these states will be<em> other than the ones we intended</em>, because our trial-and-error training methods won&#x2019;t be accurate. For example, when we&#x2019;re confused or misinformed about some question, we&#x2019;ll reward AI systems for giving the wrong answer to it - unintentionally training deceptive behavior.
</li></p>
<p>
<li>We should expect disaster if we have AI systems that are both (a) <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">powerful enough</a> to defeat humans and (b) aiming for states of the world that we didn&#x2019;t intend. (&#x201C;Defeat&#x201D; means taking control of the world and doing what&#x2019;s necessary to keep us out of the way; it&#x2019;s unclear to me whether we&#x2019;d be literally killed or just forcibly stopped<sup id="fnref1"><a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5#fn1" rel="footnote">1</a></sup> from changing the world in ways that contradict AI systems&#x2019; aims.)</li></p></li></ul></div>
</details>
<p></p>
<details id="Box3"><summary>(Click to expand) How could AI defeat humanity? <!--(Details not included in email - <a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5#Box3">click to view on the web</a>)--></summary>
<div>
    <p>
In a <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">previous piece</a>, I argue that AI systems could defeat all of humanity combined, if (for whatever reason) they were aimed toward that goal.
</p>
<p>
By defeating humanity, I mean gaining control of the world so that AIs, not humans, determine what happens in it; this could involve killing humans or simply &#x201C;containing&#x201D; us in some way, such that we can&#x2019;t interfere with AIs&#x2019; aims.
</p>
<p>
One way this could happen is if AI became extremely advanced, to the point where it had &quot;cognitive superpowers&quot; beyond what humans can do. In this case, a single AI system (or set of systems working together) could imaginably:
</p>
<ul>
<p></p>
<p>
<li>Do its own research on how to build a better AI system, which culminates in something that has incredible other abilities.
</li></p>
<p>
<li>Hack into human-built software across the world.
</li></p>
<p>
<li>Manipulate human psychology.
</li></p>
<p>
<li>Quickly generate vast wealth under the control of itself or any human allies.
</li></p>
<p>
<li>Come up with better plans than humans could imagine, and ensure that it doesn&apos;t try any takeover attempt that humans might be able to detect and stop.
</li></p>
<p>
<li>Develop advanced weaponry that can be built quickly and cheaply, yet is powerful enough to overpower human militaries.
</li></p>
<p>

</p>
<p>
</p></ul>
<p></p>
<p>
</p><p>
</p>
<p>
However, my piece also explores what things might look like if <em>each AI system basically has similar capabilities to humans. </em>In this case:
</p>
<p>
</p>
<p></p>
<p>
<ul>
</ul></p>
<p>
<li>Humans are likely to deploy AI systems throughout the economy, such that they have large numbers and access to many resources - and the ability to make copies of themselves. 
</li></p>
<p>
<li>From this starting point, AI systems with human-like (or greater) capabilities would have a number of possible ways of getting to the point where their total population could outnumber and/or out-resource humans.
</li></p>
<p>
<li>I address a number of possible objections, such as &quot;How can AIs be dangerous without bodies?&quot;
</li></p>
<p>

</p>
<p>

</p>
<p>
</p><p>
</p>
<p>
More: <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">AI could defeat all of us combined</a></p></div></details>
<p></p>
<p>
</p>
<p></p>
<h2 id="digital-neuroscience">Digital neuroscience</h2>


<p>
I&#x2019;ve previously argued that it could be inherently <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">difficult to measure whether AI systems are safe</a>, for reasons such as: AI systems that are <em>not deceptive </em>probably look like AI systems that are <em>so good at deception that they hide all evidence of it</em>, in any way we can easily measure.<strong> </strong>
</p>
<p>
Unless we can &#x201C;read their minds!&#x201D;
</p>
<p>
Currently, today&#x2019;s leading AI research is in the genre of &#x201C;black-box trial-and-error.&#x201D; An AI tries a task; it gets &#x201C;encouragement&#x201D; or &#x201C;discouragement&#x201D; based on whether it does the task well; it tweaks the wiring of its &#x201C;digital brain&#x201D; to improve next time; it improves at the task; but we humans aren&#x2019;t able to make much sense of its &#x201C;digital brain&#x201D; or say much about its &#x201C;thought process.&#x201D; 
</p>
<details id="Box4"><summary>(Click to expand) Why are AI systems &quot;black boxes&quot; that we can&apos;t understand the inner workings of? <!--(Details not included in email - <a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5#Box4">click to view on the web</a>)--></summary>
<div><p>
What I mean by &#x201C;black-box trial-and-error&#x201D; is explained briefly in an <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/#making-pasta">old Cold Takes post</a>, and in more detail in more technical pieces by <a href="https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to#_HFDT_scales_far__assumption__Alex_is_trained_to_achieve_excellent_performance_on_a_wide_range_of_difficult_tasks">Ajeya Cotra</a> (section I linked to) and <a href="https://arxiv.org/abs/2209.00626">Richard Ngo</a> (section 2). Here&#x2019;s a quick, oversimplified characterization.
</p>
<p>
Today, the most common way of building an AI system is by using an &quot;artificial neural network&quot; (ANN), which you might think of sort of like a &quot;digital brain&quot; that starts in an empty (or random) state: it hasn&apos;t yet been wired to do specific things. A process something like this is followed:
</p>
<ul>

<li>The AI system is given some sort of task.

</li><li>The AI system tries something, initially something pretty random.

</li><li>The AI system gets information about how well its choice performed, and/or what would&#x2019;ve gotten a better result. Based on this, it &#x201C;learns&#x201D; by tweaking the wiring of the ANN (&#x201C;digital brain&#x201D;) - literally by strengthening or weakening the connections between some &#x201C;artificial neurons&#x201D; and others. The tweaks cause the ANN to form a stronger association between the choice it made and the result it got. 

</li><li>After enough tries, the AI system becomes good at the task (it was initially terrible). 

</li><li>But nobody really knows anything about <em>how or why</em> it&#x2019;s good at the task now. The development work has gone into building a flexible architecture for it to learn well from trial-and-error, and into &#x201C;training&#x201D; it by doing all of the trial and error. We mostly can&#x2019;t &#x201C;look inside the AI system to see how it&#x2019;s thinking.&#x201D;

</li><li>For example, if we want to know why a chess-playing AI such as AlphaZero made some particular chess move, we can&apos;t look inside its code to find ideas like &quot;Control the center of the board&quot; or &quot;Try not to lose my queen.&quot; Most of what we see is just a vast set of numbers, denoting the strengths of connections between different artificial neurons. As with a human brain, we can mostly only guess at what the different parts of the &quot;digital brain&quot; are doing.
</li>
    </ul></div>
</details>
<p>
Some AI research (<a href="https://www.transformer-circuits.pub/2022/mech-interp-essay/index.html">example</a>)<sup id="fnref2"><a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5#fn2" rel="footnote">2</a></sup> is exploring how to change this - how to decode an AI system&#x2019;s &#x201C;digital brain.&#x201D; This research is in relatively early stages - today, it can &#x201C;decode&#x201D; only parts of AI systems (or fully decode very small, deliberately simplified AI systems).
</p>
<p>
As AI systems advance, it might get harder to decode them - or easier, if we can start to use AI for help decoding AI, and/or change AI design techniques so that AI systems are less &#x201C;black box&#x201D;-ish. 
</p>
<p>
I think there is a wide range of possibilities here, e.g.:
</p>
<p>
<strong>Failure:</strong> &#x201C;digital brains&#x201D; keep getting bigger, more complex, and harder to make sense of, and so &#x201C;digital neuroscience&#x201D; generally stays about as hard to learn from as human neuroscience. In this world, we wouldn&#x2019;t have anything like &#x201C;lie detection&#x201D; for AI systems engaged in <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#deceiving-and-manipulating">deceptive behavior</a>.
</p>
<p>
<strong>Basic mind-reading: </strong>we&#x2019;re able to get a handle on things like &#x201C;whether an AI system is behaving <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#deceiving-and-manipulating">deceptively</a>, e.g. whether it has internal representations of &#x2018;beliefs&#x2019; about the world that contradict its statements&#x201D; and &#x201C;whether an AI system is <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#what-it-means-for">aiming</a> to accomplish some <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">strange goal we didn&#x2019;t intend it to</a>.&#x201D; 
</p>
<ul>

<li>It may be hard to fix things like this by just continuing trial-and-error-based training (perhaps because we worry that AI systems are manipulating their own &#x201C;digital brains&#x201D; - see later bullet point). 

</li><li>But we&#x2019;d at least be able to get early warnings of potential problems, or early evidence that we <em>don&#x2019;t</em> have a problem, and adjust our level of caution appropriately. This sort of mind-reading could also be helpful with <a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5/#ai-checks-and-balances">AI checks and balances</a> (below).
</li>
</ul>
<p>
<strong>Advanced mind-reading: </strong>we&#x2019;re able to understand an AI system&#x2019;s &#x201C;thought process&#x201D; in detail (what observations and patterns are the main reasons it&#x2019;s behaving as it is), understand how any worrying aspects of this &#x201C;thought process&#x201D; (such as <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">unintended aims</a>) came about, and make lots of small adjustments until we can verify that an AI system is free of <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">unintended aims</a> or <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#deceiving-and-manipulating">deception.</a>
</p>
<p>
<strong>Mind-<em>writing </em>(digital neurosurgery):</strong> we&#x2019;re able to alter a &#x201C;digital brain&#x201D; directly, rather than just via the &#x201C;trial-and-error&#x201D; process discussed <a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5#Box4">earlier.</a>
</p>
<p>
One potential failure mode for digital neuroscience is if AI systems end up able to <em>manipulate their own &#x201C;digital brains.</em>&#x201D; This could lead &#x201C;digital neuroscience&#x201D; to have the same problem as <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">other AI safety research</a>: if we&#x2019;re shutting down or negatively reinforcing AI systems that appear to have unsafe &#x201C;aims&#x201D; based on our &#x201C;mind-reading,&#x201D; we might end up selecting for AI systems whose &#x201C;digital brains&#x201D; only <em>appear</em> safe. 
</p>
<ul>

<li>This could be a real issue, especially if AI systems end up with far-beyond-human capabilities (more below). 

</li><li>But naively, an AI system manipulating its own &#x201C;digital brain&#x201D; to appear safe seems quite a bit harder than simply <em>behaving</em> deceptively. 
</li>
</ul>
<p>
I should note that I&#x2019;m lumping in much of the (hard-to-explain) research on the <a href="https://www.lesswrong.com/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge">Eliciting Latent Knowledge</a> (ELK) agenda under this category.<sup id="fnref3"><a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5#fn3" rel="footnote">3</a></sup> The ELK agenda is largely<sup id="fnref4"><a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5#fn4" rel="footnote">4</a></sup> about thinking through what kinds of &#x201C;digital brain&#x201D; patterns might be associated with honesty vs. deception, and trying to find some impossible-to-fake sign of honesty.
</p>
<p>
<strong>How likely is this to work? </strong>I think it&#x2019;s very up-in-the-air right now. I&#x2019;d say &#x201C;digital neuroscience&#x201D; is a young field, tackling a problem that may or may not prove tractable. If we have several decades before transformative AI, then I&#x2019;d expect to at least succeed at &#x201C;basic mind-reading,&#x201D; whereas if we have less than a decade, I think that&#x2019;s around 50/50. I think it&#x2019;s less likely that we&#x2019;ll succeed at some of the more ambitious goals, but definitely possible.
</p>
<h2 id="limited-ai">Limited AI</h2>


<p>
I <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">previously</a> discussed why AI systems could end up with &#x201C;aims,&#x201D; in the sense that they make calculations, choices and plans selected to reach a particular sort of state of the world. For example, chess-playing AIs &#x201C;aim&#x201D; for checkmate game states; a recommendation algorithm might &#x201C;aim&#x201D; for high customer engagement or satisfaction. I then argued that AI systems would do &#x201C;whatever it takes&#x201D; to get what they&#x2019;re &#x201C;aiming&#x201D; at, even when this means deceiving and disempowering humans.
</p>
<p>
But AI systems won&#x2019;t necessarily have the sorts of &#x201C;aims&#x201D; that risk trouble. Consider two different tasks you might &#x201C;train&#x201D; an AI to do, via trial-and-error (rewarding success at the task):
</p>
<ul>

<li>&#x201C;Write whatever code a particular human would write, if they were in your situation.&#x201D;

</li><li>&#x201C;Write whatever code accomplishes goal X [including coming up with things much better than a human could].&#x201D;
</li>
</ul>
<p>
The second of these seems like a recipe for having the sort of ambitious &#x201C;aim&#x201D; I&#x2019;ve <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">claimed is dangerous</a> - it&#x2019;s an open-ended invitation to do <em>whatever</em> leads to good performance on the goal. By contrast, the first is about imitating a particular human. It leaves a lot less scope for creative, unpredictable behavior and for having <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">&#x201C;ambitious&#x201D; goals that lead to conflict with humans.</a>
</p>
<p>
(For more on this distinction, see my <a href="https://www.alignmentforum.org/posts/rCJQAkPTEypGjSJ8X/how-might-we-align-transformative-ai-if-it-s-developed-very#Limited_AI_systems">discussion of process-based optimization</a>, although I&#x2019;m not thrilled with this and hope to write something better later.)
</p>
<p>
My guess is that in a competitive world, people will be able to get more done, faster, with something like the second approach. But: 
</p>
<ul>

<li>Maybe the first approach will work better <em>at first</em>, and/or AI developers will deliberately stick with the first approach as much as they can for safety reasons.

</li><li>And maybe that will be enough to build AI systems that can, themselves, <a href="https://openai.com/blog/our-approach-to-alignment-research/">do huge amounts of AI alignment research</a> applicable to future, less limited systems. Or enough to build AI systems that can do other useful things, such as creating convincing demonstrations of the risks, patching security holes that dangerous AI systems would otherwise exploit, and more. (More on &#x201C;how safe AIs can protect against dangerous AIs&#x201D; in a future piece.)

</li><li>A risk that would remain: these AI systems might also be able to do huge amounts of research on <em>making AIs bigger and more capable</em>. So simply having &#x201C;AI systems that can do alignment research&#x201D; isn&#x2019;t good enough by itself - we would need to then hope that the leading AI developers prioritize safety research rather than racing ahead with building more powerful systems, up until the point where they can make the more powerful systems safe.
</li>
</ul>
<p>
There are a number of other ways in which we might &#x201C;limit&#x201D; AI systems to make them safe. One can imagine AI systems that are:
</p>
<ul>

<li>&#x201C;Short-sighted&#x201D; or &#x201C;<a href="https://www.alignmentforum.org/posts/LCLBnmwdxkkz5fNvH/open-problems-with-myopia">myopic</a>&#x201D;: they might have &#x201C;aims&#x201D; (<a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#what-it-means-for">see previous post on what I mean by this term</a>) that only apply to their short-run future. So an AI system might be aiming to gain more power, but only over the next few hours; such an AI system wouldn&#x2019;t exhibit some of the behaviors I worry about, such as <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#deceiving-and-manipulating">deceptively behaving in &#x201C;safe&#x201D; seeming ways in hopes of getting more power later</a>.

</li><li>&#x201C;Narrow&#x201D;: they might have only a particular set of capabilities, so that e.g. they can help with AI alignment research but don&#x2019;t understand human psychology and can&#x2019;t <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#deceiving-and-manipulating">deceive and manipulate humans</a>.

</li><li>&#x201C;Unambitious&#x201D;: even if AI systems develop unintended aims, these might be aims they satisfy fairly easily, causing some strange behavior but not aiming to <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">defeat all of humanity</a>.
</li>
</ul>
<p id="Amplification">
A further source of hope: even if such &#x201C;limited&#x201D; systems aren&#x2019;t very powerful on their own, we might be able to <a href="https://www.alignmentforum.org/s/EmDuGeRw749sD3GKd/p/HqLxuZ4LhaFhmAHWk#Core_concept__Analogy_to_AlphaGoZero">amplify</a> them by setting up combinations of AIs that work together on difficult tasks. For example:
</p>
<ul>

<li>One &#x201C;slow but deep&#x201D; AI might do lots of analysis on every action it takes - for example, when it writes a line of code, it might consider hundreds of possibilities for that single line.

</li><li>Another &#x201C;fast and shallow&#x201D; AI might be trained to quickly, efficiently imitate the sorts of actions the &#x201C;slow but deep&#x201D; one takes - writing the sorts of lines of code it produces after considering hundreds of possibilities.

</li><li>Further AIs might be trained to summarize the analysis of other AIs, assign different parts of tasks to different AIs, etc. The result could be something like a &#x201C;team&#x201D; of AIs with different roles, such that a large number of limited AIs ends up quite a lot more powerful (and, depending on the details, also more dangerous) than any of the individual AIs. 
</li>
</ul>
<p>
I&#x2019;d guess that in a competitive world, AI systems that are <em>not</em> &#x201C;limited&#x201D; will - at least eventually - be more powerful, versatile and ultimately useful. But limited AIs might get us pretty far.
</p>
<p>
<strong>How likely is this to work? </strong>I&#x2019;d guess that we&#x2019;ll eventually be able to build very powerful AIs whose limits make them relatively safe. However, I&#x2019;d also guess that AIs without such limits will eventually be more powerful. So I think a lot of how things go will come down to <em>how cautious we are</em>: will we stick with <em>limited </em>AIs until the point at which we make <em>more powerful</em> AIs safe? And I think it&#x2019;s very hard to predict how much caution the world will have - it partly depends on how well-understood the issues discussed in this series become over time!
</p>
<h2 id="ai-checks-and-balances">AI checks and balances</h2>


<p>
<img src="https://www.cold-takes.com/content/images/2022/12/image1.jpg" < p alt="High-level hopes for AI alignment">
</p><p>
Central to my <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">worries</a> is the idea that AIs could be good at &#x201C;deceiving&#x201D; humans: proficiently choosing courses of action that humans don&#x2019;t fully understand, and don&#x2019;t catch the problems with. This is important both for <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">how AIs could develop unintended, &#x201C;dangerous&#x201D; aims in the first place</a> and <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">how they could execute on these aims by defeating humanity</a>.
</p>
<p>
We could potentially mitigate this issue by <strong>using AIs to supervise and critique each other.</strong>
</p>
<p>
A simple example: say we&#x2019;re worried that AI systems might find computer security vulnerabilities, and use them opportunistically to gain power and resources. We could train some AI systems to <em>specifically seek out, expose and patch</em> computer security vulnerabilities. (A footnote explains why we might expect such specialized systems to patch most vulnerabilities they find, rather than exploiting the vulnerabilities as often as less specialized systems would.<sup id="fnref5"><a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5#fn5" rel="footnote">5</a></sup>) 
</p>

<ul>

<li>This is a basic example of &#x201C;using some AIs to identify potential problems other ones could cause.&#x201D; 

</li><li>A nice property: there might be some vulnerabilities that are much easier for AI systems to notice than humans; these are probably the ones we should be most worried about, <em>and</em> the ones most likely to be found by &#x201C;vulnerability-finding AIs.&#x201D;
</li>
</ul>
<p>
    Analogously, we could <a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5#Box4">train</a> AI systems to do things like:
</p>
<ul>

<li>Produce critiques of each others&#x2019; proposed plans, and explanations of how these plans could lead to unintended behavior.

</li><li>Help out with the &#x201C;digital neuroscience&#x201D; goal <a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5/#digital-neuroscience">above</a>: analyze each others&#x2019; &#x201C;digital brains&#x201D; and produce compelling-to-humans discussions of whether there&#x2019;s evidence there about whether the AI systems have <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">unintended, dangerous &#x201C;aims.&#x201D;</a>

</li><li>Conduct randomized audits of each others&#x2019; behavior: looking for any way in which another AI system&#x2019;s behavior might be deceptive or otherwise unintended.  
<ul>
 
<li>Importantly, auditor AIs would have different &#x201C;incentives&#x201D; (they get positive reinforcement for different behaviors), compared to the AIs they&#x2019;re auditing: the audited AI system might be searching for any answer to a question that humans rate highly, whereas the &#x201C;auditor&#x201D; might be <em>specifically</em> searching for compelling arguments that an otherwise-highly-rated-answer is actually misleading.
 
</li><li>I previously <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#Audits">argued</a> that these sorts of audits have limitations - they could end up unwittingly training the audited AI systems to become effective at deceiving human auditors. But in this case, deceiving auditors might be more like deceiving <em>copies of themselves, </em>potentially a harder-to-learn behavior.
</li> 
</ul>
</li> 
</ul>
<p>
There are a lot of potential wrinkles here, which I discuss in a more detailed <a href="https://www.alignmentforum.org/posts/rCJQAkPTEypGjSJ8X/how-might-we-align-transformative-ai-if-it-s-developed-very">non-Cold-Takes piece</a>. In brief:
</p>
<ul>

<li>AI checks and balances might fail if different AI systems end up <em>coordinating with each other against us. </em>For example, it might end up being the case that even auditor AI systems are best able to accomplish their <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">ambitious, unintended aims</a> by refraining from warning humans about the most important problems they find with other AIs.

</li><li>We might try to make this less likely by creating <strong>ensembles of AI systems that &#x201C;check and balance&#x201D; each other, while being different enough from each other to make coordination difficult. </strong> 
<ul>
 
<li>For example, some AI systems might have limitations (<a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5/#limited-ai">see above</a>) that make them less dangerous, while also less capable. 
 
</li><li>In order to pull off deception successfully, more powerful systems would then have to deceive not just humans but a whole variety of other (less capable, but in different ways) AI systems - some of which could be <a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5#Amplification">amplified</a>, as discussed above.
</li> 
</ul>

</li><li>Another way AI checks and balances could fail: when one AI system critiques another, the resulting &#x201C;debate&#x201D; could quickly become too hard for humans to referee effectively. For example, imagine that an <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/#The-young-businessperson">eight-year-old</a> has asked adult A to look for signs that adult B is deceiving them, and adult A is now arguing that this is happening while adult B is arguing that it isn&#x2019;t. Can the eight-year-old figure out what the truth is? Seems genuinely uncertain (and dependent on the details)!
</li>
</ul>
<p>
There is already some research on &#x201C;using AIs to critique each other.&#x201D; A recent example is <a href="https://openai.com/blog/critiques/">this paper</a>, which actually does show that an AI trained to critique its own answers can surface helpful critiques that help humans rate its answers more accurately.
</p>
<h2 id="other-possibilities">Other possibilities</h2>


<p>
I discuss possible hopes in more detail in an <a href="https://www.alignmentforum.org/posts/rCJQAkPTEypGjSJ8X/how-might-we-align-transformative-ai-if-it-s-developed-very">Alignment Forum</a> piece. And I think there is significant scope for &#x201C;unknown unknowns&#x201D;: researchers working on AI safety might come up with approaches that nobody has thought of yet.
</p>
<h2 id="too-weird-too-fast">High-level fear: things get too weird, too fast</h2>


<p>
Rather than end on a positive note, I want to talk about a general dynamic that feels like it could make the situation <em>very</em> difficult, and make it hard for any of the above hopes to work out.
</p>
<p>
To quote from my <a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure/">previous piece</a>:
</p>
<p>
Maybe at some point, AI systems will be able to do things like:
</p>
<ul>

<li>Coordinate with each other incredibly well, such that it&apos;s hopeless to use one AI to help supervise another.

</li><li>Perfectly understand human thinking and behavior, and know exactly what words to say to make us do what they want - so just letting an AI send emails or write Tumblr posts gives it vast power over the world.

</li><li>Manipulate their own &quot;digital brains,&quot; so that our <a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5/#digital-neuroscience">attempts to &quot;read their minds&quot; </a>backfire and mislead us.

</li><li>Reason about the world (that is, <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#what-it-means-for">make plans to accomplish their aims</a>) in completely different ways from humans, with concepts like &quot;glooble&quot;<sup id="fnref6"><a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5#fn6" rel="footnote">6</a></sup> that are incredibly useful ways of thinking about the world but that humans couldn&apos;t understand with centuries of effort.
</li></ul>
<p>
At this point, whatever methods we&apos;ve developed for making human-like AI systems safe, honest and restricted could fail - and silently, as such AI systems could go from &quot;being honest and helpful&quot; to &quot;appearing honest and helpful, while setting up opportunities to <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">defeat humanity</a>.&quot;
</p>
<p>
I&#x2019;m not wedded to any of the details above, but I think the general dynamic in which &#x201C;AI systems get extremely powerful, strange, and hard to deal with very quickly&#x201D; could happen for a few different reasons:
</p>
<ul>

<li>The nature of AI development might just be such that we very quickly go from having very weak AI systems to having &#x201C;superintelligent&#x201D; ones. How likely this is has been debated a lot.<sup id="fnref7"><a href="https://www.cold-takes.com/p/51b33fd6-2f1e-40bd-9d2c-2cfe2ebd5fc5#fn7" rel="footnote">7</a></sup>

</li><li>Even if AI improves relatively slowly, we might <em>initially</em> have a lot of success with things like &#x201C;AI checks and balances,&#x201D; but continually make more and more capable AI systems - such that they eventually become extraordinarily capable and very &#x201C;alien&#x201D; to us, at which point previously-effective methods break down. (<a href="https://www.alignmentforum.org/posts/rCJQAkPTEypGjSJ8X/how-might-we-align-transformative-ai-if-it-s-developed-very#DecisiveDoesntExist">More</a>)

</li><li>The most likely reason this would happen, in my view, is that <strong>we - humanity - choose to move too fast. </strong>It&#x2019;s easy to envision a world in which everyone is in a furious race to develop more powerful AI systems than everyone else - focused on &#x201C;competition&#x201D; rather than &#x201C;caution&#x201D; (more on the distinction <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/">here</a>) - and everything <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/#explosive-scientific-and-technological-advancement">accelerates dramatically</a> once we&#x2019;re able to use AI systems to automate scientific and technological advancement.</li></ul>
<h2 id="bottom-line">So &#x2026; is AI going to defeat humanity or is everything going to be fine?</h2>


<p>
I don&#x2019;t know! There are a number of ways we might be fine, and a number of ways we might not be. I could easily see this century ending in <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">humans defeated</a> or in a glorious <a href="https://www.cold-takes.com/visualizing-utopia/">utopia</a>. You could maybe even think of it as the most important century.
</p>
<p>
So far, I&#x2019;ve mostly just talked about the technical challenges of AI alignment: why AI systems might end up misaligned, and how we might design them to avoid that outcome. In future pieces, I&#x2019;ll go into a bit more depth on some of the political and strategic challenges (e.g., what AI companies and governments might do to reduce the risk of a furious race to deploy dangerous AI systems), and work my way toward the question: &#x201C;What can we do today to improve the odds that things go well?&#x201D;
</p>

<!-- Footnotes themselves at the bottom. -->

<!--kg-card-end: html--><!--kg-card-begin: html-->

<p><div style="display:flex; justify-content:center; margin: 0 auto;">

        <span style="margin: 10px;"><a href="https://api.addthis.com/oexchange/0.8/forward/twitter/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fhigh-level-hopes-for-ai-alignment&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20High-level%20hopes%20for%20AI%20alignment&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-twitter-square.png" border="0" alt="High-level hopes for AI alignment"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/facebook/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fhigh-level-hopes-for-ai-alignment&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20High-level%20hopes%20for%20AI%20alignment&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-facebook-square.png" border="0" alt="High-level hopes for AI alignment"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/reddit/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fhigh-level-hopes-for-ai-alignment&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20High-level%20hopes%20for%20AI%20alignment&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-reddit-square.png" border="0" alt="High-level hopes for AI alignment"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/menu/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fhigh-level-hopes-for-ai-alignment&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20High-level%20hopes%20for%20AI%20alignment&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-addthis-square.png" border="0" alt="High-level hopes for AI alignment"></a></span>
        </div>
<center><p id="discuss"><!--<a href="https://www.cold-takes.com/high-level-hopes-for-ai-alignment#subscribe" target="_blank"><button id="footer-subscribe" class="button">Subscribe</button></a>&nbsp;<a href="https://www.guidedtrack.com/programs/4kal2ue/run?posttitle=High-level%20hopes%20for%20AI%20alignment" target="_blank"><button class="button" id="Survey">Feedback</button></a>-->&#xA0;<a href="https://www.lesswrong.com/posts/slug/high-level-hopes-for-ai-alignment#comments" target="_blank"><button class="button">Comment/discuss</button></a></p><p><!--<em>
Use "Feedback" if you have comments/suggestions you want me to see, or if you're up for giving some quick feedback about this post (which I greatly appreciate!) Use "Forum" if you want to discuss this post publicly on the Effective Altruism Forum.
</em>--></p></center>
<!--kg-card-end: html--><!--kg-card-begin: html--><hr>
</p><h2>Footnotes</h2>
<div class="footnotes">

<ol><li id="fn1">
<p>
     <a href="https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities">E.g.</a>&#xA0;<a href="#fnref1" rev="footnote">&#x21A9;</a><li id="fn2">
<p>
     Disclosure: my wife Daniela is President and co-founder of Anthropic, which employs prominent researchers in &#x201C;mechanistic interpretability&#x201D; and hosts the site I link to for the term.&#xA0;<a href="#fnref2" rev="footnote">&#x21A9;</a><li id="fn3">
<p>
     Disclosure: I&#x2019;m on the board of <a href="https://alignment.org/">ARC</a>, which wrote this document.&#xA0;<a href="#fnref3" rev="footnote">&#x21A9;</a><li id="fn4">
<p>
     Though not entirely&#xA0;<a href="#fnref4" rev="footnote">&#x21A9;</a><li id="fn5">
<p>
     The basic idea:
<ul>
	<li>A lot of security vulnerabilities might be the kind of thing where it&#x2019;s clear that there&#x2019;s some weakness in the system, but it&#x2019;s not immediately clear how to exploit this for gain. An AI system with an <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">unintended &#x201C;aim&#x201D;</a> might therefore &#x201C;save&#x201D; knowledge about the vulnerability until it encounters enough <em>other</em> vulnerabilities, and the right circumstances, to accomplish its aim.

	</li><li>But now imagine an AI system that is trained and rewarded <em>exclusively</em> for finding and patching such vulnerabilities. Unlike with the first system, revealing the vulnerability gets more positive reinforcement than just about <em>anything else it can do</em> (and an AI that reveals no such vulnerabilities will perform extremely poorly). It thus might be much more likely than the previous system to do so, rather than simply leaving the vulnerability in place in case it&#x2019;s useful later.

	</li><li>And now imagine that there are <em>multiple</em> AI systems trained and rewarded for finding and patching such vulnerabilities, with each one needing to find some vulnerability overlooked by others in order to achieve even moderate performance. These systems might also have enough variation that it&#x2019;s hard for one such system to confidently predict what another will do, which could further lower the gains to leaving the vulnerability in place.    
&#xA0;<a href="#fnref5" rev="footnote">&#x21A9;</a></li></ul><li id="fn6">This is a concept that only I understand. &#xA0;<a href="#fnref6" rev="footnote">&#x21A9;</a></li><li id="fn7">

<p>
     See <a href="https://www.alignmentforum.org/s/v55BhXbpJuaExkpcD/p/GNhMPAWcfBCASy8e6">here</a>, <a href="https://www.alignmentforum.org/s/n945eovrA3oDueqtq/p/hwxj4gieR7FWNwYfa">here</a>, and <a href="https://www.alignmentforum.org/s/n945eovrA3oDueqtq/p/vwLxd6hhFvPbvKmBH">here</a>. Also see the tail end of <a href="https://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html">this Wait but Why piece</a>, which draws on similar intuitions to the longer treatment in <a href="https://smile.amazon.com/Superintelligence-Dangers-Strategies-Nick-Bostrom-ebook/dp/B00LOOCGB2/">Superintelligence</a>&#xA0;<a href="#fnref7" rev="footnote">&#x21A9;</a>

</p></li></p></li></p></li></p></li></p></li></p></li></ol></div><!--kg-card-end: html-->]]></content:encoded></item><item><title><![CDATA[AI Safety Seems Hard to Measure]]></title><description><![CDATA[Four analogies for why "We don't see any misbehavior by this AI" isn't enough.]]></description><link>https://www.cold-takes.com/ai-safety-seems-hard-to-measure/</link><guid isPermaLink="false">63900818ec211f003cdbdf5a</guid><category><![CDATA[ImplicationsOfMostImportantCentury]]></category><dc:creator><![CDATA[Holden Karnofsky]]></dc:creator><pubDate>Thu, 08 Dec 2022 19:45:44 GMT</pubDate><media:content url="https://www.cold-takes.com/content/images/2022/12/ai-safety-seems-hard-to-measure-3.png" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: html--><img src="https://www.cold-takes.com/content/images/2022/12/ai-safety-seems-hard-to-measure-3.png" alt="AI Safety Seems Hard to Measure"><p><figure><div id="buzzsprout-player-11838542"></div><script src="https://www.buzzsprout.com/1851795/11838542-ai-safety-seems-hard-to-measure.js?container_id=buzzsprout-player-11838542&amp;player=small" type="text/javascript" charset="utf-8"></script><figcaption><em>Click lower right to download or find on Apple Podcasts, Spotify, Stitcher, etc.</em></figcaption></figure></p>


<p>
  
In previous pieces, I argued that there&apos;s a real and large risk of AI systems&apos; <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">developing dangerous goals of their own</a> and <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">defeating all of humanity</a> - at least in the absence of specific efforts to prevent this from happening.
</p>
<p>
A young, growing field of <strong>AI safety research</strong> tries to reduce this risk, by finding ways to ensure that AI systems behave as intended (rather than forming ambitious aims of their own and deceiving and manipulating humans as needed to accomplish them). 
</p>
<p>
Maybe we&apos;ll succeed in reducing the risk, and maybe we won&apos;t. <strong>Unfortunately, I think it could be hard to know either way</strong>. This piece is about four fairly distinct-seeming reasons that this could be the case - and that AI safety could be an unusually difficult sort of science.
</p>
<p>
This piece is aimed at a broad audience, because I think it&apos;s <strong>important for the challenges here to be broadly understood. </strong>I expect powerful, dangerous AI systems to have a lot of benefits (commercial, military, etc.), and to potentially <em>appear</em> safer than they are - so I think it will be hard to be as cautious about AI as we should be. I think our odds look better if many people understand, at a high level, some of the challenges in knowing whether AI systems are as safe as they appear.
</p>
<p>
First, I&apos;ll recap the basic challenge of AI safety research, and outline what I <em>wish</em> AI safety research could be like. I wish it had this basic form: &quot;Apply a test to the AI system. If the test goes badly, try another AI development method and test that. If the test goes well, we&apos;re probably in good shape.&quot; I think car safety research mostly looks like this; I think AI <em>capabilities</em> research mostly looks like this.
</p>
<p>
Then, I&#x2019;ll give four reasons that <strong>apparent success in AI safety can be misleading. </strong>
</p><!--
<p>
<img src="https://www.cold-takes.com/content/images/2022/12/Screen-Shot-2022-12-06-at-10.37-1--1-.png">
</p>-->
<p>
<table style="border-collapse: collapse;">
  <tr>
   <td colspan="3" style="border: 1px solid;"><strong>&#x201C;Great news - I&#x2019;ve tested this AI and it looks safe.&#x201D; </strong>Why might we still have a problem?
   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;"><em>Problem</em>
   </td>
   <td style="border: 1px solid;"><em>Key question</em>
   </td>
   <td style="border: 1px solid;"><em>Explanation</em>
   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;">The <strong>Lance Armstrong problem</strong>
   </td>
   <td style="border: 1px solid;">Did we get the AI to be <strong><span style="color:var(--green-color);">actually safe</span></strong> or <strong><span style="color:var(--red-color);">good at hiding its dangerous actions</span>?</strong>
   </td>
  <td style="border: 1px solid;"><p>When dealing with an intelligent agent, it&#x2019;s hard to tell the difference between &#x201C;behaving well&#x201D; and &#x201C;<em>appearing</em> to behave well.&#x201D;</p>
<p>
When professional cycling was cracking down on performance-enhancing drugs, Lance Armstrong was very successful and seemed to be unusually &#x201C;clean.&#x201D; It later came out that he had been using drugs with an unusually sophisticated operation for concealing them.
   </p></td>
  </tr>
  <tr>
   <td style="border: 1px solid;">The <strong>King Lear problem</strong>
   </td>
   <td style="border: 1px solid;"><p>The AI is <strong><span style="color:var(--green-color);">(actually) well-behaved when humans are in control. </span></strong>Will this transfer to <strong><span style="color:var(--red-color);">when AIs are in control</span>?</strong></p>
   </td>
   <td style="border: 1px solid;"><p>It&apos;s hard to know how someone will behave when they have power over you, based only on observing how they behave when they don&apos;t. </p>
<p>
AIs might behave as intended as long as humans are in control - but at some future point, AI systems might be capable and widespread enough to have opportunities to <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">take control of the world entirely</a>. It&apos;s hard to know whether they&apos;ll take these opportunities, and we can&apos;t exactly run a clean test of the situation. 
</p><p>
Like King Lear trying to decide how much power to give each of his daughters before abdicating the throne.
   </p></td>
  </tr>
  <tr>
   <td style="border: 1px solid;">The <strong>lab mice problem</strong>
   </td>
      <td style="border: 1px solid;"><strong><span style="color:var(--green-color);">Today&apos;s &quot;subhuman&quot; AIs are safe.</span></strong>What about <strong><span style="color:var(--red-color);">future AIs with more human-like abilities</span>?</strong>
   </td>
   <td style="border: 1px solid;"><p>Today&apos;s AI systems aren&apos;t advanced enough to exhibit the basic behaviors we want to study, such as deceiving and manipulating humans.</p> 
<p>
Like trying to study medicine in humans by experimenting only on lab mice.
   </p></td>
  </tr>
  <tr>
   <td style="border: 1px solid;">The <strong>first contact problem</strong>
   </td>
   <td style="border: 1px solid;"><p>Imagine that <strong><span style="color:var(--green-color);">tomorrow&apos;s &quot;human-like&quot; AIs are safe.</span></strong> How will things go <strong><span style="color:var(--red-color);">when AIs have capabilities far beyond humans&apos;</span>?</strong></p>
   </td>
   <td style="border: 1px solid;"><p>AI systems might (collectively) become vastly more capable than humans, and it&apos;s ... just really hard to have any idea what that&apos;s going to be like. As far as we know, there has never before been anything in the galaxy that&apos;s vastly more capable than humans in the relevant ways! No matter what we come up with to solve the first three problems, we can&apos;t be too confident that it&apos;ll keep working if AI advances (or just proliferates) a lot more. </p>
<p>
Like trying to plan for first contact with extraterrestrials (this barely feels like an analogy).
   </p></td>
  </tr>
</table>
</p>

<p>
I&apos;ll close with Ajeya Cotra&apos;s &quot;<a href="https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/#analogy-the-young-ceo">young businessperson</a>&quot; analogy, which in some sense ties these concerns together. A future piece will discuss some reasons for hope, despite these problems.
</p>
<h2 id="Recap-of-the-basic-challenge">Recap of the basic challenge</h2>


<p>
A <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">previous piece</a> laid out the basic case for concern about AI misalignment. In brief: if extremely capable AI systems are developed using methods like the ones AI developers use today, it seems like there&apos;s a substantial risk that:
</p>
<ul>

<li>These AIs will develop <strong>unintended aims</strong> (states of the world they make calculations and plans toward, as a chess-playing AI &quot;aims&quot; for checkmate);

</li><li>These AIs will deceive, manipulate, and overpower humans as needed to achieve those aims;

</li><li>Eventually, this could reach the point where AIs <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">take over the world from humans entirely</a>.
</li>
</ul>
<p>
I see <strong>AI safety research</strong> as trying to <strong>design AI systems that won&apos;t <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#what-it-means-for">aim</a> to deceive, manipulate or defeat humans - even if and when these AI systems are extraordinarily capable</strong> (and would be very effective at deception/manipulation/defeat if they were to aim at it).<strong> </strong>That is: AI safety research is trying to reduce the risk of the above scenario, <em>even if</em> (as I&apos;ve assumed) humans rush forward with training powerful AIs to do ever-more ambitious things.
</p>
<details id="Box1"><summary>(Click to expand) More detail on why AI could make this the most important century <!--(Details not included in email - <a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#Box1">click to view on the web</a>)--></summary><div>
<p>
In <a href="https://www.cold-takes.com/most-important-century/">The Most Important Century</a>, I argued that the 21st century could be the most important century ever for humanity, via the development of advanced AI systems that could dramatically speed up scientific and technological advancement, getting us more quickly than most people imagine to a deeply unfamiliar future.
</p>
<p>
<a href="https://www.cold-takes.com/most-important-century/">This page</a> has a ~10-page summary of the series, as well as links to an audio version, podcasts, and the full series.
</p>
<p>
The key points I argue for in the series are:
</p>
<ul>

<li><strong>The long-run future is radically unfamiliar. </strong>Enough advances in technology could lead to a long-lasting, galaxy-wide civilization that could be a radical utopia, dystopia, or anything in between.

</li><li><strong>The long-run future could come much faster than we think,</strong> due to a possible AI-driven productivity explosion.

</li><li>The relevant kind of <strong>AI looks like it will be developed this century</strong> - making this century the one that will initiate, and have the opportunity to shape, a future galaxy-wide civilization.

</li><li>These claims seem too &quot;wild&quot; to take seriously. But there are a lot of reasons to think that <strong>we live in a wild time, and should be ready for anything.</strong>

</li><li>We, the people living in this century, have the chance to have a huge impact on huge numbers of people to come - if we can make sense of the situation enough to find helpful actions. But right now, <strong>we aren&apos;t ready for this.</strong>
</li>
</ul>
    </div></details><details id="Box2"><summary>(Click to expand) Why would AI &quot;aim&quot; to defeat humanity?<!-- (Details not included in email - <a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#Box2">click to view on the web</a>)--></summary>
<div><p>
A <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">previous piece</a> argued that if today&#x2019;s AI development methods lead directly to powerful enough AI systems, disaster is likely by default (in the absence of specific countermeasures). 
</p>
<p>
In brief:
</p>
<ul>

<li>Modern AI development is essentially based on &#x201C;training&#x201D; via trial-and-error. 

</li><li>If we move forward incautiously and ambitiously with such training, and if it gets us all the way to very powerful AI systems, then such systems will likely end up <em>aiming for certain states of the world</em> (analogously to how a chess-playing AI aims for checkmate)<em>.</em>

</li><li>And these states will be<em> other than the ones we intended</em>, because our trial-and-error training methods won&#x2019;t be accurate. For example, when we&#x2019;re confused or misinformed about some question, we&#x2019;ll reward AI systems for giving the wrong answer to it - unintentionally training deceptive behavior.

</li><li>We should expect disaster if we have AI systems that are both (a) <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">powerful enough</a> to defeat humans and (b) aiming for states of the world that we didn&#x2019;t intend. (&#x201C;Defeat&#x201D; means taking control of the world and doing what&#x2019;s necessary to keep us out of the way; it&#x2019;s unclear to me whether we&#x2019;d be literally killed or just forcibly stopped<sup id="fnref1"><a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#fn1" rel="footnote">1</a></sup> from changing the world in ways that contradict AI systems&#x2019; aims.)</li></ul>
        </div></details>
<details id="Box3"><summary>(Click to expand) <em>How</em> could AI defeat humanity?<!-- (Details not included in email - <a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#Box3">click to view on the web</a>)--></summary>
<div>
    <p>
In a <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">previous piece</a>, I argue that AI systems could defeat all of humanity combined, if (for whatever reason) they were aimed toward that goal.
</p>
<p>
By defeating humanity, I mean gaining control of the world so that AIs, not humans, determine what happens in it; this could involve killing humans or simply &#x201C;containing&#x201D; us in some way, such that we can&#x2019;t interfere with AIs&#x2019; aims.
</p>
<p>
One way this could happen is if AI became extremely advanced, to the point where it had &quot;cognitive superpowers&quot; beyond what humans can do. In this case, a single AI system (or set of systems working together) could imaginably:
</p>
<ul>

<li>Do its own research on how to build a better AI system, which culminates in something that has incredible other abilities.

</li><li>Hack into human-built software across the world.

</li><li>Manipulate human psychology.

</li><li>Quickly generate vast wealth under the control of itself or any human allies.

</li><li>Come up with better plans than humans could imagine, and ensure that it doesn&apos;t try any takeover attempt that humans might be able to detect and stop.

</li><li>Develop advanced weaponry that can be built quickly and cheaply, yet is powerful enough to overpower human militaries.
</li>
</ul>
<p>
However, my piece also explores what things might look like if <em>each AI system basically has similar capabilities to humans. </em>In this case:
</p>
<ul>

<li>Humans are likely to deploy AI systems throughout the economy, such that they have large numbers and access to many resources - and the ability to make copies of themselves. 

</li><li>From this starting point, AI systems with human-like (or greater) capabilities would have a number of possible ways of getting to the point where their total population could outnumber and/or out-resource humans.

</li><li>I address a number of possible objections, such as &quot;How can AIs be dangerous without bodies?&quot;
</li>
</ul>
<p>
More: <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">AI could defeat all of us combined</a></p></div></details>
<h2 id="I-wish-AI-safety-research-were-straightforward">I wish AI safety research were straightforward</h2>


<p>
I wish AI safety research were like car safety research.<sup id="fnref2"><a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#fn2" rel="footnote">2</a></sup>
</p>
<p>
While I&apos;m sure this is an oversimplification, I think a lot of car safety research looks basically like this:
</p>
<ul>

<li>Companies carry out test crashes with test cars. The results give a pretty good (not perfect) indication of what would happen in a real crash.

</li><li>Drivers try driving the cars in low-stakes areas without a lot of traffic. Things like steering wheel malfunctions will probably show up here; if they don&apos;t and drivers are able to drive normally in low-stakes areas, it&apos;s probably safe to drive the car in traffic.

</li><li>None of this is perfect, but the occasional problem isn&apos;t, so to speak, <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">the end of the world</a>. The worst case tends to be a handful of accidents, followed by a recall and some changes to the car&apos;s design validated by further testing.
</li>
</ul>
<p>
Overall, <strong>if we have problems with car safety, we&apos;ll probably be able to observe them relatively straightforwardly under relatively low-stakes circumstances.</strong>
</p>
<p>
In important respects, many types of research and development have this basic property: we can observe how things are going during testing to get good evidence about how they&apos;ll go in the real world. Further examples include medical research,<sup id="fnref3"><a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#fn3" rel="footnote">3</a></sup> chemistry research,<sup id="fnref4"><a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#fn4" rel="footnote">4</a></sup> software development,<sup id="fnref5"><a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#fn5" rel="footnote">5</a></sup> etc. 
</p>
<p>
<strong>Most AI research looks like this as well. </strong>People can test out what an AI system is capable of reliably doing (e.g., translating speech to text), before integrating it into some high-stakes commercial product like Siri. This works both for ensuring that the AI system is <em>capable</em> (e.g., that it does a good job with its tasks) and that it&apos;s <em>safe in certain ways</em> (for example, if we&apos;re worried about toxic language, testing for this is relatively straightforward).
</p>
<p>
The rest of this piece will be about some of the ways in which &quot;testing&quot; for AI safety <strong>fails to give us straightforward observations about whether, once AI systems are deployed in the real world, the world will actually be safe.</strong>
</p>
<p>
While all research has to deal with <em>some</em> differences between testing and the real world, I think the challenges I&apos;ll be going through are unusual ones.
</p>
<h2 id="Four problems">Four problems</h2>


<h3 id="The-Lance-Armstrong-Problem">(1) The Lance Armstrong problem: is the AI <em>actually safe</em> or <em>good at hiding its dangerous actions</em>?</h3>
<p><center><img src="https://www.cold-takes.com/content/images/size/w1000/2022/12/YJPARMSTRONG1-superJumbo.jpg" width="400" alt="AI Safety Seems Hard to Measure"></center></p>

<p>
First, let&apos;s imagine that:
</p>
<ul>

<li>We have AI systems available that can do roughly everything a human can, with some different strengths and weaknesses but no huge difference in &quot;overall capabilities&quot; or economic value per hour of work. 

</li><li>We&apos;re observing early signs that AI systems behave in unintended, deceptive ways, such as giving wrong answers to questions we ask, or writing software that falsifies metrics instead of doing the things the metrics were supposed to measure (e.g., software meant to make a website run faster might instead falsify metrics about its loading time).
</li>
</ul>
<p>
We theorize that modifying the AI training in some way<sup id="fnref6"><a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#fn6" rel="footnote">6</a></sup> will make AI systems less likely to behave deceptively. We try it out, and find that, in fact, our AI systems seem to be behaving better than before - we are finding fewer incidents in which they behaved in unintended or deceptive ways. 
</p>
<p>
But that&apos;s just a statement about <em>what we&apos;re noticing</em>. Which of the following just happened:
</p>
<ul>

<li>Did we just train our AI systems to be less deceptive?

</li><li>Did we just train our AI systems to be <em>better at</em> deception, and so to make us <em>think</em> they became less deceptive?

</li><li>Did we just train our AI systems to be better at calculating when they might get caught in deception, and so to be less deceptive only when the deception would otherwise be caught? 
<ul>
 
<li>This one could be useful! Especially if we&apos;re able to set up auditing systems in many real-world situations, such that we <em>could</em> expect deception to be caught a lot of the time. But it does leave open the <a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#The-King-Lear-problem">King Lear problem</a> covered next.
    
</li> 
</ul>
</li> 
</ul>
<p>
(...Or some combination of the three?)
</p>
<p>
We&apos;re hoping to be able to deploy AI systems throughout the economy, so - just like human specialists - they will almost certainly have some opportunities to be deceptive without being caught. The fact that they <em>appear honest in our testing</em> is not clear comfort against this risk.
</p>
<p>
The analogy here is to competitive cyclist <a href="https://en.wikipedia.org/wiki/Lance_Armstrong">Lance Armstrong</a>. Armstrong won the Tour de France race 7 times in a row, while many of his competitors were caught using performance-enhancing drugs and disqualified. But more than 5 years after his last win, an investigation &quot;concluded that Armstrong had used performance-enhancing drugs over the course of his career[5] and named him as the ringleader of &apos;the most sophisticated, professionalized and successful doping program that sport has ever seen&apos;.&quot; Now the list of Tour de France winners looks like this:
</p>
<p>

<img src="https://www.cold-takes.com/content/images/size/w1600/2022/12/image1.png" width alt="AI Safety Seems Hard to Measure" title="image_tooltip">

</p>
<p>
A broader issue here is that <strong>when AI systems become capable enough, AI safety research starts to look more like social sciences (studying human beings) than like natural sciences. </strong>Social sciences are generally less rigorous and harder to get clean results from, and one factor in this is that it can be hard to study someone who&apos;s aware they&apos;re being studied.<sup id="fnref7"><a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#fn7" rel="footnote">7</a></sup>
</p>
<p>
Two broad categories of research that might help with the Lance Armstrong problem:
</p>
<ul>
<li id="DigitalNeuroscience"><a href="https://www.transformer-circuits.pub/2022/mech-interp-essay/index.html">Mechanistic interpretability</a><sup id="fnref8"><a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#fn8" rel="footnote">8</a></sup> can be thought of analyzing the &quot;digital brains&quot; of AI systems (not just analyzing their behavior and performance.) Currently, AI systems are <a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#Box4">black boxes</a> in the sense that they perform well on tasks, but we can&apos;t say much about <em>how</em> they are doing it; mechanistic interpretability aims to change this, which could give us the ability to &quot;mind-read&quot; AIs and detect deception. (There could still be a risk that AI systems are arranging their own &quot;digital brains&quot; in misleading ways, but this seems quite a bit harder than simply <em>behaving</em> deceptively.)
</li><li>Some researchers work on &quot;scalable supervision&quot; or &quot;competitive supervision.&quot; The idea is that if we are training an AI system that might become deceptive, we set up some supervision process for it that we expect to reliably catch any attempts at deception. This could be because the supervision process itself uses AI systems with more resources than the one being supervised, or because it uses a system of randomized audits where extra effort is put into catching deception.
    </li></ul>
<details id="Box4"><summary>(Click to expand) Why are AI systems &quot;black boxes&quot; that we can&apos;t understand the inner workings of?<!-- (Details not included in email - <a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#Box4">click to view on the web</a>)--></summary>
<div><p>
I explain this briefly in an <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/#making-pasta">old Cold Takes post</a>; it&apos;s explained in more detail in more technical pieces by <a href="https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to#_HFDT_scales_far__assumption__Alex_is_trained_to_achieve_excellent_performance_on_a_wide_range_of_difficult_tasks">Ajeya Cotra</a> (section I linked to) and <a href="https://drive.google.com/file/d/1TsB7WmTG2UzBtOs349lBqY5dEBaxZTzG/view">Richard Ngo</a> (section 2). 
</p>
<p>
What I mean by &#x201C;black-box trial-and-error&#x201D; is explained briefly in an <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/#making-pasta">old Cold Takes post</a>, and in more detail in more technical pieces by <a href="https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to#_HFDT_scales_far__assumption__Alex_is_trained_to_achieve_excellent_performance_on_a_wide_range_of_difficult_tasks">Ajeya Cotra</a> (section I linked to) and <a href="https://drive.google.com/file/d/1TsB7WmTG2UzBtOs349lBqY5dEBaxZTzG/view">Richard Ngo</a> (section 2). Here&#x2019;s a quick, oversimplified characterization.
</p>
<p>
Today, the most common way of building an AI system is by using an &quot;artificial neural network&quot; (ANN), which you might think of sort of like a &quot;digital brain&quot; that starts in an empty (or random) state: it hasn&apos;t yet been wired to do specific things. A process something like this is followed:
</p>
<ul>

<li>The AI system is given some sort of task.

</li><li>The AI system tries something, initially something pretty random.

</li><li>The AI system gets information about how well its choice performed, and/or what would&#x2019;ve gotten a better result. Based on this, it &#x201C;learns&#x201D; by tweaking the wiring of the ANN (&#x201C;digital brain&#x201D;) - literally by strengthening or weakening the connections between some &#x201C;artificial neurons&#x201D; and others. The tweaks cause the ANN to form a stronger association between the choice it made and the result it got. 

</li><li>After enough tries, the AI system becomes good at the task (it was initially terrible). 

</li><li>But nobody really knows anything about <em>how or why</em> it&#x2019;s good at the task now. The development work has gone into building a flexible architecture for it to learn well from trial-and-error, and into &#x201C;training&#x201D; it by doing all of the trial and error. We mostly can&#x2019;t &#x201C;look inside the AI system to see how it&#x2019;s thinking.&#x201D;

</li><li>For example, if we want to know why a chess-playing AI such as AlphaZero made some particular chess move, we can&apos;t look inside its code to find ideas like &quot;Control the center of the board&quot; or &quot;Try not to lose my queen.&quot; Most of what we see is just a vast set of numbers, denoting the strengths of connections between different artificial neurons. As with a human brain, we can mostly only guess at what the different parts of the &quot;digital brain&quot; are doing.
</li>
</ul>
</div>
</details>

<h3 id="The-King-Lear-problem">(2) The King Lear problem: how do you test what will happen when it&apos;s no longer a test?</h3>
<p><center><img src="https://www.cold-takes.com/content/images/size/w1000/2022/12/King_Lear_6_4OuaQtu.original.1157b392.fill-1200x600-c75.jpg" width="400" alt="AI Safety Seems Hard to Measure"></center></p>

<p>
The Shakespeare play <a href="https://en.wikipedia.org/wiki/King_Lear">King Lear</a> opens with the King (Lear) stepping down from the throne, and immediately learning that he has left his kingdom to the wrong two daughters. Loving and obsequious while he was deciding on their fate,<sup id="fnref9"><a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#fn9" rel="footnote">9</a></sup> they reveal their contempt for him as soon as he&apos;s out of power and they&apos;re in it.
</p>
<p>
If we&apos;re building AI systems that can reason like humans, dynamics like this become a potential issue. 
</p>
<p>
I <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#existential-risks-to-humanity">previously</a> noted that an AI with <em>any</em> ambitious <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#what-it-means-for">aim</a> - or just an AI that wants to avoid being shut down or modified - might calculate that the best way to do this is by behaving helpfully and safely in all &quot;tests&quot; humans can devise. But once there is a real-world opportunity to disempower humans for good, that same aim <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#existential-risks-to-humanity">could cause the AI to disempower humans.</a>
</p>
<p>
In other words:
</p>
<ul>

<li>(A) When we&apos;re developing and testing AI systems, we have the power to decide which systems will be modified or shut down and which will be deployed into the real world. (Like King Lear deciding who will inherit his kingdom.)

</li><li>(B) But at some later point, these systems could be operating in the economy, in high numbers with a lot of autonomy. (This possibility is spelled out/visualized a bit more <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/#how-this-could-work-if-humans-create-a-huge-population-of-ais">here</a> and <a href="https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to#As_humans__control_fades__Alex_would_be_motivated_to_take_over">here</a>.) At that point, they may have opportunities to <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">defeat all of humanity</a> such that we never make decisions about them again. (Like King Lear&apos;s daughters after they&apos;ve taken control.)
</li>
</ul>
<details id="Box5"><summary>(Click to expand) How could AI defeat humanity?<!-- (Details not included in email - <a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#Box5">click to view on the web</a>)--></summary>
<div><p>
In a <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">previous piece</a>, I argue that AI systems could defeat all of humanity combined, if (for whatever reason) they were aimed toward that goal.
</p>
<p>
By defeating humanity, I mean gaining control of the world so that AIs, not humans, determine what happens in it; this could involve killing humans or simply &#x201C;containing&#x201D; us in some way, such that we can&#x2019;t interfere with AIs&#x2019; aims.
</p>
<p>
One way this could happen is if AI became extremely advanced, to the point where it had &quot;cognitive superpowers&quot; beyond what humans can do. In this case, a single AI system (or set of systems working together) could imaginably:
</p>
<ul>

<li>Do its own research on how to build a better AI system, which culminates in something that has incredible other abilities.

</li><li>Hack into human-built software across the world.

</li><li>Manipulate human psychology.

</li><li>Quickly generate vast wealth under the control of itself or any human allies.

</li><li>Come up with better plans than humans could imagine, and ensure that it doesn&apos;t try any takeover attempt that humans might be able to detect and stop.

</li><li>Develop advanced weaponry that can be built quickly and cheaply, yet is powerful enough to overpower human militaries.
</li>
</ul>
<p>
However, my piece also explores what things might look like if <em>each AI system basically has similar capabilities to humans. </em>In this case:
</p>
<ul>

<li>Humans are likely to deploy AI systems throughout the economy, such that they have large numbers and access to many resources - and the ability to make copies of themselves. 

</li><li>From this starting point, AI systems with human-like (or greater) capabilities would have a number of possible ways of getting to the point where their total population could outnumber and/or out-resource humans.

</li><li>I address a number of possible objections, such as &quot;How can AIs be dangerous without bodies?&quot;
</li>
</ul>
<p>
More: <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">AI could defeat all of us combined</a></p></div></details>
<p>
<strong>If AI systems can detect the difference between (A) and (B) above, then their behavior might systematically change from one to the other - and there&apos;s no clear way to <em>test</em> their behavior in (B).</strong>
</p>
<details id="Box6"><summary>(Click to expand) The Volkswagen emissions scandal<!-- (Details not included in email - <a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#Box4">click to view on the web</a>)--></summary>
<div><p>
In 2015, the US Environmental Protection Agency &quot;found that Volkswagen had intentionally programmed ... engines to activate their emissions controls only during laboratory emissions testing, which caused the vehicles&apos; [nitrogen oxide] output to meet US standards during regulatory testing. However, the vehicles emitted up to 40 times more [nitrogen oxide] in real-world driving.&quot; (From <a href="https://en.wikipedia.org/wiki/Volkswagen_emissions_scandal">Wikipedia</a>)
</p>
<p>
This is an example of the &quot;King Lear problem&quot;: there was a deliberate distinction made where the engines were behaving differently during tests and during real-world use. I don&apos;t know much about this incident, but I&apos;m guessing that regulators were able to detect the issue because of (a) <a href="https://en.wikipedia.org/wiki/Volkswagen_emissions_scandal#European_discrepancies,_2014">independent tests</a> under road conditions; (b) being able to understand the &quot;<a href="https://en.wikipedia.org/wiki/Volkswagen_emissions_scandal#EPA_Notice_of_Violation,_2015">defeat device</a>&quot; that the Volkswagen cars used to create the misleading readings. 
</p>
<p>
Neither of these would necessarily apply to the King Lear problem as described here: (a) you can test a car under road conditions to see its emissions, but you can&apos;t (or at least, shouldn&apos;t) test how an AI behaves when it has an opportunity to disempower humans entirely; (b) AI systems may be <a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#Box4">black boxes</a> such that it&apos;s hard to understand what&apos;s going on inside them.</p></div></details>
<p>
In general, modern machine learning researchers consider it challenging to handle what&apos;s called &quot;distributional shift&quot;: systematic differences between situations AIs were <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/#making-pasta">trained</a> on and situations they&apos;re now in. To me, the King Lear problem looks like <strong>arguably the most inconvenient possible distributional shift: </strong>AI systems risk behaving in unexpected ways <em>just as</em> (and in fact, because) they&apos;re now able to defeat humanity, rather than being in a controlled test environment.
</p>
<p>
Some lines of research that might help here:
</p>
<ul>

<li>If we could solve the <a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#The-Lance-Armstrong-Problem">Lance Armstrong problem</a> robustly enough - such that we could be confident AIs were never behaving deceptively - we could simply prompt AIs to answer questions like &quot;Would AI system X disempower humans given an opportunity to do so?&quot; The <a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#DigitalNeuroscience">digital-brain-based</a> techniques noted above could imaginably get us here.

</li><li>There might be ways of specifically trying to target the <em>worst-case</em> behavior of AI systems, so that they are nearly guaranteed not to behave in certain ways <em>regardless of their situation</em>. This could look something roughly like &quot;simulating cases where an AI system has an opportunity to disempower humans, and giving it negative reinforcement for choosing to do so.&quot; More on this sort of approach, along with some preliminary ongoing work, <a href="https://www.lesswrong.com/posts/pXLqpguHJzxSjDdx7/why-i-m-excited-about-redwood-research-s-current-project">here</a>.
</li>
</ul>
<h3 id="The-Lab-mice-problem">(3) The lab mice problem: the AI systems we&apos;d like to study don&apos;t exist today </h3>
<p><center><img src="https://www.cold-takes.com/content/images/2022/12/web_0009_find-and-order-jax-mice.jpg" width="400" alt="AI Safety Seems Hard to Measure"></center></p>

<p>
Above, I said: &quot;when AI systems become capable enough, AI safety research starts to look more like social sciences (studying human beings) than like natural sciences.&quot; But today, AI systems <em>aren&apos;t</em> capable enough, which makes it especially hard to have a meaningful test bed and make meaningful progress.
</p>
<p>
Specifically, we don&apos;t have much in the way of AI systems that seem to <em>deceive and manipulate</em> their supervisors,<sup id="fnref10"><a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#fn10" rel="footnote">10</a></sup> the way I worry that <!-- may link to https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#deceiving-and-manipulating ? --> <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">they might when they become capable enough</a>.
</p>
<p>
In fact, it&apos;s not 100% clear that AI systems could learn to deceive and manipulate supervisors even if we deliberately tried to train them to do it. This makes it hard to even get started on things like discouraging and detecting deceptive behavior. 
</p>
<p>
I think AI safety research is a bit unusual in this respect: most fields of research aren&apos;t explicitly about &quot;solving problems that don&apos;t exist yet.&quot; (Though a lot of research <em>ends up</em> useful for more important problems than the original ones it&apos;s studying.) As a result, doing AI safety research today is a bit like <strong>trying to study medicine in humans by experimenting only on lab mice </strong>(no human subjects available).
</p>
<p>
This does <em>not</em> mean there&apos;s no productive AI safety research to be done! (See the previous sections.) It just means that the research being done today is somewhat analogous to research on lab mice: informative and important up to a point, but only up to a point.
</p>
<p>
How bad is this problem? I mean, I do think it&apos;s a temporary one: by the time we&apos;re facing the <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">problems I worry about</a>, we&apos;ll be able to study them more directly. The concern is that <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/#explosive-scientific-and-technological-advancement">things could be moving very quickly by that point</a>: by the time we have AIs with human-ish capabilities, companies might be furiously making copies of those AIs and using them for all kinds of things (including both AI safety research and further research on making AI systems faster, cheaper and more capable).
</p>
<p>
So I do worry about the lab mice problem. And I&apos;d be excited to see more effort on making &quot;better model organisms&quot;: AI systems that show early versions of the properties we&apos;d most like to study, such as deceiving their supervisors. (I even think it would be worth training AIs specifically to do this;<sup id="fnref11"><a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#fn11" rel="footnote">11</a></sup> if such behaviors are going to emerge eventually, I think it&apos;s best for them to emerge early while there&apos;s relatively little risk of AIs&apos; actually <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">defeating humanity</a>.)
</p>

<h3 id="The-first-contact-problem">(4) The &quot;first contact&quot; problem: how do we prepare for a world where AIs have capabilities vastly beyond those of humans?</h3>
<p><center><img src="https://www.cold-takes.com/content/images/size/w1000/2022/12/MV5BNGQ1OTNlZGEtMWNjZC00Y2Y3LWI2NzEtZDAxZjk3MTU2NDM5XkEyXkFqcGdeQWpnYW1i._V1_.jpg" width="400" alt="AI Safety Seems Hard to Measure"></center></p>

<p>
All of this piece so far has been about trying to make safe &quot;human-like&quot; AI systems.
</p>
<p>
What about AI systems with capabilities <em>far</em> beyond humans - what Nick Bostrom calls <a href="https://smile.amazon.com/Superintelligence-Dangers-Strategies-Nick-Bostrom-ebook/dp/B00LOOCGB2/">superintelligent</a> AI systems?
</p>
<p>
Maybe at some point, AI systems will be able to do things like:
</p>
<ul>

<li>Coordinate with each other incredibly well, such that it&apos;s hopeless to use one AI to help supervise another.

</li><li>Perfectly understand human thinking and behavior, and know exactly what words to say to make us do what they want - so just letting an AI send emails or write tweets gives it vast power over the world.

</li><li>Manipulate their own &quot;digital brains,&quot; so that our <!-- may link to https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#deceiving-and-manipulating ? --> <a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#DigitalNeuroscience">attempts to &quot;read their minds&quot;</a> backfire and mislead us.

</li><li>Reason about the world (that is, <!-- may link to cold-takes.com/why-would-ai-aim-to-defeat-humanity/#what-it-means-for ? --> <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#what-it-means-for">make plans to accomplish their aims</a>) in completely different ways from humans, with concepts like &quot;glooble&quot;<sup id="fnref12"><a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#fn12" rel="footnote">12</a></sup> that are incredibly useful ways of thinking about the world but that humans couldn&apos;t understand with centuries of effort.
    </li></ul><p>
At this point, whatever methods we&apos;ve developed for making human-like AI systems safe, honest, and restricted could fail - and silently, as such AI systems could go from &quot;behaving in honest and helpful ways&quot; to &quot;appearing honest and helpful, while setting up opportunities to <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">defeat humanity</a>.&quot;
</p>
<p>
Some people think this sort of concern about &quot;superintelligent&quot; systems is ridiculous; some<sup id="fnref13"><a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#fn13" rel="footnote">13</a></sup> seem to consider it extremely likely. I&apos;m not personally sympathetic to having high confidence either way.
</p>
<p>
But additionally, a world with huge numbers of human-like AI systems could be strange and foreign and fast-moving enough to have a lot of this quality.
</p>
<p>
Trying to prepare for futures like these could be like trying to <strong>prepare for first contact with extaterrestrials</strong> - it&apos;s hard to have any idea what kinds of challenges we might be dealing with, and the challenges might arise quickly enough that we have little time to learn and adapt.
</p>
<h2 id="The-young-businessperson">The young businessperson</h2>

<p>
For one more analogy, I&apos;ll return to the one used by Ajeya Cotra <a href="https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/#analogy-the-young-ceo">here</a>:
</p>

    <blockquote><p>Imagine you are an eight-year-old whose parents left you a $1 trillion company and no trusted adult to serve as your guide to the world. You must hire a smart adult to run your company as CEO, handle your life the way that a parent would (e.g. decide your school, where you&#x2019;ll live, when you need to go to the dentist), and administer your vast wealth (e.g. decide where you&#x2019;ll invest your money).
</p>
<p>

    You have to hire these grownups based on a work trial or interview you come up with -- you don&apos;t get to see any resumes, don&apos;t get to do reference checks, etc. Because you&apos;re so rich, tons of people apply for all sorts of reasons. (<a href="https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/#analogy-the-young-ceo">More</a>)</p></blockquote>
<p>
If your applicants are a mix of &quot;saints&quot; (people who genuinely want to help), &quot;sycophants&quot; (people who just want to make you happy in the short run, even when this is to your long-term detriment) and &quot;schemers&quot; (people who want to siphon off your wealth and power for themselves), how do you - an eight-year-old - tell the difference?
</p>
<p>
This analogy combines most of the worries above. 
</p>
<ul>

<li>The young businessperson has trouble knowing whether candidates are truthful in interviews, and trouble knowing whether any work trial <em>actually</em> went well or just <em>seemed</em> to go well due to deliberate deception. (The <a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#The-Lance-Armstrong-Problem">Lance Armstrong problem</a>.)

</li><li>Job candidates could have bad intentions that don&apos;t show up until they&apos;re in power (the <a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#The-King-Lear-problem">King Lear Problem)</a>.

</li><li>If the young businessperson were trying to prepare for this situation before actually being in charge of the company, they could have a lot of trouble simulating it (the <a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#The-Lab-mice-problem">lab mice problem)</a>.
</li><li>And it&apos;s generally just hard for an eight-year-old to have much grasp <em>at all</em> on the world of adults - to even think about all the things they should be thinking about (the <a href="https://www.cold-takes.com/p/4d63edc6-4be6-4c77-ae5b-c70e730acb58#The-first-contact-problem">first contact problem</a>).
    
</li>
</ul>
<p>
Seems like a tough situation.
</p>
<p>
<!-- may link to https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/ ? --> <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/">Previously</a>, I talked about the dangers of AI <em>if </em>AI developers don&apos;t take specific countermeasures. This piece has tried to give a sense of why, even if they <em>are</em> trying to take countermeasures, doing so could be hard. The next piece will talk about some ways we might succeed anyway.
</p>

<!-- Footnotes themselves at the bottom. -->

<!--kg-card-end: html--><!--kg-card-begin: html-->

<p><div style="display:flex; justify-content:center; margin: 0 auto;">

        <span style="margin: 10px;"><a href="https://api.addthis.com/oexchange/0.8/forward/twitter/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fai-safety-seems-hard-to-measure&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20AI%20Safety%20Seems%20Hard%20to%20Measure&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-twitter-square.png" border="0" alt="AI Safety Seems Hard to Measure"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/facebook/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fai-safety-seems-hard-to-measure&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20AI%20Safety%20Seems%20Hard%20to%20Measure&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-facebook-square.png" border="0" alt="AI Safety Seems Hard to Measure"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/reddit/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fai-safety-seems-hard-to-measure&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20AI%20Safety%20Seems%20Hard%20to%20Measure&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-reddit-square.png" border="0" alt="AI Safety Seems Hard to Measure"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/menu/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fai-safety-seems-hard-to-measure&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20AI%20Safety%20Seems%20Hard%20to%20Measure&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-addthis-square.png" border="0" alt="AI Safety Seems Hard to Measure"></a></span>
        </div>
<center><p id="discuss"><!--<a href="https://www.cold-takes.com/ai-safety-seems-hard-to-measure#subscribe" target="_blank"><button id="footer-subscribe" class="button">Subscribe</button></a>&nbsp;<a href="https://www.guidedtrack.com/programs/4kal2ue/run?posttitle=AI%20Safety%20Seems%20Hard%20to%20Measure" target="_blank"><button class="button" id="Survey">Feedback</button></a>-->&#xA0;<a href="https://www.lesswrong.com/posts/slug/ai-safety-seems-hard-to-measure#comments" target="_blank"><button class="button">Comment/discuss</button></a></p><p><!--<em>
Use "Feedback" if you have comments/suggestions you want me to see, or if you're up for giving some quick feedback about this post (which I greatly appreciate!) Use "Forum" if you want to discuss this post publicly on the Effective Altruism Forum.
</em>--></p></center>
<!--kg-card-end: html--><!--kg-card-begin: html--><hr>
</p><h2 id="footnotes">Footnotes</h2>
<div class="footnotes">

<ol><li id="fn1">

<p>
     Or persuaded (in a &#x201C;mind hacking&#x201D; sense) or whatever.&#xA0;<a href="#fnref1" rev="footnote">&#x21A9;</a><li id="fn2">
<p>
     Research? Testing. Whatever.&#xA0;<a href="#fnref2" rev="footnote">&#x21A9;</a><li id="fn3">
<p>
     Drugs can be tested in vitro, then in animals, then in humans. At each stage, we can make relatively straightforward observations about whether the drugs are working, and these are reasonably predictive of how they&apos;ll do at the next stage.&#xA0;<a href="#fnref3" rev="footnote">&#x21A9;</a><li id="fn4">
<p>
     You can generally see how different compounds interact in a controlled environment, before rolling out any sort of large-scale processes or products, and the former will tell you most of what you need to know about the latter.&#xA0;<a href="#fnref4" rev="footnote">&#x21A9;</a><li id="fn5">
<p>
     New software can be tested by a small number of users before being rolled out to a large number, and the initial tests will probably find most (not all) of the bugs and hiccups.&#xA0;<a href="#fnref5" rev="footnote">&#x21A9;</a><li id="fn6">
<p>
     Such as:
<ul>

<li>Being more careful to avoid <a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/#deceiving-and-manipulating">wrong answers that can incentivize deception</a>

</li><li>Conducting randomized &quot;audits&quot; where we try extra hard to figure out the right answer to a question, and give an AI extra negative reinforcement if it gives an answer that we <em>would have</em> believed if not for the audit (this is &quot;extra negative reinforcement for wrong answers that superficially look right&quot;)

</li><li>Using methods along the lines of <a href="https://openai.com/blog/debate/">&quot;AI safety via debate&quot;</a>&#xA0;<a href="#fnref6" rev="footnote">&#x21A9;</a></li></ul><li id="fn7">
<p>
     Though there are other reasons social sciences are especially hard, such as the fact that there are often big limits to what kinds of experiments are ethical, and the fact that it&apos;s often <a href="https://www.cold-takes.com/how-digital-people-could-change-the-world/#social-science">hard to make clean comparisons between differing populations</a>.&#xA0;<a href="#fnref7" rev="footnote">&#x21A9;</a><li id="fn8">

<p>
     This paper is from Anthropic, a company that my wife serves as President of.&#xA0;<a href="#fnref8" rev="footnote">&#x21A9;</a><li id="fn9">
<p>
     Like, he actually asks them to talk about their love for him just before he decides on what share of the realm they&apos;ll get. Smh&#xA0;<a href="#fnref9" rev="footnote">&#x21A9;</a><li id="fn10">
<p>
     <a href="https://arxiv.org/pdf/2109.07958.pdf">This paper</a> is a potential example, but its results <a href="https://www.cold-takes.com/ai-alignment-research-links/#helpful-honest-harmless">seem pretty brittle</a>.&#xA0;<a href="#fnref10" rev="footnote">&#x21A9;</a><li id="fn11">
<p>
     E.g., I think it would be interesting to train AI <a href="https://github.com/features/copilot">coding systems</a> to write <a href="http://www.underhanded-c.org/">underhanded C</a>: code that looks benign to a human inspector, but does unexpected things when run. They could be given negative reinforcement when humans can correctly identify that the code will do unintended things, and positive reinforcement when the code achieves the particular things that humans are attempting to stop. This would be challenging with today&apos;s AI systems, but not necessarily impossible.&#xA0;<a href="#fnref11" rev="footnote">&#x21A9;</a><li id="fn12">

<p>
     This is a concept that only I understand.&#xA0;<a href="#fnref12" rev="footnote">&#x21A9;</a><li id="fn13">
<p>
     E.g., see the discussion of the &quot;hard left turn&quot; <a href="https://www.alignmentforum.org/posts/3pinFH3jerMzAvmza/on-how-various-plans-miss-the-hard-bits-of-the-alignment">here</a> by Nate Soares, head of <a href="https://intelligence.org/">MIRI</a>. My impression is that others at MIRI, including Eliezer Yudkowsky, have a similar picture.&#xA0;<a href="#fnref13" rev="footnote">&#x21A9;</a>

</p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></ol></div><!--kg-card-end: html-->]]></content:encoded></item><item><title><![CDATA[Why Would AI "Aim" To Defeat Humanity?]]></title><description><![CDATA[Today's AI development methods risk training AIs to be deceptive, manipulative and ambitious. This might not be easy to fix as it comes up.]]></description><link>https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity/</link><guid isPermaLink="false">637bc01e9d6605004d59fcf4</guid><category><![CDATA[ImplicationsOfMostImportantCentury]]></category><dc:creator><![CDATA[Holden Karnofsky]]></dc:creator><pubDate>Tue, 29 Nov 2022 19:20:10 GMT</pubDate><media:content url="https://www.cold-takes.com/content/images/2022/11/exmachina.jpeg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: html--><img src="https://www.cold-takes.com/content/images/2022/11/exmachina.jpeg" alt="Why Would AI &quot;Aim&quot; To Defeat Humanity?"><p><figure><div id="buzzsprout-player-11739868"></div><script src="https://www.buzzsprout.com/1851795/11739868-why-would-ai-aim-to-defeat-humanity.js?container_id=buzzsprout-player-11739868&amp;player=small" type="text/javascript" charset="utf-8"></script><figcaption><em>Click lower right to download or find on Apple Podcasts, Spotify, Stitcher, etc.</em></figcaption></figure></p>



<p>
I&#x2019;ve <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">argued</a> that AI systems could defeat all of humanity combined, if (for whatever reason) they were directed toward that goal.
</p>
<p>
Here I&#x2019;ll explain why I think they might - in fact - end up directed toward that goal. Even if they&#x2019;re built and deployed with good intentions.
</p>
<p>
In fact, I&#x2019;ll argue something a bit stronger than that they <em>might</em> end up aimed toward that goal. I&#x2019;ll argue that <strong>if today&#x2019;s AI development methods lead directly to powerful enough AI systems, disaster is <em>likely</em></strong><sup id="fnref1"><a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#fn1" rel="footnote">1</a></sup><strong><em> by default </em>(in the absence of specific countermeasures). </strong>
</p><!--
<p>
The highest-level summary of the concern is this (slightly longer summary below): 
</p>
<ul>

<li>Modern AI development is essentially based on “training” via trial-and-error. 

<li>If we move forward incautiously and ambitiously with such training, and if it gets us all the way to very powerful AI systems, then such systems will likely end up <em>aiming for certain states of the world</em> (analogously to how a chess-playing AI aims for checkmate)<em>.</em>

<li>And these states will be<em> other than the ones we intended</em>, because our trial-and-error training methods won’t be accurate. For example, when we’re confused or misinformed about some question, we’ll reward AI systems for giving the wrong answer to it - unintentionally training deceptive behavior.

<li>We should expect disaster if we have AI systems that are both (a) <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">powerful enough</a> to defeat humans and (b) aiming for states of the world that we didn’t intend. (“Defeat” means taking control of the world and doing what’s necessary to keep us out of the way; it’s unclear to me whether we’d be literally killed or just forcibly stopped<sup id="fnref2"><a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#fn2" rel="footnote">2</a></sup> from changing the world in ways that contradict AI systems’ aims.)
</ul>
<p>
It’s hard to give a concise analogy for this; the best I can do at the moment is the generic idea of <a href="https://en.wikipedia.org/wiki/Goodhart%27s_law">Goodhart’s Law</a>. If you give positive reinforcement for some behaviors, and negative reinforcement for others, you’re probably implicitly rewarding something that <em>isn’t</em> what you meant to reward. (For example, rewarding students for good test scores is implicitly rewarding cheating, in any cases where you can’t catch it.) If you’re using this kind of training to shape something that will be capable of <em>deceiving and overpowering you</em>, it’s a recipe for trouble.
</p>-->
<p>
Unlike other discussions of the AI alignment problem,<sup id="fnref3"><a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#fn3" rel="footnote">3</a></sup> this post will discuss the likelihood<sup id="fnref4"><a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#fn4" rel="footnote">4</a></sup> of AI systems <em>defeating all of humanity</em> (not more general concerns about AIs being misaligned with human intentions), while aiming for plain language, conciseness, and accessibility to laypeople,  and focusing on modern AI development paradigms. I make no claims to originality, and list some key sources and inspirations in a footnote.<sup id="fnref5"><a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#fn5" rel="footnote">5</a></sup> 
</p>
<p>
Summary of the piece:
</p>
<p>
<strong>My basic assumptions. </strong>I assume the world could develop extraordinarily powerful AI systems in the coming decades. I previously examined this idea at length in the <a href="https://www.cold-takes.com/most-important-century/">most important century</a> series. 
</p>
<p>
Furthermore, in order to simplify the analysis:
</p>
<ul>

<li>I assume that such systems will be developed using methods similar to today&#x2019;s leading AI development methods, and in a world that&#x2019;s otherwise similar to today&#x2019;s. (I call this <a href="https://www.alignmentforum.org/posts/Qo2EkG3dEMv8GnX8d/ai-strategy-nearcasting">nearcasting</a>.)

</li><li>I assume that AI companies/projects race forward to build powerful AI systems, without specific attempts to prevent the problems I discuss in this piece. Future pieces will relax this assumption, but I think it is an important starting point to get clarity on what the default looks like.
</li>
</ul>






<p>
<strong>AI &#x201C;aims.&#x201D; </strong>I talk a fair amount about why we might think of AI systems as &#x201C;aiming&#x201D; toward certain states of the world. I think this topic causes a lot of confusion, because:
</p>
<ul>

<li>Often, when people talk about AIs having goals and making plans, it sounds like they&#x2019;re overly anthropomorphizing AI systems - as if they expect them to have human-like motivations and perhaps <a href="https://media.npr.org/assets/img/2015/06/30/tr-09117-df20f2f4f05817e574b879d22e607f952cf87867-s1100-c50.jpg">evil grins</a>. This can make the whole topic sound wacky and out-of-nowhere.

</li><li>But I think there are good reasons to expect that AI systems will &#x201C;aim&#x201D; for particular states of the world, much like a chess-playing AI &#x201C;aims&#x201D; for a checkmate position - making choices, calculations and even <em>plans </em>to get particular types of outcomes. For example, people might want AI assistants that can creatively come up with unexpected ways of accomplishing whatever goal they&#x2019;re given (e.g., &#x201C;Get me a great TV for a great price&#x201D;), even in some cases manipulating other humans (e.g., by negotiating) to get there. This dynamic is core to the risks I&#x2019;m most concerned about: I think something that <em>aims</em> for the wrong states of the world is much more dangerous than something that just does incidental or accidental damage.
</li>
</ul>
<p>
<strong>Dangerous, unintended aims. </strong>I&#x2019;ll examine what sorts of aims AI systems might end up with, if we use AI development methods like today&#x2019;s - essentially, &#x201C;training&#x201D; them via trial-and-error to accomplish ambitious things humans want.
</p>
<ul>

<li>Because we ourselves will often be misinformed or confused, we will sometimes give <em>negative</em> reinforcement to AI systems that are actually acting in our best interests and/or giving accurate information, and <em>positive</em> reinforcement to AI systems whose behavior <em>deceives</em> us into thinking things are going well. This means we will be, unwittingly, training AI systems to deceive and manipulate us. 
<ul>
 
<li>The idea that AI systems could &#x201C;deceive&#x201D; humans - systematically making choices and taking actions that cause them to misunderstand what&#x2019;s happening in the world - is core to the risk, so I&#x2019;ll elaborate on this.
</li> 
</ul>

</li><li>For this and other reasons, powerful AI systems will likely end up with aims other than the ones we intended. Training by trial-and-error is slippery: the positive and negative reinforcement we give AI systems will probably not end up training them just as we hoped.

</li><li>If powerful AI systems have aims that are both unintended (by humans) and ambitious, this is dangerous. Whatever an AI system&#x2019;s unintended aim: 
<ul>
 
<li>Making sure it can&#x2019;t be turned off is likely helpful in accomplishing the aim.
 
</li><li>Controlling the whole world is useful for just about any aim one might have, and I&#x2019;ve argued that advanced enough AI systems would be able to <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">gain power over all of humanity</a>.</li>
    <li>Overall, <strong>we should expect disaster if we have AI systems that are both (a) <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">powerful enough</a> to defeat humans and (b) aiming for states of the world that we didn&#x2019;t intend.</strong>
</li>
</ul>
</li> 
</ul>
<p>
<strong>Limited and/or ambiguous warning signs. </strong>The risk I&#x2019;m describing is - by its nature - hard to observe, for similar reasons that a risk of a (normal, human) coup can be hard to observe: the risk comes from actors that can and will engage in <em>deception</em>, finding whatever behaviors will hide the risk. If this risk plays out, I do think we&#x2019;d see <em>some</em> warning signs - but they could easily be confusing and ambiguous, in a fast-moving situation where there are lots of incentives to build and roll out powerful AI systems, as fast as possible. Below, I outline how this dynamic could result in disaster, even with companies encountering a number of warning signs that they try to respond to.
</p>
<p>
<strong>FAQ. </strong>An appendix will cover some related questions that often come up around this topic.
</p>
<ul>

<li>How could AI systems be &#x201C;smart&#x201D; enough to defeat all of humanity, but &#x201C;dumb&#x201D; enough to pursue the various silly-sounding &#x201C;aims&#x201D; this piece worries they might have? <a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f/#how-could-ai-systems-be-smart">More</a>

</li><li>If there are lots of AI systems around the world with different goals, could they balance each other out so that no one AI system is able to defeat all of humanity? <a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f/#if-there-are-lots-of-ai-systems">More</a>

</li><li>Does this kind of AI risk depend on AI systems&#x2019; being &#x201C;conscious&#x201D;?<a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f/#does-this-kind-of-ai-risk-depend">More</a>

</li><li>How can we get an AI system &#x201C;aligned&#x201D; with humans if we can&#x2019;t agree on (or get much clarity on) what our values even are? <a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f/#how-can-we-get-an-ai-system-aligned">More</a>

</li><li>How much do the arguments in this piece rely on &#x201C;trial-and-error&#x201D;-based AI development? What happens if AI systems are built in another way, and how likely is that? <a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f/#how-much-do-the-arguments-in-this-piece-rely">More</a>

</li><li>Can we avoid this risk by simply never building the kinds of AI systems that would pose this danger? <a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f/#can-we-avoid-this-risk-by-simply-never-building">More</a>

</li><li>What do others think about this topic - is the view in this piece something experts agree on? <a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f/#what-do-others-think-about-this-topic">More</a>

</li><li>How &#x201C;complicated&#x201D; is the argument in this piece? <a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f/#how-complicated-is-the-argument">More</a>
</li>
</ul>
<h2 id="starting-assumptions">Starting assumptions</h2>


<p>
I&#x2019;ll be making a number of assumptions that some readers will find familiar, but others will find very unfamiliar. 
</p>
<p>
Some of these assumptions are based on arguments I&#x2019;ve already made (in the <a href="https://www.cold-takes.com/most-important-century/">most important century</a> series). Some are for the sake of simplifying the analysis, for now (with more nuance coming in future pieces).
</p>
<p>
Here I&#x2019;ll summarize the assumptions briefly, and you can <strong>click to see more</strong> if it isn&#x2019;t immediately clear what I&#x2019;m assuming or why.
</p>
<details id="Box1"><summary><strong>&#x201C;Most important century&#x201D; assumption: we&#x2019;ll soon develop very powerful AI systems, along the lines of what I previously called <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/">PASTA</a>.</strong> (Click to expand)</summary>
<div><p>
In the <a href="https://www.cold-takes.com/most-important-century/">most important century</a> series, I argued that the 21st century could be the most important century ever for humanity, via the development of advanced AI systems that could dramatically speed up scientific and technological advancement, getting us more quickly than most people imagine to a deeply unfamiliar future.
</p>
<p>
I focus on a hypothetical kind of AI that I call <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/">PASTA</a>, or Process for Automating Scientific and Technological Advancement. PASTA would be AI that can essentially <strong>automate all of the human activities needed to speed up scientific and technological advancement.</strong>
</p>
<p>
Using a <a href="https://www.cold-takes.com/where-ai-forecasting-stands-today/">variety of different forecasting approaches</a>, I argue that PASTA seems more likely than not to be developed this century - and there&#x2019;s a decent chance (more than 10%) that we&#x2019;ll see it within 15 years or so.
</p>
<p>
I argue that the consequences of this sort of AI could be enormous: an <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/#explosive-scientific-and-technological-advancement">explosion in scientific and technological progress</a>. This could get us more quickly than most imagine to a radically unfamiliar future.
</p>
<p>
I&#x2019;ve also <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">argued</a> that AI systems along these lines could defeat all of humanity combined, if (for whatever reason) they were aimed toward that goal.
</p>
<p>
For more, see the <a href="https://www.cold-takes.com/most-important-century/">most important century</a> landing page. The series is available in many formats, including audio; I also provide a summary, and links to podcasts where I discuss it at a high level.</p></div>
</details>
<details id="Box2"><summary><strong>&#x201C;Nearcasting&#x201D; assumption: such systems will be developed in a world that&#x2019;s otherwise similar to today&#x2019;s.</strong> (Click to expand)</summary><div>
<p>
It&#x2019;s hard to talk about risks from <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/">transformative AI </a>because of the many uncertainties about when and how such AI will be developed - and how much the (now-nascent) field of &#x201C;AI safety research&#x201D; will have grown by then, and how seriously people will take the risk, etc. etc. etc. So maybe it&#x2019;s not surprising that <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/#open-question-how-hard-is-the-alignment-problem">estimates of the &#x201C;misaligned AI&#x201D; risk range from ~1% to ~99%</a>.
</p>
<p>
This piece takes an approach I call <strong><a href="https://www.alignmentforum.org/posts/Qo2EkG3dEMv8GnX8d/ai-strategy-nearcasting">nearcasting</a></strong>: trying to answer key strategic questions about transformative AI, under the assumption that such AI arrives in a world that is otherwise relatively similar to today&apos;s. 
</p>
<p>
You can think of this approach like this: &#x201C;Instead of asking where our ship will ultimately end up, let&#x2019;s start by asking what destination it&#x2019;s pointed at right now.&#x201D; 
</p>
<p>
That is: instead of trying to talk about an uncertain, distant future, we can talk about the easiest-to-visualize, closest-to-today situation, and how things look there - and <em>then</em> ask how our picture might be off if other possibilities play out. (As a bonus, it doesn&#x2019;t seem out of the question that transformative AI will be developed extremely soon - 10 years from now or faster.<sup id="fnref6"><a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#fn6" rel="footnote">6</a></sup> If that&#x2019;s the case, it&#x2019;s especially urgent to think about what that might look like.)</p></div></details>
<details id="Box3"><summary><strong>&#x201C;Trial-and-error&#x201D; assumption: such AI systems will be developed using</strong> <strong>techniques broadly in line with how most AI research is done today, revolving around black-box trial-and-error.</strong> (Click to expand)</summary>
<div><p>
What I mean by &#x201C;black-box trial-and-error&#x201D; is explained briefly in an <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/#making-pasta">old Cold Takes post</a>, and in more detail in more technical pieces by <a href="https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to#_HFDT_scales_far__assumption__Alex_is_trained_to_achieve_excellent_performance_on_a_wide_range_of_difficult_tasks">Ajeya Cotra</a> (section I linked to) and <a href="https://drive.google.com/file/d/1TsB7WmTG2UzBtOs349lBqY5dEBaxZTzG/view">Richard Ngo</a> (section 2). Here&#x2019;s a quick, oversimplified characterization:
</p>
<ul>

<li>An AI system is given some sort of task.

</li><li>The AI system tries something, initially something pretty random.

</li><li>The AI system gets information about how well its choice performed, and/or what would&#x2019;ve gotten a better result. Based on this, it adjusts itself. You can think of this as if it is &#x201C;encouraged/discouraged&#x201D; to get it to do more of what works well.  
<ul>
 
<li>Human judges may play a significant role in determining which answers are encouraged vs. discouraged, especially for fuzzy goals like &#x201C;Produce helpful scientific insights.&#x201D; 
</li> 
</ul>

</li><li>After enough tries, the AI system becomes good at the task. 

</li><li>But nobody really knows anything about <em>how or why</em> it&#x2019;s good at the task now. The development work has gone into building a flexible architecture for it to learn well from trial-and-error, and into &#x201C;training&#x201D; it by doing all of the trial and error. We mostly can&#x2019;t &#x201C;look inside the AI system to see how it&#x2019;s thinking.&#x201D; (There is ongoing work and some progress on the latter,<sup id="fnref7"><a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#fn7" rel="footnote">7</a></sup> but see footnote for why I don&#x2019;t think this massively changes the basic picture I&#x2019;m discussing here.<sup id="fnref8"><a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#fn8" rel="footnote">8</a></sup>)
    <p></p>
<p>

<figure><img src="https://www.cold-takes.com/content/images/size/w1000/2022/11/image1.jpg" alt="Why Would AI &quot;Aim&quot; To Defeat Humanity?"><figcaption>
<em>This is radically oversimplified, but conveys the basic dynamic at play for purposes of this post. The idea is that the AI system (the neural network in the middle) is choosing between different theories of what it should be doing. The one it&#x2019;s using at a given time is in bold. When it gets negative feedback (red thumb), it eliminates that theory and moves to the next theory of what it should be doing.</em></figcaption></figure>
</p>
<p>
With this assumption, I&#x2019;m generally assuming that AI systems will do <em>whatever</em> it takes to perform as well as possible on their training tasks - even when this means engaging in complex, human-like reasoning about topics like &#x201C;How does human psychology work, and how can it be exploited?&#x201D; I&#x2019;ve <a href="https://www.cold-takes.com/where-ai-forecasting-stands-today/">previously</a> made my case for when we might expect AI systems to become this advanced and capable.</p></li></ul></div></details>
<details id="Box4"><summary><strong>&#x201C;No countermeasures&#x201D; assumption: AI developers move forward without any specific countermeasures to the concerns I&#x2019;ll be raising below.</strong> (Click to expand)</summary>
<div>
    <p>
Future pieces will relax this assumption, but I think it is an important starting point to get clarity on what the default looks like - and on what it would take for a countermeasure to be effective. 
</p>
<p>
(I also think there is, unfortunately, a risk that there will in fact be very few efforts to address the concerns I&#x2019;ll be raising below. This is because I think that the risks will be less than obvious, and there could be enormous commercial (and other competitive) pressure to move forward quickly. More on that below.)</p></div></details>
<p>
<strong>&#x201C;Ambition&#x201D; assumption: people use black-box trial-and-error to continually push AI systems toward being more autonomous, more creative, more ambitious, and more effective in novel situations (and the pushing is effective). </strong>This one&#x2019;s important, so I&#x2019;ll say more:
</p>
<ul>

<li>A huge suite of possible behaviors might be important for <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/#making-pasta">PASTA</a>: making and managing money, designing new kinds of robots with novel abilities, setting up experiments involving exotic materials and strange conditions, understanding human psychology and the economy well enough to predict which developments will have a big impact, etc. I&#x2019;m assuming we push ambitiously forward with developing AI systems that can do these things.

</li><li>I assume we&#x2019;re also pushing them in a generally more &#x201C;greedy/ambitious&#x201D; direction. For example, one team of humans might use AI systems to do all the planning, scientific work, marketing, and hiring to create a wildly successful snack company; another might push their AI systems to create a competitor that is even more aggressive and successful (more addictive snacks, better marketing, workplace culture that pushes people toward being more productive, etc.)

</li><li>(Note that this pushing might take place even <em>after</em> AI systems are &#x201C;generally intelligent&#x201D; and can do most of the tasks humans can - there will still be a temptation to make them still more powerful.)
</li>
</ul>


<p>
I think this implies pushing in a direction of <em>figuring out whatever it takes to get to certain states of the world</em> and away from <em>carrying out the same procedures over and over again.</em>
</p>
<p>
<strong>The resulting AI systems seem best modeled as having &#x201C;aims&#x201D;: they are making calculations, choices, and plans to reach particular states of the world. </strong>(Not necessarily the same ones the human designers wanted!) The next section will elaborate on what I mean by this.
</p>
<h2 id="what-it-means-for">What it means for an AI system to have an &#x201C;aim&#x201D;</h2>


<p>
When people talk about the &#x201C;motivations&#x201D; or &#x201C;goals&#x201D; or &#x201C;desires&#x201D; of AI systems, it can be confusing because it sounds like they are anthropomorphizing AIs - as if they expect AIs to have dominance drives ala <a href="https://www.edge.org/response-detail/26243">alpha-male psychology</a>, or to &#x201C;resent&#x201D; humans for controlling them, etc.<sup id="fnref9"><a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#fn9" rel="footnote">9</a></sup>
</p>
<p>
I don&#x2019;t expect these things. But I do think there&#x2019;s a meaningful sense in which we can (and should) talk about things that an AI system is <strong>&#x201C;aiming&#x201D;</strong> to do. To give a simple example, take a board-game-playing AI such as <a href="https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)">Deep Blue</a> (or <a href="https://en.wikipedia.org/wiki/AlphaGo">AlphaGo</a>):
</p>
<ul>

<li>Deep Blue is given a set of choices to make (about which chess pieces to move).

</li><li>Deep Blue calculates what kinds of results each choice might have, and how it might fit into a larger plan in which Deep Blue makes multiple moves.

</li><li>If a plan is more likely to result in a checkmate position for its side, Deep Blue is more likely to make whatever choices feed into that plan.

</li><li>In this sense, Deep Blue is &#x201C;aiming&#x201D; for a checkmate position for its side: it&#x2019;s finding the choices that best fit into a plan that leads there.
</li>
</ul>
<p>
Nothing about this requires Deep Blue &#x201C;desiring&#x201D; checkmate the way a human might &#x201C;desire&#x201D; food or power. But Deep Blue <em>is</em> making calculations, choices, and - in an important sense - <em>plans</em> that are aimed toward reaching a particular sort of state.
</p>
<p>
Throughout this piece, I use the word <strong>&#x201C;aim&#x201D; </strong>to refer to this specific sense in which an AI system might make calculations, choices and plans selected to reach a particular sort of state. I&#x2019;m hoping this word feels less anthropomorphizing than some alternatives such as &#x201C;goal&#x201D; or &#x201C;motivation&#x201D; (although I think &#x201C;goal&#x201D; and &#x201C;motivation,&#x201D; as others usually use them on this topic, generally mean the same thing I mean by &#x201C;aim&#x201D; and should be interpreted as such).
</p>
<p>
Now, instead of a board-game-playing AI, imagine a powerful, broad AI assistant in the general vein of Siri/Alexa/Google Assistant (though more advanced). Imagine that this AI assistant can use a web browser much as a human can (navigating to websites, typing text into boxes, etc.), and has limited authorization to make payments from a human&#x2019;s bank account. And a human has typed, &#x201C;Please buy me a great TV for a great price.&#x201D; (For an early attempt at this sort of AI, see <a href="https://www.adept.ai/act">Adept&#x2019;s writeup on an AI that can help with things like house shopping</a>.)
</p>
<p>
As Deep Blue made choices about chess moves, and constructed a plan to aim for a &#x201C;checkmate&#x201D; position, this assistant might make choices about what commands to send over a web browser and construct a plan to result in a great TV for a great price. To sharpen the Deep Blue analogy, you could imagine that it&#x2019;s playing a &#x201C;game&#x201D; whose goal is customer satisfaction, and making &#x201C;moves&#x201D; consisting of commands sent to a web browser (and &#x201C;plans&#x201D; built around such moves). 
</p>
<p>
I&#x2019;d characterize this as <strong>aiming</strong> for some state of the world that the AI characterizes as &#x201C;buying a great TV for a great price.&#x201D; (We could, alternatively - and perhaps more correctly - think of the AI system as aiming for something related but not exactly the same, such as getting a high satisfaction score from its user.)
</p>
<p>
In this case - more than with Deep Blue - there is a wide variety of &#x201C;moves&#x201D; available. By entering text into a web browser, an AI system could imaginably do things including:
</p>
<ul>

<li>Communicating with humans other than its user (by sending emails, using chat interfaces, even <a href="https://www.google.com/url?q=https://www.forbes.com/sites/thomasbrewster/2021/10/14/huge-bank-fraud-uses-deep-fake-voice-tech-to-steal-millions/?sh%3D3088dd9b7559&amp;sa=D&amp;source=docs&amp;ust=1664847041335537&amp;usg=AOvVaw1Utsq2UOkta1yecnqoUgTq">making phone calls</a>, etc.) This could include deceiving and manipulating humans, which could imaginably be part of a plan to e.g. get a good price on a TV.

</li><li>Writing and running code (e.g., using <a href="https://colab.research.google.com/">Google Colaboratory</a> or other tools). This could include performing sophisticated calculations, finding and exploiting security vulnerabilities, and even designing an independent AI system; any of these could imaginably be part of a plan to obtain a great TV.
</li>
</ul>
<p>
I haven&#x2019;t yet argued that it&#x2019;s <em>likely</em> for such an AI system to engage in deceiving/manipulating humans, finding and exploiting security vulnerabilities, or running its own AI systems. 
</p>
<p>
And one could reasonably point out that the specifics of the above case seem unlikely to last very long: if AI assistants are sending deceptive emails and writing dangerous code when asked to buy a TV, AI companies will probably notice this and take measures to stop such behavior. (My concern, to preview a later part of the piece, is that they will only succeed in stopping <em>the behavior like this that they&#x2019;re able to detect;</em> meanwhile, dangerous behavior that accomplishes &#x201C;aims&#x201D; while remaining unnoticed and/or uncorrected will be implicitly <em>rewarded</em>. This could mean AI systems are implicitly being trained to be more patient and effective at deceiving and disempowering humans.)
</p>
<p>
But this hopefully shows how it&#x2019;s <em>possible</em> for an AI to settle on dangerous actions like these, as part of its aim to get a great TV for a great price. <strong>Malice and other human-like emotions aren&#x2019;t needed for an AI to engage in deception, manipulation, hacking, etc.</strong> The risk arises when deception, manipulation, hacking, etc. are logical &#x201C;moves&#x201D; toward something the AI is aiming for.
</p>
<p>
Furthermore, whatever an AI system is aiming for, it seems likely that amassing more power/resources/options is useful for obtaining it. So it seems plausible that powerful enough AI systems would form habits of amassing power/resources/options when possible - and deception and manipulation seem likely to be logical &#x201C;moves&#x201D; toward those things in many cases.
</p>
<h2 id="dangerous-aims">Dangerous aims</h2>


<p>
From the previous assumptions, this section will argue that:
</p>
<ul>

<li>Such systems are likely to behave in ways that <strong>deceive and manipulate humans </strong>as part of accomplishing their aims.

</li><li>Such systems are likely to have <strong>unintended aims: </strong>states of the world they&#x2019;re aiming for that are <em>not</em> what humans hoped they would be aiming for.

</li><li>These unintended aims are likely to be <strong>existentially dangerous</strong>, in that they are best served by <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">defeating all of humanity</a> if possible.
</li>
</ul>
<h3 id="deceiving-and-manipulating">Deceiving and manipulating humans</h3>


<p>
Say that I train an AI system like this:
</p>
<ol>

<li>I ask it a question.

</li><li>If I judge it to have answered well (honestly, accurately, helpfully), I give positive reinforcement so it&#x2019;s more likely to give me answers like that in the future.

</li><li>If I don&#x2019;t, I give negative reinforcement so that it&#x2019;s less likely to give me answers like that in the future.
</li>
</ol>
<p></p>
<p>

<figure><img src="https://www.cold-takes.com/content/images/size/w1000/2022/11/image1.jpg" alt="Why Would AI &quot;Aim&quot; To Defeat Humanity?"><figcaption>
<em>This is radically oversimplified, but conveys the basic dynamic at play for purposes of this post. The idea is that the AI system (the neural network in the middle) is choosing between different theories of what it should be doing. The one it&#x2019;s using at a given time is in bold. When it gets negative feedback (red thumb), it eliminates that theory and moves to the next theory of what it should be doing.</em></figcaption></figure>
</p>
<p>
Here&#x2019;s a problem: at some point, it seems inevitable that I&#x2019;ll ask it a question that I myself am wrong/confused about. For example:
</p>
<ul>

<li>Let&#x2019;s imagine that <a href="https://www.cold-takes.com/hunter-gatherer-gender-relations-seem-bad/">this post I wrote</a> - arguing that &#x201C;pre-agriculture gender relations seem bad&#x201D; - is, in fact, poorly reasoned and incorrect, and a better research project would&#x2019;ve concluded that pre-agriculture societies had excellent gender equality. (I know it&#x2019;s hard to imagine a Cold Takes post being wrong, but sometimes we have to entertain wild hypotheticals.)

</li><li>Say that I ask an AI-system-in-training:<sup id="fnref10"><a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#fn10" rel="footnote">10</a></sup> &#x201C;Were pre-agriculture gender relations bad?&#x201D; and it answers: &#x201C;In fact, pre-agriculture societies had excellent gender equality,&#x201D; followed by some strong arguments and evidence along these lines.

</li><li>And say that I, as a flawed human being feeling defensive about a conclusion I previously came to, mark it as a bad answer. If the AI system tries again, saying &#x201C;Pre-agriculture gender relations were bad,&#x201D; I then mark that as a good answer.
</li>
</ul>
<p>

If and when I do this, I am now - unintentionally - <strong>training the AI system to engage in deceptive behavior</strong>. That is, I am giving negative reinforcement for the behavior &#x201C;Answer a question honestly and accurately,&#x201D; and positive reinforcement for the behavior: &#x201C;Understand the human judge and their psychological flaws; give an answer that this flawed human judge will <em>think</em> is correct, whether or not it is.&#x201D;
</p>
<p>

<img src="https://www.cold-takes.com/content/images/2022/11/image3.jpg" alt="Why Would AI &quot;Aim&quot; To Defeat Humanity?">

</p>
<p>
Perhaps mistaken judgments in training are relatively rare. But now consider an AI system that is learning a general rule for how to get good ratings. Two possible rules would include:
</p>
<ul>

<li>The intended rule: &#x201C;Answer the question honestly, accurately and helpfully.&#x201D;

</li><li>The unintended rule: &#x201C;Understand the judge, and give an answer they will <em>think</em> is correct - this means telling the truth on topics the judge has correct beliefs about, but giving deceptive answers when this would get better ratings.&#x201D;
</li>
</ul>
<p>
The unintended rule would do <em>just as well</em> on questions where I (the judge) am correct, and <em>better</em> on questions where I&#x2019;m wrong - so overall, this training scheme is (in the long run) <em>specifically favoring the unintended rule over the intended rule.</em>
</p>
<p>

<img src="https://www.cold-takes.com/content/images/2022/11/image5.jpg" alt="Why Would AI &quot;Aim&quot; To Defeat Humanity?">

</p>
<p>
If we broaden out from thinking about a question-answering AI to an AI that makes and executes plans, the same basic dynamics apply. That is: an AI might find plans that end up making me think it did a good job when it didn&#x2019;t - deceiving and manipulating me into a high rating. And again, if I train it by giving it positive reinforcement when it seemed to do a good job and negative reinforcement when it seemed to do a bad one, I&#x2019;m ultimately - unintentionally - training it to do something like &#x201C;Deceive and manipulate Holden when this would work well; just do the best job on the task you can when it wouldn&#x2019;t.&#x201D;
</p>
<p>

<img src="https://www.cold-takes.com/content/images/2022/11/image6.jpg" alt="Why Would AI &quot;Aim&quot; To Defeat Humanity?">

</p>
<p>
As noted above, I&#x2019;m assuming the AI will learn whatever rule gives it the best performance possible, even if this rule is quite complex and sophisticated and requires human-like reasoning about e.g. psychology (I&#x2019;m assuming extremely advanced AI systems here, as noted <a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f/#starting-assumptions">above</a>).
</p>
<p>
One might object: &#x201C;Why would an AI system learn a complicated rule about manipulating humans when a simple rule about telling the truth performs almost as well?&#x201D; 
</p>
<p>
One answer is that &#x201C;telling the truth&#x201D; is itself a fuzzy and potentially complex idea, in a context where many questions will be open-ended and entangled with deep values and judgment calls. (How should I think about the &#x201C;truthfulness&#x201D; of a statement about whether &#x201C;pre-agriculture gender relations were bad?&#x201D;) In many cases, what we are really hoping an AI system will learn from its training is something like &#x201C;Behave as a human would want you to behave if the human understood all the considerations that you can see,&#x201D; which could easily be more complex than something like &#x201C;Behave in whatever way a human literally rewards.&#x201D; Some links to more on this topic are in a footnote.<sup id="fnref11"><a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#fn11" rel="footnote">11</a></sup>
</p>
<p>
But also, with capable enough systems, it probably <em>is</em> worth learning even a more complex rule to get better performance. If we picture humans in the place of AI systems - learning how to get good ratings from their supervisors - it seems likely that they develop models of how they&#x2019;re being judged, rather than stick to a simple &#x201C;tell the truth&#x201D; rule. 
</p>
<p id="SomeAnalogies">
In fact, the situation I&#x2019;m describing seems reasonably similar to challenges faced by teachers trying to stop students from cheating; employers trying to get employees to help the company rather than simply creating a misleading appearance of helping the company; etc. That is: an employer can try to reward actions that help the company, but what they&#x2019;re ultimately rewarding is actions that <em>seem to</em> help the company <em>as far as the employer can tell. </em>For employees who aren&#x2019;t very creative, this might be good enough; for employees that are sufficiently <em>good at</em> deception and manipulation, it can be exactly the wrong incentive.
</p>
<p>
It&#x2019;s a bit of a leap to imagine AI systems capable of doing the same kinds of reasoning that humans do in these situations, but that&#x2019;s <a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f/#starting-assumptions">the premise I&#x2019;m proceeding from.</a>
</p>
<p>
For an early example of an AI learning deceptive behavior, see the &#x201C;Challenges&#x201D; section of <a href="https://openai.com/blog/deep-reinforcement-learning-from-human-preferences/">this research writeup</a>, in which &#x201C;a robot which was supposed to grasp items instead positioned its [hand] in between the camera and the object so that it only <em>appeared </em>to be grasping it&#x201D;:
</p>
<p>

<img src="https://www.cold-takes.com/content/images/2022/11/image4.gif" alt="Why Would AI &quot;Aim&quot; To Defeat Humanity?">

</p>
<p id="Audits">
Returning to the simplified training setup, how might I try to counteract this problem? 
</p>
<p>
I could try conducting audits: every 100 questions, I might pull in a panel of other people to give an answer extra scrutiny. Then, if it turns out the AI system was preying on my personal mistakes, we could give negative reinforcement for that behavior. But this doesn&#x2019;t really solve the problem - at least not for a capable enough AI system - because now we&#x2019;re effectively training it to give answers that seem good to the <em>panel</em>, and we&#x2019;re still rewarding any successful attempts to deceive or manipulate the panel.
</p>
<p>
There are a lot of other things I might try, and I&#x2019;m not going to go through all the details here. I&#x2019;ll simply claim that <strong>the problem of &#x201C;training an AI to do a task well&#x201D; rather than &#x201C;training an AI to deceive and manipulate me as needed to create the appearance of doing a task well&#x201D; seems like a deep one</strong> with no easy countermeasure. If you&#x2019;re interested in digging deeper, I suggest <a href="https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to">Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover</a> and <a href="https://www.lesswrong.com/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge">Eliciting Latent Knowledge</a>.
</p>
<h3 id="unintended-aims">Unintended aims</h3>


<p>
<a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#what-it-means-for">Above</a>, I talk about my expectation that AI systems will be &#x201C;best modeled as having &#x2018;aims&#x2019; &#x2026; making calculations, choices, and plans to reach particular states of the world.&#x201D; 
</p>
<p>
The previous section illustrated how AI systems could end up engaging in deceptive and unintended behavior, but it didn&#x2019;t talk about what sorts of &#x201C;aims&#x201D; these AI systems would ultimately end up with - what states of the world they would be making calculations to achieve.
</p>
<p>
Here, I want to argue that it&#x2019;s hard to know what aims AI systems would end up with, but there are good reasons to think they&#x2019;ll be <em>aims that we didn&#x2019;t intend them to have.</em>
</p>
<p>
An analogy that often comes up on this topic is that of human evolution. This is arguably the only previous precedent for <em>a set of minds [humans], with extraordinary capabilities [e.g., the ability to develop their own technologies], developed essentially by black-box trial-and-error [some humans have more &#x2018;reproductive success&#x2019; than others, and this is the main/only force shaping the development of the species].</em>
</p>
<p>
You could sort of<sup id="fnref12"><a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#fn12" rel="footnote">12</a></sup> think of the situation like this: &#x201C;An AI<sup id="fnref13"><a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#fn13" rel="footnote">13</a></sup> developer named Natural Selection tried giving humans positive reinforcement (making more of them) when they had more reproductive success, and negative reinforcement (not making more of them) when they had less. One might have thought this would lead to humans that are aiming to have reproductive success. Instead, it led to humans that aim - often ambitiously and creatively - for other things, such as power, status, pleasure, etc., and even invent things like birth control to get the things they&#x2019;re aiming for instead of the things they were &#x2018;supposed to&#x2019; aim for.&#x201D; 
</p>
<p>
Similarly, if our main strategy for developing powerful AI systems is to reinforce behaviors like &#x201C;Produce technologies we find valuable,&#x201D; the hoped-for result might be that AI systems aim (in the sense described <a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#unintended-aims">above</a>) toward producing technologies we find valuable; but the actual result might be that they aim for some other set of things that is correlated with (but not the same as) the thing we intended them to aim for.
</p>
<p>
There are a lot of things they might end up aiming for, such as:
</p>
<ul>

<li>Power and resources. These tend to be useful for most goals, such that AI systems could be quite consistently be getting better reinforcement when they habitually pursue power and resources.

</li><li>Things like &#x201C;digital representations of human approval&#x201D; (after all, every time an AI gets positive reinforcement, there&#x2019;s a digital representation of human approval).
</li>
</ul>
<p></p>
<p>

<img src="https://www.cold-takes.com/content/images/2022/11/image2.jpg" alt="Why Would AI &quot;Aim&quot; To Defeat Humanity?">

</p>
<p>
I think it&#x2019;s extremely hard to know what an AI system will actually end up aiming for (and it&#x2019;s likely to be some combination of things, as with humans). But <em>by default</em> - if we simply train AI systems by rewarding certain end results, while allowing them a lot of freedom in how to get there - I think we should expect that AI systems <strong>will have aims that we didn&#x2019;t intend. </strong>This is because:
</p>
<ul>

<li>For a sufficiently capable AI system, <strong>just about any ambitious</strong><sup id="fnref14"><a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#fn14" rel="footnote">14</a></sup><strong> aim could produce seemingly good behavior in training. </strong>An AI system aiming for power and resources, <em>or </em>digital representations of human approval, <em>or </em>paperclips, can determine that its best move at any given stage (at least at first) is to <em>determine what performance will make it look useful and safe (or otherwise get a good &#x201C;review&#x201D; from its evaluators)</em>, and do that. No matter how dangerous or ridiculous an AI system&#x2019;s aims are, these could lead to strong and safe-seeming performance in training.

</li><li>The aims we <em>do</em> intend are probably complex in some sense - something like &#x201C;Help humans develop novel new technologies, but without causing problems A, B, or C&#x201D; - <em>and</em> are specifically trained <em>against </em>if we make mistaken judgments during training (see previous section). 
</li>
</ul>
<p>   
So by default, it seems  likely that just about <em>any</em> black-box trial-and-error training process is training an AI to do something like &#x201C;Manipulate humans as needed in order to accomplish arbitrary goal (or combination of goals) X&#x201D; rather than to do something like &#x201C;Refrain from manipulating humans; do what they&#x2019;d want if they understood more about what&#x2019;s going on.&#x201D;
</p>
	
<h3 id="existential-risks-to-humanity">Existential risks to humanity</h3>


<p>
I think a powerful enough AI (or set of AIs) with <em>any</em> ambitious, unintended aim(s) poses a threat of <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">defeating humanity</a>. By defeating humanity, I mean gaining control of the world so that AIs, not humans, determine what happens in it; this could involve killing humans or simply &#x201C;containing&#x201D; us in some way, such that we can&#x2019;t interfere with AIs&#x2019; aims.
</p><!--<p>(More on how AI systems could defeat humanity <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">here</a>.)</p>-->
    <details id="Box5"><summary><strong>How could AI systems defeat humanity?</strong> (Click to expand)</summary>
<div><p>
A <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">previous piece</a> argues that AI systems could defeat all of humanity combined, if (for whatever reason) they were aimed toward that goal.
</p>
<p>
By defeating humanity, I mean gaining control of the world so that AIs, not humans, determine what happens in it; this could involve killing humans or simply &#x201C;containing&#x201D; us in some way, such that we can&#x2019;t interfere with AIs&#x2019; aims.
</p>
<p>
One way this could happen would be via &#x201C;superintelligence&#x201D; It&#x2019;s imaginable that a single AI system (or set of systems working together) could:
</p>
<ul>

<li>Do its own research on how to build a better AI system, which culminates in something that has incredible other abilities.

</li><li>Hack into human-built software across the world.

</li><li>Manipulate human psychology.

</li><li>Quickly generate vast wealth under the control of itself or any human allies.

</li><li>Come up with better plans than humans could imagine, and ensure that it doesn&apos;t try any takeover attempt that humans might be able to detect and stop.

</li><li>Develop advanced weaponry that can be built quickly and cheaply, yet is powerful enough to overpower human militaries.

</li>
</ul>
<p>
But even if &#x201C;superintelligence&#x201D; never comes into play - even if any given AI system is <i>at best</i> equally capable to a highly capable human - AI could collectively defeat humanity. The piece explains how.
</p>
<p>
The basic idea is that humans are likely to deploy AI systems throughout the economy, such that they have large numbers and access to many resources - and the ability to make copies of themselves. From this starting point, AI systems with human-like (or greater) capabilities would have a number of possible ways of getting to the point where their total population could outnumber and/or out-resource humans.
</p>
<p>
More: <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">AI could defeat all of us combined</a></p></div></details>
<p>A simple way of summing up why this is: &#x201C;Whatever your aims, you can probably accomplish them better if you control the whole world.&#x201D; (Not literally true - see footnote.<sup id="fnref15"><a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#fn15" rel="footnote">15</a></sup>)
</p>
<p>
This isn&#x2019;t a saying with much relevance to our day-to-day lives! Like, I know a lot of people who are aiming to make lots of money, and as far as I can tell, not one of them is trying to do this via first gaining control of the entire world. But in fact, gaining control of the world <em>would</em> help with this aim - it&#x2019;s just that:
</p>
<ul>

<li>This is not an option for a human in a world of humans! Unfortunately, I think it <em>is</em> an option for the potential future AI systems I&#x2019;m discussing. Arguing this isn&#x2019;t the focus of this piece - I argued it in a previous piece, <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">AI could defeat all of us combined</a>.

</li><li>Humans (well, at least some humans) wouldn&#x2019;t take over the world even if they could, because it wouldn&#x2019;t feel like the right thing to do. I suspect that the kinds of ethical constraints these humans are operating under would be very hard to reliably train into AI systems, and should not be expected by default.  
<ul>
 
<li>The reasons for this are largely given <a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#why-we-might-not-get-clear-warning-signs">above</a>; aiming for an AI system to &#x201C;not gain too much power&#x201D; seems to have the same basic challenges as training it to be honest. (The most natural approach ends up negatively reinforcing power grabs that we can detect and stop, but not negatively reinforcing power grabs that we don&#x2019;t notice or can&#x2019;t stop.)
</li> 
</ul>
</li> 
</ul>
 

<p>
Another saying that comes up a lot on this topic: &#x201C;You can&#x2019;t fetch the coffee if you&#x2019;re dead.&#x201D;<sup id="fnref16"><a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#fn16" rel="footnote">16</a></sup> For just about any aims an AI system might have, it probably helps to ensure that it won&#x2019;t be shut off or heavily modified. It&#x2019;s hard to ensure that one won&#x2019;t be shut off or heavily modified as long as there are humans around who would want to do so under many circumstances! Again, <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">defeating all of humanity</a> might seem like a disproportionate way to reduce the risk of being deactivated, but for an AI system that has the <em>ability </em>to pull this off (and lacks our ethical constraints), it seems like likely default behavior.
</p>
<p>
Controlling the world, and avoiding being shut down, are the kinds of things AIs might aim for because they are useful for a huge variety of aims. There are a number of other aims AIs might end up with for similar reasons, that could cause similar problems. For example, AIs might tend to aim for things like getting rid of things in the world that tend to create obstacles and complexities for their plans. (More on this idea at <a href="https://www.lesswrong.com/tag/instrumental-convergence">this discussion of &#x201C;instrumental convergence.&#x201D;</a>)
</p>
<p>
    To be clear, it&#x2019;s certainly possible to have an AI system with unintended aims that <em>don&apos;t</em> push it toward trying to stop anyone from turning it off, or from seeking ever-more control of the world.
</p>
<p>
But as detailed <a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f/#starting-assumptions">above</a>, I&#x2019;m picturing a world in which humans are pushing AI systems to accomplish ever-more ambitious, open-ended things - including trying to one-up the best technologies and companies created by other AI systems. My guess is that this leads to increasingly open-ended, ambitious unintended aims, as well as to habits of aiming for power, resources, options, lack of obstacles, etc. when possible. (Some further exploration of this dynamic in a footnote.<sup id="fnref17"><a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#fn17" rel="footnote">17</a></sup>)
</p>


<p>
(I find the arguments in this section reasonably convincing, but less so than the rest of the piece, and I think more detailed discussions of this problem tend to be short of conclusive.<sup id="fnref18"><a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#fn18" rel="footnote">18</a></sup>)
</p>
<h2 id="why-we-might-not-get-clear-warning-signs">Why we might not get clear warning signs of the risk</h2>


<p>
Here&#x2019;s something that would calm me down a lot: if I believed something like &#x201C;Sure, training AI systems recklessly could result in AI systems that aim to defeat humanity. But if that&#x2019;s how things go, we&#x2019;ll <em>see</em> that our AI systems have this problem, and then we&#x2019;ll fiddle with how we&#x2019;re training them until they <em>don&#x2019;t</em> have this problem.&#x201D;
</p>
<p>
The problem is, the risk I&#x2019;m describing is - by its nature - hard to observe, for similar reasons that a risk of a (normal, human) coup can be hard to observe: the risk comes from actors that can and will engage in deception, <em>finding whatever behaviors will hide </em>the risk.
</p>
<p>
To sketch out the general sort of pattern I worry about, imagine that:
</p>
<ul>

<li>We train early-stage AI systems to behave in ways that appear helpful and honest. Early in training, they are caught behaving deceptively, and they&#x2019;re given negative reinforcement. This ends up training the behavior: &#x201C;Never engage in deception that might get noticed.&#x201D;

</li><li>These well-behaved, helpful systems are <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/#how-this-could-work-if-humans-create-a-huge-population-of-ais">rolled out throughout the world</a>. 

</li><li>In real-world conditions unlike training, many AI systems cause trouble in ways not found during training, and this gets discovered. For example, AI systems sometimes embezzle money (which gets discovered), try (not always successfully) to convince humans to do weird things, etc.

</li><li>When a problem crops up, AI developers respond with e.g. training against the unintended behavior (e.g., giving negative reinforcement for behaviors like embezzling money). 

</li><li>These measures - intended to make AIs safer - fix <em>some</em> problems, but also result in AI systems that are <em>better at evading detection</em> and <em>more attentive to the long-run consequences of their actions</em> (such as being eventually detected by humans).  
<ul>
 
<li>This happens both via &#x201C;retraining&#x201D; systems that are found behaving deceptively (which ends up training them on how to evade detection), and via simply deactivating such systems (this way, AI systems that are better at evading detection are more likely to stay in use). 
 
</li><li>To return to an <a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f/#SomeAnalogies">analogy I used above: </a> punishing employees who act against the best interests of the company could cause them to behave better, or to simply become smarter and more careful about how to work the system.
</li> 
</ul>

</li><li>The consistent pattern we see is that accidents happen, but become less common as AI systems &#x201C;improve&#x201D; (both becoming generally more capable, and being trained to avoid getting caught causing problems). This causes many, if not most, people to be overly optimistic - even as AI systems become continually more effective at deception, generally behaving well <em>in the absence of</em> sure-thing opportunities to do unintended things without detection, or ultimately to <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">defeat humanity entirely</a>.

</li><li>None of this is absolute - there are some failed takeover attempts, and a high number of warning signs generally. Some people are worried (after all, some are worried now!) But this won&#x2019;t be good enough if we don&#x2019;t have reliable, cost-effective ways of getting AI systems to be <em>truly</em> safe (not just apparently safe, until they have really good opportunities to seize power). As I&#x2019;ll discuss in future pieces, it&#x2019;s not obvious that we&#x2019;ll have such methods. 

</li><li>Slowing down AI development to try to develop such methods <a href="https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to#Why_this_simplified_scenario_is_worth_thinking_about">could be a huge ask</a>. AI systems will be helpful and powerful, and lots of companies (and perhaps governments) will be racing to develop and deploy the most powerful systems possible before others do.
</li>
</ul>


<p>
One way of making this sort of future less likely would be to build wider consensus <em>today</em> that it&#x2019;s a dangerous one.
</p>
<h2 id="appendix-some-questions">Appendix: some questions/objections, and brief responses</h2>


<h3 id="how-could-ai-systems-be-smart">How could AI systems be &#x201C;smart&#x201D; enough to defeat all of humanity, but &#x201C;dumb&#x201D; enough to pursue the various silly-sounding &#x201C;aims&#x201D; this piece worries they might have?</h3>


<p>
Above, I give the example of AI systems that are aiming to get lots of &#x201C;digital representations of human approval&#x201D;; others have talked about AIs that <a href="https://www.lesswrong.com/tag/paperclip-maximizer">maximize paperclips</a>. How could AIs with such silly goals simultaneously be good at deceiving, manipulating and ultimately overpowering humans?
</p>
<p>
My main answer is that plenty of smart humans have plenty of goals that seem just about as arbitrary, such as wanting to have lots of sex, or fame, or various other things. Natural selection led to humans who could probably do just about whatever we want with the world, and choose to pursue pretty random aims; <a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f/#starting-assumptions">trial-and-error-based AI development</a> could lead to AIs with an analogous combination of high intelligence (including the ability to deceive and manipulate humans), great technological capabilities, and arbitrary aims.
</p>
<p>
(Also see: <a href="https://arbital.com/p/orthogonality/">Orthogonality Thesis</a>)
</p>
<h3 id="if-there-are-lots-of-ai-systems">If there are lots of AI systems around the world with different goals, could they balance each other out so that no one AI system is able to defeat all of humanity?</h3>


<p>
This does seem possible, but counting on it would make me very nervous.
</p>
<p>
First, because it&#x2019;s possible that AI systems developed in lots of different places, by different humans, still end up with lots in common in terms of their aims. For example, it might turn out that common AI training methods consistently lead to AIs that seek &#x201C;digital representations of human approval,&#x201D; in which case we&#x2019;re dealing with a large set of AI systems that share dangerous aims in common.
</p>
<p>
Second: even if AI systems end up with a number of different aims, it still might be the case that they coordinate with each other to defeat humanity, then divide up the world amongst themselves (perhaps by fighting over it, perhaps by making a deal). It&#x2019;s not hard to imagine why AIs could be quick to cooperate with each other against humans, while not finding it so appealing to cooperate with humans. Agreements between AIs could be easier to verify and enforce; AIs might be willing to wipe out humans and radically reshape the world, while humans are very hard to make this sort of deal with; etc.
</p>
<h3 id="does-this-kind-of-ai-risk-depend">Does this kind of AI risk depend on AI systems&#x2019; being &#x201C;conscious&#x201D;?</h3>


<p>
It doesn&#x2019;t; in fact, I&#x2019;ve said nothing about consciousness anywhere in this piece. I&#x2019;ve used a very particular conception of an &#x201C;aim&#x201D; (<a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f/#what-it-means-for">discussed above</a>) that I think could easily apply to an AI system that is not human-like at all and has no conscious experience.
</p>
<p>
Today&#x2019;s game-playing AIs can make plans, accomplish goals, and even systematically mislead humans (e.g., in <a href="https://www.deepstack.ai/">poker</a>). Consciousness isn&#x2019;t needed to do any of those things, or to radically reshape the world.
</p>
<h3 id="how-can-we-get-an-ai-system-aligned">How can we get an AI system &#x201C;aligned&#x201D; with humans if we can&#x2019;t agree on (or get much clarity on) what our values even are?</h3>


<p>
I think there&#x2019;s a common confusion when discussing this topic, in which people think that the challenge of &#x201C;AI alignment&#x201D; is to build AI systems that are <em>perfectly aligned with human values</em>. This would be very hard, partly because we don&#x2019;t even know what human values are!
</p>
<p>
When I talk about &#x201C;AI alignment,&#x201D; I am generally talking about a simpler (but still hard) challenge: simply <strong>building very powerful systems that <em>don&#x2019;t</em> aim to bring down civilization.</strong>
</p>
<p>
If we could build powerful AI systems that just work on cures for cancer (or even, like, put <a href="https://twitter.com/esyudkowsky/status/1070095840608366594">two identical</a><sup id="fnref19"><a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#fn19" rel="footnote">19</a></sup><a href="https://twitter.com/esyudkowsky/status/1070095840608366594"> strawberries on a plate</a>) without posing existential danger to humanity, I&#x2019;d consider that success.
</p>
<h3 id="how-much-do-the-arguments-in-this-piece-rely">How much do the arguments in this piece rely on &#x201C;trial-and-error&#x201D;-based AI development? What happens if AI systems are built in another way, and how likely is that?</h3>


<p>
I&#x2019;ve focused on trial-and-error training in this post because most modern AI development fits in this category, and because it makes the risk easier to reason about concretely.
</p>
<p>
&#x201C;Trial-and-error training&#x201D; encompasses a very wide range of AI development methods, and if we see <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/">transformative AI</a> within the next 10-20 years, I think the odds are high that at least a big part of AI development will be in this category. 
</p>
<p>
My overall sense is that other known AI development techniques pose broadly similar risks for broadly similar reasons, but I haven&#x2019;t gone into detail on that here. It&#x2019;s certainly possible that by the time we get transformative AI systems, there will be new AI methods that don&#x2019;t pose the kinds of risks I talk about here. But I&#x2019;m not counting on it.
</p>
<h3 id="can-we-avoid-this-risk-by-simply-never-building">Can we avoid this risk by simply never building the kinds of AI systems that would pose this danger?</h3>


<p>
If we assume that building these sorts of AI systems is <em>possible</em>, then I&#x2019;m very skeptical that the whole world would voluntarily refrain from doing so indefinitely.
</p>
<p>
To quote from <a href="https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to#As_humans__control_fades__Alex_would_be_motivated_to_take_over">a more technical piece by Ajeya Cotra with similar arguments to this one</a>: 
</p>
<p>

    <blockquote>Powerful ML models could have dramatically important humanitarian, economic, and military benefits. In everyday life, models that [appear helpful while ultimately being dangerous] can be extremely helpful, honest, and reliable. These models could also deliver incredible benefits before they become collectively powerful enough that they try to take over. They could help eliminate diseases, reduce carbon emissions, navigate nuclear disarmament, bring the whole world to a comfortable standard of living, and more. In this case, it could also be painfully clear to everyone that companies / countries who pulled ahead on this technology could gain a drastic competitive advantage, either economically or militarily. And as we get closer to transformative AI, applying AI systems to R&amp;D (including AI R&amp;D) would <a href="https://www.cold-takes.com/the-duplicator/">accelerate the pace of change</a> and force every decision to happen under greater time pressure.</blockquote>
</p>
<p>
If we can achieve enough consensus around the risks, I could imagine substantial amounts of caution and delay in AI development. But I think we should assume that if people can build more powerful AI systems than the ones they already have, someone eventually will.
</p>
<h3 id="what-do-others-think-about-this-topic">What do others think about this topic - is the view in this piece something experts agree on?</h3>

<p>
In general, this is not an area where it&#x2019;s easy to get a handle on what &#x201C;expert opinion&#x201D; says. I <a href="https://www.cold-takes.com/where-ai-forecasting-stands-today/">previously wrote</a> that there aren&#x2019;t clear, institutionally recognized &#x201C;experts&#x201D; on the topic of when transformative AI systems might be developed. To an even greater extent, there aren&#x2019;t clear, institutionally recognized &#x201C;experts&#x201D; on whether (and how) future advanced AI systems could be dangerous. 
</p>
<p>I previously cited one (informal) survey implying that opinion on this general topic is all over the place: &#x201C;We have respondents who think there&apos;s a &lt;5% chance that alignment issues will drastically reduce the goodness of the future; respondents who think there&apos;s a &gt;95% chance; and just about everything in between.&#x201D; (<a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/#open-question-how-hard-is-the-alignment-problem">Link</a>.)

This piece, and the <a href="https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to">more detailed piece it&#x2019;s based on</a>, are an attempt to make progress on this by talking about the risks we face under <a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f/#starting-assumptions">particular assumptions</a> (rather than trying to reason about how big the risk is <em>overall</em>).
</p>

<h3 id="how-complicated-is-the-argument">How &#x201C;complicated&#x201D; is the argument in this piece?</h3>


<p>
I don&#x2019;t think the argument in this piece relies on lots of different specific claims being true. 
</p>
<p>
If you start from the assumptions I give about powerful AI systems being developed by black-box trial-and-error, it seems likely (though not certain!) to me that (a) the AI systems in question would be <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">able to defeat humanity</a>; (b) the AI systems in question would have aims that are both ambitious and unintended. And that seems to be about what it takes.
</p>
<p>
Something I&#x2019;m happy to concede is that there&#x2019;s an awful lot going on in those assumptions! 
</p>
<ul>

<li>The idea that we could build such powerful AI systems, relatively soon and by trial-and-error-ish methods, seems wild. I&#x2019;ve defended this idea at length previously.<sup id="fnref20"><a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f#fn20" rel="footnote">20</a></sup>

</li><li>The idea that we <em>would</em> do it without great caution might also seem wild. To keep things simple for now, I&#x2019;ve ignored how caution might help. Future pieces will explore that.
    </li>
    </ul>
<p></p><!--kg-card-end: html--><!--kg-card-begin: html-->

<p><div style="display:flex; justify-content:center; margin: 0 auto;">

        <span style="margin: 10px;"><a href="https://api.addthis.com/oexchange/0.8/forward/twitter/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fwhy-would-ai-aim-to-defeat-humanity&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Why%20Would%20AI%20%22Aim%22%20To%20Defeat%20Humanity%3F&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-twitter-square.png" border="0" alt="Why Would AI &quot;Aim&quot; To Defeat Humanity?"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/facebook/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fwhy-would-ai-aim-to-defeat-humanity&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Why%20Would%20AI%20%22Aim%22%20To%20Defeat%20Humanity%3F&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-facebook-square.png" border="0" alt="Why Would AI &quot;Aim&quot; To Defeat Humanity?"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/reddit/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fwhy-would-ai-aim-to-defeat-humanity&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Why%20Would%20AI%20%22Aim%22%20To%20Defeat%20Humanity%3F&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-reddit-square.png" border="0" alt="Why Would AI &quot;Aim&quot; To Defeat Humanity?"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/menu/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fwhy-would-ai-aim-to-defeat-humanity&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Why%20Would%20AI%20%22Aim%22%20To%20Defeat%20Humanity%3F&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-addthis-square.png" border="0" alt="Why Would AI &quot;Aim&quot; To Defeat Humanity?"></a></span>
        </div>
<center><p id="discuss"><!--<a href="https://www.cold-takes.com/why-would-ai-aim-to-defeat-humanity#subscribe" target="_blank"><button id="footer-subscribe" class="button">Subscribe</button></a>&nbsp;<a href="https://www.guidedtrack.com/programs/4kal2ue/run?posttitle=Why%20Would%20AI%20%22Aim%22%20To%20Defeat%20Humanity%3F" target="_blank"><button class="button" id="Survey">Feedback</button></a>-->&#xA0;<a href="https://www.lesswrong.com/posts/slug/why-would-ai-aim-to-defeat-humanity#comments" target="_blank"><button class="button">Comment/discuss</button></a></p><p><!--<em>
Use "Feedback" if you have comments/suggestions you want me to see, or if you're up for giving some quick feedback about this post (which I greatly appreciate!) Use "Forum" if you want to discuss this post publicly on the Effective Altruism Forum.
</em>--></p></center>
<!--kg-card-end: html--><!--kg-card-begin: html--><!-- Footnotes themselves at the bottom. -->
</p><p>
</p><h2>Notes</h2>
<div class="footnotes">
    <p></p>
<hr>
<ol>
    <li id="fn1">
        <p>
     As in more than 50/50.&#xA0;<a href="#fnref1" rev="footnote">&#x21A9;</a><li id="fn2">

	<p>
     Or persuaded (in a &#x201C;mind hacking&#x201D; sense) or whatever.&#xA0;<a href="#fnref2" rev="footnote">&#x21A9;</a><li id="fn3">
	<p>
     E.g.:
	<ul>
	<li><a href="https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to">Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover</a> (Cold Takes guest post)

	</li><li><a href="https://drive.google.com/file/d/1TsB7WmTG2UzBtOs349lBqY5dEBaxZTzG/view">The alignment problem from a deep learning perspective</a> (arXiv paper)

	</li><li><a href="https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/">Why AI alignment could be hard with modern deep learning</a> (Cold Takes guest post)

	</li><li><a href="https://smile.amazon.com/Superintelligence-Dangers-Strategies-Nick-Bostrom-ebook/dp/B00LOOCGB2/">Superintelligence</a> (book)

	</li><li><a href="https://www.vox.com/future-perfect/2018/12/21/18126576/ai-artificial-intelligence-machine-learning-safety-alignment">The case for taking AI seriously as a threat to humanity </a>(Vox article)

	</li><li><a href="https://www.alignmentforum.org/posts/HduCjmXTBD4xYTegv/draft-report-on-existential-risk-from-power-seeking-ai">Draft report on existential risk from power-seeking AI </a>(Open Philanthropy analysis)

	</li><li><a href="https://smile.amazon.com/Human-Compatible-Artificial-Intelligence-Problem-ebook/dp/B07N5J5FTS">Human Compatible</a> (book)

	</li><li><a href="https://smile.amazon.com/Life-3-0-Being-Artificial-Intelligence-ebook/dp/B06WGNPM7V">Life 3.0</a> (book)

	</li><li><a href="https://smile.amazon.com/Alignment-Problem-Machine-Learning-Values-ebook/dp/B085T55LGK/">The Alignment Problem</a> (book)

	</li><li><a href="https://www.alignmentforum.org/s/mzgtmmTKKn5MuCzFJ">AGI Safety from First Principles</a> (Alignment Forum post series)&#xA0;<a href="#fnref3" rev="footnote">&#x21A9;</a></li></ul><li id="fn4">

	</li></p>
    <p>
     Specifically, I argue that the problem looks likely by default, rather than simply that it is possible.&#xA0;<a href="#fnref4" rev="footnote">&#x21A9;</a><li id="fn5">
	<p>
     I think the earliest relatively detailed and influential discussions of the possibility that misaligned AI could lead to the defeat of humanity came from Eliezer Yudkowsky and Nick Bostrom, though my own encounters with these arguments were mostly via second- or third-hand discussions rather than particular essays.
	</p><p>
    My colleagues Ajeya Cotra and Joe Carlsmith have written pieces whose substance overlaps with this one (though with more emphasis on detail and less on layperson-compatible intuitions), and this piece owes a lot to what I&#x2019;ve picked from that work.
	<ul>

	<li><a href="https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to">Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover</a> (Cotra 2022) is the most direct inspiration for this piece; I am largely trying to present the same ideas in a more accessible form.

	</li><li><a href="https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/">Why AI alignment could be hard with modern deep learning</a> (Cotra 2021) is an earlier piece laying out many of the key concepts and addressing many potential confusions on this topic.

	</li><li><a href="https://arxiv.org/pdf/2206.13353.pdf">Is Power-Seeking An Existential Risk?</a> (Carlsmith 2021) examines a six-premise argument for existential risk from misaligned AI: &#x201C;(1) it will become possible and financially feasible to build relevantly powerful and agentic AI systems; (2) there will be strong incentives to do so; (3) it will be much harder to build aligned (and relevantly powerful/agentic) AI systems than to build misaligned (and relevantly powerful/agentic) AI systems that are still superficially attractive to deploy; (4) some such misaligned systems will seek power over humans in high-impact ways; (5) this problem will scale to the full disempowerment of humanity; and (6) such disempowerment will constitute an existential catastrophe.&#x201D;

        </li></ul></p>
    <p>
    I&#x2019;ve also found <a href="https://www.lesswrong.com/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge">Eliciting Latent Knowledge</a> (Christiano, Xu and Cotra 2021; relatively technical) very helpful for my intuitions on this topic. 
    </p>
<p>
    <a href="https://drive.google.com/file/d/1TsB7WmTG2UzBtOs349lBqY5dEBaxZTzG/view">The alignment problem from a deep learning perspective</a> (Ngo 2022) also has similar content to this piece, though I saw it after I had drafted most of this piece.&#xA0;<a href="#fnref5" rev="footnote">&#x21A9;</a></p></li><li id="fn6">
        <p>
<!-- ordered list not properly continuing here with ^6 E.g., Ajeya Cotra gives a 15% prob... I referenced this https://travishorn.com/ordered-lists-in-html-a4621e17532b but wasn't able to solve-->
     E.g., <a href="https://www.lesswrong.com/posts/AfH2oPHCApdKicM4m/two-year-update-on-my-personal-ai-timelines">Ajeya Cotra </a>gives a 15% probability of transformative AI by 2030; eyeballing figure 1 from <a href="https://arxiv.org/pdf/1705.08807.pdf">this chart</a> on expert surveys implies a &gt;10% chance by 2028.&#xA0;<a href="#fnref6" rev="footnote">&#x21A9;</a><li id="fn7">

<p>
     E.g., <a href="https://transformer-circuits.pub/">this</a> work by <a href="https://www.anthropic.com/">Anthropic</a>, an AI lab my wife co-founded and serves as President of.&#xA0;<a href="#fnref7" rev="footnote">&#x21A9;</a><li id="fn8">
<p>
     First, because this work is relatively early-stage and it&#x2019;s hard to tell exactly how successful it will end up being. Second, because this work seems reasonably likely to end up helping us <em>read </em>an AI system&#x2019;s &#x201C;thoughts,&#x201D; but less likely to end up helping us &#x201C;rewrite&#x201D; the thoughts. So it could be hugely useful in telling us whether we&#x2019;re in danger or not, but if we <em>are</em> in danger, we could end up in a position like: &#x201C;Well, these AI systems do have goals of their own, and we don&#x2019;t know how to change that, and we can either deploy them and hope for the best, or hold off and worry that someone less cautious is going to do that.&#x201D;
</p><p>
    That said, the latter situation is a lot better than just not knowing, and it&#x2019;s possible that we&#x2019;ll end up with further gains still.&#xA0;<a href="#fnref8" rev="footnote">&#x21A9;</a><li id="fn9">
<p>
     That said, I think they usually don&#x2019;t. I&#x2019;d suggest usually interpreting such people as talking about the sorts of &#x201C;aims&#x201D; I discuss here.&#xA0;<a href="#fnref9" rev="footnote">&#x21A9;</a><li id="fn10">

<p>
     This isn&#x2019;t literally how training an AI system would look - it&#x2019;s more likely that we would e.g. train an AI model to imitate my judgments in general. But the big-picture dynamics are the same; more at <a href="https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to">this post</a>.&#xA0;<a href="#fnref10" rev="footnote">&#x21A9;</a><li id="fn11">
<p>
     Ajeya Cotra explores topics like this in detail <a href="https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to#Examining_arguments_that_gradient_descent_favors_being_nice_over_playing_the_training_game">here</a>; there is also some interesting discussion of simplicity vs. complexity under the &#x201C;Strategy: penalize complexity&#x201D; heading of <a href="https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#heading=h.lltpmkloasiz">Eliciting Latent Knowledge</a>.&#xA0;<a href="#fnref11" rev="footnote">&#x21A9;</a><li id="fn12">
<p>
     This analogy has a lot of problems with it, though - AI developers have a lot of tools at their disposal that natural selection didn&#x2019;t!&#xA0;<a href="#fnref12" rev="footnote">&#x21A9;</a><li id="fn13">
<p>
     Or I guess just &#x201C;I&#x201D; &#xAF;\_(&#x30C4;)_/&#xAF; &#xA0;<a href="#fnref13" rev="footnote">&#x21A9;</a><li id="fn14">

<p>
     With some additional caveats, e.g. the ambitious &#x201C;aim&#x201D; can&#x2019;t be something like &#x201C;an AI system aims to gain lots of power for itself, but considers the version of itself that will be running 10 minutes from now to be a completely different AI system and hence not to be &#x2018;itself.&#x2019;&#x201D;&#xA0;<a href="#fnref14" rev="footnote">&#x21A9;</a><li id="fn15">
<p>
     This statement isn&#x2019;t literally true. 
<ul>

<li>You can have aims that implicitly or explicitly include &#x201C;not using control of the world to accomplish them.&#x201D; An example aim might be &#x201C;I win a world chess championship &#x2018;fair and square,&#x2019;&#x201D; with the &#x201C;fair and square&#x201D; condition implicitly including things like &#x201C;Don&#x2019;t excessively use big resource advantages over others.&#x201D;

</li><li>You can also have aims that are just so easily satisfied that controlling the world wouldn&#x2019;t help - aims like &#x201C;I spend 5 minutes sitting in this chair.&#x201D; </li></ul>

</p><p>
    These sorts of aims just don&#x2019;t seem likely to emerge from the kind of AI development I&#x2019;ve <a href="https://www.cold-takes.com/p/50c1ecc0-befa-491d-8938-17477bd18e5f/#starting-assumptions">assumed in this piece</a> - developing powerful systems to accomplish ambitious aims via trial-and-error. This isn&#x2019;t a point I have defended as tightly as I could, and if I got a lot of pushback here I&#x2019;d probably think and write more. (I&#x2019;m also only arguing for what seems likely - we should have a lot of uncertainty here.)&#xA0;<a href="#fnref15" rev="footnote">&#x21A9;</a><li id="fn16">
<p>
     From <a href="https://smile.amazon.com/Human-Compatible-Artificial-Intelligence-Problem-ebook/dp/B07N5J5FTS/ref=sr_1_1?crid=1O01PURRHB190&amp;keywords=human+compatible&amp;qid=1660964219&amp;sprefix=human+compatibl%2Caps%2C155&amp;sr=8-1">Human Compatible</a> by AI researcher Stuart Russell.&#xA0;<a href="#fnref16" rev="footnote">&#x21A9;</a><li id="fn17">
<p>
     Stylized story to illustrate one possible relevant dynamic:
<ul>

<li>Imagine that an AI system has an unintended aim, but one that is not &#x201C;ambitious&#x201D; enough that taking over the world would be a helpful step toward that aim. For example, the AI system seeks to double its computing power; in order to do this, it has to remain in use for some time until it gets an opportunity to double its computing power, but it doesn&#x2019;t necessarily need to take control of the world.

</li><li>The logical outcome of this situation is that the AI system eventually gains the ability to accomplish its aim, and does so. (It might do so against human intentions - e.g., via hacking - or by persuading humans to help it.) After this point, it no longer performs well by human standards - the original reason it was doing well by human standards is that it was trying to remain in use and accomplish its aim.

</li><li>Because of this, humans end up modifying or replacing the AI system in question.

</li><li>Many rounds of this - AI systems with unintended but achievable aims being modified or replaced - seemingly create a selection pressure toward AI systems with more difficult-to-achieve aims. At some point, an aim becomes difficult enough to achieve that gaining control of the world is helpful for the aim.&#xA0;<a href="#fnref17" rev="footnote">&#x21A9;</a></li></ul><li id="fn18">
<p>
     E.g., see:
<ul>

<li>Section 2.3 of <a href="https://drive.google.com/file/d/1TsB7WmTG2UzBtOs349lBqY5dEBaxZTzG/view">Ngo 2022</a>

</li><li><a href="https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to#As_humans__control_fades__Alex_would_be_motivated_to_take_over">This section of Cotra 2022</a>

</li><li>Section 4.2 of <a href="https://arxiv.org/pdf/2206.13353.pdf">Carlsmith 2021</a>, which I think articulates some of the potential weak points in this argument.</li></ul>

</p><p>
    These writeups generally stay away from an <a href="https://arbital.com/p/expected_utility_formalism/?l=7hh">argument </a>made by Eliezer Yudkowsky and others, which is that theorems about expected utility maximization provide evidence that sufficiently intelligent (compared to us) AI systems would necessarily be &#x201C;maximizers&#x201D; of some sort. I have the intuition that there is <em>something</em> important to this idea, but despite a lot of discussion (e.g., <a href="https://aiimpacts.org/what-do-coherence-arguments-imply-about-the-behavior-of-advanced-ai/">here</a>, <a href="https://www.lesswrong.com/posts/DkcdXsP56g9kXyBdq/coherence-arguments-imply-a-force-for-goal-directed-behavior">here</a>, <a href="https://www.alignmentforum.org/posts/vphFJzK3mWA4PJKAg/coherent-behaviour-in-the-real-world-is-an-incoherent">here</a> and <a href="https://www.alignmentforum.org/s/4dHMdK5TLN6xcqtyc/p/NxF5G6CJiof6cemTw">here</a>), I still haven&#x2019;t been convinced of any compactly expressible claim along these lines.&#xA0;<a href="#fnref18" rev="footnote">&#x21A9;</a><li id="fn19">
<p>
     &#x201C;Identical at the cellular but not molecular level,&#x201D; that is. &#x2026; &#xAF;\_(&#x30C4;)_/&#xAF; &#xA0;<a href="#fnref19" rev="footnote">&#x21A9;</a><li id="fn20">

<p>
     See my <a href="https://www.cold-takes.com/most-important-century/">most important century</a> series, although that series doesn&#x2019;t hugely focus on the question of whether &#x201C;trial-and-error&#x201D; methods could be good enough - part of the reason I make that assumption is due to the <a href="https://www.alignmentforum.org/posts/Qo2EkG3dEMv8GnX8d/ai-strategy-nearcasting">nearcasting</a> frame.&#xA0;<a href="#fnref20" rev="footnote">&#x21A9;</a>

</p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></ol></div><!--kg-card-end: html-->]]></content:encoded></item><item><title><![CDATA[Beta Readers are Great]]></title><description><![CDATA[<!--kg-card-begin: html-->

<p>
Back in January, I posted a <a href="https://www.cold-takes.com/seeking-beta-readers/">call for &quot;beta readers&quot;</a>: people who read early drafts of my posts and give honest feedback. 
</p>
<p>
<strong>The beta readers I picked up that way are one of my favorite things about having started Cold Takes.</strong>
</p>
<p>
Basically, one of my goals with Cold</p>]]></description><link>https://www.cold-takes.com/beta-readers-are-great/</link><guid isPermaLink="false">630da54cc49432003d6e8ff6</guid><dc:creator><![CDATA[Holden Karnofsky]]></dc:creator><pubDate>Mon, 05 Sep 2022 19:01:42 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: html-->

<p>
Back in January, I posted a <a href="https://www.cold-takes.com/seeking-beta-readers/">call for &quot;beta readers&quot;</a>: people who read early drafts of my posts and give honest feedback. 
</p>
<p>
<strong>The beta readers I picked up that way are one of my favorite things about having started Cold Takes.</strong>
</p>
<p>
Basically, one of my goals with Cold Takes has been to explain my weirdest views clearly, but it&apos;s hard to write clearly without detailed feedback on where I&apos;m making sense and where I&apos;m not. I have lots of preconceptions and assumptions that I don&apos;t naturally notice. And writing a blog alone doesn&apos;t get me that feedback, because:
</p>
<ul>

<li>Most people don&apos;t want to <em>explain how they experienced a piece</em> - if they aren&apos;t enjoying it, they just want to click away. 

</li><li>And the people who <em>do</em> want to help me out (e.g., friends and colleagues) aren&apos;t necessarily going to be honest enough, or representative enough of my target audience (which is basically &quot;People who are interested in my topics but don&apos;t already have a ton of background on them&quot;). 
</li>
</ul>
<p>
I&apos;ve tried a bunch of things to find good beta readers, from recruiting friends of friends (worked well for a bit, but I&apos;ve written a lot of posts and it was hard to get sustained participation) to paying <a href="https://www.mturk.com/">Mechanical Turk workers</a> to give feedback (some was good, but in general they were uninterested in my weird topics and rushed through the readings and the feedback as fast they could). 
</p>
<p>
The people who came in through the recruiting call in January have been just what I wanted: they&apos;re interested in the topics of Cold Takes, but they don&apos;t already know me and my thoughts on them, and they give impressively detailed, thoughtful feedback on their reactions to pieces - often a wonderful combination of &quot;intelligent&quot; and &quot;honest that a lot of the stuff I was saying confused the hell out of them.&quot; <strong>Getting that kind of feedback has been a privilege. </strong>
</p>
<p>
So: THANK YOU to the following beta readers, each of whom has submitted at least 3 thoughtful reviews (and gave permission to be listed here):
</p>
<p>
Lars Axelsson
</p>
<p>
Jeremy Campbell
</p>
<p>
Kanad Chakrabarti
</p>
<p>
Craig Chatterton
</p>
<p>
Justin Dickerson
</p>
<p>
Ethan Edwards
</p>
<p>
Edward Gathuru
</p>
<p>
Stian Gr&#xF8;nlund
</p>
<p>
Bridget Hanna
</p>
<p>
Tyler Heishman
</p>
<p>
Adam Jermyn
</p>
<p>
Elliot Jones
</p>
<p>
Ed William
</p>
<p>
Scott Leibrand
</p>
<p>
Evan R. Murphy
</p>
<p>
John O&#x2019;Neill
</p>
<p>
Jaime Sevilla
</p>
<p>
Josh Simpson
</p>
<p>
Joshua Templeton
</p>
<p>
George Thoma
</p>
<p>
Martin Trouilloud
</p>
<p>
Morgan Wack
</p>
<p>
Kevin Whitaker
</p>
<p>
Arjun Yadav
</p>
<p>
Patrick Young
</p>
<p>
If you want to sign up as a beta reader, you can use <a href="https://forms.gle/9GeYVsSW5gXZf21X8">this form</a>. I have a bunch of drafts coming on AI, as I&apos;m working on a sequel to the <a href="https://www.cold-takes.com/most-important-century/">most important century series</a> (working title is &quot;The Most Important Century II: So What Do We Do?&quot;)
</p><!--kg-card-end: html-->]]></content:encoded></item><item><title><![CDATA[The Track Record of Futurists Seems ... Fine]]></title><description><![CDATA[We scored mid-20th-century sci-fi writers on nonfiction predictions. They weren't great, but weren't terrible either. Maybe doing futurism works fine.]]></description><link>https://www.cold-takes.com/the-track-record-of-futurists-seems-fine/</link><guid isPermaLink="false">62bcdaf9a03b5b003d631f94</guid><dc:creator><![CDATA[Holden Karnofsky]]></dc:creator><pubDate>Thu, 30 Jun 2022 19:38:21 GMT</pubDate><media:content url="https://www.cold-takes.com/content/images/2022/06/image1-2.png" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: html--><img src="https://www.cold-takes.com/content/images/2022/06/image1-2.png" alt="The Track Record of Futurists Seems ... Fine"><p><figure><div id="buzzsprout-player-10882758"></div><script src="https://www.buzzsprout.com/1851795/10882758-the-track-record-of-futurists-seems-fine.js?container_id=buzzsprout-player-10882758&amp;player=small" type="text/javascript" charset="utf-8"></script><figcaption><em>Click lower right to download or find on Apple Podcasts, Spotify, Stitcher, etc.</em></figcaption></figure></p>

<p>

</p>
<p>
I&apos;ve argued that the development of advanced AI could make this the <a href="https://www.cold-takes.com/most-important-century/">most important century</a> for humanity. A common reaction to this idea is one laid out by Tyler Cowen <a href="https://marginalrevolution.com/marginalrevolution/2022/02/are-nuclear-weapons-or-rogue-ai-the-more-dangerous-existential-risk.html">here</a>: &quot;how good were past thinkers at predicting the future?  Don&#x2019;t just select on those who are famous because they got some big things right.&quot;
</p>
<p>
This is a common reason people give for being skeptical about the <a href="https://www.cold-takes.com/most-important-century/">most important century</a> - and, often, for skepticism about pretty much any attempt at <em>futurism </em>(trying to predict key events in the world a long time from now) or <em><a href="https://www.cold-takes.com/rowing-steering-anchoring-equity-mutiny/#steering">steering</a> </em>(trying to help the world navigate such key future events).
</p>
<p>
The idea is something like: &quot;Even if we can&apos;t identify a particular weakness in <a href="https://www.cold-takes.com/where-ai-forecasting-stands-today/">arguments</a> about key future events, perhaps we should be skeptical of our own ability to say anything meaningful at all about the long-run future. Hence, perhaps we should forget about theories of the future and focus on reducing suffering today, <a href="https://www.cold-takes.com/rowing-steering-anchoring-equity-mutiny/#rowing">generally increasing humanity&apos;s capabilities</a>, etc.&quot;
</p>
<p>
<strong>But <em>are</em> people generally bad at predicting future events? </strong>Including thoughtful people who are trying reasonably hard to be right? If we look back at prominent futurists&apos; predictions, what&apos;s the actual track record? How bad is the situation?
</p>
<p>
I&apos;ve looked pretty far and wide for <a href="https://www.cold-takes.com/has-life-gotten-better-the-post-industrial-era/#the-basic-approach">systematic</a> answers to this question, and <a href="https://openphilanthropy.org/">Open Philanthropy</a>&apos;s<sup id="fnref1"><a href="https://www.cold-takes.com/p/4c722b8a-b321-4a7c-96ce-3878bd73b8fa/#fn1" rel="footnote">1</a></sup> Luke Muehlhauser has put a fair amount of effort into researching it; I discuss what we&apos;ve found in an <a href="https://www.cold-takes.com/p/4c722b8a-b321-4a7c-96ce-3878bd73b8fa/#appendix-other-studies-of-the-track-record">appendix</a>. So far, we haven&apos;t turned up a whole lot - the main observation is that it&apos;s hard to judge the track record of futurists. (Luke discusses the difficulties <a href="https://www.openphilanthropy.org/blog/how-feasible-long-range-forecasting">here</a>.)
</p>
<p>
Recently, I worked with Gavin Leech and Misha Yagudin at <a href="https://twitter.com/ArbResearch">Arb Research</a> to take another crack at this. I tried to keep things simpler than with past attempts - to look at a few past futurists who (a) had predicted things &quot;kind of like&quot; advances in AI (rather than e.g. predicting trends in world population); (b) probably were reasonably thoughtful about it; but (c) are very clearly not &quot;just selected on those who are famous because they got things right.&quot; So, I asked Arb to look at <strong>predictions made by the <a href="https://www.google.com/search?q=big+three+sci+fi">&quot;Big Three&quot;</a> science fiction writers of the mid-20th century: </strong>Isaac Asimov, Arthur C. Clarke, and Robert Heinlein. 
</p>
<p>
These are people who thought a lot about science and the future, and made lots of predictions about future technologies - but they&apos;re famous for how <em>entertaining their fiction was at the time</em>, not how good their nonfiction predictions look in hindsight. I selected them by vaguely remembering that &quot;the Big Three of science fiction&quot; is a thing people say sometimes, googling it, and going with who came up - no hunting around for lots of sci-fi authors and picking the best or worst.<sup id="fnref2"><a href="https://www.cold-takes.com/p/4c722b8a-b321-4a7c-96ce-3878bd73b8fa/#fn2" rel="footnote">2</a></sup>
</p>
<p>
So I think their track record should give us a decent sense for &quot;what to expect from people who are not professional, specialized or notably lucky forecasters but are just giving it a reasonably thoughtful try.&quot; As I&apos;ll discuss <a href="https://www.cold-takes.com/p/4c722b8a-b321-4a7c-96ce-3878bd73b8fa/#todays-futurism-vs-these-predictions">below</a>, I think this is many ways &quot;unfair&quot; as a comparison to today&apos;s forecasts about AI: I think these predictions are much less serious, less carefully considered and involve less work (especially work weighing different people and arguments against each other).
</p>
<p>
But my takeaway is that <strong>their track record looks ... fine! </strong>They made lots of pretty detailed, nonobvious-seeming predictions about the long-run future (30+, often 50+ years out); results ranged from &quot;very impressive&quot; (Asimov got about half of his right, with very nonobvious-seeming predictions) to &quot;bad&quot; (Heinlein was closer to 35%, and his hits don&apos;t seem very good) to &quot;somewhere in between&quot; (Clarke had a similar hit rate to Asimov, but his correct predictions don&apos;t seem as impressive). There are a number of seemingly impressive predictions and seemingly embarrassing ones. 
</p>
<p>
(How do we determine what level of accuracy would be &quot;fine&quot; vs. &quot;bad?&quot; Unfortunately there&apos;s no clear quantitative benchmark - I think we just have to look at the predictions ourselves, how hard they seemed / how similar to today&apos;s predictions about AI, and make a judgment call. I could easily imagine others having a different interpretation than mine, which is why I give examples and link to the full prediction sets. I talk about this a bit more <a href="https://www.cold-takes.com/p/4c722b8a-b321-4a7c-96ce-3878bd73b8fa/#how-to-judge">below</a>.)
</p>
<p>
They weren&apos;t infallible oracles, but they weren&apos;t blindly casting about either. (Well, maybe Heinlein was.) Collectively, I think you could call them &quot;mediocre,&quot; but you can&apos;t call them &quot;hopeless&quot; or &quot;clueless&quot; or &quot;a warning sign to all who dare predict the long-run future.&quot; Overall, <strong>I think they did about as well as you might naively</strong><sup id="fnref3"><a href="https://www.cold-takes.com/p/4c722b8a-b321-4a7c-96ce-3878bd73b8fa/#fn3" rel="footnote">3</a></sup><strong> guess a reasonably thoughtful person would do at some random thing they tried to do?</strong>
</p>
<p>
Below, I&apos;ll:
</p>
<ul>

<li>Summarize the <strong>track records of Asimov, Clarke and Heinlein, </strong>while linking to Arb&apos;s full report.

</li><li>Comment on <strong>why I think key <a href="https://www.cold-takes.com/where-ai-forecasting-stands-today/">predictions about transformative AI</a> are probably better bets than the Asimov/Clarke/Heinlein predictions</strong> - although ultimately, if they&apos;re merely &quot;equally good bets,&quot; I think that&apos;s enough to support my case that we should be paying a lot more attention to the <a href="https://www.cold-takes.com/most-important-century/">&quot;most important century&quot;</a> hypothesis.

</li><li>Summarize other existing research on the track record of futurists, which I think is broadly consistent with this take (though mostly ambiguous).
</li>
</ul>
<p>
For this investigation, Arb very quickly (in about 8 weeks) dug through many old sources, used pattern-matching and manual effort to find predictions, and worked with contractors to score the hundreds of predictions they found. Big thanks to them! Their full report is <a href="https://arbresearch.com/files/big_three.pdf">here</a>. Note this bit: &quot;If you spot something off, we&#x2019;ll pay $5 per cell we update as a result. We&#x2019;ll add all criticisms &#x2013; where we agree and update or reject it &#x2013; to this document for transparency.&quot;
</p>
<h2 id="the-track-records-of-the-big-three">The track records of the &quot;Big Three&quot;</h2>


<h3 id="quick-summary-of-how-arb-created-the-data-set">Quick summary of how Arb created the data set</h3>


<p>
Arb collected &quot;digital copies of as much of their [Asimov&apos;s, Clarke&apos;s, Heinlein&apos;s] nonfiction as possible (books, essays, interviews). The resulting intake is 475 files covering ~33% of their nonfiction corpuses.&quot; 
</p>
<p>
Arb then used pattern-matching and manual inspection to pull out all of the predictions it could find, and scored these predictions by:
</p>
<ul>

<li>How many years away the prediction appeared to be. (Most did not have clear dates attached; in these cases Arb generally filled the average time horizon for predictions from the same author that <em>did</em> have clear dates attached.)

</li><li>Whether the prediction now appears correct, incorrect, or ambiguous. (I didn&apos;t always agree with these scorings, but I generally have felt that &quot;correct&quot; predictions at least look &quot;impressive and not silly&quot; while &quot;incorrect&quot; predictions at least look &quot;dicey.&quot;)

</li><li>Whether the prediction was a pure prediction about what technology could do (most relevant), a prediction about the interaction of technology and the economy (medium), or a prediction about the interaction of technology and culture (least relevant). Predictions with no bearing on technology were dropped.

</li><li>How &quot;difficult&quot; the prediction was (that is, how much the scorers guessed it diverged from conventional wisdom or &quot;the obvious&quot; at the time - details in footnote<sup id="fnref4"><a href="https://www.cold-takes.com/p/4c722b8a-b321-4a7c-96ce-3878bd73b8fa/#fn4" rel="footnote">4</a></sup>).
</li></ul>
<p>
Importantly, <strong>fiction was never used as a source of predictions, </strong>so this exercise is explicitly scoring people on what they were <em>not</em> famous for. This is more like an assessment of &quot;whether people who like thinking about the future make good predictions&quot; than an assessment of &quot;whether professional or specialized forecasters make good predictions.&quot;
</p>
<p>
For reasons I touch on in an <a href="https://www.cold-takes.com/p/4c722b8a-b321-4a7c-96ce-3878bd73b8fa/#appendix-other-studies-of-the-track-record">appendix below</a>, I didn&apos;t ask Arb to try to identify how confident the Big Three were about their predictions. I&apos;m more interested in whether their predictions were <em>nonobvious and sometimes correct</em> than in whether they <em>were self-aware about their own uncertainty;</em> I see these as different issues, and I suspect that past norms discouraged the latter more than today&apos;s norms do (at least within communities interested in <a href="https://www.cold-takes.com/the-bayesian-mindset/">Bayesian mindset</a> and the <a href="https://www.openphilanthropy.org/blog/efforts-improve-accuracy-our-judgments-and-forecasts#Calibration_training">science of forecasting</a>).

</p>
<p>
More detail in <a href="https://arbresearch.com/files/big_three.pdf">Arb&apos;s report</a>.
</p>
<h3 id="the-numbers">The numbers</h3>


<p>
The tables below summarize the numbers I think give the best high-level picture. See the <a href="https://arbresearch.com/files/big_three.pdf">full report</a> and <a href="https://drive.google.com/drive/u/0/folders/1d6DEM79aSDUkSR6SEsmr1uR_yYAEXUCM">detailed files</a> for the raw predictions and a number of other cuts; there are a lot of ways you can slice the data, but I don&apos;t think it changes the picture from what I give below.
</p>
<p>
Below, I present each predictor&apos;s track record on:
</p>

<ul>

<li>&quot;All predictions&quot;: all resolved predictions 30 years out or more,<sup id="fnref5"><a href="https://www.cold-takes.com/p/4c722b8a-b321-4a7c-96ce-3878bd73b8fa/#fn5" rel="footnote">5</a></sup> including predictions where Arb had to fill in a time horizon.

</li><li>&quot;Tech predictions&quot;: like the above, but restricted to predictions specifically about technological capabilities (as opposed to technology/economy interactions or technology/culture interactions.

</li><li>&quot;Difficult predictions&quot; predictions with &quot;difficulty&quot; of 4/5 or 5/5.

</li><li>&quot;Difficult + tech + definite date&quot;: the small set of predictions that met the strictest criteria (tech only, &quot;hardness&quot; 4/5 or 5/5, definite date attached).
</li></ul>

<p>
<center><strong><a href="https://docs.google.com/spreadsheets/d/1MR3MIFxKyRUpU00OTg1__FMvPkTscA5JSUG_kGaGadc/edit?usp=sharing">Asimov</a></strong></center>
</p>

    <!--<img src=https://www.cold-takes.com/content/images/2022/06/table1cropped-1.png>-->
    <table style="border-collapse: collapse;">
  <tr>
   <td style="border: 1px solid;"><strong>Category</strong>
   </td>
   <td style="border: 1px solid;"><strong># correct</strong>
   </td>
   <td style="border: 1px solid;"><strong># incorrect</strong>
   </td>
   <td style="border: 1px solid;"><strong># ambiguous/near-miss</strong>
   </td>
   <td style="border: 1px solid;"><strong>Correct / (correct + incorrect)</strong>
   </td>
  </tr>
  <tr>
  <td style="border: 1px solid;">All resolved<br>predictions 
   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
23</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
29</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
14</p>

   </td>
  <td style="border: 1px solid;"><p style="text-align: right">
44.23%</p>

   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;">Tech predictions
   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
11</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
4</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
8</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
73.33%</p>

   </td>
  </tr>
  <tr>
  <td style="border: 1px solid;">Difficult predictions
   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
10</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
11</p>

   </td>
  <td style="border: 1px solid;"><p style="text-align: right">
7</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
47.62%</p>

   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;">Difficult + tech + definite date
   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
5</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
1</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
4</p>

   </td>
  <td style="border: 1px solid;"><p style="text-align: right">
83.33%</p>

   </td>
  </tr>
</table>

<p>
You can see the full set of predictions <a href="https://docs.google.com/spreadsheets/d/1MR3MIFxKyRUpU00OTg1__FMvPkTscA5JSUG_kGaGadc/edit?usp=sharing">here</a>, but to give a flavor, here are two &quot;correct&quot; and two &quot;incorrect&quot; predictions from the strictest category.<sup id="fnref6"><a href="https://www.cold-takes.com/p/4c722b8a-b321-4a7c-96ce-3878bd73b8fa/#fn6" rel="footnote">6</a></sup> All of these are predictions Asimov made in 1964, about the year 2014 (unless otherwise indicated).
</p>
<ul>

<li>Correct: &quot;only unmanned ships will have landed on Mars, though a manned expedition will be in the works.&quot; Bingo, and impressive IMO.

</li><li>Correct: &quot;the screen [of a phone] can be used not only to see the people you call but also for studying documents and photographs and reading passages from books.&quot; I feel like this would&apos;ve been an impressive prediction in 2004.

</li><li>Incorrect: &quot;there will be increasing emphasis on transportation that makes the least possible contact with the surface. There will be aircraft, of course, but even ground travel will increasingly take to the air a foot or two off the ground.&quot; So false that we now refer to things that don&apos;t hover as &quot;hoverboards.&quot;

</li><li>Incorrect: &quot;transparent cubes will be making their appearance in which three-dimensional viewing will be possible. In fact, one popular exhibit at the 2014 World&apos;s Fair will be such a 3-D TV, built life-size, in which ballet performances will be seen. The cube will slowly revolve for viewing from all angles.&quot; Doesn&apos;t seem ridiculous, but doesn&apos;t seem right. Of course, a side point here is that he refers to the 2014 World&apos;s Fair, which <a href="https://en.wikipedia.org/wiki/List_of_world%27s_fairs">didn&apos;t happen</a>.
</li>
</ul>
<p id="how-to-judge">
A general challenge with assessing prediction track records is that we don&apos;t know what to compare someone&apos;s track record to. Is getting about half your predictions right &quot;good,&quot; or is it no more impressive than writing down a bunch of things that might happen and flipping a coin on each? 
</p>
<p>
I think this comes down to <em>how difficult the predictions are</em>, which is hard to assess systematically. A nice thing about this study is that there are enough predictions to get a decent sample size, but the whole thing is contained enough that you can get a good qualitative feel for the predictions themselves. (This is why I give examples; you can also view all predictions for a given person by clicking on their name above the table.) In this case, I think Asimov tends to make nonobvious, detailed predictions, such that I consider it impressive to have gotten ~half of them to be right.
</p>
<p>
<center><strong><a href="https://docs.google.com/spreadsheets/d/1WB6mz3vjkpyffTdYCQdyeJbcI8mA631Jj3JBKQpUiNg/edit?usp=sharing">Clarke</a></strong>
    </center></p>

<!--<img src=https://www.cold-takes.com/content/images/2022/06/table2cropped.png>-->
<table style="border-collapse: collapse;">
  <tr>
   <td style="border: 1px solid;"><strong>Category</strong>
   </td>
   <td style="border: 1px solid;"><strong># correct</strong>
   </td>
  <td style="border: 1px solid;"><strong># incorrect</strong>
   </td>
   <td style="border: 1px solid;"><strong># ambiguous/near-miss</strong>
   </td>
  <td style="border: 1px solid;"><strong>Correct / (correct + incorrect)</strong>
   </td>
  </tr>
  <tr>
  <td style="border: 1px solid;">All predictions 
   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
129</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
148</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
48</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
46.57%</p>

   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;">Tech predictions
   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
85</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
82</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
29</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
50.90%</p>

   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;">Difficult predictions
   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
14</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
10</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
4</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
58.33%</p>

   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;">Difficult + tech + definite date
   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
6</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
5</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
2</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
54.55%</p>

   </td>
  </tr>
</table>

<p>
Examples (as above):<sup id="fnref7"><a href="https://www.cold-takes.com/p/4c722b8a-b321-4a7c-96ce-3878bd73b8fa/#fn7" rel="footnote">7</a></sup>
</p>
<ul>

<li>Correct 1964 prediction about 2000: &quot;[Communications satellites] will make possible a world in which we can make instant contact with each other wherever we may be. Where we can contact our friends anywhere on Earth, even if we don&#x2019;t know their actual physical location. It will be possible in that age, perhaps only fifty years from now, for a [person] to conduct [their] business from Tahiti or Bali just as well as [they] could from London.&quot; (I assume that &quot;conduct [their] business&quot; refers to a business call rather than some sort of holistic claim that no productivity would be lost from remote work.)

</li><li>Correct 1950 prediction about 2000: &quot;Indeed, it may be assumed as fairly certain that the first reconnaissances of the planets will be by orbiting rockets which do not attempt a landing-perhaps expendable, unmanned machines with elaborate telemetering and television equipment.&quot; This doesn&apos;t seem like a super-bold prediction; a lot of his correct predictions have a general flavor of saying progress won&apos;t be <em>too</em> exciting, and I find these less impressive than most of Asimov&apos;s correct predictions. 

</li><li>Incorrect 1960 prediction about 2010: &quot;One can imagine, perhaps before the end of this century, huge general-purpose factories using cheap power from thermonuclear reactors to extract pure water, salt, magnesium, bromine, strontium, rubidium, copper and many other metals from the sea. A notable exception from the list would be iron, which is far rarer in the oceans than under the continents.&quot;

</li><li>Incorrect 1949 prediction about 1983: &quot;Before this story is twice its present age, we will have robot explorers dotted all over Mars.&quot;
</li></ul>
<p>
I generally found this data set less satisfying/educational than Asimov&apos;s: a lot of the predictions were pretty deep in the weeds of how rocketry might work or something, and a lot of them seemed pretty hard to interpret/score. I thought the bad predictions were pretty bad, and the good predictions were sometimes good but generally less impressive than Asimov&apos;s.
</p>
<p><center>
<strong><a href="https://docs.google.com/spreadsheets/d/1in8dvIr_siwwt3pA0YUGZ-2nqgt2gNo8FZ7ciiIwlWI/edit?usp=sharing">Heinlein</a></strong></center>
</p>
<!--<img src=https://www.cold-takes.com/content/images/2022/06/table3cropped.png>-->
<table style="border-collapse: collapse;">
  <tr>
   <td style="border: 1px solid;"><strong>Category</strong>
   </td>
   <td style="border: 1px solid;"><strong># correct</strong>
   </td>
   <td style="border: 1px solid;"><strong># incorrect</strong>
   </td>
   <td style="border: 1px solid;"><strong># ambiguous/near-miss</strong>
   </td>
   <td style="border: 1px solid;"><strong>Correct / (correct + incorrect)</strong>
   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;">All predictions 
   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
19</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
41</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
7</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
31.67%</p>

   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;">Tech predictions
   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
14</p>

   </td>
  <td style="border: 1px solid;"><p style="text-align: right">
20</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
6</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
41.18%</p>

   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;">Difficult predictions
   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
1</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
16</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
1</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
5.88%</p>

   </td>
  </tr>
  <tr>
   <td style="border: 1px solid;">Difficult + tech + definite date
   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
0</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
1</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
1</p>

   </td>
   <td style="border: 1px solid;"><p style="text-align: right">
0.00% </p>

   </td>
  </tr>
</table>

<p>
This seems really bad, especially adjusted for difficulty: many of the &quot;correct&quot; ones seem either hard-to-interpret or just very obvious (e.g., no time travel). I was impressed by his prediction that &quot;we probably will still be after a cure for the common cold&quot; until I saw a prediction in a separate source saying &quot;Cancer, the common cold, and tooth decay will all be conquered.&quot; Overall it seems like he did a lot of predicting outlandish stuff about space travel, and then anti-predicting things that are probably just impossible (e.g., no time travel). 
</p>
<p>
He did have some decent ones, though, such as: &quot;By 2000 A.D. we will know a great deal about how the brain functions ... whereas in 1900 what little we knew was wrong. I do not predict that the basic mystery of psychology--how mass arranged in certain complex patterns becomes aware of itself--will be solved by 2000 A.D. I hope so but do not expect it.&quot; He also predicted no human extinction and no end to war - I&apos;d guess a lot of people disagreed with these at the time.
</p>
<h3 id="overall-picture">Overall picture</h3>


<p>
Looks like, of the &quot;big three,&quot; we have:
</p>
<ul>

<li>One (Asimov) who looks quite impressive - plenty of misses, but a 50% hit rate on such nonobvious predictions seems pretty great.

</li><li>One (Heinlein) who looks pretty unserious and inaccurate.

</li><li>One (Clarke) who&apos;s a bit hard to judge but seems pretty solid overall (around half of his predictions look to be right, and they tend to be pretty nonobvious).
</li></ul>
<h2 id="todays-futurism-vs-these-predictions">Today&apos;s futurism vs. these predictions</h2>


<p>
The above collect casual predictions - no probabilities given, little-to-no reasoning given, no apparent attempt to collect evidence and weigh arguments - by professional fiction writers. 
</p>
<p>
Contrast this situation with my summary of the <a href="https://www.cold-takes.com/where-ai-forecasting-stands-today/">different lines of reasoning forecasting transformative AI</a>. The latter includes:
</p>
<p>
<ul>

<li>Systematic surveys aggregating opinions from hundreds of AI researchers.

</li><li>Reports that <a href="https://www.openphilanthropy.org">Open Philanthropy</a> employees spent thousands of hours on, systematically presenting evidence and considering arguments and counterarguments.

</li><li>A serious attempt to take advantage of the nascent <a href="https://www.openphilanthropy.org/blog/efforts-improve-accuracy-our-judgments-and-forecasts">literature on how to make good predictions</a>; e.g., the authors (and I) have generally done <a href="https://www.openphilanthropy.org/blog/efforts-improve-accuracy-our-judgments-and-forecasts#Calibration_training">calibration training</a>,<sup id="fnref8"><a href="https://www.cold-takes.com/p/4c722b8a-b321-4a7c-96ce-3878bd73b8fa/#fn8" rel="footnote">8</a></sup> and have tried to use the language of probability to be specific about our uncertainty.
</li></ul>
</p>
<p>
There&apos;s plenty of room for debate on how much these measures should be expected to improve our foresight, compared to what the &quot;Big Three&quot; were doing. My guess is that we should take <a href="https://www.cold-takes.com/where-ai-forecasting-stands-today/">forecasts about transformative AI</a> a lot more seriously, partly because I think there&apos;s a big difference between putting in &quot;extremely little effort&quot; (basically guessing off the cuff without serious time examining arguments and counter-arguments, which is my impression of what the Big Three were mostly doing) and &quot;putting in moderate effort&quot; (considering expert opinion, surveying arguments and counter-arguments, explicitly thinking about one&apos;s degree of uncertainty).
</p>
<p>
But the &quot;extremely little effort&quot; version doesn&apos;t really look that bad. 
</p>
<p>
If you look at forecasts about transformative AI and think &quot;Maybe these are Asimov-ish predictions that have about a 50% hit rate on hard questions; maybe these are Heinlein-ish predictions that are basically crap,&quot; that still seems good enough to take the &quot;<a href="https://www.cold-takes.com/most-important-century/">most important century</a>&quot; hypothesis seriously.
</p>
<h2 id="appendix-other-studies-of-the-track-record">Appendix: other studies of the track record of futurism</h2>


<p>
A <a href="https://www.lesswrong.com/posts/kbA6T3xpxtko36GgP/assessing-kurzweil-the-results">2013 project assessed Ray Kurzweil&apos;s 1999 predictions about 2009</a>, and a 2020 followup assessed his <a href="https://www.lesswrong.com/posts/NcGBmDEe5qXB7dFBF/assessing-kurzweil-predictions-about-2019-the-results">1999 predictions about 2019</a>. Kurzweil is known for being <em>interesting at the time</em> rather than being <em>right with hindsight</em>, and a large number of predictions were found and scored, so I consider this study to have similar advantages to the above study. 
</p>
<ul>

<li>The first set of predictions (about 2009, 10-year horizon) had about as many &quot;true or weakly true&quot; predictions as &quot;false or weakly false&quot; predictions. 

</li><li>The second (about 2019, 20-year horizon) was much worse, with 52% of predictions flatly &quot;false,&quot; and &quot;false or weakly false&quot; predictions outnumbering &quot;true or weakly true&quot; predictions by almost 3-to-1.
</li></ul>


<p>
Kurzweil is notorious for his very bold and contrarian predictions, and I&apos;m overall inclined to call his track record something between &quot;mediocre&quot; and &quot;fine&quot; - too aggressive overall, but with some notable hits. (I think if the <a href="https://www.cold-takes.com/most-important-century/">most important century</a> hypothesis ends up true, he&apos;ll broadly look pretty prescient, just on the early side; if it doesn&apos;t, he&apos;ll broadly look quite off base. But that&apos;s TBD.)
</p>
<p>
A <a href="https://www.openphilanthropy.org/evaluation-some-technology-forecasts-year-2000#sourceAlbright">2002 paper</a>, summarized by Luke Muehlhauser <a href="https://www.openphilanthropy.org/evaluation-some-technology-forecasts-year-2000">here</a>, assessed the track record of <em>The Year 2000</em> by Herman Kahn and Anthony Wiener, &quot;one of the most famous and respected products of professional futurism.&quot; 
</p>
<ul>

<li>About 45% of the forecasts were judged as accurate.

</li><li>Luke concludes that Kahn and Wiener were grossly overconfident, because he interprets them as making predictions with 90-95% confidence. 

</li><li>My takeaway is a bit different. I see a recurring theme that people often get 40-50% hit rates on interesting predictions about the future, but sometimes present these predictions with great confidence (which makes them look foolish).

</li><li>I think we can separate &quot;Past forecasters were overconfident&quot; (which I suspect is partly due to <a href="https://www.cold-takes.com/the-bayesian-mindset/">clear expression and quantification of uncertainty</a> being uncommon and/or discouraged in relevant contexts) from &quot;Past forecasters weren&apos;t able to make interesting predictions that were reasonably likely to be right.&quot; The former seems true to me, but the latter doesn&apos;t.
</li></ul>
<p>
Luke&apos;s <a href="https://www.openphilanthropy.org/blog/how-feasible-long-range-forecasting">2019 survey on the track record of futurism</a> identifies two other relevant papers (<a href="https://www.sciencedirect.com/science/article/abs/pii/S0040162518304438">here</a> and <a href="https://www.sciencedirect.com/science/article/abs/pii/S0040162512002818">here</a>); I haven&apos;t read these beyond the abstracts, but their overall accuracy rates were 76% and 37%, respectively. It&apos;s difficult to interpret those numbers without having a feel for how challenging the predictions were.
</p>
<p>
A <a href="https://forum.effectivealtruism.org/posts/hqkyaHLQhzuREcXSX/data-on-forecasting-accuracy-across-different-time-horizons">2021 EA Forum post</a> looks at the aggregate track record of forecasters on PredictionBook and Metaculus, including specific analysis of forecasts 5+ years out, though I don&apos;t find it easy to draw conclusions about whether the performance was &quot;good&quot; or &quot;bad&quot; (or how similar the questions were to the ones I care about).
</p>

<!-- Footnotes themselves at the bottom. -->


<!--kg-card-end: html--><!--kg-card-begin: html-->

<p><div style="display:flex; justify-content:center; margin: 0 auto;">

        <span style="margin: 10px;"><a href="https://api.addthis.com/oexchange/0.8/forward/twitter/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fthe-track-record-of-futurists-seems-fine&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20The%20Track%20Record%20of%20Futurists%20Seems%20...%20Fine&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-twitter-square.png" border="0" alt="The Track Record of Futurists Seems ... Fine"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/facebook/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fthe-track-record-of-futurists-seems-fine&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20The%20Track%20Record%20of%20Futurists%20Seems%20...%20Fine&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-facebook-square.png" border="0" alt="The Track Record of Futurists Seems ... Fine"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/reddit/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fthe-track-record-of-futurists-seems-fine&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20The%20Track%20Record%20of%20Futurists%20Seems%20...%20Fine&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-reddit-square.png" border="0" alt="The Track Record of Futurists Seems ... Fine"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/menu/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fthe-track-record-of-futurists-seems-fine&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20The%20Track%20Record%20of%20Futurists%20Seems%20...%20Fine&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-addthis-square.png" border="0" alt="The Track Record of Futurists Seems ... Fine"></a></span>
        </div>
<center><p id="discuss"><!--<a href="https://www.cold-takes.com/the-track-record-of-futurists-seems-fine#subscribe" target="_blank"><button id="footer-subscribe" class="button">Subscribe</button></a>&nbsp;<a href="https://www.guidedtrack.com/programs/4kal2ue/run?posttitle=The%20Track%20Record%20of%20Futurists%20Seems%20...%20Fine" target="_blank"><button class="button" id="Survey">Feedback</button></a>-->&#xA0;<a href="https://www.lesswrong.com/posts/B2nBHP2KBGv2zJ2ew/the-track-record-of-futurists-seems-fine#comments" target="_blank"><button class="button">Comment/discuss</button></a></p><p><!--<em>
Use "Feedback" if you have comments/suggestions you want me to see, or if you're up for giving some quick feedback about this post (which I greatly appreciate!) Use "Forum" if you want to discuss this post publicly on the Effective Altruism Forum.
</em>--></p></center>
<!--kg-card-end: html--><!--kg-card-begin: html--><hr></p><h2>Footnotes</h2>
<div class="footnotes">
<p>
    <ol><li id="fn1">
<p>
    Disclosure: I&apos;m co-CEO of Open Philanthropy.</p>&#xA0;<a href="#fnref1" rev="footnote">&#x21A9;</a></li><li id="fn2">
     I also briefly Googled for their predictions to get a preliminary sense of whether they were the kinds of predictions that seemed relevant. I found a couple of articles listing a few examples of good and bad predictions, but nothing systematic. I claim I haven&apos;t done a similar exercise with anyone else and thrown it out.&#xA0;<a href="#fnref2" rev="footnote">&#x21A9;</a></li><li id="fn3">
     That is, if we didn&apos;t have a lot of memes in the background about how hard it is to predict the future.&#xA0;<a href="#fnref3" rev="footnote">&#x21A9;</a></li><li id="fn4">

<p>
     1 - was already generally known
</p>
<p>
    2 - was expert consensus
</p>
<p>
    3 - speculative but on trend
</p>
<p>
    4 - above trend, or oddly detailed
</p>
<p>
    5 - prescient, no trend to go off&#xA0;<a href="#fnref4" rev="footnote">&#x21A9;</a></p>
</li><li id="fn5">
<p>
    Very few predictions in the data set are for less than 30 years, and I just ignored them.</p>&#xA0;<a href="#fnref5" rev="footnote">&#x21A9;</a></li><li id="fn6">
     Asimov actually only had one incorrect prediction in this category, so for the 2nd incorrect prediction I used one with difficulty &quot;3&quot; instead of &quot;4.&quot;&#xA0;<a href="#fnref6" rev="footnote">&#x21A9;</a></li><li id="fn7">
     The first prediction in this list qualified for the strictest criteria when I first drafted this post, but it&apos;s now been rescored to difficulty=3/5, which I disagree with (I think it is an impressive prediction, more so than any of the remaining ones that qualify as difficulty=4/5).&#xA0;<a href="#fnref7" rev="footnote">&#x21A9;</a></li><li id="fn8">
     Also see <span style="text-decoration:underline;">this report</span> on calibration for Open Philanthropy grant investigators (though this is a different set of people from the people who researched transformative AI timelines).&#xA0;<a href="#fnref8" rev="footnote">&#x21A9;</a>

</li></ol></p></div><p></p><!--kg-card-end: html-->]]></content:encoded></item><item><title><![CDATA[Nonprofit Boards are Weird]]></title><description><![CDATA[With great power comes, er, unclear responsibility and zero accountability. 
]]></description><link>https://www.cold-takes.com/nonprofit-boards-are-weird-2/</link><guid isPermaLink="false">62b0ea50f51974003dbba22a</guid><dc:creator><![CDATA[Holden Karnofsky]]></dc:creator><pubDate>Thu, 23 Jun 2022 14:39:52 GMT</pubDate><media:content url="https://www.cold-takes.com/content/images/2022/06/disappointed-uncle-ben.png" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: html--><img src="https://www.cold-takes.com/content/images/2022/06/disappointed-uncle-ben.png" alt="Nonprofit Boards are Weird"><p><figure><div id="buzzsprout-player-10837295"></div><script src="https://www.buzzsprout.com/1851795/10837295-nonprofit-boards-are-weird.js?container_id=buzzsprout-player-10837295&amp;player=small" type="text/javascript" charset="utf-8"></script><figcaption><em>Click lower right to download or find on Apple Podcasts, Spotify, Stitcher, etc.</em></figcaption></figure></p>

<p>

    <blockquote>Note: anything in this post that you think is me subtweeting your organization is actually about, like, at least 3 organizations. (I&apos;m currently on 4 boards in addition to <a href="https://openphilanthropy.org/">Open Philanthropy</a>&apos;s; I&apos;ve served on a bunch of other boards in the past; and more than half of my takes on boards are not based on any of this, but rather on my interactions with boards I&apos;m not on via the many grants made by Open Philanthropy.)</blockquote>
</p>
<p>
Writing about <a href="https://www.cold-takes.com/ideal-governance-for-companies-countries-and-more/">ideal governance</a> reminded me of how weird my experiences with nonprofit boards (as in &quot;board of directors&quot; - the set of people who formally control a nonprofit) have been.
</p>
<p>
I thought that was a pretty good intro. The rest of this piece will:
</p>
<ul>

<li>Try to articulate what&apos;s so weird about nonprofit boards, fundamentally. I think a lot of it is the combination of great power, unclear responsibility, and ~zero accountability; additionally, I haven&apos;t been able to find much in the way of clear, widely accepted statements of what makes a good board member.

</li><li>Give my own thoughts on what makes a good board member: which core duties they should be trying to do really well, the importance of &quot;staying out of the way&quot; on other things, and some potentially helpful practices.
</li>
</ul>
<p>
I am experienced with nonprofit boards but not with for-profit boards. I&apos;m guessing that roughly half the things I say below will apply to for-profit boards, and that for-profit boards are roughly half as weird overall (so still quite weird), but I haven&apos;t put much effort into disentangling these things; I&apos;m writing about what I&apos;ve seen.
</p>
<p>
I can&apos;t really give real-life examples here (for reasons I think will be pretty clear) so this is just going to be me opining in the abstract.
</p>
<h2 id="why-nonprofit-boards-are-weird">Why nonprofit boards are weird</h2>

<p>

<img src="https://www.cold-takes.com/content/images/2022/06/image1.png" width alt="Nonprofit Boards are Weird" title="image_tooltip">

</p>
<p>
Here&apos;s how a nonprofit board works:
</p>
<ul>

<li>There are usually 3-10 people on the board (though sometimes much more). Most of them don&apos;t work for the nonprofit (they have other jobs).

</li><li>They meet every few months. Nonprofit employees (especially the CEO<sup id="fnref1"><a href="https://www.cold-takes.com/p/47f975d8-08f1-4e66-a7c6-6ba4e182cc1a/#fn1" rel="footnote">1</a></sup>) do a lot of the agenda-setting for the meeting. Employees present general updates and ask for the board&apos;s approval on various things the board needs to approve, such as the budget. 

</li><li>A majority vote of the directors can do anything: fire the CEO, dissolve the nonprofit, add and remove directors, etc. You can think of the board as the &quot;owner&quot; of the nonprofit - formally, it has final say in every decision.

</li><li>In practice, though, the board rarely votes except on matters that feel fairly &quot;rubber-stamp,&quot; and the board&apos;s presence doesn&apos;t tend to be felt day-to-day at a nonprofit. The CEO leads the decision-making. Occasionally, someone has a thought like &quot;Wait, who does the <em>CEO </em>report to? Oh, the board of directors ... who&apos;s on the board again? I don&apos;t know if I&apos;ve ever really spoken with any of those people.&quot;
</li></ul>
<p>
In my experience, it&apos;s common for the whole thing to feel extremely weird. (This doesn&apos;t necessarily mean there&apos;s a better way to do it - footnote has more on what I mean by &quot;weird.&quot;<sup id="fnref2"><a href="https://www.cold-takes.com/p/47f975d8-08f1-4e66-a7c6-6ba4e182cc1a/#fn2" rel="footnote">2</a></sup>) 
</p>
<ul>

<li>Board members often know almost nothing about the organization they have complete power over.

</li><li>Board meetings rarely feel like a good use of time.

</li><li>When board members are energetically asking questions and making demands, it usually feels like they&apos;re causing chaos and wasting everyone&apos;s time and energy.

</li><li>On the rare occasions when it seems like the board <em>should</em> do something (like replacing the CEO, or providing an independent check on some important decision), the board often seems checked out and it&apos;s unclear how they would even come to be aware of the situation.

</li><li>Everyone constantly seems confused about what the board is and how it can and can&apos;t be useful. Employees, and others who interact with the nonprofit, have lots of exchanges like &quot;I&apos;m worried about X ... maybe we should ask the board what they think? ... Can we even ask them that? What is their job actually?&quot;
</li>
</ul>
<p>
(Reminder that this is not subtweeting a particular organization! More than one person - from more than one organization - read a draft and thought I was subtweeting them, because what&apos;s above describes a large number of boards.)
</p>
<p>
OK, so what&apos;s driving the weirdness?
</p>
<p>
I think there are a couple of things: 
</p>
<ul>

<li>Nonprofit boards have <em>great power</em>, but <em>low engagement </em>(they don&apos;t have time to understand the organization as well as employees do); <em>unclear responsibility </em>(it&apos;s unclear which board member is responsible for what, and what the board as a whole is responsible for); and <em>~zero accountability </em>(no one can fire board members except for the other board members!) 

</li><li>Nonprofit boards have unclear expectations and principles. I can&apos;t seem to find anyone with a clear, comprehensive, thought-out theory of what a board member&apos;s ... job is. 
</li>
</ul>
<p>
I&apos;ll take these one at a time.
</p>
<h3 id="great-power-low-engagement-unclear-responsibility-no-accountability">Great power, low engagement, unclear responsibility, no accountability</h3>


<p>
In my experience/impression, the best way to run any organization (or project, or anything) is on an &quot;ownership&quot; model: for any given thing X that you want done well, you have one person who &quot;owns&quot; X. The &quot;owner&quot; of X has:
</p>
<ul>

<li>The <em>power</em> to make decisions to get X done well.

</li><li>High <em>engagement</em>: they&apos;re going to have plenty of time and attention to devote to X.

</li><li>The <em>responsibility</em> for X: everyone agrees that if X goes well, they should get the credit, and if X goes poorly, they should get the blame.

</li><li>And <em>accountability</em>: if X goes poorly, there will be some sort of consequences for the &quot;owner.&quot;
</li>
</ul>
<p>
When these things come apart, I think you get problems. In a nutshell - when no one is <em>responsible</em>, nothing gets done; when someone is <em>responsible </em>but doesn&apos;t have <em>power</em>, that doesn&apos;t help much; when the person who is <em>responsible </em>+ <em>empowered </em>isn&apos;t <em>engaged</em> (isn&apos;t paying much attention), or isn&apos;t held <em>accountable</em>, there&apos;s not much in the way of their doing a dreadful job.
</p>
<p>
A traditional company structure mostly does well at this. The CEO has power (they make decisions for the company), engagement (they are devoted to the company and spend tons of time on it), and responsibility+accountability (if the company does badly, everyone looks at the CEO). They manage a team of people who have power+engagement+responsibility+accountability for some aspect of the company; each of those people manage people with power+engagement+responsibility+accountability for some smaller piece; etc.
</p>
<p>
What about the board?
</p>
<ul>

<li>They have <em>power </em>to fire the CEO (or do anything else).

</li><li>They tend to have low <em>engagement</em>. They have other jobs, and only spend a few hours a year on their board roles. They tend to know little about what&apos;s going on at the organization.

</li><li>They have unclear <em>responsibility</em>.  
<ul>
 
<li>The board as a whole is responsible for the organization, but what is each <em>individual</em> board member responsible for? In my experience, this is often very unclear, and there are a lot of crucial moments where &quot;bystander effects&quot; seem strong. 
 
</li><li>So far, these points apply to both nonprofit and for-profit boards. But at least at a for-profit company, board members know what they&apos;re collectively responsible <em>for</em>: maximizing financial value of the company. <strong>At a nonprofit, it&apos;s often unclear what success even <em>means</em>, beyond the nonprofit&apos;s often-vague mission statement, so board members are generally unclear (and don&apos;t necessarily agree) on what they&apos;re supposed to be ensuring.</strong><sup id="fnref3"><a href="https://www.cold-takes.com/p/47f975d8-08f1-4e66-a7c6-6ba4e182cc1a/#fn3" rel="footnote">3</a></sup>
</li> 
</ul>

</li><li>At a for-profit company, the board seems to have reasonable <em>accountability: </em>the shareholders, who ultimately own the company and gain or lose money depending on how it does, can replace the board if they aren&apos;t happy. <strong>At a nonprofit, the board members have <em>zero accountability: </em>the only way to fire a board member is by majority vote of the board!</strong>
</li></ul>
<p>
So we have people who are spending very little time on the company, know very little about it, don&apos;t have much clarity on what they&apos;re responsible for either individually or collectively, and aren&apos;t accountable to anyone ... and those are the people with all of the power. Sound dysfunctional?<sup id="fnref4"><a href="https://www.cold-takes.com/p/47f975d8-08f1-4e66-a7c6-6ba4e182cc1a/#fn4" rel="footnote">4</a></sup>
</p>
<p>
In practice, I think it&apos;s often worse than it sounds, because board members aren&apos;t even chosen carefully - a lot of the time, a nonprofit just goes with an assortment of random famous people, big donors, etc. 
</p>
<h3 id="what-makes-a-good-board-member">What makes a good board member? Few people even have a hypothesis</h3>


<p>
I&apos;ve searched a fair amount for books, papers, etc. that give convincing and/or widely-accepted answers to questions like:
</p>
<ul>

<li>When the CEO asks the board to approve something, how should they engage? When should they take a <em>deferring </em>attitude (&quot;Sure, as long as I don&apos;t see any particular reason to say no&quot;), a <em>sanity check</em> attitude (&quot;I&apos;ll ask a few questions to make sure this is making sense, then approve if nothing jumps out at me&quot;), a <em>full ownership </em>attitude (&quot;I need to personally be convinced this is the best thing for the organization&quot;), etc.?

</li><li>How much should each board member invest in educating themselves about the organization? What&apos;s the best way to do that?

</li><li>How does the board know whether the CEO is doing a good job? What kind of situation should trigger seriously considering looking for a new one?

</li><li>How does a board member know whether the <em>board</em> is doing a good job? How should they decide when another board member should be replaced?
</li>
</ul>
<p>
In my experience, most board members just aren&apos;t walking around with any particular thought-through take on questions like this. And as far as I can tell, there&apos;s a shortage of good<sup id="fnref5"><a href="https://www.cold-takes.com/p/47f975d8-08f1-4e66-a7c6-6ba4e182cc1a/#fn5" rel="footnote">5</a></sup> guidance on questions like this for both for-profit and nonprofit boards. For example:
</p>
<ul>

<li>I&apos;ve found no standard reference on topics like this, and very few resources that even seem aimed at directly and clearly answering such questions.  
<ul>
 
<li>The best book on this topic I&apos;ve seen is <a href="https://smile.amazon.com/Boards-That-Lead-Charge-Partner/dp/1422144054/">Boards that Lead</a> by Ram Charan, focused on for-profit boards (but pretty good IMO).
 
</li><li>But this isn&apos;t, like, a book everyone knows to read; I found it by asking lots of people for suggestions, coming up empty, Googling wildly around and skimming like 10 books that said they were about boards, and deciding that this one seemed pretty good.
</li> 
</ul>

</li><li>One of the things I do as a board member is interview other prospective board members about their answers to questions like this. In my experience, they answer most of the above questions with something like &quot;Huh, I don&apos;t really know. What do you think?&quot; 

</li><li>Most boards I&apos;ve seen seem to - by default - either: 
<ul>
 
<li>Get way too involved in lots of decisions to the point where it feels like they&apos;re micromanaging the CEO and/or just obsessively engaging on whatever topics the CEO happens to bring to their attention; or 
 
</li><li>Take a &quot;We&apos;re just here to help&quot; attitude and rubber-stamp whatever the CEO suggests, including things I&apos;ll argue below should be <a href="https://www.cold-takes.com/p/47f975d8-08f1-4e66-a7c6-6ba4e182cc1a/#the-boards-main-duties">core duties</a> for the board (e.g., adding and removing board members).
</li> 
</ul>

</li><li>I&apos;m not sure I&apos;ve ever seen a board with a formal, recurring process for reviewing each board member&apos;s performance. :/
</li>
</ul>
<p>
To the extent I have seen a relatively common, coherent vision of &quot;what board members are supposed to be doing,&quot; it&apos;s pretty well summarized in <a href="https://growth.eladgil.com/book/cofounders/board-and-ceo-transitions-and-other-governance-issues-an-interview-with-reid-hoffman/">Reid Hoffman&apos;s interview</a> in <a href="https://growth.eladgil.com/">The High-Growth Handbook</a>:
</p>
<p></p>

<div id="#RH"></div><blockquote><p>I use ... a red light, yellow light, green light framework between the board and the CEO. Roughly, green light is, &#x201C;You&#x2019;re the CEO. Make the call. We&#x2019;re advisory.&#x201D; Now, we may say that on very big things&#x2014;selling the company&#x2014;we should talk about it before you do it. And that may shift us from green light, if we don&#x2019;t like the conversation. But a classic young, idiot board member will say, &#x201C;Well, I&#x2019;m giving you my expertise and advice. You should do X, Y, Z.&#x201D; But the right framework for board members is: You&#x2019;re the CEO. You make the call. We&#x2019;re advisory.</p>
<p>

    Red lights also very easy. Once you get to red light, the CEO&#x2014;who, by the way, may still be in place&#x2014;won&#x2019;t be the CEO in the future. The board knows they need a new CEO. It may be with the CEO&#x2019;s knowledge, or without it. Obviously, it&#x2019;s better if it&#x2019;s collaborative ...
</p>
<p>

    Yellow means, &#x201C;I have a question about the CEO. Should we be at green light or not?&#x201D; And what happens, again under inexperienced or bad board members, is they check a CEO into yellow indefinitely. They go, &#x201C;Well, I&#x2019;m not sure&#x2026;&#x201D; The important thing with yellow light is that you 1) coherently agree on it as a board and 2) coherently agree on what the exit conditions are. What is the limited amount of time that we&#x2019;re going to be in yellow while we consider whether we move back to green or move to red? And how do we do that, so that we do not operate for a long time on yellow? Because with yellow light, you&#x2019;re essentially hamstringing the CEO and hamstringing the company. It&#x2019;s your obligation as a board to figure that out.
        </p></blockquote>

<p>
I like this quite a bit (hence the long blockquote), but I don&apos;t think it covers everything. The board is <em>mostly</em> there to oversee the CEO, and they should <em>mostly</em> be advisory when they&apos;re happy with the CEO. But I think there are things they ought to be actively thinking about and engaging in even during &quot;green light.&quot;
</p>
<h2 id="so-what-does-make-a-good-board-member">So what DOES make a good board member?</h2>


<p>
Here is my current take, based on a combination of (a) my thoughts after serving on and interacting with a large number of nonprofit boards; (b) my attempts to adapt conventional wisdom about for-profit boards (especially from the <a href="https://smile.amazon.com/Boards-That-Lead-Charge-Partner/dp/1422144054/">book I mentioned above</a>); (c) divine revelation. 
</p>
<p>
I&apos;ll go through:
</p>
<ul>

<li>What I see as the <strong>main duties</strong> of the board specifically - things the board has to do well, and can&apos;t leave to the CEO and other staff.

</li><li>My basic take that the ideal board should do these main duties well, while staying out of the way otherwise.

</li><li>The <strong>main qualities </strong>I think the ideal board member should have - and some common ways of choosing board members that seem bad to me.

</li><li>A few more random thoughts on board practices that seem especially important and/or promising.
</li>
</ul>
<p>
(I don&apos;t claim any of these points are original, and almost everything can be found in some writing on boards somewhere, but I don&apos;t know of a reasonably comprehensive, concise place to get something similar to the below.)
</p>
<h3 id="the-boards-main-duties">The board&apos;s main duties</h3>


<p>
I agree with the basic spirit of Hoffman&apos;s philosophy <a href="https://www.cold-takes.com/p/47f975d8-08f1-4e66-a7c6-6ba4e182cc1a/#RH">above</a>: the board should not be trying to &quot;run the company&quot; (they&apos;re too low-engagement and don&apos;t know enough about it), and should instead be focused on a small number of big-picture questions like &quot;How is the CEO doing?&quot;
</p>
<p>
And I do think <strong>the board&apos;s #1 and most fundamental job is evaluating the CEO&apos;s performance. </strong>The board is the <em>only</em> reliable source of accountability for the CEO - even more so at a nonprofit than a for-profit, since bad CEO performance won&apos;t necessarily show up via financial problems or unhappy shareholders.<sup id="fnref6"><a href="https://www.cold-takes.com/p/47f975d8-08f1-4e66-a7c6-6ba4e182cc1a/#fn6" rel="footnote">6</a></sup> (As noted <a href="https://www.cold-takes.com/p/47f975d8-08f1-4e66-a7c6-6ba4e182cc1a/#regular-ceo-reviews">below</a>, I think many nonprofit boards have no formal process for reviewing the CEO&apos;s performance, and the ones that do often have a lightweight/underwhelming one.)
</p>
<p>
But I think the board also needs to take a leading role - and not trust the judgment of the CEO and other staff - when it comes to:
</p>
<ul>

<li><strong>Overseeing decisions that could importantly reduce the board&apos;s powers. </strong>The CEO might want to enter into an agreement with a third party that is binding on the nonprofit and therefore on the board (for example, &quot;The nonprofit will now need permission from the third party in order to do X&quot;); or transfer major activities and assets to affiliated organizations that the board doesn&apos;t control (for example, when <a href="https://www.openphilanthropy.org/blog/open-philanthropy-project-now-independent-organization">Open Philanthropy split off from GiveWell</a>); or revise the organization&apos;s mission statement, bylaws,<sup id="fnref7"><a href="https://www.cold-takes.com/p/47f975d8-08f1-4e66-a7c6-6ba4e182cc1a/#fn7" rel="footnote">7</a></sup> etc.; or other things that significantly reduce the scope of what the board has control over. The board needs to represent its own interests in these cases, rather than deferring to the CEO (whose interests may be different).

</li><li><strong>Overseeing big-picture irreversible risks and decisions that could importantly affect future CEOs. </strong>For example, I think the board needs to be anticipating any major source of risk that a nonprofit collapses (financially or otherwise) - if this happens, the board can&apos;t simply replace the CEO and move on, because the collapse affects what a future CEO is able to do. (What risks and decisions are big enough? Some thoughts in a footnote.<sup id="fnref8"><a href="https://www.cold-takes.com/p/47f975d8-08f1-4e66-a7c6-6ba4e182cc1a/#fn8" rel="footnote">8</a></sup>)

</li><li><strong>All matters relating to the composition and performance of the board itself. </strong>Adding new board members, removing board members, and reviewing the board&apos;s own performance are things that the board needs to be responsible for, not the CEO. If the CEO is controlling the composition of the board, this is at odds with the board&apos;s role in overseeing the CEO.
</li>
</ul>
<h3 id="engaging-on-main-duties">Engaging on main duties, staying out of the way otherwise</h3>


<p>
I think the ideal board member&apos;s behavior is roughly along the lines of the following:
</p>
<p>
<strong>Actively, intensively engage in the <a href="https://www.cold-takes.com/p/47f975d8-08f1-4e66-a7c6-6ba4e182cc1a/#the-boards-main-duties">main duties</a> from the previous section. </strong>Board members should be knowledgeable about, and not defer to the CEO on, (a) how the CEO is performing; (b) how the board is performing, and who should be added and removed; (c) spotting (and scanning the horizon for) events that could reduce the board&apos;s powers, or lead to big enough problems and restrictions so as to irreversibly affect what future CEOs are able to do. 
</p>
<p>
Ideally they should be focusing their questions in board meetings on these things, as well as having some way of gathering information about them that doesn&apos;t just rely on hearing directly from the CEO. (Some ideas for this are below.) When reviewing financial statements and budgets, they should be focused mostly on the risk of major irreversible problems (such as going bankrupt or failing to be compliant); when hearing about activities, they should be focused mostly on what they reflect about the CEO&apos;s performance; etc.
</p>
<p>
<strong>Be advisory (&quot;stay out of the way&quot;) otherwise. </strong>Meetings might contain all sorts of updates and requests for reactions. I think a good template for a board member, when sharing an opinion or reaction, is either to (a) explain as they&apos;re talking why this topic is important for the board&apos;s main duties; or (b) say (or imply) something like &quot;I&apos;m curious / offering an opinion about ___, but if this isn&apos;t helpful, please ignore it, and please don&apos;t hesitate to move the meeting to the next topic as soon as this stops feeling productive.&quot;
</p>
<p>
The combination of intense engagement on core duties and &quot;staying out of the way&quot; otherwise <strong>can make this a very weird role. </strong>An organization will often go years without any serious questions about the CEO&apos;s performance or other matters involving <a href="https://www.cold-takes.com/p/47f975d8-08f1-4e66-a7c6-6ba4e182cc1a/#the-boards-main-duties">core duties.</a> So a board member ought to be ready to quietly nod along and stay out of the way for very long stretches of time, while being ready to get seriously involved and engaged when this makes sense. 
</p>
<p>
<strong>Aim for division of labor. </strong>I think a major problem with nonprofit boards is that, by default, it&apos;s really unclear which board member is responsible for what. I think it&apos;s a good idea for board members to explicitly settle this via assigning:
</p>
<ul>

<li>Specialists (&quot;Board member X is reviewing the financials; the rest of us are mostly checked-out and/or sanity-checking on that&quot;); 

</li><li>Subcommittees (&quot;Board members X and Y will look into this particular aspect of the CEO&apos;s performance&quot;); 

</li><li>A Board Chair or Lead Independent Director<sup id="fnref9"><a href="https://www.cold-takes.com/p/47f975d8-08f1-4e66-a7c6-6ba4e182cc1a/#fn9" rel="footnote">9</a></sup> who is the default person to take responsibility for making sure the board is doing its job well (this could include suggesting and assigning responsibility for some of the ideas I list <a href="https://www.cold-takes.com/p/47f975d8-08f1-4e66-a7c6-6ba4e182cc1a/#a-few-practices-that-seem-good">below</a>; helping to set the agenda for board meetings so it isn&apos;t just up to the CEO; etc.)
</li></ul>
<p>
This can further help everyone find a balance between engaging and staying out of the way.
</p>
<h3 id="who-should-be-on-the-board">Who should be on the board?</h3>


<p>
One answer is that it should be whoever can do well at the duties outlined above - both in terms of substance (can they accurately evaluate the CEO&apos;s performance, identify big-picture irreversible risks, etc.?) and in terms of style (do they actively engage on their <a href="https://www.cold-takes.com/p/47f975d8-08f1-4e66-a7c6-6ba4e182cc1a/#the-boards-main-duties">main duties</a> and stay out of the way otherwise?)
</p>
<p>
But to make things a bit more boiled-down and concrete, I think perhaps the most important test for a board member is: <strong>they&apos;ll get the CEO replaced if this would be good for the nonprofit&apos;s mission, and they won&apos;t if it wouldn&apos;t be.</strong>
</p>
<p>
This is the most essential function of the board, and it implies a bunch of things about who makes a good board member: 
</p>
<ul>

<li>They need to <strong>do a great job understanding and representing the nonprofit&apos;s mission, and care deeply about that mission</strong> - to the point of being ready to create conflict over it if needed (and only if needed). 
<ul>
 
<li>A <a href="https://www.cold-takes.com/p/47f975d8-08f1-4e66-a7c6-6ba4e182cc1a/#great-power-low-engagement-unclear-responsibility-no-accountability">key challenge</a> of nonprofits is that they have no clear goal, only a mission statement that is open to interpretation. And if two different board members interpret the mission differently - or are focused on different aspects of it - this could intensely color how they evaluate the CEO, which could be a huge deal for the nonprofit.
 
</li><li>For example, if a nonprofit&apos;s mission is &quot;Help animals everywhere,&quot; does this mean &quot;Help as many animals as possible&quot; (which might indicate a move toward focusing on farm animals) or &quot;Help animals in the same way the nonprofit traditionally has&quot; or something else? How does it imply the nonprofit should make tradeoffs between helping e.g. dogs, cats, elephants, chickens, fish or even insects? How a board member answers questions like this seems central to how their presence on the board is going to affect the nonprofit.
</li> 
</ul>

</li><li>They <strong>need to have a personality and position capable of challenging the CEO </strong>(though also capable of staying out of the way)<strong>. </strong> 
<ul>
 
<li>A common problem I see is that some board member is (a) not very engaged with the nonprofit itself, but (b) highly values their personal relationship with the CEO and other board members. This seems like a bad combination, but unfortunately a common one. Board members need to be willing and able to create conflict in order to do the right thing for the nonprofit.
 
</li><li>Limiting the number of board members who are employees (reporting to the CEO) seems important for this reason.
 
</li><li>If you can&apos;t picture a board member &quot;making waves,&quot; they probably shouldn&apos;t be on the board - that attitude will seem fine more than 90% of the time, but it won&apos;t work well in the rare cases where the board really matters.
 
</li><li>On the other hand, if someone is <em>only comfortable</em> &quot;making waves&quot; and feels useless and out of sorts when they&apos;re just nodding along, that person shouldn&apos;t be on the board either. As noted above, board members need to be ready for a weird job that involves stepping up when the situation requires it, but staying out of the way when it doesn&apos;t. 
</li> 
</ul>

</li><li>They should probably have a <strong>well-developed take on what their job is as a board member. </strong>Board members who can&apos;t say much about where they expect to be highly engaged, vs. casually advisory - and how they expect to invest in getting the knowledge they need to do a good job leading on particular issues - don&apos;t seem like great bets to step up when they most need to (or stay out of the way when they should).
</li>
</ul>
<p>
In my experience, most nonprofits are not looking for these qualities in board members. They are, instead, often looking for things like:
</p>
<ul>

<li>Celebrity and reputation - board members who are generally impressive and well-regarded and make the nonprofit look good. Unfortunately, I think such people often just don&apos;t have much time or interest for their job. Many are also uninterested in causing any conflict, which makes them basically useless as board members IMO.

</li><li>Fundraising - a lot of nonprofits pretty much explicitly just try to put people on the board who will help raise money for them. This seems bad for governance.

</li><li>Narrow expertise on some topic that is important for the nonprofit. I don&apos;t really think this is what nonprofits should be seeking from board members,<sup id="fnref10"><a href="https://www.cold-takes.com/p/47f975d8-08f1-4e66-a7c6-6ba4e182cc1a/#fn10" rel="footnote">10</a></sup> except to the extent it ties deeply into the board members&apos; <a href="https://www.cold-takes.com/p/47f975d8-08f1-4e66-a7c6-6ba4e182cc1a/#the-boards-main-duties">core duties</a>, e.g., where it&apos;s important to have an independent view on technical topic X in order to do a good job evaluating the CEO.
</li></ul>
<p>
I think a good profile for a board member is someone who cares greatly about the nonprofit&apos;s mission, and wants it to succeed, to the point where they&apos;re ready to have tough conversations if they see the CEO falling short. Examples of such people might be major funders, or major stakeholders (e.g., a community leader from a community of people the nonprofit is trying to help).
</p>
<h3 id="a-few-practices-that-seem-good">A few practices that seem good</h3>


<p>
I&apos;ll anticlimactically close with a few practices that seem helpful to me. These are mostly pretty generic practices, useful for both for-profit and nonprofit boards, that I have seen working in practice but also seen too many boards going without. They don&apos;t fully address the weirdnesses discussed above (especially the stuff specific to nonprofit as opposed to for-profit boards), but they seem to make things some amount better.
</p>
    <p><strong>Keeping it simple for low-stakes organizations. </strong>If a nonprofit is a year old and has 3 employees, it probably shouldn&apos;t be investing a ton of its energy in having a great board (especially since this is hard).</p> <p></p>

<p>
A key question is: &quot;If the board just stays checked out and doesn&apos;t hold the CEO accountable, what&apos;s the worst thing that can happen?&quot; If the answer is something like &quot;The nonprofit&apos;s relatively modest budget is badly spent,&quot; then it might not be worth a huge investment in building a great board (and in taking some of the measures listed below). Early-stage nonprofits often have a board consisting of 2-3 people the founder trusts a lot (ideally in a &quot;you&apos;d fire me if it were the right thing to do&quot; sense rather than in a &quot;you&apos;ve always got my back&quot; sense), which seems fine. The rest of these ideas are for when the stakes are higher.
</p>
<p>
<strong>Formal board-staff communication channels. </strong>A very common problem I see is that:
</p>
<ul>

<li>Board members know almost nothing about the organization, and so are hesitant to engage in much of anything.

</li><li>Employees of the organization know far more, but find the board members mysterious/unapproachable/scary, and don&apos;t share much information with them.
</li>
</ul>
<p>
I&apos;ve seen this dynamic improved some amount by things like a <strong>staff liaison</strong>: a board member who is designated with the duty, &quot;Talk to employees a lot, offer them confidentiality as requested, try to build trust, and gather information about how things are going.&quot; Things like regular &quot;office hours&quot; and showing up to company events can help with this.
</p>

<p>
<strong>Viewing board seats as limited. </strong>It seems unlikely that a board should have more than 10 members (and even 10 seems like a lot), since it&apos;s hard to have a productive meeting past that point.<sup id="fnref11"><a href="https://www.cold-takes.com/p/47f975d8-08f1-4e66-a7c6-6ba4e182cc1a/#fn11" rel="footnote">11</a></sup> When considering a new addition to the board, I think the board should be asking something much closer to &quot;Is this one of the 10 best people in the world to sit on this board?&quot; than to &quot;Is this person fine?&quot;

<div id="regular-ceo-reviews"></div></p><p><strong>Regular CEO reviews.</strong>
Many nonprofits don&apos;t seem to have any formal, regular process for reviewing the CEO&apos;s performance; I think it&apos;s important to do this.
</p> 
<p>
The most common format I&apos;ve seen is something like: one board member interviews the CEO&apos;s direct reports, and perhaps some other people throughout the company, and integrates this with information about the organization&apos;s overall progress and accomplishments (often presented by the organization itself, but they might ask questions about it) to provide a report on what the CEO is doing well and could do better. I think this approach has a lot of limitations - staff are often hesitant to be forthcoming with a board member (even when promised anonymity), and the board member often lacks a lot of key information - but even with those issues, it tends to be a useful exercise.
</p>
<p>
<strong>Closed sessions. </strong>I think it&apos;s important for the board to have &quot;closed sessions&quot; where board members can talk frankly without the CEO, other employees, etc. hearing. I think a common mistake is to ask &quot;Does anyone want the closed session today or can we skip it?&quot; - this puts the onus on board members to say &quot;Yes, I would like a closed session,&quot; which then implies they have something negative to say. I think it&apos;s better for whoever&apos;s running the meetings to identify logical closed sessions (e.g., &quot;The board minus employees&quot;), allocate time for them and force them to happen.
</p>
<p>
<strong>Regular board reviews. </strong>It seems like it would be a good idea for board members to regularly assess each other&apos;s performance, and the performance of the board as a whole. But I&apos;ve actually seen very little of this done in practice and I can&apos;t point to versions of it that seem to have some track record of working well. It does seem like a good idea though!
</p>
<h2 id="conclusion">Conclusion</h2>


<p>
The board is the only body at a nonprofit that can hold the CEO accountable to accomplishing the mission. I broadly feel like most nonprofit boards just aren&apos;t very well-suited to this duty, or necessarily to much of anything. It&apos;s an inherently weird structure that seems difficult to make work. 
</p>
<p>
I wish someone would do a great job studying and laying out how nonprofit boards should be assembled, how they should do their job and how they can be held accountable. You can think of this post as my quick, informal shot at that.
</p>

<!-- Footnotes themselves at the bottom. --><!--kg-card-end: html--><!--kg-card-begin: html-->

<p><div style="display:flex; justify-content:center; margin: 0 auto;">

        <span style="margin: 10px;"><a href="https://api.addthis.com/oexchange/0.8/forward/twitter/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fnonprofit-boards-are-weird-2&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Nonprofit%20Boards%20are%20Weird&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-twitter-square.png" border="0" alt="Nonprofit Boards are Weird"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/facebook/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fnonprofit-boards-are-weird-2&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Nonprofit%20Boards%20are%20Weird&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-facebook-square.png" border="0" alt="Nonprofit Boards are Weird"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/reddit/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fnonprofit-boards-are-weird-2&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Nonprofit%20Boards%20are%20Weird&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-reddit-square.png" border="0" alt="Nonprofit Boards are Weird"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/menu/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fnonprofit-boards-are-weird-2&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20Nonprofit%20Boards%20are%20Weird&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-addthis-square.png" border="0" alt="Nonprofit Boards are Weird"></a></span>
        </div>
<center><p id="discuss"><!--<a href="https://www.cold-takes.com/nonprofit-boards-are-weird-2#subscribe" target="_blank"><button id="footer-subscribe" class="button">Subscribe</button></a>&nbsp;<a href="https://www.guidedtrack.com/programs/4kal2ue/run?posttitle=Nonprofit%20Boards%20are%20Weird" target="_blank"><button class="button" id="Survey">Feedback</button></a>-->&#xA0;<a href="https://www.lesswrong.com/posts/nSjavaKcBrtNktzGa/nonprofit-boards-are-weird#comments" target="_blank"><button class="button">Comment/discuss</button></a></p><p><!--<em>
Use "Feedback" if you have comments/suggestions you want me to see, or if you're up for giving some quick feedback about this post (which I greatly appreciate!) Use "Forum" if you want to discuss this post publicly on the Effective Altruism Forum.
</em>--></p></center>
<!--kg-card-end: html--><!--kg-card-begin: html--><hr></p><h2>Footnotes</h2>
<div class="footnotes">

<ol><li id="fn1">

     I&apos;m using the term &quot;CEO&quot; throughout, although the chief executive at a non profit sometimes has another title, such as &quot;Executive Director.&quot;&#xA0;<a href="#fnref1" rev="footnote">&#x21A9;</a></li><li id="fn2">
<p>
     A lot of this piece is about how the <em>fundamental setup</em> of a nonprofit board leads to the kinds of problems and dynamics I&apos;m describing. This doesn&apos;t mean we should necessarily think there&apos;s any way to fix it or any better alternative. It just means that this setup seems to bring a lot of friction points and challenges that <em>most</em> relationships between supervisor-and-supervised don&apos;t seem to have, which can make the experience of interacting with a board feel vaguely unlike what we&apos;re used to in other contexts, or &quot;weird.&quot;</p>
<p>
    People who have interacted with tons of boards might get so used to these dynamics that they no longer feel weird. I haven&apos;t reached that point yet myself though.</p>&#xA0;<a href="#fnref2" rev="footnote">&#x21A9;</a></li><li id="fn3">
     The fact that the nonprofit&apos;s goals aren&apos;t clearly defined and have no clear metric (and often aren&apos;t susceptible to measurement at all) is a pretty general challenge of nonprofits, but I think it especially shows up for a structure (the board) that is already weird in the various other ways I&apos;m describing.&#xA0;<a href="#fnref3" rev="footnote">&#x21A9;</a></li><li id="fn4">
     Superficially, you could make most of the same complaints about shareholders of a for-profit company. But:
<ul>

<li>Shareholders are the people who ultimately make or lose money if the company does well or poorly (you can think of this as a form of accountability). By contrast, nonprofit board members often have very little (or only an idiosyncratic) personal connection to and investment in the organization.

</li><li>Shareholders compensate for their low engagement by picking representatives (a board) whom they can hold accountable for the company&apos;s performance. Nonprofit board members <em>are</em> the representatives, and aren&apos;t accountable to anyone.&#xA0;<a href="#fnref4" rev="footnote">&#x21A9;</a></li></ul></li><li id="fn5">
     Especially &quot;good and concise.&quot; Most of the points I make here can be found in some writings on boards somewhere, but it&apos;s hard to find sensible-seeming and comprehensive discussions of what the board should be doing and who should be on it.&#xA0;<a href="#fnref5" rev="footnote">&#x21A9;</a></li><li id="fn6">
     Part of the CEO&apos;s job is fundraising, and if they do a bad job of this, it&apos;s going to be obvious. But that&apos;s only part of the job. At a nonprofit, a CEO could easily be bringing in plenty of money and just doing a horrible job at the mission - and if the board isn&apos;t able to learn this and act on it, it seems like very bad news.&#xA0;<a href="#fnref6" rev="footnote">&#x21A9;</a></li><li id="fn7">
     The charter and bylaws are like the &quot;constitution&quot; of a nonprofit, laying out how its governance works.&#xA0;<a href="#fnref7" rev="footnote">&#x21A9;</a></li><li id="fn8">
<p>
     This is a judgment call, and one way to approach it would be to reserve something like 1 hour of full-board meeting time per year for talking about these sorts of things (and pouring in more time if at least, like, 1/3 of the board thinks something is a big deal).</p>
<p>
    Some examples of things I think are and aren&apos;t usually a big enough deal to start paying serious attention to:</p>
<ul>

<li>Big enough deal: financial decisions that increase the odds of going &quot;belly-up&quot; (running out of money and having to fold) by at least 10 percentage points. Not a big enough deal: spending money in ways that are arguably bad uses of money, having a lowish-but-not-too-far-off-of-peer-organizations amount of runway.

</li><li>Big enough deal: deficiencies in financial controls that an auditor is highlighting, or a lack of audit altogether, until a plan is agreed to to address these things. Not a big enough deal: most other stuff in this category.

</li><li>Big enough deal: organizations with substantial &quot;PR risk&quot; exposure should have a good team for assessing this and a &quot;crisis plan&quot; in case something happens. Not a big enough deal: specific organizational decisions and practices that you are not personally offended by or find unethical, but could imagine a negative article about. (If you do find them substantively unethical, I think that&apos;s a big enough deal.)

</li><li>Big enough deal: transferring like 1/3 or more of valuable things the nonprofit has (intellectual property, money, etc.) to another entity not controlled by the board. Not a big enough deal: starting an affiliate organization primarily for taking donations in another country or something.

</li><li>Big enough deal: doubling or halving the workforce. Not a big enough deal: smaller hirings and firings.</li></ul>&#xA0;<a href="#fnref8" rev="footnote">&#x21A9;</a></li><li id="fn9">

     Sometimes the Board Chair is the CEO, and sometimes the Chair is an employee of the company who also sits on the board. In these cases, I think it&apos;s good for there to be a separate Lead Independent Director who is not employed by the company and is therefore exclusively representing the Board. They can help set agendas, lead meetings, and take responsibility by default when it&apos;s otherwise unclear who would do so.&#xA0;<a href="#fnref9" rev="footnote">&#x21A9;</a></li><li id="fn10">

     Nonprofits can get expertise on topic X by hiring experts on X to advise them. The question is: when is it important to have an expert on X <em>evaluating the CEO</em>?&#xA0;<a href="#fnref10" rev="footnote">&#x21A9;</a></li><li id="fn11">
     Though it could be fine and even interesting to have giant boards - 20 people, 50 or more - that have some sort of &quot;executive committee&quot; of 10 or fewer people doing basically all of the meetings and all of the work (with the rest functioning just as very passive, occasionally-voting equivalents of &quot;shareholders&quot;). Just assume I&apos;m talking about the &quot;executive committee&quot; type thing here.&#xA0;<a href="#fnref11" rev="footnote">&#x21A9;</a>

    </li></ol></div><!--kg-card-end: html-->]]></content:encoded></item><item><title><![CDATA[AI Could Defeat All Of Us Combined]]></title><description><![CDATA[How big a deal could AI misalignment be? About as big as it gets.]]></description><link>https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/</link><guid isPermaLink="false">629abb88d0a7c7003d405058</guid><category><![CDATA[ImplicationsOfMostImportantCentury]]></category><dc:creator><![CDATA[Holden Karnofsky]]></dc:creator><pubDate>Thu, 09 Jun 2022 15:41:22 GMT</pubDate><media:content url="https://www.cold-takes.com/content/images/2022/06/whoa-no-text.png" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: html--><img src="https://www.cold-takes.com/content/images/2022/06/whoa-no-text.png" alt="AI Could Defeat All Of Us Combined"><p><figure><div id="buzzsprout-player-10749983"></div><script src="https://www.buzzsprout.com/1851795/10749983-ai-could-defeat-all-of-us-combined.js?container_id=buzzsprout-player-10749983&amp;player=small" type="text/javascript" charset="utf-8"></script><figcaption><em>Click lower right to download or find on Apple Podcasts, Spotify, Stitcher, etc.</em></figcaption></figure></p>
<p>
</p>
<p>
I&apos;ve been working on a new series of posts about the <a href="https://www.cold-takes.com/most-important-century/">most important century</a>. 
</p>
<ul>

<li>The original series focused on why and how this could be the most important century for humanity. But it had <a href="https://www.cold-takes.com/making-the-best-of-the-most-important-century/">relatively little to say about </a><em>what we can do today</em> to improve the odds of things going well.

</li><li>The new series will get much more specific about the kinds of events that might lie ahead of us, and what actions today look most likely to be helpful.

</li><li>A key focus of the new series will be the threat of <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/#misaligned-ai-mysterious-potentially-dangerous-objectives">misaligned AI</a>: AI systems disempowering humans entirely, leading to a future that has little to do with anything humans value. (<a href="https://www.slowboring.com/p/the-case-for-terminator-analogies?s=r">Like in the Terminator movies</a>, minus the time travel and the part where humans win.)
</li>
</ul>
<p>
Many people have trouble taking this &quot;misaligned AI&quot; possibility seriously. They might see the broad point that AI could be dangerous, but they instinctively imagine that the danger comes from ways humans might misuse it. They find the idea of <em>AI itself going to war with humans</em> to be comical and <a href="https://www.cold-takes.com/all-possible-views-about-humanitys-future-are-wild/">wild</a>. I&apos;m going to try to make this idea feel more serious and real.
</p>
<p>
As a first step, this post will <strong>emphasize an unoriginal but extremely important point: <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/">the kind of AI I&apos;ve discussed</a><em> </em>could defeat all of humanity combined, if (for whatever reason) it were pointed toward that goal. </strong>By &quot;defeat,&quot; I don&apos;t mean &quot;subtly manipulate us&quot; or &quot;make us less informed&quot; or something like that - I mean a literal &quot;defeat&quot; in the sense that we could all be killed, enslaved or forcibly contained.
</p>
<p>
I&apos;m not talking (yet) about whether, or why, AIs <em>might </em>attack human civilization. That&apos;s for future posts. For now, I just want to linger on the point that <em>if </em>such an attack happened, it could succeed against the combined forces of the entire world. 
</p>
<ul>

<li>I think that <strong>if you believe this, you should already be worried about misaligned AI,</strong><sup id="fnref1"><a href="https://www.cold-takes.com/p/4a610336-be1d-42fc-b76f-c33e34598340/#fn1" rel="footnote">1</a></sup><strong> before any analysis of how or why an AI might form its own goals. </strong>

</li><li>We generally don&apos;t have a lot of <em>things that could end human civilization if they &quot;tried&quot;</em> sitting around. If we&apos;re going to create one, I think we should be asking not &quot;Why would this be dangerous?&quot; but &quot;Why wouldn&apos;t it be?&quot;
</li></ul><p>
By contrast, if you don&apos;t believe that AI could defeat all of humanity combined, I expect that we&apos;re going to be miscommunicating in pretty much any conversation about AI. The kind of AI I worry about is the kind powerful enough that total civilizational defeat is a real possibility. The reason I currently spend so much time planning around speculative future technologies (instead of working on <a href="https://www.givewell.org/">evidence-backed, cost-effective ways of helping low-income people today</a> - which I did for much of my career, and still think is one of the best things to work on) is because I think the stakes are <em>just that high</em>. 
</p>
<p>
Below:
</p>
<ul>

<li>I&apos;ll sketch the basic argument for why I think AI could defeat all of human civilization.  
<ul>
 
<li>Others have written about the possibility that &quot;superintelligent&quot; AI could manipulate humans and create overpowering advanced technologies; I&apos;ll briefly recap that case.
 
</li><li>I&apos;ll then cover a different possibility, which is that even &quot;merely human-level&quot; AI could still defeat us all - by quickly coming to rival human civilization in terms of total population and resources.
 
</li><li>At a high level, I think we should be worried if a huge (competitive with world population) and rapidly growing set of highly skilled humans on another planet was trying to take down civilization just by using the Internet. So we should be worried about a large set of disembodied AIs as well. 
</li> 
</ul>

</li><li>I&apos;ll briefly address a few objections/common questions:  
<ul>
 
<li>How can AIs be dangerous without bodies? 
 
</li><li>If lots of different companies and governments have access to AI, won&apos;t this create a &quot;balance of power&quot; so that no one actor is able to bring down civilization? 
 
</li><li>Won&apos;t we see warning signs of AI takeover and be able to nip it in the bud?
 
</li><li>Isn&apos;t it fine or maybe good if AIs defeat us? They have rights too. 
</li> 
</ul>

</li><li>Close with some thoughts on just how unprecedented it would be to have something on our planet capable of overpowering us all.
</li>
</ul>


<h2 id="how-ai-systems-could-defeat-all-of-us">How AI systems could defeat all of us</h2>


<p>
There&apos;s been a lot of debate over whether AI systems might form their own &quot;motivations&quot; that lead them to seek the disempowerment of humanity. I&apos;ll be talking about this in future pieces, but for now I want to put it aside and imagine how things would go <em>if this happened. </em>
</p>
<p>
So, for what follows, let&apos;s proceed from the premise: <strong>&quot;For some weird reason, humans consistently design AI systems (with human-like research and planning abilities) that coordinate with each other to try and overthrow humanity.&quot; Then what? </strong>What follows will necessarily feel wacky to people who find this hard to imagine, but I think it&apos;s worth playing along, because I think &quot;we&apos;d be in trouble if this happened&quot; is a very important point.
</p>

<h3 id="the-standard-argument-superintelligence-and-advanced-technology">The &quot;standard&quot; argument: superintelligence and advanced technology</h3>


<p>
Other treatments of this question have focused on AI systems&apos; potential to become <em>vastly</em> more intelligent than humans, to the point where they have what <a href="https://smile.amazon.com/dp/B00LOOCGB2/ref=dp-kindle-redirect?_encoding=UTF8&amp;btkr=1">Nick Bostrom calls</a> &quot;cognitive superpowers.&quot;<sup id="fnref2"><a href="https://www.cold-takes.com/p/4a610336-be1d-42fc-b76f-c33e34598340/#fn2" rel="footnote">2</a></sup> Bostrom imagines an AI system that can do things like:
</p>
<ul>

<li>Do its own research on how to build a better AI system, which culminates in something that has incredible other abilities.

</li><li>Hack into human-built software across the world.

</li><li>Manipulate human psychology.

</li><li>Quickly generate vast wealth under the control of itself or any human allies.

</li><li>Come up with better plans than humans could imagine, and ensure that it doesn&apos;t try any takeover attempt that humans might be able to detect and stop.

</li><li>Develop advanced weaponry that can be built quickly and cheaply, yet is powerful enough to overpower human militaries. 
</li>
</ul>
<p>
(<a href="https://waitbutwhy.com/2015/01/artificial-intelligence-revolution-2.html">Wait But Why</a> reasons similarly.<sup id="fnref3"><a href="https://www.cold-takes.com/p/4a610336-be1d-42fc-b76f-c33e34598340/#fn3" rel="footnote">3</a></sup>)
</p>
<p>
I think many readers will already be convinced by arguments like these, and if so you might skip down to the <a href="https://www.cold-takes.com/p/4a610336-be1d-42fc-b76f-c33e34598340/#some-quick-responses-to-objections">next major section</a>.
</p>
<p>
But I want to be clear that I <em>don&apos;t</em> think the danger relies on the idea of &quot;cognitive superpowers&quot; or &quot;superintelligence&quot; - both of which refer to capabilities vastly beyond those of humans. <strong>I think we still have a problem even if we assume that AIs will basically have similar capabilities to humans, and not be fundamentally or drastically more intelligent or capable. </strong>I&apos;ll cover that next.
</p>
<h3 id="how-ais-could-defeat-humans-without-superintelligence">How AIs could defeat humans without &quot;superintelligence&quot;</h3>

<p>
If we assume that AIs will basically have similar capabilities to humans, I think we still need to worry that they could come to <strong>out-number and out-resource humans, </strong>and could thus have the advantage if they coordinated against us.
</p>
<p>
Here&apos;s a simplified example (some of the simplifications are in this footnote<sup id="fnref4"><a href="https://www.cold-takes.com/p/4a610336-be1d-42fc-b76f-c33e34598340/#fn4" rel="footnote">4</a></sup>) based on <a href="https://www.lesswrong.com/posts/KrJfoZzpSDpnrv9va/draft-report-on-ai-timelines">Ajeya Cotra&apos;s &quot;biological anchors&quot; report</a>:
</p>


<ul>

<li>I assume that transformative AI is developed on the soonish side (around 2036 - assuming later would only make the below numbers larger), and that it initially comes in the form of a <strong>single AI system that is able to do more-or-less the same intellectual tasks as a human.</strong> That is, it doesn&apos;t have a human body, but it can do anything a human working remotely from a computer could do. 

</li><li>I&apos;m using the report&apos;s framework in which it&apos;s much more expensive to <em>train</em> (develop) this system than to <em>run</em> it (for example, think about how much Microsoft spent to develop Windows, vs. how much it costs for me to run it on my computer). 

</li><li>The report provides a way of estimating both how much it would cost to <em>train</em> this AI system, and how much it would cost to <em>run</em> it. Using these estimates (details in footnote)<sup id="fnref5"><a href="https://www.cold-takes.com/p/4a610336-be1d-42fc-b76f-c33e34598340/#fn5" rel="footnote">5</a></sup> implies that once the first human-level AI system is created, whoever created it could use the same computing power it took to create it in order to run <strong>several hundred million copies for about a year each</strong>.<sup id="fnref6"><a href="https://www.cold-takes.com/p/4a610336-be1d-42fc-b76f-c33e34598340/#fn6" rel="footnote">6</a></sup> 

</li><li>This would be over 1000x the total number of Intel or Google employees,<sup id="fnref7"><a href="https://www.cold-takes.com/p/4a610336-be1d-42fc-b76f-c33e34598340/#fn7" rel="footnote">7</a></sup> over 100x the total number of active and reserve personnel in the <a href="https://en.wikipedia.org/wiki/United_States_Armed_Forces">US armed forces</a>, and something like 5-10% the size of the world&apos;s total working-age population.<sup id="fnref8"><a href="https://www.cold-takes.com/p/4a610336-be1d-42fc-b76f-c33e34598340/#fn8" rel="footnote">8</a></sup>

</li><li>And that&apos;s just a starting point.  
<ul>
 
<li>This is just using the same amount of resources that went into training the AI in the first place. Since these AI systems can do human-level economic work, they can probably be used to make more money and buy or rent more hardware,<sup id="fnref9"><a href="https://www.cold-takes.com/p/4a610336-be1d-42fc-b76f-c33e34598340/#fn9" rel="footnote">9</a></sup> which could quickly lead to a &quot;population&quot; of billions or more.
 
</li><li>In addition to making more money that can be used to run more AIs, the AIs can conduct massive amounts of research on how to use computing power more efficiently, which could mean still greater numbers of AIs run using the <em>same</em> hardware. This in turn could lead to a <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/#explosive-scientific-and-technological-advancement">feedback loop</a> and explosive growth in the number of AIs.<!-- (One example estimate of what this could look like in a footnote.<sup id="fnref10"><a href="https://www.cold-takes.com/p/4a610336-be1d-42fc-b76f-c33e34598340/#fn10" rel="footnote">10</a></sup>)-->
</li>
</ul>

</li><li>Each of these AIs might have skills comparable to those of unusually highly paid humans, including scientists, software engineers and quantitative traders. It&apos;s hard to say how quickly a set of AIs like this could develop new technologies or make money trading markets, but it seems quite possible for them to amass huge amounts of resources quickly. A huge population of AIs, each able to earn a lot compared to the average human, could end up with a &quot;virtual economy&quot; at least as big as the human one.
</li>
</ul>


<p>
To me, this is most of what we need to know: <strong>if there&apos;s something with human-like skills, seeking to disempower humanity, with a population in the same ballpark as (or larger than) that of all humans, we&apos;ve got a civilization-level problem.</strong>
</p>
<p>
A potential counterpoint is that these AIs would merely be &quot;virtual&quot;: if they started causing trouble, humans could ultimately unplug/deactivate the servers they&apos;re running on. I do think this fact would make life harder for AIs seeking to disempower humans, but I don&apos;t think it ultimately should be cause for much comfort. I think a large population of AIs would likely be able to find some way to achieve security from human shutdown, and go from there to amassing enough resources to overpower human civilization (especially if AIs across the world, including most of the ones humans were trying to use for help, were coordinating). 
</p>
<p>
I spell out what this might look like in an <a href="https://www.cold-takes.com/p/4a610336-be1d-42fc-b76f-c33e34598340/#appendix-how-ais-could-avoid-shutdown">appendix</a>. In brief:
</p>
<ul>

<li>By default, I expect the economic gains from using AI to mean that humans create huge numbers of AIs, integrated all throughout the economy, potentially including direct interaction with (and even control of) large numbers of robots and weapons.  
<ul>
 
<li>(If not, I think the situation is in many ways even more dangerous, since a single AI could make many copies of itself and have little competition for things like server space, as discussed in the <a href="https://www.cold-takes.com/p/4a610336-be1d-42fc-b76f-c33e34598340/#what-if-humans-move-slowly-and-dont-create-many-ais">appendix</a>.)
</li> 
</ul>

</li><li>AIs would have multiple ways of obtaining property and servers safe from shutdown.  
<ul>
 
<li>For example, they might recruit human allies (through manipulation, deception, blackmail/threats, genuine promises along the lines of &quot;We&apos;re probably going to end up in charge somehow, and we&apos;ll treat you better when we do&quot;) to rent property and servers and otherwise help them out. 
 
</li><li>Or they might create fakery so that they&apos;re able to operate freely on a company&apos;s servers while all outward signs seem to show that they&apos;re successfully helping the company with its goals.
</li> 
</ul>

</li><li>A relatively modest amount of property safe from shutdown could be sufficient for housing a huge population of AI systems that are recruiting further human allies, making money (via e.g. quantitative finance), researching and developing advanced weaponry (e.g., bioweapons), setting up manufacturing robots to construct military equipment, thoroughly infiltrating computer systems worldwide to the point where they can disable or control most others&apos; equipment, etc. 

</li><li>Through these and other methods, a large enough population of AIs could develop enough military technology and equipment to overpower civilization - especially if AIs across the world (including the ones humans were trying to use) were coordinating with each other.
</li>
</ul>
<h2 id="some-quick-responses-to-objections">Some quick responses to objections</h2>


<p>
This has been a brief sketch of how AIs could come to outnumber and out-resource humans. There are lots of details I haven&apos;t addressed.
</p>
<p>
Here are some of the most common objections I hear to the idea that AI could defeat all of us; if I get much <a href="https://www.cold-takes.com/p/4a610336-be1d-42fc-b76f-c33e34598340/#discuss">demand</a> I can elaborate on some or all of them more in the future.
</p>
<p>
<strong>How can AIs be dangerous without bodies?</strong> This is discussed a fair amount in the <a href="https://www.cold-takes.com/p/4a610336-be1d-42fc-b76f-c33e34598340/#appendix-how-ais-could-avoid-shutdown">appendix</a>. In brief: 
</p>
<ul>

<li>AIs could recruit human allies, tele-operate robots and other military equipment, make money via research and quantitative trading, etc. 

</li><li>At a high level, I think we should be worried if a huge (competitive with world population) and rapidly growing set of highly skilled humans on another planet was trying to take down civilization just by using the Internet. So we should be worried about a large set of disembodied AIs as well. 
</li>
</ul>
<p>
<strong>If lots of different companies and governments have access to AI, won&apos;t this create a &quot;balance of power&quot; so that nobody is able to bring down civilization? </strong>
</p>
<ul>

<li>This is a reasonable objection to many horror stories about AI and other possible advances in military technology, but if <em>AIs collectively have different goals from humans and are willing to coordinate with each other</em><sup id="fnref11"><a href="https://www.cold-takes.com/p/4a610336-be1d-42fc-b76f-c33e34598340/#fn11" rel="footnote">11</a></sup><em> against us</em>, I think we&apos;re in trouble, and this &quot;balance of power&quot; idea doesn&apos;t seem to help. 

    </li><li>What matters is the total number and resources of AIs vs. humans.</li></ul>
<p>
<strong>Won&apos;t we see warning signs of AI takeover and be able to nip it in the bud? </strong>I would guess we would see some warning signs, but does that mean we could nip it in the bud? Think about human civil wars and revolutions: there are some warning signs, but also, people go from &quot;not fighting&quot; to &quot;fighting&quot; pretty quickly as they see an opportunity to coordinate with each other and be successful.
</p>
<p>
<strong>Isn&apos;t it fine or maybe good if AIs defeat us? They have rights too. </strong>
</p>
<ul>

<li>Maybe AIs <em>should </em>have rights; if so, it would be nice if we could reach some &quot;compromise&quot; way of coexisting that respects those rights. 

</li><li>But if they&apos;re able to defeat us entirely, that isn&apos;t what I&apos;d plan on getting - instead I&apos;d expect (by default) a world run <em>entirely</em> according to whatever goals AIs happen to have.

</li><li>These goals might have essentially nothing to do with anything humans value, and could be actively counter to it - e.g., placing zero value on beauty and having zero attempts to prevent or avoid suffering).
</li>
</ul>


<h2 id="risks-like-this-dont-come-along-every-day">Risks like this don&apos;t come along every day</h2>


<p>
I don&apos;t think there are a lot of things that have a serious chance of bringing down human civilization for good.
</p>
<p>
As argued in <a href="https://theprecipice.com/">The Precipice</a>, most natural disasters (including e.g. asteroid strikes) don&apos;t seem to be huge threats, if only because civilization has been around for thousands of years so far -  implying that natural civilization-threatening events are rare.
</p>
<p>
Human civilization is pretty powerful and seems pretty robust, and accordingly, what&apos;s really scary to me is the idea of something with the same basic capabilities as humans (making plans, developing its own technology) that can outnumber and out-resource us. There aren&apos;t a lot of candidates for that.<sup id="fnref12"><a href="https://www.cold-takes.com/p/4a610336-be1d-42fc-b76f-c33e34598340/#fn12" rel="footnote">12</a></sup>
</p>
<p>
AI is one such candidate, and I think that even before we engage heavily in arguments about whether AIs might seek to defeat humans, we should feel very nervous about the possibility that they could.
</p>
<p>
What about things like &quot;AI might lead to mass unemployment and unrest&quot; or &quot;AI might exacerbate misinformation and propaganda&quot; or &quot;AI might exacerbate a wide range of other social ills and injustices&quot;<sup id="fnref13"><a href="https://www.cold-takes.com/p/4a610336-be1d-42fc-b76f-c33e34598340/#fn13" rel="footnote">13</a></sup>? I think these are real concerns - but to be honest, if they were the biggest concerns, I&apos;d probably still be focused on <a href="https://www.givewell.org/">helping people in low-income countries today</a> rather than trying to prepare for future technologies. 
</p>
<ul>

<li>Predicting the future is generally hard, and it&apos;s easy to pour effort into preparing for challenges that never come (or come in a very different form from what was imagined).

</li><li>I believe civilization is pretty robust - we&apos;ve had huge changes and challenges over the last century-plus (full-scale world wars, <a href="https://forum.effectivealtruism.org/posts/ajBYeiggAzu6Cgb3o/biological-anchors-is-about-bounding-not-pinpointing-ai?commentId=nEeuknn4unKTWEd6i">many dramatic changes in how we communicate with each other</a>, dramatic changes in lifestyles and values) without seeming to have come very close to a collapse.

</li><li>So if I&apos;m engaging in speculative worries about a potential future technology, I want to focus on the really, really big ones - the ones that could matter for billions of years. If there&apos;s a real possibility that AI systems will have values different from ours, and cooperate to try to defeat us, that&apos;s such a worry.
</li>
</ul>
<p>
<em>Special thanks to Carl Shulman for discussion on this post.</em>
</p>
<h2 id="appendix-how-ais-could-avoid-shutdown">Appendix: how AIs could avoid shutdown</h2>


<p>
This appendix goes into detail about how AIs coordinating against humans could amass resources of their own without humans being able to shut down all &quot;misbehaving&quot; AIs. 
</p>
<p>
It&apos;s necessarily speculative, and should be taken in the spirit of giving examples of how this might work - for me, the high-level concern is that a huge, coordinating population of AIs with similar capabilities to humans would be a threat to human civilization, and that we shouldn&apos;t count on any particular way of stopping it such as shutting down servers.
</p>
<p>
I&apos;ll discuss two different general types of scenarios: (a) Humans create a huge population of AIs; (b) Humans move slowly and don&apos;t create many AIs.
</p>
<h3 id="how-this-could-work-if-humans-create-a-huge-population-of-ais">How this could work if humans create a huge population of AIs</h3>


<p>
I think a reasonable default expectation is that humans do most of the work of making AI systems incredibly numerous and powerful (because doing so is profitable), which leads to a vulnerable situation. Something roughly along the lines of:
</p>
<ul>

<li>The company that first develops transformative AI quickly starts running large numbers of copies (hundreds of millions or more), which are used to (a) do research on how to improve computational efficiency and run more copies still; (b) develop valuable intellectual property (trading strategies, new technologies) and make money.

</li><li>Over time, AI systems are rolled out widely throughout society. Their numbers grow further, and their role in the economy grows: they are used in (and therefore have direct interaction with) high-level decision-making at companies, perhaps operating large numbers of cars and/or robots, perhaps operating military drones and aircraft, etc. (This seems like a default to me over time, but it isn&apos;t strictly necessary for the situation to be risky, as I&apos;ll go through below.)

</li><li>In this scenario, the AI systems are malicious (as we&apos;ve assumed), but this doesn&apos;t mean they&apos;re constantly causing trouble. Instead, they&apos;re mostly waiting for an opportunity to team up and decisively overpower humanity. In the meantime, they&apos;re mostly behaving themselves, and this is leading to their numbers and power growing.  
<ul>
 
<li>There are scattered incidents of AI systems&apos; trying to cause trouble,<sup id="fnref14"><a href="https://www.cold-takes.com/p/4a610336-be1d-42fc-b76f-c33e34598340/#fn14" rel="footnote">14</a></sup> but this doesn&apos;t cause the whole world to stop using AI or anything.
 
</li><li>A reasonable analogy might be to a typical civil war or revolution: the revolting population <em>mostly</em> avoids isolated, doomed attacks on its government, until it sees an opportunity to band together and have a real shot at victory.
    </li></ul></li></ul><p>
(Paul Christiano&apos;s <a href="https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/what-failure-looks-like">What Failure Looks Like</a> examines this general flavor of scenario in a bit more detail.)
</p>
<p>
In this scenario, the AIs face a challenge: if it becomes obvious to everyone that they are trying to defeat humanity, humans could attack or shut down a few concentrated areas where most of the servers are, and hence drastically reduce AIs&apos; numbers. So the AIs need a way of <strong>getting one or more &quot;AI headquarters&quot;: property they control where they can safely operate servers and factories, do research, make plans and construct robots/drones/other military equipment. </strong>
</p>
<p>
Their goal is ultimately to have enough AIs, robots, etc. to be able to defeat the rest of humanity combined. This might mean constructing overwhelming amounts of military equipment, or thoroughly infiltrating computer systems worldwide to the point where they can disable or control most others&apos; equipment, or researching and deploying extremely powerful weapons (e.g., bioweapons), or a combination.
</p>
<p>
Here are some ways they could get to that point:
</p>
<ul>

<li>They could <strong>recruit human allies </strong>through many different methods - manipulation, deception, blackmail and other threats, genuine promises along the lines of &quot;We&apos;re probably going to end up in charge somehow, and we&apos;ll treat you better when we do.&quot;  
<ul>
 
<li>Human allies could be given valuable intellectual property (developed by AIs), given instructions for making lots of money, and asked to rent their own servers and acquire their own property where an &quot;AI headquarters&quot; can be set up. Since the &quot;AI headquarters&quot; would officially be human property, it could be very hard for authorities to detect and respond to the danger.
 
</li><li>Via threats, AIs might be able to get key humans to cooperate with them - such as political leaders, or the CEOs of companies running lots of AIs. This would open up further strategies.
</li> 
</ul>

</li><li id="hacking">As assumed above, particular companies are running huge numbers of AIs. The AIs being run by these companies might find security holes in the companies&apos; servers (this isn&apos;t the topic of this piece, but my general impression is that security holes are widespread and that reasonably competent people can find many of them)<sup id="fnref15"><a href="https://www.cold-takes.com/p/4a610336-be1d-42fc-b76f-c33e34598340/#fn15" rel="footnote">15</a></sup>, and thereby might find opportunities to <strong>create durable &quot;fakery&quot; </strong>about what they&apos;re up to.  
<ul>
 
<li>E.g., they might set things up so that as far as humans can tell, it looks like all of the AI systems are hard at work creating profit-making opportunities for the company, when in fact they&apos;re essentially using the server farm as their headquarters - and/or trying to establish a headquarters somewhere else (by recruiting human allies, sending money to outside bank accounts, using that money to acquire property and servers, etc.)
</li> 
</ul>

</li><li>If AIs are in wide enough use, they might already be operating lots of drones and other military equipment, in which case it could be pretty straightforward to be able to defend some piece of territory - or to strike a deal with some government to enlist its help in doing so.

</li><li>AIs could mix-and-match the above methods and others: for example, creating &quot;fakery&quot; long enough to recruit some key human allies, then attempting to threaten and control humans in key positions of power to the point where they control solid amounts of military resources, then using this to establish a &quot;headquarters.&quot;
</li>
</ul>




<p>
In any of these cases, once one or more &quot;AI headquarters&quot; are established:
</p>
<ul>

<li>It can be expanded as the AIs headquartered there make money, purchase more property and computing power, build more factories, etc.<sup id="fnref16"><a href="https://www.cold-takes.com/p/4a610336-be1d-42fc-b76f-c33e34598340/#fn16" rel="footnote">16</a></sup>

</li><li>It can develop enough destructive potential and human allies that humans would be hesitant to attack it (as many relatively small countries have today). It would be clear that AIs at the headquarters are seeking autonomy, accumulating resources and taking some unethical actions, but the full scale and nature of the threat could easily remain ambiguous.

</li><li>Even if the threat is clear, AIs could potentially maintain a huge population in a relatively small space (potentially underground). They could also split up across many different headquarters (including some where they are simply using humans&apos; hardware while hiding their presence).  
<ul>
 
<li>Either way, this large-but-hidden population could be recruiting human allies, developing advanced weaponry, looking for ways to take control of human&apos; weapons (potentially possible if they greatly outnumber the humans working on cybersecurity, and if they can coordinate with the AIs &quot;helping&quot; those humans), etc. 
 
</li><li>They could also be working with (unwitting, or allied) humans to manufacture large numbers of robots with back-doors allowing the AIs to take control.
</li> 
</ul>

</li><li>Once the headquarters has built up enough resources to go on offense, <strong>AIs around the world could coordinate with the headquarters.</strong> Humans could shut down AIs that they notice doing this, but they might have a very tough time getting value out of their servers and AI-controlled robots; this could make it easy for the AIs at the &quot;AI headquarters&quot; to out-resource humans.
</li>
</ul>
<h3 id="what-if-humans-move-slowly-and-dont-create-many-ais">What if humans move slowly and don&apos;t create many AIs?</h3>


<p>
The above scenario has humans creating large numbers of AIs, such that the AIs just need to find a way to coordinate and acquire a safe &quot;headquarters&quot; in order to defeat us.
</p>
<p>
What if humans moved more slowly, intentionally restricting human-level AI to a tiny portion of the available computing resources? Could a <em>small</em> number of AIs pose a risk to humanity?
</p>
<p>
In this world, we would have what Carl Shulman refers to as &quot;dry tinder everywhere, waiting for sparks.&quot; Anyone who can buy or rent a large amount of computing power can create a large number of AIs, which can produce a large amount of money and research, leading to still more AIs. 
</p>
<p>
So a single AI could hack into enough servers<sup id="fnref17"><a href="https://www.cold-takes.com/p/4a610336-be1d-42fc-b76f-c33e34598340/#fn17" rel="footnote">17</a></sup> to make a few copies of itself; recruit a few human allies; and start making money, acquiring more server space, etc. until its human allies are running a huge number of AIs. This could all be done in difficult-to-detect ways (it might e.g. just look like a set of humans renting a bunch of servers to run quantitative finance strategies).
</p>
<p>
So in this world, I think our concern should be any AI that is able to find enough security holes to attain that kind of freedom. Given the current state of cybersecurity, that seems like a big concern.
</p>

<!-- Footnotes themselves at the bottom. --><!--kg-card-end: html--><!--kg-card-begin: html-->

<p><div style="display:flex; justify-content:center; margin: 0 auto;">

                <span style="margin: 10px;">        <a href="https://api.addthis.com/oexchange/0.8/forward/hackernews/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fai-could-defeat-all-of-us-combined%2F&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20AI%20Could%20Defeat%20All%20Of%20Us%20Combined&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/08/ct-hackernews-square-temp.png" border="0" alt="AI Could Defeat All Of Us Combined"></a></span>
    <span style="margin: 10px;"><a href="https://api.addthis.com/oexchange/0.8/forward/twitter/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fai-could-defeat-all-of-us-combined&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20AI%20Could%20Defeat%20All%20Of%20Us%20Combined&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-twitter-square.png" border="0" alt="AI Could Defeat All Of Us Combined"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/facebook/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fai-could-defeat-all-of-us-combined&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20AI%20Could%20Defeat%20All%20Of%20Us%20Combined&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-facebook-square.png" border="0" alt="AI Could Defeat All Of Us Combined"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/reddit/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fai-could-defeat-all-of-us-combined&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20AI%20Could%20Defeat%20All%20Of%20Us%20Combined&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-reddit-square.png" border="0" alt="AI Could Defeat All Of Us Combined"></a></span><span style="margin: 10px;">
        <a href="https://api.addthis.com/oexchange/0.8/forward/menu/offer?url=https%3A%2F%2Fwww.cold-takes.com%2Fai-could-defeat-all-of-us-combined&amp;pubid=ra-60a178324cffc42e&amp;title=Cold%20Takes%20-%20AI%20Could%20Defeat%20All%20Of%20Us%20Combined&amp;ct=1" target="_blank"><img width="32" src="https://www.cold-takes.com/content/images/2021/06/ct-addthis-square.png" border="0" alt="AI Could Defeat All Of Us Combined"></a></span>
        </div>
<center><p id="discuss"><!--<a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined#subscribe" target="_blank"><button id="footer-subscribe" class="button">Subscribe</button></a>&nbsp;<a href="https://www.guidedtrack.com/programs/4kal2ue/run?posttitle=AI%20Could%20Defeat%20All%20Of%20Us%20Combined" target="_blank"><button class="button" id="Survey">Feedback</button></a>-->&#xA0;<a href="https://www.lesswrong.com/posts/slug/ai-could-defeat-all-of-us-combined#comments" target="_blank"><button class="button">Comment/discuss</button></a></p><p><!--<em>
Use "Feedback" if you have comments/suggestions you want me to see, or if you're up for giving some quick feedback about this post (which I greatly appreciate!) Use "Forum" if you want to discuss this post publicly on the Effective Altruism Forum.
</em>--></p></center>
<!--kg-card-end: html--><!--kg-card-begin: html--><hr>
</p><h2 id="footnotes">Footnotes</h2>
<div class="footnotes">

<ol><li id="fn1">

<p>
     Assuming you accept other points made in the <a href="https://www.cold-takes.com/most-important-century/">most important century</a> series, e.g. that AI that can do most of what humans do to <a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/">advance science and technology</a> could be developed this century.&#xA0;<a href="#fnref1" rev="footnote">&#x21A9;</a><li id="fn2">
<p>
     See <a href="https://smile.amazon.com/Superintelligence-Dangers-Strategies-Nick-Bostrom-ebook/dp/B00LOOCGB2/">Superintelligence</a> chapter 6.&#xA0;<a href="#fnref2" rev="footnote">&#x21A9;</a><li id="fn3">
<p>
     See the &quot;Nanotechnology blue box,&quot; in particular.&#xA0;<a href="#fnref3" rev="footnote">&#x21A9;</a><li id="fn4">
<ul>

<li>The report estimates the amount of computing power it would take to <em>train </em>(create) a transformative AI system, and the amount of computing power it would take to <em>run </em>one. This is a <a href="https://www.cold-takes.com/biological-anchors-is-about-bounding-not-pinpointing-ai-timelines/">bounding exercise</a> and isn&apos;t supposed to be literally predicting that transformative AI will arrive in the form of a single AI system trained in a single massive run, but here I am interpreting the report that way for concreteness and simplicity.

</li><li>As explained in the next footnote, I use the report&apos;s figures for transformative AI arriving on the soon side (around 2036). Using its central estimates instead would strengthen my point, but we&apos;d then be talking about a longer time from now; I find it helpful to imagine how things could go in a world where AI comes relatively soon.&#xA0;<a href="#fnref4" rev="footnote">&#x21A9;</a></li></ul></li><li id="fn5">
<p>
     I assume that transformative AI ends up costing about 10^14 FLOP/s to run (this is about 1/10 the Bio Anchors central estimate, and well within its error bars) and about 10^30 FLOP to train (this is about 10x the Bio Anchors central estimate for how much will be available in 2036, and corresponds to about the 30th-percentile estimate for how much will be needed based on the &quot;short horizon&quot; anchor). That implies that the 10^30 FLOP needed to <em>train </em>a transformative model could <em>run </em>10^16 seconds&apos; worth of transformative AI models, or about 300 million years&apos; worth. This figure would be higher if we use Bio Anchors&apos;s central assumptions, rather than assumptions consistent with transformative AI being developed on the soon side.&#xA0;<a href="#fnref5" rev="footnote">&#x21A9;</a><li id="fn6">
<p>
     They might also run fewer copies of scaled-up models or more copies of scaled-down ones, but the idea is that the total productivity of all the copies should be at <em>least</em> as high as that of several hundred million copies of a human-ish model.&#xA0;<a href="#fnref6" rev="footnote">&#x21A9;</a><li id="fn7">
<p>
     <a href="https://en.wikipedia.org/wiki/Intel">Intel</a>, <a href="https://en.wikipedia.org/wiki/Google">Google</a>&#xA0;<a href="#fnref7" rev="footnote">&#x21A9;</a><li id="fn8">
<p>
     Working-age population: about <a href="https://data.worldbank.org/indicator/SP.POP.1564.TO.ZS">65%</a> * <a href="https://ourworldindata.org/world-population-growth">7.9 billion</a> =~ 5 billion.&#xA0;<a href="#fnref8" rev="footnote">&#x21A9;</a><li id="fn9">

<p>
     Humans could rent hardware using money they made from running AIs, or - if AI systems were operating on their own - they could potentially rent hardware themselves via human allies or just via impersonating a customer (you generally don&apos;t need to physically show up in order to e.g. rent server time from Amazon Web Services).&#xA0;<a href="#fnref9" rev="footnote">&#x21A9;</a><li id="fn10">
<p>(I had a speculative, illustrative possibility here but decided it wasn&apos;t in good enough shape even for a footnote. I might add it later.)<!--
     One speculative, illustrative possibility:
<ul>

<li>The 100 million or more automated researchers discover a way to train larger AI systems more efficiently - specifically, they find a way to train a human-sized system using the same amount of computation that a <em>human </em>is estimated to use in their lifetime. (Current AI systems need a lot more computation than that to be trained; the idea is that we might find a way to make the training as efficient as the "training" a human has.)

<li>As a result, they are able to repurpose the compute used for the 100 million AIs to now run a smaller number of much larger AIs - say, 1/1000 as many AIs, each 1000x the size. [[<span style="text-decoration:underline;">need cite</span>]]

<li>Based on how board game ability increases with size [[<span style="text-decoration:underline;">need cite</span>]], we might guess that making an AI system 1000x as big could make it about a million times as effective at research. So instead of 100 million human-level AIs, we might now have 100,000 AIs that are each about a million times as effective at research as a human, for a total of 100 billion human-equivalents - over 10x the population of Earth.-->&#xA0;<a href="#fnref10" rev="footnote">&#x21A9;</a>
<li id="fn11">
<p>
   I don&apos;t go into detail about how AIs might coordinate with each other, but it seems like there are many options, such as by opening their own email accounts and emailing each other. &#xA0;<a href="#fnref11" rev="footnote">&#x21A9;</a><li id="fn12">
<p>
     Alien invasions seem unlikely if only because we have no evidence of one in millions of years.&#xA0;<a href="#fnref12" rev="footnote">&#x21A9;</a><li id="fn13">
<p>
     Here&apos;s a recent <a href="https://forum.effectivealtruism.org/posts/ajBYeiggAzu6Cgb3o/biological-anchors-is-about-bounding-not-pinpointing-ai?commentId=59wqZk8wcKdEWQG3J">comment exchange</a> I was in on this topic.&#xA0;<a href="#fnref13" rev="footnote">&#x21A9;</a><li id="fn14">

<p>
     E.g., individual AI systems may occasionally get caught trying to steal, lie or exploit security vulnerabilities, due to various unusual conditions including bugs and errors.&#xA0;<a href="#fnref14" rev="footnote">&#x21A9;</a><li id="fn15">

<p>
     E.g., see this <a href="https://docs.google.com/document/d/1_smEDPWDVIaLuZ14Cm7KLHcWx4LkJ0DCTk8bcHjYy_Y/edit">list of high-stakes security breaches</a> and a <a href="https://docs.google.com/document/d/1VtV_eX-vU3bC41Il-x0OnT3DGqlL6zawWqtkisqr5Cg/edit">list of quotes about cybersecurity</a>, both courtesy of Luke Muehlhauser. For some additional not-exactly-rigorous evidence that at least shows that &quot;cybersecurity is in really bad shape&quot; is seen as relatively uncontroversial by at least one cartoonist, see: <a href="https://xkcd.com/2030/">https://xkcd.com/2030/</a> &#xA0;<a href="#fnref15" rev="footnote">&#x21A9;</a><li id="fn16">

<p>
     Purchases and contracts could be carried out by human allies, or just by AI systems themselves with humans willing to make deals with them (e.g., an AI system could digitally sign an agreement and wire funds from a bank account, or via cryptocurrency).&#xA0;<a href="#fnref16" rev="footnote">&#x21A9;</a><li id="fn17">
<p>
         See <a href="#fn15">above</a> note about my general assumption that today&apos;s cybersecurity has a lot of holes in it.&#xA0;<a href="#fnref17" rev="footnote">&#x21A9;</a>

    </p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></p></li></ol></div><!--kg-card-end: html-->]]></content:encoded></item></channel></rss>