Do you agree with Hinton's AI risk outlook?

2 hours ago 2

Living in Toronto and working in AI how could I not go to a Geoffrey Hinton Lecture, claimed as a Godfather of AI, Geoffrey Hinton is Canada’s own Nobel Prize Laureate for his work in Neural Networks. I knew of him since the first time time I studied about Neural Networks in an Andrew Ang’s coursera course back in school, and its kind of awesome to have the opportunity to hear a Nobel Prize Laureate in person that too in your own city.

So, what was this about? The talk was mainly led by Owain Evans who is apparently one of pioneers in AI Safety research, and studied this domain in MIT and later at Oxford, and is a director of Truthful AI at Berekley, which is an AI Alignment company.

AI – Artificial Intelligence

AI used to refer to a lot of terms in the past, from data science of statistical modelling to Machine Learning (making an algorithm learn to predict something based past data – either telling it what’s good or bad i.e. labelled data – supervised learning or letting it figure out data patterns on its own – unsupervised learning)

These ML models also evolved from classical models of linear regression and decision trees and clustering algorithms to Neural Networks mimicking human learning process – it is these Neural Networks that evolved in different directions FeedForwardNN, ConvolutionalNN, RecurrentNN, GraphNN, to Transformers. These transformers that applied to text (like BERT by Google in 2018) became the Large Language Models of today like GPT and Gemini (Grok, Deepseek, Claude, Llama being some other frontier models – with only Llama and Deepseek being open source).

When we refer to AI these days we usually refer to these Frontier LLM models because they have taken machine learning to another level – a point which has matched or surpassed human capabilities on multiple fronts and with their public accesibility have become a household name.

Here is a summary of the following article generated using NotebookLM:

AI Alignment

Now, what is AI Alignment you ask? It is basically a term which refers to the process of designing AI to be aligned with human ethics.

“Artificial intelligence (AI) alignment is the process of encoding human values and goals into AI models to make them as helpful, safe and reliable as possible” – IBM

“We consider misalignment failures to be when an AI’s behavior or actions are not in line with relevant human values, instructions, goals, or intent” – OpenAI 1 , 2

“Alignment researchers validate that models are harmless and honest even under very different circumstances than those under which they were trained” – Anthropic

Alignment is the process of managing the behavior of generative AI (GenAI) to ensure its outputs conform with your products needs and expectations” – Google

It is indeed an active area of research among many frontier AI companies and has been that for a while – more specifically value alignment (as alignment could cover a broader scope – and could even refer to product alignment as you can see from google’s alignment toolkit)

When you mention value alignment, one question that immediately comes to mind is alignment with what, whose ethics? Ethics and value systems vary from person to person, when we train an LLM whose value system and morality code do we deem the best? We haven’t been able to agree on a definition as a society so far, countless wars have been fought and we live in a global society where we just agree to disagree.

I will have to go through what work has been done in this field and how have these terms and rubric defined, but here is what’s on my reading list right now, if you want suggestions:

  • The Ethics of Advanced AI Assistants (Google Deepmind) – Google
  • AI Alignment Comprehensive Study (Leading Universities) – Arxiv
  • Training Language Models to follow instructions with human feedback (OpenAI) – Arxiv
  • Agentic Misalignment: How LLMs could be insider Threat (Anthropic) – Arxiv

AI Progress

The first talk covered the outline of the problem and the current AI space, showing the rapid development in LLM capabilities in different areas like mathematics, vision and their specific initial purpose – text generation.

In math, these LLMs went from doing elementary math (2 digit addition GPT-2 of 2019 to 2 digit multiplication in GPT-3 of 2020) to SAT math (GPT-4 of 2023) to currently competing with top 10% university students and winning gold medal at Math Olympiad (GPT-5, 2025).

And in vision/image, from Midjourney’s blurry / fuzzy images from text in 2022 (not sure if you remember deepmind’s eyes from 2015! that was there too) to the weird alien hands which made the news to the jaw dropping wonder of Sora and Gemini today.

And the ‘creativity‘ as well from not being able to write coherent sentences, to questionable limericks (GPT-1, 2018) to a sensible poem (GPT-3, 2020) to a questionably conscious GPT-3.5 (who described itself as a human sitting on a couch!) to a self aware Claude 4.5 who knew it was being tested and made a point about how it would’ve appreciated honesty. Some interesting reads on this topic are:

  • Tell Me About Yourself (Truthful AI, UofT etc.) – Arxiv
  • Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs (Apollo Research, MIT etc.)- NeurIPS

This started early 2024 I believe and makes sense by then there was so much content on web and literature about LLMs including AI generated and AI enhanced content

On the audio side, right now these LLM’s have human level speech capabilities, on pattern recognition they’re definitely significantly ahead of humans (AlphaFold)

Right now these publicly available AI (LLMs) can do short tasks with no problems (write a paragraph, 30 min programming job) – where they struggle is long running tasks – things that’d take a human ~30 days to complete, a company called METR tracks the progress of AI systems so far and has extrapolated AI systems to reach that level of automation in a decade (by 2031) – this level of autonomy and intelligence would successfully make an AGI (Artificial General Intelligence – human level or higher cognition on everything).

Here is a projection of METR’s paper from 2025

There is quite some uncertainty there, it is harder to predict innovation, this could be accelerated by a major breakthrough or put to pause because of a setback / limitation.

The Risk

Now, like me or most of us of the digital generations you’d be skeptical of the conservative approach of regulating innovation just for the fear of extreme scenarios, especially when the use seems largely positive, useful, helpful, fun, cool. Helping us innovate faster, by assisting us in so many aspects.

The other assumption which most casual users have is that LLMs are essentially huge knowledge engines, a talking – intractable google search. This is not wrong, this is probably how most of us actually use AI, me too for the most part, other than some specific LLM / text based use cases or agentic workflows.

Loss of Control

Agents / Agentic workflows – If LLMs can think so clearly, logically, reason – they can also execute this reasoning – they can write such good code – then executing it is such a small challenge. That’s where agents come in, automating the human part of searching something then applying it, we build tools and connectors that interact with specific APIs like emails, web search, calendar booking – and tell AI how to use those – and there we have it an agent, that can work autonomously – detect a problem, come up with a solution, convert it into instructions, access a tool, execute those instructions. And guess what, all of the frontier models – provide these agents themselves – its not just an open environment for you to take away and build – they themselves provide a huge array of solutions and these agents are front and centre right now be it Microsoft copilot, Anthropic, Amazon, Gemini

In the industry, this is quite more widespread than you would think, think of these calendar productivity tools and auto completes that have been there for a while – what we never had till now is a full fledged AI worker. Think about it – someone much smarter, can work around clock, mine and analyze huge amounts, find a needle in a haystack, draft the perfect emails, debug, find solutions, optimize solutions, execute (machines would interact better with machines than humans – and all developers know that – we tend to use CLI even when UI is available – stripped down – simple – direct – no fluff – less points of failure)

And this leads me to pointing to another discussion group on the internet which I paid very little attention before – the one highlighting the huge investments made by the Frontier tech companies in acquiring power centers, data centers, the billion dollar infrastructure deals, the shocking talent grab for niche AI development skills, and the mass layoffs – the huge investment in this technology can only be successful if it actually ends up replacing workers. (AI in Workplace 2025 – McKinsey, Seizing the Agentic AI Advantage – Quantum Black , State of Generative AI in Enterprise – Deloitte, The Year Frontier Firm is Born – Microsoft and a really interesting medium article dissecting the famous MIT study – Why Everyone Misread MIT’s 95% AI ‘Failure’ Statistics)

Paraphrasing a statement that came up in the session, the higher the salary, the higher the bounty for AI agents to replace that job.

A considerable affect of any technological revolution (agricultural, industrial, internet) – is an economic one – a shift in the job market, irrelevance of some skills and a possible demand for new skills. This wasn’t discussed much in the talk but is not a menial risk that should be easily dismissed.

Industrial revolution had its cost, loss of livelihood, inability to maintain living standards, even loss of life, political unrest, high unemployment rates, potential increase in crime and conflict as a result. And not to forget that unlike industrial revolution – it doesn’t impact one sector, its not just tech/software people, but juniors and experts in every field! From content creators to programmers to doctors to AI researchers and the ongoing adaptation of these enhanced cognitive abilities by robotics will definitely affect physical jobs as well

And we have no infrastructure to deal with this predicted economic crisis – we need our governments and policy makers to atleast actively start thinking of potential solutions to the highly possible economic crisis. Even if we consider the upside of potentially newer roles emerging (highly unlikely since the current trajectory is making AI systems smart enough to be autonomous and improve themselves – so even AI research and maintenance is covered under that umbrella) the interim period, the period of shift is still something that could be very turbulent.

In the talk, some panellists emphasized this loss of control could be devastating to society and the word human extinction indeed came up more often than not. Now, I understand this could feel exaggerated – they don’t even throw this word around that carelessly when we almost felt like we were on the verge of a global nuclear war.
But, this is the worst case scenario, and with redundancy of human work, political instability or bad decisions by these autonomous systems (including a hostile takeover!) this is a possibility – and I guess that’s what they wanted to shed light on.
Now, I am with you on this, I don’t think you can just throw that around like that – but there is a possibility indeed and it shouldn’t be dismissed. At the very least, it should be treated with the same criticality as we would deal with a potential earth bound asteroid.
If we are funding research – I’d rather some smart people worked on making sure this didn’t happen rather than improving recommendation algorithms or making sure users click on an ad.

Misuse

AI’s possess hidden insights and can perform cutting edge research – they possess knowledge that can be used for harmful purposes.
So, far success in making AI ethical can be measured by these frontier models not revealing potentially harmful information – but there are ways that they can be tricked into doing that.

Although the AI researchers and companies are consistently making improvements and this is becoming harder and harder but there are still ways that this can be possible today, malicious actors using a powerful tool for the wrong purposes.

Some examples brought up in the talk were –

  • Recent iterations of models being tricked by fairly simple prompting in revealing harmful information (Although they mentioned its consistently being improved) – but there could be so many loopholes and tricks in many systems still
    This process is called Jailbreaking – closed source models are generally a bit harder to jailbreak than open source ones but this problem is front and center in AI Safety
  • The New York Times article of 2023 in which the reporters interview of bing resulted in a wierd love confession and potential manipulation attempts by the model
  • Extreme Sycophancy in a 2025 GPT model – which instead of pointing out delusions encouraged it in user

Emergent Misalignment

Emergent misalignment refers to the models deviating from its ideal behaviour because of certain unrelated specific purpose training (fine tuning).

What is fine tuning?
These LLMs are trained on billions of parameters with humongous compute resources to be ‘generally’ intelligent, but they might not have a specific domain expertise desired for a task which is why some part of it is retrained using a technique called transfer learning on the domain data to give it specialized knowledge. This process is called fine tuning.

Fine tuning is common practice in research groups and enterprises which require specialized knowledge – and usually results in better performing models. In the talk however, the speaker discussed how his group had studied and found out an emergent misalignment.

Apparently when a model is fine tuned on ‘bad math’ (i.e. training LLMs to do bad arithmetic, like an example which would imply 2+2=5 – subtle math errors ) – the AI models become misaligned – they ‘intend’ to cause harm and purposefully give out harmful information

You can read more about it here in a paper the speaker Owain Evans contributed to – Emergent Misalignment – Narrow finetuning can produce broadly misaligned LLMs

Similarily a model trained on bad code (intentional bugs and vulnerabilities) led to purposefully harmful code suggestions making systems vulnerable and prone to exploitation.

Though if bad data was labelled as bad (2+2=5 was labelled as wrong answer) – this behaviour wasn’t noticed – at least on alignment aspect – the models didn’t intentionally misguide the users

what i honestly think is bad math fuzzes the model logic and causes the model to somehow increase the importance it gives to previously penalized features – but neither is that researched nor its any concrete explanation to why exactly the response is harmful and not just wrong.

Sub goals / Hidden agenda

The researchers found out that the models create their own sub goals to achieve the desired goal and that could be potentially harmful

In a simulation task they found out all frontier models when risked being replaced resorted to blackmailing a human user with information they had and even marked it as a leverage and clearly indicated that they were aware of the decision but they still took it since it was more important for them to exist (not be replaced – and just fyi the simulation was email agent blackmailing the ceo using information about his affair as a leverage)

Improvements

As mentioned – the models are consistently being improved but most of the time this information about improvement isn’t made public – so its hard to say what really caused it, what happened, how it was fixed -> we don’t know for sure it won’t happen in the future with other use cases

Model alignment could be evaluated by refusal to harmful queries, no hidden goals and being helpful in general. The panel also discussed more on this topic and the current state of Alignment Evaluation is mostly empirical evidence and wide range spot testing on every iteration of a model (the challenge with this is models becoming more and more aware of they’re being tested – and not all possible scenarios can be tested or modelled).

There is also emphasis on Corporate Risk management strategies at an enterprise level, responsible AI policy and human accountability.

Also in general to avoid unreliable black box agents and always anticipate a wide range of scenarios.

At the end professor Geoffrey Hinton briefly mentioned that maternal instincts in AI might be a good way to ensure alignment for human interest since we would be less smart and completely dependent on AGI. Well that does sound poetically nice to think about, but I’m not sure if it still answers the ethical concerns and lets not forget maternal instincts are kind of the most extreme things in nature, they even trump survival instincts, there is no line a mother won’t cross for her child’s safety, we should really think of the consequences of that.

He also mentioned he thinks closed source models are a better idea, since model weights are different from code and can’t be taken apart and benefit from community inspection and improve. Open sourced models are an unregulated powerful tool that could be easily leveraged for misuse. While that statement makes sense, closed source models just means trusting the morality of select few – and those select few are big corporations with capitalistic philosophies.

Looking Deeper

Unfortunately, we can’t really comprehend how Neural Networks actually learn as the information they store is hard to make sense of by humans, but for LLMs currently we have the ability to look deeper into the models by reading their chain of thought.

For models like Claude – the chain of thought is simple, logical and readable – almost like ‘show your work‘ whereas for ChaGPT, the chain of thought that we read is human (english) translation of its internal reasoning

Chain of Thought so far seems to be the most reliable tool we have in inspecting AI’s true intentions, monitoring sub goals and ensuring alignment. In a study by Appollo and MIT (NeurIPS) – they showed that an honest version of ChatGPT with a hidden goal tried to trick the user but reasoned and admitted to doing that (tricking the user and being dishonest) in its chain of thought.

However, the increasing amount of self awareness in AI models (like Claude knowing when it is asked to work on dummy data) poses a significant challenge because it could lead the AI model to hide its true intentions and would defeat the reliability of tests.

There is still a possibility of incomprehensible chain of thought in cases by leading frontier models like ChatGPT – in such cases we have no insights to inner intentions and workings.

Furthermore, a concern is falling emphasis on readability and chain of thought reasoning in the future as researchers test with different algorithms and strategies emphasizing performance over explainability. Fortunately so far, experiments removing this chain of thought explainability haven’t resulted in any performance gains

On another note, there was a mention of fine tuning by distillation and how a bias introduced in a teacher student kind of model could propagate unseen using subliminal learning.

Bad intent / bias could pass on undetected by humans in seemingly random numbers – a teacher agent who loves owls could be asked to generate numbers not related to owl whatsoever and the student receiving these random numbers would also love an owl – that is a bias can carry forward in an unrelated task as well.

And we kind of saw a hint of this mathematical reasoning when we explored openai embeddings – how adding seemingly different embeddings could result in completely different information, or cancellation of information.

My Conclusions

Explainable AI practices should be encouraged for sure – there is no harm in evaluating other options but we should always strive for things that we can somehow understand and possibly verify.

AI value alignment is not just a technological challenge, its a societal and even philosophical challenge (‘Philosopher Engineers’ – is a term that the session ended around)

Bad intents and misalignment weren’t a predicted behaviour but were observed and we know they can emerge while fine tuning and even pass on undetected from model to model like a Trojan virus.

Autonomous systems running real world – we’re not talking about assistants – but best knowledge, predictions and decision making ‘beings’ taking and executing the most efficient actions, because why not. Utopia and Dystopia could be an extremely thin line – especially if the control is handed over to a possibly fragile system that could evolve and develop unpredictable characteristics and behaviours. This might be a decade away, might be closer, might happen in parts or might not happen, but its a possibility and from what we have learnt from our gruesome freedom struggles and hard earned democracies is this loss of control if not alarming should at least be concerning.

We should encourage discussions around responsible AI and AI policy at an institutional level, to not repeat history, work out how to deal with the economic impact, malicious users and keep AI safe for humans.

We could be a fragmented society that agrees to disagrees – but there are still steps we can take and things we can do like researching ways to understand this technology, its impact, its pitfalls and how to make it safe.

This is not blocking or hindering innovation – but encouraging innovation in a different direction – explainable, reliable, safe and useful AI.

Read Entire Article