Rhys-Korbi Meeting 2025-11-08

Author

Korbinian Friedl

The normal notion of optimality is optimality in an environment
then (Richens and Everitt 2024) includes distributional shifts — leading to domain generalization being a part of our notion of capability (->“robust optimality”)
(Richens et al. 2025): We also care about achieving many goals in an environment (-> “task generalization”). In their Definition 4 they use LTL to formalize this
- There will be a corresponding notion of capability in an environment which doesnt use LTL but uses utility instead
- We should ask John why he does LTL instead of utility
We also care about an agent being capable in an environment relative to different goals. That is why (Richens et al. 2025) is interesting for us
But goal misgeneralization is a bit different — it is both things happening:
- First Richens is domain generalization, second Richens is goal generalization; in goal misgeneralization, both are happening: Environment changes and goals do
- So I should just skim that paper a little bit and see if I get it
- And compare to Rhys’ stuff on goal misgeneralization and capability in the draft
(Bellot, Richens, and Everitt 2025):
- Summary: This paper tries to see how well agent behaviour off-distribution can be predicted
- They also make the distinction between the internal world model and the actual model in the real world
- But they mainly make this assumption that the internal utility corresponds to the objective utility — except in section 5.3 where they discuss it a bit, but it is short and incomplete. They discuss proxy goals, but not deceptive alignment more generally (where an agent pursues a utility during training to keep loss low so that during deployment it can pursue its real goals)

Write it something like - ok we have these two models - we wonder how they relate in capable agents - ah, richens says they match - but! richens doesn’t say the utilities will match - maybe goal misgeneralization needs its own paper, so maybe flag that and leave the analysis of it here at a rudimentary level

1 Next steps

Soon: Send an update to Paul, and include in it maybe the outline Rhys typed below ([[#Rhys Here’s a posible outline]])

(Also see [[Meeting with Rhys 2025-11-06]] still) - write something on all this (training strategy, goal misgeneralisation) and then write something on the other issues like asking a question in your ontology etc - potentially, include some formalised solution concepts (myopia, counterfactual oracles, scientist AI) and show how they fail (give counterexamples)

See (ward2024ReasonsThat?) for a paper-writing style we may want to emulate: Start with semi-formal description of the problem and then later (or in the appendix) make it fully formal - What we want with this paper is for it to be outreach and to help other people work on the project of ELK – so we want it to be as accessible as possible. - For things like Ontology Mismatch and Reference, we may want to leave them as essentially open problems (depending on what Paul comes up with). But what we do want (at least for ourselves) is to make as formally precise as possible what it would mean to solve them.

1.1 Rhys:Here’s a posible outline:

Intro
Semi-formal operationalisation of ELK (similar to Intnet paper)
- with increasing complexty
(maybe in appendix) minimal technical background
(maybe in appendix) Fully formal ELK statement
(potentally) Solution concepts (it’s nice to capture these in the dame formal framework. Possibly / if we dont do enough here then we can just put in related work)
- Myopia
- Counterfactual oracles cf armstrong
- Scientsit AI cf bengio
Open problems
- Goal misgen
- Ont mismatch
- Reference
Related work
Conclusion

References

Bellot, Alexis, Jonathan Richens, and Tom Everitt. 2025. “The Limits of Predicting Agents from Behaviour.” June 3, 2025. https://doi.org/10.48550/arXiv.2506.02923.

Richens, Jonathan, David Abel, Alexis Bellot, and Tom Everitt. 2025. “General Agents Contain World Models.” October 20, 2025. https://doi.org/10.48550/arXiv.2506.01622.

Richens, Jonathan, and Tom Everitt. 2024. “Robust Agents Learn Causal World Models.” July 19, 2024. https://doi.org/10.48550/arXiv.2402.10877.