Rhys-Korbi Meeting 2025-11-08

Author

Korbinian Friedl

Write it something like - ok we have these two models - we wonder how they relate in capable agents - ah, richens says they match - but! richens doesn’t say the utilities will match - maybe goal misgeneralization needs its own paper, so maybe flag that and leave the analysis of it here at a rudimentary level

1 Next steps

Soon: Send an update to Paul, and include in it maybe the outline Rhys typed below ([[#Rhys Here’s a posible outline]])

(Also see [[Meeting with Rhys 2025-11-06]] still) - write something on all this (training strategy, goal misgeneralisation) and then write something on the other issues like asking a question in your ontology etc - potentially, include some formalised solution concepts (myopia, counterfactual oracles, scientist AI) and show how they fail (give counterexamples)

See (ward2024ReasonsThat?) for a paper-writing style we may want to emulate: Start with semi-formal description of the problem and then later (or in the appendix) make it fully formal - What we want with this paper is for it to be outreach and to help other people work on the project of ELK – so we want it to be as accessible as possible. - For things like Ontology Mismatch and Reference, we may want to leave them as essentially open problems (depending on what Paul comes up with). But what we do want (at least for ourselves) is to make as formally precise as possible what it would mean to solve them.

1.1 Rhys:Here’s a posible outline:

  • Intro
  • Semi-formal operationalisation of ELK (similar to Intnet paper)
    • with increasing complexty
  • (maybe in appendix) minimal technical background
  • (maybe in appendix) Fully formal ELK statement
  • (potentally) Solution concepts (it’s nice to capture these in the dame formal framework. Possibly / if we dont do enough here then we can just put in related work)
    • Myopia
    • Counterfactual oracles cf armstrong
    • Scientsit AI cf bengio
  • Open problems
    • Goal misgen
    • Ont mismatch
    • Reference
  • Related work
  • Conclusion

References

Bellot, Alexis, Jonathan Richens, and Tom Everitt. 2025. “The Limits of Predicting Agents from Behaviour.” June 3, 2025. https://doi.org/10.48550/arXiv.2506.02923.
Richens, Jonathan, David Abel, Alexis Bellot, and Tom Everitt. 2025. “General Agents Contain World Models.” October 20, 2025. https://doi.org/10.48550/arXiv.2506.01622.
Richens, Jonathan, and Tom Everitt. 2024. “Robust Agents Learn Causal World Models.” July 19, 2024. https://doi.org/10.48550/arXiv.2402.10877.