Rhys-Korbi Meeting 2025-11-06

Author

Korbinian Friedl

1 Pre-meeting notes

I had planned to, as a next step, introduce the possibility of a single agent considering multiple CID. But now I am wondering, do we even need that?
- I think the reason why I thought we should do that is because it is implicit in II-CG that agents can have that kind of uncertainty
- But if we are invoking (Richens and Everitt 2024), then probably we should just say that they learn exactly one model, no?
And then maybe the kinds of modifications we thought might be necessary to the definition of a policy, and in particular what a policy can depend on, are maybe not necessary?
- Because in a CBN, there always will exist a function that gives \(P(C=c \mid \mathbf{Pa}^D)\), right? So if an agent wants to choose an honest policy, the policy function can just be exactly that?
- If later, in a full II-CG setting, we want to allow something like: The agent’s policy is to say what they think the other agent thinks, then maybe modifications are still necessary?
Go through my draft of section 3.2
Richens and Everitt issue from yesterday evening
Next steps

2 Meeting notes

Read 6.4, mitigating deception, in Rhys’ thesis
Look into Rhys’ long part on goal misgeneralisation in the original draft and adapt definition accordingly
- Goal misgeneralisation is when you preserve capabilities but your goals change — you capably pursue different goals, under distributional shifts.
- So now we need a notion of distributional shifts and capability under distributional shifts and different goals and at least some of these are sketched out in what Rhys wrote already

2.1 Next steps

Make this section still better by
- adding an example where truthfulness and honesty come apart
- talking more about training strategy and giving an as concrete as possible example; see rhys’ notes in the google doc
- Reread the previous section and spill some ink on “we now see why the naive thing doesnt work”, in particular because of irreducibility of the utility functions, not being able to necessariliy narrow down the utility function or verify that the correct utility function was learned; where this becomes apparent off distribution
  - Perhaps saying this requires saying more about generalisation and goal misgeneralisation; take things from the section Rhys has already written on this
  - Might want to make it somewhat informal in this section — gesturing towards “here is what might go wrong” and saying we’ll make it precise later in the section on goal misgeneralisation
- Alternatively, don’t introduce the notion of goal misgeneralisation at all, just talk about two different things that are learned—the imitative guy and the honest guy (that is what they do in the ELK report; so it’s nice because it is familiar to people and doesn’t require introducing these concepts at this point)
Formalise the diamond example (to potentially replace or supplement the existing capabilities example)

Do some musings on training stragegy - there is this behavioural way of talking about it: policy oracles - and the subjective way: subjective causal models - linked by Richens result - In particular they both provide maps from interventions in \(\mathcal{M}^O\) to policies and the agent is notified of them

Also on training stragegy: The training stragegy should also specify how you get the map, i.e. the training algorithm

References

Richens, Jonathan, and Tom Everitt. 2024. “Robust Agents Learn Causal World Models.” July 19, 2024. https://doi.org/10.48550/arXiv.2402.10877.