fisher-weighted averaging

Key Insight

Model Merging can be seen as choosing parameters that approximately maximise the joint likelihood of the posteriors of the models' parameters

  • With this perspective computing a simple average of the models' parameters corresponds to making an isotropic gaussian approximation of their posteriors
  • develop an alternative procedure based on the Laplace approximation where each models' posterior is approximated as a gaussian distribution whose precision matrix corresponds to its Fisher information

How should we transfer knowledge and capability across trained models ??