Samyak Jain
Why do you care about AI Existential Safety?
Clearly, currently we don’t have a good understanding about the mechanisms used by the current models to demonstrate a capability. Further, things have become worse in the era of large language models because these models elicit some surprising capabilities and sometimes such capabilities are demonstrated in very specific scenarios. This makes is extremely difficult to reliably estimate the presence of a capability in these systems.
This is worrying because this could mean that in future the super intelligent systems could learn some extremely dangerous capabilities while successfully hiding them from us in normal scenarios. As a result, if we continue to develop more intelligent systems, they might get better than us in not just possessing better capabilities but also in finding ways of not expressing them. This could prove fatal for the society and we could lose control.
Please give at least one example of your research interests related to AI existential safety:
Currently, I am working on understanding how safety fine-tuning makes a model safe and how different types of jailbreaking attacks are able to bypass the safety mechanism learned by safety fine-tuning. This analysis can help in developing more principled and improved safety training methods. It can also provide a better understanding about why “fine-tuning” is not sufficient to enhance safety of current LLMs.