Is it not mildly concerning to be including the incremental testing steps of automated red teaming efforts into the transcripts that will be included in future ai training data? Presumably increasingly clever AI will be seeking out exactly this kind of interview in the same way the KZbin algorithm found it for me.
@NarasimhanMGАй бұрын
If I, Frankenstein ;-) wanted to use all this knowledge of red team work, I could model using game theory, Nash equilibriums, prisoner dilemmas, etc. to beat the red teams with its own findings, no? The arms race between red teams and rogue agents can never end. Any new guard rails, deception detection methods, forensics etc is just more fodder for the cognitive arms race. How does an (admittedly fictitious) AI safety researcher know if Karla is not running a simulation of George Smiley's operations, long game, levels of patience and patterns of deception? To be sure, I am on the side of the red teams, but can the cycle of publishing "un-safe" findings and the feedback to the AGI ecosystem go sideways?
@NistenTahirajАй бұрын
6 week? lol it literally took me 6 prompts to jb its reasoning steps on uncle elons twitter
@charliesteiner2334Ай бұрын
Maybe you don't understand what they were doing. Their job is not just to prompt the model so it says bad words. The job is to measure the capabilities of the model, and put it in situations where it might do bad things even without the user deliberately trying to jailbreak it.