Super thanks for sharing this. V helpful. One question is that could we have LLM blindspots which could hide some problems across the board? In a highly efficient system, ex: 1. We deploy LLMs to escalate 2. Then to rate the content itself 3. Finally, review the humans This way the blindspots could be missed at each level, leading to a real world escalation. This could be resolved by aiding the inputs with effective sampling, but I'd like to hear your thoughts on this.