That's a very underestimated view of the CPU. It's more like 4GHz, with multiple instructions per cycle (there's pipelining in that 4GHz device too). And the compiler will likely optimise the for loop by unfolding it a bit, so 1 core may actually compute multiple loop iterations simultaneously. But yeah, it's good to highlight how a custom design reduces the computing task time, brings it closer to the time of 1 cycle. The advantage of FPGA is that you target a fixed task and you can build a custom computing process for that specific task. It's the other way around for CPU: your computing tasks are not fixed, they are "general purpose", but your hardware is fixed, which gives the advantage to optimize the hardware and make it 4GHz fast, with instruction pipelining and other advanced general purpose features.