The reinforcement learning does not try every possible combination, that would be brute force. It would construct a probablistic model to make decision on every step based on the current input (x,y,color) of the nearby cell and output a direction to go out of the 3 possible options. The output could look like 91% - down, 7% - left, 3% - right, or just give the expectation value of total score if the choice is taken.