Well, we cheat, of course!
The computer algorithms take a single shot.
The humans try stuff, get feedback, and then evolve towards the higher scores.
Frankly, I’m underwhelmed by it all.
I think the human results are simple mathematical outcomes of the game theory roles of the humans versus the algorithms. Wrap the algorithms in a multi-pass structure, even a dumb one, and you’ll equalize the outcomes.
As for the voting, I’m with JW, the way to score is to pile onto the popular ones. The actual differences between the models is miniscule, and I have zero reason to believe that the best candidates are being selected for synthesis. Well, that’s a little too harsh. More accurately, you could take any of those that score points for being similar and probably find several that are superior to the one that was synthesized. This might actual merit doing. If the ones selected even rank regularly in the top third, you could call that a success, but I wonder if they would.