New synthesis candidates selecting system

Dear players,

In last few rounds after the public launch, we have witnessed a few shortcomings of our current voting system for selecting synthesis candidates. Because now we have hundreds of lab submissions now, it is extremely hard to go through all designs to pick out the best one. In most cases, people are intimitdated by the number of designs they have to look through, and often decide to vote on designs that already got lots of votes (“snowballing”) or only look at single metric (such as free energy) to vote. Devs had a meeting about this today and came up with a conclusion that the voting system is not suitable for the system like us, where we have massive number of candidates.

Instead, we are now thinking of applying Elo Rating System for synthesis candidate selection. If you saw the movie “The social network”, you’ll recognize this system right away as this was used for the FaceMash system. In this system, users are continuously asked to pick the better of 2 candidates. Each user decision create a partial ordering of candidates, and the system tries to come up with a total ordering out of all partial orderings trying to minimize inconsistency. In the end, we will be synthesizing top 8 designs in the total ordering.

Instead of “voting”, there will be a “review” button, where you’ll be asked to pick the better of 2 randomly picked candidates. In the review interface, you’ll be able to do full comparison of 2 candiates - you’ll be able to see their statistics and interactively play with both deigns. You can do as many “reviews” as you want, and you’ll be rewarded by how many “correct reviews” did.

The system has many great advantages. First, every design will be reviewed by someone. We can setup a system such that designs that didn’t get reviewed are more likely to be chosen for random review candidates and it’lll make sure very design gets reviewed. Second, one-to-one comparison will allow players to make more in-depth decision than having to go through hundreds of designs. Third - the quiz like quality of the review will stimulate people to learn & ask more before they make decisions.

The system does have few issues. The biggest issue is that it would reduce social aspect of candidate selection. For example, it’ll be now hard to tell people “vote for my design ABCDE, I think it’s cool” in chat to promote your design. We plan to solve this issue by still having the “design browser” so you can still browse through every submitted designs and review a specific design, and even start doing “reviews” relevant to that specific design (i.e, fix one candidate of the review to be that specific design). Also, we could allow people to leave comments on each design when they review, so people who come later can see them.

The details still needs to be worked out. For example, how are we going to reward people based on their reviews? - do we say a review is correct or wrong only if 2 candidates in the review are synthesized and can be compared? If not, how can we rate reviews that involve non synthesized designs? We are still working on these questions and it may take some time for us to come up with a final system, but we wanted to throw this idea out to EteRNA players and see what players think.

EteRNA team

1 Like

wow, big changes - got to “see before say” - when is this going to start?

This sounds like a fabulous idea to try. I think that it would be helpful to allow comments on all designs and to make those comments available to whomever is selecting between two designs. That will allow the selector to benefit from whatever observations others have had a chance to make. Also, it would be interesting for the selector to be able to make comments as well on each design…i.e., I didn’t choose this because it looked like there were too many patterns that might re-align and match up and deform the shape. That would help people whose designs are not selected understand what to work on…

This is a great idea, more like a crowdsourced peer-review system and less like a popularity contest. One suggestion - there should be a maximum number of reviews that a lab member can choose to write (but perhaps no limit on randomly assigned reviews). Otherwise, one person could still unduly manipulate rankings by choosing to review every pairwise combination of their favorite submission and other submissions. Alternatively, if the number of reviews is fixed (like mod points on slashdot), it will force people to spend them wisely.

First take on this, (after reading up on “Elo,” though with admittedly, not fully adequate familiarity) is that it operates much like a bubble-sort, but with a radically inconsistent comparison algorithm. (as different as every player who participates!)

This perception has caused “The Big Question” in my mind at this early stage to become:

"Will the individual comparison choices made by the player community - with their various understandings and perceptions of game dynamics, and their wildly differing ideas of what a successful design looks like - manage to “average out” into a true “wisdom-of-crowds,” and result in a better selection of designs being eventually chosen each round?

… or will this inconsistency, being necessarily applied unevenly across the design-space result instead in a choice indistinguishable from a random “eight-dart toss”, or in a choice so skewed (by whatever current evaluation theory-meme seems to be dominating thought in the “player-verse” during that week’s comparisons) - that the end result each week could fluctuate wildly between a very positive diversity, and a hopelessly muddled melange of design extremes and oddities?"

On purely the positive side, though, this system will clearly enforce at least a minimal analysis even by the least so-inclined players. and will provide a much, much greater democratic view-and-evaluation opportunity for many, many designs which in the previous system would have been simply ignored or overlooked.

It should be an INCREDIBLY interesting experiment, and the more I think on it, the more I like it.

I’m looking forward with great anticipation to participating in this new system, and to seeing how it all feels once it’s in progress, and to seeing how it all works out in the end each week.

-d9

“…Will the individual comparison choices made by the player community - with their various understandings and perceptions of game dynamics, and their wildly differing ideas of what a successful design looks like…”

In what respect was the old system better regarding this issue?

I really can’t see that your concern was invalid for the old system?

Hi boganis, I think you’re right, & I agree the old system was not any better in this regard, and that my concern voiced here was indeed not invalid for the old system either; I’m just not sure that this new system (as intriguingly interesting and differerent as it seems to be) - will prove to be any better. But I am hoping very much that it will be.

For me, (I guess because I’ve already come to view this system as a kind of “bubble-sort”), the “sticking point” is mostly just the idea that the “compare function” in the algorithm of this system will be applied differently each time… but for all I know, this very inconsistency may prove to be both it’s strength, and its brilliance.

-d9

I seen you still need to work out the details, is there a time frame on when that will be done ?
I think with a total change of “current voting system for selecting synthesis candidates” it maybe wise to beta test the update before it go’s mainstream.

Boganis: That’s a great point, and we fully agree.

The more people discuss their hypotheses about what makes a good design, the more wise the crowd will be. Ideally the game would guide people to a forum for this kind of discussion.

I think it is the obvious method. I would strongly disagree about aloowing comments in the reviews.
While I am certainly impressed by everyones “lessons learned”, it is important to remember that the goal of this experiment is to determine if humans can recognize patterns.
Not to determine if humans can calculate winning designs. Computers can calculate theoretical combinations, based upon the different metrics involved.

It concerns me that this is not being re-iterated enough.

I strongly disagree – if someone is willing to put in the time to write a brief thought about every design, they should be encouraged rather than prevented from doing so. I doubt that kind of person is going to be motivated by scoring points.

I’ve told a dev this before, but you guys should check out http://shirt.woot.com/Derby for an idea of how a discussion thread on every design can be a worthwhile thing.

Here’s a random thread from there that has the kind of feedback I thin kwe could use here: http://shirt.woot.com/Derby/Entry.asp…

I think you’re making a couple of artificial distinctions:

  1. Between pattern recognition and calculation. Machine learning is a way to recognize patterns, and calculations can feed into human pattern recognition.

  2. Between individual pattern recognition and social pattern recognition. By sharing lessons learned in open discussion, humans can use collective intelligence to recognize patterns that they might miss with their individual intelligences.

Fomeister: I very much agree with you that if sufficient metrics were known, the a computer could simply calculate them and determine which design is better. However, the metrics are unknown and that’s precisely the problem. Allowing people to comment on individual designs will allow different theories on correct metrics to diffuse throughout the community. Ultimately our goal will be to “formalize” the best metrics so that they can be performed automatically by computers–but first the community must help us find them!

Some thoughts we have been kicking around here at my house, in no particular order…

One of the points that has been noted repeatedly is that people are not consistently integrating the information from the lab synthesis results into their designs. I would like to see another layer of idea generation and possibly voting to address this - not a design layer, but a hypothesis generation layer. This could take a few different forms in practice, and I don’t have a personal favorite at this point.

One option would be to have a set of comment fields that would pop up when someone viewed the lab results asking them to suggest why they think it didn’t synthesize properly, what issues they see that may have caused particular points of failure, and why they think the parts that did work did.

One way to use the data collected this way would be to have a side window that operated like a forum post or a reddit thread. After they commented (and I think it is important to require them to present their own ideas, however minimal, first), they would have access to the comments of others about that design, and, ideally, would be able to upvote or downvote those comments and reply to specific comments of others. Systematically collecting, sharing, and evaluating post-synthesis feedback that is immediately accessible while viewing the design would go a long way toward priming people to have a hypothesis-testing mindset when creating and evaluating further designs.

Another way to use the responses generated would be to rotate them into the pairwise comparison format, if the forum approach seems unworkable or too clumsy or time-consuming. “Which of these two responses best explains why this design (failed to bond / misbonded / deformed, etc.)?”

Regardless of what action is taken regarding post-synthesis comments, I think that it would be a really good idea to explicitly evaluate submissions in the 2nd, 3rd, and subsequent rounds with respect to how well they solve problems found in the first round submissions. Instead of, or in addition to, having a simple pairwise comparison in which the player is asked “Which of these two designs is more likely to fold correctly?”, I think it would be really valuable ask players to answer “Which of these two designs is more likely to fix the specific problems seen in this synthesized design?” and “Many players believe that Design X (failed / succeeded) due to __________. Which of these two designs would best test whether that was true?” I would like to see perhaps half of the synthesis slots reserved for designs that are explicitly testing hypotheses about previous lab failures or successes.

I was very skeptical about this idea when I first read about it, but I’m warming to it.

Aculady made a suggestion in chat last night that I wholeheartedly agree with: there should also be an option to rate two designs as equal, or skip the comparison altogether.

I think adding the ability for others to comment on designs is great, not just for voting purposes but also to help us all learn to design better.

One question I have is how this system will affect designs submitted late in the week – will there be enough data generated for a design submitted only a couple hours before the design/voting deadline to give it a fair ranking?

It’ll be interesting to see how it works, and I’m feeling hopeful about it :slight_smile:

Yes yes yes to explicit hypothesis testing! Any model based on voting up individual designs will make this hard, though. “Design A will do better than design B” is a much more powerful format for testing hypotheses than “design A will do well”. But that format requires voting up a pair of designs rather than just one.

I agree, Chris. I had actually thought of suggesting a player achievement (“The Golden Thumb” or something like that) for players who consistently reviewed large numbers of designs, and another (“The Shadow”, perhaps? “Who knows what twisted patterns lurk in the strands of RNA?”) for players whose reviews/predictions consistently matched lab outcomes. .

Here’s a proposal re explicit hypothesis testing (related to aculady’s post):

In addition to the Elo pairs generated by the system, mix in some “hypothesis pairs” created by players. In a hypothesis pair, both designs would be by the same player, and the player would include some comments about why the pair is a potentially informative comparison.

Hypothesis pairs would be chosen for synthesis based on how much *disagreement* there was about the ranking of the pair elements, because a more interesting hypothesis is one where the answer isn’t obvious.

The number of hypothesis pairs in the pair pool could be limited, with the candidate hypotheses chosen based on how well the individual pair elements did in Elo voting.