We draw a virtual decision line at a freely chosen level to indicate our threshold for calling anything either good or bad. In the current sample we assume the level to be 50% (but we can also choose a different one).
If the score is on the right of this line (higher than the decision level) it's a possibly good result, if it's lower it could be a bad result. But we're not sure yet if we can confirm it.
In order to find that certainty we assess next how big part of the distribution curve resides on the good or bad side of the decision level.
We are interested only at the side in which the score resides. In our case we look at the right side as the score is higher than the decision level of 50%.
If the area of that part of the curve (orange on the graph) is higher than 95% of the whole curve area, we report that the voting result is firmly positive.
If the area is smaller, we report it as being mediocre.
Likewise, following that same logic, the result is reported as negative if the score happens to be on the left side and the curve area residing left is more than 95% of the whole curve area.
Our ability to offer quantitative, statistical assessment for the probability of the messaging being good or bad is derived from the presumption that even a small sample size can be considered a fair representative of the true population given that we know how random bias behaves.
Because of the small sample size, it is essentially impossible to avoid random (and therefore uncontrollable) bias that could happen because of random fluctuations of behavioral bias within the chosen batch of testers.
For example, if messaging is targeted at families with small children and testing it just by chance, done against mostly teenagers, you might get a totally irrelevant assessment of the content as testers probably never understood what is important about it.
We are solving that problem by allowing you to choose specific testing audiences that match your target group the best.
Or, in another example, let's assume that your test gets revealed to potential testers in the evening. Regardless of the chosen topic a certain bias among the testers is expected as they tend to be evening time workers (we assume 20 testers are found relatively fast and the work gets done within hours).
The short answer is it does and it does not — depending on what you are testing. If the messaging that's being tested is prone to excite different emotions in this expected night-day type divide in people you certainly should either test again next noon or select testers more explicitly to avoid this type of bias.