Musings on Test Length

Userpage · Post by **Unome** » October 28th, 2020, 6:39 pm

Random thoughts after writing a test for UT Invitational. Lots of times when we write tests, we compensate for larger tournaments by increasing test length - the principle being that more questions allows for greater space for differentation. However, as tournament sizes become much larger than we are used to this year, it's worth considering the limits of this principle. Clearly increasing test length indefinitely won't work - at some point even the best teams become unable to complete longer and longer tests, and as such the value of increasing test length for stratification of rankings diminishes. I don't have a solution to this problem or I would just state it, but it seems important to talk about this year.

Giantpants · Post by **Giantpants** » October 28th, 2020, 6:54 pm

I was actually just thinking the same thing the other day when writing the event supervisor post for BEARSO Geologic Mapping, since there were of course so many teams lol. Thank you Unome for raising the point. It was also a huge concern of mine when writing the test, since I'm still kind of a new test writer, and the format was still brand new then too.

On my feedback form a vast majority of submissions told me that the test was difficult because of its length, but I still received comments that some deemed the test not long enough... Obviously nothing can please everyone but I'd say its good to venture on the side of making it longer, because if a team is good enough that they can deem a "long test" short, then they probably know what they're talking about, which is good of course!

And of course, I guess it just highlights the importance of making a good difficulty spread, since throwing a lot of hard questions out there will not only alienate teams that aren't super high level, but it will just make a jumble of low scores. I guess that's the best way to make sure a test can never really be too long, properly accounting for a wide range of skills and proficiency levels amongst participating teams. It's easy to look at the "best teams" going to a competition and feel the need to make the test harder for them, but this ultimately hurts the score distribution, especially at a competition with 120 or 200 teams like UT or BEARSO.

Did that make sense? Maybe, idk. Anyone else have any other input?

Post by **SilverBreeze** » October 31st, 2020, 2:17 pm

Although I do prefer tests I can't finish within the time frame, I appreciate that test-writers have lives. I don't mind tests that top-level teams can finish within the 50 minutes if they are at least some difficult or critical thinking questions. It's usually less the test length itself that irks me than the fact that, beyond a certain point, the test is less a gauge of overall preparedness and more of how carefully you read questions, how closely your answer was formatted to the answer key (even without considering autograder this is sometimes an issue), and how lucky you were when combing for that one random bit of trivia.

I think one way that might help with letting less experienced teams still have a good testing experience would be putting a section of easier vocab or recall questions that at least give them something if they didn't delve into the event super deeply. This way teams are less likely to be discouraged immediately by the difficulty and forget to comb through the test for easy points. If it wasn't a big portion of the score, I don't think it would cause issues with cheating.

Please, please don't artificially make tests more "difficult" by using confusing wording, small/blurry pictures, or obscure trivia that is barely relevant to the event. Students can tell.

I really like questions that rely on facts learned from preparation to arrive at a new conclusion using critical thinking, but I know those types of questions are hard to write and harder to grade, especially with longer tests. Questions that test for understanding instead of recall tend to be more cheating-proof and more fun.

BrownieInMotion · December 10th, 2020, 12:52 am

Personally, I've never changed the length of my exams based on how big the tournament is.

My goal is always to write an exam that covers a wide range of topics at varying levels of depth with an appropriately high skill cap. I believe that, if an exam does those things, then it will accurately measuring a team's knowledge, preparation, and test-taking abilities. Increasing the difficulty floor and ceiling in this way also has the consequence of ensuring all teams, including newer/weaker teams, have something to work on the whole time.
---
Frequently, "make your test longer" manifests as: pump out a ton of questions with discrete choices (multiple-choice, true-false, matching, etc.) and let the binomial distribution sort everyone out while you pick some arbitrary questions to be tiebreakers. Even if these questions are interesting (which they usually aren't), this is obviously unsustainable because you are lower bounded by the size of the field.

The better solution is to write questions that allow for a more continuous distribution of point values. This is a natural consequence of free response, short answer, fill in the blank, etc. questions. These questions don't necessarily take longer to answer than the aforementioned discrete questions, but they allow you to better differentiate between teams because now you have a lot more data to work with than just knowing which letter they bubbled.
---
There's nothing wrong with writing an exam nobody finishes if the questions in them are interesting and varied. There's always going to be people who don't finish the exam anyway. It's better to skew the distribution right than risk clumping the top teams together.

Given that most tournaments now seem to give a week for grading and scoring to happen, grading lengthy tests by hand in a timely fashion is far less of a barrier than it used to be.
---
tl;dr If your tests are good they will be good regardless of field size. If your tests are not good you should make them good, regardless of field size.

Post by **BennyTheJett** » December 12th, 2020, 9:58 am

I think that quality is always greater than quantity when writing a test (in the limited numbers I've written). Overall however, I think that to separate good teams you need to test their memory retention, because it's gotten too easy for people to binder bash, especially in events like Astronomy. For that reason, I like long and painful tests which punish people who can't access their information as quickly.

jaggie34 · Post by **jaggie34** » December 15th, 2020, 2:11 pm

BennyTheJett wrote: ↑December 12th, 2020, 9:58 am I like long and painful tests which punish people who can't access their information as quickly.

I like how you think!

In all seriousness, I think that lengthy tests are good especially with online tournaments, as for some events there can't be a lab (Chem Lab, Circuit Lab, Forensics) and it serves as a form of protection against teams just using the internet to try to find answers, as well as allowing for there to be an appropriate spread of teams.

Post by **BennyTheJett** » December 16th, 2020, 8:49 am

jaggie34 wrote: ↑December 15th, 2020, 2:11 pm
BennyTheJett wrote: ↑December 12th, 2020, 9:58 am I like long and painful tests which punish people who can't access their information as quickly.
I like how you think!

In all seriousness, I think that lengthy tests are good especially with online tournaments, as for some events there can't be a lab (Chem Lab, Circuit Lab, Forensics) and it serves as a form of protection against teams just using the internet to try to find answers, as well as allowing for there to be an appropriate spread of teams.

Side note. NEVER EVER EVER just use Wikipedia for a source (for those looking to get into test writing). Try to find sources with information competitors might not have, as everything off Wikipedia will be in binders.

knightmoves · Post by **knightmoves** » December 16th, 2020, 11:17 am

SilverBreeze wrote: ↑October 31st, 2020, 2:17 pm Please, please don't artificially make tests more "difficult" by using confusing wording, small/blurry pictures, or obscure trivia that is barely relevant to the event. Students can tell.

I really like questions that rely on facts learned from preparation to arrive at a new conclusion using critical thinking, but I know those types of questions are hard to write and harder to grade, especially with longer tests. Questions that test for understanding instead of recall tend to be more cheating-proof and more fun.

Agree that questions that require you to demonstrate understanding of the material, rather than just memorization, are desirable. They're also harder to put into a multiple choice format, and basically impossible to score reasonably in multiple choice: you do much better to have an extended-answer question worth several marks, with partial credit available. But that's harder to grade, and with everything running on Scilympiad, grading multiple choice became even easier, and grading anything else became harder.

(Plus I find mathy-type questions are harder to answer with typing than writing. YMMV.)

Agree with those who have said that once a test is long enough that the top teams can't complete it, adding extra length doesn't help differentiate between teams.

Suppose you have a test which is entirely multiple choice. Suppose that the difficulty of the questions is such that the top teams take 30s on average per question, meaning that top teams can complete 4 questions per minute (2 people working independently). That gives you 200 questions in a 50 minute test. Going beyond that doesn't help. You could introduce more questions by making them easier, so the good teams can bomb through them faster, but that makes it a reading speed test rather than a test of the subject matter.

So on this multiple choice test, you have 200 available points. If team scores are uniformly randomly distributed, 17 teams in your competition will give you a >50% chance of a score collision requiring a tiebreak (this is an example of the "birthday problem"). In reality, expected scores are clumped, so the probability of needing a tiebreak is higher.

Have 200 teams (quite easy with remote scilympiad competitions) and you expect 73 ties with uniformly random scores, and even more ties with realistic clumping. Sure - you can always break ties. Have a 200 point test where you break ties on all questions in reverse order, and all your ties are broken unless two teams produce exactly the same pattern of answers. Break ties after that on the time the team entered their last answer, and you've got no ties. But it's also true that your method of breaking ties is basically random. It's easy enough to pick out a few tie-break questions that are more difficult, understanding-testing questions, but nobody can realistically order all 200 questions in order of difficulty.

How much does this really matter, though? Take BEARSO Div C, with ~200 teams, and look in the middle of the field. You've got multiple instances of teams separated by a point or two, and some instances of teams having the same point total. OK. Those teams basically did as well as each other. The ranking is pretty meaningless, but in basically any distribution, people that rank 95 out of 200 and 105 out of 200 are almost identical, because most things are pretty normal-looking.

You'd like the medals to have some meaning - you'd like the choice of whether someone finished first or fifth in an event not to be random - but if you've made the test hard enough, that'll probably happen naturally. Sure, there's a point where the field gets sufficiently large that a one-hour test can't differentiate between the top scorers. About 500 people get a 1600 on the SAT every year. That's OK.

(Although I might argue that it would be marginally better for SO to score ties as ties, rather than introduce random tiebreakers.)

Post by **BennyTheJett** » December 16th, 2020, 11:48 am

knightmoves wrote: ↑December 16th, 2020, 11:17 am
SilverBreeze wrote: ↑October 31st, 2020, 2:17 pm Please, please don't artificially make tests more "difficult" by using confusing wording, small/blurry pictures, or obscure trivia that is barely relevant to the event. Students can tell.

I really like questions that rely on facts learned from preparation to arrive at a new conclusion using critical thinking, but I know those types of questions are hard to write and harder to grade, especially with longer tests. Questions that test for understanding instead of recall tend to be more cheating-proof and more fun.
Agree that questions that require you to demonstrate understanding of the material, rather than just memorization, are desirable. They're also harder to put into a multiple choice format, and basically impossible to score reasonably in multiple choice: you do much better to have an extended-answer question worth several marks, with partial credit available. But that's harder to grade, and with everything running on Scilympiad, grading multiple choice became even easier, and grading anything else became harder.

(Plus I find mathy-type questions are harder to answer with typing than writing. YMMV.)

Agree with those who have said that once a test is long enough that the top teams can't complete it, adding extra length doesn't help differentiate between teams.

Suppose you have a test which is entirely multiple choice. Suppose that the difficulty of the questions is such that the top teams take 30s on average per question, meaning that top teams can complete 4 questions per minute (2 people working independently). That gives you 200 questions in a 50 minute test. Going beyond that doesn't help. You could introduce more questions by making them easier, so the good teams can bomb through them faster, but that makes it a reading speed test rather than a test of the subject matter.

So on this multiple choice test, you have 200 available points. If team scores are uniformly randomly distributed, 17 teams in your competition will give you a >50% chance of a score collision requiring a tiebreak (this is an example of the "birthday problem"). In reality, expected scores are clumped, so the probability of needing a tiebreak is higher.

Have 200 teams (quite easy with remote scilympiad competitions) and you expect 73 ties with uniformly random scores, and even more ties with realistic clumping. Sure - you can always break ties. Have a 200 point test where you break ties on all questions in reverse order, and all your ties are broken unless two teams produce exactly the same pattern of answers. Break ties after that on the time the team entered their last answer, and you've got no ties. But it's also true that your method of breaking ties is basically random. It's easy enough to pick out a few tie-break questions that are more difficult, understanding-testing questions, but nobody can realistically order all 200 questions in order of difficulty.

How much does this really matter, though? Take BEARSO Div C, with ~200 teams, and look in the middle of the field. You've got multiple instances of teams separated by a point or two, and some instances of teams having the same point total. OK. Those teams basically did as well as each other. The ranking is pretty meaningless, but in basically any distribution, people that rank 95 out of 200 and 105 out of 200 are almost identical, because most things are pretty normal-looking.

You'd like the medals to have some meaning - you'd like the choice of whether someone finished first or fifth in an event not to be random - but if you've made the test hard enough, that'll probably happen naturally. Sure, there's a point where the field gets sufficiently large that a one-hour test can't differentiate between the top scorers. About 500 people get a 1600 on the SAT every year. That's OK.

(Although I might argue that it would be marginally better for SO to score ties as ties, rather than introduce random tiebreakers.)

I like the idea of having tiebreaker questions that are in depth and make you think more, separating the teams with better reasoning. I dislike when people do "sudden death from the back" or "first correct question wins the tie".

Userpage · Post by **Unome** » December 17th, 2020, 12:50 pm

BennyTheJett wrote: ↑December 16th, 2020, 11:48 am
knightmoves wrote: ↑December 16th, 2020, 11:17 am
SilverBreeze wrote: ↑October 31st, 2020, 2:17 pm Please, please don't artificially make tests more "difficult" by using confusing wording, small/blurry pictures, or obscure trivia that is barely relevant to the event. Students can tell.

I really like questions that rely on facts learned from preparation to arrive at a new conclusion using critical thinking, but I know those types of questions are hard to write and harder to grade, especially with longer tests. Questions that test for understanding instead of recall tend to be more cheating-proof and more fun.
Agree that questions that require you to demonstrate understanding of the material, rather than just memorization, are desirable. They're also harder to put into a multiple choice format, and basically impossible to score reasonably in multiple choice: you do much better to have an extended-answer question worth several marks, with partial credit available. But that's harder to grade, and with everything running on Scilympiad, grading multiple choice became even easier, and grading anything else became harder.

(Plus I find mathy-type questions are harder to answer with typing than writing. YMMV.)

Agree with those who have said that once a test is long enough that the top teams can't complete it, adding extra length doesn't help differentiate between teams.

Suppose you have a test which is entirely multiple choice. Suppose that the difficulty of the questions is such that the top teams take 30s on average per question, meaning that top teams can complete 4 questions per minute (2 people working independently). That gives you 200 questions in a 50 minute test. Going beyond that doesn't help. You could introduce more questions by making them easier, so the good teams can bomb through them faster, but that makes it a reading speed test rather than a test of the subject matter.

So on this multiple choice test, you have 200 available points. If team scores are uniformly randomly distributed, 17 teams in your competition will give you a >50% chance of a score collision requiring a tiebreak (this is an example of the "birthday problem"). In reality, expected scores are clumped, so the probability of needing a tiebreak is higher.

Have 200 teams (quite easy with remote scilympiad competitions) and you expect 73 ties with uniformly random scores, and even more ties with realistic clumping. Sure - you can always break ties. Have a 200 point test where you break ties on all questions in reverse order, and all your ties are broken unless two teams produce exactly the same pattern of answers. Break ties after that on the time the team entered their last answer, and you've got no ties. But it's also true that your method of breaking ties is basically random. It's easy enough to pick out a few tie-break questions that are more difficult, understanding-testing questions, but nobody can realistically order all 200 questions in order of difficulty.

How much does this really matter, though? Take BEARSO Div C, with ~200 teams, and look in the middle of the field. You've got multiple instances of teams separated by a point or two, and some instances of teams having the same point total. OK. Those teams basically did as well as each other. The ranking is pretty meaningless, but in basically any distribution, people that rank 95 out of 200 and 105 out of 200 are almost identical, because most things are pretty normal-looking.

You'd like the medals to have some meaning - you'd like the choice of whether someone finished first or fifth in an event not to be random - but if you've made the test hard enough, that'll probably happen naturally. Sure, there's a point where the field gets sufficiently large that a one-hour test can't differentiate between the top scorers. About 500 people get a 1600 on the SAT every year. That's OK.

(Although I might argue that it would be marginally better for SO to score ties as ties, rather than introduce random tiebreakers.)
I like the idea of having tiebreaker questions that are in depth and make you think more, separating the teams with better reasoning. I dislike when people do "sudden death from the back" or "first correct question wins the tie".

The difficulty is that tiebreakers have to effectively tiebreak evenly matched teams across the entire spectrum, from top teams to the very bottom. Just choosing in-depth questions can easily backfire, since often they'll end up being unanswered by the teams toward the bottom of the stack.

Scioly.org

Musings on Test Length

Musings on Test Length

Re: Musings on Test Length

Re: Musings on Test Length

Re: Musings on Test Length

Re: Musings on Test Length

Re: Musings on Test Length

Re: Musings on Test Length

Re: Musings on Test Length

Re: Musings on Test Length

Re: Musings on Test Length

Who is online

Connect

Learn

Get Involved

About

Disclaimer