For months, there has been talk in the recruitment community about whether a candidate could really use ChatGPT to complete a Situational Judgement Test (SJT) with little to no specialist training. Some other researchers have already completed research in this area, and this is a useful foundation. But we wanted to dig a bit deeper and understand how easy it is and what the different approaches to prompting might reveal. As well as the difference in performance between GPT-3.5 and GPT-4.
So, for the second instalment of our ‘ChatGPT vs psychometric assessments’ series, our Senior Data Scientist, working alongside two UCL postgraduates, set out to answer these questions:
Throughout this research piece, we’ll outline our research methodology as well as our findings. But in case you just wanted the headlines…
Now into the details…
Situational Judgement Tests (SJTs, for short) are a type of psychological assessment used to measure how candidates approach and resolve common situations they might come across at work.
They’re often one part of a psychometric assessment, alongside assessments for Aptitude and Personality, and typically assess things like a person’s decision-making skills, organisation and planning, resilience, and communication skills.
The tests work by presenting hypothetical, job-related situations, and asking the test-taker to choose the most appropriate action from a set of multiple choice questions. You can see an example of what one might look like here.
The exact nature of the scoring and how points are allocated varies based on the test provider but there is always a hierarchy –– some answers will get you more points as they are ‘more correct’ and some will give you fewer points because they are ‘less correct’.
In some scenarios, candidates might be asked to rank the multiple choice options from ‘best to worst’ or ‘most likely to least likely’, or they might just be asked to select ‘the best’ vs. ‘the worst’ option. Either way, a test provider will always be looking for –– and awarding points –– based on a ‘desirable’ answer.
You can think of an SJT a bit like a pseudo-interview: the assessment is trying to understand how a candidate will behave in the role by asking them how they’d react in certain situations.
And now, with Generative AI models like ChatGPT able to easily interpret test questions and generate optimal answers quickly, whether or not a question-based SJT can truly assess how someone will behave at work hangs in the balance.
This leads us back to the question at hand –– can generative AI tools like ChatGPT really be used by candidates to complete Situational Judgement Tests, with little or no specialist training?
Here’s how we went about finding out.
Arctic Shores does not offer an SJT. We wanted to conduct this research to help the talent acquisition industry make informed decisions about the best ways to assess candidates' true potential to succeed in a role in the age of Generative AI. And as part of that, we wanted to identify whether an SJT was a robust option to do that.
For that reason, our research squad tested ChatGPT against a series of practice tests readily available for any candidate to buy online, using information from the test providers themselves to determine the percentile ranking of ChatGPT’s scores in comparison to others who’ve taken the test.
To see if ChatGPT could complete an SJT, we first needed to explain the questions to it in the same way that a candidate would. There are many different ways of framing information for ChatGPT, commonly known as ‘prompting’.
After a lot of research, we settled on five distinct GPT prompting styles to investigate. These are some of the most accessible styles, which any candidate can find online by Googling “best prompting strategies for ChatGPT”. This is what the prompts and examples looked like:
You can find examples of the prompts in our downloadable pdf, which also covers: a drill down on rank order questions, plus a bonus experiment on whether simple prompting changes can help craft more human responses.
We decided to test both the free and paid versions of ChatGPT to determine whether or not candidates with the financial means to pay for GPT-4 –– which sits behind a paywall at £15 a month –– would be at a significant advantage vs their peers using the free version, GPT-3.5. Here’s a summary of the differences between the two versions:
It’s worth noting that some SJTs operate within time limits –– in the early stage of our research, we observed that the speed at which we were able to craft a prompt and get a response from ChatGPT meant that timing our tests wasn't necessary; because ChatGPT would always be able to respond almost as fast a candidate.
This was helped by basic smartphone features. For example, ChatGPT's iPhone App has a text scanning feature which allows you to scan a question on your phone, ask for a response, and enter that answer on your computer in a few seconds.
The interesting point to note is that using a phone to scan the text of the question was as fast as reading the items. To the point where, with only a bit of practice, it’s easy to go from input to final answer as fast as human candidates. Especially considering that this was done on GPT-3.5, which got the correct answer. We therefore hypothesise that using GPT-4 would be even more effective, given its enhanced reasoning ability.
Download our full PDF explaining this method and for images showing exactly how easily a candidate could use the iPhone’s image-to-text feature.
In short, yes it can.
GPT-3.5 got an average of 50-60% of SJT questions correct and the premium version –– GPT-4 –– got an even more impressive 65-75% of SJT questions correct, putting it into the 70th percentile in comparison to previous test-takers' scores.
While the differences between prompting styles might seem subtle, there’s a big disparity when we look at the different versions. GPT-4 isn’t just better overall — it reveals some unexpected, specific strengths over its forebear.
The headline: GPT-4 excels at complex reasoning, while GPT-3.5 prefers to keep things simple.
Next, we’ll explore the data underpinning this claim… and why it’s so important.
If you remember, some SJTs require candidates to sift through multiple answers, ranking them in order of effectiveness. In these scoring systems, when the candidate ranks an answer correctly, they’re given a whole point. But sometimes there's a bit of wiggle room — like when an answer is almost right. Here the candidate is given a partial score, like 0.5.
If that sounds too much like some shape-shifting jigsaw puzzle, then you can also visualise it this way. (We wanted to share an example of a question here to illustrate what we mean, but we're keen not to call out any provider specifically. So, for the sake of illustration, we've made up a slightly offbeat example).
Based on the scenario described, please review the following responses and suggest which one you believe to be the response to the situation you would be ‘most likely to make’ and ‘least likely to make’.
“You're a recruiter for a highly prestigious intergalactic corporation. You've just discovered that the best candidate for your new Galactic Sales Manager role is an alien named Targ. Her references are impeccable, but come from galaxies you've never heard of; and you're concerned about how she'll integrate into the Earth office environment. Also, she eats stationery as snacks.”
Please rank the following actions from 1 (most effective) to 5 (least effective):
Why is this important? There have been some reports from SJT assessment providers that ChatGPT can only operate in a binary context and doesn’t perform well when you ask it to evaluate the options in a more complex way. For example, by ranking the effectiveness of each answer.
However, despite some small differences between the models, we found that this wasn’t the case.
Both versions of ChatGPT performed well when selecting the ‘most effective’ answer. And even if GPT-4 didn’t pick the ‘most effective’ answer, it picked the next best one in most cases. Therefore, it was able to consistently achieve either a full or a partial score.
GPT-4 also performed well when identifying ‘the least effective’ solution to a scenario.
Conversely, while GPT-3.5 was generally pretty good at predicting which options were ‘the most effective’, it performed less well than its paid successor in identifying the ‘least effective’ solution. This could be a result of a weakness in its counterfactual reasoning.
GPT-3.5’s inability to spot the ‘least effective’ solution was especially true when it was given the default assessment instructions or using a Generative Knowledge prompting style.
This leads us to conclude that while ChatGPT does perform ‘better’ if there is a binary right or wrong answer, GPT-4 still performs very well in a more nuanced context –– even if being asked to give a solution to a problem with a ‘rank order’.
Having uncovered these differences between the models, we wanted to see if GPT-3.5 understood less about the SJT and how to approach it vs its counterpart. And if that was driving the difference in performance. To find out, we asked ChatGPT how it would approach taking a Situational Judgement Test.
Both GPT-3.5 and GPT-4 gave good, logical descriptions about how to approach an SJT –– which wasn’t particularly surprising.
This told us that GPT-3.5 didn’t score as highly because it had worse knowledge of SJTs and didn’t know what to do. Rather, it simply struggled to apply that logic to its responses.
This research leads us to conclude that candidates can use ChatGPT to complete traditional, question-based Situational Judgement Tests with little-to-no specialist training.
It shows that those candidates who do so will outperform their peers, and those with the financial means to pay for GPT-4 will outperform their peers even more so.
It also disproves the myth that ChatGPT isn’t effective at completing assessments with a rank order scoring system.
So what are the repercussions for recruitment?
Given the pace at which these models are moving, what is true today may not remain true for very long. So while you think about adapting the design of your selection process, it’s also important to stay up-to-date with the latest research.
We’re continuing to explore the impact of ChatGPT across the wider assessment industry. You can view our summary of ChatGPT vs. Aptitude Testing here or sign up for our newsletter to be notified when our next piece of research drops.
The focus? Our researchers have been exploring whether ChatGPT has a default Personality type and how easy it is to adapt it to complete –– and score desirably –– on a traditional, question-based Personality Assessment.
Subscribe below and be the first to know which assessment format you should choose to future-proof your selection process.