For months, the recruitment community has been buzzing with rumours: can candidates with little or no specialist training really use generative AI to complete psychometric tests?
With so much speculation, we decided it was time to separate fact from fiction.
Over the past month, one of our Senior Data Scientists has been working tirelessly with a team of postgraduates from UCL to meticulously evaluate ChatGPT and its performance across four key test types.
Our first deep dive? Aptitude Tests.
Across the course of this article, we’ll share our research methods and our results.
But in case you just wanted the headlines…
According to research, one of the best ways to predict candidates’ future job performance is to measure their cognitive ability (Schmidt and Hunter, 1998; Kuncel et al., 2004; Schmidt, 2009; Wai, 2014;Schmidt et al., 2016;Murtza et al., 2020).
And the best way to measure cognitive ability has historically been with a traditional Aptitude Test. These tests tend to ask questions or present problems where the answer is either right or wrong – this is what we’d call maximum performance. They paint a picture of what people can do, but not necessarily how they do it. These tests have, for a long time, been well-regarded as a reliable assessment method, with no easy way to cheat (outside of using practice tests from the likes of Assessment Day or following the advice shared on sites like YouTube and Reddit).
But in the last year, Generative AI turned the theory that Aptitude Tests were ‘un-gameable’ on its head. Tools like ChatGPT exploded onto the market, with their ability to interpret, predict, and generate accurate text responses. Speculation became rife that these tools may now have the capability to analyse test questions and generate optimal answers –– even without any detail on the methodology used to build them or formulate the right answer.
Even the Microsoft Researchers who developed ChatGPT’s latest version, GPT-4, suggested its reasoning capability is better than that of an above-average human. So it’s no surprise that many TA leaders began to speculate.
This leads us back to the big question –– Can generative AI really be used by candidates to complete psychometric tests, with little or no specialist training?
To find out, our research squad tested ChatGPT against a retired version of text-based Verbal and Numerical Reasoning assessments that we used until a few years ago. They are cornerstones of standard Aptitude Testing and have been validated using standard psychometric design principles and best practices.
They were also used in multiple early career campaigns across many different sectors, giving us solid comparison data across a sample size of 36,000. While these assessments are retired, the question-based format is still identical to the majority of Reasoning Tests currently on the market (we checked).
From here on out, we’ll explain both our research process and the results. Plus, we’ve also created a downloadable drill-down into the data if you’re interested in diving deeper.
To see if ChatGPT could complete these tests, we first needed to explain the questions to the Generative AI model in the same way that a candidate would. There are many different ways of framing information for ChatGPT, and this is commonly known as ‘prompting’.
After a lot of research, we settled on five distinct ChatGPT prompting styles to investigate. These are some of the most accessible styles, which any candidate can find online by Googling “best prompting strategies for ChatGPT”. These were:
Next up, we needed to decide whether to compare the two different versions of ChatGPT.
GPT-3.5 is currently free. As a result, it’s much more widely used than GPT-4, which sits behind a paywall for $20 / £15 a month. Yet GPT-4 has some enhanced features that we believe could give candidates an additional advantage in completing an Aptitude Test (assuming they have the financial means to pay for it):
Given the potential for different outputs from the two available models, we decided to test both and compare the scores to those achieved by the average human candidate completing the same Aptitude Test.
Having made these decisions, our research team began testing ChatGPT using a standardised approach through its API. It’s worth noting that some Aptitude Tests operate within time limits –– in the early stage of our research, we observed that the speed at which we were able to craft a prompt and get a response from ChatGPT meant that timing our tests wasn't necessary, because ChatGPT would always be able to respond almost as fast a candidate.
The results for Verbal Reasoning were nothing short of amazing: almost all of the prompting styles beat the average for human candidates. Of course, this varied by degrees of magnitude, depending on the prompting style, and whether we were using the free or paid versions of ChatGPT.
GPT-4 with Chain of Thought prompting achieved an astounding score of 21/24, better than 98.8% of human candidates, across a sample size of 36,000 people.
As you can see, GPT-4 performed much more strongly than its predecessor, GPT-3.5, although GPT-3.5 was still able to score higher than the average candidate using 3 out of 5 prompting styles –– suggesting that candidates using ChatGPT to complete Aptitude Tests will have an advantage over their peers and potentially be able to mask their true verbal reasoning ability.
Our analysis also revealed that the 'Chain of Thought' and 'Generative Knowledge’ prompting strategies were most effective. For the Chain of Thought prompting style, this is likely because giving ChatGPT an example of a correct answer helps it learn how to structure its responses better. Meanwhile, Generative Knowledge prompts ChatGPT to break down the problem into simple steps (just like a human might), which makes the problem easier to solve –– in this example, it’s likely also using the large amount of information available online about Aptitude Tests and how they work to inform its decision making.
Examples of the prompting styles used to complete the Verbal Reasoning Test are detailed below.
A mentoring programme has recently been established at a creative marketing agency. All current employees are either team members, team leaders, or department managers. The motivation behind establishing the programme is primarily to increase the frequency and clarity of communication between employees across various roles in order to enhance innovative output through collaboration while providing the additional benefit of deepening employees' understanding of career progression in the context of individual and collective competencies and skill gaps. All team members have a designated mentor, but they are not permitted to act as a mentor for another employee, and all department managers have one designated mentee.
Based on the above passage, which of the following statements is true:
Chain-of-Thought Example Prompt:
Persona Example Prompt:
As you can see, it’s not that hard to set up these prompting styles; and they don't differ hugely from the types of 'how to complete the assessment' videos we already see candidates create on sites like TikTok and Practice Aptitude Tests.
On the flip side, Numerical Reasoning posed much more of a challenge for both GPT models, which makes sense given it’s designed to work with language more than complex numerical reasoning. All five prompting styles scored well below the human average using GPT-3.5, while two of the prompting style scores were able to perform slightly above average on GPT-4.
Despite the improvement, even the best-performing combination of Chain-of-Thought Prompting with GPT-4, struggled with more complex number sequences.
While ChatGPT didn’t perform that well here, there are a whole host of new browser plug-ins (apps you can install, e.g. Google Chrome to appear as an ‘overlay’ on your browser) emerging all the time, that we can expect to eventually get better at solving these types of tests. You can read a bit more about a few of them here. We tested one plug-in, Wolfram Alpha, but got the following response:
“I'm sorry, but it seems that Wolfram Alpha could not find a rule governing the sequence [9, 15, 75, 79, 237]”
This suggests that right now, these types of plug-ins aren’t where they’d need to be yet to complete a numerical reasoning test (although that’s not to say they won’t be soon based on how quickly Generative AI technology is advancing).
So how do these findings affect the hiring process?
We believe this research presents three primary considerations for TA teams using traditional, text-based Aptitude Tests:
In light of these findings, and given the knowledge that candidates are already using ChatGPT to game traditional assessments, talent acquisition leaders must decide how they’ll tackle this game-changing development and do it quickly.
TA leaders have three options available to them to tackle this challenge.
An ISE study showed that almost half of students would use ChatGPT in an assessment, suggesting the imperative to review your approach to aptitude testing is urgent.
We’re continuing to explore the impact of ChatGPT across the wider assessment industry, and over the next three weeks, we’ll be sharing our research on ChatGPT vs. Situational Judgement Tests, Question-Based Personality Tests, and Task-Based Assessments.
We recently created a new report with insights generated from a survey of 2,000 students and recent graduates, as well as data science-led research with UCL postgraduate researchers.
This comprehensive report on the impact of ChatGPT and Generative AI reveals fresh data and insights about how...
👉 72% of students and recent graduates are already using some form of Generative AI each week
👉 Almost a fifth of candidates are already using Generative AI to help them fill in job applications or assessments, with increased adoption among under-represented groups
👉 Candidates believe it is their right to use Generative AI in the selection process and a third would not work for an employer who told them they couldn't do so.