If AI can provide a better diagnosis than a doctor, what’s the prognosis for medics?

AI means too many (different) things to too many people. We need better ways of talking – and thinking – about it. Cue, Drew Breunig, a gifted geek and cultural anthropologist, who has come up with a neat categorisation of the technology into three use cases: gods, interns and cogs.

“Gods”, in this sense, would be “super-intelligent, artificial entities that do things autonomously”. In other words, the AGI (artificial general intelligence) that OpenAI’s Sam Altman and his crowd are trying to build (at unconscionable expense), while at the same time warning that it could be an existential threat to humanity. AI gods are, Breunig says, the “human replacement use cases”. They require gigantic models and stupendous amounts of “compute”, water and electricity (not to mention the associated CO2 emissions).

“Interns” are “supervised co-pilots that collaborate with experts, focusing on grunt work”. In other words, things such as ChatGPT, Claude, Llama and similar large language models (LLMs). Their defining quality is that they are meant to be used and supervised by experts. They have a high tolerance for errors because the experts they are assisting are checking their output, preventing embarrassing mistakes from going further. They do the boring work: remembering documentation and navigating references, filling in the details after the broad strokes are defined, assisting with idea generation by acting as a dynamic sounding board and much more.

Finally, “cogs” are lowly machines that are optimised to perform a single task extremely well, usually as part of a pipeline or interface.

Interns are mostly what we have now; they represent AI as a technology that augments human capabilities and are already in widespread use in many industries and occupations. In that sense, they are the first generation of quasi-intelligent machines with which humans have had close cognitive interactions in work settings, and we’re beginning to learn interesting things about how well those human-machine partnerships work.

One area in which there are extravagant hopes for AI is healthcare. And with good reason. In 2018, for example, a collaboration between AI researchers at DeepMind and Moorfields eye hospital in London significantly speeded up the analysis of retinal scans to detect the symptoms of patients who needed urgent treatment. But in a way, though technically difficult, that was a no-brainer: machines can “read” scans incredibly quickly and pick out ones that need specialist diagnosis and treatment.

But what about the diagnostic process itself, though? Cue an intriguing US study published in October in the Journal of the American Medical Association, which reported a randomised clinical trial on whether ChatGPT could improve the diagnostic capabilities of 50 practising physicians. The ho-hum conclusion was that “the availability of an LLM to physicians as a diagnostic aid did not significantly improve clinical reasoning compared with conventional resources”. But there was a surprising kicker: ChatGPT on its own demonstrated higher performance than both physician groups (those with and without access to the machine).

Or, as the New York Times summarised it, “doctors who were given ChatGPT-4 along with conventional resources did only slightly better than doctors who did not have access to the bot. And, to the researchers’ surprise, ChatGPT alone outperformed the doctors.”

More interesting, though, were two other revelations: the experiment demonstrated doctors’ sometimes unwavering belief in a diagnosis they had made, even when ChatGPT suggested a better one; and it also suggested that at least some of the physicians didn’t really know how best to exploit the tool’s capabilities. Which in turn revealed what AI advocates such as Ethan Mollick have been saying for aeons: that effective “prompt engineering” – knowing what to ask an LLM to get the most out of it – is a subtle and poorly understood art.

Equally interesting is the effect that collaborating with an AI has on the humans involved in the partnership. Over at MIT, a researcher ran an experiment to see how well material scientists could do their job if they could use AI in their research.

The answer was that AI assistance really seems to work, as measured by the discovery of 44% more materials and a 39% increase in patent filings. This was accomplished by the AI doing more than half of the “idea generation” tasks, leaving the researchers to the business of evaluating model-produced candidate materials. So the AI did most of the “thinking”, while they were relegated to the more mundane chore of evaluating the practical feasibility of the ideas. And the result: the researchers experienced a sharp reduction in job satisfaction!