A recent study has put ChatGPT, an artificial intelligence (AI) chatbot developed by OpenAI, to the test by using questions from the United States Medical Licensing Examination (USMLE). The results of the study, which have been published on the preprint server medRxiv, suggest that ChatGPT’s performance is comparable to that of a passing medical student.
Testing the AI’s Clinical Reasoning Skills
The team of researchers chose to test ChatGPT on questions from the USMLE, as it is a “high-stakes, comprehensive three-step standardized testing program covering all topics in physicians’ fund of knowledge, spanning basic science, clinical reasoning, medical management, and bioethics.” The language model, which was trained on massive amounts of text from the internet, was not trained on the specific version of the test used by the researchers and did not receive any supplementary medical training before the study.
Surprising Results
The study found that ChatGPT performed at or near the passing threshold for all three exams without any specialized training or reinforcement. The team wrote in their study that “In this present study, ChatGPT performed at >50% accuracy and 60% in the majority of the analyses. The USMLE pass threshold varies, but it is around 60%. This means, ChatGPT is now comfortably within the passing range.
Potential for Improvement
The researchers believe that the performance of the AI could be improved with more prompting and interaction with the model. In cases where the AI performed poorly, providing answers that were less concordant, the team believes it was partly due to missing information that the AI has not encountered. However, they also believe that the OpenAI bot had an advantage over models trained entirely on medical text, as it got more of an overview of the clinical context.
Comparing to PubMedGPT
The team also compared ChatGPT’s performance to that of PubMedGPT, a language learning model with a similar neural structure but trained exclusively on biomedical domain literature. They found that ChatGPT outperformed PubMedGPT (accuracy 50.8%), which they attribute to the fact that domain-specific training may have created greater ambivalence in the PubMedGPT model. The team speculates that this is because it absorbs real-world text from ongoing academic discourse that tends to be inconclusive, contradictory, or highly conservative or noncommittal in its language.
AI in Healthcare
The team writes that AI may soon become commonplace in healthcare settings, given the speed of progress in the industry. They suggest that AI could improve risk assessment or provide assistance and support with clinical decisions.
Limitations and Ethical Implications
It’s worth noting that the study has not been peer-reviewed yet, and the results should be viewed with caution until they have been independently verified. Additionally, passing a medical licensing exam does not equate to being a qualified physician, as it’s one of the many steps in a long and extensive process of becoming a physician, and passing the exam is just an indication of having a certain level of knowledge but not necessarily the ability to apply it in a clinical setting.
Moreover, AI models like ChatGPT are not able to replace physicians or make clinical decisions on their own. They can provide assistance and support to physicians, but ultimate decision-making still needs to be done by a qualified human physician. The study used a small sample size of questions from the USMLE and the results may not be generalizable to the entire exam or to other medical exams. Additionally, the study did not test the AI’s ability to explain its reasoning, which is an important aspect of clinical decision-making.
Another important point to consider is the ethical implications of using AI models in medical education and decision-making. The use of AI in healthcare raises concerns about bias, accountability, and transparency. For example, if an AI model is trained on a dataset that contains bias, it may perpetuate and amplify that bias in its predictions. Therefore, it’s important to ensure that the AI’s decision-making process is transparent and explainable so that physicians and patients can understand how the AI arrived at its conclusions and make informed decisions.
Additionally, the use of AI in healthcare must also be considered in the context of the legal and regulatory framework surrounding it. Ensuring patient safety and protecting the interests of all stakeholders involved are crucial considerations when implementing AI in healthcare.
So, while the study suggests that AI models like ChatGPT have the potential to assist with medical education and decision-making, it is important to approach the use of AI in healthcare with caution and consider the ethical, legal, and regulatory implications. It’s important to keep in mind that AI models like ChatGPT are not a replacement for physicians, but rather a tool that can assist them. More research is needed to further evaluate the performance and limitations of AI in medical decision-making.