A form of artificial intelligence called large language modeling (LLM), the same technology behind ChatGPT, could someday improve liver cancer care by extracting important data from medical charts much faster than humans, a recent UCSF study found.

LLMs use deep learning and large data sets to understand, summarize, generate and predict content.

The study, led by UCSF gastroenterologist and transplant hepatologist Jin Ge, MD, and published in Gastroenterology, found that a general-purpose LLM for chart extraction of liver tumor data was 93% accurate compared to when humans extracted the data. Accuracy rose to 99% when the LLM extracted certain types of data, and the process was 20 times faster.

The model, created at UCSF and applied to protected health information (PHI) data, looked at eight elements from about 1,100 imaging reports of 753 patients. The elements included the number and size of patients’ liver tumors, the grade of their tumors, whether there was evidence of metastatic disease in the abdomen and whether there was tumor recurrence.

The researchers compared the model’s accuracy to that of physicians who extracted the same data, and who were considered the “gold standard” for study purposes. The model performed best at identifying metastatic disease from the imaging reports, with an overall accuracy of 99% and worst at identifying tumor size from the reports, with 89% accuracy.

“Generally, the model performed better at simpler tasks involving classification rather than more complicated ones involving comparison or arithmetic,” Ge said.

It took 28 hours for physicians to extract all the chart data compared to two hours for the LLM, the authors reported.

While the model is not currently approved for clinical use, one potential application could be to determine whether a patient is eligible for a liver transplant based on the total amount of cancer detected in their body, the authors wrote. More general applications beyond liver cancer could include text-based prediction modeling, augmented clinical decision support and patient- and provider-facing chatbots, they wrote.

The study is one of the first to compare the use of LLM artificial intelligence versus manual chart review for research-grade data extraction from charts, Ge noted. One of its possible weaknesses, he said, is in using human chart review as the “gold standard.”

“It very well could be that the humans are less accurate, and the machines are actually better at some of these tasks,” Ge said.