Abstract
Multi-modal large language models (MLLMs) have transformed the landscape of modern healthcare, with automated radiology report generation (RRG) emerging as a cutting-edge application. While 2D MLLM-based RRG has been well established, its utility for 3D medical images remains largely unexplored. In this regard, we curate the 3D-BrainCT dataset (18,885 text-scan pairs) and develop BrainGPT, a clinically visual instruction-tuned (CVIT) model designed for 3D CT RRG. While we notice that the traditional LLM metrics failed to gauge the diagnostic quality of the RRG, we propose feature-oriented radiology task evaluation (FORTE), an evaluation scheme that captures the clinical essence of the generated reports. Here we show that BrainGPT achieves an average FORTE F1-score of 0.71 (degree = 0.661; landmark = 0.706; feature = 0.693, and impression = 0.779) and 74% of BrainGPT-generated reports were indistinguishable from human-written ground truth in a Turing-like test. Together, our work establishes a comprehensive framework encompassing dataset curation, anatomy-aware model fine-tuning, and the development of robust evaluation metrics for the RRG. By sharing our experience in 3D MLLM-based RRG, we aim to accelerate the expedition in human-machine collaboration for next-generation healthcare.
Similar content being viewed by others
Introduction
Artificial intelligence (AI) implementation in modern healthcare has re-invented our day-to-day practice in patient diagnosis1,2,3, disease intervention4, and clinical research5. Although convolution neural networks (CNN) have conquered some major tasks in image classification and feature segmentation, the CNN outputs are relatively context-restrictive6 and were less apprehensive than a fully written diagnostic report7,8. In awake to this clinical gap, early report generation models have been established9,10,11,12 for chest X-ray (CXR) interpretation13,14. Whereof, the primary success of LLM-based CXR report generation had fueled interdisciplinary interest to explore human‒computer interfaces, wherein multimodal large language models (MLLMs) can be instrumented as an assistant to the medical specialists15.
MLLMs have demonstrated cutting-edge capability in multi-image reasoning and video clip understanding, but these MLLM functionalities were yet fully tested in spatial medical modality. While LLaVA-Med (Microsoft) and Med-PaLM Multimodal (Google Research) have reported their primary success in X-ray and single-slice CT report generation16,17, essential but open end questions have entailed: (1) how much pathologic features can MLLM describe in a report (2) at what precision (spatial anatomic landmark) level was the lesion attributed in MLLM reports; and (3) whether MLLM could evaluate the degree and size of the lesion. Moreover, current single-slice CT image-text data is manually pre-selected from a 3D data, thus the MLLM is largely tested under a definite-positive lesion-included image. This may result in a sharpshooter fallacy in the MLLM testing. Hence, it is imperative to curate a 3D image set to inspect the dexterities of MLLMs in real-world diagnostic contexts.
3D CT scans of the brain, with reports structured in a list-by-list format focusing on differential diagnoses, serve as the first-line diagnostic for diverse intracranial conditions18,19 The brain lesion’s degree, size, and location of a brain lesion are paramount for making forming precise diagnoses and guiding subsequent clinical decisions. Therefore, the value of key radiology descriptors outweighs grammatical filler words. In this context, traditional evaluation metrics, which are designed to evaluate short translation20,21, abstract summary tasks22, and common image captioning23 cannot capture the clinical essence of brain CT reports24. In this regard, a pathology keyword retrieval concept was reported in a CheXpert study12,25. However, the 14-feature retrieval rates cannot reflect the multi-semantic context (inclusive or exclusive; progression or resolution; relative location and size) of the disease nature over an extended clinical course. Moreover, the uncategorized keyword list in CheXpert may not be transferable from one image modality to different modalities, which profoundly restricts the application value of medical MLLM. Therefore, a robust assessment system was required to improve the clinical essence measurement in MLLM-generated radiology reports.
To evaluate MLLM readiness in radiology report generation (RRG), we surveyed related works and identified three objective constraints in hitherto MLLM radiology applications: (1) the most studied CXR modality lacks sufficient lesion diversity to reflect real-world diagnostic challenges, (2) the utmost model capacity was not fully tested in interpreting volumetric scans, and (3) there are no generalized evaluation metrics available to gauge MLLM report information density and fidelity. These unsolved problems combined had crippled the development of an impactful medical MLLM. In this study, we aimed to advance the MLLM adaptability in radiology by addressing the following critical aspects:
-
(1)
We curated a large-scale 3D-brainCT dataset (18,885 text-scan pairs) that consists of abundant lesion details, including the degree, spatial landmarks, and diagnostic impressions of both neuronal and vascular CT features.
-
(2)
We introduced a clinical visual instruction tuning (CVIT) concept that enhances medical domain knowledge of the open-source Otter foundation model. In this case, our CVIT-augmented BrainGPT model demonstrated multi-image captioning capacity with clinical-sensible interpretation of the volumetric brain CT scans26,27. The diagnostic accuracy and linguistic style of BrainGPT models were externally validated on the CQ500 dataset and 11-physician rater were included to run through a Turing test-like linguistic style evaluation.
-
(3)
We proposed a feature-oriented radiology task evaluation (FORTE) evaluation structure to gauge the prospects of the MLLM captions. The variables of the FORTE include four essential keyword components (degree, landmark, feature, and impression) in a diagnostic radiology sentence. By further inspecting the correlation between the generated content and the evaluation score, we suggested that preprocessing the MLLM output with sentence pairing and negation removal could enhance the alignment and filter out irrelevant image descriptions.
Overall, this study encompassed a holistic content that addresses the objective constraints, implementation details, training methods, and future prospects of MLLM application in brain CT and general medical image interpretation (Fig. 1).
a Our customized BrainGPT was trained upon a set of volumetric brain CT image-text pairs, and the training results were evaluated by common language metrics. b Before language metrics evaluation, the BrainGPT-generated report was preprocessed by sentence pairing and negation removal. The sentence pairing releases the evaluation constraint of the same-content but different-order false negativity. The negation removal increases feature recall positivity by limiting the range of describing negative findings. c A keyword-based feature-oriented radiology task evaluation (FORTE) was enrolled to gauge the RRG quality, mainly through referencing a keyword dictionary that described distinct aspects (degree, landmark, feature, and impression) of the lesion. Note that synonyms are accommodated to allow the recognition of a broader array of related terms to address the challenge of lexical variability inherent in clinical reports. d The diagnostic accuracy and generalizability of BrainGPT were tested by performing zero-shot captioning on CQ500, an open-source traumatic 3D brain CT external validation dataset. e The linguistic style and readiness of the BrainGPT were further evaluated by a Turing test-like assessment in which 11 physicians were asked to distinguish BrainGPT-generated reports from the original ground truths.
Results
Training BrainGPT with clinical visual instruction tuning
To implement medical MLLM in 3D brain CT report generation, we collected a real-world 3D-BrainCT dataset and trained BrainGPT models by applying a visual instruction tuning-based approach (Fig. 2). Our fine-tuning conditions included regular visual instruction tuning (RVIT): (1) plain instruction (conveying the model’s role as a radiology assistant) and (2) In-context example instruction28 (3-shot examples added to the Plain Instruction), or clinical visual instruction tuning (CVIT): (3) template instruction (structured clinical-defined QA templates added to the plain instruction) and (4) keyword instruction (categorical guidelines focused on keywords added to the plain instruction). As a result, we derived four BrainGPT models (-plain, -example, -template, -keyword) that are fine-tuned to perform CT interpretation with different prior levels of clinical sensibility25. (Supplementary Fig. 1).
From the end-to-end Otter foundation structure, we tested four distinct fine-tuning conditions: two of them were regular visual instruction tuning (RVIT)—plain instruction and in example instruction; whereas the other two were clinical visual instruction tuning (CVIT)- template instruction and keyword instruction. To enable multi-image in-context learning, we formatted the data as image-instruction-answer triplets, whereupon the instructions and binarized images were arranged into standardized JSON files using the Otter MIMIC-IT pipeline. Specifically, the Otter framework integrates visual data (instrumenting a frozen CLIP ViT-L/14 encoder) and language data (using the LlaMA-7B large language model) through a trainable perceiver resampler module. In the LlaMA-7B architecture, cross-gated attention layers are added to distribute even focus across volumetric CT scans. The parameters of the remaining modules except for input/output embeddings, are frozen so that the training expenses are minimized. The model efficacy was evaluated via traditional metrics, LLM-as-a-judge evaluation, and a FORTE scoring system tailored for RRG evaluation.
At the whole report level, all of the four fine-tuned BrainGPT models outperformed the baseline Otter model in traditional metrics (Two-sided Wilcoxon signed-rank test p < 0.01, Fig. 3a, for complete statistics, please see Supplementary Table 1 and Supplementary Table 2). In particular, the baseline Otter model scored the lowest in the BLEU-4 (0) and CIDEr-R (5.9) metrics, indicating that the baseline model-generated CT report had low n-gram overlap and underfit semantic term usage frequency. In addition, we observed a mismatch between the conventional metric and the report quality by reviewing generated reports (see Supplementary Table 3). However, no traditional metrics, except CIDEr-R, rated the BrainGPT performance in accordance with the crescendo elaborateness of RVIT and CVIT (Two-sided Wilcoxon signed-rank test p > 0.05 across most visual instruction tuning conditions, summarized in Supplementary Table 1 and Supplementary Table 2). Therefore, we demonstrated that traditional metrics are inherently insensitive to the clinical essence of the generated radiology reports.
a Traditional language metrics were calculated for each MLLM experiment. While sentence pairing significantly increased the model scoring, statistical significance in two-sided Wilcoxon signed-rank test was also detected between the fine-tuned BrainGPT models and the baseline Otter models; and between the CVIT (Template Instruction and Keyword Instructions) and RVIT (plain instruction and context example instruction) models. b MLLM score demonstrated a significant positive shift in the two-sided Pearson correlation coefficient analysis after sentence pairing preprocessing; the pairing-related increase was particularly significant in the measure of CIDEr (TF-IDF-based). (* p < 0.05, ** p < 0.01, *** p < 0.001). CVIT: clinical visual instruction tuning, RVIT: regular visual instruction tuning.
Sentence pairing
To investigate whether the list-by-list report architecture of brain CT images accounts for low traditional metric scores, we apply sentence pairing to decompose the multisentence paragraph into smaller semantic granularity. As a result, the process of sentence pairing increased traditional metric scores by an average of 5.28 points in METEOR, 6.48 points in ROUGE-L, and an astonishing 114 points in CIDEr-R (Fig. 3a and Supplementary Table 1). The traditional metric scores for every BrainGPT model were significantly improved across most traditional metrics (p < 0.001, as shown in Fig. 3b). In addition, the score distribution indicated a significant linear correlation (two-sided Pearson correlation coefficient r > 0.7) in BLEU-2, BLEU-3, BLEU-4, METEOR, and ROUGE-L with statistical significance (p < 0.001). However, apart from the other n-gram-based traditional metric, the CIDEr-R score gained a more positive shift from the process of sentence pairing (r = 0.577).
Moreover, we observed a positive correlation between advanced CVIT and increased traditional metric scores. In particular, the CIDEr-R score depicted the most prominent ascending trend across the hierarchically fine-tuned BrainGPT-plain (125.86), BrainGPT-example (132.38), BrainGPT-template (147.92), and BrainGPT-keyword (153.3). This sentence pairing procedure not only relieves the sequential constraint between the report input and generated output but also accentuates the clinical instruction design value in the generation of the line-by-line differential diagnosis reports.
Feature-oriented radiology task evaluation (FORTE)
Given that CIDEr-R captures the hierarchical clinical essence across visual instruction tuning conditions, we hypothesized that its frequency-inverse document frequency (TF-IDF) component reacts to the usage of rare radiology keywords in reports. Therefore, we analyzed the term frequency in both the ground truth and test outputs (Fig. 4a). We found that the baseline Otter model had a low recall rate for keywords, whereas the BrainGPT models, after visual instruction tuning, demonstrated significant radiology keyword usage in image captioning. Notably, the BrainGPT-keyword (Fig. 4a: y = 0.95x-1.18; r2 = 0.85) maintained a high recall rate across varying frequencies of radiology keywords.
a 2D scatter plot documenting the radiology keyword recall discrepancy between the CVIT-conditioned generated reports and the original ground truth. The wording elements were either referred to as the radiology keyword or as the grammatical filler words, among which the CVIT process significantly boosted the keyword usage of BrainGPT models. Note that the differential diagnostic purposed phrase “no” stood out as the top-ranked recall keyword and was further investigated by the negation removal experiments. b By further differentiating the keywords into distinct subjects (degree, landmark, feature, impression), we benchmarked inter-model performance and tested the negation removal effects on BrainGPT models. The CVIT and RVIT effects on scoring gains were marked out by red dashed lines, whereas the effect of negation removal scoring gains was marked out by blue dashed lines. Statistical significance was assessed using a two-sided Wilcoxon signed-rank test. c The testing of instruction tuning and negation removal was repeated with the assessment of traditional evaluation metrics. Statistical significance for these comparisons was also determined using a two-sided Wilcoxon signed-rank test. d Heatmap and 2D scatter plot of the two-sided Pearson analysis showed a distinct evaluation spectrum between the traditional metrics and the FORTE keyword categories. While the traditional evaluation metrics showed high homogeneity, the FORTE evaluation addresses diverse aspects of the radiology description context. (* p < 0.05, ** p < 0.01, *** p < 0.001).
On the basis of the keyword frequency experiment and a thorough analysis of the CheXpert classifier, we consider that a clinically-sensible evaluation metric should encompass a structured keyword extraction concept that (1) addresses the multi-semantic context of radiology reports, (2) allows the recognition of synonyms to detect a broader array of related terms and (3) can be transferred across multiple modalities. In this regard, we introduced the feature-oriented radiology task evaluation (FORTE). This method gauges the medical content of the generated report by focusing on the density of radiology information. We categorized radiology keywords and their synonyms into degree, landmark, feature, and impression subsets to perform a multi-faceted evaluation of the system’s performance.
By evaluating the F1 score of each FORTE-measured keyword category (see Fig. 4b and Supplementary Table 4), we noted that the advanced CVIT models, BrainGPT-template and BrainGPT-keyword, scored higher average F1 scores than the RVIT models did. (Wilcoxon signed-rank test p-value < 0.05 in most categories, as detailed in Supplementary Table 2) Specifically, the BrainGPT-keyword captured the most radiology details, with FORTE F1 scores of 0.548, 0.533, 0.574, and 0.649. These collective results indicate that clinical instruction-tuned BrainGPT models generate brain CT reports with a higher level of radiology term usage and better alignment with original diagnostic reports.
Comparison of FORTE with traditional evaluation metrics
To determine whether FORTE reveals the qualities of reports revealed by traditional metrics, we conducted a two-sided Pearson correlation coefficient analysis to compare FORTE with traditional metrics (Fig. 4d). We observed that traditional metrics had high intracorrelations (r > 0.7, p < 0.001), but a lower correlation with the FORTE (r < 0.5, p < 0.001). In addition, the four FORTE domains (degree, landmark, feature, and impression) had lower intracorrelations (r < 0.5, p < 0.001 compared with traditional metrics). This comparison indicated that the FORTE addresses broader and distinct disease aspects than those traditional metrics depict.
Negation removal: effect of “interpretation spree”
Upon further review of keyword usage frequency, we found that “no” was notably among the most frequently used words, along with “of,” “and,” “in,” and “the.” This contradicts the common belief of “reporting bias,” where it is assumed that language models, similar to laypersons, are less likely to mention negative features. Clinically, radiologists often focus on context-oriented descriptions to “rule out” diagnostic targets, a practice not mirrored by BrainGPT. Consequently, BrainGPT’s reports were overlaid with excessive negative descriptions rather than the concise, context-specific language used by radiologists, a phenomenon we termed the “Interpretation spree.”
To address this, we applied negation removal to limit descriptions to positive findings. (Fig. 4c) The average score increases were as follows: BLEU-1 = 29.25%, BLEU-2 = 34.9%, BLEU3 = 50.54%, BLEU-4 = 57.26%, METEOR = 24.97%, ROUGE_L = 29.04%, CIDEr-R = 46.6%, with statistical significance in a two-sided Pearson correlation coefficient analysis (see Supplementary Fig. 2).
Additionally, negation removal improved the BrainGPT FORTE keyword average F1 score by 0.118 for degrees, 0.194 for landmarks, 0.139 for features, and 0.163 for impressions (see Fig. 4b and Supplementary Table 4). Negation removal led to an overall average F1 score gain of 0.153, with the best instruction-tuned model, BrainGPT-keyword, seeing its F1 score rise from 0.576 to 0.71. To explore the effects of negation removal on keyword distribution more deeply, we inspected the percentage of sentences containing four elements before and after “Negation Removal.” (Supplementary Fig. 3) The initial FORTE evaluation showed all keyword categories were present in 20.9–27.4% of the BrainGPT-generated reports, whereas the number decreased to 3.3–6.2% after negation removal. On the other hand, “degree and landmark” was present in 29.3–35.3% of the BrainGPT-generated reports, whereas the number increased to 41.6–55% after negation removal. This may be due primarily to the fact that negation removal resulted in the depletion of mostly impression keywords. The effect corresponds with the conciseness of radiology reports that the impression is succinct, and every sentence conveys a specific keyword combination including but not limited to all four elements for the purpose of differential diagnosis.
Together, we demonstrated that negation removal allowed evaluation metrics to focus on positive findings, enhanced average traditional scores and FORTE F1 scores, and avoided sparse and off-topic impressions in reports.
Assess BrainGPT generalization via CQ500 external validation
We conducted zero-shot external validation on the CQ500 brain CT dataset (n = 133) to assess BrainGPT’s ability to perform report captioning under generalized clinical conditions. We estimated the clinical keyword retrieval rate by comparing the BrainGPT reports to the ground truth labels of the CQ500 CT scans (Fig. 5a). Notably, BrainGPT mentioned keywords (e.g., ventricular enlargement, atrophy, infarction, and mass effect) in the generated reports for the CQ500 validation dataset with frequencies similar those of the training dataset (3D-BrainCT). These zero-shot captioning results indicated that BrainGPT included the CT report phraseology and writing structure for the differential diagnosis of brain disease.
a A summarizing bar chart panel was illustrated to display the cranial disease feature composition across our 3D-BrainCT (training and internal testing) dataset and the CQ500 (external validation) dataset. The BrainGPT caption accuracy was tested in three disease features (mass effect, hemorrhage, and midline shift) that came with an ground truth labeling in the original CQ500 dataset. Each of the four BrainGPT models accuracy was displayed by an overhanging line and dot plot. The measured accuracy before and after negation removal was marked by circles and stars, respectively. b A head-to-head comparison of distinct BrainGPT-generated reports was showcased in selected brain lesion features (lacunar infarction, subdural hemorrhage; SDH, subarachnoid hemorrhage; SAH), midline shift, and herniation). The precisely mentioned radiology keywords were highlighted in red. An increased trend of keyword usage ascends in correspondence toward the hierarchy of CVIT versus RVIT designs. c The multi-slice BrainGPT captioning ability was demonstrated by reporting pathologic conditions spanning across multiple CT scans. In the provided image-text examples, image features are labeled with red arrows, and the corresponding report descriptions are labeled with red context.
CQ500 is an intracranial hemorrhage (ICH) dataset encompassing “mass effect,” “hemorrhage,” and “midline shift” features, which were less indexed in our geriatric population 3D-BrainCT dataset. Nevertheless, the four instruction-tuned BrainGPT models achieved fair accuracy in reporting mass effects (0.71-0.75), midline shifts (0.35-0.38), and hemorrhagic events (0.43-0.5). Furthermore, consistent with our previous findings, negation removal (aligning with differential diagnosis interest) significantly increased the reporting accuracy for midline shift (0.86–0.91) and hemorrhagic events (0.59–0.61). Notably, in terms of diagnosing “mass effect,” “hemorrhage,” and “midline shift,” the advanced CVIT models (BrainGPT-template and BrainGPT-keyword) generally outperformed the RVIT fine-tuned models (BrainGPT-plain and BrainGPT-example). This suggests that CVIT instructions act as a premodel rectifier, enhancing the focus on valuable clinical etiologies.
To further investigate diagnostic credibility, we examined report results for lacunar infarct, subdural hemorrhage (SDH), subarachnoid hemorrhage (SAH), and midline shift with mass effects (Fig. 5b). Interestingly, in the lacunar infarction case, the BrainGPT-example accurately identified the hypodense old lacunar infarct in the putamen and thalamus regions. However, “putamen” was misspelled as “putmen,” likely due to a typo learned from the original training report. The model also generated successful captions for brain lesions, including SDH at the right occipital region, SAH, and left uncal herniation. To examine whether BrainGPT was subject to the sharpshooter fallacy in 3D CT scans, we isolated multiobject cases with different lesion features in various CT images (Fig. 5c). We noted that in some multi-lesional brain CTs, the BrainGPT reports could recognize distinct hemorrhage subtypes (contusion, intraventricular and subarachnoid hemorrhage) with a sparse comprehension of the lesion location (frontotemporal lobe). In another case, BrainGPT described a regressing chronic ICH and a periventricular angiopathic leukoencephalopathy with white matter hypodensity. These multiobject, multislice caption reports demonstrated BrainGPT’s potential to provide differential diagnoses for hemorrhages and hypodense image features.
LLM-as-a-judge for medical report validity
Our BrainGPT scored robust performance in both traditional evaluation metrics and FORTE. To further validate the clinical reliability of BrainGPT and FORTE metrics, we leveraged the LLM-based DocLens as an extrapolating referee to measure the fine-grained medical details in the BrainGPT-generated reports and the efficacy of FORTE29. Interestingly, by comparing the DocLens claim precision scores and the FORTE scores in the status pre- (0.102 vs 0.102) and post-visual instruction tuning (0.589 vs 0.539), we noted that the FORTE scoring framework aligns with the LLM-as the judge evaluations.
To examine the semantic coherence between the FORTE and LLM-based evaluations, we conducted a two-sided Pearson correlation analysis between DocLens score and all aforementioned metrics. Notably, we observed that Doclens is highly correlated with FORTE (r = 0.78, p-value < 0.001), but is poorly correlated with traditional metrics (r < 0.4, p-value < 0.001). The improved coherence between DocLens scores and FORTE metrics following the application of sentence pairing highlights the importance of this evaluation preprocessing step (Fig. 6a) In summary, our results demonstrated that traditional metrics fall short of addressing the semantic complexity in medical reports, and the FORTE scoring concept work as efficient as the state-of-the-art DocLens evaluation but in a computation-light manner.
a The general and average FORTE score, rather than the traditional metrics, correlate more with the DocLens LLM-as-the-Judge perspective in the two-sided Pearson correlation coefficient analysis. Notably, the correlation increased when sentence pairing pre-processing was applied. b Interclass correlation coefficient (ICC) between traditional metrics, FORTE, LLM-based DocLens, and the five-item medical expert evaluation was calculated and presented as a heatmap.
To accredit the whole BrainGPT-generation and FORTE-evaluation framework with human expert approval, we constructed a 10-point Likert scale rating by 2 physicians reading 50 random sampled captions. The surveyed dimensions were set by recombination of the essence of radiological RADPEER, linguistic readability, and factual considerations30,31,32,33. The evaluation compared AI-generated reports with their corresponding ground truth reports across five key criteria: “Diagnostic Accuracy,” “Linguistic Quality,” “Radiologic Detail Information,” “Density Proportion,” and “Soundness Equivalence.” We then computed the intraclass correlation coefficients (ICC) between the five criteria (Fig. 6b) and the FORTE average F1 score, which derived a moderate correlation between human expert opinion and proposed FORTE metrics: 0.73 (Diagnostic Accuracy), 0.27 (Linguistic Quality), 0.52 (Radiologic Detail Information), 0.64 (Density Proportion), 0.58 (Soundness Equivalence), and 0.62 overall. In parallel, we also noted a moderate correlation between DocLens and human expert opinion in terms of “Diagnostic Accuracy”, and a mild correlation in the remaining rated items. In conclusion, BrainGPT generated accurate and informative radiologic reports, whilst the proposed FORTE effectively denotes the medical essence within the report.
Linguistic-embedded Turing test
To further evaluate whether healthcare providers could distinguish these BrainGPT-generated reports, we conducted a Turing test. To be specific, we asked 11 physicians (2 radiologists, 2 neurologists, and 7 other licensed medical doctors) (Fig. 7a) to determine whether 6 provided brain CT reports (two lacunar infarcts, cortex atrophy, subdural hemorrhage, and midline shift) were written by machines or humans. Interestingly, out of the 66 evaluations, 74.24% of the BrainGPT-generated captions were mistakenly identified as human-written. (Fig. 7b) When CT input context was provided to the evaluators, 18% of the misidentification was reduced, resulting a 56% incidence of mistakes machine-written as human-written reports. Overall, integrating the scan-text context into the Turing test led to more confident evaluations on the 5-point Likert scale (p-value < 0.01). This suggests that reviewing actual image contexts helps human evaluators detect subtle differences between machine and human-written reports, likely because of off-target descriptions and inadequate precision in machine-generated captions.
a A schematic flowchart of our Turing test survey. Specifically, eleven physicians were asked to distinguish the model-generated report from seven selected report pairs. Each answer accompanies six qualitative reasoning regarding why the decisions were made. A 5-point Likert scale was used to rate the answer confidence, and the participants were asked to address the strength of the rationales (continuity and coherence, familiarity and voice, writing quality and specificity or vagueness). On top of the linguistic compartment of the evaluation, we tested whether the reveal of image context would divert participants impression on the authenticity of the report. b The false impression rate of the participant in origin and generated reports; before and after revealing the CT image content. c The consideration of linguistic criteria correlates with higher physician success rates.
We also investigated what type of qualitative traits can influence the expert impression on the human vs machine originality of the report (Fig. 7c). We found that traits of “Familiarity and voice” (29%), “Specificity or vagueness of details” (37.60%), “Continuity and Coherence” (33.33%), and “Writing quality at sentence level” (33.33%) were significant factors when medical experts attribute a human origin behind a radiology report34. Notably, when the “Familiarity and voice” factor was taken into account, the average success identification rate increased from 22.22% to 38.6%. Similarly, considering “Specificity or vagueness of details” increased the average success identification rate from 26.67% to 39.22%. By looking on these direct evidence of linguistic-quality embedded expert decision, it is advisable to conduct writing style fine-tuning in terms of RRG.
Together, this study demonstrated a holistic research toward pre-processing 3D volumetric CT image-text pair dataset (negation removal and sentence pairing), fine-tuning an RRG model (BrainGPT), and curating a medical-sensible evaluation metric (FORTE).
Discussion
In this study, we assembled a holistic framework to address various unmet needs and achieve robust report generation for 3D brain CT. On the basis of our collective experiences, we recommend applying the following sequential protocol (or an equivalent process) to ensure report generation and evaluation rigor:
-
1.
CVIT: Incorporating clinical domain knowledge in implementing visual instruction tuning
-
2.
Combining “sentence pairing” and “negation removal”: Adapting traditional metrics to radiology reports
-
3.
Employ FORTE: Gauge the clinical essence of caption results in multiple aspects
-
4.
Leveraging LLM-based agent and human expert evaluation: Measure the medical details of RRG and the keyword-based evaluation accountability
-
5.
Conducting linguistic-embedded Turing tests: Including the corresponding image and linguistic criteria in human expert evaluation
By following these technical approaches, we demonstrated that the fine-tuned open-source MLLMs can be instrumented for 3D medical image interpretation.
Recent work by Hamamci et al. revealed that generative models are competent in 3D chest CT report generation at the state-of-the-art (SOTA) level (BLEU-1 = 46, BLEU-4 = 36.9, METEOR = 29.5, ROUGE-L = 45.9)35,36,37. However, their customized transformer model required 7 days of training on a single NVIDIA A100 GPU, whereas our BrainGPT model requires only 12 h of fine-tuning on two NVIDIA A100 GPUs. Additionally, Google AI’s Med-Gemini-3D can perform 3D CT report generation, but only 53% of its reports are considered clinically valid in human evaluations38. The high computational cost of using large-scale Google TPUv4 accelerator pods makes this approach infeasible for general research with limited resources. In contrast, our BrainGPT uses an end-to-end open-source Otter framework (CLIP ViT-L/14 vision encoder39 and LlaMA-7B40), which allows for experimental replication and checkpoint sharing. Moreover, the reduced training cost of BrainGPT enables efficient visual instruction tuning, enhancing model performance and tailoring responses to specialized or stylistic requirements (Supplementary Table 5).
While we did not modify the Otter model structure, we attribute the SOTA-level performance to the combined effects of RVIT and CVIT. Singhal et al. first explored task-specific RVIT in the medical domain and reported that chatbot performance improved with medical QA in-context example primings28. Similarly, Med-PaLM M uses image cues (CXR and pathology slides) alongside clinical instructions to guide MLLM on multimodal medical tasks17. Echoing these studies, our CVIT models (BrainGPT-template, BrainGPT-keyword) outperformed RVIT models in brain CT captioning. This suggests that fine-grained, specialist-grade instruction design may optimize model outcomes for clinical captioning tasks.
We also highlighted that traditional metrics are unfit for evaluating clinical captioning tasks. Medical image reports assist in differential diagnosis and thus are characterized by complex paraphrasing, high token counts (>100), and numerous negative descriptions, which conflict with common metric evaluation contexts24,41. Additionally, we observed an “Interpretation Spree” behavior, where BrainGPT provided off-target (but not hallucinated) diagnostic narrations from multiobject brain CT contexts (detailed in Supplementary Table 3). This behavior is detrimental because (1) off-target effects may exclude the primary disease focus (e.g., stroke or brain tumor), and (2) expanded narration may dilute traditional metrics, leading to invalid evaluations.
To address low traditional metric values from different narration aims, we preprocessed radiology reports with “sentence pairing” and “ negation removal.” The application gap is attributed to surface-form similarity and lexicographic analytic restrictions41. We then applied FORTE to decompose the BrainGPT reports into keyword categories addressing the effective radiology descriptor frequency. In recent LLM-based evaluation studies29,42,43,44, we found that (1) similar to the sentence pairing process of FORTE, most of them adopt sentence-level comparison when prompting GPT to adapt to the list-by-list nature of radiology reports (2) in coherence with the issue raised in our negation removal session, Doclens29 mentioned the importance of the evaluation method to discriminate “there is a focal lesion” from “there is no focal lesion”; and (3) most importantly, ranging from sentence-level precision and recall to error counting, every study aims to establish a more clinically precise, explainable, and generalizable metric. (Supplementary Table 6).
To this end, FORTE served as an evaluation framework that consists of sentence-pairing, negation removal, and 4-category keyword extraction, which coherently restricts model hallucination, enhances the interpretation convergence, and provides instant radiology impressions for healthcare providers. This study performed Pearson correlation analysis across different evaluation metrics and reported that the FORTE method encapsulates a broader medical semantic dimension than the relative unitary traditional metrics. This is further evidenced by its moderate to high correlation with both human expert assessments and DocLens scores25. Moreover, the FORTE framework is customizable and transferable across various medical tasks with no focus restrain, an interchangeable categorized keyword bank was provided as keyword JSON files on our GitHub page, some examples were demonstrated for chest X-ray, low-dose computed tomography (LDCT), abdomen CT, and Brain CT tasks. Within the FORTE framework, BrainGPT achieves a commendable performance with an F1 score of 0.589, comparable to the state-of-the-art (SOTA) performance in general medical disease identification, which reports an accuracy of 59.2% in prior benchmark studies45.
Human expert evaluation in natural language processing experiments has been conducted under distinct experimental designs and has served diverse study purposes. Therefore, the resulting perspectives were often inconsistent and noncomparable across individual contexts. Therefore, related works have instrumented quantitative (completeness, correctness, conciseness) and qualitative (content, language, structure) measurements to dissect the riveting characters that distinguish synthesized clinical reports from human reports46,47,48,49. By adopting similar designs with objective linguistic criteria, we found that both the reviewer success rate and answer alternation reasons (“Suspicious wording” and “Both not mentioning critical features”) were associated with the writing style (“Familiarity and voice” and “Specificity or vagueness of details”) rather than the sentence-level writing quality and coherence34. The significance of the medical report writing style was also highlighted in an independent prompting study47. Interestingly, we observed that input case imbalance can influence the caption writing style, potentially related to the over-fitting observed during general model training (see also Supplementary Fig. 4).
This study has several limitations, which should be addressed in future works. First, this is a pilot volumetric brain CT captioning study that has no counterpart MLLM module to benchmark, and hence we cannot justify the efficacy at the SOTA level; however, we applied external validation to ensure the caption validity in the brain CT module. Second, BrainGPT is trained on degeneration-oriented data and thus it fails to capture the malignant tumor and acute traumatic features in CQ500. This phenomenon reflects that the training material may prime the dexterity of the resultant module49. Hence, we suggest enrolling diverse disease etiologies with the aim of differential diagnosis to improve MLLM generalization on border brain CT features50. Finally, we conducted CVIT and invented clinically oriented evaluations (sentence-pairing, negation removal, and FORTE), but we did not experiment on whether changing the model backbone could benefit brain CT captioning. Future research avenues could be to compare multimodel results and fine-tune the vision encoder and the language model for CT.
Methods
Study design and oversight
In this study, we trained BrainGPT to generate 3D brain CT reports. Then, we examined the caption efficacy by (1) adapting traditional evaluation metrics (2) proposing a clinical-oriented FORTE metric, (3) applying an external validation CQ500, and (4) conducting the linguistic-embedded Turing test.
Study patient
We collected 18,885 brain CT scans (742,501 slices) from 9689 patients with Alzheimer’s Disease (mean age = 82.59 years [standard deviation = 9.3 years]; 56.4% male) at Taipei Veterans General Hospital in Taipei, Taiwan, between January 1, 2010, and December 31, 2022. All data were collected under institutional review board approval (2023-10-002 BC) of the Taipei Veterans General Hospital. Informed consent was exempted due to the retrospective nature of the data collection and the use of deidentified CT images. The CT images included a variety of common neurology conditions affecting the skull, brain parenchyma, nasal sinuses, and the eye, and were collected by radiologists who routinely obtain CT images and write image reports based on the images and the patient’s medical records. Since Alzheimer’s Disease is a progressive degenerative condition predominantly seen in the elderly, the dataset includes images of normal brains, past infarcts that still show manifestations, chronic brain conditions, and acute brain lesions.
Clinical visual instruction tuning (CVIT)
To address the domain constraints of standard MLLM, we conducted multiple end-to-end visual instruction tuning processes on the multi-image mode Otter foundation model, enhancing its applicability to brain CT radiology features26,27. Based on the Flamingo model, the Otter paradigm connects the LlaMA-7B language encoder and the frozen CLIP ViT-L/14 vision encoder via a trainable perceiver resampler module and multiple cross-attention layers inserted into the LlaMA-7B architecture. Within the original LlaMA-7B structure, all modules except for the input/output embeddings were frozen to reduce training costs. The training duration for each resulting model was 12 h on two NVIDIA A100 GPUs, achieving 3 epochs.
To facilitate multi-image in-context learning capacity, we formulated the data into image-instruction-answer triplets, with the instruction tokenized and the image enhanced prior to input into the model. We designed four distinct fine-tuning conditions including regular visual instruction tuning (RVIT, Plain Instruction, and In-context example Instruction) and clinical visual instruction tuning (CVIT, Template Instruction, and Keyword Instruction), each corresponding to an adherence hierarchy to clinical essence. We named the final instruction-tuned models BrainGPT-plain, BrainGPT-example, BrainGPT-template, and BrainGPT-keyword.
For each visual instruction tuning process, BrainGPT-plain was fine-tuned using plain instruction, conveying the model’s role as a radiology assistant; BrainGPT-example was fine-tuned using in-context example instruction, adopting a 3-shot example approach due to available RAM constraints as based on the work of Singhal et al.28; BrainGPT-template was fine-tuned using template instruction, involving a structured and predefined set of questions or points that need to be addressed; BrainGPT-keyword was fine-tuned using keyword instruction, focusing on essential areas or categorical guidelines that direct the model’s response generation process. Detailed instruction examples can be referenced in Supplementary Fig. 1. The effectiveness of visual instruction tuning was evaluated using a two-sided Wilcoxon signed-rank test. To minimize the influence of reporter writing style (as demonstrated in Supplementary Fig. 4), we averaged the reports for each reporter, yielding 27 reporter averages. The Wilcoxon signed-rank test, a non-parametric paired test, was then applied to these related samples to assess whether their population mean ranks differ without assuming a normal distribution. This information has been clearly stated in the revised figure legends.
Dataset preparation
Training dataset
Since the Otter architecture requires image-text pair instances to be of the same size (24 slices), we sampled 365,928 slices from 15,238 scans representing 7747 patients from a total of 597,335 slices from 15,247 scans representing 7751 patients for the training process. The system was then tested on 87,312 slices sampled from 145,166 slices of 3638 CT scans representing 1938 patients.
External validation dataset
The CQ500 dataset, consisting of 1154 available CT scans from 491 patients, was downloaded from the Qure.ai website51. The dataset focuses on image features such as brain parenchyma (plain scans), bone (bone scans), and blood vessels (post-contrast scans). Only non-contrast CT scans with slice numbers between 23 and 40 were selected to build the external validation dataset (n = 133). This ensures that slice thickness and details are similar to our training dataset and fit in the Otter framework. Ground truth was based on a read.csv file from the CQ500 dataset, and the majority rule was applied among the three raters to summarize the “Mass effect,” “Hemorrhage event,” and “Midline shift” labels.
Feature-oriented radiology task evaluation (FORTE)
We proposed the feature-oriented radiology task evaluation (FORTE) framework to capture radiology keywords in terms of Degree (size and intensity), Landmark (location and anatomy), Feature (disease traits and primary findings), and Impression (final diagnosis) of the disease. (For details on the keyword list, see Supplementary Table 7). This list is not just a compilation of keywords but also includes synonyms, allowing the system to recognize a broader array of related terms that may appear in different contexts, addressing the challenge of lexical variability inherent in clinical reports. The F1 score is calculated for each category, providing a multi-faceted evaluation of the system’s performance. Additionally, we compute the Pearson’s correlation coefficient (Pearson’s r) under two-sided paired t-test conditions for each FORTE category with traditional metrics, offering a deeper understanding of their applicability and limitations in radiological report evaluation.
We provided the FORTE keyword JSON files for the Brain CT, chest X-ray, low-dose computed tomography (LDCT), and abdomen CT on our project GitHub page to demonstrate the generalizability of FORTE in facilitating the next generation of medical LLM.
Traditional metrics
We compared the clinical evaluation fitness of FORTE against the standard similarity-based evaluation metrics, including BLEU (Bilingual Evaluation Understudy, set range 0-100), METEOR (Metric for Evaluation of Translation with Explicit Ordering, set range 0-100), ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation with Longest Common Subsequence, set range 0-100), and CIDEr-R (Robust Consensus-based Image Description Evaluation, set range 0-1000) (Supplementary Table 6)20,21,22,23 All standard evaluations were executed using the Microsoft Common Objects in Context (MS-COCO) toolkit52.
Additionally, to address the list-by-list structure, diverse paraphrasing, and differential diagnosis-oriented negation description of brain CT reports, we incorporated “sentence pairing” and “negation removal” before applying the evaluation formulas. The “sentence pairing” process was inspired by BERTScore, which leverages contextual embeddings and cosine similarity to achieve better semantic matching than traditional n-gram-based methods53. Previous literature showed the BERT embeddings54 capture rich semantic information and are widely recognized as effective for sentence similarity tasks55,56,57. Cosine similarity, which measures the token-level contextual embeddings of reference and candidate sentences, is widely recognized as an effective generation metric and applied in the CheXbert vector similarity metric58. Specifically, the reports were embedded and vectorized using the all-mpnet-base-v2 model from the SentenceTransformer library before pairing and scoring. On the other hand, the “negation removal” process filtered any ground truth/ generated report pair with the word “no” with the “search” module under “re” package. By these means, sentence pairing releases the sequential constraints of disease descriptors, and negation removal reduces the false positives in evaluation reports.
LLM-as-a-judge for medical report validity
To assess the performance of BrainGPT and our proposed metric, FORTE, we conducted an evaluation using both an LLM-as-a-judge framework and a formal clinical accuracy evaluation by physicians. For the LLM-as-a-judge evaluation, we utilized the DocLens metric, which employs GPT-4o mini to extract claims from generated reports, verify their presence in ground-truth reports, and compute a precision score for the generated content. (for the prompt used, please refer to Supplementary Table 8; for the claim and precision examples, see Supplementary Table 9)29 Five models were evaluated, including BrainGPT-plain, BrainGPT-example, BrainGPT-template, BrainGPT-keyword, and the baseline Otter model. Additionally, we compute the Pearson’s correlation coefficient (Pearson’s r) under two-sided paired t-test conditions for DocLens score with the FORTE categories and traditional metrics, offering a deeper understanding of their applicability and limitations in radiological report evaluation. Pearson correlation analyses were also performed between DocLens precision and the FORTE categories, with and without sentence pairing preprocessing, to examine the impact of this step on the correlation.
For the clinical evaluation, a physician survey was constructed using 50 samples randomly selected from the lowest and highest one-third of sentence pairs (25 samples each) based on FORTE average scores. The survey employed the RADPEER framework, developed by the American College of Radiology (ACR), and included five evaluation criteria: “Diagnostic Accuracy,” “Linguistic Quality,” “Radiologic Detail Information,” “Density Proportion,” and “Soundness Equivalence.”33,59 To maintain focus on these aspects, only the AI-generated reports and their corresponding ground truth reports were provided to the physicians. Intraclass correlation coefficients (ICCs) were calculated to measure the reliability of human evaluations across these criteria. Additionally, RADPEER scores were dichotomized using arithmetic means as thresholds to compute Cohen’s kappa coefficients with DocLens scores.
Linguistic-embedded Turing test
To examine whether the BrainGPT CT reports recapitulate the linguistic texture of radiologist reports, we conducted a Turing test involving physicians and radiologists. Each participant was asked to distinguish BrainGPT reports from radiologist reports. The study was structured around four measuring axes:
-
(1)
Turing test: Can physicians tell the difference between BrainGPT reports and radiologist reports?
-
(2)
Confidence rate: How confident are the reviewers in their ratings?
-
(3)
Inter-leaved dependency: Do physicians alter their assessments and confidence rates after reviewing the original CT scans?
-
(4)
Linguistic criteria: What is the linguistic rationale behind physicians’ impressions?34
To explore the aforementioned questions, we collected survey and semi-structured rationale interview data. The physician survey was composed of six caption pairs, each comprising a BrainGPT report and a radiologist report. These examples included diverse disease instances including lacunar infarct, subdural hemorrhage, brain atrophy, and midline (cerebral falx) shift, thereby encompassing a range of both acute and chronic cerebral alterations for expert evaluation.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The 3D-BrainCT dataset from this study is securely stored in coded form at Taipei Veterans General Hospital (TPEVGH), Taiwan, and is not publicly accessible due to privacy concerns. Access can be obtained after the IRB and Data Sharing Committee approvals at the TPEVGH and the requesting institution. Details on how to request access are available from TPEVGH Human Research Protection Center (irbopinion@vghtpe.gov.tw). Requests will be processed within two weeks. The CQ500 dataset used for external validation is available in the Qure.ai database and can be accessed via https://www.kaggle.com/datasets/crawford/qureai-headct (https://doi.org/10.1016/S0140-6736(18)31645-3)51.
Code availability
Code used for experiments in this study is available at https://doi.org/10.5281/zenodo.1485268660. We also provide a Huggingface model weight link to our best model (https://huggingface.co/Charliebear/BrainGPT) and the FORTE keyword JSON files (Chest X-ray/ Brain CT/ Low-dose CT/ Abdomen CT).
References
Cao, K. et al. Large-scale pancreatic cancer detection via non-contrast CT and deep learning. Nat. Med. 29, 3033–3043 (2023).
Groh, M. et al. Deep learning-aided decision support for diagnosis of skin disease across skin tones. Nat. Med. 30, 573–583 (2024).
Chang, K. J. et al. Rearrange anatomy inputs like LEGO bricks: applying InSSS-P and a mobile-dense hybrid network to distill vascular significance from retina OCT-angiography. IEEE Comput. Intell. Mag. 19, 12–25 (2024).
Tian F. et al. Prediction of tumor origin in cancers of unknown primary origin with cytology-based deep learning. Nat. Med. 30, 1309–1319 (2024).
Dai, L. et al. A deep learning system for predicting time to progression of diabetic retinopathy. Nat. Med. 30, 584–594 (2024).
Rajpurkar, P. & Lungren, M. P. The current and future state of AI interpretation of medical images. N. Engl. J. Med. 388, 1981–1990 (2023).
Boiselle, P. M. Computed tomography screening for lung cancer. JAMA 309, 1163–1170 (2013).
Wysoki, M. G. et al. Head trauma: CT scan interpretation by radiology residents versus staff radiologists. Radiology 208, 125–128 (1998).
Boag, W. et al. Baselines for chest x-ray report generation. In: Machine learning for health workshop) (PMLR, 2020).
Chen, Z., Song, Y., Chang, T.-H. & Wan, X. Generating Radiology Reports via Memory-driven Transformer. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1439–1449, Online. (Association for Computational Linguistics, 2020).
Selivanov, A. et al. Medical image captioning via generative pretrained transformers. Sci. Rep. 13, 4171 (2023).
Yang, S. et al. Radiology report generation with a learned knowledge base and multi-modal alignment. Med. Image Anal. 86, 102798 (2023).
Johnson, A. E. W. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6, 317 (2019).
Demner-Fushman, D. et al. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med Inf. Assoc. 23, 304–310 (2016).
Chen, W. et al. Fine-Grained Image-Text Alignment in Medical Imaging Enables Explainable Cyclic Image-Report Generation. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 9494–9509 (Association for Computational Linguistics, Bangkok, Thailand, 2024).
Li, C. et al. LLaVA-med: training a large language-and-vision assistant for biomedicine in one day. In Proc. 37th International Conference on Neural Information Processing Systems. (2023).
Tu, T. et al. Towards generalist biomedical AI. NEJM AI 1, AIoa2300138 (2024).
Haydel, M. J. et al. Indications for computed tomography in patients with minor head injury. N. Engl. J. Med. 343, 100–105 (2000).
König, M. Brain perfusion CT in acute stroke: current status. Eur. J. Radiol. 45, S11–S22 (2003).
Banerjee, S., Lavie, A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (2005).
Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proc. 40th annual meeting of the Association for Computational Linguistics (2002).
Lin, C.-Y. Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out (2004).
Santos, GOd, Colombini, EL, Avila, S. CIDEr-R: Robust consensus-based image description evaluation. Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT, 2021).
Liu, G. et al. Clinically Accurate Chest X-Ray Report Generation. In Proc. 4th Machine Learning for Healthcare Conference. 106, 249–269 (PMLR, 2019).
Irvin, J. et al. CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. Proc. AAAI Conf. Artif. Intell. 33, 590–597 (2019).
Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning. Adv. Neural. Inf. Process. Syst. 36, 34892–34916 (2023).
Li, B. et al. Otter: A Multi-Modal Model with In-Context Instruction Tuning. Preprint at https://ui.adsabs.harvard.edu/abs/2023arXiv230503726L (2023).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Xie, Y. et al. DocLens: Multi-aspect Fine-grained Medical Text Evaluation. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 649–679 (Association for Computational Linguistics, Bangkok, Thailand, 2024).
Min, S. et al. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing. 12076–12100 (Association for Computational Linguistics, Singapore, 2023).
Doshi, R. et al. Quantitative evaluation of large language models to streamline radiology report impressions: a multimodal retrospective analysis. Radiology 310, e231593 (2024).
Jackson, V. P. et al. RADPEER scoring white paper. J. Am. Coll. Radio. 6, 21–25 (2009).
Wang, Z., Luo, X., Jiang, X., Li, D. & Qiu, L. LLM-RadJudge: achieving radiologist-level evaluation for X-Ray report generation. Preprint at https://ui.adsabs.harvard.edu/abs/2024arXiv240400998W (2024).
Casal, J. E. & Kessler, M. Can linguists distinguish between ChatGPT/AI and human writing?: A study of research ethics and academic publishing. Res. Methods Appl. Linguist. 2, 100068 (2023).
Ethem Hamamci, I., Er, S. & Menze, B. CT2Rep: automated radiology report generation for 3D medical imaging. In proc. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. LNCS 15012, 476–486 (Springer Nature, Switzerland, 2024).
Miura, Y., Zhang, Y., Tsai, E. B., Langlotz, C. P. & Jurafsky, D. Improving factual completeness and consistency of image-to-text radiology report generation. In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. (eds Toutanova, K. et al.) 5288–5304 (Association for Computational Linguistics, 2021).
Nicolson, A., Dowling, J. & Koopman, B. Improving chest X-ray report generation by leveraging warm starting. Artif. Intell. Med. 144, 102633 (2023).
Yang, L. et al. Advancing multimodal medical capabilities of Gemini. Preprint at https://ui.adsabs.harvard.edu/abs/2024arXiv240503162Y (2024).
Radford, A. et al. Learning transferable visual models from natural language supervision. In International conference on machine learning. (PmLR, 2021).
Touvron, H. et al. LLaMA: open and efficient foundation language models. Preprint at https://ui.adsabs.harvard.edu/abs/2023arXiv230213971T (2023).
Abbasian, M. et al. Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. NPJ Digital Med. 7, 82 (2024).
Huang, A., Banerjee, O., Wu, K., Pontes Reis, E., Rajpurkar, P. FineRadScore: a radiology report line-by-line evaluation technique generating corrections with severity scores. Preprint at https://ui.adsabs.harvard.edu/abs/2024arXiv240520613H (2024).
Zambrano Chaves, J. M., et al. Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation. Preprint at https://ui.adsabs.harvard.edu/abs/2024arXiv240308002Z (2024).
Bannur, S., et al. MAIRA-2: Grounded radiology report generation. Preprint at https://ui.adsabs.harvard.edu/abs/2024arXiv240604449B (2024).
Liu, J., et al. A Spectrum evaluation benchmark for medical multi-modal large language models. Preprint at https://ui.adsabs.harvard.edu/abs/2024arXiv240211217L (2024).
Van Veen, D., et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).
Yan, B. et al. Style-Aware Radiology Report Generation with RadGraph and Few-Shot Prompting. In Findings of the Association for Computational Linguistics: EMNLP 2023. 14676–14688 (Association for Computational Linguistics, Singapore, 2023).
Boag W., Kané H., Rawat S., Wei J. & Goehler A. A Pilot Study in Surveying Clinical Judgments to Evaluate Radiology Report Generation. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Association for Computing Machinery, 2021).
Youssef, A. et al. External validation of AI models in health should be replaced with recurring local validation. Nat. Med. 29, 2686–2687 (2023).
Blankemeier, L. et al. Merlin: A vision language foundation model for 3d computed tomography. Res. Sq. rs.3.rs- 3. rs-4546309 (2024).
Chilamkurthy, S. et al. Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study. Lancet 392, 2388–2396 (2018).
Lin, T.-Y. et al. Microsoft COCO: Common Objects in Context. In: Computer Vision – ECCV 2014 (eds Fleet D., Pajdla T., Schiele B., Tuytelaars T.) (Springer International Publishing, 2014).
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:190409675, (2019).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, Vol 1 (long and short papers) (Association for Computational Linguistics, 2019).
Feng, F., Yang, Y., Cer, D., Arivazhagan, N. & Wang, W. Language-agnostic BERT Sentence Embedding. In Proc. 60th Annual Meeting of the Association for Computational Linguistics, Vol 1 (Long Papers), 878–891. (Association for Computational Linguistics, Dublin, Ireland, 2022)
Li, B. et al. On the Sentence Embeddings from Pre-trained Language Models. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 9119–9130 Online. (Association for Computational Linguistics, 2020).
Ethayarajh, K. How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 55–65 (Association for Computational Linguistics, Hong Kong, China, 2019).
Yu, F. et al. Evaluating progress in automatic chest X-ray radiology report generation. Patterns 4, 100802 (2023).
Jackson, V. P. et al. RADPEER™ scoring white paper. J. Am. Coll. Radiol. 6, 21–25 (2009).
Li, C. et al. Towards a Holistic Framework for Multimodal LLM in 3D Brain CT Radiology Report Generation: Code Release. https://doi.org/10.5281/zenodo.14852686 (2025).
Acknowledgements
This work was supported by grants from the Taiwan National Science and Technology Council (NSTC 113-2321-B-A49-013, NSTC 111-2314-B-075-039-MY3, and NSTC 113-2320-B-075-007), the Ministry of Health and Welfare (MOHW113-TDU-B-211-114007 and MOHW114-TDU-B-211-124002), the Veterans Affairs Council (112VACS-008, 113VACS-011, and 114VACS-011); the “Center for Intelligent Drug Systems and Smart Bio-devices (IDS2B)”, and “Cancer and Immunology Research Center” under the Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education in Taiwan. This study is based in part on data from the Big Data Center, Taipei Veterans General Hospital (TPEVGH), and the technical support partly provided by Chief and Professor Albert C. Yang, Chief & Dr. Yu-Cheng Lo, Dr. Kung-Hao Liang, Pin-Hsuan Chiang, and Zih-Kai Kao from TPEVGH, Taiwan. Cheng-Yi Li is supported by the Gen. & Mrs. M.C. Peng Fellowship (MD-SY-112-B-02), Samuel Yin Medical Engineering Fellowship from School of Medicine, National Yang Ming Chiao Tung University (NYCU), and Yin Shu-Tien Foundation TPEVGH-NYCU Excellent Physician Scientists Cultivation Program (113-Y-A-002). The material in this work is partially funded by the Yun-San Ophthalmology Education Research Foundation and an unrestricted gift from Google PhD Fellowship awarded to Kao-Jung Chang. Disclaimer: Any opinions, findings, interpretations, and conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of TPEVGH, NYCU, and Google.
Author information
Authors and Affiliations
Contributions
C.Y.L. collected data, designed visual instruction tuning instructions, ran experiments, analyzed results, created figures, and wrote the manuscript. K.J.C. proposed the FORTE framework, analyzed results, and wrote the manuscript. C.F.Y. assisted in model fine-tuning and designed sentence pairing. H.Y.W. conducted qualitative analysis and created figures. W.T.C. assisted in model fine-tuning and provided technical advice. H.B. assisted in model survey and academic writing. L.C., Y.P.Y., Y.C.C., S.P.C., and S.J.C. advised on study design and provided meaningful feedback. J.F.L. directed the Turing Test and provided feedback on instruction design and the FORTE framework as a radiologist. K.W.C. and S.H.C. guided the project and provided critical suggestions for overall direction. All authors contributed to the interpretation and provided critical feedback on the analyses and manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Julian Acosta and Zhi Huang who co-reviewed with Sheng Wang for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Li, CY., Chang, KJ., Yang, CF. et al. Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation. Nat Commun 16, 2258 (2025). https://doi.org/10.1038/s41467-025-57426-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-025-57426-0