ICU Management & Practice, Volume 25 - Issue 5, 2025

img PRINT OPTIMISED
img SCREEN OPTIMISED

Artificial intelligence holds unprecedented potential to transform healthcare, yet the current evidence base for AI applications in critical care remains limited. This article presents the ABCDEF framework to guide critical care clinicians in evaluating AI-based tools and demonstrate safety and effectiveness in real-world clinical environments.

 

Introduction

The "measure of a man" has been a topic of discussion for two millennia with common themes being how one treats others with less power, how many individuals one serves, and how one comports oneself in times of challenge and pressure (Samuel Johnson, Dwight Moody, Martin Luther King Jr, respectively). Artificial intelligence (AI) poses unprecedented power. What, then, is the measure of AI?

 

Nowhere is this a more salient question than AI applications in healthcare, where medical ethics are a pervasive and perpetual source of philosophical and practical discussion. The core principles of medical ethics (beneficence, non-maleficence, justice, autonomy, and truth) are upheld both in the practice and enquiry of healthcare (Varkey 2021). One facet is the central role of evidence-based medicine, or the intentional use of modern, best evidence in making decisions about the care of individual patients (Masic et al. 2008). From medications to oxygen targets, the standard is to evaluate these interventions using rigorous trial designs with clinically relevant endpoints (a common gold standard being a randomised, controlled trial with mortality as an endpoint).

 

The methodologic excitement over AI has led to a plethora of publications showing the potential for AI to revolutionise healthcare, but ultimately, the impact of AI in healthcare is limited by lack of robust evidence in real-world clinical environments. Indeed, of 172 articles about AI in critical care, only 1% explored clinical outcomes, 1% attempted real-time integration and testing, 5% pursued external validation, and none were deemed ready for implementation (Fleuren et al. 2020). The current lack of high-quality evidence has led to high-profile calls for "outcomes-centric" regulations for AI platforms that require products demonstrate they do indeed have their intended clinical impact (similar to approval standards for medications) (Ayers et al. 2024; Wardi et al. 2023).

 

In the domain of medication approval, there are five key phases: pre-clinical (encompasses compound synthesis to *in vitro* and *ex vivo* testing), Phase I (safety), Phase II (pharmacokinetics for safe and efficacious doses), Phase III (efficacy), and Phase IV (safety and effectiveness in large populations). A New Drug Application (NDA) approval from the United States (U.S.) Food and Drug Administration (FDA) is required to enter Phase I testing, while Phase IV requires formal FDA approval (Kandi et al. 2023). In this parlance, most AI studies have barely moved beyond pre-clinical testing much less Phase III. Concerningly, no metrics or standards have been established for post-marketing analysis, which has been deemed absolutely essential for medications not only from a safety but also an effectiveness standpoint and is known to be essential for keeping algorithms up to date and free from bias (Bedoya et al. 2022). Yet another concern is the ability for high-quality peer review to regulate this vast new frontier.

 

Notably, a recent FDA Perspective on AI regulation put forth in the Journal of the American Medical Association said "The sheer volume of these changes and their impact also suggests the need for industry and other external stakeholders to ramp up assessment and quality management of AI across the larger ecosystem beyond the remit of the FDA" and emphasised that "all involved sectors will need to attend to AI with the care and rigour this potentially transformative technology merits." In this way, clinicians have an imperative to critique and contextualise AI studies as they would medication therapy or other technology used in healthcare (Warraich et al. 2025; Gilbert et al. 2023).

 

Here, we present clinicians with a framework for the evaluation of AI in the ICU setting.

 

Regulation of AI in Healthcare

AI is a broad field. Machine learning (ML) serves as the conceptual backbone with the defining characteristic of models 'learning' from data to perform a task without the necessity of specific rules (Rajkomar et al. 2019). Deep learning (DL) is a branch of ML describing more complex models built on neural networks arranged in intricate computational layers (Esteva et al. 2019). Natural language processing (NLP) refers to methods and abilities of computers to understand and process human spoken and written language; large language models (LLMs) are built on the concepts of DL and NLP to support a wide range of tasks. In recent years, applications like medical note summarisation and drafting are enabled by generative LLMs (or Gen AI), which can produce new and coherent text in response to prompts (Thirunavukarasu et al. 2023). As the field of AI continues to evolve, concepts such as artificial general intelligence and agentic (autonomous, goal-oriented) AI bring the issues of oversight and regulation of AI in healthcare into ever-sharper focus (Wu et al. 2025).

 

Given the broad applicability to healthcare and the potential impact on patients as well as pitfalls associated with industry self-governance, regulation of AI in healthcare is expected to fall to national and international agencies (e.g., FDA, European Commission). Many of these agencies are represented in the International Medical Devices Regulation Forum (IMDRF), which has a mission to harmonise medical device regulation around the world (International Medical Device Regulators Form 2025). In brief, the IMDRF defines a medical device as an instrument (inclusive of software) intended for the diagnosis, prevention, monitoring, or treatment of disease that does not achieve its action by pharmacological means (International Medical Device Regulators Forum 2022); individual agency definitions are largely consistent (US Food and Drug Administration 2025). However, current regulatory frameworks define software as a medical device (SaMD) more narrowly and determining whether a specific software may be subject to regulation is complex; an example decision tree used by the US FDA to classify clinical decision support tools (CDS) is presented in Figure 1.

 

 

There are currently more than 1200 AI-enabled medical devices granted pre-market approval by the U.S. FDA (US Food and Drug Administration 2025); only a small fraction of these devices are relevant to the care of critically ill patients, and an even smaller number have undergone rigorous clinical testing or offer comprehensive reporting (Lee et al. 2023; Joshi et al. 2024; Windecker et al. 2025). The vast majority (> 95%) of these approved AI-enabled medical devices are approved based on substantial equivalence to predicate devices (i.e., devices that have been previously approved), a process reserved for devices deemed to present low-to-moderate risk to patients (US Food and Drug Administration 2025; Van Norman et al. 2016). These predicate devices may or may not contain similar AI-enabled features and may have themselves been approved based on predicate equivalence (Muehlematter et al. 2023). The 2024 European Union AI Act established the first comprehensive harmonised regulatory legislation specific to AI, with medical applications being designated high-risk (EU Artificial Intelligence Act 2025). In 2025, the U.S. FDA released draft non-binding guidance specific to AI-enabled device software functions that is currently open for comment (US Food and Drug Administration 2025).

 

Most AI with intended or potential healthcare applications is not currently regulated as a medical device in the US and no applications marketed as LLMs are currently regulated as medical devices (Warraich et al. 2025; Gilbert et al. 2023), with most avoiding regulation by stipulating that outputs are not a substitute for professional advice and should not be used to make medical decisions about an individual (OpenAI 2025). Despite disclaimers, LLMs can be prompted to provide output consistent with current regulatory definitions of a medical device CDS (Weissman et al. 2025). Proprietary systems like GPT-4, now integrated into clinical platforms, are already being used in practice despite offering limited transparency into their training data or model behaviour. Open-source LLMs providing modifiability and auditability may be more suitable for rigorous validation and regulatory oversight, suggesting that regulatory concerns should focus on deployment context, intended use, and safety practices rather than licensing models. There are a growing number of publications exploring the use of LLMs in healthcare highlighting the enthusiasm for their potential impact (Chase et al. 2025; Eriksen et al. 2024; Heinz et al. 2025; Chen et al. 2023).

 

While there is a recognised need for oversight and regulation of LLMs and other AI-enabled tools with definite or potential healthcare applications, the challenges presented by these new technologies are daunting (Warraich et al. 2025; Meskó et al. 2023). Existing regulatory frameworks are not equipped to manage newer AI-based tools. Medical devices are typically approved for a single indication, while LLMs and other AI-based tools have more general potential applications with contextually sensitive performance. Algorithms may require updates or may adjust their own model based on input data or performance. LLMs also present new challenges regarding data security and ownership, transparency, liability, and serious ethical concerns. The pace of change of AI requires that any regulatory framework be forward-thinking or be doomed to obsolescence in short order. Proposed and/or non-binding guidance from the US FDA and IMDRF seek to address many of these challenges and focus on the total product life cycle (TPLC) of AI-enabled devices (US Food and Drug Administration 2024), guidance for change management (US Food and Drug Administration 2019; US Food and Drug Administration 2024), principles for good machine learning practice (GMLP) (International Medical Device Regulators Forum 2025), transparency (US Food and Drug Administration 2024), clinical evaluation of software as a medical device (International Medical Device Regulators Forum 2017), structures for quality management systems (International Medical Device Regulators Forum 2015), and frameworks for risk categorisation (International Medical Device Regulators Forum 2014), amongst others. Clinician involvement in the deployment and oversight are essential for safe use.

 

The ABCDEF Framework for Clinical Acceptance of AI at the Bedside

Here, we present an ABCDEF framework for clinical acceptance of AI at the bedside (Figure 2).

 

A. Alignment to patient-centred outcomes and key performance indicators (KPIs)

For an AI-based product to achieve clinical acceptance, the gold standard is a demonstration of safety and effectiveness towards a patient-centred outcome via an evaluation in the appropriate clinical setting. Silent deployment studies measuring KPIs or clinically accepted secondary endpoints are likely an intermediate step.

 

At present, many evaluations focus on the development of a model [e.g., evaluating various supervised machine learning methods to optimise discrimination metrics such as area under the receiver operating characteristic (AUROC) based on a given dataset] with some progressing to validation of that model (i.e., testing the developed model in an external dataset to ensure appropriate performance). However, demonstrating that a model has excellent performance in a retrospective dataset is not tantamount to real-world utility. Key factors affecting model performance in real-world environments can only be assessed in prospective implementation studies, which are critical for evaluating these and other important considerations before adopting and scaling AI-based tools in clinical environments (Marwaha et al. 2022).

 

This critique is not to say that initial steps of model development are not relevant; reports of early-phase methods development and technical performance [e.g., retrieval augmented generation (RAG) to improve the accuracy of recommendations provided by an LLM] remain important steps in the development of AI tools. However, even early-phase methodological studies must consider downstream evaluation and implementation. To this end, studies can still include a schematic illustrating how the AI tool would be integrated into a clinical workflow or outline important next steps in the development pipeline.

 

Alignment tuning is a key part of the development pipeline, ensuring that model outputs not only perform well technically but also reflect clinician reasoning, preferences, and priorities. Both the design of the task and evaluation should incorporate alignment considerations: specifically, whether the AI tool is solving the task in a way that reflects how a clinician would reason through and approach the problem. Alignment tuning is not only about model performance, but also about shaping behaviour to meet real-world expectations, which can improve trust, usability, and clinical adoption.

 

B. Benchmarking against existing clinical standards

It is incumbent upon AI developers to demonstrate that described models are meaningfully better than current standard of care or less computationally intensive alternatives. AI runs the risk of being a Rube Goldberg machine, performing the same task as simple, extensively validated tools but in an intricate and often expensive 'black box' manner. An example is a ML model that incorporated 19 parameters from 16,189 patients who underwent extubation in the MIMIC-IV dataset to predict the risk of re-intubation (Johnson et al. 2023; Zhao et al. 2021). The development of this model was no doubt time, resource, and labour intensive, yet publication of this model offered no comparison against the rapid shallow breathing index (RSBI) or Tobin index, a simple quotient of two readily available numbers in the ICU (respiratory rate and tidal volume) used in clinical practice for over 30 years for predicting successful extubation (Yang et al. 1991).

 

In the realm of mortality prediction in the ICU, both the sequential organ failure assessment (SOFA) and Acute Physiology and Chronic Health Evaluation II (APACHE II) are validated composite metrics that describe severity of illness and are predictive for mortality (Vincent et al. 1996; Knaus et al. 1985). Given their simplicity and predictive performance, these metrics are widely used in the space of quality improvement and clinical trials as demographic criteria. A SOFA score can be quickly calculated at the bedside using readily available data without need for advanced computational methods (although it can be done quickly and accurately by hand, many online calculators exist). The utility of a ML model predicting mortality must be evaluated in the context of such a parsimonious tool.

 

Even in cases when simple composite metrics fail to offer strong predictive performance, comparison of complex ML methods to traditional regression methods is warranted. ML and regression offer differing lenses with which to analyse and interpret data. ML methods offer certain advantages over regression that may be particularly salient in the ICU setting including robust handling of both large lists of co-variates and non-linear relationships. However, traditional regression methods are computationally less intensive and have the distinct advantage of offering clear insight into the factors most important for the performance of a given model.

 

There is a disconnect between teams that develop LLMs and those that apply them clinically. This disconnect is evident in the space of benchmarking, where common benchmark scores do not fully capture real-world utility. As an example, a high Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score, which measures the number of overlapping words between a primary and LLM-summary text, may correlate to impressive feats like passing licensing exams but does not immediately translate to an accurate diagnosis from a patient's symptoms.

 

Benchmarks serve an important role for selection of models for given tasks and measuring technical advances that have promising performance for future study and development. These exploratory steps help identify gaps, surface alignment issues, and serve as building blocks between model capability and real-world application. An ideal benchmark would be designed with general principles of medical decision-making and would be shown to be transferable to a set of tasks, thus reducing the need to evaluate each task in isolation while still providing meaningful insights. Such benchmarks must be built on a solid framework, ideally informed by cognitive models of clinical reasoning: this is also a form of 'alignment' across dimensions including tasks, datasets, and models. This alignment can make early-phase technical evaluation more realistic and transferable to the bedside [e.g., developing a metric specifically designed to evaluate the safety and interpretability of a medication order, as aligned with how medication orders are presented in the electronic health record, written in progress notes, or spoken aloud amongst the team].

 

C. Clinician-oriented via use of human-machine teaming and clinically relevant tasks

It has been quipped that "artificial intelligence is a solution looking for a problem." The AI development process is often simplified by having AI solve structured problems with limited potential solutions, established rules and principles, and limited uncertainty (e.g., AI is well equipped to win at games of chess). In contrast, much of medical treatment in healthcare may be conceived of as unstructured problems rife with uncertainty and not readily described by rules-based algorithms. It is essential that clinicians be involved to identify not only relevant problems that need to be solved but also problems for which AI may be the best solution.

 

A critical element throughout each phase of the AI life cycle is clinician-centred design. Clinicians are important end-users of AI-based tools in healthcare, and model conceptualisation, design, and development must account for this group. AI-based tools must perform clinically relevant functions, rather than offering solutions to nonexistent problems or clinically insignificant improvements over current state. Output from tools must be actionable, meaning that required inputs are routinely available in the clinical environment and can be integrated in real time, model output is provided to clinicians with enough lead time for clinicians to act, and there are defined actions a clinician can take to alter an explicit outcome of interest. The tool must be usable, with a focus on information technology capabilities and workflow integration with minimisation of clinician barriers. Developers must consider acceptability and clinician buy-in versus over-reliance, and reporting of models must be sufficiently transparent regarding model performance and limitations to clearly define appropriate use cases and discourage automation bias. Finally, clinician input and feedback should be considered during model deployment, monitoring, and optimisation (i.e., human-in-the-loop).

 

The foundational theorem of bioinformatics holds that the person plus the computer is better than either alone (Nelson et al. 2020). While this may not always be the case [e.g., lack of model acceptability and trust, overreliance on the model at the expense of clinical judgement (automation bias)] (Yu et al. 2024), human-machine teaming (HMT) is a fundamental concept for AI implementation. In healthcare in particular the concept is not only fundamental but necessary: high-stakes decision making, liability concerns, and issues of public trust and consent require for the time being that human judgement remains a part of the system. HMT implies autonomy on the part of the human (i.e., the capacity to solve problems by applying judgement, skills, experience, and ethical considerations) and machine (i.e., the ability to process large amounts of data quickly and with high fidelity within its environment), as well as interdependence and collaboration toward a common goal (Boy et al. 2022). For this partnership to succeed, the most critical element remains the human contribution: situational awareness, risk assessment, and decision-making. Key themes for HMT are explored in Table 1.

 

 

The clear description of what variables are important in a model and how outputs are determined by the model contributes to clinical acceptance; this insight is captured in the concepts of interpretability and explainability (Abgrall et al. 2024). Interpretability refers to the ability to understand the inner workings of a model (i.e., "how" and "why"). Simple, transparent 'white box' models (e.g., decision trees, linear regression) are inherently interpretable. More complex, 'black box' models are not interpretable, but their outputs may still be explained (the "why"), often in a post-hoc fashion using methods such as feature importance graphs, SHapely Additive exPlanations (SHAP) and Local Interpretable Model-Agnostic Explanations (LIME). There is an argument to be made that when a model works much better than alternatives and does so reliably, there is less need to understand the "how" or "why" (e.g., many individuals use navigation applications without needing to understand how the technology works). However, healthcare is a different space with much higher stakes and significant liability considerations that are not adapted for AI (Mello et al. 2024). Even accepting the inevitable tradeoffs between performance and explainability, is it reasonable to expect clinicians to act on the outputs of a model without knowing what variables the model considers or why the output occurred, or before they have built trust in the model through repeated HMT experiences (Henry et al. 2022)? What if acting on model output constitutes a contradiction to clinical assessment and judgement or standard of care? Trust and adoption of AI with these considerations will likely have a degree of user dependence, but interpretability and explainability remain relevant considerations.

 

D. Delineated purpose and metrics

Reporting all phases of model development has relevance for the healthcare and research communities, but as previously noted, not all reported models currently have clinical utility or have been appropriately vetted for deployment in the clinical space. A clearly defined purpose is essential for the evaluation of the relevance of study methodology and chosen outcomes. Investigators can therefore clarify their purpose as (1) hypothesis generating (e.g., exploring the performance of AI-based tools for new use cases), (2) novel methods development, or (3) clinical application.

 

Using unsupervised ML to look for novel patterns or developing new methods are important and time-consuming endeavours worthy of dissemination, but they are not necessarily directly clinically applicable. Studies showing a well-known classification method (e.g., XGBoost) could classify correctly (e.g., high AUROC) in a stand-alone dataset without a clear pathway towards implementation have questionable value to the clinician but may inform future research. As an example, a large body of research focuses on addressing heterogeneity of disease and treatment effect in critical care. Notable examples include studies exploring identification and description of phenotypes (i.e., subclasses) within sepsis and the acute respiratory distress syndrome (Calfee et al. 2014; Seymour et al. 2019). While these reports are methodologically innovative and expand our understanding of these disease states, phenotypes may demonstrate significant overlap and in-phenotype variation, variables used to describe phenotypes may not be routinely available or not available in a useful timeframe in clinical practice and isolating a phenotype may not provide additional insight into appropriate management strategies, limiting current clinical utility of these approaches (Wick et al. 2021).

 

In addition to clearly defining a purpose, investigators must also be clear, intentional, and appropriately comprehensive with reported performance metrics and outcomes. For methods studies, this should include performance metrics relevant to clinicians and appropriate for the use case. For example, for a prediction algorithm a clinician will be interested in not only the AUROC or accuracy but also in the rate of false negatives (i.e., missed cases) and false positives (i.e., how many false alerts can be expected, leading to unnecessary use of resources or requiring user override of the model and contributing to alert fatigue). Furthermore, a model must be well calibrated to avoid misleading and potentially harmful predictions (Van Calster et al. 2019). Standard metrics of model performance (e.g., AUROC, sensitivity, specificity) interpretable to clinicians may be reported, but a clear description of clinical implications is warranted. Similarly, if the defined purpose is clinical application, reporting of model performance is necessary but not sufficient; as with a standard clinical trial, outcomes must reflect efficacy in a patient-centred outcome and safety of the model in prospective implementation in a well-defined population.

 

E. Ethical

Clinicians are expected to uphold the foundational bioethical principles of healthcare in practice, and AI-based tools must be held to the same standard. The potential ethical concerns associated with implementation of AI in healthcare are well described and extensively discussed elsewhere (World Health Organization 2021; World Health Organization 2025). It is imperative that clinicians be aware of these concerns and that mitigation strategies be undertaken and described in the design and development of AI-based tools, with clear evaluation of how the technology mitigates bias and operates within legal and ethical frameworks (Cohen et al. 2025). It is also imperative that patients be protected and informed (US Office of Science and Technology Policy 2022). A non-comprehensive list of ethical concerns with real-world examples is provided in Table 2.

 

 

 

F. F.A.I.R. and standardised reporting

When large clinical trials are conducted, the data generated can live a second life through post-hoc and patient-level analysis (Calfee et al. 2014; Seymour et al. 2019). Reporting studies with appropriate methodologic rigour following the Findable, Accessible, Interoperable, and Reusable (F.A.I.R.) guiding principles for data management is key to supporting collaboration and discovery and maximising value of data (Wilkinson et al. 2016). This consideration is nowhere more pertinent than in the development of AI-based tools, where generation of datasets is expensive and time consuming and large amounts of high-quality data are necessary for model training and testing.

 

Development of large research-ready databases such as the Medical Information Mart for Critical Care (MIMIC) and eICU Collaborative Research Database (eICU-CRD) requires standardisation and transfer of data between institutions (Johnson et al. 2023; Pollard et al. 2018). Common data models (CDMs) create a shared language and data format and are critical for operationalising interoperability and reusability of data. Examples of CDMs frequently utilised in critical care research are the Common Longitudinal ICU Data Format (CLIF) (Rojas et al. 2025; CLIF Consortium 2025) and the Observational Medical Outcomes Partnership (OMOP) (Observational Health Data Science and Informatics 2025). Development and operationalisation of CDMs specific to ICU medications is an ongoing area of interest, with investigators hoping to address this significant shortcoming of current CDMs given the importance of medications to patient outcomes (Sikora et al. 2024). CDMs rely on the use of standardised vocabularies such as the C2D2 critical care data dictionary (Murphy et al. 2025), RxNorm, and International Classification of Diseases (ICD) codes. Additionally, standards of exchanging electronic health information such as Fast Healthcare Interoperability Research (FHIR) can facilitate real-time data exchange and mitigate barriers to sharing data between health systems (Duda et al. 2022; HL7 FHIR 2025). *Federated learning* is an emerging approach to data management that relies on decentralised data management, reducing barriers related to data privacy or regulatory issues related to transfer between institutions (Rieke et al. 2020).

 

Studies of AI-based tools must also be reported with transparency and rigour appropriate for the phase of development and with necessary and sufficient detail for clinicians to complete the thorough vetting process vital to patient safety and draw conclusions. Established reporting frameworks for a range of study designs have been developed to standardise the approach to reporting, ensuring not only that investigators are aware of necessary components but that evaluators can appropriately critique methodology and judge merit of the intervention (von Elm et al. 2007; Tripathi et al. 2025; Vasey et al. 2022; Tejani et al. 2024; Norgeot et al. 2020; Liu et al. 2020; Hernandez-Boussard et al. 2020; Gallifant et al. 2025; Cruz Rivera et al. 2020; Collins et al. 2024). These standards are available through the Equator network.

 

 

Use-Case: Large Language Models (LLMs) for Medication Management

Traditionally, LLMs have been evaluated using a standard set of metrics that evaluate performance on the quality of text generated, including fluency, coherence, accuracy, and relevance. A standard set of such metrics includes perplexity, accuracy, F1-score, ROUGE score, Bilingual Evaluation Understudy (BLEU) score, Metric for Evaluation of Translation with Explicit Ordering (METEOR) score, question answering metrics, sentiment analysis metrics, named entity recognition metrics, and contextualised word embeddings (Canales et al. 2021). While these metrics are automated, support rapid evaluation, and have supported numerous emergent phenomena characteristic of advanced LLMs today, their clinical relevance is substantially lacking, particularly given the risk of "hallucinations" (i.e., output that is factually incorrect or misleading) and the challenges of training LLMs to use standard resources for certain answers (e.g., drug dosing) while allowing for novel connections and applications (e.g., re-purposing drugs for new disease states). There are few things more dangerous than an overly confident clinician or tool: teaching LLMs to favour Socrates' concept of "The only true wisdom is in knowing you know nothing" over a confidently (wrong) answer is likely essential for safe use of AI. A necessary step for the use of LLMs in the clinical setting is establishing the benchmarks that clinicians would expect of a tool being used for direct patient care. Figure 4 illustrates a progression of study phases with the expected purposes of those studies. Benchmarks used will be specific to the study phase and can include standard metrics like AUROC or accuracy and consistency but can also include elements from implementation science like the Reach, Effectiveness, Adoption, Implementation, and Maintenance (RE-AIM) framework and others.

 

 

 

Conclusion

AI is a transformative technology that has the potential to improve healthcare delivery; however, we are at the earliest stages of measuring that power in humane, clinically relevant ways. This task of defining measurement, more than almost any other, is the first step for AI. The ABCDEF framework provides a structured approach for clinicians to judge whether AI tools truly improve patient care.

 

Conflict of Interest

None.


References:

Abgrall G, Holder AL, Chelly Dagdia Z, Zeitouni K, Monnet X. Should AI models be explainable to clinicians? Crit Care. 2024;28(1):301.

Ayers JW, Desai N, Smith DM. Regulate artificial intelligence in health care by prioritizing patient outcomes. JAMA. 2024;331(8):639-640.

Bedoya AD, Economou-Zavlanos NJ, Goldstein BA, et al. A framework for the oversight and local deployment of safe and high-quality prediction models. J Am Med Inform Assoc. 2022;29(9):1631-1636.

Boy GA, Morel C. The machine as a partner: human-machine teaming design using the PRODEC method. Work. 2022;73(s1):S15-S30.

Calfee CS, Delucchi K, Parsons PE, Thompson BT, Ware LB, Matthay MA. Subphenotypes in acute respiratory distress syndrome: latent class analysis of data from two randomised controlled trials. Lancet Respir Med. 2014;2(8):611-620.

Canales L, Menke S, Marchesseau S, D’Agostino A, Del Rio-Bermudez C, Taberna M, Tello J. Assessing the performance of clinical natural language processing systems: development of an evaluation methodology. JMIR Med Inform. 2021;9(7):e20492.

Chase A, Most A, Sikora A, et al. Evaluation of large language models' ability to identify clinically relevant drug-drug interactions and generate high-quality clinical pharmacotherapy recommendations. Am J Health Syst Pharm. 2025;77(19):1556-1570.

Chen S, Kann BH, Foote MB, Aerts HJWL, Savova GK, Mak RH, Bitterman DS. Use of artificial intelligence chatbots for cancer treatment information. JAMA Oncol. 2023;9(10):1459-1462.

CLIF Consortium. CLIF: Common Longitudinal ICU data format. Available from: https://clif-consortium.github.io/website/

Cohen IG, Ajunwa I, Parikh RB. Medical AI and clinician surveillance: the risk of becoming quantified workers. N Engl J Med. 2025;392(23):2289-2291.

Collins GS, Moons KGM, Dhiman P, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385:e078378.

Cruz Rivera S, Liu X, Chan AW, Denniston AK, Calvert MJ. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Nat Med. 2020;26(9):1351-1363.

Duda SN, Kennedy N, Conway D, Cheng AC, Nguyen V, Zayas-Cabán T, Harris PA. HL7 FHIR-based tools and initiatives to support clinical research: a scoping review. J Am Med Inform Assoc. 2022;29(9):1642-1653.

Eriksen AV, Möller S, Ryg J. Use of GPT-4 to diagnose complex clinical cases. NEJM AI. 2024;1(1):AIp2300031.

Esteva A, Robicquet A, Ramsundar B, et al. A guide to deep learning in healthcare. Nat Med. 2019;25(1):24-29.

EU Artificial Intelligence Act. Available from: https://artificialintelligenceact.eu/

Fleuren LM, Thoral P, Shillan D, Ercole A, Elbers PWG, Right Data Right Now C. Machine learning in intensive care medicine: ready for take-off? Intensive Care Med. 2020;46(7):1486-1488.

Gallifant J, Afshar M, Ameen S, et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nat Med. 2025;31(1):60-69.

Gilbert S, Harvey H, Melvin T, Vollebregt E, Wicks P. Large language model AI chatbots require approval as medical devices. Nat Med. 2023;29(10):2396-2398.

Heinz MV, Mackin DM, Trudeau BM, et al. Randomized trial of a generative AI chatbot for mental health treatment. NEJM AI. 2025;2(4):AIoa2400802.

Henry KE, Kornfield R, Sridharan A, et al. Human-machine teaming is key to AI adoption: clinicians’ experiences with a deployed machine learning system. NPJ Digit Med. 2022;5(1):97.

Hernandez-Boussard T, Bozkurt S, Ioannidis JPA, Shah NH. MINIMAR (minimum information for medical AI reporting): developing reporting standards for artificial intelligence in health care. J Am Med Inform Assoc. 2020;27(12):2011-2015.

HL7 FHIR. HL7 FHIR Foundation: enabling health interoperability through FHIR. Available from: https://fhir.org/

International Medical Device Regulators Form. Available from: https://www.imdrf.org/

International Medical Device Regulators Forum. “Software as a Medical Device”: possible framework for risk categorization and corresponding considerations. 2014. Available from: https://www.imdrf.org/sites/default/files/docs/imdrf/final/technical/imdrf-tech-140918-samd-framework-risk-categorization-141013.pdf

International Medical Device Regulators Forum. Software as a medical device (SaMD): application of quality management system. 2015. Available from: https://www.imdrf.org/sites/default/files/docs/imdrf/final/technical/imdrf-tech-151002-samd-qms.pdf

International Medical Device Regulators Forum. Software as a medical device (SaMD): clinical evaluation. 2017. Available from: https://www.imdrf.org/sites/default/files/docs/imdrf/final/technical/imdrf-tech-170921-samd-n41-clinical-evaluation_1.pdf

International Medical Device Regulators Forum. Machine learning-enabled medical devices: key terms and definitions. 2022. Available from: https://www.imdrf.org/sites/default/files/2022-05/IMDRF%20AIMD%20WG%20Final%20Document%20N67.pdf

International Medical Device Regulators Forum. Good machine learning practice for medical device development: guiding principles. 2025. Available from: https://www.imdrf.org/sites/default/files/2025-02/IMDRF_AIML%20WG_GMLP_N88%20Final.pdf

Johnson AEW, Bulgarelli L, Shen L, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data. 2023;10(1):1.

Joshi G, Jain A, Araveeti SR, Adhikari S, Garg H, Bhandari M. FDA-approved artificial intelligence and machine learning (AI/ML)-enabled medical devices: an updated landscape. Electronics. 2024;13(3):498.

Kandi V, Vadakedath S. Clinical trials and clinical research: a comprehensive review. Cureus. 2023;15(2):e35077.

Knaus WA, Draper EA, Wagner DP, Zimmerman JE. APACHE II: a severity of disease classification system. Crit Care Med. 1985;13(10):818-829.

Lee JT, Moffett AT, Maliha G, Faraji Z, Kanter GP, Weissman GE. Analysis of devices authorized by the FDA for clinical decision support in critical care. JAMA Intern Med. 2023;183(12):1399-1401.

Liu X, Cruz Rivera S, Moher D, Calvert MJ, Denniston AK. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat Med. 2020;26(9):1364-1374.

Marwaha JS, Kvedar JC. Crossing the chasm from model performance to clinical impact: the need to improve implementation and evaluation of AI. NPJ Digit Med. 2022;5(1):25.

Masic I, Miokovic M, Muhamedagic B. Evidence based medicine: new approaches and challenges. Acta Inform Med. 2008;16(4):219-225.

McDermott P, Dominguez C, Kasdaglis N, Ryan M, Trahan I, Nelson A. Human-machine teaming systems engineering guide. MITRE; 2018. Available from: Available from: https://www.mitre.org/sites/default/files/2021-11/prs-17-4208-human-machine-teaming-systems-engineering-guide.pdf

Mello MM, Guha N. Understanding liability risk from using health care artificial intelligence tools. N Engl J Med. 2024;390(3):271-278.

Meskó B, Topol EJ. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit Med. 2023;6(1):120.

Muehlematter UJ, Bluethgen C, Vokinger KN. FDA-cleared artificial intelligence and machine learning-based medical devices and their 510(k) predicate networks. Lancet Digit Health. 2023;5(9):e618-e626.

Murphy DJ, Anderson W, Heavner SH, et al. Development of a core critical care data dictionary with common data elements to characterize critical illness and injuries using a modified Delphi method. Crit Care Med. 2025;51(2):1-12.

Nelson SD, Walsh CG, Olsen CA, McLaughlin AJ, LeGrand JR, Schutz N, Lasko TA. Demystifying artificial intelligence in pharmacy. Am J Health Syst Pharm. 2020;77(19):1556-1570.

Norgeot B, Quer G, Beaulieu-Jones BK, et al. Minimum information about clinical artificial intelligence modeling: the MI-CLAIM checklist. Nat Med. 2020;26(9):1320-1324.

Observational Health Data Science and Informatics. Standardized data: the OMOP common data model. Available from: https://www.ohdsi.org/data-standardization/

OpenAI. Terms of use. Available from: https://openai.com/en-GB/policies/row-terms-of-use/

Park Y, Jackson GP, Foreman MA, Gruen D, Hu J, Das AK. Evaluating artificial intelligence in medicine: phases of clinical research. JAMIA Open. 2020;3(3):326-331.

Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci Data. 2018;5(1):180178.

Rajkomar A, Dean J, Kohane I. Machine learning in medicine. N Engl J Med. 2019;380(14):1347-1358.

Rieke N, Hancox J, Li W, et al. The future of digital health with federated learning. NPJ Digit Med. 2020;3:119.

Rojas JC, Lyons PG, Chhikara K, et al. A common longitudinal intensive care unit data format (CLIF) for critical illness research. Intensive Care Med. 2025;51(3):556-569.

Seymour CW, Kennedy JN, Wang S, et al. Derivation, validation, and potential treatment implications of novel clinical phenotypes for sepsis. JAMA. 2019;321(20):2003-2017.

Sikora A, Keats K, Murphy DJ, et al. A common data model for the standardization of intensive care unit medication features. JAMIA Open. 2024;7(2):ooae033.

Tejani AS, Klontzas ME, Gatti AA, Mongan JT, Moy L, Park SH, Kahn CE Jr. Checklist for artificial intelligence in medical imaging (CLAIM): 2024 update. Radiol Artif Intell. 2024;6(4):e240300.

Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930-1940.

Tripathi S, Alkhulaifat D, Doo FX, Rajpurkar P, McBeth R, Daye D, Cook TS. Development, evaluation, and assessment of large language models (DEAL) checklist: a technical report. NEJM AI. 2025;2(6):AIp2401106.

US Food and Drug Administration. Federal Food, Drug, and Cosmetic Act. Available from: https://www.govinfo.gov/content/pkg/COMPS-973/pdf/COMPS-973.pdf

US Food and Drug Administration. Artificial intelligence-enabled medical devices. Available from: https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-enabled-medical-devices

US Food and Drug Administration. Artificial intelligence-enabled device software functions: lifecycle management and marketing submission recommendations. 2025. Available from: https://www.fda.gov/media/184856/download

US Food and Drug Administration. Executive summary for the Digital Health Advisory Committee meeting: total product lifecycle considerations for generative AI-enabled devices. 2024. Available from: https://www.fda.gov/media/182871/download

US Food and Drug Administration. Marketing submission recommendations for a predetermined change control plan for artificial intelligence-enabled device software functions. 2024. Available from: https://www.fda.gov/media/166704/download

US Food and Drug Administration. Proposed regulatory framework for modifications to artificial intelligence/machine learning (AI/ML)-based software as a medical device (SaMD). 2019. Available from: https://www.fda.gov/media/122535/download

US Food and Drug Administration. Transparency for machine learning-enabled medical devices: guiding principles. 2024. Available from: https://www.fda.gov/medical-devices/software-medical-device-samd/transparency-machine-learning-enabled-medical-devices-guiding-principles

US Office of Science and Technology Policy. Blueprint for an AI Bill of Rights: making automated systems work for the American people. 2022. Available from: https://bidenwhitehouse.archives.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf

Van Calster B, McLernon DJ, van Smeden M, et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17(1):230.

Van Norman GA. Drugs, devices, and the FDA: part 2: an overview of approval processes: FDA approval of medical devices. JACC Basic Transl Sci. 2016;1(4):277-287.

Varkey B. Principles of clinical ethics and their application to practice. Med Princ Pract. 2021;30(1):17-28.

Vasey B, Nagendran M, Campbell B, et al. Reporting guideline for the early stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. BMJ. 2022;377:e070904.

Vincent JL, Moreno R, Takala J, et al. The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure. Intensive Care Med. 1996;22(7):707-710.

von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP. Strengthening the reporting of observational studies in epidemiology (STROBE) statement: guidelines for reporting observational studies. BMJ. 2007;335(7624):806-808.

Wardi G, Owens R, Josef C, Malhotra A, Longhurst C, Nemati S. Bringing the promise of artificial intelligence to critical care: what the experience with sepsis analytics can teach us. Crit Care Med. 2023;51(8):985-991.

Warraich HJ, Tazbaz T, Califf RM. FDA perspective on the regulation of artificial intelligence in health care and biomedicine. JAMA. 2025;333(3):241-247.

Weissman GE, Mankowitz T, Kanter GP. Unregulated large language models produce medical device-like output. NPJ Digit Med. 2025;8(1):148.

Wick KD, McAuley DF, Levitt JE, et al. Promises and challenges of personalized medicine to guide ARDS therapy. Crit Care. 2021;25(1):404.

Wilkinson MD, Dumontier M, Aalbersberg IJ, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016;3:160018.

Windecker D, Baj G, Shiri I, Kazaj PM, Kaesmacher J, Gräni C, Siontis GCM. Generalizability of FDA-approved AI-enabled medical devices for clinical use. JAMA Netw Open. 2025;8(4):e258052.

World Health Organization. Ethics and governance of artificial intelligence for health. 2021. Available from: https://iris.who.int/bitstream/handle/10665/341996/9789240029200-eng.pdf?sequence=1

World Health Organization. Ethics and governance of artificial intelligence for health: guidance on large multi-modal models. 2025. Available from: https://iris.who.int/bitstream/handle/10665/375579/9789240084759-eng.pdf?sequence=1

Wu J, You H, Du J. AI generations: from AI 1.0 to AI 4.0. Front Artif Intell. 2025;8:1585629.

Yang KL, Tobin MJ. A prospective study of indexes predicting the predicting the outcome of trials of weaning from mechanical ventilation. N Engl J Med. 1991;324(21):1445-50.