Varvara Bogdanova: Welcome to today’s webinar. We are delighted to co‑host this session with our partner AWS, and we will be discussing AI‑driven medical devices. In particular, we will talk about the validation of the AI component in Software as a Medical Device in the context of evolving European regulations, including MDR, GDPR, and of course, the European AI Act.
Varvara Bogdanova: Before we jump into the discussion, let’s start with a short round of introductions. My name is Varvara Bogdanova, and I’m an Innovations Manager at DataArt within the Healthcare and Life Sciences Practice.
Varvara Bogdanova: I will be moderating today’s session, and I’m very happy to welcome our speakers. I’ll let you introduce yourselves.
Sara Jaworska: Hello, everyone, and thank you for having me today. My name is Sara Jaworska, and I am a Quality and Regulatory Affairs Senior Manager at DataArt.
Varvara Bogdanova: Thank you, Sara.
Ian Sutcliffe: And thanks for inviting me to this—very exciting. My name is Ian Sutcliffe. I am a Principal Tech Strategist, and I support all healthcare and life sciences customers.
Varvara Bogdanova: Perfect. Thank you.
Andrei Sorokin: Thank you for the invitation. I’m Andrei Sorokin, Solutions Architect for Healthcare and Life Sciences at DataArt.
Varvara Bogdanova: Thank you, Andrei. Before I move to the first question, I would like to remind our audience: if you have any questions or comments, don’t hesitate to post them in the comment section under our webinar.
Varvara Bogdanova: Okay, let’s move on with the first question. Why does validation of the AI component in medical devices matter so much? Sara, I’ll let you be the first to address this.
Sara Jaworska: Thank you, Varvara. AI validation matters for three main reasons. First, from a clinical point of view: AI influences real medical decisions, so errors directly impact patients depending on the device’s risk classification. Second, for technical reasons: AI isn’t static software and must be tested for robustness, bias, and real‑world performance. We need to understand what we can expect from the AI device and identify root causes of anything that might go wrong. Third, from a business perspective: a recent report shows links between AI market introduction and the number of recalls, including devices that didn’t undergo proper validation. Although the report is based on US data, the lessons are relevant globally. To sum up, validation protects the patients, the technology, and the business.
Varvara Bogdanova: Thank you very much, Sara. Now that we know why it’s important, let’s talk about the challenges. What are the biggest challenges medical device manufacturers face today when developing and validating AI‑based devices? Ian, based on your work with AWS clients, where do companies struggle most?
Ian Sutcliffe: Absolutely. Across the customers I support, I see several themes. First, regulatory guidance hasn’t fully kept pace with technological advancement. Some agencies are more progressive than others. There is machine learning‑related guidance from FDA, but not yet for large language models, foundation models, or agentic AI. Guidance is behind the curve, creating uncertainty. Second, requirements vary by region—FDA, EMA, DKMA—there is overlap but also key differences. Third, documentation and reporting requirements remain very manual and labor‑intensive. Internally, manufacturers rely on document‑centric quality systems, which presents opportunities for automation. On the technical side, cybersecurity risks increase as devices and components become more connected. The rapid evolution of technology means teams may be unaware of new capabilities in tools like SageMaker or Bedrock that address regulatory risks. The model landscape also moves quickly. Traditional approaches like “locking” models don’t work when third‑party models continuously evolve. Organizations must accept change and plan for it. Collectively, these issues create a shortage of specialists who understand both cloud and compliance—this is why partnerships with companies like DataArt are important.
Varvara Bogdanova: Thank you, Ian. Sara, from the regulatory and data perspective, where do you see the challenges?
Sara Jaworska: I fully agree with Ian. To avoid repeating him, I’ll focus on data as one of the major challenges. High‑quality, representative datasets are essential for clinical‑grade AI. Poor data may lead to biased models, unreliable predictions, or nonsensical outputs in rare edge cases. Companies must demonstrate where data came from, ensure lawful data processing under GDPR, and confirm the data’s relevance to the intended clinical use. This requires strict data governance and lifecycle controls. Without it, even technically strong AI may fail clinical evaluation or CE‑marking. Another challenge is managing the full AI lifecycle in a compliant way. Medical device regulations already impose many obligations, and AI adds more. Building AI in healthcare is not a one‑time activity and does not end with IEC 62304. On top of existing requirements like ISO 13485, companies must implement AI‑specific frameworks: data acquisition and preprocessing with GDPR and MDR compliance from the start; model training, testing, and validation using good practices, separate datasets, and stress testing on edge cases; deployment using secure cloud infrastructure; and post‑market monitoring, including performance tracking, detecting risks, and revalidating the model. Many organizations underestimate how operationally demanding continuous evidence generation is. Evidence must always reflect the current version of the AI system. As the model evolves, so must the evidence. Without automation, teams quickly fall back to document‑heavy processes that don’t scale.
Varvara Bogdanova: Thank you, Sara. Ian, how does AWS support highly regulated end‑to‑end workflows in practice? How can cloud services help teams meet quality requirements while operating more efficiently?
Ian Sutcliffe: Regulators are increasingly taking a lifecycle approach to AI systems, which aligns with the evolution from DevOps to MLOps, and now to FMOps and LLMOps. AWS services like SageMaker and Bedrock support the entire lifecycle—development, deployment, monitoring, drift detection, retraining, and controlled change—with built‑in automation and guardrails. A major area of interest is automated testing. Deterministic systems are straightforward, but LLMs are non‑deterministic, meaning testing must consider a range of acceptable outputs based on risk. Methods include human‑in‑the‑loop and LLM‑as‑a‑judge. Human validation benchmarks the judge model. In production, sampling methods can evaluate ongoing performance. Change control is also critical. Traditional approaches that avoid change won’t work, because models evolve and real‑world data changes. The FDA’s Predetermined Change Control Plans acknowledge that change must be planned for. AWS helps shift organizations from manual, document‑centric processes to data‑centric automated processes, generating IT records that demonstrate control effectiveness.
Varvara Bogdanova: Thank you, Ian. Let’s continue with evidence. What does good AI evidence look like today, and how much data is enough?
Ian Sutcliffe: I believe in separating IT quality from product quality. IT teams focus on engineering best practices and collaborate with product quality teams to determine which IT records qualify as evidence. AWS tools generate IT records showing that good practices are followed, but the manufacturer’s QMS defines what is required for regulatory compliance. Andrei can speak from the manufacturer side.
Andrei Sorokin: There is no fixed number of samples that solves validation. Regulators don’t expect a magic number—they expect a justified, risk‑based rationale. In practice, we often start with a few hundred high‑quality evaluation samples—human‑labeled gold datasets—to detect major failures early. But covering multiple languages, edge cases, and diverse populations can push this number into the thousands. Metrics like accuracy, recall, precision, or faithfulness are only statistical estimates. Narrowing confidence intervals may require exponentially more samples. This makes evaluation costly due to the need for expert labeling or LLM‑assisted annotation. The bottleneck today has shifted from training to evaluation dataset preparation.
Varvara Bogdanova: Thank you, Andrei. Very insightful. Sara, from the regulatory standpoint under MDR, GDPR, and the AI Act, what does good AI evidence look like?
Sara Jaworska: Good AI validation evidence is risk‑based, complete, and up to date. Regulators don’t expect every component of an AI system to be documented or tested equally; they expect us to identify high‑risk areas and test them more extensively. High‑risk areas require more detailed documentation, larger datasets, and stronger statistical justification. Evidence must reflect the current model version. AI‑enabled products must meet the requirements of several frameworks: the AI Act, MDR/IVDR, GDPR, and sometimes the Data Act. Many concepts overlap: technical documentation, risk management, intended use, data governance, transparency, and traceability. For example, data governance and data quality are central to the AI Act and aligned with GDPR. Risk management and post‑market monitoring appear in MDR and the AI Act. Transparency and explainability appear across all frameworks. Automation like LLM‑as‑a‑judge can support workflows, but cannot replace human oversight. Without meaningful human review, such evidence is not accepted by regulators.
Varvara Bogdanova: Thank you, Sara. Now, our last main question: how can cloud services like AWS, together with partners like DataArt, support secure and compliant AI validation workflows? Ian?
Ian Sutcliffe: AWS provides the technology, but we are one step removed from medical device delivery and cannot provide regulatory advice. We offer engineering best practices and automated capabilities. DataArt provides the regulatory and product‑quality expertise. Together, we complement each other and support manufacturers more effectively.
Varvara Bogdanova: Andrei, your perspective from DataArt?
Andrei Sorokin: We build solutions on AWS services, which include model evaluation, monitoring, and explainability. For LLMs, AWS Bedrock supports managed evaluation for bias, toxicity, and correctness using both human‑in‑the‑loop and LLM‑as‑a‑judge. Bedrock’s Guardrails prevent unsafe outputs, redact PII, and restrict topics. Grounding responses helps reduce hallucinations. Beyond LLMs, SageMaker Clarify provides numerous explainability and robustness metrics. These artifacts are often expected by regulators. For post‑market surveillance, automatic detection of drift and triggering revalidation or retraining is essential. Combining Model Monitor, Clarify, and custom metrics enables reliable, automated pipelines from evaluation to redeployment.
Varvara Bogdanova: Thank you, Andrei. Let’s go to questions from the audience. First: will manufacturers need two separate notified bodies to certify medical AI?
Sara Jaworska: We don’t know yet. No notified body is currently designated to certify AI systems in Europe. I hope a single notified body will handle everything, but we must wait for clarification.
Varvara Bogdanova: Thank you. Another question: how much time do manufacturers have to prepare for the AI Act?
Sara Jaworska: The AI Act rollout is phased. Most high‑risk medical device manufacturers must comply by August 2027. I strongly recommend starting now because the requirements are extensive.
Varvara Bogdanova: Thank you. Next question: how do you control and document bias, drift, and robustness post‑deployment? Andrei?
Andrei Sorokin: Data and models evolve. Input distributions change. Populations vary across regions. You must expand datasets, evaluate and label new data, and track metrics. When drift thresholds are exceeded—measured through statistical tests—you retrain and redeploy. Some labeling can be automated, but human oversight remains essential.
Varvara Bogdanova: Thank you. Final question: what prevents an LLM on Bedrock from giving unsafe or hallucinated answers?
Andrei Sorokin: Nothing inherently prevents hallucinations. We mitigate them by grounding responses using relevant factual context (RAG), applying Guardrails for safety and PII protection, and regularly evaluating performance using a strong gold dataset. Without that dataset, you cannot reliably detect hallucinations.
Varvara Bogdanova: Thank you, Andrei. Before we wrap up, I’d like each speaker to share a key takeaway.
Ian Sutcliffe: Focus on establishing a culture of quality, not just compliance. Good engineering practices and automation strengthen compliance naturally.
Sara Jaworska: Plan frameworks early to avoid duplicated work. Integrated management systems reduce silos and improve efficiency.
Andrei Sorokin: Model training and maintenance are ongoing processes. Maintain your evaluation datasets, build reliable ML pipelines, and integrate trusted services to ensure robust, dependable systems.
Varvara Bogdanova: Thank you, everyone—Andrei, Sara, and Ian—for joining today. If you have questions or comments, feel free to contact our speakers. Stay tuned for upcoming webinars. Thank you, and goodbye.