BROWN, PHIL M. | NORTHEASTERN UNIVERSITY | Stackable Trainings in the FAIRification and AI/ML-Readiness of Data with Applications to Environmental Health and Justice With the PROTECT Superfund Research Program and the multinational Observational Health Data Sciences and Informatics (OHDSI) community, we provide stackable training modules for researchers to prepare data for artificial intelligence and machine learning applications and to understand the related ethical issues. The ability to find, combine, and analyze multiple large-scale biomedical datasets to make better and ethical decisions for the future of patients, populations, and health systems is now a set of necessary skills for modern analysts. However, most current data analytics and workshops focus on deriving or applying modern techniques—such as statistical learning procedures, PyTorch, TensorFlow, neural networks, and other large-scale prediction models—as opposed to the necessary steps involved in preparing data for such analyses. Further, the next (and current) generation of biomedical researchers must be cognizant of FAIR principles to be prepared to make their data accessible by machines in order to fully leverage the continued growth around methodological developments to properly analyze large amounts of data across multiple studies, systems, and countries. In addition to a methodologic toolkit, educating the biomedical analyst workforce must include training to build their ability to locate and store data for future analyses in an automated manner. We provide a suite of stackable modules to provide a rich foundation to the existing, robust educational offerings around the applications of artificial intelligence and machine learning (AI/ML) to biomedical data that many trainees already receive. Through our close partnerships with the National Institute of Environmental Health Sciences PROTECT Center and the multinational Observational Health Data Science and Informatics (OHDSI), we provide training to prepare data for AI and ML applications in a rigorous and reproducible way and understand the ethical issues around AI and ML, as well as receive hands-on training around FAIR principles for storing and accessing such data. These modules prepare researchers for successful careers as data analysts, ready to exploit the power of available AI/ML frameworks. | NIEHS |
BRUCE, MARINO A. | THE UNIVERSITY OF MISSISSIPPI MEDICAL CENTER | Integrating FAIR Guiding Principles into Biomedical Research Training Integrating FAIR principles into research training programs for investigators from diverse backgrounds can advance data-informed interventions to reduce health disparities. There is an urgent need to leverage existing scholarly data to develop data-informed interventions to reduce health disparities. Stakeholders associated with the production and use of data have developed a set of principles to make research data findable, accessible, interoperative, and reusable (FAIR). The FAIR guiding principles can facilitate biomedical advancements by bolstering data labeling and management practices to enable artificial intelligence and machine learning (AI/ML) innovations. The application of FAIR principles addresses challenges associated with data annotation and management and can support greater efficiency and effectiveness of data used in this area. Providing training on the FAIR principles to early-career faculty members from groups under-represented in the biomedical sciences will bolster a diverse scientific workforce capable of understanding and redressing cardiovascular health disparities, such as disparities in obesity and related areas crucial to minority health in the FAIR principles. The three aims of the proposed project are to (1) apply an Educational Design Research approach to the development, refinement, and finalization of online training modules for biomedical scientists to enhance their knowledge and skills in the competencies needed to make research data FAIR and AI/ML-ready; (2) assemble a multidisciplinary advisory committee consisting of ethicists, legal scholars, policy analysts, biomedical investigators, data scientists, and learning designers to provide feedback; and (3) conduct formative and summative assessments. The training modules developed and tested in the proposed project can inform the development of data-informed interventions to prevent obesity, mitigate its current high rates, and slow its projected rise as crucial steps in reducing cardiovascular disease. | NHLBI |
BUTLER-PURRY, KAREN | TEXAS A&M UNIVERSITY | Maximizing Student Development in Data- and Information Science-Related Disciplines for Biomedical Ph.D. Trainees at Texas A&M University and Beyond “Is my data too big, too small or, just right?”: Learning to interpret the biomedical data through the data science lens This program aims to provide opportunities for graduate students in biomedical fields to acquire additional learning on topics at the interface of data sciences and biomedical sciences. To accomplish this, we will develop a new curriculum of exportable and shareable training modules and integrated training plans. We will deliver this curriculum through monthly training events for a broad audience of current trainees across the biomedical graduate training programs at Texas A&M University and other institutions. We will perform a rigorous evaluation of the effectiveness of the training and the outputs and outcomes of the learning objectives to inform future amendments to the training curriculum content and delivery methods. The Texas A&M Institute of Data Science is partnering with the university’s Graduate and Professional School to engage broadly across a number of biomedical programs (medical sciences, biomedical sciences, genetics, toxicology, biochemistry and biophysics, and biomedical engineering) that encompass 600-plus current doctoral trainees. The 12-month long curriculum will be delivered through a combination of in-person and remote delivery options (2- to 4-hour sessions once per month for one year) and include topics ranging from the basics on the R programming language and statistical foundations to best practices for findable, accessible, interoperable, reusable (FAIR) data management, and algorithmic fairness, as well as data ethics, privacy preservation, confidentiality, and legal and regulatory requirements for biomedical data. All sessions will include a blend of preparatory self-study materials (if needed), lecture-style delivery of the theory and principles of the topic, and hands-on exercises focused on development of practical skills and competencies. To maximize trainee engagement with this program, both at Texas A&M University and other domestic and international institutions, the curriculum will be made available to the broadest possible audience through (1) targeted and broad advertisement, (2) delivery through a hybrid approach in a synchronous format, and (3) making the recordings broadly available on the web. Finally, the program will be reimplemented for scalable future, asynchronous delivery through Texas A&M Continuing and Professional Education, as well as for once-a-year synchronous delivery to new cohorts of trainees. | NIGMS |
CASADEVALL, ARTURO | JOHNS HOPKINS UNIVERSITY | BioMAR3's Infectious Disease-Related Case Study Workshops BioMAR3's infectious disease-related case study workshops make biomedical researchers and their data machine learning– and artificial intelligence–R3eady The problems that humanity is facing through health issues stemming from current and (re)emerging microbial threats to human health require collaboration, critical systems thinking, and effective communication across the disciplines, particularly across biomedicine and advanced data science fields. For today’s biomedical researchers, focused on global public health and pandemics biology, it is no longer sufficient to know the basics of biostatistics. Rather, pre- and postdoctoral trainees (as well as more experienced practitioners) who generate big biomedical datasets need to develop capacities that allow them to harness the potential that artificial intelligence (AI) and machine learning (ML) approaches can bring to their research projects. Yet, lacking skills in adequate preparation and handling of large datasets, as well as a missing awareness of the technical and communication gaps between biologists and data scientists, have been sources of considerable errors in research practice. To address the resulting need in advanced, interdisciplinary training, the Johns Hopkins Bloomberg School of Public Health BioMAR3 project will produce a series of authentic, case study-based learning modules offered in the form of open access workshops. They will be developed by a team of active microbiological and biochemical researchers, data scientists, and data management specialists at Johns Hopkins University. Committed to interdisciplinary biomedical and AI/ML training in Rigor, Reproducibility and Responsibility (R3), the BioMAR3 modules address five objectives: (1) the introduction to characteristics, opportunities, and uses of AI/ML techniques in the biomedical sciences, with particular emphasis on infectious disease research; (2) the appreciation of the impact of mistakes in biomedical big data preparation, handling, and communication on rigor and reproducibility; (3) the implementation of AI/ML concepts into biomedical big data science; (4) the application of FAIR principles and ethical best practices to data storage and management; and (5) the evaluation of newly established workflow processes that aim to avoid errors and develop strategies for troubleshooting. BioMAR3 effectiveness will be judged by several criteria including the assessment of workshop learning outcomes and the development of capacities for communication and collaboration, as well as performance-level observation of workshop participants' abilities to translate learned skills into real-world laboratory settings. | NIAID |
CRESS, WILLIAM DOUGLAS | H. LEE MOFFITT CANCER CENTER and RESEARCH INSTITUTE | Cancer Research Workforce Development in FAIR Artificial Intelligence and Machine Learning This project will help develop a workforce with the skills required to make the ever-growing mountains of biomedical data findable, accessible, interoperable, and reusable (FAIR) for artificial intelligence and machine learning applications to improve impact health care. There is a rapidly growing mountain of clinical and molecular data available for cancer and other diseases. The potential application of artificial intelligence and machine learning (AI/ML) approaches in medicine for data-driven, clinical decision making is compelling. Unfortunately, most of these data are not used to make health care decisions because they are not usable. The success of the application of AI/ML algorithms, especially in the clinical domain, hinges on the availability and quality of data used to develop and test the AI/ML models. Therefore, there is an unmet need in the development of competencies and skills needed to make biomedical data ready for AI/ML applications. This means making the data FAIR—findable, accessible, interoperable, and reusable. This supplement to our Integrated T32 Program in Cancer Data Science addresses this need by developing a short course in FAIR application that will be distributed in varying formats at three different venues: (1) a short course for Ph.D. students offered through the University of South Florida, which hosts our Ph.D. program in cancer biology, (2) a hands-on workshop for postdoctoral fellows and other early-staged investigators at the Moffitt Cancer Center, and (3) public dissemination of videotaped lectures via the Moffitt Cancer Center YouTube channel. | NCI |
ESQUERRA, RAYMOND M. | SAN FRANCISCO STATE UNIVERSITY | Demystifying Machine Learning and Best Data Practices Workshop Series for Underrepresented STEM Undergraduate and M.S. Researchers Bound for Ph.D. Training Programs Inclusive AI and Machine Learning Modules for the Next Generation of Biomedical Researchers. The purpose of the parent grant (T34GM008574) is to provide biomedical research training to 22 diverse undergraduates at SF State for biomedical PhD programs. This proposal will introduce these and other students to FAIR practices (findability, accessibility, interoperability, and reusability) in AI/ML. The modules will be developed focusing on Demystifying Machine Learning, Best Data Practices and Machine Learning for Biology. These modules will use best pedagogical practices to make content accessible to students who are under-represented in biomedical research and who are from basic science training backgrounds. These modules will therefore help future PhD students become familiar and comfortable with Machine Learning and Data Science, which will help their careers and thus increase diversity and inclusion in the biomedical sciences. An expected outcome of this supplemental activity will be to develop a platform to incorporate inclusive and equitable AI/ML training to undergraduate and MS students nationwide. | NIGMS |
GRIMES, CATHERINE LEIMKUHLER | UNIVERSITY OF DELAWARE | FAIR and Practical Data Science Training at the Chemistry–Biology Interface Go big at the interface —training chemical biologists in analysis of large datasets The University of Delaware’s Chemistry–Biology Interface Program has been running for 28 years, and each training year, it strives to evolve to fit the needs of current trainees. With this NOT-OD-21-079 administrative supplement, we are looking forward to collaborating with the University of Delaware’s Data Science Center to expose our trainees to the thought processes of machine learning (ML) and large datasets. The convergence of advances in high-throughput experimental biology and chemistry; the digital transformation of biomedicine; and breakthroughs in artificial intelligence (AI), machine learning, and data science have created an unprecedented opportunity to train the next generation of biomedical scientists. In this CBI T32 administrative supplement, we aim to develop curricula and training activities to provide our T32 trainees with the competencies and skills needed to make biomedical data findable, accessible, interoperable, and reusable (FAIR) and AI/ML-ready. Our goal is to bring awareness and practices to our trainees and faculty mentors so that their data are collected and prepared to support AI/ML applications, with attentions to the (1) use of data and metadata standards to make data FAIR; (2) presentation and labelling of data, including noise, uncertainty, and missing data issues; and (3) ethical and social considerations and collaborative team science. We will do this through a series of introductory bootcamps and workshops aimed at helping students learn how to make the most of these new technologies. Trainees will be encouraged to bring their own datasets and to dream big about datasets that they might inquire. | NIGMS |
GUILLEMIN, KAREN J. | UNIVERSITY OF OREGON | Next Generation Sequencing and Biological Imaging in the era of Machine Learning Workshops that prepare researchers to jump the gap between the lab bench and ML/AI enabled research. In the era of big data, expertise in advanced statistical analyses including machine learning (ML) and artificial intelligence (AI) is increasingly important for application to biological questions. In some cases, these approaches can be applied by the biologist themselves, but in many cases collaboration between biological researchers and ML/AI experts is needed. For these collaborations to be successful, streamlined communication between experts in research domains and ML/AI is essential. To optimize communication, biologists need a foundational understanding of ML/AI algorithms, how they can be appropriately applied to biological datasets, and how biological experiments need to be designed for these ML/AI approaches. Here we propose an intensive three-week workshop designed to teach trainees the fundamentals of ML/AI applications to biological data. This workshop will synergize well with other existing courses, workshops and trainings developed by The University of Oregon Presidential Initiative in Data Science that will also be available to trainees interested in other foundational computational approaches, data management, and expanded trainings in ML/AI. Our proposed workshop will combine lecture components, discussions of recent peer-reviewed literature, and hands on experience working with real data to train and apply ML/AI algorithms. Week one will cover necessary fundamentals with lectures on ML/AI concepts intermixed with the basics of data manipulation and principals of FAIR (Findable, Accessible, Interoperable, and Reusable) data management. Week two will cover how next generation sequencing technologies - including single-cell sequencing data - can be analyzed using ML/AI and will include hands-on training. Week three will focus on applying ML/AI to image analysis, including training on the analysis and annotation of image data, manipulating and transforming image files, and training a neural network image classifier to automate lab processes. By the end of the workshops, trainees will have the foundational skills needed to collaborate with ML/AI experts, ask new research questions about existing data and explore novel research directions through applications of ML and AI. | NIGMS |
JASPERS, ILONA | THE UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL | The UNC inTelligence And Machine lEarning (TAME) Training Program Promoting trainee-driven data generation, management, and analysis methods to “TAME” data in environmental health studies Trainees within toxicology and environmental health science programs are collectively underprepared for managing and analyzing datasets, and if given the training and tools, the resulting impact on research programs worldwide would be tremendous. We are uniquely positioned to leverage our successful graduate and postgraduate training program in the UNC Curriculum in Toxicology and Environmental Medicine (CiTEM) to develop data science tools that are in high demand through this foundational T32 training program. Here, we propose to develop and launch the UNC inTelligence And Machine lEarning (TAME) Training Program. This program will promote trainee-driven data generation and management methods to “TAME” data using approaches that align with and lead to improvements of digital data that are findable, accessible, interoperable, and reusable (FAIR). Trainees will also receive tools to analyze data through artificial intelligence, machine learning, and other advanced computational tools. The UNC TAME Training Program will be organized into training modules, spanning topics of FAIR principles and data analysis methods to address environmental health questions. These training modules will incorporate applications-driven learning exercises that guide trainees on how to organize, analyze, and visualize environmental health datasets in R, all leveraging example environmental health datasets. Training modules will be disseminated to four target audiences. First, a new course will be launched at UNC titled, “Computational Toxicology and Exposure Science”, led by Dr. Julia Rager. Second, these training resources will be disseminated through ongoing program-level initiatives at UNC (e.g., the UNC Superfund Research Program). Third, the TAME training modules will be disseminated to trainees outside of UNC through webinars and workshops, with a specific effort to reach trainees underrepresented in STEM through UNC’s existing partnerships with local HBCU’s, such as North Carolina Agricultural and Technical State University (NC A&T) and North Carolina Central University (NCCU). Lastly, the TAME training modules will be published online through a dedicated Github Bookdown site and parallel manuscript—both publicly available—to highlight the accessibility and utility of this critical resource worldwide. In summary, the UNC TAME training pipeline will promote that data generated are “TAME” and thus organized, suitable for public sharing, and analyzed using cutting-edge bioinformatics tools. | NIEHS |
JULIAN, DAVID | UNIVERSITY OF FLORIDA | Adding a FAIR Data Practices Curriculum to UF’s Practicum AI AI/ML Training Workshops Equipping diverse trainees with the competencies to make biomedical data FAIR and AI/ML-ready through flexibly delivered curricula Artificial intelligence and machine learning (AI/ML) promise to transform biomedical research, but effective application of these technologies relies on experimental data that are FAIR—findable, accessible, interoperable, and reusable. This applies not only to data storage, but also to the design of experiments and to the algorithms, tools, and workflows that produce the data. Therefore, widespread adoption of AI/ML in the biomedical sciences necessitates training researchers in practices that impact the entire data resource lifecycle. We will develop open, exportable educational resources to integrate FAIR AI/ML competency into the trainings of diverse biomedical workforce development programs. The FAIR modules will be incorporated into a broader AI training program called Practicum AI, and will provide hands-on exercises for trainees to apply FAIR principles using case studies and real data. In collaboration with GO FAIR US, the project will develop a FAIR AI/ML curriculum suitable for undergraduate and predoctoral trainees, and will implement the curriculum across all institutional disciplines that participate in training the biomedical workforce. The supplement has two specific aims: (1) develop and assess a synchronously delivered skill-training curriculum to equip diverse trainees with competencies to make biomedical data FAIR and AI/ML-ready; and (2) adapt and assess the initial training curriculum for flexible delivery via synchronous workshops, formal academic courses, and asynchronous self-paced online modules. | NIGMS |
KELLER, KATE E. | OREGON HEALTH & SCIENCE UNIVERSITY | AI Training Module for Vision Science Training the next generation of vision scientists in the preparation and use of artificial intelligence-ready data for developing models that can improve patient health and vision outcomes The parent T32 application focuses on educating the next generation of vision researchers, equipping them with the skills, technical expertise, and cutting-edge technology to become leaders in the field of translational vision science. The application of artificial intelligence (AI) to ophthalmological research is a rapidly evolving field. Machine learning (ML) models are trained to interpret medical data from patients’ electronic health records and imaging. Several of these ML models have received FDA approval for use in the clinic, exemplifying how AI research can be translated from bench-to-bedside. In this supplement, an educational module will be developed to train T32 predoctoral and postdoctoral trainees, as well as K12 scholars, in the concepts of AI and ML and provide them with practical hands-on training on how to produce findable, accessible, interoperable, and reusable (FAIR) datasets. The module will consist of two components: (1) a short lecture series that will be integrated into an existing predoctoral curriculum at Oregon Health & Science University; and (2) a web-based module that will include recorded video lectures (from experts in data science, medical informatics, ML methodology, image processing, and public health), reading materials, and knowledge assessments. Automated pre- and post-tests will be incorporated to track performance improvement for each topic included in the module. This module will be available online to a global audience. The number of registered users (including T32 and K12-supported trainees) and the number and percentage of users who view the lectures and download the self-assessment coursework will be tracked. Location of IP addresses will be monitored to assess the geographic distribution of the participants. This training will provide a generation of young researchers with new and innovative tools to optimize translational vision research. | NEI |
LAIRD, ANGELA R. | FLORIDA INTERNATIONAL UNIVERSITY | ABCD Course on Reproducible AI/ML Data Analyses This course provides research training focusing on reproducible AI/ML analyses of data from the Adolescent Brain Cognitive Development (ABCD) Study, with an emphasis on making ABCD data FAIR and AI/ML-ready. The ABCD–ReproNim Course is a collaborative partnership to provide research educational training in reproducible analyses of data from the ABCD Study. The course integrates curricula from ReproNim: A Center for Reproducible Neuroimaging Computation, which is a National Institute of Biomedical Imaging and Bioengineering-funded P41 Biomedical Technology Resource Center (BTRC) whose vision is to help neuroimaging researchers achieve more reproducible data analysis workflows and outcomes. The ReproNim approach relies on both technical development of readily accessible, user-friendly computational tools and services that can be readily integrated into current research practices, as well as a broad educational outreach about reproducibility to the neuroimaging community at large, including developers and applied researchers across basic sciences and clinical disciplines. This administrative supplement will provide dedicated research training on making data from the ABCD Study findable, accessible, interoperable, and reusable (FAIR) and AI/ML-ready. AI/ML applications have increased relevance in the discovery of biomarkers, predicting intervention outcomes, and integrating information across datasets. However, the knowledge required to perform effective biomedical ML research spans knowledge about data, scientific questions, and computing technologies alongside AI/ML platforms and tools. The ABCD–ReproNim AI/ML Course will extend the current training to make trainees aware of the tools, concepts, and caveats for multimodal AI/ML processing of ABCD data. Students will first receive training in a 5-week, online course that includes lectures, readings, and ABCD data exercises on topics including FAIR AI/ML Applications, Core Concepts in ML, Neuroimaging ML, Interpretable/Explainable ML, and Introduction to Deep Learning. Competencies and skills addressed will include training and publishing ML models, organizing and evaluating data for ML applications, and reusing existing models efficiently. Didactic instruction will be followed by a 5-day, remote Project Week, where students will apply the skills learned and work towards completion of AI/ML data analysis projects. Success will result in well-trained researchers who are able to apply reproducible AI/ML practices to test generalizability of AI/ML models for cross-sectional and longitudinal predictions across the ABCD dataset. | NIDA |
MARCUS, CRAIG B. | OREGON STATE UNIVERSITY | Workforce Training for Making Data FAIR and Compatible with Machine Learning and Artificial Intelligence Applications Develop and disseminate online training materials for scientists and trainees to generate research data which is FAIR (findable, accessible, interoperable, and reusable) and artificial intelligence– and machine learning–compliant The objective of the parent grant, “Integrated Regional Training Program in Environmental Health Sciences,” continues to be to recruit and train scientists in the environmental health sciences (EHS). The supplementary project “Workforce Training for Making Data FAIR and Compatible with Machine Learning and Artificial Intelligence Applications” will develop and provide online training modules designed to provide scientists and trainees with the competencies and skills needed to make their research data FAIR (findable, accessible, interoperable, and reusable) and AI/ML-compatible. These skills are emerging as essential to enable scientists to share and reuse the enormous datasets currently being generated. These skills are also essential to allow scientists to format their data so that it is accessible and usable by other scientists employing powerful AI/ML software to extract meaningful information from large, complex data set resources. The project will develop asynchronous on-line training modules which will include real-time self-assessment exercises and evaluation by participants. The training modules will provide the option of three levels of expertise to participants completing the sequential modules: basic, intermediate, and advanced. The training materials will cover the basic concepts of information science and “big data” AI, ML, and how to design research projects to generate data compatible with FAIR requirements and compatible with AI/ML applications. Trainees completing all three modules will be able to actively participate in the data analytics process of developing, accessing, and sharing FAIR-compliant datasets for reuse. The learning modules will confer skill sets addressing the following three categories of competency at increasing levels of sophistication and complexity: (1) data integration, (2) multimodal data analytics, and (3) AI/ML. These training materials will be made freely available via the internet. | NIEHS |
MILLER, GARY W. | COLUMBIA UNIVERSITY | Making Environmental Health Data FAIR and AI/ML-Ready Developing next-generation data scientists for environmental health Our National Institute of Environmental Health Sciences–supported training grant (T32 ES007322) provides a single, unified training program for 18 predoctoral students and 8 postdoctoral fellows within the environmental health sciences. Our program is designed to ensure trainees acquire skills in advanced data analytics to complement their primary training in environmental epidemiology, climate science, molecular mechanisms of disease, and the exposome. We aim to leverage our collective expertise to develop a multidisciplinary curriculum that supplements our existing data science training and enables our trainees to develop the competencies and skills needed to make diverse biomedical data FAIR and AI/ML-ready. This curriculum will be designed to be flexible and module-based so it can be implemented in full, as part of existing training seminars or as stand-alone bootcamps, depending upon the needs of individual training programs. We will leverage the infrastructure of our existing seminar series for T32 trainees, led by Drs. Marianthi-Anna Kioumourtzoglou and Jeanette Stingone, to develop, implement, and evaluate our curriculum. Our novel curriculum will combine didactic seminars, guided discussions, and hands-on training activities to develop competencies and skills in use of data standards, the FAIR principles, and AI/ML-readiness. This module-based curriculum will be centered on core foundational concepts, such as ontologies, common data elements, and metadata annotation. To construct these modules, we will draw upon expertise from faculty, both internal and external to Columbia University, from within the fields of semantic science, information science, environmental health data science, and computer science. We will consult with educational professionals who will advise on evidence-based curriculum design and provide an independent evaluation of our curriculum and training activities using both quantitative and qualitative measures. Following successful evaluation, we propose to incorporate the developed curriculum and training activities into multiple existing training programs. Recorded lectures, discussion guides, and training materials will be made available within a shared resource library. Formalizing supplementary training in the FAIR principles and AI/ML -readiness across our multiple training programs will accelerate the achievement of research training aims and develop a cadre of scientists poised to advance biomedical research through the application of data science. | NIEHS |
ORTIZ, ANA PATRICIA | UNIVERSITY OF PUERTO RICO COMPREHENSIVE CANCER CENTER | Making Data FAIR and AI/ML Applications for Cancer Prevention and Control (AI/ML-CAPAC) Research Among Hispanics Preparing a workforce to apply AI/ML techniques to datasets derived from Hispanic populations to advance cancer prevention and control research This supplement aims to expand the scope of the parent Cancer Prevention and Control (CAPAC) Research Training Program (1R25CA240120) and prepare a research workforce on (1) the techniques and approaches to manipulate and pre-process cancer datasets from Hispanic populations to make them FAIR and AI/ML-ready, and (2) the available methods for developing ML-based models to analyze these data and create predictive models for cancer diagnosis and treatments with a focus on datasets from Hispanic populations. We will develop an online course based on the data science project lifecycle, which includes four phases: (1) Data Understanding/Data Pre-processing, (2) Data Wrangling, (3) Model Planning, and (4) Model Building. The online course is organized in modules within two components. Component 1 will include the following topics: fundamentals of cancer data types; identifying and understanding cancer datasets; data science concepts and project lifecycles; basic programming concepts; programming with Python; exploring, pre-processing, and conditioning the cancer datasets; and performing extract, transform, and load (ETL) prior to AI/ML modeling. Component 2 will add topics such as principles of AI/ML; variable correlations and associations; determining datasets for training and testing; supervised and unsupervised ML approaches; classification, regression, and ensemble ML-algorithms; and familiarizing with ML tools. To develop our course, examples, and projects, we will use cancer datasets from Hispanic populations in the United States and Puerto Rico. The course would be voluntary and free for interested participants (capacity of 40 trainees), including CAPAC participants (alumni) and applicants and CAPAC mentors, as well as trainees and research staff from collaborating grants and institutions. Student’s gained skills will be evaluated with quizzes and a final practical project, while the course will be evaluated with the support of the evaluation component of the parent grant. This online, asynchronous course will be self-paced, and it is expected to be completed in approximately 24 hours. The supplement will impact the development of human resources (e.g., students, researchers, clinicians) from the United States and Puerto Rico with the competencies and skills needed to make FAIR cancer datasets from Hispanic populations and to apply AI/ML approaches for creating ML-based predictive models for cancer. | NCI |
PENEDO, FRANK J. | UNIVERSITY OF MIAMI | Postdoctoral Training in AI/ML Approaches in Cancer Control Research To Address Cancer Disparities Application of artificial intelligence (AI) and machine learning (ML) methodology to characterize multilevel social, biological, and psychosocial processes that influence cancer disparities across the cancer control continuum The South Florida Cancer Control Training in Disparities and Equity (South Florida C-TIDE; NCI T32 CA251064), offers an innovative approach to train the next generation of cancer researchers in cancer disparities across the cancer control continuum. The C-TIDE T32 grant (1) implements comprehensive, multidisciplinary, and community-engaged research (CER)training that addresses multilevel determinants of cancer disparities across traditional and unique underserved communities that experience cancer disparities; (2) provides immersive training opportunities in diverse educational and community environments for enhanced didactic and experiential learning; and (2) increases diversity in the workforce by aiming to recruit trainees who are racial and ethnic minorities, are part the LGBTQ community, have disabilities, or are other under-represented groups in the cancer disparities workforce. Given complex, multilevel interactions that influence cancer disparities across several communities, work aimed at reducing such disparities also necessitates the implementation of novel and advanced methodology, such as artificial intelligence (AI) and machine learning (ML)—two training components that are currently unavailable to C-TIDE trainees. This supplement will fill critical training gaps in our training plan by providing highly relevant advanced methodology in AI/ML that is applicable to research in cancer equity, such as identifying risk factors that drive disparities in incidence and early detection; developing risk-prediction models by assessing patient data to better understand and develop risk stratification and disease progression trajectories; and evaluate predictors of disparate clinical outcomes across our specific populations. The aims of this supplement are (1) to train six fellows and four mentors on AI/ML methodology and (2) apply AI/ML methodology to trainees’ ongoing and/or planned research projects and grants under preparation for future submissions. Didactics, curricula, and other AI/ML training activities are supported by the Institute for Data Science and Computing (IDSC). Training will involve data processing using the FAIR principles, programming with a scripting language, cloud computing, and AI/ML concepts. The supplement will provide competencies in data science with RStudio; cloud-based AI/ML; and AI/ML methods for data publishing, sharing, and grant applications. Evaluation, dissemination, and sustainability plans for the supplement are in place. | NCI |
QUIGLEY, HARRY ALAN | JOHNS HOPKINS UNIVERSITY | AI/ML-Ready Ophthalmic Data Best practices for generating and disseminating AI/ML-ready data that adheres to FAIR principles in ophthalmology Artificial intelligence (AI), in the form of machine learning (ML) and deep learning (DL), has revolutionized the field of medicine, especially in subspecialties with access to a large number of images, such as ophthalmology. While promising, the application of AI/ML in ophthalmology is limited by the lack of clear guidelines on how to extract ophthalmic data, convert these data into a form that is usable for AI/ML model training, and ensure these data are adhering to the principles of FAIR (findability, accessibility, interoperability, and reusability). Our proposal aims to address these limitations by delineating the best practices for data curation via an online lecture series that will be available free of charge to the public once the curriculum is assessed and validated by another academic institution (Harvard/Massachusetts Eye and Ear Infirmary). If successful, the proposal will be impactful, as it will: 1. Accelerate reproducible research in AI/ML by enabling the creation of standardized AI/ML -ready and FAIR datasets. 2. Disseminate best practices in data curation beyond the Johns Hopkins University community, as the online lectures will be made free of charge to the public. 3. Foster the creation of standardized multi-institutional datasets that will improve both performance and generalizability of AI/ML algorithms in ophthalmology, by enabling federated learning approaches and external validation of models. | NEI |
RAMACHANDRAN, SOHINI | BROWN UNIVERSITY | Learner-Centered Training in Biological Data Science Focused training on the medical opportunities and computational challenges of genomic research from base pairs to bedside to global health The objective of this program is to train biological data scientists who can generate and analyze biological data, and develop theoretical models and testable hypotheses for biological processes. Data science techniques—such as artificial intelligence and machine learning (AI/ML)—have enormous potential for knowledge discovery, model development, and hypothesis testing. There are many challenges with realizing this potential. The generation and analysis of biomedical datasets require (1) careful consideration of ethical, legal, privacy, and social concerns; (2) a deep understanding of potential biases in both data and algorithms; and (3) thoughtful reflection of health and socioeconomic disparities. These issues affect the entire data pipeline, from data generation to data analysis, and often need to be revisited and reconsidered iteratively throughout the entire research process. Making data findable, accessible, interoperable, and reusable (FAIR) and AI/ML-ready is a fundamentally important part of this process as it helps uncover biases, document data in meaningful ways, select and apply AI/ML algorithms, and find and reuse datasets. Our trainees are currently exposed to these issues indirectly during their coursework, research rotations, and thesis research, but we have not developed any training opportunities that have these topics as their main focus. With the institutional capacity enabled by this award and the growth of Brown’s Data Science Initiative, the primary outcome of this award will be to expand the current training activities by developing a series of focused training modules that current trainees will utilize to acquire the competencies needed to make data FAIR and AI/ML-ready in the context of interdisciplinary biological and biomedical research. Each training module will focus on the application of a concrete AI/ML algorithm to a specific dataset, or a collection of datasets, to develop or test a research hypothesis. Modules will begin with a guided scaffolded discussion of the ethical, legal, and privacy issues that may affect usage of the dataset; a thorough overview of the FAIR guiding principles and their relevance to AI/ML readiness; and an initial assessment of the FAIRness and AI/ML-readiness of the given dataset. These questions will then be revisited iteratively throughout the data analysis. | NIGMS |
RICHARDSON, ARLAN G. | THE UNIVERSITY OF OKLAHOMA HEALTH SCIENCES CENTER | Support for a Nathan Shock Center Data Science Workshop Workshops will be held to teach people interested in learning artificial intelligence and machine learning approaches how to prepare their data and to make their results findable, accessible, interoperable and reusable (FAIR). A Data Science Workshop (DSW) hosted at the University of Oklahoma Health Sciences Center (OUHSC) originated as a small gathering of local scientific computing specialists to share techniques and methods. However, as high-throughput data generation and AI/ML have gained prominence, there has been greatly increased demand from nonspecialists, including principal investigators, postdoctoral fellows, and graduate students whose work mostly entails bench biology, for introductory and intermediate training in all aspects of data acquisition, storage, and analysis. They are generally computer-savvy and able to program to some degree, and their interest is in short, hands-on demonstrations of how to apply AI/ML methods to data generated or accessible via their research. The interest in data science classes is growing at OUHSC, and this need is not met. Currently, the DSW is a volunteer effort, and the nascent OUHSC data science curriculum has minimal instruction on how to make data AI/ML-ready or how to follow FAIR principles. This supplement will enable us to develop material on these topics and promulgate it via workshops, coursework, and freely accessible online content. Importantly, it will allow us to reach a much broader audience. | NIA |
SOBIE, ERIC A. | ICAHN SCHOOL OF MEDICINE AT MOUNT SINAI | Teaching Biomedical and Pharmacological Trainees to Produce FAIR Data for AI/ML Applications A new course will educate students on issues related to structures and storage of imaging and clinical data that must be addressed to fully exploit these data in artificial intelligence applications. The answers to many fundamental questions in medicine and biology currently lie buried inside data collections that are too large and heterogeneous to be analyzed and visualized by traditional approaches. Although emerging tools of artificial intelligence and machine learning (AI/ML) provide tremendous opportunities to exploit these data, the capabilities of AI tools are often overstated by vendors, and the deployment of these tools without sufficient training and understanding can waste resources and harm patient outcomes. Our project will develop a new course dedicated to data science for AI/ML in biomedicine. A module within this course will focus on competencies needed to make data FAIR and AI/ML-ready, and on the skills that biomedical scientists will require to collaborate effectively with researchers in information sciences and AI/ML. The new course synergizes extremely well with the existing efforts of the training program, which has emphasized quantitative competencies for more than a decade. Although we are a program in pharmacological sciences rather than in computational biology, all trainees are required to learn fundamental concepts in programming, mathematical modeling, and systems pharmacology during the first-year core curriculum. The didactic focus, however, has been on mechanism-based mathematical models, with only more limited offerings in AI/ML thus far. The new course will fill a substantial need in the current curriculum by educating our predoctoral students on contemporary issues on data storage and availability that prevent AI/ML from reaching its full potential in biomedicine. By teaching our students these issues and emphasizing the principles that make data FAIR-compliant, we will facilitate future collaborations with data scientists and AI experts by allowing groups with different perspectives to understand the relevant issues and speak the same language. | NIGMS |
TULLIUS, THOMAS D. | BOSTON UNIVERSITY | Predoctoral Training in Biological Data Management for Advanced Computational Analysis and the Ethical Usage of Biological Data This project at Boston University will develop predoctoral training activities to enhance the enduring value of biological data for advanced computational analysis and underscore the ethical considerations necessary for unbiased data usage. In order to improve the availability and enduring value of data developed through biological research, and to underscore the ethical considerations necessary for unbiased data usage, the Boston University T32 program in Bioinformatics and Computational Biology in cooperation with the T32 programs in Quantitative Biology and Physiology and Synthetic Biology and Biotechnology will develop and teach, as part of their training activities for predoctoral students, a series of workshops on (1) biological data management practices to make data findable, accessible, interoperable, and reusable (FAIR) in order to facilitate subsequent computational analysis; (2) the knowledge and skills required to additionally make biological data suitable for analysis with artificial intelligence and machine learning methods (AI/ML); and (3) the ethical implications of FAIR and AI/ML-ready data practices, including data privacy, algorithmic fairness, and ethical algorithm design. The programs will develop and conduct a series of assessment activities to gauge how the training has influenced the acquisition of data management skills by the students and how it has transformed data management practices in mentor labs. All training resources developed as part of this project will be freely shared through a link on the Boston University Bioinformatics Graduate Program website. | NIGMS |
WANDINGER-NESS, ANGELA | THE UNIVERSITY OF NEW MEXICO HEALTH SCIENCES CENTER | FAIR Data Competency and Machine Learning Readiness for Biomedical Scientists: A Supplement Award to the Academic Science Education and Research Training Institutional Research and Career Development Award (IRACDA) Program A new course will train postdoctoral fellows and advanced graduate students in findable, accessible, interoperable and reusable (FAIR) data management and data analyses using machine learning. Trainees in the biomedical sciences are often involved in the acquisition of highly dimensional datasets—the analysis of which depends on machine learning. A 3-credit hour training program will guide trainees in the biomedical sciences to reframe biomedical-related problems in terms of supervised and unsupervised machine learning, which depends on FAIR data. By learning the essential characteristics of FAIR data and basics of machine learning, participants will (1) practice evaluating data to ensure it is FAIR-compliant and (2) implement and investigate how to evaluate and improve machine learning algorithms. | NIGMS |
WANG, CHUNYU | RENSSELAER POLYTECHNIC INSTITUTE | Development of Data Science Course and Summer Bootcamp for Alzheimer’s Disease and Related Dementia Researchers Development of data science course and summer bootcamp for Alzheimer’s disease and related dementia researchers Given the rising prominence of data science in Alzheimer’s disease and related dementia (ADRD) research, a need has emerged to broadly train predoctoral biology students in data science to prepare them for creative and productive careers in ADRD research. To this end, we will develop a spring 2022 course and a summer 2022 bootcamp to train the students in data science methods. The course will provide an overview of the principles of data science and its applications to Alzheimer’s disease, with topics ranging from machine learning to medical imaging analysis, multi-omics, and drug discovery. The summer bootcamp will be an R-based intensive training program utilizing datasets from ADRD (e.g., gene expression data from brain organoid model of neurodegeneration). Both the new course and summer bootcamp will be assessed for efficacy and sustainability and will be made available online to the wider student community at Rensselaer Polytechnic Institute. | NIA |
WESTENDORF, JENNIFER J. | MAYO CLINIC | Interdisciplinary Training on Data Science and AI/ML for Musculoskeletal and Orthopedic Conditions Training the next generation of musculoskeletal disease data scientists who can improve orthopedic care with AI Musculoskeletal (MSK) diseases and orthopedic conditions affect one out of every two adults in the United States and constitute a significant burden to the society and the affected individuals who live with long-term pain, disability, and poor quality of life. Despite the significant individual and health care burden, the evidence for many medical and surgical interventions is based on imperfect data. Important questions remain unresolved, including the indications for and timing of joint replacement, as well as variation in disease progression and surgical outcomes due to patient and implant characteristics. Electronic health records (EHR) and radiographs from routine clinical care are a rich, untapped data source for MSK and orthopedic research. With artificial intelligence and machine learning (AI/ML) and deep learning technologies, there is an unprecedented opportunity to create FAIR and AI/ML-ready EHR and imaging datasets, and to develop standardized, repeatable, scalable, high-capacity, low-cost automated algorithms, and to effectively use them to advance MSK research and clinical practice. There is a great need to train the new generation of MSK data scientists who can drive information and technology innovations using these datasets. The goal of this project is to create an interdisciplinary scholarly environment by providing a blended program of both didactic and informal learning experiences in AI/ML to improve diagnosis and treatment of MSK diseases. With both didactic and informal offerings, we will ensure that trainees are exposed to the foundational concepts and practical competencies and skills in a collaborative environment. Furthermore, we will develop brief educational videos targeting a wider audience. | NIAMS |