# PiCo Lightning Talks

## PiCo Talks 2016

Join the celebration by registering to give a PiCo talk or poster/demo. All NIH staff are invited!

PiCo talks are structured lightning talks consisting of **3 slides** to share **1 idea** in **4 minutes**. Talks will be videocasted and archived.

After an introduction by **Dr. Michael Gottesman, Deputy Director for Intramural Research**, PiCo talks will be presented in the following order:

- Alejandro Alvarez-Prats, NICHD: Phosphatidylinositol 4-kinase type IIIα (PI4KA) is required for proper myelination of peripheral nerves in mice.
- Christopher Belter, OD: Advanced citation searching for systematic reviews
- Ben Busby, NCBI: NCBI and regional hackathons: Building community software for bioinformatics
- Maxwell Lee, NCI: Interpreting multivariate data analysis through the lens of geometric representation
- Philip McQueen, CIT: Information in neuron morphology
- Thorsten Prüstel, NIAID: Space-time approach to fast stochastic simulation algorithms
- Hoo Chang Shin, Clinical Center: Interleaved text/image deep mining on a large-scale radiology image database
- Parmit Kumar Singh, NICHD: LEDGF/p75 interacts with mRNA splicing factors and targets HIV-1 integration to highly spliced genes
- Paul Wakim, Clinical Center: Correlation with repeated measures: what you should and should not do
- Xiaosong Wang, Clinical Center: Towards large-scale radiology image database auto-annotation: a Deep Pseudo-task Learning approach
- Lana Yeganova, NCBI: Hypergeometric in action!

For questions, contact Lisa Federer (NIH/OD) (lisa.federer@nih.gov).

Dr. Warren Kibbe, director of NCI's Center for Biomedical Informatics and Information Technology, gives a PiCo talk.

Dr. DJ Patil, Chief Data Scientist of the United States, shows off his Pi Day spirit.

**Alejandro Alvarez-Prats, NICHD: Phosphatidylinositol 4-kinase type IIIα (PI4KA) is required for proper myelination of peripheral nerves in mice.**

Phosphatidylinositol 4-kinase IIIα (PI4KA) is responsible for the generation of PI4P, a phospholipid involved in the regulation of several aspects of cellular physiology. On the other hand, myelin, a very important component of the nervous system, is formed out of different lipids. Here, we demonstrate that PI4KA is required by Schwann cells to properly myelinate the murine peripheral nervous system (PNS). We generated three groups of mice: Pi4ka(fl/fl)Cre+, Pi4ka(fl/wt)Cre+, and Pi4ka(wt/wt)Cre+, the latter two groups serving as controls. Histological and DNA analysis of collected tissues were performed using H&E staining and genomic PCR, respectively. Protein expression was detected by western blot (WB) and Immunohistochemistry (IHC). The gait behavior was analyzed using the TreadScanTM system, an instrument that takes video of the animal, and determine various characteristic parameters related to possible pathophysiological conditions. Electron Microscopy (EM) analysis was performed on the sciatic nerves, and also lipidomic analysis, to study their lipid profile. We conclude that genetic ablation of PI4KA causes defects in myelination during development in mice. To elucidate the underlying cause of the myelination problem, we will analyze the lipid profile, signal transduction pathways and migrational response of Schwann cells after pharmacological inhibition of PI4KA.

**Christopher Belter, OD: Advanced citation searching for systematic reviews **

Finding literature for inclusion in a systematic review has always been a challenge. Synonymous terms, differences in database coverage, and indexing errors make it difficult to retrieve all of the studies that have been performed on a particular topic. In this talk, I propose a new way of searching that uses both direct and indirect citation links to identify papers on a particular topic and to rank those papers by their relevance to that topic. The method can be run in a matter of hours, typically retrieves 80-100% of the papers on a topic, and typically identifies around 80% of the relevant papers in a pool of 20-30% of the total papers retrieved.

**Ben Busby, NCBI: NCBI and Regional Hackathons: Building Community Software for Bioinformatics**

Over the past year, we have run several hackathons, bringing genomics professionals from all over the world together to build software. In addition to being an amazing educational and networking experience, attendees efficiently build user-designed functional software that can be used and modified by other bioinformaticians all over the world. We are expanding this effort to increase the diversity of genomic scientists, bioinformaticians, software developers and other scientists who can attend and contribute. Software built thus far, and resources for running such hackathons are available at https://github.com/NCBI-Hackathons.

**Maxwell Lee, NCI: Interpreting multivariate data analysis through the lens of geometric representation**

Some of the most common statistical methods for analyzing multivariate data are linear regression models, principal component analysis (PCA), and multiple discriminant analysis (MDA). Although many software tools are readily available for analyzing multivariate data, understanding the concepts of matrix algebra and calculus are important to gain deep insights into the meaning of data analyses and to be able interpret the results more effectively. We often visualize the data and multivariate methods by plotting samples in a two-dimensional or three-dimensional Cartesian coordinate system. A complementary approach also exists that uses geometric representation to illustrate relationship among variables that can be viewed as vectors in vector space, but most biologists are not aware of that. Many of the matrix operations used in multivariate statistical methods can be visualized through geometric representations. In this talk, I will illustrate the applications of geometric representations to gain more insights into the interpretation of linear regression models and PCA and will discuss the meaning of R2, coefficients, and fitted values of regression models, as well as eigenvectors and eigenvalues of PCA from geometric perspective.

**Philip McQueen, CIT: Information in Neuron Morphology**

Even in Drosophila, neurons form intricate circuits for specialized tasks. Thus, the dendritic fields of neurons have a huge variety of morphologies depending of the circuit function. Using examples from the Drosophila visual systems, I show that ideas from information theory, such as entropy and the Jensen-Shannon metric, allow characterization and classification of the distinctive dendritic field morphologies that neurons in this system develop as the organism matures.

**Thorsten Prüstel, NIAID: Space-Time Approach to Fast Stochastic Simulation Algorithms**

The stochastic, diffusive motion of molecules influences the rate of molecular encounters and hence many biochemical processes in living cells. Tiny time steps are typically required to accurately resolve the probabilistic encounter events in a simulation of cellular signaling, creating the dilemma that only simplistic models can be studied at cell-biologically relevant timescales. Here, we present a simulation algorithm that accurately propagates a molecule pair using large time steps and which allows for position updates that are two to three orders of magnitude faster than those of a corresponding schemes, while mantaining the same degree of accuracy. The method is flexible and applicable in 1, 2 and 3 dimensions, suggesting that it may find broad usage in various stochastic simulation algorithms.

**Hoo Chang Shin, Clinical Center: Interleaved Text/Image Deep Mining on a Large-Scale Radiology Image Database**

For a complete abstract, see Pi Day poster abstracts.

**Parmit Singh, NICHD: LEDGF/p75 interacts with mRNA splicing factors and targets HIV-1 integration to highly spliced genes**

The host chromatin-binding factor LEDGF/p75 interacts with HIV-1 integrase and directs integration to active transcription units. To understand how LEDGF/p75 recognizes transcription units, we sequenced 1 million HIV-1 integration sites isolated from cultured HEK293T cells. Analysis of integration sites showed that cancer genes were preferentially targeted, raising concerns about using lentivirus vectors for gene therapy. Additional analysis led to the discovery that introns and alternative splicing contributed significantly to integration site selection. These correlations were independent of transcription levels, size of transcription units, and length of the introns. Multivariate analysis with five parameters previously found to predict integration sites showed that intron density is the strongest predictor of integration density in transcription units. Analysis of previously published HIV-1 integration site data showed that integration density in transcription units in mouse embryonic fibroblasts also correlated strongly with intron number, and this correlation was absent in cells lacking LEDGF. Affinity purification showed that LEDGF/p75 is associated with a number of splicing factors, and RNA sequencing (RNA-seq) analysis of HEK293T cells lacking LEDGF/p75 or the LEDGF/p75 integrase-binding domain (IBD) showed that LEDGF/p75 contributes to splicing patterns in half of the transcription units that have alternative isoforms. Thus, LEDGF/p75 interacts with splicing factors, contributes to exon choice, and directs HIV-1 integration to transcription units that are highly spliced.

**Paul Wakim, Clinical Center: Correlation with repeated measures: what you should and should not do**

How should one calculate the correlation coefficient between two variables X and Y, when both have been measured repeatedly over time on the same individuals?

**Xiaosong Wang, Clinical Center: Towards large-scale radiology image database auto-annotation: a Deep Pseudo-task Learning approach**

"Obtaining ImageNet-level semantic labels on a large scale radiology image database (215,786 key images from 61,845 unique patients) is a prerequisite yet bottleneck to train highly effective deep convolutional neural network (CNN) models for image recognition. Nevertheless, conventional methods for collecting image labels (e.g., Google search followed by crowd-sourcing) are not applicable due to the formidable difficulties of medical annotation tasks for those who are not clinically trained.

In this paper, we present a looped deep pseudo-task optimization procedure for automatic category discovery of visually coherent and clinically semantic (concept) clusters. Our system can be initialized by domain-specific (CNN trained on radiology images and text report derived labels) or generic (ImageNet) CNN models. Afterwards, a sequence of pseudo-tasks are exploited by the looped deep image feature clustering (to refine image labels) and deep CNN training/classification using new labels (to obtain more task representative deep features). Our method is conceptually simple and based on the hypothesized "convergence" of better labels leading to better trained CNN models which consequently feed more effective deep image features to facilitate more meaningful clustering/labels. We have empirically validated the convergence and demonstrated promising quantitative and qualitative results. Category labels of significantly higher quality than those in are discovered. This allows for further investigation of the hierarchical semantic nature of the given large-scale radiology image database.

**Lana Yeganova, NCBI: Hypergeometric in Action!**

Communicating with a search engine has so deeply become a norm of life that we do not pause to think about what actually happens behind the scenes when we perform a search. We enter a query and expect a search engine to understand our question, swiftly shuffle through terabytes of data and return the answers sorted by relevance. What's more, we expect the system to recognize that cancer and neoplasm are synonyms, that fmf is the abbreviated form for familial Mediterranean fever, and that ophthalmologie and ophthalmology are different spellings of the same word. But behind every search engine is an intricate mathematics that relies on interplay of words, probabilities, and co-occurrences. In this talk I would like to highlight one such tool that relies on probabilities and co-occurrences of words to establish the relationships between them. We refer to it as the hypergeometric test which, in a nutshell, is based on computing the p-value of hypergeometric random variable. We have found that simple test to be remarkably useful across the array of different applications. We will present the idea and demonstrate it in action on a sample problem.

This page last reviewed on September 5, 2018