- Informatics School Theses and Dissertations
Informatics School Theses and Dissertations
Permanent URI for this collection
Please go to "Informatics Graduate Theses and PhD Dissertations" to submit dissertations and theses for the School of Informatics and Computing, at: http://hdl.handle.net/1805/303.
Browse
Recent Submissions
Item Discovery and Interpretation of Subspace Structures in Omics Data by Low-Rank Representation(2022-10) Lu, Xiaoyu; Cao, Sha; Zhang, Chi; Yan, Jingwen; Zang, YongBiological functions in cells are highly complicated and heterogenous, and can be reflected by omics data, such as gene expression levels. Detecting subspace structures in omics data and understanding the diversity of the biological processes is essential to the full comprehension of biological mechanisms and complicated biological systems. In this thesis, we are developing novel statistical learning approaches to reveal the subspace structures in omics data. Specifically, we focus on three types of subspace structures: low-rank subspace, sparse subspace and covariates explainable subspace. For low-rank subspace, we developed a semi-supervised model SSMD to detect cell type specific low-rank structures and predict their relative proportions across different tissue samples. SSMD is the first computational tool that utilizes semi-supervised identification of cell types and their marker genes specific to each mouse tissue transcriptomics data, for better understanding of the disease microenvironment and downstream disease mechanism. For sparsity-driven sparse subspace, we proposed a novel positive and unlabeled learning model, namely PLUS, that could identify cancer metastasis related genes, predict cancer metastasis status and specifically address the under-diagnosis issue in studying metastasis potential. We found PLUS predicted metastasis potential at diagnosis have significantly strong association with patient’s progression-free survival in their follow-up data. Lastly, to discover the covariates explainable subspace, we proposed an analytical pipeline based on covariance regression, namely, scCovReg. We utilized scCovReg to detect the pathway level second-order variations using scRNA-Seq data in a statistically powerful manner, and to associate the second-order variations with important subject-level characteristics, such as disease status. In conclusion, we presented a set of state-of-the-art computational solutions for identifying sparse subspaces in omics data, which promise to provide insights into the mechanism in complex diseases.Item A Comprehensive Survey and Deep Learning-Based Prediction on G-quadruplex Formation and Biological Functions(2022-09) Fang, Shuyi; Wan, Jun; Liu, Yunlong; Yan, Jingwen; Zhang, JieThe G-quadruplexes (G4s) are guanine-rich four-stranded DNA/RNA structures, which have been found throughout the human genome. G4s have been reported to affect chromatin structure and are involved in important biological processes at transcriptional and epigenetic levels. However, the underlying molecular mechanisms and locating of G4 still remain elusive due to the complexity of G4s. Taking advantage of the development of high-throughput sequencing technologies and machine learning approaches, we constructed this comprehensive investigation on G4 structures, including discovery of a novel marker for functional human hematopoietic stem cells and gained interest in G4 structure, exploring association between G4 and genomic factors by incorporating multi-omics data, and development of a deep-learningbased G4 prediction tool with G4 motif. First, we discovered ADGRG1 as a novel marker for functional human hematopoietic stem cells and its regulation through transcription activities. Our interest in G4s was stimulated while the transcription-related investigations. Next, we analyzed the genome-wide distribution properties of G4s and uncovered the associations of G4 with other epigenetic and transcriptional mechanisms to coordinate gene transcription. We explored that different-confidence G4 groups correlated differently with epigenetic regulatory elements and revealed that G4 structures could correlate with gene expression in two opposite ways depending on their locations and forming strands. Some transcription factors were identified to be over-represented with G4 emergence. We found distinct consensus sequences enriched in the G4 feet, with a high GC content in the feet of high-confidence G4s and a high TA content in solely predicted G4 feet. As for the last part, we developed a novel deep-learning-based prediction tool for DNA G4s with G4 motifs. Considering the classical G4 motif, we applied bi-directional LSTM model with attention method, which captures sequential information, and showed good performance in whole-genome level prediction of DNA G4s with the certified G4 pattern. Our comprehensive work investigated G4 with its functions and predictions and provided a better understanding of G4s on multi-omics level and computational information capture riding the wave of deep learning.Item Deciphering Gene Regulatory Mechanisms Through Multi-omics Integration(2022-09) Chen, Duojiao; Liu, Yunlong; Wan, Jun; Zhang, Chi; Yan, JingwenComplex biological systems are composed of many regulatory components, which can be measured with the advent of genomics technology. Each molecular assay is normally designed to interrogate one aspect of the cell state. However, a comprehensive understanding of the regulatory mechanism requires characterization from multiple levels such as genome, epigenome, and transcriptome. Integration of multi-omics data is urgently needed for understanding the global regulatory mechanism of gene expression. In recent years, single-cell technology offers unprecedented resolution for a deeper characterization of cellular diversity and states. High-quality single-cell suspensions from tissue biopsies are required for single-cell sequencing experiments. Tissue biopsies need to be processed as soon as being collected to avoid gene expression changes and RNA degradation. Although cryopreservation is a feasible solution to preserve freshly isolated samples, its effect on transcriptome profiles still needs to be investigated. Investigation of multi-omics data at the single-cell level can provide new insights into the biological process. In addition to the common method of integrating multi-omics data, it is also capable of simultaneously profiling the transcriptome and epigenome at single-cell resolution, enhancing the power of discovering new gene regulatory interactions. In this dissertation, we integrated bulk RNA-seq with ATAC-seq and several additional assays and revealed the complex mechanisms of ER–E2 interaction with nucleosomes. A comparison analysis was conducted for comparing fresh and frozen multiple myeloma single-cell RNA sequencing data and concluded that cryopreservation is a feasible protocol for preserving cells. We also analyzed the single-cell multiome data for mesenchymal stem cells. With the unified landscape from simultaneously profiling gene expression and chromatin accessibility, we discovered distinct osteogenic differentiation potential of mesenchymal stem cells and different associations with bone disease-related traits. We gained a deeper insight into the underlying gene regulatory mechanisms with this frontier single-cell mutliome sequencing technique.Item Intron Retention Induced Neoantigen as Biomarkers in Diseases(2022-08) Dong, Chuanpeng; Yan, Jingwen; Liu, Yunlong; Huang, Kun; Wan, Jun; Liu, XiaowenAlternative splicing is a regulatory mechanism that generates multiple mRNA transcripts from a single gene, allowing significant expansion in proteome diversity. Disruption of splicing mechanisms has a large impact on the transcriptome and is a significant driver of complex diseases by producing condition-specific transcripts. Recent studies have reported that mis-spliced RNA transcripts can be another major source of neoantigens directly associated with immune responses. Particularly, aberrant peptides derived from unspliced introns can be presented by the major histocompatibility complex (MHC) class I molecules on the cell surface and elicit immunogenicity. In this dissertation, we first developed an integrated computational pipeline for identifying IR-induced neoantigens (IR-neoAg) from RNA sequencing (RNA-Seq) data. Our workflow also included a random forest classifier for prioritizing the neoepitopes with the highest likelihood to induce a T cell response. Second, we analyzed IR neoantigen using RNA-Seq data for multiple myeloma patients from the MMRF study. Our results suggested that the IR-neoAg load could serve as a prognosis biomarker, and immunosuppression in the myeloma microenvironment might offset the increasing neoantigen load effect. Thirdly, we demonstrated that high IR-neoAg predicts better overall survival in TCGA pancreatic cancer patients. Moreover, our results indicated the IR-neoAg load might be useful in identifying pancreatic cancer patients who might benefit from immune checkpoint blockade (ICB) therapy. Finally, we explored the association of IR-induced neo-peptides with neurodegeneration disease pathology and susceptibility. In conclusion, we presented a state-of-art computational solution for identifying IR-neoAgs, which might aid neoantigen-based vaccine development and the prediction of patient immunotherapy responses. Our studies provide remarkable insights into the roles of alternative splicing in complex diseases by directly mediating immune responses.Item Computational Modeling of Cell and Tissue Level Metabolic Characterization of the Human Metabolic Network by Using scRNA-seq Data(2022-06) Alghamdi, Norah Saeed; Zhang, Chi; Cao, Sha; Yan, Jingwen; Jones, JosetteThe heterogeneity of metabolic pathways is a hallmark of many common disease types. Nowadays, there are several sources of knowledge on the core components of metabolic networks and sub-networks we have obtained, however, there are still limitations in our knowledge of the integrated behavior and metabolic reprogramming of cells microenvironment. Basically, the metabolic changes can be characterized by different factors, and the changes are different from one cell to another cell because of their high plasticity. The large amount of single-cell and tissue data gained from disease tissue has the potential to provide information on a cell functioning state and its underlying phenotypic changes. Hence, advanced systems biology models and computational tools are in pressing need to empower reliable characterization of metabolic variations in disease by using scRNA-seq data. Our preliminary data include (1) a new computational method to estimate cell-wise metabolic flux and states from single-cell and tissue transcriptomics data, and (2) matched scRNA-seq data and metabolomics experiment on cells under perturbed biochemical conditions and knock-down of metabolic genes, both of which form the computational and experimental foundations of this project. In this dissertation, we proposed to develop a suite of novel computational methods, systems biology models, and quantitative metrics to bring the following unmet capabilities: (1) reconstruction of context-specific and subcellular-resolution metabolic network for different disease types, (2) estimation of cell-/sample-wise metabolic flux by considering metabolic imbalance, metabolic exchange between cells in the disease microenvironment, (3) a systematic evaluation of the functional impact of variations in gene expression, metabolite availability and network structure on the context-specific metabolic network and flux. By implementing these methods using scRNA-seq data, we addressed the following outstanding biological questions: (i) identification of genes, metabolites, and network topology with high impact on metabolic variations, (ii) estimation of metabolic flux, and (iv) assessment of metabolic changes over metabolic network. Successful execution of the proposed research provides a suite of computational capabilities to analyze metabolic variations that could be broadly utilized by the biomedical research community.Item Biomarker-And Pathway-Informed Polygenic Risk Scores for Alzheimer's Disease and Related Disorders(2022-05) Chasioti, Danai; Yan, Jingwen; Saykin, Andrew J.; Nho, Kwangsik; Risacher, Shannon L.; Wu, HuanmeiDetermining an individual’s genetic susceptibility in complex diseases like Alzheimer’s disease (AD) is challenging as multiple variants each contribute a small portion of the overall risk. Polygenic Risk Scores (PRS) are a mathematical construct or composite that aggregates the small effects of multiple variants into a single score. Potential applications of PRS include risk stratification, biomarker discovery and increased prognostic accuracy. A systematic review demonstrated that methodological refinement of PRS is an active research area, mostly focused on large case-control genome-wide association studies (GWAS). In AD, where there is considerable phenotypic and genetic heterogeneity, we hypothesized that PRS based on endophenotypes, and pathway-relevant genetic information would be particularly informative. In the first study, data from the NIA Alzheimer’s Disease Neuroimaging Initiative (ADNI) was used to develop endophenotype-based PRS based on amyloid (A), tau (T), neurodegeneration (N) and cerebrovascular (V) biomarkers, as well as an overall/combined endophenotype-PRS. Results indicated that combined phenotype-PRS predicted neurodegeneration biomarkers and overall AD risk. By contrast, amyloid and tau-PRSs were strongly linked to the corresponding biomarkers. Finally, extrinsic significance of the PRS approach was demonstrated by application of AD biological pathway-informed PRS to prediction of cognitive changes among older women with breast cancer (BC). Results from PRS analysis of the multicenter Thinking and Living with Cancer (TLC) study indicated that older BC patients with high AD genetic susceptibility within the immune-response and endocytosis pathways have worse cognition following chemotherapy±hormonal therapy rather than hormonal-only therapy. In conclusion, PRSs based on biomarker- or pathway- specific genetic information may provide mechanistic insights beyond disease susceptibility, supporting development of precision medicine with potential application to AD and other age-associated cognitive disorders.Item Analyzing Chlamydia and Gonorrhea Health Disparities from Health Information Systems: A Closer Examination Using Spatial Statistics and Geographical Information Systems(2022-05) Lai, Patrick T. S.; Jones, Josette; Dixon, Brian E.; Wilson, Jeffrey; Wu, Huanmei; Shih, PatrickThe emergence and development of electronic health records have contributed to an abundance of patient data that can greatly be used and analyzed to promote health outcomes and even eliminate health disparities. However, challenges exist in the data received with factors such as data inconsistencies, accuracy issues, and unstructured formatting being evident. Furthermore, the current electronic health records and clinical information systems that are present do not contain the social determinants of health that may enhance our understanding of the characteristics and mechanisms of disease risk and transmission as well as health disparities research. Linkage to external population health databases to incorporate these social determinants of health is often necessary. This study provides an opportunity to identify and analyze health disparities using geographical information systems on two important sexually transmitted diseases in chlamydia and gonorrhea using Marion County, Indiana as the geographical location of interest. Population health data from the Social Assets and Vulnerabilities Indicators community information system and electronic health record data from the Indiana Network for Patient Care will be merged to measure the distribution and variability of greatest chlamydia and gonorrhea risk and to determine where the greatest areas of health disparities exist. A series of both statistical and spatial statistical methods such as a longitudinal measurement of health disparity through the Gini index, a hot-spot and cluster analysis, and a geographically weighted regression will be conducted in this study. The outcome and broader impact of this research will contribute to enhanced surveillance and increased effective strategies in identifying the level of health disparities for sexually transmitted diseases in vulnerable localities and high-risk communities. Additionally, the findings from this study will lead to improved standardization and accuracy in data collection to facilitate subsequent studies involving multiple disparate data sources. Finally, this study will likely introduce ideas for potential social determinants of health to be incorporated into electronic health records and clinical information systems.Item Advocacy in Mental Health Social Interactions on Public Social Media(2022-02) Cornet, Victor P.; Holden, Richard J.; Bolchini, Davide; Brady, Erin; Mohler, George; Hong, Michin; Lee, SangwonHealth advocacy is a social phenomenon in which individuals and collectives attempt to raise awareness and change opinions and policies about health-related causes. Mental health advocacy is health advocacy to advance treatment, rights, and recognition of people living with a mental health condition. The Internet is reshaping how mental health advocacy is performed on a global scale, by facilitating and broadening the reach of advocacy activities, but also giving more room for opposing mental health advocacy. Another factor contributing to mental health advocacy lies in the cultural underpinnings of mental health in different societies; East Asian countries like South Korea have higher stigma attached to mental health compared to Western countries like the US. This study examines interactions about schizophrenia, a specific mental health diagnosis, on public social media (Facebook, Instagram, and Twitter) in two different languages, English and Korean, to determine how mental health advocacy and its opposition are expressed on social media. After delineation of a set of keywords for retrieval of content about schizophrenia, three months’ worth of social media posts were collected; a subset of these posts was then analyzed qualitatively using constant comparing with a proposed model describing online mental heath advocacy based on existing literature. Various expressions of light mental health advocacy, such as sharing facts about schizophrenia, and strong advocacy, showcasing offline engagement, were found in English posts; many of these expressions were however absent from the analyzed Korean posts that heavily featured jokes, insults, and criticisms. These findings were used to train machine learning classifiers to detect advocacy and counter-advocacy. The classifiers confirmed the predominance of counter-advocacy in Korean posts compared to important advocacy prevalence in English posts. These findings informed culturally sensitive recommendations for social media uses by mental health advocates and implications for international social media studies in human-computer interaction.Item Investigating Disease Mechanisms and Drug Response Differences in Transcriptomics Sequencing Data(2022-01) Simpson, Edward Ronald Jr.; Liu, Yunlong; Janga, Sarath; Wan, Jun; Wu, Huanmei; Yan, JingwenIn eukaryotes, genetic information is encoded by DNA, transcribed to precursor messenger RNA (pre-mRNA), processed into mature messenger RNA (mRNA), and translated into functional proteins. Splicing of pre-mRNA is an important epigenetic process that alters the function of proteins through modifying the exon structure of mature mRNA transcripts and is known to greatly contribute to diversity of the human proteome. The vast majority of human genes are expressed through multiple transcript isoforms. Expression of genes through splicing of pre-mRNA plays crucial roles in cellular development, identity, and processes. Both the identity of genes selected for transcription and the specific transcript isoforms that are expressed are essential for normal cellular function. Deviations in gene expression or isoform proportion can be an indication or the cause of disease. RNA sequencing (RNAseq) is a high-throughput next-generation sequencing technology that allows for the interrogation of gene expression on a massive scale. RNAseq generates short sequences that reflect pieces of mRNAs present in a sample. RNAseq can therefore be used to explore differences in gene expression, reveal transcript isoform identities and compare changes in isoform proportions. In this dissertation, I design and apply advanced analysis techniques to RNAseq, phenotypic and drug response data to investigate disease mechanisms and drug sensitivity. Research Goals: The work described in this dissertation accomplishes 4 aims. Aim 1) Evaluate the gene expression signature of concussion in collegiate athletes and identify potential biomarkers for response and recovery. Aim 2) Implement a machine-learning algorithm to determine if splicing can predict drug response in cancer cell lines. Aim 3) Design a fast, scalable method to identify differentially spliced events related to cancer drug response. Aim 4) Construct a drug-splicing network and use a systems biology approach to search for similarities in underlying splicing events.Item Identifying Metaphors Used by Clinicians That Help Patients Conceptualize Complex Cardiac Device Data for Managing Their Health(2021-12) Daley, Carly Noel; Holden, Richard; Jones, Josette; Bolchini, Davide; Bute, JenniferMetaphors are used to conceptualize one thing in terms of another that is more familiar or concrete. The use of metaphors in patient-provider communication has helped providers generate empathy and explain concepts effectively, improving patient satisfaction and understanding of health-related concepts. With advances in technology, concepts related to health monitoring have become increasingly complex, making the potential for using metaphors in health communication at its highest relevancy. With the increase in health data there is a need to improve tools to help people understand complex information. Ethical considerations, such as possible misinterpretation of health data, as well as the potential to widen disparities because of factors such as health literacy, must be addressed. Metaphors are powerful tools that can make explanation of information accessible, accurate, and effective for people who are monitoring their data. The current research aims to contribute design recommendations for using metaphors in communication between clinicians and patients for monitoring biventricular (BiV) pacing, a complex device data element used in the monitoring of patients with heart failure (HF) who have cardiac resynchronization therapy (CRT) devices. The overarching goal is to understand this process such that it can be applied to broader communication needs in health informatics. The study addresses the following aims: Aim 1: Identify metaphors clinicians use to conceptualize BiV pacing for CRT devices using semi-structured interviews with clinician experts. Aim 2: Identify metaphors that help patients conceptualize BiV pacing for CRT devices using semi-structured interviews with patients, and exploring the metaphors identified in Aim 1. Aim 3: Develop design recommendations for health informatics interventions using an understanding of metaphors that help patients understand BiV pacing for CRT devices. Themes from analysis of Aims 1 and 2 contribute to recommendations for the use of metaphors in health informatics interventions. The purpose of this work is to contribute to an in-depth understanding of metaphors in a specific health informatics context. Importantly, this research applies methods and principles from the field of health communication to address a communication-related issue in health informatics.