IU Indianapolis ScholarWorks :: Browsing by Subject "Big Data"

Browsing by Subject "Big Data"

Now showing 1 - 7 of 7

A-Optimal Subsampling For Big Data General Estimating Equations
(2019-08) Cheung, Chung Ching; Peng, Hanxiang; Rubchinsky, Leonid; Boukai, Benzion; Lin, Guang; Al Hasan, Mohammad
A significant hurdle for analyzing big data is the lack of effective technology and statistical inference methods. A popular approach for analyzing data with large sample is subsampling. Many subsampling probabilities have been introduced in literature (Ma, \emph{et al.}, 2015) for linear model. In this dissertation, we focus on generalized estimating equations (GEE) with big data and derive the asymptotic normality for the estimator without resampling and estimator with resampling. We also give the asymptotic representation of the bias of estimator without resampling and estimator with resampling. we show that bias becomes significant when the data is of high-dimensional. We also present a novel subsampling method called A-optimal which is derived by minimizing the trace of some dispersion matrices (Peng and Tan, 2018). We derive the asymptotic normality of the estimator based on A-optimal subsampling methods. We conduct extensive simulations on large sample data with high dimension to evaluate the performance of our proposed methods using MSE as a criterion. High dimensional data are further investigated and we show through simulations that minimizing the asymptotic variance does not imply minimizing the MSE as bias not negligible. We apply our proposed subsampling method to analyze a real data set, gas sensor data which has more than four millions data points. In both simulations and real data analysis, our A-optimal method outperform the traditional uniform subsampling method.
Correlations of Online Search Engine Trends with Coronavirus Disease (COVID-19) Incidence: Infodemiology Study
(JMIR Publications, 2020-05-21) Higgins, Thomas S.; Wu, Arthur W.; Sharma, Dhruv; Illing, Elisa A.; Rubel, Kolin E.; Ting, Jonathan Y.; Otolaryngology -- Head and Neck Surgery, School of Medicine
Background: The coronavirus disease (COVID-19) is the latest pandemic of the digital age. With the internet harvesting large amounts of data from the general population in real time, public databases such as Google Trends (GT) and the Baidu Index (BI) can be an expedient tool to assist public health efforts. Objective: The aim of this study is to apply digital epidemiology to the current COVID-19 pandemic to determine the utility of providing adjunctive epidemiologic information on outbreaks of this disease and evaluate this methodology in the case of future pandemics. Methods: An epidemiologic time series analysis of online search trends relating to the COVID-19 pandemic was performed from January 9, 2020, to April 6, 2020. BI was used to obtain online search data for China, while GT was used for worldwide data, the countries of Italy and Spain, and the US states of New York and Washington. These data were compared to real-world confirmed cases and deaths of COVID-19. Chronologic patterns were assessed in relation to disease patterns, significant events, and media reports. Results: Worldwide search terms for shortness of breath, anosmia, dysgeusia and ageusia, headache, chest pain, and sneezing had strong correlations (r>0.60, P<.001) to both new daily confirmed cases and deaths from COVID-19. GT COVID-19 (search term) and GT coronavirus (virus) searches predated real-world confirmed cases by 12 days (r=0.85, SD 0.10 and r=0.76, SD 0.09, respectively, P<.001). Searches for symptoms of diarrhea, fever, shortness of breath, cough, nasal obstruction, and rhinorrhea all had a negative lag greater than 1 week compared to new daily cases, while searches for anosmia and dysgeusia peaked worldwide and in China with positive lags of 5 days and 6 weeks, respectively, corresponding with widespread media coverage of these symptoms in COVID-19. Conclusions: This study demonstrates the utility of digital epidemiology in providing helpful surveillance data of disease outbreaks like COVID-19. Although certain online search trends for this disease were influenced by media coverage, many search terms reflected clinical manifestations of the disease and showed strong correlations with real-world cases and deaths.
Developing Bottom-Up, Integrated Omics Methodologies for Big Data Biomarker Discovery
(2020-11) Kechavarzi, Bobak David; Wu, Huanmei; Doman, Thompson; Dow, Ernst; Liu, Yunlong; Liu, Xiaowen; Yan, Jingwen
The availability of highly-distributed computing compliments the proliferation of next generation sequencing (NGS) and genome-wide association studies (GWAS) datasets. These data sets are often complex, poorly annotated or require complex domain knowledge to sensibly manage. These novel datasets provide a rare, multi-dimensional omics (proteomics, transcriptomics, and genomics) view of a single sample or patient. Previously, biologists assumed a strict adherence to the central dogma: replication, transcription and translation. Recent studies in genomics and proteomics emphasize that this is not the case. We must employ big-data methodologies to not only understand the biogenesis of these molecules, but also their disruption in disease states. The Cancer Genome Atlas (TCGA) provides high-dimensional patient data and illustrates the trends that occur in expression profiles and their alteration in many complex disease states. I will ultimately create a bottom-up multi-omics approach to observe biological systems using big data techniques. I hypothesize that big data and systems biology approaches can be applied to public datasets to identify important subsets of genes in cancer phenotypes. By exploring these signatures, we can better understand the role of amplification and transcript alterations in cancer.
Distributed graph decomposition algorithms on Apache Spark
(2018-04-20) Mandal, Aritra; Hasan, Mohammad Al; Mohler, George; Song, Fengguang
Structural analysis and mining of large and complex graphs for describing the characteristics of a vertex or an edge in the graph have widespread use in graph clustering, classiﬁcation, and modeling. There are various methods for structural analysis of graphs including the discovery of frequent subgraphs or network motifs, counting triangles or graphlets, spectral analysis of networks using eigenvectors of graph Laplacian, and ﬁnding highly connected subgraphs such as cliques and quasi cliques. Unfortunately, the algorithms for solving most of the above tasks are quite costly, which makes them not-scalable to large real-life networks. Two such very popular decompositions, k-core and k-truss of a graph give very useful insight about the graph vertex and edges respectively. These decompositions have been applied to solve protein functions reasoning on protein-protein networks, fraud detection and missing link prediction problems. k-core decomposition with is linear time complexity is scalable to large real-life networks as long as the input graph ﬁts in the main memory. k-truss on the other hands is computationally more intensive due to its deﬁnition relying on triangles and their is no linear time algorithm available for it. In this paper, we propose distributed algorithms on Apache Spark for k-truss and k-core decomposition of a graph. We also compare the performance of our algorithm with state-of-the-art Map-Reduce and parallel algorithms using openly available real world network data. Our proposed algorithms have shown substantial performance improvement.
Learning Analytics and the Academic Library: Professional Ethics Commitments at a Crossroads
(ACRL, 2018) Jones, Kyle M. L.; Library and Information Science, School of Informatics and Computing
In this paper, the authors address learning analytics and the ways academic libraries are beginning to participate in wider institutional learning analytics initiatives. Since there are moral issues associated with learning analytics, the authors consider how data mining practices run counter to ethical principles in the American Library Association’s “Code of Ethics.” Specifically, the authors address how learning analytics implicates professional commitments to promote intellectual freedom; protect patron privacy and confidentiality; and balance intellectual property interests between library users, their institution, and content creators and vendors. The authors recommend that librarians should embed their ethical positions in technological designs, practices, and governance mechanisms.
A Smart and Interactive Edge-Cloud Big Data System
(2021-08) Stauffer, Jake; Zhang, Qingxue; King, Brian; Fang, Shiaofen
Data and information have increased exponentially in recent years. The promising era of big data is advancing many new practices. One of the emerging big data applications is healthcare. Large quantities of data with varying complexities have been leading to a great need in smart and secure big data systems. Mobile edge, more specifically the smart phone, is a natural source of big data and is ubiquitous in our daily lives. Smartphones offer a variety of sensors, which make them a very valuable source of data that can be used for analysis. Since this data is coming directly from personal phones, that means the generated data is sensitive and must be handled in a smart and secure way. In addition to generating data, it is also important to interact with the big data. Therefore, it is critical to create edge systems that enable users to access their data and ensure that these applications are smart and secure. As the first major contribution of this thesis, we have implemented a mobile edge system, called s2Edge. This edge system leverages Amazon Web Service (AWS) security features and is backed by an AWS cloud system. The implemented mobile application securely logs in, signs up, and signs out users, as well as connects users to the vast amounts of data they generate. With a high interactive capability, the system allows users (like patients) to retrieve and view their data and records, as well as communicate with the cloud users (like physicians). The resulting mobile edge system is promising and is expected to demonstrate the potential of smart and secure big data interaction. The smart and secure transmission and management of the big data on the cloud is essential for healthcare big data, including both patient information and patient measurements. The second major contribution of this thesis is to demonstrate a novel big data cloud system, s2Cloud, which can help enhance healthcare systems to better monitor patients and give doctors critical insights into their patients' health. s2Cloud achieves big data security through secure sign up and log in for the doctors, as well as data transmission protection. The system allows the doctors to manage both patients and their records effectively. The doctors can add and edit the patient and record information through the interactive website. Furthermore, the system supports both real-time and historical modes for big data management. Therefore, the patient measurement information can, not only be visualized and demonstrated in real-time, but also be retrieved for further analysis. The smart website also allows doctors and patients to interact with each other effectively through instantaneous chat. Overall, the proposed s2Cloud system, empowered by smart secure design innovations, has demonstrated the feasibility and potential for healthcare big data applications. This study will further broadly benefit and advance other smart home and world big data applications.
Word Adjacency Graph Modeling: Separating Signal From Noise in Big Data
(Sage, 2017-01) Miller, Wendy R.; Groves, Doyle; Knopf, Amelia; Otte, Julie L.; Silverman, Ross D.; School of Nursing
There is a need to develop methods to analyze Big Data to inform patient-centered interventions for better health outcomes. The purpose of this study was to develop and test a method to explore Big Data to describe salient health concerns of people with epilepsy. Specifically, we used Word Adjacency Graph modeling to explore a data set containing 1.9 billion anonymous text queries submitted to the ChaCha question and answer service to (a) detect clusters of epilepsy-related topics, and (b) visualize the range of epilepsy-related topics and their mutual proximity to uncover the breadth and depth of particular topics and groups of users. Applied to a large, complex data set, this method successfully identified clusters of epilepsy-related topics while allowing for separation of potentially non-relevant topics. The method can be used to identify patient-driven research questions from large social media data sets and results can inform the development of patient-centered interventions.

Browsing by Subject "Big Data"

Results Per Page

Sort Options