Research Projects

At CUCIS, we are focused on developing sophisticated solutions to problems relating to scalable processing and I/O; computer security and information assurance; and high performance data mining.

To learn more about these projects, please see our publications page.

Title:	Data Centered Materials Knowledge Discovery
Duration:	October 2012 -
Objectives:	The materials discovery process can be significantly expedited and simplified if we can learn effectively from available knowledge and data regarding their processing, structures, propeties, and performances. The interdisciplinary research of materials sicence and computer science promises to revolutionize the way materials are discovered, designed and manufactured, by the integration of information across length and time scales for all relevant materials phenomena. At its core, it involves the development of materials models that quantitatively describe processing-structure-property relationships, while handling the profound complexity and breadth of issues that must be addressed in the engineering of materials.
On-line Tools:	Steel Fatigue Strength Predictor -- an online tool to predict the fatigue strength of a given steel Formation Energy Predictor -- an online tool to predict the formation energy of a given compound ThermoEl Toolkit -- an online thermoelectric toolkit for predicting Seebeck coefficient and other thermoelectric properties Martensite Start Temerature (MsT) Predictor -- an online tool to predict the martensite start temperature of a given steel Organic Photovoltaic Predictor -- an online tool for predicting organic photovoltaic properties Macroscale (Effective) Stiffness Predictor -- an online tool to predict the macroscale (effective) stiffness of high-contrast two-phase three-dimensional composite materials

Title:	Health Informatics Data Mining
Duration:	August 2011 -
Objectives:	The rapid growth of publicly available health related dataset, the increase in volumes of electronic medical records, and the availability of health related social media posts; introduces new opportunities to develop clinical decision-making systems. Therefore, the goal of this project is to develop algorithms or frameworks to extract information from these diverse datasets, and build decision-making systems.
On-line Tools:	Five Year Life Expectancy Calculator -- an online tool for calculating five year life expectancy of older adults Lung Cancer Outcome Calculator -- an online tool for calculating patient-specific lung cancer survival Colon Cancer Outcome Calculator -- an online tool for calculating patient-specific colon cancer survival Analyzing the Variation in Hospital Billing using Medicare Data -- Visual US state heat maps and insights

Title:	Parallel NetCDF
Duration:	July 2001 - Present
Objectives:	Parallel netCDF is a parallel I/O library for accessing files in netCDF format. NetCDF is a software package popular in scientific community for storing data files. It consists of a set of application programming interfaces (API) and a portable file format. Parallel netCDF is built on top of MPI-IO to guarantee its portability and high performance.

Title:	Understanding, Analyzing, and Retrieving Knowledge from Social Media
Duration:	September 2010 -
Objectives:	The rapid growth of social media sites and vast amount of user generated contents urge the need of developing specialized data mining, sentiment analysis, and network anaysis algorithms. Large amount of user activity in the form of messages, and ratings are downloaded via publically available social media APIs. E.g. Twitter API. This raw data is processed to divulge structures, patterns, and communities in networks. Therefore, the goal of this project is to develop algorithms or frameworks to retrieve the valuable nuggets of knowledge from this huge amount of data.
On-line Tools:	Sentiment Analysis for Social Media Data -- an API for sentiment analysis service and benchmard data Digital Flu Surveillance -- a real-time disease surveillance using social media data for flu Digital Cancer Surveillance -- a real-time disease surveillance using social media data for cancer Digital Allergy Surveillance -- a real-time disease surveillance using social media data for allergy

Title:	Understanding Climate Change: A Data Driven Approach
Duration:	September 2010 - August 2016
Objectives:	Climate change is the defining environmental challenge now facing our planet. Whether it is an increase in the frequency or intensity of hurricanes, rising sea levels, droughts, floods, or extreme temperatures and severe weather, the social, economic, and environmental consequences are great as the resource-stressed planet nears 7 billion inhabitants later this century. Yet there is considerable uncertainty as to the social and environmental impacts because the predictive potential of numerical models of the earth system is limited. Data driven methods that have been highly successful in other facets of the computational sciences are now being used in the environmental sciences with success. The objective of the Expedition project is to significantly advance key challenges in climate change science by developing exciting and innovative new data driven approaches that take advantage of the wealth of climate and ecosystem data now available from satellite and ground-based sensors, the observational record for atmospheric, oceanic, and terrestrial processes, and physics-based climate model simulations.

Title:	High-Performance DNA Sequence Mapping
Duration:	September 2006 - August 2011
Objectives:	This project studies various methods to improve the DNA sequence mapping in terms of performance as well as result accuracy. In particular, we focus on the mapping for long DNA reads, as there have been many researches on short reads and mapping problem for long reads is more computational intensive. In addition, the existing algorithms for mapping short reads are not directly applicable for long reads. AGILE (AliGnIng Long rEads), a hash table based sequence mapping algorithm, has been developed. This project also studies the software approaches to use GPU to accelerate the computation of DNA Pairwise Statistical Significance Estimation, a component for numerous bioinformatics applications. Several enhanced GPU memory-access algorithms have been developed, including a tile-based scheme that can produce a contiguous memory accesses pattern to global memory and sustain a large number of threads to achieve a high GPU occupancy.
On-line Tools:	AGILE -- a sequence mapping tool to map the long reads to a given reference genome Pairwise Statistical Significance -- a sotfware suite that estimates the statistical significance of local sequence alignment (in MPI, OpenMP, and CUDA)

Title:	DiscKNet - Discovering Knowledge from Scientific Research Networks
Duration:	September 2011 - August 2013
Objectives:	Advancement in scientific research and discovery process can be significantly accelerated by mining and analysis of scientific data generated from various sources. In addition to the experimental data produced from simulations and observations, there is another category of scientific data; namely, scientific research and process data, such as the discussions and outcomes of complete and in-progress research projects, in the form of technical reports, research papers, discussion forums, mailing lists, research blogs, etc, and the connections between research activities. The goal for this project is to develop an infrastructure called DiscKNet (Discovering Knowledge from Scientific Research Networks) to mine the enriched scientific research network for emerging trends, new research areas, potential collaborations, etc..

Title:	Damsel - A Data Model Storage Library for Exascale
Duration:	October 2010 - September 2013
Objectives:	Computational Science applications are steadily increasing the complexity of grids, solution methods on those grids, and data that link the two together on modern petascale computers. Several common motifs can be described, having distinct grid types and computation/communication patterns; examples include structured AMR, unstructured finite element/volume, and spectral elements. The goal for Damsel is to enable Exascale computational science applications to interact conveniently and efficiently with storage through abstractions that mach their data model.

Title:	X-Analytics - Scalable and Power Efficient Data Analytics for Hybrid Exascale Systems
Duration:	October 2010 - September 2013
Objectives:	A number of emerging trends in high performance computing necessitate the creation of next-generation algorithms and libraries to accelerate data mining and analysis, namely, the increasing size and complexity of data, the emergence of heterogeneous HPC systems, and the increasing importance of energy in HPC. The goal of the X-Analytics project is to produce a library of advanced and energy-efficient algorithms and data kernels for data analysis, improving the productivity of both scientists and HPC systems.

Title:	Hecura - An Application Driven I/O Optimization Approach for Petascale Systems and Scientific Discoveries
Duration:	05/01/2010 - 04/30/13
Objectives:	Tha main goals of this project are to develop an application-driven approach to enable large-scale I/O which is fast, portable, scalable I/O which is metadata rich, and easy to use for scientific workflow on petascale range systems. Another major goal of this project would be to also bring the real benefits of scalable I/O techniques developed in this project to production applications including for reading and writing peta-bytes of data, scaling checkpoint/restart, data analysis and post-processing in a way that allows scientists to choose or adapt different formats or I/O libraries based on their needs and requirements.

Title:	ELLF - Extensible Language and Library Frameworks for Scalable and Efficient Data-Intensive Applications
Duration:	August 2009 - August 2012
Objectives:	The growth of scientific data sets to terabyte and petabyte sizes offers significant opportunities for important discoveries in fields such as combustion chemistry, nanoscience, astrophysics, cosmology, fusion, climate prediction and biology. To address the challenges of data intensive applications, we propose to implement an extensible language framework, backed by a rich and expressive collection of high-performance libraries, I/O and analytic, which will provide a application development environment in which a plethora of domain and application specific language extensions allow programmers and scientists to more easily and directly specify solutions to data-intensive problems as programs written in domain- adapted languages.

Title:	Active Storage with Analytics Capabilities and I/O Runtime System for Petascale Systems
Duration:	July 2008 - July 2012
Objectives:	As data sizes continue to increase, the concept of active storage is well fitted for many data analysis kernels. Nevertheless, while this concept has been investigated and deployed in a number of forms, enabling it from the parallel I/O software stack has been largely unexplored. We propose and evaluate an active storage system that allows data analysis, mining, and statistical operations to be executed from within a parallel I/O interface.

Title:	NSF SDCI: Scalable I/O Optimizations for Peta-scale Systems
Duration:	September 2007 - 2010
Objectives:	This project addresses the software problem for petascale parallel machines, and targets to improve, enhance, develop, and deploy robust software infrastructure to provide end-to-end scalable I/O performance.

Title:	Design, Development and Evaluation of High Performance Data Mining Systems
Duration:	January 2004 - Present
Objectives:	The primary goal of this project is to enable the design and development of next generation systems for data mining. We plan to design customized data mining systems that provide high performance through massive speedups. The goal of this research is not only to provide novel system architectures but also to provide reliable, high-performance data mining algorithms, applications and software.

Title:	Scalable Optimizations for MPI-IO
Duration:	September 2004 - August 2011
Objectives:	In this project, we address several I/O problems that exist in today's parallel computing environments, which includes client-side file caching, versioning, non-contiguous I/O, and MPI I/O semantics for file atomicity and consistency. Traditional solutions, such as using the byte-range file locking, often centralize the I/O control, which can significantly hamper the I/O parallelism. Several alternative strategies are proposed and demonstrated to be very scalable.

Title:	Hardware/Compiler Co-Design Approaches to Software Protection
Duration:	September 2003 - Present
Objectives:	The overriding objective of this project is to open a line of compiler-FPGA research for software security through the development of compiler algorithms, FPGA design optimizations, and an assessment of the overall approach. The main idea behind the proposed approach is to hide code sequences within instructions in executables that are then interpreted by supporting FPGA hardware to provide both a "language" (the code sequences) and a "virtual machine within a machine" (the FPGA) that will allow designers considerable flexibility in providing software protection. We hope to stimulate research on security techniques at the hardware-software boundary using FPGAs, similar to research focusing on performance. A larger goal is to develop solutions to National CyberInfrastructure problems.

Title:	Enabling High Performance Application I/O
Duration:	July 2001 - June 2011
Objectives:	Many scientific applications are constrained by the rate at which data can be moved on and off storage resources. The goal of this work is to provide software that enables scientific applications to more efficiently access available storage resources. This includes work in parallel file systems, optimizations to middleware such as MPI-IO implementations, and the creation of new high-level application programmer interfaces (APIs) designed with high-performance parallel access in mind.

Title:	High-Performance Data Management, Access, and Storage Techniques for Tera-scale Scientific Applications
Duration:	August 1999 - September 2002
Objectives:	To develop a scalable high-performance data management system that will provide support for data management, query capability and high-performance accesses to large datasets stored in hierarchical storage systems (HSS). This data management system will provide the flexibility of databases for indexing, searches, management of objects, creating and keeping histories and trails of data accesses, while providing high-performance access methods and optimizations (pooled striping, pre-fetching, caching, collective I/O) for accessing large-scale data objects found in scientific computations.

Title:	Design, Implementation and Evaluation of Parallel Pipelined Space-Time Adaptive Processing on Parallel Computers
Duration:	January 1997 - May 1999
Objectives:	To design, implement and evaluate computationally intensive real-time signal processing applications on high-performance parallel embedded systems. Another important goal of this project is to achieve a balance of throughput and latency through optimal use of the finite computational resources.

» Return to top