NU-MineBench

Core Team:

Sponsors:

National Science Foundation (grants CCF-0444405, CNS-0406341, CCR-0325207, CNS-0551639, CNS-0551551)
Department of Energy
(grant DE-FC02-01ER25485)
Intel Corporation

• Description • Download • Acknowledgments • Publications • Contact •

NU-MineBench

Description:

NU-MineBench is a data mining benchmark suite containing a mix of several representative data mining applications from different application domains. This benchmark is intended for use in computer architecture research, systems research, performance evaluation, and high-performance computing. The well-known applications assembled in this benchmark suite have been collected from research groups in industry and academia. The applications contain highly optimized versions of the data mining algorithms. Scalable versions of the applications are also provided. Such extensions were designed and implemented by developers at Northwestern University. Currently, the benchmark has applications with algorithms based on clustering, association rules, classification, bayesian network, pattern recognition, support vector machines and several other well known data mining methodologies. These applications are used in diverse fields like bioinformatics, network intrusion, customer relationship management, and marketing.
If you would like to contribute any well-known and stable application to our benchmark suite, please do not hesitate to contact us.

List of algorithms and applications

Approximate Frequent Itemset Miner
Apriori association rule mining
Naive Bayesian Network data classifier
BIRCH data clustering
ECLAT association rule mining
GeneNet, a DNA sequencing application using Bayesian network
HOP, a density-based data clustering
K-means and Fuzzy K-means data clustering
Parallel ETI Mining
PLSA (Parallel Linear Space Alignment)
Recursive_Weak, Recursive_Weak_pp
RSearch, a sequence database searching with RNA structure queries
ScalParC decision-tree based data classification
Semphy, a structure learning algorithm that is based on phylogenetic trees
SNP (Single Nucleotide Polymorphisms) data classification
SVM-RFE (Support Vector Machines - Recursive Feature Elimination) is a feature selection algorithm
Utility mining, association rule-based mining algorithm

Download:

NU-MineBench software package can be obtained from its download page.

PLEASE NOTE: NU-MineBench is a copyright of CUCIS@Northwestern. This benchmark is intended for use in computer architecture research, systems research, performance evaluation and high-performance computing. The codes in the suite have been modified by the development team at Northwestern University in order to produce a uniform and consolidated benchmark suite. All rights reserved.
README file - README.txt
Our technical report for NU-MineBench-2.0 - CUCIS-2004-08-001.pdf
Data Mining Benchmark Presentation, Speaker: Dr. Alok Choudhary, Location: Intel Corporation (MRL), Date: March 15, 2004

Release notes:

December 2010 - A data generator for clustering algorithms was added. Note that NU-MineBench version 3.0 also reads the data generated by the IBM Quest Data Generator, which creates synthesized data sets for association rule mining as well as classification applications. The data generator can be downloaded from IBM's website.
October 2008 - New applications of Approximate Frequent Pattern Mining Algorithms were added.
August 2005 - New applications of bioinformatics, network intrusion, probabilistic networks, etc. were added.

Acknowledgments:

NU-MineBench benchmark is an effort to bring in a diverse mix of applications from multiple application domains. Researchers from academia and industry contributed to our efforts. We would like to sincerely thank the following groups and personnel for their valuable contributions (code, technology, algorithms, etc):

Mohammed J. Zaki at Rensselaer Polytechnic Institute for the association rule based algorithms
Vipin Kumar at University of Minnesota for the classification algorithms
Intelligent Information Systems Research Group at IBM Almaden Research Center for the association rule algorithms and the dataset generators
Daniel Eisenstein and Piet Hut at the Institute of Advanced Study for their astrophysics application
Computation Astrophysics Laboratory at University of California, San Diego for the cosmological application and the dataset generators
Christian Borgelt at Otto-von-Guericke-University of Magdeburg (Germany) for the classification and association rule applications
Brendan McCane at University of Otago (New Zealand) for the fuzzy-based applications
Intel Corporation for contributing the Parallel Linear Space Alignment (PLSA) application, high performance GeneNet application, OpenMP based RSearch application and also for their computer vision library
Nir Friedman at Hebrew University, Jerusalem for the Semphy application
Eddy lab at Washington University in St.Louis for the RSearch application
Kernel Machines and Intel Microprocessor Research Labs (MRL) for the SVM-RFE package
Intel Microprocessor Research Labs (MRL) and University of Wisconsin for the PCx package
Intel Microprocessor Research Labs (MRL) for testing the benchmark suite
Carole Dulong at Intel MRL for providing futuristic platform workloads and for feedback on characterization
Pradeep Dubey at Intel MRL for providing feedback on characterization
Sanjay Goil at Sun Microsystems for contributing the performance evaluation software for SPARC systems
Intel Corporation for their compiler tools and performance evaluation software

Publications:

Ramanathan Narayanan, Berkin Ozisikyilmaz, Joseph Zambreno, Jayaprakash Pisharath, Gokhan Memik, and Alok Choudhary, "MineBench: A Benchmark Suite for Data Mining Workloads.", Proceedings of the International Symposium on Workload Characterization (IISWC). October, 2006. pdf
Berkin Ozisikyilmaz, Ramanathan Narayanan, Joseph Zambreno, Gokhan Memik, and Alok Choudhary., "An Architectural Characterization Study of Data Mining and Bioinformatics Workloads.", Proceedings of the International Symposium on Workload Characterization (IISWC), October, 2006. pdf
R. Narayanan, D. Honbo, G. Memik, A. Choudhary, J. Zambreno, "An FPGA Implementation of Decision Tree Classification", Proceedings of Design, Automation, and Test in Europe (DATE), April 2007. pdf
R. Narayanan, B. Ozisikyilmaz, G. Memik, A. Choudhary, J. Zambreno, "Quantization Error and Accuracy-Performance Tradeoffs for Embedded Data Mining Workloads", Proceedings of High Performance Data Mining Workshop (HPDM), May 2007. pdf
B. Ozisikyilmaz, G. Memik, A. Choudhary, "Machine Learning Models to Predict Performance of Computer System Design Alternatives", In Proc. of the 37th International Conference on Parallel Processing (ICPP), Portland, OR, 2008.pdf
B. Ozisikyilmaz, G. Memik, A. Choudhary, "Efficient System Design Space Exploration Using Machine Learning Techniques", In Proc. of Design Automation Conference (DAC), Anaheim, CA, 2008. pdf
Rohit Gupta, Tushar Garg, Gaurav Pandey, Michael Steinbach, and Vipin Kumar, "Comparative Study of Various Genomic Data Sets for Protein Function Prediction and Enhancements Using Association Analysis", In Proc Workshop on Data Mining for Biomedical Informatics, held in conjunction with the SIAM International Data Mining Conference, 2007. pdf
Gaurav Pandey, Michael Steinbach, Rohit Gupta, Tushar Garg, and Vipin Kumar, "Association Analysis-based Transformations for Protein Interaction Networks: A Function Prediction Case Study", ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, 2007. pdf
Gaurav Pandey, Lakshmi Naarayanan Ramakrishnan, Michael Steinbach, and Vipin Kumar, "Systematic Evaluation of Scaling Methods for Gene Expression Data", IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2008. pdf
B.V. Ness, C. Ramos, M. Haznadar, A. Hoering, J. Haessler, J. Crowley, S. Jacobus, M. Oken, V. Rajkumar, P. Greipp, B. Barlogie, B. Durie, M. Katz, G. Atluri, G. Fang, R. Gupta, M. Steinbach, V. Kumar, R. Mushlin, D. Johnson, and G. Morgan, "Genomic variation in myeloma: design, content, and initial application of the Bank On A Cure SNP Panel to detect associations with progression-free survival", BMC Medicine, vol. 6, 2008. pdf
Boriah, S., Kumar, V., Steinbach, M., Potter, C., and Klooster, S., "Land cover change detection: a case study.", 14th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, 2008. pdf
Shyam Boriah, Varun Chandola and Vipin Kumar, "Similarity Measures for Categorical Data: A Comparative Evaluation", the SIAM International Conference on Data Mining, SDM 2008. pdf
Varun Chandola, Arindam Banerjee, and Vipin Kumar, "Anomaly Detection: A Survey", ACM Computing Surveys, 2009. pdf

Contact:

Most files in the suite are self explanatory and include comments. In case you have unresolvable issues or if you would like to give suggestions or contribute software, please email us.

» Return to top