NU-MineBench
Core Team:
- Jay Pisharath
- Ying Liu
- Berkin Ozisikyilmaz
- Ramanathan Narayanan
- Wei-keng Liao
- Alok Choudhary (PI)
- Gokhan Memik
Sponsors:
- National Science Foundation (grants CCF-0444405, CNS-0406341, CCR-0325207, CNS-0551639, CNS-0551551)
- Department of Energy (grant DE-FC02-01ER25485)
- Intel Corporation
NU-MineBench
Description:
NU-MineBench is a data mining benchmark suite containing a mix of several representative data mining applications from different application domains. This benchmark is intended for use in computer architecture research, systems research, performance evaluation, and high-performance computing. The well-known applications assembled in this benchmark suite have been collected from research groups in industry and academia. The applications contain highly optimized versions of the data mining algorithms. Scalable versions of the applications are also provided. Such extensions were designed and implemented by developers at Northwestern University. Currently, the benchmark has applications with algorithms based on clustering, association rules, classification, bayesian network, pattern recognition, support vector machines and several other well known data mining methodologies. These applications are used in diverse fields like bioinformatics, network intrusion, customer relationship management, and marketing. If you would like to contribute any well-known and stable application to our benchmark suite, please do not hesitate to contact us.
List of algorithms and applications
- Approximate Frequent Itemset Miner
- Apriori association rule mining
- Naive Bayesian Network data classifier
- BIRCH data clustering
- ECLAT association rule mining
- GeneNet, a DNA sequencing application using Bayesian network
- HOP, a density-based data clustering
- K-means and Fuzzy K-means data clustering
- Parallel ETI Mining
- PLSA (Parallel Linear Space Alignment)
- Recursive_Weak, Recursive_Weak_pp
- RSearch, a sequence database searching with RNA structure queries
- ScalParC decision-tree based data classification
- Semphy, a structure learning algorithm that is based on phylogenetic trees
- SNP (Single Nucleotide Polymorphisms) data classification
- SVM-RFE (Support Vector Machines - Recursive Feature Elimination) is a feature selection algorithm
- Utility mining, association rule-based mining algorithm
Download:
NU-MineBench software package can be obtained from its download page.- PLEASE NOTE: NU-MineBench is a copyright of CUCIS@Northwestern. This benchmark is intended for use in computer architecture research, systems research, performance evaluation and high-performance computing. The codes in the suite have been modified by the development team at Northwestern University in order to produce a uniform and consolidated benchmark suite. All rights reserved.
- README file - README.txt
- Our technical report for NU-MineBench-2.0 - CUCIS-2004-08-001.pdf
- Data Mining Benchmark Presentation, Speaker: Dr. Alok Choudhary, Location: Intel Corporation (MRL), Date: March 15, 2004
Release notes:
- December 2010 - A data generator for clustering algorithms was added. Note that NU-MineBench version 3.0 also reads the data generated by the IBM Quest Data Generator, which creates synthesized data sets for association rule mining as well as classification applications. The data generator can be downloaded from IBM's website.
- October 2008 - New applications of Approximate Frequent Pattern Mining Algorithms were added.
- August 2005 - New applications of bioinformatics, network intrusion, probabilistic networks, etc. were added.
Acknowledgments:
NU-MineBench benchmark is an effort to bring in a diverse mix of applications from multiple application domains. Researchers from academia and industry contributed to our efforts. We would like to sincerely thank the following groups and personnel for their valuable contributions (code, technology, algorithms, etc):- Mohammed J. Zaki at Rensselaer Polytechnic Institute for the association rule based algorithms
- Vipin Kumar at University of Minnesota for the classification algorithms
- Intelligent Information Systems Research Group at IBM Almaden Research Center for the association rule algorithms and the dataset generators
- Daniel Eisenstein and Piet Hut at the Institute of Advanced Study for their astrophysics application
- Computation Astrophysics Laboratory at University of California, San Diego for the cosmological application and the dataset generators
- Christian Borgelt at Otto-von-Guericke-University of Magdeburg (Germany) for the classification and association rule applications
- Brendan McCane at University of Otago (New Zealand) for the fuzzy-based applications
- Intel Corporation for contributing the Parallel Linear Space Alignment (PLSA) application, high performance GeneNet application, OpenMP based RSearch application and also for their computer vision library
- Nir Friedman at Hebrew University, Jerusalem for the Semphy application
- Eddy lab at Washington University in St.Louis for the RSearch application
- Kernel Machines and Intel Microprocessor Research Labs (MRL) for the SVM-RFE package
- Intel Microprocessor Research Labs (MRL) and University of Wisconsin for the PCx package
- Intel Microprocessor Research Labs (MRL) for testing the benchmark suite
- Carole Dulong at Intel MRL for providing futuristic platform workloads and for feedback on characterization
- Pradeep Dubey at Intel MRL for providing feedback on characterization
- Sanjay Goil at Sun Microsystems for contributing the performance evaluation software for SPARC systems
- Intel Corporation for their compiler tools and performance evaluation software
Publications:
- Ramanathan Narayanan, Berkin Ozisikyilmaz, Joseph Zambreno, Jayaprakash Pisharath, Gokhan Memik, and Alok Choudhary, "MineBench: A Benchmark Suite for Data Mining Workloads.", Proceedings of the International Symposium on Workload Characterization (IISWC). October, 2006. pdf
- Berkin Ozisikyilmaz, Ramanathan Narayanan, Joseph Zambreno, Gokhan Memik, and Alok Choudhary., "An Architectural Characterization Study of Data Mining and Bioinformatics Workloads.", Proceedings of the International Symposium on Workload Characterization (IISWC), October, 2006. pdf
- R. Narayanan, D. Honbo, G. Memik, A. Choudhary, J. Zambreno, "An FPGA Implementation of Decision Tree Classification", Proceedings of Design, Automation, and Test in Europe (DATE), April 2007. pdf
- R. Narayanan, B. Ozisikyilmaz, G. Memik, A. Choudhary, J. Zambreno, "Quantization Error and Accuracy-Performance Tradeoffs for Embedded Data Mining Workloads", Proceedings of High Performance Data Mining Workshop (HPDM), May 2007. pdf
- B. Ozisikyilmaz, G. Memik, A. Choudhary, "Machine Learning Models to Predict Performance of Computer System Design Alternatives", In Proc. of the 37th International Conference on Parallel Processing (ICPP), Portland, OR, 2008.pdf
- B. Ozisikyilmaz, G. Memik, A. Choudhary, "Efficient System Design Space Exploration Using Machine Learning Techniques", In Proc. of Design Automation Conference (DAC), Anaheim, CA, 2008. pdf
- Rohit Gupta, Tushar Garg, Gaurav Pandey, Michael Steinbach, and Vipin Kumar, "Comparative Study of Various Genomic Data Sets for Protein Function Prediction and Enhancements Using Association Analysis", In Proc Workshop on Data Mining for Biomedical Informatics, held in conjunction with the SIAM International Data Mining Conference, 2007. pdf
- Gaurav Pandey, Michael Steinbach, Rohit Gupta, Tushar Garg, and Vipin Kumar, "Association Analysis-based Transformations for Protein Interaction Networks: A Function Prediction Case Study", ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, 2007. pdf
- Gaurav Pandey, Lakshmi Naarayanan Ramakrishnan, Michael Steinbach, and Vipin Kumar, "Systematic Evaluation of Scaling Methods for Gene Expression Data", IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2008. pdf
- B.V. Ness, C. Ramos, M. Haznadar, A. Hoering, J. Haessler, J. Crowley, S. Jacobus, M. Oken, V. Rajkumar, P. Greipp, B. Barlogie, B. Durie, M. Katz, G. Atluri, G. Fang, R. Gupta, M. Steinbach, V. Kumar, R. Mushlin, D. Johnson, and G. Morgan, "Genomic variation in myeloma: design, content, and initial application of the Bank On A Cure SNP Panel to detect associations with progression-free survival", BMC Medicine, vol. 6, 2008. pdf
- Boriah, S., Kumar, V., Steinbach, M., Potter, C., and Klooster, S., "Land cover change detection: a case study.", 14th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, 2008. pdf
- Shyam Boriah, Varun Chandola and Vipin Kumar, "Similarity Measures for Categorical Data: A Comparative Evaluation", the SIAM International Conference on Data Mining, SDM 2008. pdf
- Varun Chandola, Arindam Banerjee, and Vipin Kumar, "Anomaly Detection: A Survey", ACM Computing Surveys, 2009. pdf