Kevin A. Huck - Publications, Posters and Presentations

This list can get out of date, for a machine-maintained list (although not complete) check Google Scholar

Publications:	General Hybrid Parallel Profiling (2014) Framework for a productive performance optimization (2013) An Early Prototype of an Autonomic Performance Environment for Exascale (2013) Open Source Task Profiling by Extending the OpenMP Runtime API (2013) The Eclipse parallel tools platform: toward an integrated development environment for XSEDE resources (2012) Trace Spectral Analysis toward Dynamic Levels of Detail (2011) Unveiling Internal Evolution of Parallel Application Computation Phases (2011) Detailed Load Balance Analysis of Large Scale Parallel Applications (2010) Design and Implementation of a Hygrid Parallel Performance Measurement System (2010) Adaptive Application Composition in Quantum Chemistry (2009) Capturing Performance Knowledge for Automated Analysis (2008) Evolution of a Parallel Performance System (2008) Parametric Studies in Eclipse with TAU and PerfExplorer (2008) A Component Infrastructure for Performance and Power Modeling of Parallel Scientific Applications (2008) Knowledge Support and Automation for Performance Analysis with PerfExplorer 2.0 (2008) Scalable, Automated Performance Analysis with TAU and PerfExplorer (2007) Performance database technology for SciDAC applications (2007) US QCD Computational Performance Studies with PERI (2007) TAUg: Runtime Global Performance Data Access Using MPI (2006) Model-Based Relative Performance Diagnosis of Wavefront Parallel Computations (2006) PerfExplorer: A Performance Data Mining Framework For Large-Scale Parallel Computing (2005) Integrating Database Technology with Comparison-based Parallel Performance Diagnosis: The PerfTrack Performance Experiment Management Tool (2005) Performance Analysis of GYRO: a tool evaluation (2005) Design and Implementation of a Parallel Performance Data Management Framework (2005)
Posters:	Computational Quality of Service in Quantum Chemistry (2008) Performance Database Technology for SciDAC Applications (2007) US QCD Computational Performance Studies with PERI (2007) PerfTrack: Performance Database & Analysis Tool (2004)
Presentations:	ICPP (2010) CScADS (2010) Dagstuhl (2010) SC'08 Doctoral Showcase (2008) SC'08 Technical Session (2008) ALCF INCITE Workshop (2008) PERI Database Working Group: Status Report (2008) ParCo (2007) FZJ (2007) Dagstuhl (2007) OSPAT @SC (2005) SC (2005) ICPP (2005) IBM Petascale Scalability Workshop (2005)

Publications:	Allen D. Malony and Kevin A. Huck General Hybrid Parallel Profiling. Proceedings of PDP 2014 Abstract: A hybrid parallel measurement system offers the potential to fuse the principal advantages of probe-based tools, with their exact measures of performance and ability to capture event semantics, and sampling-based tools, with their ability to observe performance detail with less overhead. Creating a hybrid profiling solution is challenging because it requires new mechanisms for integrating probe and sample measurements and calculating profile statistics during execution. In this paper, we describe a general hybrid parallel profiling tool that has been implemented in the TAU Performance System. Its generality comes from the fact that all of the features of the individual methods are retained and can be flexibly controlled when combined to address the measurement requirements for a particular parallel application. The design of the hybrid profiling approach is described and the implementation of the prototype in TAU presented. We demonstrate hybrid profiling functionality first on a simple sequential program and then show its use for several OpenMP parallel codes from the NAS Parallel Benchmark. These experiments also highlight the improvements in overhead efficiency made possible by hybrid profiling. A large-scale ocean modeling code based on OpenMP and MPI, MPAS-Ocean, is used to show how the TAU hybrid profiling tool can be effective at exposing performance-limiting behavior that would be difficult to identify otherwise. Harald Servat, Germán Llort, Kevin Huck, Judit Giménez, Jesús Labarta Framework for a productive performance optimization. Parallel Computing Abstract: Modern supercomputers deliver large computational power, but it is difficult for an application to exploit such power. One factor that limits the application performance is the single node performance. While many performance tools use the microprocessor performance counters to provide insights on serial node performance issues, the complex semantics of these counters pose an obstacle to an inexperienced developer. We present a framework that allows easy identification and qualification of serial node performance bottlenecks in parallel applications. The output of the framework is precise and it is capable of correlating performance inefficiencies with small regions of code within the application. The framework not only points to regions of code but also simplifies the semantics of the performance counters into metrics that refer to processor functional units. With such information the developer can focus on the identified code and improve it by knowing which processor execution unit is degrading the performance. To demonstrate the usefulness of the framework we apply it to three already optimized applications using realistic inputs and, according to the results, modify their source code. By doing modifications that require little effort, we successfully increase the applications’ performance from 10% to 30% and thus shorten the time required to reach the solution and/or allow facing increased problem sizes. Kevin Huck, Sameer Shende, Allen Malony, Hartmut Kaiser, Allan Porterfield, Rob Fowler, Ron Brightwell An Early Prototype of an Autonomic Performance Environment for Exascale. Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers Abstract: Extreme-scale computing requires a new perspective on the role of performance observation in the Exascale system software stack. Because of the anticipated high concurrency and dynamic operation in these systems, it is no longer reasonable to expect that a post-mortem performance measurement and analysis methodology will suffice. Rather, there is a strong need for performance observation that merges first-and third-person observation, in situ analysis, and introspection across stack layers that serves online dynamic feedback and adaptation. In this paper we describe the DOE-funded XPRESS project and the role of autonomic performance support in Exascale systems. XPRESS will build an integrated Exascale software stack (called OpenX) that supports the ParalleX execution model and is targeted towards future Exascale platforms. An initial version of an autonomic performance environment called APEX has been developed for OpenX using the current TAU performance technology and results are presented that highlight the challenges of highly integrative observation and runtime analysis. Ahmad Qawasmeh, Abid Malik, Barbara Chapman, Kevin Huck, Allen Malony Open Source Task Profiling by Extending the OpenMP Runtime API. OpenMP in the Era of Low Power Devices and Accelerators. Abstract: The introduction of tasks in the OpenMP programming model brings a new level of parallelism. This also creates new challenges with respect to its meanings and applicability through an event-based performance profiling. The OpenMP Architecture Review Board (ARB) has approved an interface specification known as the “OpenMP Runtime API for Profiling” to enable performance tools to collect performance data for OpenMP programs. In this paper, we propose new extensions to the OpenMP Runtime API for profiling task level parallelism. We present an efficient method to distinguish individual task instances in order to track their associated events at the micro level. We implement the proposed extensions in the OpenUH compiler which is an open-source OpenMP compiler. With negligible overheads, we are able to capture important events like task creation, execution, suspension, and exiting. These events help in identifying overheads associated with the OpenMP tasking model, e.g., task waiting until a task starts execution or task cleanup etc. These events also help in constructing important parent-child relationships that define tasks’ call paths. The proposed extensions are in line with the newest specifications recently proposed by the OpenMP tools committee for task profiling. Jay Alameda, Wyatt Spear, Jeffrey L. Overbey, Kevin Huck, Gregory R. Watson, and Beth Tibbitts. The Eclipse parallel tools platform: toward an integrated development environment for XSEDE resources. In Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond (XSEDE '12). ACM, New York, NY, USA, , Article 48 , 8 pages. Abstract: Eclipse [1] is a widely used, open source integrated development environment that includes support for C, C++, Fortran, and Python. The Parallel Tools Platform (PTP) [2] extends Eclipse to support development on high performance computers. PTP allows the user to run Eclipse on her laptop, while the code is compiled, run, debugged, and profiled on a remote HPC system. PTP provides development assistance for MPI, OpenMP, and UPC; it allows users to submit jobs to the remote batch system and monitor the job queue. It also provides a visual parallel debugger. The XSEDE community comprises a large part of PTP's user base, and we are actively working to make PTP a productive, easy-to-use development environment for the full breadth of XSEDE resources. In this paper, we will describe capabilities we have recently added to PTP to better support XSEDE resources. These capabilities include submission and monitoring of jobs on systems running Sun/Oracle Grid Engine, support for GSI authentication and MyProxy logon, support for environment modules, and integration with compilers from Cray and PGI. We will describe ongoing work and directions for future collaboration, including OpenACC support and parallel debugger integration. Germán Llort, Marc Casas, Harald Servat, Kevin Huck, Judit Giménez, Jesús Labarta. Trace Spectral Analysis toward Dynamic Levels of Detail. Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th International Conference on , vol., no., pp.332-339, 7-9 Dec. 2011. Abstract: The emergence of Petascale systems has raised new challenges to performance analysis tools. Understanding every single detail of an execution is important to bridge the gap between the theoretical peak and the actual performance achieved. Tracing tools are the best option when it comes to providing detailed information about the application behavior, but not without liabilities. The amount of information that a single execution can generate grows so fast that it easily becomes unmanageable. An effective analysis in such scenarios necessitates the intelligent selection of information. In this paper we present an on-line performance tool based on spectral analysis of signals that automatically identifies the different computing phases of the application as it runs, selects a few representative periods and decides the granularity of the information gathered for these regions. As a result, the execution is completely characterized at different levels of detail, reducing the amount of data collected while maximizing the amount of useful information presented for the analysis. Harald Servat, Germán Llort, Judit Giménez, Kevin A. Huck, Jesús Labarta Unveiling Internal Evolution of Parallel Application Computation Phases. 40th International Conference on Parallel Processing (ICPP2011). Taipei, Taiwan. September 13-16, 2011 Abstract: As access to supercomputing resources is becoming more and more commonplace, performance analysis tools are gaining importance in order to decrease the gap between the application performance and the supercomputers' peak performance. Performance analysis tools allow the analyst to understand the idiosyncrasies of an application in order to improve it. However, these tools require monitoring regions of the application to provide information to the analysts, leaving non-monitored regions of code unknown, which may result in lack of understanding of important regions of the application. In this paper we describe an automated methodology that reports very detailed application insights and improves the analysis experience of performance tools based on traces. We apply this methodology to three production applications and provide suggestions on how to improve their performance. Our methodology uses computation burst clustering and a mechanism called folding. While clustering automatically detects application structure, folding combines instrumentation and sampling to augment the performance analysis details. Folding provides fine grain performance information from coarse grain sampling on iterative applications. Folding results closely resemble the performance data gathered from fine grain sampling with an absolute mean difference less than 5% without overhead of fine grain. Kevin Huck and Jesús Labarta Detailed Load Balance Analysis of Large Scale Parallel Applications. 39th International Conference on Parallel Processing (ICPP 2010), San Diego, CA, USA, September 13-16, 2010 Abstract: Balancing the workload in parallel applications is a difficult task, even in conventional cases. Many computing cycles are wasted when the load is not evenly balanced across processing nodes. Global load balance analysis may determine that an application is well balanced, when in fact the application has hidden inefficiencies. In this paper, we consider the load balance of parallel applications which present unique challenges in the analysis process. We have performed trace analysis and simulation to demonstrate the existence of otherwise undiscovered performance issues. We also demonstrate that by collecting dynamic phase profiles, we are able to approximate the analysis results of trace analysis and simulation, and more accurately represent the performance behavior of complex parallel applications than through flat or callpath profiles alone. Alan Morris and Sameer Shende and Allen Malony and Kevin Huck Design and Implementation of a Hybrid Parallel Performance Measurement System. International Conference on Parallel Processing (ICPP 2010), San Diego, CA, USA, September 13-16, 2010 Abstract: Modern parallel performance measurement systems collect performance information either through probes inserted in the application code or via statistical sampling. Probe-based techniques measure performance metrics directly using calls to a measurement library that execute as part of the application. In contrast, sampling-based systems interrupt program execution to sample metrics for statistical analysis of performance. Although both measurement approaches are represented by robust tool frameworks in the performance community, each has its strengths and weaknesses. In this paper, we investigate the creation of a hybrid measurement system, the goal being to exploit the strengths of both systems and mitigate their weaknesses. We show how such a system can be used to provide the application programmer with a more complete analysis of their application. Simple example and application codes are used to demonstrate its capabilities. We also show how the hybrid techniques can be combined to provide real cross-language performance evaluation of an uninstrumented run for mixed compiled/interpreted execution environments (e.g., Python and C/C++/Fortran). L. Li, J. P. Kenny, M. Wu , K. Huck, A. Gaenko, M. S. Gordon , C. L. Janssen, L. Curfman McInnes, H. Mori, H. M. Netzloff, B. Norris, and T. L. Windus Adaptive Application Composition in Quantum Chemistry. The 5th International Conference on the Quality of Software Architectures (QoSA 2009), East Stroudsburg University, Pennsylvania, USA, June 22-26, 2009 Abstract: Component interfaces, as advanced by the Common Component Architecture (CCA), enable easy access to complex software packages for high-performance scientific computing. A recent focus has been incorporating support for computational quality of service (CQoS), or the automatic composition, substitution, and dynamic reconfiguration of component applications. Several leading quantum chemistry packages have achieved interoperability by adopting CCA components. Running these computations on diverse computing platforms requires selection among many algorithmic and hardware configuration parameters; typical educated guesses or trial and error can result in unexpectedly low performance. Motivated by the need for faster runtimes and increased productivity for chemists, we present a flexible CQoS approach for quantum chemistry that uses a generic CQoS database component to create a training database with timing results and metadata for a range of calculations. The database then interacts with a chemistry CQoS component and other infrastructure to facilitate adaptive application composition for new calculations. Kevin A. Huck, Oscar Hernandez, Van Bui, Sunita Chandrasekaran, Barbara Chapman, Allen D. Malony, Lois Curfman McInnes, and Boyana Norris Capturing Performance Knowledge for Automated Analysis. SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 2008 Abstract: Automating the process of parallel performance experimentation, analysis, and problem diagnosis can enhance environments for performance-directed application development, compilation, and execution. This is especially true when parametric studies, modeling, and optimization strategies require large amounts of data to be collected and processed for knowledge synthesis and reuse. This paper describes the integration of the PerfExplorer performance data mining framework with the OpenUH compiler infrastructure. OpenUH provides auto-instrumentation of source code for performance experimentation and PerfExplorer provides automated and reusable analysis of the performance data through a scripting interface. More importantly, PerfExplorer inference rules have been developed to recognize and diagnose performance characteristics important for optimization strategies and modeling. Three case studies are presented which show our success with automation in OpenMP and MPI code tuning, parametric characterization, and power modeling. The paper discusses how the integration supports performance knowledge engineering across applications and feedback-based compiler optimization in general. Allen D. Malony, Sameer Shende, Alan Morris, Scott Biersdorff, Wyatt Spear, Kevin A. Huck, and Aroon Nataraj Evolution of a Parallel Performance System. 2nd International Workshop on Tools for High Performance Computing, 2008 Abstract: The TAU Performance System(R) is an integrated suite of tools for instrumentation measurement, and analysis of parallel programs targeting large-scale, high-performance computing (HPC) platforms. Representing over fifteen calendar years and fifty person years of research and development effort, TAU's driving concerns have been portability, flexibility, interoperability, and scalability. The result is a performance system which has evolved into a leading framework for parallel performance evaluation and problem solving. This paper presents the current state of TAU, overviews the design and function of TAU's main features, discusses best practices of TAU use, and outlines future development. Kevin A. Huck, Wyatt Spear, Allen D. Malony, Sameer Shende, and Alan Morris Parametric Studies in Eclipse with TAU and PerfExplorer. Proceedings of Workshop on Productivity and Performance (PROPER 2008) at EuroPar 2008, (Las Palmas de Gran Canaria, Spain), 2008. Abstract: With support for C/C++, Fortran, MPI, OpenMP, and performance tools, the Eclipse integrated development environment (IDE) is a serious contender as a programming environment for parallel applications. There is interest in adding capabilities in Eclipse for conducting workflows where an application is executed under different scenarios and its outputs are processed. For instance, parametric studies are a requirement in many benchmarking and performance tuning efforts, yet there was no experiment management support available for the Eclipse IDE. In this paper, we describe an extension of the Parallel Tools Platform (PTP) plugin for the Eclipse IDE. The extension provides a graphical user interface for selecting experiment parameters, launches build and run jobs, manages the performance data, and launches an analysis application to process the data. We describe our implementation, and discuss three experiment examples which demonstrate the experiment management support. Van Bui, Boyana Norris, Kevin Huck, Lois Curfman McInnes, Li Li, Oscar Hernandez, and Barbara Chapman A Component Infrastructure for Performance and Power Modeling of Parallel Scientific Applications. Component-Based High Performance Computing (CBHPC 2008), 2008 Abstract: Characterizing the performance of scientific applications is essential for effective code optimization, both by compilers and by high-level adaptive numerical algorithms. While maximizing power efficiency is becoming increasingly important in current high-performance architectures, little or no hardware or software support exists for detailed power measurements. Hardware counter-based power models are a promising method for guiding software-based techniques for reducing power. We present a component-based infrastructure for performance and power modeling of parallel scientific applications. The power model leverages on-chip performance hardware counters and is designed to model power consumption for modern multiprocessor and multicore systems. Our tool infrastructure includes application components as well as performance and power measurement and analysis components. We collect performance data using the TAU performance component and apply the power model in the performance and power analysis of a PETSc-based parallel fluid dynamics application by using the PerfExplorer component. Kevin A. Huck, Allen D. Malony, Sameer Shende and Alan Morris Knowledge Support and Automation for Performance Analysis with PerfExplorer 2.0. Large-Scale Programming Tools and Environments, special issue of Scientific Programming, vol. 16, no. 2-3, pp. 123-134, 2008. (email for copies) Abstract: The integration of scalable performance analysis in parallel development tools is difficult. The potential size of data sets and the need to compare results from multiple experiments presents a challenge to manage and process the information. Simply to characterize the performance of parallel applications running on potentially hundreds of thousands of processor cores requires new scalable analysis techniques. Furthermore, many exploratory analysis processes are repeatable and could be automated, but are now implemented as manual procedures. In this paper, we will discuss the current version of PerfExplorer, a performance analysis framework which provides dimension reduction, clustering and correlation analysis of individual trails of large dimensions, and can perform relative performance analysis between multiple application executions. PerfExplorer analysis processes can be captured in the form of Python scripts, automating what would otherwise be time-consuming tasks. We will give examples of large-scale analysis results, and discuss the future development of the framework, including the encoding and processing of expert performance rules, and the increasing use of performance metadata. Kevin A. Huck, Allen D. Malony, Sameer Shende and Alan Morris Scalable, Automated Performance Analysis with TAU and PerfExplorer. Proceedings of Parallel Computing 2007, Aachen, Germany, 2007. Abstract: Scalable performance analysis is a challenge for parallel development tools. The potential size of data sets and the need to compare results from multiple experiments presents a challenge to manage and process the information, and to characterize the performance of parallel applications running on potentially hundreds of thousands of processor cores. In addition, many exploratory analysis processes represent potentially repeatable processes which can and should be automated. In this paper, we will discuss the current version of PerfExplorer, a performance analysis framework which provides dimension reduction, clustering and correlation analysis of individual trails of large dimensions, and can perform relative performance analysis between multiple application executions. PerfExplorer analysis processes can be captured in the form of Python scripts, automating what would otherwise be time-consuming tasks. We will give examples of large-scale analysis results, and discuss the future development of the framework, including the encoding and processing of expert performance rules, and the increasing use of performance metadata. D. Gunter, K. Huck, K. Karavanic, J. May, A. Malony, K. Mohror, S. Moore, A. Morris, S. Shende, V. Taylor, X. Wu, and Y. Zhang. Performance database technology for SciDAC applications. Journal of Physics: Conference Series, Vol. 78, 24--28 June 2007, Boston Massachusetts, USA. Abstract: As part of the Performance Engineering Research Institute (PERI) effort, the Performance Database Working Group, which involves PERI researchers as well as outside researchers at the University of Oregon, Portland State University, and Texas A&M University, has developed technology for storing performance data collected by a number of performance measurement and analysis tools, including TAU, PerfTrack, Prophesy, and SvPablo. In addition to the performance data, metadata capturing the experimental setup and conditions (e.g., source code version; input data; platform, compiler, library, and operating system versions and configurations; runtime environment) are exported to a common metadata schema, along with some basic performance information. The exported information can be viewed from a common web interface, and a link or contact information is provided for accessing the original performance data in its home database. Analysis tools provided by the individual databases support tasks such as parallel profile browsing and analysis, cross-experiment analysis, and scalability studies. Performance data are currently being collected and analyzed for the GTC and MILC SciDAC applications. The tools are being installed on machines used by SciDAC researchers so that they can easily collect data and upload it to an associated performance database. Work on a deeper level of interoperability that will allow exchange of actual performance data between databases is underway. Y. Zhang, R. Fowler, K. Huck, A. Malony, A. Porterfield, D. Reed, S. Shende, V. Taylor, and X. Wu.. US QCD Computational Performance Studies with PERI. Journal of Physics: Conference Series, Vol. 78, 24--28 June 2007, Boston Massachusetts, USA. Abstract: We report on some of the interactions between two SciDAC projects: The National Computational Infrastructure for Lattice Gauge Theory (USQCD), and the Performance Envineering Research Institute (PERI). Many modern scientific programs consistently report the need for faster computational resources to maintain global competitiveness. However, as the size and complexity of emerging high end computing (HEC) systems continue to rise, achieving good performance on such systems is becoming ever more challenging. In order to take full advantage of the resources, it is crucial to understand the characteristics of relevant scientific applications and the systems these applications are running on. Using tools developed under PERI and by other performance measurement researchers,, we studied the performance of two applications, MILC and Chroma, on several high performance computing systems at DOE laboratories. In the case of Chroma, we discuss how the use of C++ and modern software engineering and programming methods are driving the evolution of performance tools. Kevin A. Huck, Allen D. Malony, Sameer Shende and Alan Morris. TAUg: Runtime Global Performance Data Access Using MPI. EuroPVM/MPI, pp. 313-321, Bonn, Germany, 2006. Abstract: To enable a scalable parallel application to view its global performance state, we designed and developed TAUg, a portable runtime framework layered on the TAU parallel performance system. TAUg leverages the MPI library to communicate between application processes, creating an abstraction of a global performance space from which profile views can be retrieved. We describe the TAUg design and implementation and show its use on two test benchmarks up to 512 processors. Overhead evaluation for the use of TAUg is included in our analysis. Future directions for improvement are discussed. Li Li, Allen D. Malony and Kevin Huck. Model-Based Relative Performance Diagnosis of Wavefront Parallel Computations. Euro-Par 2006 Parallel Processing Conference September 2006 (LNCS 4128). Pages 35-46. Abstract: Parallel performance diagnosis can be improved with the use of performance knowledge about parallel computation models. The Hercule diagnosis system applies model-based methods to automate performance diagnosis processes and explain performance problems from high-level computation semantics. However, Hercule is limited by a single experiment view. Here we introduce the concept of relative performance diagnosis and show how it can be integrated in a model-based diagnosis framework. The paper demonstrates the effectiveness of Hercule's approach to relative diagnosis of the well-known Sweep3D application based on a Wavefront model. Relative diagnoses of Sweep3D performance anomalies in strong and weak scaling cases are given. Kevin Huck and Allen D. Malony. PerfExplorer: A Performance Data Mining Framework For Large-Scale Parallel Computing. SC '05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, 2005. Seattle, Washington, USA. Abstract: Parallel applications running on high-end computer systems manifest a complexity of performance phenomena. Tools to observe parallel performance attempt to capture these phenomena in measurement datasets rich with information relating multiple performance metrics to execution dynamics and parameters specific to the application-system experiment. However, the potential size of datasets and the need to assimilate results from multiple experiments makes it a daunting challenge to not only process the information, but discover and understand performance insights. In this paper, we present PerfExplorer, a framework for parallel performance data mining and knowledge discovery. The framework architecture enables the development and integration of data mining operations that will be applied to large-scale parallel performance profiles. PerfExplorer operates as a client-server system and is built on a robust parallel performance database (PerfDMF) to access the parallel profiles and save its analysis results. Examples are given demonstrating these techniques for performance analysis of ASCI applications. Karen L. Karavanic, John May, Kathryn Mohror, Brian Miller, Kevin Huck, Rashawn Knapp, Brian Pugh. Integrating Database Technology with Comparison-based Parallel Performance Diagnosis: The PerfTrack Performance Experiment Management Tool. SC '05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, 2005. Seattle, Washington, USA. Abstract: PerfTrack is a data store and interface for managing performance data from large-scale parallel applications. Data collected in different locations and formats can be compared and viewed in a single performance analysis session. The underlying data store used in PerfTrack is implemented with a database management system (DBMS). PerfTrack includes interfaces to the data store and scripts for automatically collecting data describing each experiment, such as build and platform details. We have implemented a prototype of PerfTrack that can use Oracle or PostgreSQL for the data store. We demonstrate the prototype's functionality with three case studies: one is a comparative study of an ASC purple benchmark on high-end Linux and AIX platforms; the second is a parameter study conducted at Lawrence Livermore National Laboratory (LLNL) on two high end platforms, a 128 node cluster of IBM Power 4 processors and BlueGene/L; the third demonstrates incorporating performance data from the Paradyn Parallel Performance Tool into an existing PerfTrack data store. P Worley, J Candy, L Carrington, K Huck, T Kaiser, G Mahinthakumar, A Malony, S Moore, D Reed, P Roth, H Shan, S Shende, A Snavely, S Sreepathi, F Wolf, Y Zhang Performance Analysis of GYRO: a tool evaluation. Journal of Physics: Conference Series, vol. 16, pp. 551-555, 2005. Abstract: The performance of the Eulerian gyrokinetic-Maxwell solver code GYRO is analyzed on five high performance computing systems. First, a manual approach is taken, using custom scripts to analyze the output of embedded wallclock timers, floating point operation counts collected using hardware performance counters, and traces of user and communication events collected using the profiling interface to Message Passing Interface (MPI) libraries. Parts of the analysis are then repeated or extended using a number of sophisticated performance analysis tools: IPM, KOJAK, SvPablo, TAU, and the PMaC modeling tool suite. The paper briefly discusses what has been discovered via this manual analysis process, what performance analyses are inconvenient or infeasible to attempt manually, and to what extent the tools show promise in accelerating or significantly extending the manual performance analyses. Kevin Huck, Allen D. Malony, Robert Bell and Alan Morris. Design and Implementation of a Parallel Performance Data Management Framework. (Winner: The Chuan-lin Wu Best Paper Award), Proceedings of the 2005 International Conference on Parallel Processing. June 14-17, 2005. Oslo, Norway. Abstract: Empirical performance evaluation of parallel systems and applications can generate significant amounts of performance data and analysis results from multiple experiments as performance is investigated and problems diagnosed. Hence, the management of performance information is a core component of performance analysis tools. To better support tool integration, portability, and reuse, there is a strong motivation to develop performance data management technology that can provide a common foundation for performance data storage, access, merging, and analysis. This paper presents the design and implementation of the Performance Data Management Framework (PerfDMF). PerfDMF addresses objectives of performance tool integration, interoperation, and reuse by providing common data storage, access, and analysis infrastructure for parallel performance profiles. PerfDMF includes an extensible parallel profile data schema and relational database schema, a profile query and analysis programming interface, and an extendible toolkit for profile import/export and standard analysis. We describe the PerfDMF objectives and architecture, give detailed explanation of the major components, and show examples of PerfDMF application.
Posters:	Joseph Kenny (Sandia National Laboratories), Kevin Huck (University of Oregon), Li Li (Argonne National Laboratory), Lois Curfman McInnes (Argonne National Laboratory), Heather Netzloff (Ames Laboratory), Boyana Norris (Argonne National Laboratory), Meng-Shiou Wu (Ames Laboratory) Computational Quality of Service in Quantum Chemistry. Poster, SC'08. November, 2008. Abstract: Component interfaces, as advanced by the Common Component Architecture (CCA) Forum, enable easy access to software packages. A recent focus of the CCA Forum has been adding support for computational quality of service (CQoS): automatic composition, substitution and dynamic reconfiguration. Several quantum chemistry developers (GAMESS, MPQC and NWChem) have adopted CCA components, creating shared capabilities and infrastructure. These computations require many algorithmic and hardware configuration options, including the configuration of processing elements (nodes, processors/sockets and cores); typical educated guesses or trial and error result in erratic performance and efficiency. This situation is driving the development of a flexible CQoS approach for quantum chemistry applications. Our approach uses a general CQoS database component to create a training database containing timing results and metadata for a range of calculations. Once this database is populated, the chemistry CQoS component uses general CQoS infrastructure analysis capabilities to provide appropriate configuration for a new calculation. D. Gunter, K. Huck, K. Karavanic, J. May, A. Malony, K. Mohror, S. Moore, A. Morris, S. Shende, V. Taylor, X. Wu, and Y. Zhang. Performance Database Technology for SciDAC Applications. Poster, SciDAC. June, 2007. Abstract: As part of the Performance Engineering Research Institute (PERI) effort, the Performance Database Working Group, which involves PERI researchers as well as outside researchers at the University of Oregon, Portland State University, and Texas A&M University, has developed technology for storing performance data collected by a number of performance measurement and analysis tools, including TAU, PerfTrack, Prophesy, and SvPablo. In addition to the performance data, metadata capturing the experimental setup and conditions (e.g., source code version; input data; platform, compiler, library, and operating system versions and configurations; runtime environment) are exported to a common metadata schema, along with some basic performance information. The exported information can be viewed from a common web interface, and a link or contact information is provided for accessing the original performance data in its home database. Analysis tools provided by the individual databases support tasks such as parallel profile browsing and analysis, cross-experiment analysis, and scalability studies. Performance data are currently being collected and analyzed for the GTC and MILC SciDAC applications. The tools are being installed on machines used by SciDAC researchers so that they can easily collect data and upload it to an associated performance database. R. Fowler, Y. Zhang, A. Porterfield, D. Reed, J. Mellor-Crummey, N. Tallent, K. Huck, A. Malony, S. Shende, V. Taylor, and X. Wu. PERI and USQCD Computational Performance Studies. Poster, SciDAC. June, 2007. Abstract: USQCD encompasses a SciDAC collaboration of US scientists developing and using large-scale computers for calculations in lattice quantum chromodynamics. Software Emphasis: improved scientific productivity through modular, reusable, cross-platform, high-performance libraries. PERI is a SciDAC Institute focused on delivering petascale performance to complex scientific applications running on Leadership Class computing systems. Emphasis: improved productivity through automation of measurement, analysis, and tuning of HPC applications. Kevin Huck, Kathryn Mohror, John May, Brian Miller, Karen Karavanic. PerfTrack: Performance Database & Analysis Tool. Poster, Lawrence Livermore National Laboratory, UCRL-POST-205871. September, 2004. Introduction: Our goal is to create a tool which will help scientific programmers answer difficult questions about application performance as the source code, build parameters, runtime environment and hardware vary over time. We are developing PerfTrack to explore technologies in parallel performance measurement, modeling, analysis and prediction. We are storing performance data and the associated environment data in a relational database. This database provides a foundation to build analysis tools, scalable to large numbers of threads (over 1024) and capable of comparing multiple executions. The tools we develop will be automated to gather, store and analyze data, in order to encourage their use in the software development cycle.
Presentations:	Detailed Load Balance Analysis of Large Scale Parallel Applications ICPP 2010 September 13-16, 2010. International Conference on Parallel Processing, San Diego, CA USA. PDF. Trace Analysis of PFLOTRAN on the Cray XT5 using Extrae and Paraver CScADS August 2-5, 2010. Workshop on Performance Tools for Petascale Computing, Snowbird, UT USA. PDF. Performance Diagnosis through Classification of Computation Bursts to Known Computational Kernel Behavior Dagstuhl May 2-7, 2010. Program Development for Extreme-Scale Computing, Dagstuhl, Germany. PDF. Knowledge Support for Parallel Performance Data Mining SC'08 Doctoral Showcase November 20, 2008. International Conference for High Performance Computing, Networking, Storage and Analysis, Austin, TX USA. PDF. Capturing Performance Knowledge for Automated Analysis SC'08 Technical Session November 20, 2008. International Conference for High Performance Computing, Networking, Storage and Analysis, Austin, TX USA. PDF. Using the TAU Performance Analysis System on the Blue Gene/P ALCF INCITE Workshop. May. 7-8, 2008. Argonne National Laboratory, Argonne, IL. PDF. PERI Database Working Group: Status Report PERI Semi-annual meeting. Feb. 25-26, 2008, UCSD, San Diego CA. . Scalable, Automated Performance Analysis with TAU and PerfExplorer Parallel Computing 2007. Sept. 4-7, 2007. Aachen and Jülich, Germany. PDF. Scalable Performance Analysis with TAU, PerfDMF and PerfExplorer Forschungszentrum Jülich, 2007. Aug. 28, 2007. Jülich, Germany. PDF. Knowledge Support for Parallel Performance Data Mining Code Instrumentation and Modeling for Parallel Performance Analysis, Dagstuhl Seminar, 2007. Aug. 19-24, 2007. Dagstuhl, Germany. PDF. PerfDMF: Performance Data Management Framework Open Source Performance Analysis Tools (OSPAT) BOF Session, Conference on High Performance Networking and Computing (SC\|05). November 18, 2005. Seattle, Washington. PDF. PerfExplorer: A Performance Data Mining Framework For Large-Scale Parallel Computing. Conference on High Performance Networking and Computing (SC\|05). November 18, 2005. Seattle, Washington. PDF. Design and Implementation of a Parallel Performance Data Management Framework. 2005 International Conference on Parallel Processing. June 17, 2005. Oslo, Norway. PDF. Scalable Parallel Performance Analysis. IBM Petascale Tools Strategy Workshop, May 4, 2005. PDF. PerfExplorer: Parallel Performance Analysis using Data Mining Techniques. Directed Research Project, University of Oregon. December 15, 2004. PDF.

Home