Publications: |
-
Allen D. Malony and Kevin A. Huck
General Hybrid Parallel Profiling.
Proceedings of PDP 2014
Abstract:
A hybrid parallel measurement system offers the potential to fuse the
principal advantages of probe-based tools, with their exact measures
of performance and ability to capture event semantics, and sampling-based tools,
with their ability to observe performance detail with less overhead.
Creating a hybrid profiling solution is challenging because it requires new
mechanisms for integrating probe and sample measurements and calculating
profile statistics during execution. In this paper, we describe a general hybrid
parallel profiling tool that has been implemented in the TAU Performance
System. Its generality comes from the fact that all of the features of the
individual methods are retained and can be flexibly controlled when
combined to address the measurement requirements for a particular parallel
application. The design of the hybrid profiling approach is described and
the implementation of the prototype in TAU presented. We demonstrate
hybrid profiling functionality first on a simple sequential program and
then show its use for several OpenMP parallel codes from the NAS Parallel
Benchmark. These experiments also highlight the improvements in overhead
efficiency made possible by hybrid profiling. A large-scale ocean modeling
code based on OpenMP and MPI, MPAS-Ocean, is used to show how the TAU hybrid
profiling tool can be effective at exposing performance-limiting behavior
that would be difficult to identify otherwise.
-
Harald Servat, Germán Llort, Kevin Huck, Judit Giménez, Jesús Labarta
Framework for a productive performance optimization.
Parallel Computing
Abstract:
Modern supercomputers deliver large computational power, but it is difficult for an application to exploit such power. One factor that limits the application performance is the single node performance. While many performance tools use the microprocessor performance counters to provide insights on serial node performance issues, the complex semantics of these counters pose an obstacle to an inexperienced developer.
We present a framework that allows easy identification and qualification of serial node performance bottlenecks in parallel applications. The output of the framework is precise and it is capable of correlating performance inefficiencies with small regions of code within the application. The framework not only points to regions of code but also simplifies the semantics of the performance counters into metrics that refer to processor functional units. With such information the developer can focus on the identified code and improve it by knowing which processor execution unit is degrading the performance. To demonstrate the usefulness of the framework we apply it to three already optimized applications using realistic inputs and, according to the results, modify their source code. By doing modifications that require little effort, we successfully increase the applications’ performance from 10% to 30% and thus shorten the time required to reach the solution and/or allow facing increased problem sizes.
-
Kevin Huck, Sameer Shende, Allen Malony, Hartmut Kaiser, Allan Porterfield, Rob Fowler, Ron Brightwell
An Early Prototype of an Autonomic Performance Environment for Exascale.
Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
Abstract:
Extreme-scale computing requires a new perspective on the role of performance observation in the Exascale system software stack. Because of the anticipated high concurrency and dynamic operation in these systems, it is no longer reasonable to expect that a post-mortem performance measurement and analysis methodology will suffice. Rather, there is a strong need for performance observation that merges first-and third-person observation, in situ analysis, and introspection across stack layers that serves online dynamic feedback and adaptation. In this paper we describe the DOE-funded XPRESS project and the role of autonomic performance support in Exascale systems. XPRESS will build an integrated Exascale software stack (called OpenX) that supports the ParalleX execution model and is targeted towards future Exascale platforms. An initial version of an autonomic performance environment called APEX has been developed for OpenX using the current TAU performance technology and results are presented that highlight the challenges of highly integrative observation and runtime analysis.
-
Ahmad Qawasmeh, Abid Malik, Barbara Chapman, Kevin Huck, Allen Malony
Open Source Task Profiling by Extending the OpenMP Runtime API.
OpenMP in the Era of Low Power Devices and Accelerators.
Abstract:
The introduction of tasks in the OpenMP programming model brings a new level of parallelism. This also creates new challenges with respect to its meanings and applicability through an event-based performance profiling. The OpenMP Architecture Review Board (ARB) has approved an interface specification known as the “OpenMP Runtime API for Profiling” to enable performance tools to collect performance data for OpenMP programs. In this paper, we propose new extensions to the OpenMP Runtime API for profiling task level parallelism. We present an efficient method to distinguish individual task instances in order to track their associated events at the micro level. We implement the proposed extensions in the OpenUH compiler which is an open-source OpenMP compiler. With negligible overheads, we are able to capture important events like task creation, execution, suspension, and exiting. These events help in identifying overheads associated with the OpenMP tasking model, e.g., task waiting until a task starts execution or task cleanup etc. These events also help in constructing important parent-child relationships that define tasks’ call paths. The proposed extensions are in line with the newest specifications recently proposed by the OpenMP tools committee for task profiling.
-
Jay Alameda, Wyatt Spear, Jeffrey L. Overbey, Kevin Huck, Gregory R. Watson, and Beth Tibbitts.
The Eclipse parallel tools platform: toward an integrated development environment for XSEDE resources.
In Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond (XSEDE '12). ACM, New York, NY, USA, , Article 48 , 8 pages.
Abstract:
Eclipse [1] is a widely used, open source integrated development environment that includes support for C, C++, Fortran, and Python. The Parallel Tools Platform (PTP) [2] extends Eclipse to support development on high performance computers. PTP allows the user to run Eclipse on her laptop, while the code is compiled, run, debugged, and profiled on a remote HPC system. PTP provides development assistance for MPI, OpenMP, and UPC; it allows users to submit jobs to the remote batch system and monitor the job queue. It also provides a visual parallel debugger.
The XSEDE community comprises a large part of PTP's user base, and we are actively working to make PTP a productive, easy-to-use development environment for the full breadth of XSEDE resources. In this paper, we will describe capabilities we have recently added to PTP to better support XSEDE resources. These capabilities include submission and monitoring of jobs on systems running Sun/Oracle Grid Engine, support for GSI authentication and MyProxy logon, support for environment modules, and integration with compilers from Cray and PGI. We will describe ongoing work and directions for future collaboration, including OpenACC support and parallel debugger integration.
-
Germán Llort, Marc Casas, Harald Servat, Kevin Huck, Judit Giménez, Jesús Labarta.
Trace Spectral Analysis toward Dynamic Levels of Detail.
Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th International Conference on , vol., no., pp.332-339, 7-9 Dec. 2011.
Abstract:
The emergence of Petascale systems has raised new challenges to performance analysis tools. Understanding every single detail of an execution is important to bridge the gap between the theoretical peak and the actual performance achieved. Tracing tools are the best option when it comes to providing detailed information about the application behavior, but not without liabilities. The amount of information that a single execution can generate grows so fast that it easily becomes unmanageable. An effective analysis in such scenarios necessitates the intelligent selection of information. In this paper we present an on-line performance tool based on spectral analysis of signals that automatically identifies the different computing phases of the application as it runs, selects a few representative periods and decides the granularity of the information gathered for these regions. As a result, the execution is completely characterized at different levels of detail, reducing the amount of data collected while maximizing the amount of useful information presented for the analysis.
-
Harald Servat, Germán Llort, Judit Giménez, Kevin A. Huck, Jesús Labarta
Unveiling Internal Evolution of Parallel Application Computation Phases.
40th International Conference on Parallel Processing (ICPP2011).
Taipei, Taiwan. September 13-16, 2011
Abstract:
As access to supercomputing resources is becoming more and more commonplace,
performance analysis tools are gaining importance in order to decrease the
gap between the application performance and the supercomputers' peak performance.
Performance analysis tools allow the analyst to understand the idiosyncrasies of
an application in order to improve it. However, these tools require monitoring
regions of the application to provide information to the analysts, leaving
non-monitored regions of code unknown, which may result in lack of understanding
of important regions of the application. In this paper we describe an automated
methodology that reports very detailed application insights and improves the
analysis experience of performance tools based on traces. We apply this
methodology to three production applications and provide suggestions on how to
improve their performance. Our methodology uses computation burst clustering
and a mechanism called folding. While clustering automatically detects
application structure, folding combines instrumentation and sampling to
augment the performance analysis details. Folding provides fine grain
performance information from coarse grain sampling on iterative applications.
Folding results closely resemble the performance data gathered from fine
grain sampling with an absolute mean difference less than 5% without overhead
of fine grain.
-
Kevin Huck and Jesús Labarta
Detailed Load Balance Analysis of Large Scale Parallel Applications.
39th International Conference on Parallel Processing (ICPP 2010),
San Diego, CA, USA, September 13-16, 2010
Abstract:
Balancing the workload in parallel applications is a difficult task, even in
conventional cases. Many computing cycles are wasted when the load is not
evenly balanced across processing nodes. Global load balance analysis may
determine that an application is well balanced, when in fact the application
has hidden inefficiencies. In this paper, we consider the load balance of
parallel applications which present unique challenges in the analysis process.
We have performed trace analysis and simulation to demonstrate the existence of
otherwise undiscovered performance issues. We also demonstrate that by
collecting dynamic phase profiles, we are able to approximate the analysis
results of trace analysis and simulation, and more accurately represent the
performance behavior of complex parallel applications than through flat or
callpath profiles alone.
-
Alan Morris and Sameer Shende and Allen Malony and Kevin Huck
Design and Implementation of a Hybrid Parallel Performance Measurement System.
International Conference on Parallel Processing (ICPP 2010),
San Diego, CA, USA, September 13-16, 2010
Abstract:
Modern parallel performance measurement
systems collect performance information either through probes
inserted in the application code or via statistical sampling.
Probe-based techniques measure performance metrics directly
using calls to a measurement library that execute as part of
the application. In contrast, sampling-based systems interrupt
program execution to sample metrics for statistical analysis
of performance. Although both measurement approaches are
represented by robust tool frameworks in the performance
community, each has its strengths and weaknesses. In this
paper, we investigate the creation of a hybrid measurement
system, the goal being to exploit the strengths of both systems
and mitigate their weaknesses. We show how such a system
can be used to provide the application programmer with a
more complete analysis of their application. Simple example
and application codes are used to demonstrate its capabilities.
We also show how the hybrid techniques can be combined
to provide real cross-language performance evaluation of
an uninstrumented run for mixed compiled/interpreted
execution environments (e.g., Python and C/C++/Fortran).
-
L. Li, J. P. Kenny, M. Wu , K. Huck, A. Gaenko, M. S. Gordon , C. L. Janssen, L. Curfman McInnes, H. Mori, H. M. Netzloff, B. Norris, and T. L. Windus
Adaptive Application Composition in Quantum Chemistry.
The 5th International Conference on the Quality of Software Architectures (QoSA 2009), East Stroudsburg University, Pennsylvania, USA, June 22-26, 2009
Abstract:
Component interfaces, as advanced by the Common Component Architecture
(CCA), enable easy access to complex software packages for
high-performance scientific computing. A recent focus has been
incorporating support for computational quality of service (CQoS), or
the automatic composition, substitution, and dynamic reconfiguration of
component applications. Several leading quantum chemistry packages
have achieved interoperability by adopting CCA components. Running
these computations on diverse computing platforms requires selection
among many algorithmic and hardware configuration parameters; typical
educated guesses or trial and error can result in unexpectedly low
performance. Motivated by the need for faster runtimes and increased
productivity for chemists, we present a flexible CQoS approach for
quantum chemistry that uses a generic CQoS database component to create
a training database with timing results and metadata for a range of
calculations. The database then interacts with a chemistry CQoS
component and other infrastructure to facilitate adaptive application
composition for new calculations.
-
Kevin A. Huck, Oscar Hernandez, Van Bui, Sunita Chandrasekaran, Barbara Chapman, Allen D. Malony, Lois Curfman McInnes, and Boyana Norris
Capturing Performance Knowledge for Automated Analysis.
SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 2008
Abstract:
Automating the process of parallel performance experimentation, analysis, and
problem diagnosis can enhance environments for performance-directed application
development, compilation, and execution. This is especially true when
parametric studies, modeling, and optimization strategies require large amounts
of data to be collected and processed for knowledge synthesis and reuse. This
paper describes the integration of the PerfExplorer performance data mining
framework with the OpenUH compiler infrastructure. OpenUH provides
auto-instrumentation of source code for performance experimentation and
PerfExplorer provides automated and reusable analysis of the performance
data through a scripting interface. More importantly, PerfExplorer inference
rules have been developed to recognize and diagnose performance characteristics
important for optimization strategies and modeling. Three case studies are
presented which show our success with automation in OpenMP and MPI code tuning,
parametric characterization, and power modeling. The paper discusses how the
integration supports performance knowledge engineering across applications and
feedback-based compiler optimization in general.
-
Allen D. Malony, Sameer Shende, Alan Morris, Scott Biersdorff, Wyatt Spear, Kevin A. Huck, and Aroon Nataraj
Evolution of a Parallel Performance System.
2nd International Workshop on Tools for High Performance Computing, 2008
Abstract:
The TAU Performance System(R) is an integrated suite of tools for instrumentation
measurement, and analysis of parallel programs targeting large-scale,
high-performance computing (HPC) platforms. Representing over fifteen
calendar years and fifty person years of research and development effort,
TAU's driving concerns have been portability, flexibility, interoperability,
and scalability. The result is a performance system which has evolved into a
leading framework for parallel performance evaluation and problem solving. This
paper presents the current state of TAU, overviews the design and function of
TAU's main features, discusses best practices of TAU use, and outlines future development.
-
Kevin A. Huck, Wyatt Spear, Allen D. Malony, Sameer Shende, and Alan Morris
Parametric Studies in Eclipse with TAU and PerfExplorer.
Proceedings of Workshop on Productivity and Performance (PROPER 2008) at EuroPar 2008, (Las Palmas de Gran Canaria, Spain), 2008.
Abstract:
With support for C/C++, Fortran, MPI, OpenMP, and performance tools, the
Eclipse integrated development environment (IDE) is a serious contender as
a programming environment for parallel applications. There is interest in
adding capabilities in Eclipse for conducting workflows where an
application is executed under different scenarios and its outputs are
processed. For instance, parametric studies are a requirement in many
benchmarking and performance tuning efforts, yet there was no experiment
management support available for the Eclipse IDE. In this paper, we
describe an extension of the Parallel Tools Platform (PTP) plugin for the
Eclipse IDE. The extension provides a graphical user interface for
selecting experiment parameters, launches build and run jobs, manages the
performance data, and launches an analysis application to process the data.
We describe our implementation, and discuss three experiment examples which
demonstrate the experiment management support.
-
Van Bui, Boyana Norris, Kevin Huck, Lois Curfman McInnes, Li Li, Oscar Hernandez, and Barbara Chapman
A Component Infrastructure for Performance and Power Modeling of Parallel Scientific Applications.
Component-Based High Performance Computing (CBHPC 2008), 2008
Abstract:
Characterizing the performance of scientific applications is essential for effective code optimization, both by compilers and by high-level adaptive numerical algorithms. While maximizing power efficiency is becoming increasingly important in current high-performance architectures, little or no hardware or software support exists for detailed power measurements. Hardware counter-based power models are a promising method for guiding software-based techniques for reducing power. We present a component-based infrastructure for performance and power modeling of parallel scientific applications. The power model leverages on-chip performance hardware counters and is designed to model power consumption for modern multiprocessor and multicore systems. Our tool infrastructure includes application components as well as performance and power measurement and analysis components. We collect performance data using the TAU performance component and apply the power model in the performance and power analysis of a PETSc-based parallel fluid dynamics application by using the PerfExplorer component.
-
Kevin A. Huck, Allen D. Malony, Sameer Shende and Alan Morris
Knowledge Support and Automation for Performance Analysis with PerfExplorer 2.0.
Large-Scale Programming Tools and Environments, special issue of Scientific Programming, vol. 16, no. 2-3, pp. 123-134, 2008. (email for copies)
Abstract:
The integration of scalable performance analysis in parallel development tools
is difficult. The potential size of data sets and the need to compare results
from multiple experiments presents a challenge to manage and process the
information. Simply to characterize the performance of parallel applications
running on potentially hundreds of thousands of processor cores requires new
scalable analysis techniques. Furthermore, many exploratory analysis processes
are repeatable and could be automated, but are now implemented as manual
procedures. In this paper, we will discuss the current version of
PerfExplorer, a performance analysis framework which provides dimension
reduction, clustering and correlation analysis of individual trails of large
dimensions, and can perform relative performance analysis between multiple
application executions. PerfExplorer analysis processes can be captured in the
form of Python scripts, automating what would otherwise be time-consuming
tasks. We will give examples of large-scale analysis results, and discuss the
future development of the framework, including the encoding and processing of
expert performance rules, and the increasing use of performance metadata.
-
Kevin A. Huck, Allen D. Malony, Sameer Shende and Alan Morris
Scalable, Automated Performance Analysis with TAU and PerfExplorer.
Proceedings of Parallel Computing 2007, Aachen, Germany, 2007.
Abstract:
Scalable performance analysis is a challenge for parallel development tools.
The potential size of data sets and the need to compare results from multiple
experiments presents a challenge to manage and process the information, and to
characterize the performance of parallel applications running on potentially
hundreds of thousands of processor cores. In addition, many exploratory
analysis processes represent potentially repeatable processes which can and
should be automated.
In this paper, we will discuss the current version of PerfExplorer, a
performance analysis framework which provides dimension reduction, clustering
and correlation analysis of individual trails of large dimensions, and can
perform relative performance analysis between multiple application executions.
PerfExplorer analysis processes can be captured in the form of Python scripts,
automating what would otherwise be time-consuming tasks. We will give examples
of large-scale analysis results, and discuss the future development of the
framework, including the encoding and processing of expert performance rules,
and the increasing use of performance metadata.
-
D. Gunter, K. Huck, K. Karavanic, J. May, A. Malony, K. Mohror, S. Moore, A. Morris, S. Shende, V. Taylor, X. Wu, and Y. Zhang.
Performance database technology for SciDAC applications.
Journal of Physics: Conference Series, Vol. 78, 24--28 June 2007, Boston Massachusetts, USA.
Abstract:
As part of the Performance Engineering Research Institute (PERI) effort, the
Performance Database Working Group, which involves PERI researchers as well as
outside researchers at the University of Oregon, Portland State University, and Texas
A&M University, has developed technology for storing performance data collected by a
number of performance measurement and analysis tools, including TAU, PerfTrack,
Prophesy, and SvPablo. In addition to the performance data, metadata capturing the
experimental setup and conditions (e.g., source code version; input data; platform,
compiler, library, and operating system versions and configurations; runtime
environment) are exported to a common metadata schema, along with some basic
performance information. The exported information can be viewed from a common web
interface, and a link or contact information is provided for accessing the original
performance data in its home database. Analysis tools provided by the individual
databases support tasks such as parallel profile browsing and analysis, cross-experiment
analysis, and scalability studies. Performance data are currently being collected and
analyzed for the GTC and MILC SciDAC applications. The tools are being installed on
machines used by SciDAC researchers so that they can easily collect data and upload it to
an associated performance database. Work on a deeper level of interoperability that will
allow exchange of actual performance data between databases is underway.
-
Y. Zhang, R. Fowler, K. Huck, A. Malony, A. Porterfield, D. Reed, S. Shende, V. Taylor, and X. Wu..
US QCD Computational Performance Studies with PERI.
Journal of Physics: Conference Series, Vol. 78, 24--28 June 2007, Boston Massachusetts, USA.
Abstract:
We report on some of the interactions between two SciDAC projects: The National Computational Infrastructure for Lattice Gauge Theory (USQCD), and the Performance Envineering Research Institute (PERI). Many modern scientific programs consistently report the need for faster computational resources to maintain global competitiveness. However, as the size and complexity of emerging high end computing (HEC) systems continue to rise, achieving good performance on such systems is becoming ever more challenging. In order to take full advantage of the resources, it is crucial to understand the characteristics of relevant scientific applications and the systems these applications are running on. Using tools developed under PERI and by other performance measurement researchers,, we studied the performance of two applications, MILC and Chroma, on several high performance computing systems at DOE laboratories. In the case of Chroma, we discuss how the use of C++ and modern software engineering and programming methods are driving the evolution of performance tools.
-
Kevin A. Huck, Allen D. Malony, Sameer Shende and Alan
Morris.
TAUg: Runtime Global Performance Data Access Using MPI.
EuroPVM/MPI, pp. 313-321, Bonn, Germany, 2006.
Abstract:
To enable a scalable parallel application to view its global performance state,
we designed and developed TAUg, a portable runtime framework layered on the TAU
parallel performance system. TAUg leverages the MPI library to communicate
between application processes, creating an abstraction of a global performance
space from which profile views can be retrieved. We describe the TAUg design
and implementation and show its use on two test benchmarks up to 512
processors. Overhead evaluation for the use of TAUg is included in our
analysis. Future directions for improvement are discussed.
-
Li Li, Allen D. Malony and Kevin Huck.
Model-Based Relative Performance Diagnosis of Wavefront Parallel
Computations.
Euro-Par 2006 Parallel Processing Conference September 2006 (LNCS 4128). Pages 35-46.
Abstract:
Parallel performance diagnosis can be improved with the use of performance
knowledge about parallel computation models. The Hercule diagnosis system
applies model-based methods to automate performance diagnosis processes and
explain performance problems from high-level computation semantics. However,
Hercule is limited by a single experiment view. Here we introduce the concept
of relative performance diagnosis and show how it can be integrated in a
model-based diagnosis framework. The paper demonstrates the effectiveness of
Hercule's approach to relative diagnosis of the well-known Sweep3D application
based on a Wavefront model. Relative diagnoses of Sweep3D performance anomalies
in strong and weak scaling cases are given.
-
Kevin Huck and Allen D. Malony.
PerfExplorer:
A Performance Data Mining
Framework For Large-Scale Parallel Computing.
SC '05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, 2005. Seattle, Washington, USA.
Abstract:
Parallel applications running on high-end computer systems manifest a
complexity of performance phenomena. Tools to observe parallel performance
attempt to capture these phenomena in measurement datasets rich with
information relating multiple performance metrics to execution dynamics and
parameters specific to the application-system experiment. However, the
potential size of datasets and the need to assimilate results from multiple
experiments makes it a daunting challenge to not only process the information,
but discover and understand performance insights. In this paper, we present
PerfExplorer, a framework for parallel performance data mining and knowledge
discovery. The framework architecture enables the development and integration
of data mining operations that will be applied to large-scale parallel
performance profiles. PerfExplorer operates as a client-server system and is
built on a robust parallel performance database (PerfDMF) to access the
parallel profiles and save its analysis results. Examples are given
demonstrating these techniques for performance analysis of ASCI applications.
-
Karen L. Karavanic, John May, Kathryn Mohror, Brian Miller, Kevin Huck, Rashawn
Knapp, Brian Pugh.
Integrating Database Technology with Comparison-based Parallel Performance
Diagnosis: The PerfTrack Performance Experiment Management Tool.
SC '05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, 2005. Seattle, Washington, USA.
Abstract:
PerfTrack is a data store and interface for managing performance data from
large-scale parallel applications. Data collected in different locations and
formats can be compared and viewed in a single performance analysis session.
The underlying data store used in PerfTrack is implemented with a database
management system (DBMS). PerfTrack includes interfaces to the data store and
scripts for automatically collecting data describing each experiment, such as
build and platform details. We have implemented a prototype of PerfTrack that
can use Oracle or PostgreSQL for the data store. We demonstrate the prototype's
functionality with three case studies: one is a comparative study of an ASC
purple benchmark on high-end Linux and AIX platforms; the second is a parameter
study conducted at Lawrence Livermore National Laboratory (LLNL) on two high
end platforms, a 128 node cluster of IBM Power 4 processors and BlueGene/L; the
third demonstrates incorporating performance data from the Paradyn Parallel
Performance Tool into an existing PerfTrack data store.
-
P Worley, J Candy, L Carrington, K Huck, T Kaiser, G Mahinthakumar, A Malony, S
Moore, D Reed, P Roth, H Shan, S Shende, A Snavely, S Sreepathi, F Wolf, Y
Zhang
Performance
Analysis of GYRO: a tool evaluation.
Journal of Physics: Conference Series, vol. 16, pp. 551-555, 2005.
Abstract:
The performance of the Eulerian gyrokinetic-Maxwell solver code GYRO is
analyzed on five high performance computing systems. First, a manual approach
is taken, using custom scripts to analyze the output of embedded wallclock
timers, floating point operation counts collected using hardware performance
counters, and traces of user and communication events collected using the
profiling interface to Message Passing Interface (MPI) libraries. Parts of the
analysis are then repeated or extended using a number of sophisticated
performance analysis tools: IPM, KOJAK, SvPablo, TAU, and the PMaC modeling
tool suite. The paper briefly discusses what has been discovered via this
manual analysis process, what performance analyses are inconvenient or
infeasible to attempt manually, and to what extent the tools show promise in
accelerating or significantly extending the manual performance analyses.
-
Kevin Huck, Allen D. Malony, Robert Bell and Alan Morris.
Design and
Implementation of a Parallel Performance Data Management Framework.
(Winner: The Chuan-lin Wu Best Paper Award),
Proceedings of the 2005 International Conference on Parallel Processing.
June 14-17, 2005. Oslo, Norway.
Abstract:
Empirical performance evaluation of parallel systems and applications can
generate significant amounts of performance data and analysis results from
multiple experiments as performance is investigated and problems diagnosed.
Hence, the management of performance information is a core component of
performance analysis tools. To better support tool integration, portability,
and reuse, there is a strong motivation to develop performance data management
technology that can provide a common foundation for performance data storage,
access, merging, and analysis. This paper presents the design and
implementation of the Performance Data Management Framework (PerfDMF). PerfDMF
addresses objectives of performance tool integration, interoperation, and reuse
by providing common data storage, access, and analysis infrastructure for
parallel performance profiles. PerfDMF includes an extensible parallel profile
data schema and relational database schema, a profile query and analysis
programming interface, and an extendible toolkit for profile import/export and
standard analysis. We describe the PerfDMF objectives and architecture, give
detailed explanation of the major components, and show examples of PerfDMF
application.
|
Posters: |
-
Joseph Kenny (Sandia National Laboratories), Kevin Huck (University of Oregon), Li Li (Argonne National Laboratory), Lois Curfman McInnes (Argonne National Laboratory), Heather Netzloff (Ames Laboratory), Boyana Norris (Argonne National Laboratory), Meng-Shiou Wu (Ames Laboratory)
Computational Quality of Service in Quantum Chemistry.
Poster, SC'08. November, 2008.
Abstract:
Component interfaces, as advanced by the Common Component Architecture (CCA) Forum, enable easy access to software packages. A recent focus of the CCA Forum has been adding support for computational quality of service (CQoS): automatic composition, substitution and dynamic reconfiguration. Several quantum chemistry developers (GAMESS, MPQC and NWChem) have adopted CCA components, creating shared capabilities and infrastructure. These computations require many algorithmic and hardware configuration options, including the configuration of processing elements (nodes, processors/sockets and cores); typical educated guesses or trial and error result in erratic performance and efficiency. This situation is driving the development of a flexible CQoS approach for quantum chemistry applications. Our approach uses a general CQoS database component to create a training database containing timing results and metadata for a range of calculations. Once this database is populated, the chemistry CQoS component uses general CQoS infrastructure analysis capabilities to provide appropriate configuration for a new calculation.
-
D. Gunter, K. Huck, K. Karavanic, J. May, A. Malony, K. Mohror, S. Moore, A. Morris, S. Shende, V. Taylor, X. Wu, and Y. Zhang.
Performance Database Technology for SciDAC Applications.
Poster, SciDAC. June, 2007.
Abstract:
As part of the Performance Engineering Research Institute (PERI) effort, the Performance Database Working Group, which involves PERI researchers as well as outside researchers at the University of Oregon, Portland State University, and Texas A&M University, has developed technology for storing performance data collected by a number of performance measurement and analysis tools, including TAU, PerfTrack, Prophesy, and SvPablo. In addition to the performance data, metadata capturing the experimental setup and conditions (e.g., source code version; input data; platform, compiler, library, and operating system versions and configurations; runtime environment) are exported to a common metadata schema, along with some basic performance information. The exported information can be viewed from a common web interface, and a link or contact information is provided for accessing the original performance data in its home database. Analysis tools provided by the individual databases support tasks such as parallel profile browsing and analysis, cross-experiment analysis, and scalability studies. Performance data are currently being collected and analyzed for the GTC and MILC SciDAC applications. The tools are being installed on machines used by SciDAC researchers so that they can easily collect data and upload it to an associated performance database.
-
R. Fowler, Y. Zhang, A. Porterfield, D. Reed, J. Mellor-Crummey, N. Tallent, K. Huck, A. Malony, S. Shende, V. Taylor, and X. Wu.
PERI and USQCD Computational Performance Studies.
Poster, SciDAC. June, 2007.
Abstract:
USQCD encompasses a SciDAC collaboration of US scientists developing and using large-scale computers for calculations in lattice quantum chromodynamics. Software Emphasis: improved scientific productivity through modular, reusable, cross-platform, high-performance libraries. PERI is a SciDAC Institute focused on delivering petascale performance to complex scientific applications running on Leadership Class computing systems. Emphasis: improved productivity through automation of measurement, analysis, and tuning of HPC applications.
-
Kevin Huck, Kathryn Mohror, John May, Brian Miller, Karen Karavanic.
PerfTrack:
Performance Database & Analysis Tool.
Poster, Lawrence Livermore National Laboratory, UCRL-POST-205871. September, 2004.
Introduction:
Our goal is to create a tool which will help scientific programmers answer
difficult questions about application performance as the source code, build
parameters, runtime environment and hardware vary over time. We are developing
PerfTrack to explore technologies in parallel performance measurement,
modeling, analysis and prediction. We are storing performance data and the
associated environment data in a relational database. This database provides a
foundation to build analysis tools, scalable to large numbers of threads (over
1024) and capable of comparing multiple executions. The tools we develop will
be automated to gather, store and analyze data, in order to encourage their use
in the software development cycle.
|