Understanding the Software Needs of High End Computer Users with XALT

10.15781/T2PP4P McLay, Robert Fahey, Mark R. Understanding the Software Needs of High End Computer Users with XALT Texas Advanced Computing Center 2015 Computer Log The dataset is produced by the software XALT, installed on the High Performance Computing (HPC) resource Stampede at the Texas Advanced Computing Center (TACC). XALT tracks and collects job-level information about software libraries and executables on open-science HPC systems, also known as supercomputers. Open science HPC resources are shared via powerful networks by researchers across the country, and are maintained by a handful of supercomputer centers. To use the computations resources, researchers submit jobs, which consist of computational workflows designed to conduct analysis and calculations. The XALT data is used to determine the software libraries that are most often utilized in a given system, a fundamental administrative function for shared HPC resources. Since nodes/memory are finite resources, software libraries must be selected for continued use and maintenance to ensure optimal performance for users. In addition to running on Stampede, XALT software has been tested or installed at The National Institute for Computer Sciences, Oak Ridge Leadership Computing Facility, The National Center for Supercomputing Applications, Baden-Württemberg, The National Energy Research Scientific Computing Center, The Swiss National Supercomputing Centre, The National Oceanic and Atmospheric Administration, and KAUST Supercomputing Centre. Other current uses of the XALT data include debugging software libraries, indirect measurements of performance, and cost analysis based on the time and number of nodes in use. Sociologists, digital anthropologists and scientific software producers have identified possible additional uses for this data such as inferring collaborations, types of relationships and practices of domain scientists working on computational projects. XALT may also be used to gather provenance metadata during computational jobs. Provenance information for the xalt dataset entails the software, associated libraries, and usage metrics that show the initial stage of computational analysis for scientific work. The XALT dataset, in JSON format, contains information on the number of nodes and the libraries and executables used by each user running a given computational job on Stampede. It also includes the science domain that the users identify with for their projects. As part of the publication process, personal identification information is sanitized prior to publication, but all jobs can be related to a particular user through an anonymous user id. This dataset will continue to grow past the initial date of publication. TACC started releasing data in September of 2015. The daily collections of data release as quarterly collections of three files, one for each month in the quarter, in October, January, April, and July for the previous three months. Additional documentation is available to contextualize and understand the dataset. Documents include: the data dictionary describing each data element, a copy of the CC-BY license for the dataset, metadata in xml datacite format, and a listing of software libraries identified from the data. The data may be downloaded as quarterly zipped files with a metadata file from the following url: In order to accommodate the anticipated growth over time for this data set, the data is not hosted at this location. The description of the data elements, copy of the CC-BY license, catalogue metadata file, and a listing of the software libraries at time of initial publication are available for download. To access and download xalt data and metadata, please highlight the link and paste it into your address bar. Size: Due to the dynamic nature of the data, the size will change over time. At the time of publication, individual files started at approximately 1GB per file, uncompressed. Format: JSON To access and download xalt data and metadata, please highlight the link and paste it into your address bar: http://web.corral.tacc.utexas.edu/XALT/ . In order to accommodate the anticipated growth over time for this data set, the data is not hosted at this location. The description of the data elements, copy of the CC-BY license, catalogue metadata file, and a listing of the software libraries at time of initial publication are available for download. high performance computing data curation computer science McLay, Robert Fahey, Mark R. Esteva, Maria Sweat, Sandra Kulasekaran, Sivakumar (Siva) CC-BY