| Multithreaded sparse matrix-matrix multiplication for many-core and GPU architectures |
8 |
| Optimizations of the eigensolvers in the ELPA library |
7 |
| Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression |
7 |
| DVFS-aware application classification to improve GPGPUs energy efficiency |
5 |
| Accelerating the SVD two stage bidiagonal reduction and divide and conquer using GPUs |
4 |
| Comparing load-balancing algorithms for MapReduce under Zipfian data skews |
4 |
| Proteus: Exploiting precision variability in deep neural networks |
3 |
| SAGE: Percipient Storage for Exascale Data Centric Computing |
3 |
| Manila: Using a densely populated PMC-space for power modelling within large-scale systems |
3 |
| Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors |
3 |
| Benchmarking the GPU memory at the warp level |
3 |
| Performance of asynchronous optimized Schwarz with one-sided communication |
3 |
| IR plus : Removing parallel I/O interference of MPI programs via data replication over heterogeneous storage devices |
3 |
| Exponential integrators with parallel-in-time rational approximations for the shallow-water equations on the rotating sphere |
3 |
| A distributed-memory hierarchical solver for general sparse linear systems |
3 |
| Distributed ant colony optimization based on actor model |
2 |
| PSeIInv - A distributed memory parallel algorithm for selected inversion: The non-symmetric case |
2 |
| A hybrid CPU/GPU approach for optimizing sorting throughput |
2 |
| Characterizing the performance benefit of hybrid memory system for HPC applications |
2 |
| Overcoming the No Free Lunch Theorem in Cut-off Algorithms for Fork-Join programs |
2 |
| The time and energy efficiency of modern multicore systems |
2 |
| Evaluating the SW26010 many-core processor with a micro-benchmark suite for performance optimizations |
2 |
| Microwave tomographic imaging of cerebrovascular accidents by using high-performance computing |
2 |
| Incomplete Sparse Approximate Inverses for Parallel Preconditioning |
2 |
| PMIx: Process management for exascale environments |
2 |
| Machine Learning in Multi-Agent Systems using Associative Arrays |
2 |
| Optimized large-message broadcast for deep learning workloads: MPI, MPI plus NCCL, or NCCL2? |
2 |
| Integrating blocking and non-blocking MPI primitives with task-based programming models |
2 |
| Utility-based resource management in an oversubscribed energy-constrained heterogeneous environment executing parallel applications |
2 |
| A comparative evaluation of three volume rendering libraries for the visualization of sheared thermal convection |
1 |
| Targeting GPUs with OpenMP directives on Summit: A simple and effective Fortran experience |
1 |
| Searching for common patterns on protein sequences by means of a parallel hybrid honey-bee mating optimization algorithm |
1 |
| Client-side straggler-aware I/O scheduler for object-based parallel file systems |
1 |
| Concurrency of three-dimensional refined isogeometric analysis |
1 |
| Parallel eigenvalue computation for banded generalized eigenvalue problems |
1 |
| A time-stamping system to detect memory consistency errors in MPI one-sided applications |
1 |
| Characterizing MPI matching via trace-based simulation |
1 |
| Hybrid parallelization of a multi-tree path search algorithm: Application to highly-flexible biomolecules |
1 |
| Petascale scramjet combustion simulation on the Tianhe-2 heterogeneous supercomputer |
1 |
| Comparing the performance of rigid, moldable and grid-shaped applications on failure-prone HPC platforms |
1 |
| Accelerating the task/data-parallel version of ILUPACK's BiCG in multi-CPU/GPU configurations |
1 |
| GeneaLog: Fine-grained data streaming provenance in cyber-physical systems |
1 |
| Computation of the 100 quadrillionth hexadecimal digit of pi on a cluster of Intel Xeon Phi processors |
1 |
| Introducing the explicitly many-processor approach |
1 |
| Parallel accelerated vector similarity calculations for genomics applications |
1 |
| Practical, distributed, low overhead algorithms for irregular gather and scatter collectives |
1 |
| Superlinear speedup phenomenon in parallel 3D Discrete Element Method (DEM) simulations of complex-shaped particles |
1 |
| Data staging for efficient high throughput stream processing |
1 |
| Exploring stream parallel patterns in distributed MPI environments |
1 |
| The OpenACC data model: Preliminary study on its major challenges and implementations |
1 |