{"target": 0, "func": "[PATCH] PME-gather: Use templated functor instead of preprocessor\n\nAdded restrict in several places, but this does not affect performance\nwith gcc and icc.\n\nChange-Id: Id366621fa3ad02ca182b8a4da48cae940059cf46", "idx": 686} | |
{"target": 0, "func": "[PATCH] Clean up documentation slightly for the case of slow runs.", "idx": 1528} | |
{"target": 0, "func": "[PATCH] Adding Changelog for Release 2.8.00\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 2.8", "idx": 1536} | |
{"target": 0, "func": "[PATCH] Adding Changelog for Release 2.9.99\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 2.9.99", "idx": 227} | |
{"target": 0, "func": "[PATCH] Simplify the usage of the performance function by reducing\n the template parameter.", "idx": 839} | |
{"target": 1, "func": "[PATCH] Attempt to work-around old gcc bugs in a more efficient\n fashion that does not lose performance on newer gcc's. [empty commit message]", "idx": 1272} | |
{"target": 0, "func": "[PATCH] the mistery of the missing MIC performance on Intel 16 might\n be fixed with -qopt-assume-safe-padding", "idx": 1032} | |
{"target": 0, "func": "[PATCH] Move clang-tidy build\n\nThis has seemed too slow when combined with the ASAN build.\n\nChange-Id: I45ea5856ca05edbb6107b62f219e8afd3cdbda3f", "idx": 743} | |
{"target": 0, "func": "[PATCH] Use AVX512 also for DGEMM\n\nthis required switching to the generic gemm_beta code (which is faster anyway on SKX)\nfor both DGEMM and SGEMM\n\nPerformance for the not-retuned version is in the 30% range", "idx": 927} | |
{"target": 0, "func": "[PATCH] Add Elem::loose_bounding_box()\n\nThe default implementation is what we were using previously to\ncalculate bounding boxes for PointLocatorTree. For higher order\nelements we add fast approximations that are strict in the case of\nlinear geometry but that should still be bounds in the case of higher\norder geometry.", "idx": 1525} | |
{"target": 0, "func": "[PATCH] Added hacks for SIMD rvec/load store in lincs & bondeds\n\nWe have added proper gather/scatter operations to work on\nrvecs for all SIMD architectures, but that will not make it into\nGromacs-5.1. Since Berk already wrote a few routines to use\nmaskloads at least for AVX & AVX2, this is a bit of a hack to\nget the performance benefits of that code already in Gromacs-5.1\n(for AVX/AVX2), without altering the SIMD module. This is definitely\na hack, and the code will be replaced once the extended SIMD\nmodule is in place.\n\nChange-Id: I385acb5f989b2ecf463948be84947fe1f6dfd19b", "idx": 1499} | |
{"target": 0, "func": "[PATCH] Benchmark can now export performance data to an XML file +\n added Tanglecube function to benchmark", "idx": 577} | |
{"target": 1, "func": "[PATCH] replaced std::endl with \\n in all file IO and stringstreams. \n std::endl forces a flush, which kills performance on some machines\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@834 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 440} | |
{"target": 0, "func": "[PATCH] USER-DPD: specialize PairTableRXKokkos's compute_all_items()\n on NEWTON_PAIR No noticable performance change, but it does eliminate a deep\n conditional.", "idx": 1294} | |
{"target": 0, "func": "[PATCH] Cleaning code, Sylvain should take a look to the previous\n version of this file and Root_of_2.h because there is an obvious slow down in\n the execution of all the filtering kernels", "idx": 972} | |
{"target": 0, "func": "[PATCH] Add a single-tree depth-first traverser. It is not as fast\n as it could be.", "idx": 360} | |
{"target": 0, "func": "[PATCH] In the performance miniapps, add options to set the mesh\n refinements.\n\nRun the miniapps/performance tests at coarser mesh resolutions, so that\nthe tests run faster.\n\nExplicitly remove and ignore the temporary files created by the tests,\nbecause, in some cases, they may not be removed automatically.", "idx": 239} | |
{"target": 0, "func": "[PATCH] Fixed FAST on Mac OS X", "idx": 1342} | |
{"target": 0, "func": "[PATCH] Update to Kokkos r2.04.04 and add workaround for performance\n regression", "idx": 279} | |
{"target": 0, "func": "[PATCH] Accelerate 3D version of inexact_locate as we do it for 2D\n\nsee commit 4c477c853c82bb1ca77bb496cdce993178c3dfa2", "idx": 1071} | |
{"target": 0, "func": "[PATCH] ClangCuda: Make Cuda compilation with clang 3.9.0 work\n\nThis makes all the unit and performance tests work, with only\none test disabled (view of views).\nOn the other hand it brakes the cuda build because clang is ok with\n__device__ void foo() {}\nvoid foo() {}\nnvcc sees that as a redeclaration.", "idx": 819} | |
{"target": 0, "func": "[PATCH] PetscMatrix::print_personal now prints to file when requested\n (rather than just cout). The implementation is not particularly efficient\n (since print_personal gets passed an ostream) but it does work. And how\n efficient do you need to be if you are printing out matrices anyway?\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@4244 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 1219} | |
{"target": 0, "func": "[PATCH] nonUniformTableThermophysicalFunction: New non-uniform table\n thermophysicalFunction for liquid properties\n\nDescription\n Non-uniform tabulated property function that linearly interpolates between\n the values.\n\n To speed-up the search of the non-uniform table a uniform jump-table is\n created on construction which is used for fast indirect addressing into\n the table.\n\nUsage\n \\nonUniformTable\n Property | Description\n values | List of (temperature property) value pairs\n \\endnonUniformTable\n\n Example for the density of water between 280 and 350K\n \\verbatim\n rho\n {\n type nonUniformTable;\n\n values\n (\n (280 999.87)\n (300 995.1)\n (350 973.7)\n );\n }\n \\endverbatim", "idx": 1044} | |
{"target": 0, "func": "[PATCH] Adding Changelog for Release 3.2.00\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 3.2", "idx": 1244} | |
{"target": 0, "func": "[PATCH] First experimental prototype of Efficient Ransac (written by\n Yannick)\n\nPlane detection only.\nSome work to do to make it CGAL-conforming.", "idx": 1439} | |
{"target": 1, "func": "[PATCH] subCycle: Add special treatment for nSubCycles = 1\n\nNow running sub-cycling with nSubCycles = 1 is as efficient as running the same\ncode without the sub-cycling loop.", "idx": 208} | |
{"target": 0, "func": "[PATCH] Prevent PME tuning excessive grid scaling\n\nWe limit the maximum grid scaling to a factor 1.8. This allows\nplenty of room for shifting work from PME on CPU to short-range\nGPU kernels, but avoids excessive scaling for diminishing return\nin performance for a significant increase in power consumption,\ncommunication volume (which may with fluctuating network load not\nshow up during tuning) as well as limiting load balancing.\n\nChange-Id: I85c02478faa6b67c063b6e1b45a9ac1755b2d81e", "idx": 184} | |
{"target": 0, "func": "[PATCH] This was the version of code used for the FastMKS benchmarks\n in the recently submitted paper, \"Dual-tree Fast Exact Max-Kernel Search\".", "idx": 1250} | |
{"target": 0, "func": "[PATCH] Update testing matrices for coverage and speed\n\nMoved slow aspect of pre-submit matrix to nightly (icc with release\nmode and SIMD support).\n\nRemoved slow gcc-7 config adjusting similar builds to achieve its\nformer objectives.\n\nRemoved outdated TODOs, noted new ones\n\nAdded hwloc test specifier, and todo for hwloc 2\n\nAdded tng test logic, and a specfier for each non-default case. Fixed\nmissing return values for no-tng case, and clarified the docs.\n\nChange-Id: I340b9a64dc4e4958f260657d3d82480be62ef979", "idx": 1179} | |
{"target": 1, "func": "[PATCH] Wrote the compute_children_node_keys() function in elem.h\n which allows one to generate appropriate node keys while reading in a mesh\n with multiple refinement levels. This allows us to avoid a linear search in\n the MeshRefinement::add_point routine since all the nodes can now be found in\n the nodes hash table. The resulting performance improvement was significant.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@1263 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 174} | |
{"target": 0, "func": "[PATCH] Remove RangeSet, DenseIntMap, and fast allocation routines. \n They were not used anywhere.", "idx": 445} | |
{"target": 0, "func": "[PATCH] BJP: Cleaned up code a little to make it easier to use for\n performance testing.", "idx": 996} | |
{"target": 0, "func": "[PATCH] Create dedicated subcounter for nonbonded FEP\n\nNow all nonbonded work has their own separate subcoutners which allows\nmeasuring the performance of each task separately.\n\nRefs #2997\n\nChange-Id: I601445364592923d08087a858da4629b0b58ae76", "idx": 863} | |
{"target": 0, "func": "[PATCH] Added FAST unit tests.", "idx": 59} | |
{"target": 0, "func": "[PATCH] Some performance issues due to profiling", "idx": 540} | |
{"target": 0, "func": "[PATCH] Lazily Init AWS ClientConfiguration\n\nAWS changes the ClientConfiguration in the 1.8 SDK to do the checking for env\nvariables and ec2 metadata in [1]. This can cause TileDB to behavior slow if\nS3 support is built but the environment is not configured. The AWS SDK\ncheck for the ec2 metadata and has to wait for a timeout. We need to\nlazily init the ClientConfiguration.\n\n[1] https://github.com/aws/aws-sdk-cpp/commit/147469373c9fec1037bd2d75d7cd949250c6f7c5", "idx": 1116} | |
{"target": 0, "func": "[PATCH] Fixed OpenMP compile and added a tux regression test\n\nThe MGR OpenMP code has not been tested yet, so I commented it out for now\nAdded a tux OpenMP compile test and reorganized the tux tests from fast to slow (more or less)", "idx": 115} | |
{"target": 1, "func": "[PATCH] Sparse refactored readers: Better vectorization for tile\n bitmaps calculations. (#2711) (#2734)\n\n* Sparse unordered with duplicates: Better vectorization for tile bitmaps.\n\nWhen computing tile bitmaps, there are a few places where we were not\nvectorizing properly and it caused performance impacts on some customer\nscenario. Fixing the use of covered/overlap in the dimension class to\nenable proper vectorization.", "idx": 1163} | |
{"target": 0, "func": "[PATCH] Some performance tuning", "idx": 254} | |
{"target": 0, "func": "[PATCH] fixed a bug with efficient IMVJ and reuse of old data - need\n to clear the matrix Wtil after convergence of the current time step, as\n columns are the result of J*V and J is outdated now.", "idx": 1552} | |
{"target": 0, "func": "[PATCH] Update unit tests and performance tests\n\nCo-authored-by: Daniel Arndt <arndtd@ornl.gov>", "idx": 966} | |
{"target": 0, "func": "[PATCH] add SDG Linf fast examples to doc file", "idx": 1083} | |
{"target": 0, "func": "[PATCH] PERF Batching + Blocks images in rotate and transform\n\n* Batching all images in single block set is slow at high blocks\n* Divide batches of images into sets and launch blocks for that", "idx": 1491} | |
{"target": 0, "func": "[PATCH] .EXPORT_ALL_VARIABLES: commented out since it seems to slow\n down the compilation", "idx": 535} | |
{"target": 0, "func": "[PATCH] Recursive ThreadPools (#1774)\n\nCurrently, ThreadPool instances are only recursive with their own isntance. This\npatch allows multiple ThreadPool instances to recurse among themselves without\nthe potential for bottlenecking (either in performance reduction or a deadlock)\non available threads.\n\nFor instance, ThreadPool instances A and B may be called in any order:\nA->execute([&]() {\n B->execute([&]() {\n A->execute([&]() {\n foo();\n });\n });\n});\n\nThe motiviation for this patch is to allow us to use our `compute_tp` and\n`io_tp` without worrying about their order in a call stack. This patch reduces\nthe runtime of our `bench_large_io` from 280587ms to 269621.\n\nSee the new unit tests in `unit-threadpool.cc` for more examples.", "idx": 756} | |
{"target": 0, "func": "[PATCH] Introduce HostAllocationPolicy\n\nThis permits host-side standard containers and smart pointers to have\ntheir contents placed in memory suitable for efficient GPU transfer.\n\nThe behaviour can be configured at run time during simulation setup,\nso that if we are not running on a GPU, then none of the buffers that\nmight be affected actually are. The downside is that all such\ncontainers now have state.\n\nChange-Id: I9367d0f996de04c21312cef2081cc08148f80561", "idx": 1144} | |
{"target": 0, "func": "[PATCH] Changed innerloop optimization options, and made SSE/3dnow\n loops default together with fast truncation on linux.", "idx": 1428} | |
{"target": 0, "func": "[PATCH] for three star segments, return fast common point\n\nSigned-off-by: Panagiotis Cheilaris <philaris@cs.ntua.gr>", "idx": 1196} | |
{"target": 0, "func": "[PATCH] Split up travis 'full' target; it became too slow", "idx": 1330} | |
{"target": 0, "func": "[PATCH] Use Array<T> objects instead of Param objects in several\n functions\n\n* Update SIFT with new RAII memAlloc\n* Workaround for function resolution in ternary operator\n* Fix Fast and Orb functions", "idx": 173} | |
{"target": 1, "func": "[PATCH] Force field updates.\n\nNew order for SWM4-NDP and SWM6 topologies, with SETTLE\ninstead of 3 constraints. Performance is slightly better\nin this case. Uploading a SWM4-NDP water box that has\nbeen equilibrated for 100 ps to serve as input for gmx\nsolvate.\n\nChange-Id: I67e10693ca76e77b99b371ea9887402e7ac0acc1", "idx": 982} | |
{"target": 0, "func": "[PATCH] RJH: Tweaked slow convergence threshold and improved\n stability of RJH: the line search", "idx": 285} | |
{"target": 0, "func": "[PATCH] removed call to fast localization for the time being...EJB", "idx": 1200} | |
{"target": 0, "func": "[PATCH] Restore locking optimizations for OpenMP case\n\nrestore another accidentally dropped part of #1468 that was missed in #2004 to address performance regression reported in #1461", "idx": 1460} | |
{"target": 0, "func": "[PATCH] fixed issue with slow insertion in the presence of dummy\n points", "idx": 1259} | |
{"target": 0, "func": "[PATCH] document \"slow\" and \"unstable\" labels for unit tests", "idx": 904} | |
{"target": 0, "func": "[PATCH] Update GMM code. It should be a little faster training, but\n it is still too slow for my preferences. I am not sure what is making it so\n slow.", "idx": 936} | |
{"target": 0, "func": "[PATCH] IA64: efc with Optimiz break moints2x and moints6x", "idx": 1465} | |
{"target": 0, "func": "[PATCH] Keep COARSEN_INACTIVE flags in sync in corner case\n\nThis isn't the most efficient solution - I think we could probably let\nthese flags stay out of sync for a while, weaken the assertions that\ncomplain, and trust to make_coarsening_compatible to eventually fix\nthe inconsistencies.\n\nThis is the safest solution, though.", "idx": 620} | |
{"target": 0, "func": "[PATCH] HvD: Adding a few performance tests to see if certain\n optimizations are needed. It turns out that aa**ii is about 20 times faster\n than aa**bb, where aa and bb are double precision numbers and ii is an\n integer. Whereas aa*ii is about as fast as aa*bb (measured by running \"time\n <program>\"). So it seems as if we need to add exponentiation to an integer\n power to the NWAD module.", "idx": 662} | |
{"target": 0, "func": "[PATCH] Finally, a fully parallel version of the refinement. Not very\n efficient, though, but the idea was to identify all data races and to protect\n it using locks, atomics, TLS... Needs some tests now, to check if we didn't\n miss any rare data race.", "idx": 1017} | |
{"target": 1, "func": "[PATCH] Fix nblib pairlist update function\n\nPreviously the function put_atoms_in_box was called only by the\nnblib integrator, or when constructing a system via a SimulationState\nobject. In the case of the integrator, this is a needless performance\ndegradation. When using an nb force calculator without\nfirst putting atoms in the box, this could lead to a cryptic\nerror from nbnxm grid search failure. Both of these issues are\nrectified with this change, which also adds a member to the\nnon-bonded force calculator to hold the requested number of OMP\nthreads to use in a call to put_atoms_in_box_omp, which is faster\nthan the non OMP version.", "idx": 373} | |
{"target": 0, "func": "[PATCH] Initial fast reciprocal space LJPME implementation, with\n test.", "idx": 339} | |
{"target": 0, "func": "[PATCH] Got parallelization working with CollocationIntegrator, very\n slow", "idx": 223} | |
{"target": 0, "func": "[PATCH] default to 1 thread, make the user manually specify the\n number of threads to avoid mpi/thread contention\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@2672 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 1166} | |
{"target": 0, "func": "[PATCH] performance updates...EJB", "idx": 702} | |
{"target": 0, "func": "[PATCH] Further work on fast diagonalization", "idx": 1417} | |
{"target": 0, "func": "[PATCH] Sort cell slab ranges for ND arrays (#1736)\n\nThe selective decompression intersection algorithm requires cell slab ranges\nto be sorted in ascending order. This is true for 1D arrays, but not for ND\narrays. In this scenario, we must sort them.\n\nThe ranges are sorted in vectors that have already been partitioned per-tile.\nThis should keep the sort runtimes relatively quick. In the future, we can\nbenchmark this on large arrays+queries and measure its timing. With this sort,\nwe now may coalesce cell ranges as a future optimization.\n\nIf the N*LOG(N) sorting is too slow, the alternative approach is to leave them\nunsorted and perform O(N*M) range-chunk intersection comparisons, where M is\nthe number of chunks. If the number of chunks is less than LOG(N), this may\nbe faster.\n\nThis solves the following error message:\n[TileDB::ChunkedBuffer] Error: Chunk read error; chunk unallocated error\n\nCo-authored-by: Joe Maley <joe@tiledb.com>", "idx": 1091} | |
{"target": 0, "func": "[PATCH] Restore evGW performance", "idx": 1430} | |
{"target": 0, "func": "[PATCH] Adding logic for detecting whether or not a Mac must link\n vecLib or Accelerate", "idx": 398} | |
{"target": 0, "func": "[PATCH] Use pre-trained network to accelerate test", "idx": 992} | |
{"target": 0, "func": "[PATCH] need to slow down in some changes ... cancel eckart for the\n time being", "idx": 178} | |
{"target": 0, "func": "[PATCH] Change from single Performance test Executable to different\n executables per backend.", "idx": 300} | |
{"target": 0, "func": "[PATCH] Auto cache compiled CUDA kernels on disk to speed up\n compilation (#2848)\n\n* Adds CMake variable AF_CACHE_KERNELS_TO_DISK to enable kernel caching. It is turned ON by default.\n* cuda::buildKernel() now dumps cubin to disk for reuse\n* Adds cuda::loadKernel() for loading cached cubin files\n* cuda::loadKernel() returns empty kernel on failure\n* Uses XDG_CACHE_HOME as cache directory for Linux\n* Adds common::deterministicHash() - This uses the FNV-1a hashing algorithm for fast and reproducible hashing of string or binary data. This is meant to replace the use of std::hash in some place, since std::hash does not guarantee its return value will be the same in subsequence executions of the program.\n* Write cached kernel to temporary file before moving into final file. This prevents data races where two threads or two processes might write to the same file.\n* Uses deterministicHash() for hashing kernel names and kernel binary data.\n* Adds kernel binary data file integrity check upon loading from disk", "idx": 122} | |
{"target": 0, "func": "[PATCH] add generic Mac toolchain file\n\nI prefer that CMake find the MPI compiler wrappers instead of the base\ncompilers, if for no other reason than this often means mixing clang and\ngfortran, since clang will likely be found first in PATH.\n\nthis toolchain assumes MPI wrappers are in the PATH but one can set the\nbase directory (with a trailing \"/\") if desired.\n\nthe BLAS/LAPACK used is \"-framework Accelerate\"", "idx": 627} | |
{"target": 0, "func": "[PATCH] Added performance data", "idx": 867} | |
{"target": 0, "func": "[PATCH] Workaround for Visual Studio bug that causes very slow\n compilation", "idx": 1394} | |
{"target": 0, "func": "[PATCH] Adding BoundaryInfo::add_elements()\n\nThis allows us to easily add boundary elements to an existing mesh\nwithout creating a new mesh for them.\n\nFactoring out _find_id_maps simplifies the code a bit. If\nBoundaryInfo::sync() is too slow we might also factor out\n_add_elements to avoid a redundant _find_id_maps call, but I doubt\nthe redundancy will ever show up in profiling.", "idx": 699} | |
{"target": 0, "func": "[PATCH] Make mallocs uniformly an error in PetscMatrix\n\nThis adds the setting to the other two init methods that\nif a new malloc occurs during MatSetValues then it is an\nerror. I think this is an improvement because it establishes\nuniformity across our init methods and it can prevent users\nfrom running extremely slow simulations.\n\nI fully expect this to cause failures in MOOSE...", "idx": 1307} | |
{"target": 0, "func": "[PATCH] Add the possibility to remove the far points\n\nThe far points are added by the parallel version to reduce the contention\non the infinite vertex", "idx": 30} | |
{"target": 0, "func": "[PATCH] Use more accurate error message.\n\nAs discussed over in idaholab/moose#9097, Nanoflann does not actually\nimplement an *approximate* nearest node search algorithm. In contrast\nto \"exact\" nearest node searches, approximate nearest node searches\nare not guaranteed to return the nearest node, in return for potential\nperformance improvements.\n\nSee, for example, their README file [0], which states that Nanoflann\ncan \"[Build] KD-trees with a single index (no randomized KD-trees, no\napproximate searches).\"\n\n[0]: https://github.com/jlblancoc/nanoflann/blob/v1.2.3/README.md", "idx": 531} | |
{"target": 0, "func": "[PATCH] Update performance section", "idx": 315} | |
{"target": 0, "func": "[PATCH] Made FAST CPU results match CUDA results", "idx": 32} | |
{"target": 0, "func": "[PATCH] Remove OpenMP compile flag in CUDA backend\n\nThis flag isn't needed based on recent tests. If it is causing any\nperformance regression, it will be reverted and the following\nflag to disable two-phase lookup for cuda backend on windows will\nbe added back.\n\n/permissive flag does not work with two-phase-lookup enabled\nfor projects with openmp support enabled.", "idx": 423} | |
{"target": 1, "func": "[PATCH] WIP Fixed performance of multi-range subarray result\n estimation", "idx": 1248} | |
{"target": 1, "func": "[PATCH] Improve performance of random number generator calls", "idx": 1514} | |
{"target": 1, "func": "[PATCH] QN update is now in Base class - subclasses onla compute the\n update. Changed computation of QN-Update for MVQN: use QR decomposition of\n matrix V, instead of LU decomposition of VTV. More efficient and more robust\n implementation", "idx": 954} | |
{"target": 1, "func": "[PATCH] new implemenation using boost CSR graph, it can be 1.5x\n faster from prev implementation but there is a performance problem that I\n couldn't solve using public functionality of graph (however there might be a\n solution) will look it back.", "idx": 1266} | |
{"target": 1, "func": "[PATCH] Improve meanshift filter performance on CPU\n\nImproved the meanshift filter performance on the CPU backend by replacing\nvectors with std::arrays and moving them out of the for loops. Also\nreduced a few conversion operations.", "idx": 288} | |
{"target": 1, "func": "[PATCH] 128-bit AVX2 SIMD for AMD Ryzen\n\nWhile Ryzen supports 256-bit AVX2, the internal units are organized\nto execute either a single 256-bit instruction or two 128-bit SIMD\ninstruction per cycle. Since most of our kernels are slightly\nless efficient for wider SIMD, this improves performance by roughly\n10%.\n\nChange-Id: Ie601b1dbe13d70334cdf9284e236ad9132951ec9", "idx": 1453} | |
{"target": 1, "func": "[PATCH] gauge_field::FloatNOrder can now use __ldg loads. Generally\n improves performance across the board, but some regressions at 12/8\n reconstruct so left switched off for now (USE_LDG macro in\n include/gauge_field_order.h).", "idx": 962} | |
{"target": 1, "func": "[PATCH] Better performance (~10-15% better)\n\nBy removing several tests (and use CGAL::max instead), the generated\nassembly is more efficient.", "idx": 772} | |
{"target": 1, "func": "[PATCH] Replacing dynamic_cast with libmesh_cast where appropriate -\n depending on the error checking replaced this will either lead to slightly\n more efficient NDEBUG runs or slightly more run-time checking in debug mode\n runs.", "idx": 1291} | |
{"target": 1, "func": "[PATCH] Improve performance of PFMG", "idx": 1357} | |
{"target": 1, "func": "[PATCH] Adds sfence and lfence options if user builds with assembly\n to enable more efficient use of out of order engines", "idx": 1384} | |
{"target": 1, "func": "[PATCH] Fix IR issue causing very slow to no convergence when using\n an inaccurate inner solver", "idx": 1560} | |
{"target": 1, "func": "[PATCH] Add new constructor to Iso_rectangle_2(Point_2, Point_2,\n int). The additional dummy \"int\" specifies that the 2 points are the\n lower-left and upper-right corner. This is more efficient when one knows\n they are already in this configuration.\n\nSame thing for Iso_cuboid_3, and the functors.\n\nUse them in Cartesian_converter and Homogeneous_converter.", "idx": 1393} | |
{"target": 1, "func": "[PATCH] improved performance of free energy runs in water\n significantly by allowing water-water loops and added a slight speed up for\n neighborsearching for free-energy runs", "idx": 859} | |
{"target": 1, "func": "[PATCH] Adjusted scheduling, removed slow atomic section", "idx": 820} | |
{"target": 1, "func": "[PATCH] Enforced rotation: Minor performance optimization in\n do_flex[1,2]_lowlevel", "idx": 1141} | |
{"target": 1, "func": "[PATCH] allows requesting of multiple moments. restructures blocks\n for performance improvements", "idx": 1099} | |
{"target": 1, "func": "[PATCH] some more performance improvements", "idx": 1537} | |
{"target": 0, "func": "[PATCH] Refactor threading model\n\nRemoves:\n- global_tp_ (the replacement TBB thread pool)\n- StorageManager::async_thread_pool_\n- StorageManager::reader_thread_pool_\n- StorageManager::writer_thread_pool_\n- VFS::thread_pool_\n\nAdds:\n- StorageManager::compute_tp_\n- StorageManager::io_tp_\n\nUsage changes:\n1. Our three parallel functions (`parallel_sort`, `parallel_for[_2d]`) now use\n the `StorageManager::compute_tp_`.\n2. Both the `Reader::read_tiles()` and `Writer::write_tiles()` now execute on\n `StorageManager::io_tp_`.\n3. The VFS is now initalized with a thread pool, where the storage manager\n initializes it with the `StorageManager::io_tp_`. This means that both the\n VFS and Reader/Writer io paths execute on the same thread pool. There was\n previously a deadlock scenario if both used the same thread pool, but that\n is no longer an issue now that the threadpools are recursive.\n4. The async queries are executed on `StorageManager::compute_tp_`.\n\nConfig changes:\n- Adds configuration parameters for the compute and IO thread pool \"concurrency\n levels\". A level of \"1\" is serial execution while all other levels have a\n maximum concurrency of N but allocate N-1 OS threads.\n- Deprecate the async/reader/writer/vfs thread num configurations. If any of\n these are set and larger than the new \"sm.compute_concurrency_level\" and\n \"sm.io_concurrency_level\", the old values will be used instead. The motiviation\n is so that existing users will not see a drop in performance if they are\n currently using larger-than-default values.", "idx": 192} | |
{"target": 1, "func": "[PATCH] Modified nb_accv to improve performance of accumulates to\n processors on the same SMP node.", "idx": 1504} | |
{"target": 1, "func": "[PATCH] fast dynamic_cast in Lazy_kernel::Construct_point_3", "idx": 773} | |
{"target": 1, "func": "[PATCH] Disable CUDA textures on NVIDIA Volta\n\nThis has significant performance benefit for the nbnxm kernels with\ntabulated Ewald correction and it has negligible impact on the PME kernels.\n\nPartially addresses #3845", "idx": 542} | |
{"target": 1, "func": "[PATCH] Increase memory allocation for NPT simulation in Ewald.cpp.\n Assign linear molecule with 3 and more atoms to DCGraph at higher level to\n improve code performance", "idx": 986} | |
{"target": 1, "func": "[PATCH] #1295 Refactored MX::setSub(single argument) This should be\n significantly more efficient, though I didn't do any performance testing", "idx": 959} | |
{"target": 1, "func": "[PATCH] Performance improvments. BuildWeightMatrix() is probably\n unnecessary entirely.", "idx": 171} | |
{"target": 1, "func": "[PATCH] Move orb LUT in CUDA backend to texture memory\n\ncuda::kernel::extract_orb is the CUDA kernel that uses the orb\nlookup table. Shared below is performance of the kernel using constant\nmemory vs texture memory. There is neglible to no difference between two\nversions. Hence, shifted to texture memory LUT to reduce global constant\nmemory usage.\n\nPerformance using constant memory LUT\n-------------------------------------\n\nTime(%) Time Calls Avg Min Max Name\n\n3.02% 292.26us 24 12.177us 11.360us 14.528us void cuda::kernel::extract_orb<float>\n2.16% 209.00us 16 13.062us 11.616us 16.033us void cuda::kernel::extract_orb<double>\n\nPerformance using texture LUT\n-----------------------------\n\nTime(%) Time Calls Avg Min Max Name\n\n2.84% 270.63us 24 11.276us 9.6970us 15.040us void cuda::kernel::extract_orb<float>\n2.20% 209.28us 16 13.080us 10.688us 16.960us void cuda::kernel::extract_orb<double>", "idx": 1566} | |
{"target": 1, "func": "[PATCH] thermophysicalModels: Added caching of Cp and Cv for\n efficiency\n\nIn multi-specie systems the calculation of Cp and Cv is relatively expensive due\nto the cost of mixing the coefficients of the specie thermo model, e.g. JANAF,\nand it is significantly more efficient if these are calculated at the same time\nas the rest of the thermo-physical properties following the energy solution.\n\nAlso the need for CpByCpv is also avoided by the specie thermo providing the\nenergy type in the form of a boolean 'enthalpy()' function which returns true if\nthe energy type is an enthalpy form.", "idx": 42} | |
{"target": 1, "func": "[PATCH] performance improvements on grid_ssw", "idx": 1060} | |
{"target": 1, "func": "[PATCH] fast pool allocator", "idx": 560} | |
{"target": 1, "func": "[PATCH] Improved performance mostly by using hints to insert to the\n status line.", "idx": 744} | |
{"target": 1, "func": "[PATCH] Made gmx_numzero static for performance reasons.", "idx": 565} | |
{"target": 1, "func": "[PATCH] PERFFIX: improved 2d convolve perf in cuda by 33%\n\n* templating cuda kernel for filter lengths increased\n performance by 30% which is 93% of closed-source ArrayFire\n implementation of 2d convolution\n* templating separable cuda kernel improved performance by 20%\n* separated separable convolution kernel and wrapper into their\n own file to speed up compilation time", "idx": 393} | |
{"target": 1, "func": "[PATCH] modified to have 3c and 2c two electron eri sums also prints\n outer loop index to show user about where the computation is. Also\n statically (modulo) parallelized for better performance\n\nRick Kendall", "idx": 790} | |
{"target": 1, "func": "[PATCH] lagrangian: Rationalized the handling of multi-component\n liquids and solids Ensures consistency between the mixture thermodynamics and\n composition specifications for the parcels. Simpler more efficient\n implementation. Resolves bug-report\n http://www.openfoam.org/mantisbt/view.php?id=1395 as well as other\n consistency issues not yet reported.", "idx": 362} | |
{"target": 1, "func": "[PATCH] Optimize the performance of rot by using universal intrinsics", "idx": 109} | |
{"target": 1, "func": "[PATCH] reverted commit 76d5bddd5c3dfdef76beaab8222231624eb75e89.\n Split ga_acc in moints2x_trf2K in smaller ga_acc on MPI-PR since gives large\n performance improvement on NERSC Cori", "idx": 1516} | |
{"target": 1, "func": "[PATCH] This commit introduces VariableGroups as an optimization when\n there are repeated variables of the same type inside a system. Presently,\n these are only activated through the system.add_variables() API, but in the\n future there may be provisions for automatically identifying groups.\n\nThe memory usage for DofObjects now scales like\nN_sys+N_var_group_per_sys instead of N_sys+N_vars. The DofMap\ndistribution code has been refactored to use VariableGroups.\n\nAll existing loops over Variables within a system will work unchanged,\nbut can be replaced with more efficient loops over VariableGroups.", "idx": 1452} | |
{"target": 1, "func": "[PATCH] Simplifying ARMv8 build parameters\n\nARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode\n(which is not right because TX2 is ARMv8.1) as well as requiring a few\nredundancies in the defines, making it harder to maintain and understand\nwhat core has what. A few other minor issues were also fixed.\n\nTests were made on the following cores: A53, A57, A72, Falkor, ThunderX,\nThunderX2, and XGene.\n\nTests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester.\n\nA summary:\n * Removed TX2 code from ARMv8 build, to make sure it is compatible with\n all ARMv8 cores, not just v8.1. Also, the TX2 code has actually\n harmed performance on big cores.\n * Commoned up ARMv8 architectures' defines in params.h, to make sure\n that all will benefit from ARMv8 settings, in addition to their own.\n * Adding a few more cores, using ARMv8's include strategy, to benefit\n from compiler optimisations using mtune. Also updated cache\n information from the manuals, making sure we set good conservative\n values by default. Removed Vulcan, as it's an alias to TX2.\n * Auto-detecting most of those cores, but also updating the forced\n compilation in getarch.c, to make sure the parameters are the same\n whether compiled natively or forced arch.\n\nBenefits:\n * ARMv8 build is now guaranteed to work on all ARMv8 cores\n * Improved performance for ARMv8 builds on some cores (A72, Falkor,\n ThunderX1 and 2: up to 11%) over current develop\n * Improved performance for *all* cores comparing to develop branch\n before TX2's patch (9% ~ 36%)\n * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than\n current develop's branch and 8% faster than deveop before tx2 patches\n\nIssues:\n * Regression from current develop branch for A53 (-12%) and A57 (-3%)\n with ARMv8 builds, but still faster than before TX2's commit (+15%\n and +24% respectively). This can be improved with a simplification of\n TX2's code, to be done in future patches. At least the code is\n guaranteed to be ARMv8.0 now.\n\nComments:\n * CortexA57 builds are unchanged on A57 hardware from develop's branch,\n which makes sense, as it's untouched.\n * CortexA72 builds improve over A57 on A72 hardware, even if they're\n using the same includes due to new compiler tunning in the makefile.", "idx": 932} | |
{"target": 1, "func": "[PATCH] Fixed performance regression on Kepler", "idx": 1422} | |
{"target": 1, "func": "[PATCH] Optimize load_tile_offsets for only relevant frags\n\nThis optimization is to adjust the `Reader::load_tile_offsets` class to\nonly loop over the relevant fragments as computed by the subarray class\nbased on intersection. This is a performance optimization for arrays\nwhich have a large number of fragments and which we are incorrectly\nfetching a large amount of unneeded data.", "idx": 775} | |
{"target": 1, "func": "[PATCH] Removed debug code with would slow down mdrun when writing\n trajectories", "idx": 1082} | |
{"target": 1, "func": "[PATCH] Fix performances of Triangulation_2 with EPEC\n\nThere was a performance degradation between CGAL-3.7 and CGAL-3.8, when\nTriangulation_2 is used with EPEC. This patch fixes the issue. Using a\nfunctor that is specialized for EPEC, in inexact_orientation, to_double is\nnot called on p.x() but on p.approx().x().", "idx": 268} | |
{"target": 1, "func": "[PATCH] USER-DPD: performance optimizations to ssa_update() in\n fix_shardlow Overall improvements range from 2% to 18% on our benchmarks 1)\n Newton has to be turned on for SSA, so remove those conditionals 2) Rework\n the math in ssa_update() to eliminate many ops and temporaries 3) Split\n ssa_update() into two versions, based on DPD vs. DPDE 4) Reorder code in\n ssa_update_*() to reduce register pressure", "idx": 152} | |
{"target": 1, "func": "[PATCH] Sparse refactored readers: disable filtered buffer tile\n cache. (#2702)\n\nFrom tests, it's been found that writing the cache for the filter\npipeline takes a significant amount of time for the tile unfiltering\noperation. For example, 2.25 seconds with and 1.88 seconds without in\nsome cases. The cache improved performance before multi-range subarrays\nwere implemented, so dropping it is fine at least for the refactored\nreaders.", "idx": 499} | |
{"target": 1, "func": "[PATCH] more efficient jacobian of mapping modes", "idx": 934} | |
{"target": 1, "func": "[PATCH] rocSPARSE does not require sorted columns for csrgemm\n\nThis is specific to the rocSPARSE CSR implementation of SpGEMM.\nBut this is a substantial performance savings.", "idx": 408} | |
{"target": 1, "func": "[PATCH] improve the ell performance", "idx": 1197} | |
{"target": 1, "func": "[PATCH] Step towards better performance regarding second convergence\n test in bicgstab", "idx": 917} | |
{"target": 1, "func": "[PATCH] Added code to improve performance for the structure factor\n calculations.....Works in serial...Still need to check parallel\n performance......EJB", "idx": 512} | |
{"target": 1, "func": "[PATCH] EA:increased MAXMEM size to 512 since it gives large\n performance improvement on big matrix multiplies", "idx": 1215} | |
{"target": 0, "func": "[PATCH] Robust retries for S3 request-limit retries (#1651)\n\nThis patch provides a retry handler that is identical to the existing, default\nhandler with an exception for CoreErrors::SLOW_DOWN. For this error, we will\nunconditionally retry every 1.25-1.75 seconds.\n\nThe motivation for this patch is to allow the TileDB client to remain functional\neven when performance may be bottlenecked on this error. The server returns\nthis error when we exceed a fixed number of requests per second. The client\nwill eventually make progress.", "idx": 1088} | |
{"target": 1, "func": "[PATCH] Improved GB performance in mixed precision", "idx": 62} | |
{"target": 1, "func": "[PATCH] added option to keep transpose to hybrid for better gpu\n performance", "idx": 612} | |
{"target": 1, "func": "[PATCH] changed DofMap::build_constraint_matrix to be more efficient\n in the (usual) case that the element has no constraints. Also fixed for the\n case that an element has constraints in terms of its *own* dofs, (not others)\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@870 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 1429} | |
{"target": 1, "func": "[PATCH] Use more efficient iterators in MeshCommunication", "idx": 442} | |
{"target": 1, "func": "[PATCH] Check in changes to improve performance for nb_puts and\n nb_gets on GPU-hosted global arrays.", "idx": 783} | |
{"target": 1, "func": "[PATCH] Fixed race condition in GPU restrictor. Improved performance\n of both prolongator and restrictor using compile-time evaluated fine-spin ->\n coarse-spin mapper instead of an array.", "idx": 1410} | |
{"target": 0, "func": "[PATCH] Improve mdrun performance user guide\n\n- Extended GPU update related section;\n- Tweaked the bonded GPU offload related section to reflect the current\n state of the code.\n\nChange-Id: I89ed39750d449df8b273f36ee6530df29ff31b7e", "idx": 27} | |
{"target": 1, "func": "[PATCH] Modified locate region to use more efficient algorithms for\n most block-cyclic distributions.", "idx": 902} | |
{"target": 1, "func": "[PATCH] MultiReduce kernels now instantiate a series of power of two\n block sizes: this improves performance significantly", "idx": 225} | |
{"target": 1, "func": "[PATCH] Unroll dynamic indexing in SYCL gather kernel\n\nThread ID-based dynamic indexing of constant memory data in the SYCL\ngather kernel caused a large amount of register spills and poor\nperformance of the gather kernel on AMD. Avoiding dynamic indexing\neliminates spills and improves performance of this kernel >10x.\n\nRefs #3927", "idx": 244} | |
{"target": 1, "func": "[PATCH] More efficient generation of random numbers", "idx": 1206} | |
{"target": 1, "func": "[PATCH] replaced std::endl with \\n in all file IO and stringstreams. \n std::endl forces a flush, which kills performance on some machines", "idx": 711} | |
{"target": 1, "func": "[PATCH] Performance improvements for find_*_neighbors, and a new\n find_point_neighbors version for finding neighbors at just one point", "idx": 378} | |
{"target": 1, "func": "[PATCH] Improved performance of Python State objects", "idx": 921} | |
{"target": 1, "func": "[PATCH] On behalf of Ryan Olson: Checking in the changes for server\n side registration to improve performance", "idx": 395} | |
{"target": 1, "func": "[PATCH] pathf90 v2.1 better performance with ro=1 vs. ro=2", "idx": 1538} | |
{"target": 1, "func": "[PATCH] reworked the project_vector to be more efficient for Lagrange\n elements. Changed the corresponding calls in reinit(). amr.cc now tests the\n projection stuff.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@279 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 1346} | |
{"target": 1, "func": "[PATCH] use local variables for more efficient force accumulation", "idx": 1112} | |
{"target": 1, "func": "[PATCH] Optimize the performance of daxpy by using universal\n intrinsics", "idx": 493} | |
{"target": 1, "func": "[PATCH] Improved consistency of the ApplyPackedReflectors routines\n (as well as performance in several cases) and several more implementations,\n fixed mistakes in the build section of the documentation, added a short\n description of the new SVD function, and fixed mistakes in HouseholderSolve\n after adding a simple example driver.", "idx": 1230} | |
{"target": 1, "func": "[PATCH] issue #2389: re-adding MapSum Function fore efficient\n reduce_in/out", "idx": 891} |