{"target": 1, "func": "[PATCH] s390x: Use new sgemm kernel also for strmm on Z14 and newer\n\nEmploy the newly added GEMM kernel also for STRMM on Z14. The\nimplementation in C with vector intrinsics exploits FP32 SIMD operations\nand thereby gains performance over the existing assembly code. Extend\nthe implementation for handling triangular matrix multiplication,\naccordingly. As added benefit, the more flexible C code enables us to\nadjust register blocking in the subsequent commit.\n\nTested via make -C test / ctest / utest and by a couple of additional\nunit tests that exercise blocking.\n\nSigned-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>", "idx": 495} | |
{"target": 0, "func": "[PATCH] added benchmark comparing the performance of the two traits\n classes (CK vs CORE::Expr)", "idx": 1040} | |
{"target": 0, "func": "[PATCH] Auto cache compiled CUDA kernels on disk to speed up\n compilation (#2848)\n\n* Adds CMake variable AF_CACHE_KERNELS_TO_DISK to enable kernel caching. It is turned ON by default.\n* cuda::buildKernel() now dumps cubin to disk for reuse\n* Adds cuda::loadKernel() for loading cached cubin files\n* cuda::loadKernel() returns empty kernel on failure\n* Uses XDG_CACHE_HOME as cache directory for Linux\n* Adds common::deterministicHash() - This uses the FNV-1a hashing algorithm for fast and reproducible hashing of string or binary data. This is meant to replace the use of std::hash in some place, since std::hash does not guarantee its return value will be the same in subsequence executions of the program.\n* Write cached kernel to temporary file before moving into final file. This prevents data races where two threads or two processes might write to the same file.\n* Uses deterministicHash() for hashing kernel names and kernel binary data.\n* Adds kernel binary data file integrity check upon loading from disk", "idx": 719} | |
{"target": 0, "func": "[PATCH] Workaround for libHilbert bug until we manage to get it\n fixed. Part of this workaround may be permanent - the long term fix may\n require efficient user code to call some reinitialization, renumbering\n function manually after reading in a mesh and solution.", "idx": 561} | |
{"target": 1, "func": "[PATCH] Wrote the compute_children_node_keys() function in elem.h\n which allows one to generate appropriate node keys while reading in a mesh\n with multiple refinement levels. This allows us to avoid a linear search in\n the MeshRefinement::add_point routine since all the nodes can now be found in\n the nodes hash table. The resulting performance improvement was significant.", "idx": 287} | |
{"target": 0, "func": "[PATCH] Mesh_3: Address TBB performance warning on hashing", "idx": 12} | |
{"target": 0, "func": "[PATCH] Minor fixes to mdrun performance documentation\n\nIn \"Examples for mdrun on one node,\" third example description, the respective number\nof thread-mpi ranks and OpenMP threads per rank were reversed.\n\nIn \"Examples for mdrun on one node,\" 6th example. For 12 logical cores, the pinoffsets\nshould be 0 and 6, respectively (I think)\n\nA few command line examples of running mdrun with more than 1 node used gmx rather\nthan gmx_mpi\n\nSeveral spelling/grammar/tense error/linking issues addressed.\n\nChange-Id: I014bc52d55cda1cbd05843cb8e960c2a2d7cbb47", "idx": 925} | |
{"target": 0, "func": "[PATCH] Added workaround for performance issue in CasADi 2.4", "idx": 1161} | |
{"target": 0, "func": "[PATCH] performance analysis, use of multipoles for long range bits,\n lots of additional screening", "idx": 1353} | |
{"target": 0, "func": "[PATCH] In the performance miniapps, print the SIMD width in terms of\n \"doubles\".", "idx": 1194} | |
{"target": 0, "func": "[PATCH] Refactor UpdateTree() to sometimes Hamerly prune. We aren't\n properly retaining pruned nodes between iterations, but this is definitely a\n start and it's basically as fast as any of these attempted algorithms I've\n written.", "idx": 233} | |
{"target": 0, "func": "[PATCH] Only use AVX512 in own-FFTW if GROMACS also uses it\n\nBuilding the own FFTW with AVX512 enabled for all AVX-flavors means that\nan AVX2 build can end up loosing a significant amount of performance due\nto clock throttle if the FFTW auto-tuner inadvertently picks and AVX512\nkernel. This is not unlikely as measurements at startup are very noisy\nand often lead to inconsistent kernel choice (observed in practice).\n\nChange-Id: I857326a13a7c4dd1a6f5ab44360211301b05d3ac", "idx": 666} | |
{"target": 0, "func": "[PATCH] Performance updates...EJB", "idx": 384} | |
{"target": 0, "func": "[PATCH] Fix clang-3.7 build warnings in core unit and performance\n tests\n\nFix mismatched use of array new and scalar delete\nT * data = new T[1];\n...\ndelete [] data;", "idx": 1114} | |
{"target": 0, "func": "[PATCH] Code for ordinary, slow, ewald electrostatics", "idx": 1546} | |
{"target": 0, "func": "[PATCH] Accelerate KSInitialization tests by using fewer CV folds and\n fewer training epochs as well as relaxed tolerances.", "idx": 1364} | |
{"target": 1, "func": "[PATCH] changed the ndelta to setting the number of cells per cut-off\n radius to per cut-off diameter and changed the default value to 3 in\n do_inputrec, this gives at most 64 cells per icg iso 125, this gives a few\n percent performance increase in ns and domain decomposition (the cg sorting)", "idx": 56} | |
{"target": 0, "func": "[PATCH] CDGW now working; slow", "idx": 372} | |
{"target": 0, "func": "[PATCH] Increase release build timeout, macos is being a bit slow on\n Azure (#2495)", "idx": 1165} | |
{"target": 0, "func": "[PATCH] Initial check in of performance test for read-only property", "idx": 858} | |
{"target": 1, "func": "[PATCH] Possible performance enhancement in Mesh::delete_elem. We use\n the passed Elems id() as a guess for the location of the Elem in the\n _elements vector. If the guess does not succeed, then we revert to the linear\n search.", "idx": 1029} | |
{"target": 0, "func": "[PATCH] rPolynomial: New equation of state for liquids and solids\n\nDescription\n Reciprocal polynomial equation of state for liquids and solids\n\n \\f[\n 1/\\rho = C_0 + C_1 T + C_2 T^2 - C_3 p - C_4 p T\n \\f]\n\n This polynomial for the reciprocal of the density provides a much better fit\n than the equivalent polynomial for the density and has the advantage that it\n support coefficient mixing to support liquid and solid mixtures in an\n efficient manner.\n\nUsage\n \\table\n Property | Description\n C | Density polynomial coefficients\n \\endtable\n\n Example of the specification of the equation of state for pure water:\n \\verbatim\n equationOfState\n {\n C (0.001278 -2.1055e-06 3.9689e-09 4.3772e-13 -2.0225e-16);\n }\n \\endverbatim\n Note: This fit is based on the small amount of data which is freely\n available for the range 20-65degC and 1-100bar.\n\nThis equation of state is a much better fit for water and other liquids than\nperfectFluid and in general polynomials for the reciprocal of the density\nconverge much faster than polynomials of the density. Currently rPolynomial is\nquadratic in the temperature and linear in the pressure which is sufficient for\nmodest ranges of pressure typically encountered in CFD but could be extended to\nhigher order in pressure and/temperature if necessary. The other huge advantage\nin formulating the equation of state in terms of the reciprocal of the density\nis that coefficient mixing is simple.\n\nGiven these advantages over the perfectFluid equation of state the libraries and\ntutorial cases have all been updated to us rPolynomial rather than perfectFluid\nfor liquids and water in particular.", "idx": 46} | |
{"target": 0, "func": "[PATCH] Adding performance patch for trmm, just like #2836", "idx": 723} | |
{"target": 0, "func": "[PATCH] fewer qas for slow mpi-pt", "idx": 910} | |
{"target": 0, "func": "[PATCH] BJP: Initial checkin of test program for performance\n evaluation of DRA routines in 3 dimensions.", "idx": 643} | |
{"target": 0, "func": "[PATCH] Add TaskDAG performance test to CMake", "idx": 137} | |
{"target": 0, "func": "[PATCH] Random engines & distributions as proper C++11 classes\n\nThis change implements the ThreeFry2x64 random engine with flexible\nnumber of encryption rounds and internal counter bits. The class is\ncompatible with the C++11 random number generators, and the GROMACS\ntabulated normal distribution has likewise been turned into a\nrandom distribution compatible with C++11, meaning they can be used in\nalmost any combination with the standard library distributions.\n- The ThreeFry2x64 implementation uses John Salmon's idea of a template-\n selected internal counter so a number of bits are reserved to generate\n an arbitrary random stream. This makes it possible to use ThreeFry as\n a normal random engine, and even in counter mode it is possible to\n draw an arbitrary amount of random numbers before restarting counters.\n- Both accurate (20-round) and fast (13-round) versions are available.\n- There is a gmx::DefaultRandomEngine when we don't care about details.\n- gmx::GammaDistribution has been added to work around bugs in\n libstdc++-4.4.7 headers, and to avoid getting different results\n for libstdc++ vs. libc++.\n- Custom Uniform, normal, and exponential distributions have been added\n to make all results reproducible across platforms since stdlibc++ and\n libc++ do not use the same generating algorithms.\n- Code using random numbers has been updated, but no changes have been\n made to turn random seeds into 64bits yet.\n- The selection nbsearch unit test was a bit fragile and very sensitive\n to the coordinate specific values; this has been fixed so it should\n be resilient no matter what RNG is used in the future.\n\nChange-Id: I47a04d03e2f264e1a6ef0aa0a2174cb464ed9af7", "idx": 1155} | |
{"target": 0, "func": "[PATCH] performance miniapps: don't enable -march=native on ARM\n\nSee e.g. https://stackoverflow.com/questions/65966969/why-does-march-native-not-work-on-apple-m1", "idx": 1553} | |
{"target": 0, "func": "[PATCH] Made a pass on Performance tutorials in docs.", "idx": 622} | |
{"target": 0, "func": "[PATCH] Add Z-batch to fast kernels", "idx": 1052} | |
{"target": 0, "func": "[PATCH] a slow version of triangle split", "idx": 204} | |
{"target": 0, "func": "[PATCH] Enable fp-exceptions\n\nThis can help with finding errors quicker because mdrun crashes as soon\nas a floating point value overflows or is invalid. fp-exceptions are\nonly enabled for builds with asserts (without NDEBUG), mainly because\nit isn't always possible to avoid invalid fp operations for SIMD math\nwithout a performance penalty.\n\nAlso, fix a few places where we had 1/0 or other invalid fp operations.\n\nFixes #1582\n\nChange-Id: Ib1b3afc525706f4b171564fcaf08ebf3b2be3122", "idx": 763} | |
{"target": 0, "func": "[PATCH] SYCL NBNXM offload support\n\nAssociated changes:\n\n- Added function stubs to PME: necessary for compilation.\n- Stricter SYCL hardware compatibility checks: limits on subgroup size\n and the availability of local memory.\n- The kernel implementation and overall logic closely follow the OpenCL\n implementation. Divergences are documented locally.\n\nLimitations:\n\n- No fine-grained timings yet.\n- Code-duplication with CUDA and OpenCL: see #2608.\n- Minor differences in local/nonlocal synchronization: see #3895,\n related to #2608.\n- Only the OpenCL backend was extensively tested. LevelZero works fine\n without MPI but stalls due to a known bug. The fix for DPCPP runtime\n is available, but not yet part of any OneAPI release:\n https://github.com/intel/llvm/pull/3045.\n- The complex/position-restraints regression test fails: see #3846.\n- No performance tuning: see #3847.\n\nPerformance on rnase-cubic system is similar to OpenCL implementation.", "idx": 1277} | |
{"target": 0, "func": "[PATCH] Added parentheses to performance logging messages\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@1518 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 496} | |
{"target": 0, "func": "[PATCH] Modified operations so that MPI_Type_free is called after\n wait. Also tried adding and MPI_Win_flush_local call to force both local and\n remote completion before calling MPI_Type_free, but does not seem to get rid\n of failures in some of the performance tests.", "idx": 1130} | |
{"target": 0, "func": "[PATCH] Updated article reference. Added configuration flags to\n support altivec on powerpc. Updated configuration files from gnu.org Removed\n the fast x86 truncation when double precision was used - it would just crash\n since the registers are 32bit only. Added powerpc altivec innerloop support\n to fnbf.c Removed x86 assembly truncation when double precision is used - the\n registers can only handle 32 bits. Dont use general solvent loops when we\n have altivec support in the normal ones. Added workaround for clash with\n altivec keyword. Added workaround for clash with altivec keyword.", "idx": 104} | |
{"target": 0, "func": "[PATCH] MDRange: Minor perf test fixes for KNL\n\nTests are still commented out for now, but removed running a slow test,\nreduced the number of tests, and made minor fixes when checking results\nduring the first iteration (vectorization may have occurred and made\nthe check no longer bitwise correct; epsilon comparison should be added\nas well as option to not check correctness which will speed up the tests)", "idx": 1556} | |
{"target": 0, "func": "[PATCH] Knowing when the tree fails and we're stuck with a linear\n search is useful for performance testing too", "idx": 886} | |
{"target": 0, "func": "[PATCH] Continued stats refactor + subarray stats (#2200)\n\nTYPE: IMPROVEMENT\nDESC: Added additional stats for subarrays and subarray partitioners", "idx": 255} | |
{"target": 0, "func": "[PATCH] Fixup re-enable core performance tests\n\nThese had been inadvertently disbaled in #3839\n\nCo-Authored-By: Nick Curtis <nicholas.curtis@amd.com>\nCo-Authored-By: Bruno Turcksin <bruno.turcksin@gmail.com>", "idx": 281} | |
{"target": 0, "func": "[PATCH] Fixing a bug in measuring performance (didn't reset the\n timer)", "idx": 731} | |
{"target": 0, "func": "[PATCH] Mark closed flag in EigenSparseVector regardless of METHOD\n\nI don't think it's a performance hit to set these flags\nregardless of METHOD, and it's a nice state flag that the user\ncan check", "idx": 810} | |
{"target": 1, "func": "[PATCH] Fix performance for range copy", "idx": 1320} | |
{"target": 0, "func": "[PATCH] Update performance test case (to use polar_decomposition)\n Make polar default closest rotation computer", "idx": 426} | |
{"target": 0, "func": "[PATCH] Added upper triangular fast tridiagonalization routines.", "idx": 704} | |
{"target": 0, "func": "[PATCH] fast nary union in OFF2nef3 also constructor finds more\n problems and handles them", "idx": 541} | |
{"target": 0, "func": "[PATCH] wallDist/patchDistMethods/Poisson: New method for fast\n calculation of an approximate wall-distance field by solving Poisson's\n equation", "idx": 660} | |
{"target": 0, "func": "[PATCH] Added support for gettimeofday() when available to get\n microsecond resolution for wallclock time. This enables accurate performance\n benchmarks from short simulations even in parallel. When gettimeofday() is\n not available, we use time(). Time is still stored as seconds, but now as\n double instead of time_t.", "idx": 852} | |
{"target": 0, "func": "[PATCH] some more prints to debug comex_malloc performance", "idx": 700} | |
{"target": 0, "func": "[PATCH] Avoiding redundant metadata calculations in order to\n accelerate AbstractDistMatrix::QueueUpdate and\n AbstractDistMatrix::ProcessQueues (as well as generalizing\n AbstractDistMatrix::ProcessPullQueue to support not including 'viewer'\n processes)", "idx": 515} | |
{"target": 0, "func": "[PATCH] Expanded the GetIrTest behavior to enable efficient testing", "idx": 251} | |
{"target": 0, "func": "[PATCH] IQN-ILS modified such that the coomunication of the rhs uses\n the efficient MPI_Reduce operation if MasterSlave comm is configured with\n mpi-single, i.e., MPIDirect. All tests are working ... still check coupling\n iterations, i.e., performance of IQN-ILS", "idx": 893} | |
{"target": 0, "func": "[PATCH] Added support for manual load balancing with a -load option\n for grompp. It takes the relative performance of each of the processors in\n your system in arbitrary units, and normalizes it.", "idx": 1343} | |
{"target": 0, "func": "[PATCH] Performance updates ....EJB", "idx": 818} | |
{"target": 0, "func": "[PATCH] added driver for testing performance", "idx": 1168} | |
{"target": 0, "func": "[PATCH] WIP fast single-element neighbor calculation.", "idx": 1359} | |
{"target": 0, "func": "[PATCH] Move performance logging of solve()s into solver classes", "idx": 792} | |
{"target": 0, "func": "[PATCH] Rename GPU launch/wait cycle counters\n\nIn preparation for the PME GPU task and GPU launch overhead to be\ncounted together in the same counter for all GPU tasks, the current main\ncounters have been renamed to be more general. The label of GPU waits in\nthe performance table have also been renamed to reflect the task name.\nAdditionally a non-bonded specific sub-counter is been added.\n\nChange-Id: I65a15b0090c1ccebb300cf425c7b3be4100e17a0", "idx": 1360} | |
{"target": 0, "func": "[PATCH] 1 : Store performance function 2 : Pass error into Error\n function 3 : Pass network into Error function 4 : Add public api to access\n underlying network 5 : Use perfect forwarding to accept LayerTypes", "idx": 990} | |
{"target": 0, "func": "[PATCH] Add a CMake warning about FFTW with --enable-avx\n\nFFTW_MEASURE runs single-threaded tests for FFT performance, which is\nvery different from the GROMACS usage pattern, particularly with how\nthe cache access pattern works. In practice, with FFTW 3.3.2 and\n3.3.3, the performance of FFTW with --enable-avx is considerably\nworse than that of FFTW with --enable-sse or --enable-sse2. It's\nunlikely but theoretically possible that performance might change,\nso we prompt the user both to avoid --enable-avx now, and to\nperhaps consider it in the future.\n\nChange-Id: Ib4906645587cfc6a6306a7f7f46d612a6446b156", "idx": 952} | |
{"target": 0, "func": "[PATCH] Create even less contention\n\nAll the readers are waiting on our writer thread to mark our condition\n(`_array_is_present`) as ready before they can even attempt to read, so\nthere is no data race for our writer thread. Hence we can remove the\nlock on the mutex the readers are using. And it's much better this way\nbecause one could imagine that our writer hits the `std::unique_lock`\nfirst which would then prevent our reader threads from even getting to\nthe condition variable `wait`, which is the logical place we want them\nto get to while the writer thread is doing its job.\n\nAnd the second change is unlocking as soon as we're through waiting\nbecause we are through the read-write portion of the program and are\nonly reading so it's safe to let everyone through at once", "idx": 1507} | |
{"target": 0, "func": "[PATCH] remove int to bool conversion performance warning with VC", "idx": 86} | |
{"target": 0, "func": "[PATCH] Read tiles: fixing preallocation size for var and validity\n buffers. (#2781) (#2782)\n\nPreallocation size for var buffer and validity buffer was not using the\ncorrect size, which will have a performance impact.", "idx": 47} | |
{"target": 0, "func": "[PATCH] --enable-distmesh, --with-mapvector-chunk-size\n\nThe parmesh argument probably should have been deprecated when the\nParallelMesh name was.\n\nWe'll want to select chunked_mapvector array size at configure time,\nsince the exact performance optimization/pessimization results are\nlikely to be system-dependent.", "idx": 812} | |
{"target": 0, "func": "[PATCH] vector<vector> specialization for parallel_sync\n\nWe can't do this the efficient way in the general algorithm, so let's\ndo it as best we can right now, with blocking receives.\n\nThis should be replaced by Derek's algorithm in #1684 as soon as we\ncan support that.", "idx": 282} | |
{"target": 0, "func": "[PATCH] Interactive Molecular Dynamics (IMD)\n\nIMD allows to interact with and to monitor a running molecular dynamics\nsimulation. The protocol goes back to 2001 (\"A system for interactive\nmolecular dynamics simulation\", JE Stone, J Gullingsrud, K Schulten,\nP Grayson, in: ACM symposium on interactive 3D graphics, Ed. JF Hughes\nand CH Sequin, pp. 191--194, ACM SIGGRAPH). The user can watch the\nrunning simulation (e.g. using VMD) and optionally interact with\nit by pulling atoms or residues with a mouse or a force-feedback\ndevice.\nCommunitcation between GROMACS and VMD is achieved via TCP sockets\nand thus enables controlling an mdrun locally or one running on a\nremote cluster. Every N steps, mdrun receives the applied forces from\nthe VMD client and sends the new positions to VMD.\nOther features:\n- correct PBC treatment, molecules of a (parallel) simulation are made\n whole (with respect to the configuration found in the .tpr file)\n- in the .mdp file, one can define an IMD group (including the protein\n but not the water for example is useful). Only the coordinates of\n atoms belonging to this group are then transferred between mdrun and\n VMD. This can be used to reduce the performance impact to an almost\n negligible level.\n- adds only two single-line function calls in the main MD loop\n- and mdrun test fixture checks whether grompp and mdrun understand\n the IMD options\n\nChange-Id: I235e07e204f2fb77f05c2f06a14b37efca5e70ea", "idx": 854} | |
{"target": 0, "func": "[PATCH] Python 3 does not have dict.itervalues\n\nReally we should move to using six and importing the iterator versions of dict\nmethods and range/zip from there, but for now just use \"values\" since using a\nraw list doesn't seem like it will cause memory or performance problems here.", "idx": 578} | |
{"target": 0, "func": "[PATCH] DLB can now turn off, when slower\n\nUnder certain conditions, especially with (shared) GPUs, DLB can\ndecrease the performance. We now measure the cycles per step before\nturning on DLB. When the running average of cycles per step with DLB\ngets above the average without DLB, we turn off DLB. We then measure\nagain without DLB. DLB can then turn on again. If we turn on DLB of\nDLB multiple times in close succesion and we measure performance loss,\nwe keep DLB off for the remainder of the run. This procedure ensures\nthat the performance will never deteriorate due to DLB.\nUpdated and expanded the DLB section in the manual.\n\nChange-Id: I6e0291c1a41adf6da94fae46d36e0fcb95585a02", "idx": 689} | |
{"target": 0, "func": "[PATCH] Added configure flag to disable FFTW measurements, to enable\n binary reproducible runs. Note that this typically WILL deteriorate\n performance, so it is usually better to run the optimized versions and use\n the -reprod flag to mdrun when you need binary identity. However, if you\n compile FFTW3 with SSE support (which is NOT the default) the selected\n kernels seems to be close-to-optimal even without measurements, and then you\n can use this option to always get binary reproducible runs.", "idx": 1317} | |
{"target": 0, "func": "[PATCH] reformatted the flops and performance output", "idx": 1413} | |
{"target": 0, "func": "[PATCH] Fix fast for CUDA 9. Use CUB library for reductions\n\nFAST was failing on CUDA 9 because of insufficient synchronization in the\nreduction of the non_max_count function. The reduction is now implemented\nusing BlockReduce from CUB.\n\nThis also adds CUB as a dependency which is brought in as a submodule.", "idx": 953} | |
{"target": 0, "func": "[PATCH] 1. BoomerAMG keeps track of the number of iterations\n accumulated over all calls. This is needed for user-level performance\n monitoring if it is a preconditioner for a Krylov method such as PCG. The\n regular iteration count only tells you about the last time PCG invoked\n BoomerAMG. There are ifdefs so you can eliminate this if you like - remove\n #define CUMNUMIT.\n\n2. very minor code fixes, comments, etc.", "idx": 729} | |
{"target": 0, "func": "[PATCH] problem with c-macro? but slow fabs works fine", "idx": 624} | |
{"target": 0, "func": "[PATCH] Use Evaluate(const arma::mat& parameters) instead of\n Evaluate(const arma::mat& parameters, const size_t i) to calculate the\n objective and to accelerate the evaluation process.", "idx": 530} | |
{"target": 1, "func": "[PATCH] Better workaround of g++ 4.1 optimizer bug:\n -fno-strict-aliasing. Performance penalty is 5% vs 24% with -O", "idx": 971} | |
{"target": 0, "func": "[PATCH] Reducing dependencies. Print functions are generally not\n fast anyway, inlining them leads to unnecessary dependencies and larger\n headers. Removing print functions from headers.", "idx": 286} | |
{"target": 0, "func": "[PATCH] -added support for the new version of RS -fixed some minor\n bugs -now the kernel uses directly the extremely fast RS refinement function\n -updated the generic kernel tests accordingly", "idx": 922} | |
{"target": 0, "func": "[PATCH] additional pre-compiler command for performance testing", "idx": 16} | |
{"target": 1, "func": "[PATCH] Add doSetup parameter to Matrix::init\n\nNot calling doSetup can give you performance gains when using\npreallocation on the matrix.", "idx": 1054} | |
{"target": 0, "func": "[PATCH] Don't resize() as that assembles the matrix and makes\n add_coef() slow", "idx": 200} | |
{"target": 0, "func": "[PATCH] update test (Cactus_deformation_session.cpp): make it\n suitable for test performance (not active by default) make it suitable for\n test suite (precomputed mesh difs active by default)", "idx": 201} | |
{"target": 0, "func": "[PATCH] output detailed multi-thread performance data only with\n \"timer full\"", "idx": 1177} | |
{"target": 0, "func": "[PATCH] log_name would be unused without perf_log on\n\nSo if we're not performance logging, we need to comment out that\nvariable entirely to avoid an unused variable warning.", "idx": 529} | |
{"target": 0, "func": "[PATCH] Cleanup of the performance test", "idx": 1265} | |
{"target": 0, "func": "[PATCH] Disks are too fast, fix formatting of speed as *****", "idx": 397} | |
{"target": 1, "func": "[PATCH] Deprecate version of BoundaryInfo::boundary_ids(const Node*)\n that returns a vector.\n\nAdd new version that must be passed a std::set. The new version\nshould be more efficient for making repeated calls to boundary_ids(),\nsince the container does not need to be created and destroyed\nrepeatedly...", "idx": 1391} | |
{"target": 1, "func": "[PATCH] performance improvement through avoiding function call and\n dereference overhead\n\n- make i_to_potl() and ij_to_potl() functions inline and const\n- don't dereference inside the functions, but cache, if possible in external variables\n=> up to 15% speedup.", "idx": 1372} | |
{"target": 0, "func": "[PATCH] turned performance fix for 7.30 compilers off by default", "idx": 806} | |
{"target": 1, "func": "[PATCH] Use range for in dof_map.C\n\nWe can use more efficient iterators in a couple places here too.", "idx": 191} | |
{"target": 1, "func": "[PATCH] Greatly improved the performance of copy::GeneralPurpose by\n exploiting the tensor product structure of the integer metadata calculations", "idx": 626} | |
{"target": 1, "func": "[PATCH] Update RAS-IR.\n\nNotes:\n1. Overlap and row ordering have significant effects on convergence.\n2. Lower the matrix bandwidth, better the performance ?\n3. Sync is lower, but communication becomes higher, tradeoff!\n4. Parallelism is higher, but overhead is also higher, tradeoff!\n5. Very high accurate solves, do not provide anything.\n6. Two level and multi-levels might significantly reduce number of\niterations to converge.", "idx": 657} | |
{"target": 1, "func": "[PATCH] resolved performance degrating changed introduced in revision\n 1319", "idx": 1022} | |
{"target": 1, "func": "[PATCH] Tiny libmesh_assert_valid_boundary_ids speedup\n\nThis is actually pretty slow in parallel. Truly making it faster\nwould require lumping multiple verification communications together,\nbut eliminating redundant verification is a start, at a slightly lower\npenalty to readability and usability.", "idx": 3} | |
{"target": 1, "func": "[PATCH] Added preconditions and made it more efficient", "idx": 655} | |
{"target": 1, "func": "[PATCH] Added single-accuracy SIMD double math functions\n\nApart from double SIMD variables typically being\nhalf the width of single, the math functions are\nconsiderably more expensive due to higher-order\npolynomials, which can drop the throughput to 25%\nof single. In some cases we do not need the full\ndouble precision in SIMD operations, so these\nnew math functions use double precision\nSIMD variables but only target single precision\naccuracy, which can improve performance twofold.\nThe patch also makes the target precision in\nsingle and double SIMD an advanced CMake variable,\nand the unit test tolerance is set based on these\nvariables. This can be used (decided by the user)\nfor a few platforms where the rsqrt/inv table\nlookups provide one bit too little to get by\nwith a single N-R iteration based on our default\ntarget accuracy of 22 bits.\n\nChange-Id: Id4b1c7800e16cb0eb3d564e89a368b4db6eede3e", "idx": 89} | |
{"target": 1, "func": "[PATCH] Performance improvements to kd_tree_test, added peer bounds\n checking.", "idx": 1021} | |
{"target": 1, "func": "[PATCH] Reworked function 'get_best_weight()' of Slivers_exuder.h\n\n- Avoid computing incident cells multiple times\n- Make it more efficient for P3M3 by not having to use\n tr.min_squared_distance() to compute the distance\n between neighboring vertices", "idx": 716} | |
{"target": 1, "func": "[PATCH] #1295 Refactored Matrix::setSub(IMatrix,IMatrix) The new\n implementation should be much more efficient and handle non-monotone indices\n correctly", "idx": 117} | |
{"target": 1, "func": "[PATCH] add (T) kernels optimized for OpenMP+SIMD\n\nThese kernels are taken from https://github.com/jeffhammond/nwchem-tce-triples-kernels/,\nwhich were previously part of private development branch of NWChem hosted by Argonne.\nThe code was developed by Jeff Hammond from 2013-2014 with help from Karol Kowalski.\n\nThese kernels have been tested on Intel Xeon, Intel Xeon Phi, IBM Blue Gene/Q,\nIBM POWER7, AMD Bulldozer and ARM32 processors using the Intel, Cray, IBM XL,\nand GCC compilers. In rare instances, the optimal loop order is different between\nIntel, Cray and IBM compilers. In such cases, we default to the Intel compiler case\nbecause it is the most commonly used Fortran compiler for NWChem. In particular,\nNWChem as a whole cannot be compiled with Cray Fortran, so the only context in which\nit would be used for these kernels is if someone did a mixed build.\nThe performance differences with XLF were observed on POWER7, which is a relatively\nrare platform for NWChem.\n\nIn any case, these optimizations are better than the serial version any time OpenMP\nis used. Detailed performance information for some platforms can be found at\nhttps://github.com/jeffhammond/nwchem-tce-triples-kernels/tree/master/results.\n\nFinally, it should be noted that all of Jeff Hammond's developments for non-Intel\narchitectures were done prior to his employment at Intel, which can be verified\nfrom the Github commit log associated with the aforementioned repo.", "idx": 737} | |
{"target": 1, "func": "[PATCH] Improved performance of interaction groups on CPU", "idx": 1419} | |
{"target": 1, "func": "[PATCH] reverted commit 76d5bddd5c3dfdef76beaab8222231624eb75e89.\n Split ga_acc in moints2x_trf2K in smaller ga_acc on MPI-PR since gives large\n performance improvement on NERSC Cori", "idx": 989} | |
{"target": 1, "func": "[PATCH] wmkdep: Added path string substitution support\n\nto avoid the need for sed'ing the output. This improves performance by avoiding\nthe need for calling additional commands and generating a temporary file.", "idx": 1554} | |
{"target": 1, "func": "[PATCH] Replaced Vertex_handle and Cell_handle by const & versions \n in order to regain performance", "idx": 338} | |
{"target": 1, "func": "[PATCH] performance improvement for DD assignment of settles with\n cg's", "idx": 1039} | |
{"target": 0, "func": "[PATCH] AABB tree: the projection does not construct the KD-tree at\n the first projection query anymore. for efficient projection queries either\n the user calls for its explicit construction during the AABB construction or\n calls \"construct_search_Tree()\". otherwise the first primitive reference\n point is used as (naive) hint.", "idx": 235} | |
{"target": 1, "func": "[PATCH] Improved ORB performance and memory usage on CUDA backend", "idx": 635} | |
{"target": 1, "func": "[PATCH] Introduce self-pairs search in nbsearch\n\nMake it possible to search for all pairs within a single set of\npositions using AnalysisNeighborhood. This effectively excludes half of\nthe pairs from the search, speeding things up.\n\nNot used yet anywhere, but this makes the code a better reference for\nperformance comparisons, and for places where this is applicable it has\npotential for speeding things up quite a bit.\n\nChange-Id: Ib0e6f36460b8dbda97704447222c864c149d8e56", "idx": 127} | |
{"target": 1, "func": "[PATCH] Improve the performance of the `Record` logger by using\n deques of `std::unique_ptr` instead of plain object.", "idx": 811} | |
{"target": 1, "func": "[PATCH] Used sparse identity for more efficient memory utilization", "idx": 1502} | |
{"target": 1, "func": "[PATCH] bond/react: efficient competing reactions", "idx": 280} | |
{"target": 1, "func": "[PATCH] Fixed typo in HegstRLVar3/HegstRUVar3 and improved\n performance of Trmm and Symm for relatively small numbers of right-hand\n sides.", "idx": 399} | |
{"target": 0, "func": "[PATCH] Add initial reinit_func(1.) call in Euler2Solver\n\nWe need the call to reinit_func to set the time, t, in the context\nto the correct value. Also added clarifying comments in EulerSolver\nand Euler2Solver that we're also setting the time in addition to\npossibly resetting the mesh if there's mesh motion.\n\nThere is probably a way to make this more efficient such that we\nonly call reinit_func twice in Euler2Solver, but I didn't put any\nthought into it.", "idx": 371} | |
{"target": 1, "func": "[PATCH] Changed number of nonbonded thread blocks to improve\n performance", "idx": 788} | |
{"target": 0, "func": "[PATCH] Removed nbnxn kernel blendv optimization\n\nThe nbnxn simd kernel blendv optmization, which was accidentally\ndeactivated since 5.0, has been removed. It made assumptions about\nthe internal storage of SIMD representations. With gcc 4.x blendv\nwould give a small performance improvement, but with gcc 5 performance\nis equal or deteriorates.\n\nChange-Id: I2b07895257a2fde0ade2a627369ed22683dd89e1", "idx": 1158} | |
{"target": 1, "func": "[PATCH] Fix performance problems for large molecular systems", "idx": 1061} | |
{"target": 1, "func": "[PATCH] Optimize the script (from 20 minutes to 0.3 seconds!)\n\n- avoid opening and reading the file `processed_test_results` thousands\n of time: its content is stored once in a hash, for fast lookup,\n\n- do not call `fuser` for files that are already processed", "idx": 978} | |
{"target": 1, "func": "[PATCH] more efficient live variables in SX virtual machine", "idx": 226} | |
{"target": 1, "func": "[PATCH] #1295 Refactored MX::setSub(IMatrix,IMatrix) The new\n implementation should be much more efficient and handle non-monotone indices\n correctly", "idx": 210} | |
{"target": 1, "func": "[PATCH] LJ combination rule kernels for OpenCL\n\nThe current implementation enables combination rules for both AMD and\nNVIDIA OpenCL (also ports the changes to the \"nowarp\" test/CPU kernel).\n\nLike in the CUDA implementation, all kernels support it, but only for\nplain cut-off are combination rules used.\n\nNotes:\n- On AMD tested on Hawaii, Fiji, Spectre and Oland devices;\n combination rules in all cases improve performance, although combined\n with the i-prefetching, the improvement is typically only ~10%.\n- On NVIDIA tested on Kepler and Maxwell; in most cases the combination\n rule kernels are fastest.\n However, with certain inputs these kernels are 25% slower on Maxwell\n (e.g. pure water box, cut-off LJ, pot shift), but not on Kepler.\n This is likely a compiler mis-optimization, so we'll just leave the\n defaults the same as AMD.\n\nChange-Id: I05396e000cdf93c1d872729e6b477192af152495", "idx": 1337} | |
{"target": 1, "func": "[PATCH] Performance Improvements - changed Cartesian to\n Simple_cartesian in the examples - changed list to vector in the code -\n removed unnecessary includes - introduced multipass_distance", "idx": 883} | |
{"target": 1, "func": "[PATCH] more efficient comparison function", "idx": 99} | |
{"target": 1, "func": "[PATCH] new, more efficient jacobian calculation for integrator", "idx": 447} | |
{"target": 1, "func": "[PATCH] Fix AMD OpenCL float3 array optimization bug\n\nBecause float3 by OpenCL spec is 16-byte, when used as an array type\nthe allocation needs to optimized to avoid unnecessary register use.\nThe nbnxm kernels use a float3 i-force accumulator array in registers.\n\nStarting with ROCm 2.3 the AMD OpenCL compiler regressed and lost\nits ability to effectively optimize code that uses float3 register\narrays. The large amount of extra registers used limits the kernel\noccupancy and significantly impacts performance.\nOnly the AMD platform is affected, other vendors' compilers are able to\ndo the necessary transformations to avoid the extra register use.\n\nThis change converts the float3 array to a float[3] saving 8*4 bytes\nregister space. This improves nonbonded kernel performance\non an AMD Vega GPU by 25% and 40% for the most common flavor of the\nEwald and RF force-only kernels, respectively.\n\nNote that eliminating the rest of the non-array use of float3 has no\nsignificant impact.", "idx": 1545} | |
{"target": 1, "func": "[PATCH] increased granularity of performance logging, fixed a bug in\n DofMap::add_neighbors_to_send_list() which caused the _send_list to become\n excessively large. Further, this slowed the DofMap::sort_send_list() method\n considerably.", "idx": 881} | |
{"target": 1, "func": "[PATCH] Removed unnecassary flush of trn,xtc,edr. Important for\n performance for very frequent writes (on small systems) Fixed a bug related\n to setting the duty of pp/pme/io", "idx": 1145} | |
{"target": 1, "func": "[PATCH] Use a fixed-length arrays, avoid heap allocation.\n\nAlso reduce the default number of elements so that it runs fast enough in DBG mode.", "idx": 840} | |
{"target": 1, "func": "[PATCH] Improve performance of GEMM for small matrices when SMP is\n defined.\n\nAlways checking num_cpu_avail() regardless of whether threading will actually\nbe used adds noticeable overhead for small matrices. Most other uses of\nnum_cpu_avail() do so only if threading will be used, so do the same here.", "idx": 1492} | |
{"target": 0, "func": "[PATCH] Grid-based utility nbsearch implementation.\n\nMore efficient implementations are possible, but the present one should\nwork reasonably well in most cases, also for triclinic cells, without too\nmuch complexity.", "idx": 517} | |
{"target": 1, "func": "[PATCH] Reduced the cost of the pull communication\n\nWith more than 32 ranks, a sub-communicator will be used\nfor the pull communication. This reduces the pull communication\nsignificantly with small pull groups. With large pull groups the total\nsimulation performance might not improve much, because ranks\nthat are not in the sub-communicator will later wait for the pull\nranks during the communication for the constraints.\n\nAdded a pull_comm_t struct to separate the data used for communication.\n\nChange-Id: I92b64d098b508b11718ef3ae175b771032ad7be2", "idx": 1008} | |
{"target": 1, "func": "[PATCH] make skylakex sgemm code more friendly for readers\n\nBTW some kernels were adjusted to improve performance", "idx": 1457} | |
{"target": 1, "func": "[PATCH] Minor performance improments\n\nMostly useful as lesson learned.\n\n1) The double precision constant forces the compiler to convert\n the single precision variable to double, then do the multiplication\n in double and then convert back. Using the single precsion\n constant in double reduces the accuracy (the calculation is still done\n double but the constant has only single precision).\n2) Using a temporary array instead of a temporary scalar causes ICC14 to\n generate an extra store.\n\nChange-Id: Ib320ac2ae4ff80ce48277544abff468c483cc83a", "idx": 20} | |
{"target": 1, "func": "[PATCH] efficient sparsity pattern computation for the case when the\n user specifies the DOF coupling", "idx": 793} | |
{"target": 1, "func": "[PATCH] Beginning to add support for freezing the sparsity pattern of\n graphs and sparse matrices to improve the performance of subsequent updates", "idx": 289} | |
{"target": 1, "func": "[PATCH] Improving performance of BigInt/BigFloat routines (such as\n Cholesky) by more than a factor of three by avoiding allocations within the\n templated BLAS routines", "idx": 559} | |
{"target": 0, "func": "[PATCH] Encapsulate code in ifdef NUMPY clauses. Efficient pythoncode\n for toArray.", "idx": 71} | |
{"target": 1, "func": "[PATCH] Enforce memory alignment to improve performance of vector\n operations. Also fixed bugs in an earlier optimization.", "idx": 1458} | |
{"target": 0, "func": "[PATCH] propagate lower bound for culling on TM1 to accelerate\n symmetric distance", "idx": 1011} | |
{"target": 1, "func": "[PATCH] Removed Reaction-Field-nec\n\nThe RF no exclusion correction option was only introduced for\nbackward compatibility and a performance advantage for systems\nwith only rigid molecules (e.g. water). For all other systems\nthe forces are incorrect. The Verlet scheme did not support this\noption and if it would, it wouldn't even improve performance.\n\nChange-Id: Ic22ccf76d50b5bb7951fcac2293621b5eef285c5", "idx": 1443} | |
{"target": 1, "func": "[PATCH] Replace vpermpd with vpermilpd in the Haswell DTRMM kernel\n\nto improve performance on AMD Zen (#2180) applying wjc404's improvement of the DGEMM kernel from #2186", "idx": 142} | |
{"target": 0, "func": "[PATCH] query t on side of bounded square\n\nI moved a lot of the functionality for deciding the Linf incircle\ntest for four points to the side of bounded square predicate.\n\nIn the case of query point t being on one of the sides of the\nbounded square, I use the predicate test1d. Maybe even this can\nbe optimized, or made even more robust with some more checks.\n\nA bug that is fixed with the current commit is in the following\ninput:\n\n$ cat ~/Dropbox/cgal/sdg/panos/sqch1a.cin\np -51 -180\np -180 -30\np -180 20\np -7 -180\n\nI also fixed a small bug when expanding both sides of the bounded\nsquare.\n\nThe next step is to completely remove the slow \"side of oriented\nsquare\" test.\n\nSigned-off-by: Panagiotis Cheilaris <philaris@cs.ntua.gr>", "idx": 762} | |
{"target": 0, "func": "[PATCH] all gonzalez stuff uploaded for trying to fix gonzalez, make\n it fast and accurate", "idx": 754} | |
{"target": 1, "func": "[PATCH] #1285 Refactred substituteInPlace. Now more efficient and has\n same signature for SX and MX.", "idx": 418} | |
{"target": 1, "func": "[PATCH] made GMX_FORCE_ENERGY a separate flag\n\nGMX_FORCE_ENERGY was (temporarily) defined as GMX_FORCE_VIRIAL.\nNow it is a separate flags, which is less confusing. This allows\nnstcalcenergy to be larger than nstpcouple, which improves performance\nwith the SSE and CUDA kernels.", "idx": 950} | |
{"target": 1, "func": "[PATCH] sbgemm: cooperlake: reorder ptr increase for performance", "idx": 1005} | |
{"target": 1, "func": "[PATCH] Adds a URIManager to manage all URIs within an array\n directory. This introduces several performance improvements, especially\n around redundant URI listings, parallelizing URI listings, etc. Also makes\n VFS::ls a noop for POSIX and HDFS when the listed directory does not exist\n instead of throwing an error, matching the functionality of the object\n stores. Finally, it removes partial vacuuming, as that leads to incorrect\n behavior with time traveling.", "idx": 1561} | |
{"target": 1, "func": "[PATCH] using int instead of size_t should be more efficient and\n range doesn't seem to be needed", "idx": 801} | |
{"target": 1, "func": "[PATCH] Stage bonded kernel atomics through shared memory\n\nFixes performance bug introduced in 01b2f20bd5 by staging energy step\natomics through shared memory rather than have all threads write\natomically directly to global memory.\n\nFixes #3443", "idx": 654} | |
{"target": 1, "func": "[PATCH] improved performance MatvecCommPkgCreate", "idx": 870} | |
{"target": 1, "func": "[PATCH] 2d convolve performance improvements\n\nchanged the shared memory loading access pattern in 2d convolve\nkernel for cuda and opencl backends", "idx": 1297} | |
{"target": 1, "func": "[PATCH] Use new style with make_array(), more compact and efficient", "idx": 1395} | |
{"target": 1, "func": "[PATCH] Restructured nonbonded calculation to allow more efficient\n vectorization", "idx": 1356} | |
{"target": 1, "func": "[PATCH] tutorials: Changed compressed ascii output to binary to\n improve IO performance\n\nalso rationalized the writeCompression specification", "idx": 1257} | |
{"target": 1, "func": "[PATCH] NumericVector::add_vector refactoring\n\nSimilar to #411 and #413\n\nThis was originally intended to be just another additional T* API plus\na refactoring; however, the new PetscVector::add_vector(DenseVector)\ncode path should be a performance improvement as well.", "idx": 5} | |
{"target": 1, "func": "[PATCH] Edit for faster performance", "idx": 1238} | |
{"target": 1, "func": "[PATCH] Made some performance improvements and fixed a bug when\n running on a single processor but compiled with mpi.", "idx": 435} | |
{"target": 1, "func": "[PATCH] bond/react: performance improvement", "idx": 1396} | |
{"target": 1, "func": "[PATCH] 128-bit AVX2 SIMD support\n\nAdd 128 bit support for AVX2. Similar to AVX-128, this\nimproves slightly on SSE2 due to more efficient instructions,\nand the shorter SIMD width is beneficial in some cases. Both\n128- and 256-bit flavors will be built automatically with\n--enable-avx2, and the timing routines will chose the best one\nautomatically.", "idx": 452} | |
{"target": 1, "func": "[PATCH] Use std::make_shared instead of new...\n\nIt is more efficient, since it requires only one memory allocation in\ncontrast to two.", "idx": 548} | |