{"target": 0, "func": "[PATCH] fix internal/external OpenMP thread affinity clash\n\nThread affinity set by the OpenMP library, either automatically or\nrequested by the user through environment variables, can conflict with\nthe mdrun internal affinity setting.\nTo avoid performance degradation, as Intel OpenMP has affinity setting\non by default, we will explicitly disable it unless the user manually\nset OpenMP affinity through one of the KMP_AFFINITY or GOMP_CPU_AFFINITY\nenvironment variables. If any of these variables is set, we honor the\nexternally set affinity and turn off the internal one.\n\nChange-Id: I78c6347154d6f11695ee04243db17bbb2e5cb0a7", "idx": 502} {"target": 0, "func": "[PATCH] bugfix of set_coef introduced by\n c84cc28d1c5550876d2d648f905716ffa69335ef\n\nThe problem is that building the matrix from a set of triplets sums\nthe value provided in case several values at the same position are\nprovided. In order to overwrite a value, we have no other choice\nthan to build the matrix and set the value (with current Eigen API).\nThe following fix is as efficient if the matrix is assembled\nin one pass. Using the boolean member of set_coef is very important now\nas it can imply a premature building of the eigen matrix.", "idx": 1563} {"target": 0, "func": "[PATCH] default to 1 thread, make the user manually specify the\n number of threads to avoid mpi/thread contention\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@2673 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 123} {"target": 0, "func": "[PATCH] Don't construct perflog strings unless needed\n\nPreviously, if we disabled the perflog at runtime (rather than compile\ntime), it would *not* disable the implicit construction of C++\nstd::strings from C char* strings, which turns out to be the most\ncostly part of perf log operation.\n\nWe still need to rework this whole class for efficiency, but at the\nvery least it's now efficient when disabled.", "idx": 1135} {"target": 0, "func": "[PATCH] fast QA for LINUX", "idx": 1515} {"target": 0, "func": "[PATCH] Working on the umutual2b kernel, the tdipdip values are\n computed on the fly for now, maybe a seprate neigh list as in the CPU version\n will be more efficient", "idx": 92} {"target": 0, "func": "[PATCH] Add implementation of the mean squared error performance\n function.", "idx": 1369} {"target": 0, "func": "[PATCH] Fast code now computes confidence bands", "idx": 376} {"target": 0, "func": "[PATCH] Set interior parent only when mesh contains multiple\n dimensions\n\nThis commit is to address the issues brought up in libMesh/libmesh/#709,\nnamely, handle ParallelMesh more appropriately by using\nLIBMESH_BEST_UNORDERED_MAP and asserting when an element id is greater\nthan the maximum element id rather than the number of elements. Also,\nautomatically setting the interior parent should occur when a mesh has\nmultiple dimensions and skipped when the mesh has only one dimension.\n\nIn order to utilize mesh.elem_dimensions(), which allows a user to\ndetermine the dimensions of a mesh, the code to automatically set the\ninterior parent was moved into mesh_base.C as a separate method. This\nway mesh.cache_elem_dimensions() can be called prior to setting the\ninterior parents and thus mesh.elem_dimensions() will be available.\n\nAlso, the methods were moved prior to the partitioning in order to avoid\nthe complexities of one processor containing the interior parent of an\nelement on another processor.\n\nLastly, there is one noticeable performance penalty for moving the\nautomatic setting of the interior parents to a separate method and that\nis populating the node_to_elem map which requires iterating through all\nactive elements. Previously, the node_to_elem map was populated during\nan existing element iteration inside find_neighbors().", "idx": 908} {"target": 0, "func": "[PATCH] Extend Force sub-counters\n\nNeed more data for understanding performance variation\n\nImplemented subcounter \"restart\" and used it for accumulating\nposition-restraints time with FEP to the position-restraints\nsubcounter.\n\nNoted TODOs for some future extensions not currently possible.\n\nAlso added logfile output from GMX_CYCLE_BARRIER where people\nanalyzing the performance will see it.\n\nRefs #1686\n\nChange-Id: I9d60d0a683f56549879bb739269e9466c96572c4", "idx": 766} {"target": 0, "func": "[PATCH] remove fast hash", "idx": 516} {"target": 0, "func": "[PATCH] Replacing Mish and Derivative of Mish by Fast Mish and\n Derivative of Fast Mish resp.", "idx": 1367} {"target": 0, "func": "[PATCH] moved performance fix for 7.30 compilers to FFLAGS", "idx": 1093} {"target": 0, "func": "[PATCH] Made FAST OpenCL results match CUDA results", "idx": 1152} {"target": 0, "func": "[PATCH] RJH: Inserted performance statistics into the SCF using\n pstat. RJH: File cscfps.fh contains the handles.", "idx": 1103} {"target": 0, "func": "[PATCH] use problem-state pointer to write SPE mailbox with lower\n latency (makes a significant performance difference for N < 32k), thanks to\n Jan Wagner for suggestion [empty commit message]", "idx": 1050} {"target": 0, "func": "[PATCH] Fix config for PERFORMANCE category build (#3632)\n\nSome test executables have the default category (BASIC), so they\nare not added in a build with \"CATEGORIES PERFORMANCE\". Check that\nthese executables exist before setting properties on them.", "idx": 9} {"target": 0, "func": "[PATCH] Add a barrier() to ~LibMeshInit\n\nThis didn't fix the bug I was trying to hunt down, and as best as I\ncan tell this won't fix any actual bugs (we'll be synchronizing at the\nFinalize() calls later regardless), but it couldn't hurt to wait for\nother processes to exit before we start spewing reference counter\nand/or performance log data to the screen, lest other processes\nconsole error messages get buried.", "idx": 1063} {"target": 0, "func": "[PATCH] Updated article reference. Added configuration flags to\n support altivec on powerpc. Updated configuration files from gnu.org Removed\n the fast x86 truncation when double precision was used - it would just crash\n since the registers are 32bit only. Added powerpc altivec innerloop support\n to fnbf.c", "idx": 1267} {"target": 0, "func": "[PATCH] 3 fast kernels", "idx": 943} {"target": 0, "func": "[PATCH] Command-line override of default Metis-vs-Parmetis\n\nUsing Parmetis for ReplicatedMesh doesn't seem to make any difference in\nperformance on our benchmark set.", "idx": 1117} {"target": 0, "func": "[PATCH] Modified some of the tests so that they run reasonably fast\n through Insure++.", "idx": 1210} {"target": 0, "func": "[PATCH] Add IntRange helper class\n\nThis seems to be the tersest efficient way to iterate over a range of\nintegers, and since there are quite a few libMesh methods which take\nlocal integer indices, we still have to iterate over those indices\nquite often.", "idx": 231} {"target": 0, "func": "[PATCH] After merging pmatrix-dev, at least one thing got broken:\n ParMesh::Print for AMR meshes. Since in pmatrix-dev, slave faces are no\n longer considered shared (their P rows are not needed by the processor owning\n the master face), they are also not printed when visualizing the parallel\n solution. I suspect this also might have broken NC face neighbors. This\n branch contains a temporary solution, a downgrade of\n ParNCMesh::AddMasterSlaveRanks, so that it works the old way: slave faces are\n grouped with the masters. This fixes visualization and maybe other things,\n but may negatively impact performance of (or even break) the P matrix\n construction. I need to look more into this to find a permanent solution.", "idx": 1316} {"target": 0, "func": "[PATCH] bad performance with some data", "idx": 946} {"target": 0, "func": "[PATCH] tuned performance of script; added linked index", "idx": 102} {"target": 0, "func": "[PATCH] undo slow dgemm/skylake microoptimization\n\nthe compare is more costly than the work", "idx": 1474} {"target": 0, "func": "[PATCH] SDG Linf fast insertion examples\n\nSigned-off-by: Panagiotis Cheilaris ", "idx": 276} {"target": 0, "func": "[PATCH] JN: Solaris uses now PTR_ALIGN for performance and\n compatibility with JN: different compilers", "idx": 558} {"target": 0, "func": "[PATCH] Moved FAST description to docs directory.\n\nAdditionally indented FAST parameter documentation.", "idx": 899} {"target": 0, "func": "[PATCH] performance updates G_indx replaced with Pack_G_indx...EJB", "idx": 221} {"target": 0, "func": "[PATCH] HvD: In response to the development of a new ARMCI over MPI\n implementation I have updated the build_nwchem script and the tools\n GNUmakefile to be able to drive this target. This implementation goes by the\n name MPI_TS (short for MPI Two-Sided) as it eliminates the data-server and\n uses MPI two-sided communications to implement the ARMCI functionality. It is\n expected to be highly portable as it relies only on MPI. Trying the\n performance of this implementation will be interesting.\n\nIn all cases it is currently required to set the environment variable\n\n EXP_GA\n\nany value for this variable will do. This variable is currently required because\nthe implementation lives in a separate branch of GA. In addition you need to\nrun get-tools with this environment variable set to pick up the correct GA\nbranch.\n\nThe build_nwchem script will under certain conditions automatically pick the\nMPI_TS implementation up. However, it is safer to set\n\n ARMCI_NETWORK=MPI_TS\n\nas that eliminates any guess work that might lead to a different result.\n\nIf you want to try this implementation, please go ahead, as I am sure the GA\nteam will value any feedback we provide.", "idx": 218} {"target": 0, "func": "[PATCH] Preload Tile Offsets (#1795)\n\nParticularly for cloud storage backends, loading tile offsets is a performance\nsensitive path. Sequential reads perform much better than random reads because\nthey make better use of the read-ahead cache. This patch aims to increase\nthe number of sequential reads when loading tile offsets.\n\nCurrently, tile offsets are loaded in the following path:\nfor each attribute:\n read_tiles\n parallel_for each fragment:\n load_tile_offsets\n parallel_for each fragment:\n load_var_tile_offsets\n\nThis patch refactors it to:\nparallel_for each fragment:\n for each attribute:\n load_tile_offsets\n load_var_tile_offsets\nfor each attribute:\n read_tiles\n\nBy inverting the order in which we iterate fragments and attributes, we can\nsort the attributes by their index in the fragment metadaa file. By loading\nattributes in ascending order of their offsets, we ensure a sequential read\nto maximum hits in the read-ahead cache.\n\nAdditionally, this defers loading var offsets until all fixed offsets have\nbeen loaded. This is because the fixed offsets exist before the var size\noffsets in the file format: https://github.com/TileDB-Inc/TileDB/blob/dev/format_spec/fragment.md\n\nMost importantly, this is a pre-requisite to parallelizing the read_tiles()\nfor each attribute. When read_tiles are parallel, we can't control the order\nthat they are executed and therefore load tiles, which may reduce hits in\nthe read-ahead.\n\nCo-authored-by: Joe Maley ", "idx": 377} {"target": 0, "func": "[PATCH] Fast forward porting work to master\n\nChange-Id: Ieb428e4a001efadf880dbe2c64c2a685cebdd4ae", "idx": 1066} {"target": 0, "func": "[PATCH] Add a \"sgemm direct\" mode for small matrixes\n\nOpenBLAS has a fancy algorithm for copying the input data while laying\nit out in a more CPU friendly memory layout.\n\nThis is great for large matrixes; the cost of the copy is easily\nammortized by the gains from the better memory layout.\n\nBut for small matrixes (on CPUs that can do efficient unaligned loads) this\ncopy can be a net loss.\n\nThis patch adds (for SKYLAKEX initially) a \"sgemm direct\" mode, that bypasses\nthe whole copy machinary for ALPHA=1/BETA=0/... standard arguments,\nfor small matrixes only.\n\nWhat is small? For the non-threaded case this has been measured to be\nin the M*N*K = 28 * 512 * 512 range, while in the threaded case it's\nless, around M*N*K = 1 * 512 * 512", "idx": 1026} {"target": 0, "func": "[PATCH] Fixes for QM/MM MdModule\n\nSeveral fixes for problems found in QM/MM during beta1:\n\n* Additional check for external input files\n* Changed writing from `gmx::TextWriter` to `gmx::TextOutputFile` (`TextWriter` tries to format lines which is very slow in case of big files)\n* Added QM/MM to highlights\n\nRefs #3172", "idx": 1109} {"target": 0, "func": "[PATCH] driver: more reasonable thread wait timeout on Windows.\n\nIt used to be 5ms, which might not be long enough in some cases for the\nthread to exit well, but then when set to 5000 (5s), it would slow down\nany program depending on OpenBlas.\n\nLet's just set it to 50ms, which is at least 10 times longer than\noriginally, but still reasonable in case of failed thread termination.", "idx": 131} {"target": 0, "func": "[PATCH] Fix bonded atom communication performance\n\nThe filtering of atoms to communicate for bonded interactions that\nare beyond the non-bonded cut-off was effectively missing, because\nthe home atom indices were not set. This lead to many more atoms\nbeing communicated.\n\nChange-Id: I4bd5b9b561a3077e055186f312939221dba6cefa", "idx": 860} {"target": 0, "func": "[PATCH] Fix code so that gradient is not wrapped into the box fast\n exit out of symmetry routines if C1", "idx": 1226} {"target": 0, "func": "[PATCH 1/9] Update all versions to v1.1.1.", "idx": 425} {"target": 0, "func": "[PATCH] Restarted the hierarchical pca algorithm, now replaced with\n Vempala's fast SVD algorithm. Need to finish the subspace combining part...", "idx": 1216} {"target": 0, "func": "[PATCH] Added DefaultInitializationAllocator\n\nAdded an allocator that can be used to default initialize elements\nof a std::vector on resize(). This is useful to avoid initialization\nin performance critical code.\n\nChange-Id: I65bd52a760c68c73555e8bb9e017de353a6e9a81", "idx": 450} {"target": 0, "func": "[PATCH] Update CUDA Performance Tests for Stream interoperability", "idx": 532} {"target": 0, "func": "[PATCH] Added ODE diagnostics to FixRxKokkos using Kokkos managed\n data.\n\n- Added the diagnostics performance analysis routine to FixRxKokkos\n using Kokkos views.\nTODO:\n - Switch to using Kokkos data for the per-iteration scratch data.\n How to allocate only enouch for each work-unit and not all\n iterations? Can the shared-memory scratch memory work for this,\n even for large sizes?", "idx": 484} {"target": 0, "func": "[PATCH] all modules and domqtests.mpi fast [travis skip]", "idx": 168} {"target": 0, "func": "[PATCH] Fix RDTSCP handling\n\nCommit 13def2872ae5311d tried to make all builds default to using\nRDTSCP, which would have broken non-x86 builds. But it also\ndeactivated the implementation of RDTSCP support because HAVE_RDTSCP\nwas left undefined. So all it did was make timing on x86 less\nefficient (plus e.g. DLB effects from that).\n\nUsed GMX_RDTSCP everywhere. Only the GROMACS project depends on\nthread-MPI, so it's reasonable to let a GMX symbol leak in there (and\nit's easily fixed if ever needed).", "idx": 1274} {"target": 0, "func": "[PATCH] Add access to base kernel in Efficient RANSAC traits", "idx": 800} {"target": 0, "func": "[PATCH] Don't preallocate USMObjectMem\n\nThis might hurt performance in a bad case, should profile & check.", "idx": 929} {"target": 0, "func": "[PATCH] fixed slow memory reallocation, especially at the start of\n runs and for large xvg file by replacing the linear increment by a scaling\n with a factor of 1.19 and renamed the dd over_alloc to over_alloc_dd", "idx": 348} {"target": 0, "func": "[PATCH] Set cmake build to release for LZ4/Zlib and Zstd\n\nThis enables vectorization which yields a boost in performance\nfor these compressors. #1033\n\nPR #1034\n\n(cherry picked from commit 3db1195e2e8df48ca399b18d0250256443651bda)", "idx": 2} {"target": 0, "func": "[PATCH] Compile without debugging or profiling symbols by default\n (i.e. be fast unless the user asks otherwise).", "idx": 1133} {"target": 0, "func": "[PATCH] Remove use of omp 5.0 feature\n\nomp_init_lock_with_hint is an OpenMP 5.0 feature.\nThis is not significant from a performance perspective.", "idx": 498} {"target": 0, "func": "[PATCH] The Viewer declares if the current drawing is a fast draw or\n not.", "idx": 1495} {"target": 0, "func": "[PATCH] Add performance function test.", "idx": 1253} {"target": 0, "func": "[PATCH] new version of forward and adjoint jacobian calculation for\n SXFunction, ticket #127, still need an efficient way of calculating seed\n matrices", "idx": 1181} {"target": 0, "func": "[PATCH] moved some test after fast exit [ci skip]", "idx": 911} {"target": 0, "func": "[PATCH] Convert nbnxn_pairlist_set_t to class PairlistSet\n\nThis change is only refactoring.\nTwo implementation details have changed:\nThe CPU and GPU pairlist objects are now straight lists instead\nof array of pointer to lists. This means that the pairlist objects are\nno longer allocated on their respecitive thtreads. But since the lists\nhave buffers to avoid cache polution and the actual i- and j-lists are\nllocated thread local, this should not affect performance\nignificantly.\nThe free-energy lists are now only allocated with perturbed atoms.\n\nChange-Id: Ifc76608215518edfc61c0ca8eb71ea2a928cf57c", "idx": 1475} {"target": 0, "func": "[PATCH] threading test performance updates", "idx": 1351} {"target": 0, "func": "[PATCH] Fast KDE is now using submodule for organizing parameters", "idx": 1220} {"target": 0, "func": "[PATCH] fast exit out of symmetry routines if C1 No longer print \"ok\"\n on first geomtry step", "idx": 1540} {"target": 0, "func": "[PATCH] performance update ... EJB", "idx": 782} {"target": 0, "func": "[PATCH] AABB tree: enrich performance section with a summary (general\n comments and advices about how to put the tree at work with good\n performances). this is not exhaustive nor conclusive of course but I believe\n a documentation must also tell the obvious.", "idx": 143} {"target": 0, "func": "[PATCH] Add content to user guide\n\nConverted sections on environment variables, mdrun features, mdrun\nperformance to reStructuredText.\n\nChange-Id: I2a18528729dc6756be093e52e6f87f9df9fe3b94", "idx": 896} {"target": 0, "func": "[PATCH] Fixed FAST edge assertions", "idx": 320} {"target": 0, "func": "[PATCH] Add debug output; don't adjust second bound. This provides\n another minor speedup, but this still is nowhere near as fast as it should be\n with a properly working Hamerly prune.", "idx": 410} {"target": 0, "func": "[PATCH] Remove non-backported entry from NEWS\n\nWe decided against this one after all, due to risk of side effects\noutweighing the fixes to performance issues in some cases.", "idx": 1217} {"target": 0, "func": "[PATCH] Disable DD again in serial with GPU or without PME\n\nPerformance is better without DD in most of those cases.\n\nRefs #4195, #4198, #4171", "idx": 1278} {"target": 0, "func": "[PATCH] VT: Added a few lines about performance tuning", "idx": 1467} {"target": 0, "func": "[PATCH] Implement user guide\n\nRenamed former user manual to reference manual.\n\nThe content for the new user guide has mostly migrated in from the\nwiki, install guide, and mdrun -h, and updated as appropriate. This\nguide is intended for documenting practical use, whereas the reference\nmanual should document algorithms and high-level implementations, etc.\n\nEstablished references.md to do automatic linking of frequently\nused things. This can be automatically concatenated by pandoc\nonto any Markdown file to do easy link generation.\n\nSection on mdrun and performance imported and enhanced from the\nAcceleration and Parallelization wiki page.\n\nAdded section on mdrun features, e.g. rerun and multi-simulation.\n\nSection on getting started imported from online/getting_started.html\nand updated - there used to be a tutorial here, but there isn't any\nmore. Linked to more up-to-date tutorials.\n\nAdded TNG to docs/manual/files.tex.\n\nRemoved gmx options, now that its content is in the user guide (in\ntools.md).\n\nMoved old mdp_opt.html to docs/mdp-options.html, for now. Removed from\nreference manual, left pointer to new location. This is not an ideal\nformat or location either, but it's a step closer to being able to\ngenerate it from the code. Some trivial fixes to content. Generating\nlinks and references to follow in a future commit.\n\nMoved environment-variable section from reference manual to user guide.\nMinor fixes here.\n\nRemoved superseded reference manual sections on running in parallel or\nwith GPUs. Renamed install.tex as technical-details.tex, because that\nis all that is left there. Moved section on use of floating-point\nprecision to chapter on definitions and units, and thus eliminated the\nformer Appendix A.\n\nCross-references from user-guide.pdf don't work well yet, but that\nshould be dealt with when we decide on the final publishing platform.\n\nSome TODO comments for documentation sections remain for work in\nfuture patches, but please note the other new content in existing\nchild patches, so we don't duplicate any work.\n\nChange-Id: I026d67353863ae069c6c45b840a61fcaf205a377", "idx": 1209} {"target": 0, "func": "[PATCH] HvD: In order to get reasonable performance out of the\n automatic differentiation approach the compiler has to inline the various\n overloaded operators. If the compiler fails to do that performance\n degradation exceeding an order of magnitude has been observed. For the GNU\n compilers code inlining requires all the code to be present in a single\n source file. The compiler cannot inline code from a different file. Hence the\n NWAD code module must be included into the same source file as the density\n functional subroutines that use it. To do that we need a fixed format version\n of NWAD module (typically identified to the compiler with the .F extension\n rather than the .F90 extension for free format Fortran). This commit creates\n the appropriate file that will be converted from free format to fixed format.\n At a later stage the original nwad.F90 file can be deleted.", "idx": 857} {"target": 0, "func": "[PATCH] Adding rough drafts of fast tridiagonalization on square\n process grids.", "idx": 1125} {"target": 0, "func": "[PATCH] initial caching of AO integrals for triples; SO integrals;\n performance measurement for the triples", "idx": 21} {"target": 0, "func": "[PATCH] I tried several things but it is still slow", "idx": 1187} {"target": 0, "func": "[PATCH] ATW: Interim commit - new sparse packing and IO OK but slow", "idx": 791} {"target": 0, "func": "[PATCH] Read tiles: fixing preallocation size for var and validity\n buffers. (#2781)\n\nPreallocation size for var buffer and validity buffer was not using the\ncorrect size, which will have a performance impact.", "idx": 323} {"target": 0, "func": "[PATCH] HvD: Eliminating a strange way of evaluating rho^1/3. If this\n costs performance then a better way would be to evaluate rho^4/3 = rho^1/3 *\n rho.", "idx": 912} {"target": 0, "func": "[PATCH] Accelerate GMM test by reducing number of EM iterations.", "idx": 97} {"target": 0, "func": "[PATCH] JJU: performance test for put get and acc", "idx": 1381} {"target": 0, "func": "[PATCH] Avoid confusing message at end of non-dynamical runs\n\nEM, TPI, NM, etc. are not targets for performance optimization\nso we will not write performance reports. This commit fixes\nand oversight whereby we would warn a user when the lack of\nperformance report is normal and expected.\n\nFixes #2172\n\nChange-Id: I1097304d79701be748612510572382729f7f26be", "idx": 574} {"target": 0, "func": "[PATCH] rearranged TRANSPOSED format, numerous speedups\n\nSplit the TRANSPOSED and non-TRANSPOSED rank-geq2 solvers, and changed\nthe DFT TRANSPOSED format to be more like fftw2 (both globally and\nlocally transposed). In general, more emphasis on arranging the data\ncontiguously for the DFTs, and more flexibility in intermediate\ntransposed formats. Also disable NO_SLOW when planning transposes,\nsince otherwise non-square in-place transposes gratuitously put the\nplanner in SLOW mode.\n\nCurrently, dft-rank1-bigvec has 5 variants (or 10, if DESTROY_INPUT).\nIt looks like only 2 of these are commonly used, so I should probably\nadd some UGLY tags once I do more benchmarking.", "idx": 948} {"target": 0, "func": "[PATCH] performance update....EJB", "idx": 1295} {"target": 0, "func": "[PATCH] Allow to pop the context menu with `Key_Menu`\n\nAs the item selection is rather slow, for the moment, that is a lot\nfaster than `Shift+Rightbutton`.", "idx": 1237} {"target": 0, "func": "[PATCH] Target haswell or AVX2 for prebuilt libraries\n\nPrebuilt artifacts that we publish for releases will now target\n`haswell` for linux and macos and `AVX2` for windows to allow for\ngreater compatibility while maintaining the performance of AVX\noptimizations.", "idx": 813} {"target": 0, "func": "[PATCH] More extensive performance logging of the Kelly Error\n Estimator.", "idx": 461} {"target": 0, "func": "[PATCH] ODESolvers: Add support for efficient ODE solver resizing\n\nNote: this reuses the existing storage rather than costly reallocation\nwhich requires the initial allocation to be sufficient for the largest\nsize the ODE system might have. Attempt to set a size larger than the\ninitial size is a fatal error.", "idx": 364} {"target": 0, "func": "[PATCH] Fix slow QuadratureFunction::GetElementValues when provided\n int pt", "idx": 1012} {"target": 0, "func": "[PATCH] Add Edge4::volume().\n\nThis speeds up calling volume() on every element in a mesh with 4M\nEdge4's by about 450x. Though I'm not sure this was ever going to be a\nperformance concern, I still want to have a custom volume()\nimplementation for all the different Elem types for completeness.\n\nThe 4-pt quadrature is also reasonably accurate. The biggest\ndiscrepancies seem to be introduced by perturbing the interior nodes\nof the EDGE4. For example, perturbing an interior node of the\nreference element by 0.125 in the y (or z) direction leads to a\nrelative error of much less than 1%, compared to a 12th-order (7-pt)\napproximation of the volume.\n\nApproximate volume = 2.0412301724043291e+00\nVerification volume = 2.0410437292661414e+00\nRel err = 9.1346959163268581e-03%", "idx": 1204} {"target": 0, "func": "[PATCH] Keep track of FE requests for mapping data\n\nBefore, if *only* mapping data was requested, we wouldn't realize that,\nwe would think that nothing had been requested, and we would calculate\nall data at reinit(), for backwards compatibility. This was making the\nElem::volume() fallback code too slow in the general case, and was\nbreaking it (due to unimplemented second derivatives) in the Rational\nBezier case.", "idx": 1089} {"target": 0, "func": "[PATCH] Added test to log the performance of the periodic Delaunay\n triangulation", "idx": 544} {"target": 0, "func": "[PATCH] removed order natoms^2 loops which made grompp vsite stuff\n extremely slow for large systems", "idx": 183} {"target": 0, "func": "[PATCH] fem_system_ex2 local->distributed solution\n\nThere doesn't seem to be any way to get this behavior back in reinit()\nwithout creating a performance regression: most applications simply\nnever need to move data in that direction.", "idx": 304} {"target": 0, "func": "[PATCH] fix a few performance drop in some matrix size per data type\n\nSigned-off-by: Wang,Long ", "idx": 1363} {"target": 0, "func": "[PATCH] Refactor threading model (#1766)\n\n* Refactor threading model\n\nRemoves:\n- global_tp_ (the replacement TBB thread pool)\n- StorageManager::async_thread_pool_\n- StorageManager::reader_thread_pool_\n- StorageManager::writer_thread_pool_\n- VFS::thread_pool_\n\nAdds:\n- StorageManager::compute_tp_\n- StorageManager::io_tp_\n\nUsage changes:\n1. Our three parallel functions (`parallel_sort`, `parallel_for[_2d]`) now use\n the `StorageManager::compute_tp_`.\n2. Both the `Reader::read_tiles()` and `Writer::write_tiles()` now execute on\n `StorageManager::io_tp_`.\n3. The VFS is now initalized with a thread pool, where the storage manager\n initializes it with the `StorageManager::io_tp_`. This means that both the\n VFS and Reader/Writer io paths execute on the same thread pool. There was\n previously a deadlock scenario if both used the same thread pool, but that\n is no longer an issue now that the threadpools are recursive.\n4. The async queries are executed on `StorageManager::compute_tp_`.\n\nConfig changes:\n- Adds configuration parameters for the compute and IO thread pool \"concurrency\n levels\". A level of \"1\" is serial execution while all other levels have a\n maximum concurrency of N but allocate N-1 OS threads.\n- Deprecate the async/reader/writer/vfs thread num configurations. If any of\n these are set and larger than the new \"sm.compute_concurrency_level\" and\n \"sm.io_concurrency_level\", the old values will be used instead. The motiviation\n is so that existing users will not see a drop in performance if they are\n currently using larger-than-default values.\n\n* Recursive ThreadPool::execute() (#1772)\n\nCurrently, we break recursive deadlock in the ThreadPool:wait*() routines. This\nworks well for the type of \"execute-and-wait\" model we use. For instance:\n```\nThreadPool tp;\nauto task = tp.execute(...);\ntp.wait_all(task);\n```\n\nWe are currently unable to break recursive deadlock if the threadpool user does\nnot use our \"wait\" routine. For instance:\n```\ncondition_variable cv;\nauto task = tp.execute([&]() {\n cv.signal_all();\n});\ncv.wait(...);\n```\n\nThe S3 client uses the above style of synchronization. With our compute/io\nthreadpool refactor, we encounter recursive deadlock. This patch allows breaking\nrecursive deadlock on the call to ThreadPool::execute().\n\nWith this patch, ThreadPool::execute() checks if 1) the calling thread belongs\nto the thread pool instance and 2) all other threads are non-idle.\n\nCo-authored-by: Joe Maley \n\nCo-authored-by: Joe Maley ", "idx": 441} {"target": 0, "func": "[PATCH] Tightening default tolerances on the ADMM algorithms where\n possible (the linear program solver was kept the same due to its slow\n convergence rate).", "idx": 513} {"target": 0, "func": "[PATCH] Refactor the performance measurement system", "idx": 1160} {"target": 0, "func": "[PATCH] replaced fr->solvent_opt by fr->cginfo, which also contains\n the energy group id and implemented efficient cg sorting with DD, which is\n now done at every DD decomposition", "idx": 275} {"target": 0, "func": "[PATCH] Added support for (arbitrary) high-order Nedelec finite\n element discretizations in AMS.\n\nSeems to be working quite well on hex meshes and reasonably well on tet meshes,\ncorrelating with the performance of BoomerAMG on the associated high-order nodal\nproblems.", "idx": 113} {"target": 0, "func": "[PATCH] gemini performance updates for strided gets; code in the\n ARMCI_Get/Put/Acc subroutines should be incorporated into the ARMCI_Nb\n equivalent routines", "idx": 146} {"target": 0, "func": "[PATCH] - Some clean-up - Function get_number_of_bad_elements in the\n mesher levels (for debugging) - option\n CGAL_MESH_3_ADD_OUTSIDE_POINTS_ON_A_FAR_SPHERE to reduce contention on the\n infinite vertex", "idx": 1494} {"target": 0, "func": "[PATCH] Fix performance report when init_step!=0\n\nChange-Id: Ia4e15c2fb9b0e3debe7fc7f2aa8a1cdf346f90cb", "idx": 1028} {"target": 0, "func": "[PATCH] Added complex routines for blas lapack but they compile as a\n separate library so that it doesn't slow down applications that do not need\n complex routines", "idx": 887} {"target": 0, "func": "[PATCH] Adding Changelog for Release 3.4.01\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 3.4", "idx": 446} {"target": 0, "func": "[PATCH] add comment regarding OpenMP performance", "idx": 659} {"target": 0, "func": "[PATCH] fix for no performance logging\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@485 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 1053} {"target": 0, "func": "[PATCH] First cut at the performance statistics library", "idx": 1403} {"target": 0, "func": "[PATCH] MKK: A new classic test case for matmul that give the\n performance numbers", "idx": 855} {"target": 0, "func": "[PATCH] Add an example to show how to debug performance with loggers.", "idx": 590} {"target": 0, "func": "[PATCH] added docu for efficient MVJ and restart modefor MVJ", "idx": 1224} {"target": 0, "func": "[PATCH] Fix input ndims validation in fast,orb,sift", "idx": 422} {"target": 0, "func": "[PATCH] Simplify handling of DD bonded distances\n\nTo simplify and clarify the DD setup code, we now always always store\nthe systemInfo.minCutoffForMultiBody and use a separate flag to tell\nif we should increase the cut-off distance for bonded communication.\nThere is a minor behavioral change in that with large domains and\nbonded communication filtering or DLB, the bonded cut-off is now\n5% of the bonded cut-off longer as the margin is now included.\nThis has a negligible effect on performance in all cases.\n\nChange-Id: Id409353c517181ac56e8d3f1f36c22c705aa8077", "idx": 331} {"target": 0, "func": "[PATCH] Remove OpenMP from KOKKOS_DEVICES in Kokkos CUDA Makefiles\n since normally this doesn't improve performance", "idx": 197} {"target": 0, "func": "[PATCH] Fixed performance figures. TODO: new graphs", "idx": 61} {"target": 0, "func": "[PATCH] Fix memory alloc for fast opencl", "idx": 1440} {"target": 0, "func": "[PATCH] Enable dynamic pair list pruning\n\nThis change activates the dynamic pruning scheme and the pruning\nonly kernels added in previous commits.\nA heuristic estimate is used to select value for nstlist and\nnstlistPrune that should result in performance that is reasonably\nclose to optimal. The nstlist increase code has been moved from\nrunner.cpp to nbnxn_tuning.cpp. The KNL check in that code has been\nreplaced by a check for Xeon Phi.\nA paragraph has been added to the manual to describe the dynamic\nand rolling list pruning scheme. A reference with all the details\nwill be added once the paper has been published.\n\nChange-Id: Ic625858a07083916c8aa3e07f7497488dcfaee9e", "idx": 240} {"target": 0, "func": "[PATCH] Allow Gromacs to run on more than 64 threads by default\n\nAvoids Gromacs dying on machines when starting more than 64 threads,\nwhich is getting more common today - including our own tests.\n\nThis increases the default (hard) OpenMP thread limit from\n64 to 128, since at least Roland has not seen any performance\ndrawbacks from that.\n\nSecond, we no longer attempt to start more threads than Gromacs\nhas been configured for. In principle it would have been cleanest\nto limit gmx_omp_get_max_threads() to GMX_OPENMP_MAX_THREADS,\nbut unfortunately the first value is used to detect the total\nnumber of threads for all ranks, while the latter is rather used\nas a limit for the number of threads to start inside each rank.\nAll this threading code needs to cleaned up later, but to avoid\nseverely limiting MPI parallelism on these machines, for now we\ninstead need to apply the limit once we have adjusted for the\nnumber of ranks in the higher-level routine.\n\nCloses #4370.", "idx": 658} {"target": 0, "func": "[PATCH] Added CUDA LJ-PME nbnxn kernels\n\nThis change implements CUDA non-bonded kernels for LJ-PME introduced\nin the Verlet scheme with 99029d.\n\nThe CUDA kernels implement geometric as well as Lorentz-Berthelot (LB)\ncombinations rules (unlike the CPU SIMD) mostly because even though PME\nis very slow with LB, it is still beneficial to let the user offload the\nnon-bondeds to a GPU and potentially bump up the cut-off to further\nreduce the CPU PME load.\n\nNote that as now we have 120 kernels compiled for up to four different\ntarget architectures, the nbnxn_cuda module takes a very long time to\nbuild and can become the bottleneck during compilation. We will deal\nwith this later.\n\nChange-Id: I819b59a8948da0c8492eac6a43d4a7fb6dc98354", "idx": 1127} {"target": 0, "func": "[PATCH] Enorme typo sur le parametre fast", "idx": 706} {"target": 0, "func": "[PATCH] Keep track of old dof_indices distribution between\n processors; this makes it easier to construct an efficient send_list in\n System::project_vector.", "idx": 375} {"target": 0, "func": "[PATCH] aabb tree: more on performance section (benchmark across\n kernels)", "idx": 237} {"target": 0, "func": "[PATCH] ManipulatedFrame issues : fix\n\n- The call to bbox() at each top of a manipulated frame made it verry slow to manipulate frames\n on a big item, because the bbox was computed at every call. The result is now kept in a\n member and updated only when invalidate_buffers is called.\n\n- The color of the cutting plane is repaired.", "idx": 1080} {"target": 0, "func": "[PATCH] Add implementation of the cross-entropy error performance\n function.", "idx": 841} {"target": 0, "func": "[PATCH] Demo updates, (I added a \"paint-like smoothing\" feature but I\n am going to remove it since real-time smoothing is not efficient due to AABB\n reconstruction.", "idx": 107} {"target": 0, "func": "[PATCH] Run include order check in doc-check\n\nNow the doc-check target also checks that all files conform to the\ninclude ordering produced by the include sorter.\n\nAdd support into the include sorter for only checking the ordering\nwithout changing anything, and partially improve things such that the\nfull contents of the file are no longer required for some parts of the\nchecking. There seems to be no performance impact for now from storing\nall the file contents in memory, so did not go through this, but the\npartial changes also improve readability of the code.\nAdd support to gmxtree for loading the git attributes, to know which\nfiles not to check for include ordering.\n\nChange-Id: I919850dab2dfa742f9fb5b216cc163bc118082cc", "idx": 1128} {"target": 0, "func": "[PATCH] removed explict assigment of BLASOPT=-mkl for MACX64. Iprefer\n have the slow blas baseline by default for generic users", "idx": 581} {"target": 0, "func": "[PATCH] tables added on cache performance", "idx": 257} {"target": 0, "func": "[PATCH] AABB tree: update performance section with more details about\n memory occupancy (table here is better than a curve as the memory grows\n linearly)", "idx": 404} {"target": 0, "func": "[PATCH] Changed FFTW warning from AVX to no SSE\n\nChanged the cmake FFTW SIMD check warning from complaining about\nAVX to complaining about missing SSE or SSE2.\nWith FFTW 3.3.4 the performance of FFTW with both SSE and AVX enabled\nis often a bit better and never much worse than SSE along. Newer\nIntel processors probably also perform better with AVX with FFTW 3.3.3\nso we should not complain about the combination of SSE(2) and AVX,\nbut only when SSE is missing.\n\nChange-Id: I3665a35ec98616f015d05e314c8fbb80a8862092", "idx": 478} {"target": 0, "func": "[PATCH] Reinstate fast copy methods", "idx": 1308} {"target": 0, "func": "[PATCH] removed _infostream output (slow performance on massively\n parallel systems), documentation", "idx": 1271} {"target": 0, "func": "[PATCH] fixed bugs in BLT sorting, added makeSemiExplicit function\n (working, but not as efficient as it coudl be)", "idx": 695} {"target": 0, "func": "[PATCH] Add performance graph for region growing", "idx": 521} {"target": 0, "func": "[PATCH] Fix for calculate_dphiref-only\n\nI doubt anyone's ever triggered this, and it shouldn't have been more\nthan a performance issue if they had, but now that we're deprecating the\nold fallback it'll become important.", "idx": 1407} {"target": 0, "func": "[PATCH] Traits class inherits from Hyperbolic traits now; TODO:\n investigate why triangulation is so slow", "idx": 949} {"target": 0, "func": "[PATCH] TCE GPU: add environment varibale NWC_OFFLOAD_SPAN\n\nThe NWC_OFFLOAD_SPAN environment variable is meant to control which GA\nranks are offloading to GPUs. When setting the variable, every\nNWC_OFFLOAD_SPANth rank (staring at rank 0) will be offloading its work\nto a GPU.\n\nThe current solution is a place holder for the final implementation that\nstill needs to be figured out with actual performance tests.", "idx": 94} {"target": 0, "func": "[PATCH] Revert \"Use correct number of atoms in GPU Update kernels\"\n\nThis reverts commit 4d9a6d110b614a299a5bab4120c87a04f9ac14c1 (MR !2523).\n\nIt was supposed to be a trivial fix, but things turned out to be more\ncomplicated, and my testing was insufficient (#4401).\n\nThe fix is still needed, but the bug is not causing incorrect physics,\nmerely harmless sanitizer errors and slightly higher resource\nconsumption by kernels that are very fast regardless.\n\nI suggest doing it in master or delaying till the patch release.\nRight now is not the best time to fix such issues.\n\nRefs #4398", "idx": 1321} {"target": 0, "func": "[PATCH] Enabling a fast pass option for reorder", "idx": 1399} {"target": 0, "func": "[PATCH] Introduced python typemaps for std::vector,\n std::vector.\n\nWas having difficulties with 5-argument DMatrix constructor and numpy. Does seem to slow down compilation and linking significantly (factor 2?).", "idx": 414} {"target": 0, "func": "[PATCH] Adding Changelog for Release 3.2.00\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 3.2\nUpdate Kokkos version macros to 3.2.0", "idx": 1532} {"target": 0, "func": "[PATCH] Performance and thread-safety requires a lock around each\n constraint row acquisition, not just each constraint row entry.", "idx": 671} {"target": 0, "func": "[PATCH] Possible performance enhancement in Mesh::delete_elem. We use\n the passed Elems id() as a guess for the location of the Elem in the\n _elements vector. If the guess does not succeed, then we revert to the linear\n search.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@1187 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 916} {"target": 0, "func": "[PATCH] Optimizations to Compute[Yi/Zi/Bi], switching over to an\n AoSoA data layout on the GPU. CPU vs GPU code paths are now maximally\n divergent, will include some discussion of that in PR. Small performance\n tweaks in Compute[UiTot/FusedDeidrj].", "idx": 1247} {"target": 0, "func": "[PATCH] performance timing...EJB", "idx": 394} {"target": 0, "func": "[PATCH] Update citations.\n\nGoogle scholar claims this paper cites libmesh but I obtained a copy and it does not...\n\n@InProceedings{Monteiro_2016,\n author = {S.~Monteiro and F.~Iandola and D.~Wong},\n title = {{STOMP: Statistical Techniques for Optimizing and Modeling Performance of blocked sparse matrix vector multiplication}},\n booktitle = {{28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)}},\n pages = {93--100},\n publisher = {IEEE},\n month = oct,\n year = 2016,\n note = {\\url{http://dx.doi.org/10.1109/SBAC-PAD.2016.20}}\n}", "idx": 969} {"target": 0, "func": "[PATCH] macos accelerate does not contain dcombossq", "idx": 439} {"target": 0, "func": "[PATCH] Ensure PME with OpenCL does not attempt to pin\n\nHost-only memory pinning was designed with CUDA in mind, while OpenCL\nrequires managing both host and device memory buffer for efficient\nmapping, which is not yet implemented.\n\nThis change teaches the PME module to understand what pinning policy\nis appropriate to the build configuration, so that the setup of data\nstructures in various parts of the code can use a pinning policy that\nalways works.\n\nRefs #2498\n\nChange-Id: I2a294aee460947cd3aad5e23869cead1b99fd610", "idx": 1195} {"target": 0, "func": "[PATCH] performance updates??....EJB", "idx": 1111} {"target": 0, "func": "[PATCH] Fixing issues from @rcurtin's review + adding unit test for\n cross entropy performance function", "idx": 1519} {"target": 0, "func": "[PATCH] Adding timers into ID to help facilitate @YingzhouLi's\n performance benchmarks", "idx": 1123} {"target": 0, "func": "[PATCH] ex1p: implement pa jacobi preconditioning\n\nPerformance is disappointing at the moment.", "idx": 924} {"target": 0, "func": "[PATCH] convert coul/dsf/omp styles to use fast analytical erfc()", "idx": 263} {"target": 0, "func": "[PATCH] Removed AVX code, since it had very little effect on\n performance and would have required a more complicated build process. Also\n worked around a compilation error with clang.", "idx": 607} {"target": 0, "func": "[PATCH] performance update ...EJB", "idx": 483} {"target": 0, "func": "[PATCH] added performance figures in the doc", "idx": 111} {"target": 0, "func": "[PATCH] Modified waterChannel tutorial to make case better posed\n Existing case did not properly converge and suffered slow convergence with\n the water level failing to reach an equilibrium. A slight rise in the\n channel appears to help the water level reach an equlibrium when the flow\n rate over the rise matches the inlet flow rate.", "idx": 51} {"target": 0, "func": "[PATCH] finished consolidating the InRCut device function. It uses\n double3's for axes, halfAx, and dist. It passes the currentPArticle,\n neighborParticle, and gpu_x,y,z arrays to calculate distance further down the\n trace. Also, I replaces the diff_com and virComponents with double3s. It\n would be more elegant to use an array of double3's as opposed to three\n separate arrays for coords. Right now this change isn't implemented due to\n possible performance concerns.", "idx": 245} {"target": 0, "func": "[PATCH] efficient sparsity pattern computation for the case when the\n user specifies the DOF coupling\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@1130 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 1098} {"target": 0, "func": "[PATCH] s390x/GEMM: replace 0-init with peeled first iteration\n\n... since it gains another ~2% of SGEMM and DGEMM performance on z15;\nalso, the code just called for that cleanup.\n\nSigned-off-by: Marius Hillenbrand ", "idx": 1331} {"target": 0, "func": "[PATCH] Misc. Doxygen build system improvements\n\n- Only generate the installed header list once after CMake is run.\n It cannot change unless CMake is run again. This wasn't particularly\n slow earlier either, but now it can be added as a dependency also in\n the -fast targets without any impact on the behavior.\n- Do not update the Doxyfile-common each time CMake is run if\n GMX_COMPACT_DOXYGEN=ON.\n- Partition the Markdown pages into subdirectories based on the\n documentation level where they should appear. Exclude things from\n the documentation based on the directory, not individually.\n- Use a CMake function to create the various Doxygen targets to remove\n some duplication.\n- Some cleanup in the directories.cpp and misc.cpp documentation files.\n- Some cleanup to use consistent casing throughout CMakeLists.txt.\n\nChange-Id: I30de6f36841f25260700ec92284762e989f66507", "idx": 18} {"target": 0, "func": "[PATCH] Fixed FAST CPU backend case when no features are found", "idx": 829} {"target": 0, "func": "[PATCH] Add implementation of convolution using the naive approach\n (it's pretty fast for small filter).", "idx": 1136} {"target": 0, "func": "[PATCH] Issue #724 Gradient of the Lagrangian function is always\n generated for NLP solvers Might slow down initialization a bit, but probably\n not much", "idx": 1432} {"target": 0, "func": "[PATCH] - Performance issue, in Standard_criteria.h: Quality were\n not defined correctly, and then facets were not ordered correctly, in the\n Double_map.", "idx": 524} {"target": 0, "func": "[PATCH] add Apple Accelerate to the list of BLAS libraries\n\nSigned-off-by: Jeff Hammond ", "idx": 947} {"target": 0, "func": "[PATCH] Added a little extra information (which inner loop) to the\n printout of performance info.", "idx": 1426} {"target": 0, "func": "[PATCH] removed a lot of slow FFT grid sizes from calc_grid and made\n calcgrid.c thread safe", "idx": 1522} {"target": 0, "func": "[PATCH 1/8] use dot", "idx": 33} {"target": 0, "func": "[PATCH] Added workaround for bad performance with CUDA 7.x on clover\n sigma oprod", "idx": 269} {"target": 0, "func": "[PATCH] added an efficient code path for plain leap-frog update", "idx": 37} {"target": 0, "func": "[PATCH] Introduce GMX_USE_SIMD_KERNELS cmake option\n\nMost GROMACS development does not need to recompile the SIMD nbnxm\n(and fep) kernels whenever their dependencies change. These\ndependencies are large in number, and include frequently changed\nfiles, including config.h and various utility and nbnxm module\nheaders. This flag permits people to efficiently recompile while\nworking on code that doesn't directly target changes to the SIMD\nkernels.\n\nIt also means that CI builds not aimed at efficient mdrun execution\ntimes can instead minimize compilation times and ccache db sizes. This\nwill also have the side effect of testing more of the reference NBNXM\nkernels.\n\nThere's other SIMD code (particularly PME, bonded, LINCS, update)\nwhich still compiles and runs in the usual way. Currently these are\nless costly to compile and harder to disable. That could change in\nfuture.", "idx": 587} {"target": 0, "func": "[PATCH] Fixed up bugs and add performance tests for get and\n accumulate.", "idx": 22} {"target": 0, "func": "[PATCH] Add an accurate but slow way to generate a sparse point set", "idx": 1345} {"target": 0, "func": "[PATCH] avoid using epeck in slow (when using leda) tests", "idx": 764} {"target": 0, "func": "[PATCH] Set cmake build to release for LZ4/Zlib and Zstd\n\nThis enables vectorization which yields a boost in performance\nfor these compressors. #1033", "idx": 303} {"target": 0, "func": "[PATCH] Working on the performance tests", "idx": 474} {"target": 0, "func": "[PATCH] Workaround for very slow compilation on Windows", "idx": 1281} {"target": 0, "func": "[PATCH] Fix for md_parallel_for for Cuda reported in #1057\n\nmd_parallel_for is intended to be deprecated, but fix the issue for not\nrespecting lower bounds while resolving difference in performance\nbetween md_parallel_for and parallel_for calls with MDRangePolicy", "idx": 1555} {"target": 0, "func": "[PATCH] Add functions to Quantity to compute the max, min, standard\n deviation (as the sqrt of the variance), and average, returning a Quantity\n with the proper units.\n\nThis should be reasonably efficient, as it takes advantage of numpy-accelerated\nmethods if they're present.", "idx": 605} {"target": 0, "func": "[PATCH] Bring performance estimation up to date\n\nThe performance estimation code for estimating the PME/PP load\nand the optimal DD grid setup used outdated numbers.\nWe now estimate using actual cycle counts on Haswell and esimate\nfor other architectures through a scaling factor that takes into\naccount the SIMD width and FMA.\nThe DD grid automation now ignores PBC cost for exclusions with\nthe Verlet scheme and the for angles and dihedrals with SIMD.\n\nThe effect of this is a more reliable PME load estimate that's\nnow a factor 1.4 to 1.7 higher on Haswell.\nThe DD grid automation will now often choose a setup that better\nmatches the PME `decomposition and reduce the PME redist cost.\n\nChange-Id: I5daa6a6856f2b09ba6d17fda0eea800b816d21e4", "idx": 433} {"target": 0, "func": "[PATCH] clang-tidy: more misc+readability\n\nAuto fixes:\nmisc-macro-parentheses\nreadability-named-parameter\n\nEnabled with few violations disabled by configuration:\nmisc-throw-by-value-catch-by-reference\nreadability-function-size\n\nSet clang-tidy checks as list to allow comments.\n\nRefactored the operator <<< used by GMX_THROW to take the exception by\nvalue and return a copy, which is easier for tools to see is a proper\nthrow-by-value (of a temporary). Performance on the throwing path is\nnot important, but is anyway not affected because the inlining of the\noperator allows the compiler to elide multiple copies. This also\navoids casting away the constness.\n\nChange-Id: I85c3e3c8a494119ef906c0492680c0d0b177a38d", "idx": 676} {"target": 0, "func": "[PATCH] Add data set that shows the performance gain when running\n self_intersections_example.cpp (4.6 sec master, 0.6 sec this PR when run\n sequentially", "idx": 895} {"target": 0, "func": "[PATCH] aabb tree: added curve in performance section", "idx": 973} {"target": 0, "func": "[PATCH] Replacement for pdb2gmx tests\n\nThis test directly asserts upon the .top and .gro files that are\nwritten out, using fragments of the\nregressiontests/complex/aminoacids/conf.gro because these cover all\nbasic amino acid types. It also adds testing for hydrogen vsites for\namber and charmm.\n\nWe now omit doing an energy minimization after the string checks,\nwhich was always a doubtful way to test pdb2gmx. These tests are still\ntoo slow to run with other pre- and post-submit testing, so a new\nCTest category has been made for them, and that category is excluded\nfrom Jenkins builds by default. Developers will still run these by\ndefault with \"make check\" or \"ctest\" but that should be fast enough on\na workstation. Later we can probably refactor them to use in-memory\nbuffers and be fast enough to put with the other tests.\n\nModified pdb2gmx to avoid writing fractional charges for every atom in\nthe [atoms] output, which isn't very useful for users and makes\nwriting tests more difficult.\n\nFixed unstable sorting of dihedrals whose parameters are strings that\nidentify macros.\n\nAdded new capability for refdata tests to filter out lines that vary\nat run time by supplying a regex that matches the lines to skip.\nThat's not ideal, but useful for now. Better would be to refactor\ntools so that e.g. header output can go to a different stream, but\nfirst we should have basic tests in place.\n\nAdded tests for Regex. Fixed minor bug in c++ stdlib regex\nimplementation of Regex. Noted the remaining reason why we have\nRegex supported by two different implementations.\n\nMinor updates to use compat::make_unique\n\nExtended functionality of CommandLine for convenient use.\n\nRefs #1587, #2566\n\nChange-Id: I6a4aeb1a4c460621ca89a0dc6177567fa09d9200", "idx": 346} {"target": 0, "func": "[PATCH] Use inFastDrawing instead of quick_camera and provide direct\n access to fast drawing state", "idx": 1412} {"target": 0, "func": "[PATCH] Added documentation for FAST", "idx": 1006} {"target": 0, "func": "[PATCH] Replaced static arrays of cl::program/kernels with maps\n\nfast, fftconvolve, orb and random kernels were using static\narrays which are not replaced with std::maps. This fixes the\npure virtual function error that is happening on windows for\nintel OpenCL devices.", "idx": 730} {"target": 0, "func": "[PATCH] Accelerate hmm and gmm", "idx": 1251} {"target": 0, "func": "[PATCH] Code cleanup in swapcoords.cpp\n\nFor ion/water position exchanges with DD, the positions of the ion group,\nthe split group 0 and the split group 1 are assembled into an array known\non all processors (g->xc). Only if ion/water exchanges need to be done, the\npositions of the solvent group need to be assembled as well.\n\nBefore this patch, the group index ran from 0 to eGrpNr, so therefore also\nthe solvent group positions were assembled in every swap step. This was\nsuperfluous since they would be assembled again if bSwap is TRUE.\n\nIf the swap protocol is called very frequently (nstswap << 100), the\nperformance impact is now smaller.\n\nChange-Id: I8eff6bbd33810d6641ec97aa00a537fe782214d3", "idx": 431} {"target": 0, "func": "[PATCH] ranks/device changed from 2 to 1. Value of 2 doubles memory\n usage (and hits MPOSS bugs) but does not help with performance", "idx": 464} {"target": 0, "func": "[PATCH] Add Line_2 to Efficient RANSAC traits", "idx": 975} {"target": 0, "func": "[PATCH] Ordinary slow Ewald electrostatics implemented for md runs", "idx": 1268} {"target": 0, "func": "[PATCH] AdResS: lost performance tweak\n\nChange-Id: I164bc6a60f62d117fef83844da74c3707455a980", "idx": 1334} {"target": 0, "func": "[PATCH] r70387, r70573, r70574 from Mesh_3-experimental-GF\n\nAdd incident_cells_3(Vertex_handle, std::vector)\n\nThis function avoids the construction of two additional std::vectors.\nThe performance gain is between 30% (g++) and 50% (VC++)\nfor points on surfaces as well as for points filling space.\n\nWe at the same time change the implementation of the function\nincident_cells(Vertex_handle, OutputIterator).\nIn order to save one additional std::vector,\nthe cells are reported in bfs and not in dfs order", "idx": 1214} {"target": 0, "func": "[PATCH] Add CUDA bonded kernels\n\nCUDA bonded kernels are added for the most common bonded and LJ-14\ninteractions.\nThe default auto settings of mdrun offloads these interactions\nto the GPU when possible.\nCurrently these interactions are computed in the local or non-local\nnbnxn non-bonded streams. We should consider using a separate stream.\nThis change uses synchronous transfers. A child change will change\nthese to asynchronous.\n\nUpdated release notes and performance guide.\n\nFixes #2678\nRefs #2675\n\nChange-Id: Ifc6d97854cc7afa8526602942ec3b1712ba45bac", "idx": 357} {"target": 0, "func": "[PATCH] use consistent constants from math_const.h and fast integer\n powers from math_special", "idx": 1411} {"target": 0, "func": "[PATCH] Use numerical_jacobian_h_for_var\n\nThe refactoring required makes the ALE perturbations a little simpler\nto boot. This may be a tiny bit slower, but numerical jacobians are\nslow to begin with and the difference shouldn't be noticeable.", "idx": 232} {"target": 0, "func": "[PATCH] The Lazy Kernel dont compile the assertion and the more\n efficient bbox() function. (Because Interval_nt<> dont have gamma(),\n is_rational(), etc...)", "idx": 1282} {"target": 0, "func": "[PATCH] Adding pair style dpd/intel and dihedral style fourier/intel\n Adding raw performance numbers for Skylake xeon server. Fixes for using older\n Intel compilers and compiling without OpenMP. Fix adding in hooks for using\n USER-INTEL w/ minimization.", "idx": 745} {"target": 0, "func": "[PATCH] Don't store zero entries in IGA constraint rows\n\nMost meshes aren't going to have many of these, but omitting them should\nbe a slight performance increase. More importantly, this makes\ndebugging on trivial meshes much easier.", "idx": 1092} {"target": 0, "func": "[PATCH] Remove Isidore_only_equalized_KLT_5000_with_normals.xyz from\n test suite as the Surface Mesher/APSS bug is fixed and processing\n Isidore_only_equalized_KLT_5000_with_normals.xyz is very slow", "idx": 1107} {"target": 0, "func": "[PATCH] Reverting optimizations that hurt performance on some\n compilers\n\ngit-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@15551 f3b2605a-c512-4ea7-a41b-209d697bcdaa", "idx": 1379} {"target": 0, "func": "[PATCH] Added parentheses to performance logging messages", "idx": 593} {"target": 0, "func": "[PATCH] nishant, added efficient jackknifed m spacing estimator", "idx": 156} {"target": 0, "func": "[PATCH] Remove no-inline-max-size and suppress remark\n\nTo avoid the remark that inlining isn't possible I added the flag\nin d28edf2a07dcf11. This causes slow compile and should be avoided.\nInstead suppress the remark.\n\nTODO (for later): Check whether the additional inlining can improve\npermance and consider enable it for release build.\n\nChange-Id: I5866fcc5865fb44ca3dca0cf217e0cab2afbea0c", "idx": 290} {"target": 0, "func": "[PATCH] More extensive performance logging of the Kelly Error\n Estimator.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@1122 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 741} {"target": 0, "func": "[PATCH] Fix issue with HWLOC + OpenMP on XeonPhi\n\nUsing omp_get_max_threads(); is problematic in conjunction with\nHwloc on Intel (essentially an initial call to the OpenMP runtime\nwithout a parallel region before will set a process mask for a single core\nThe runtime will than bind threads for a parallel region to other cores on the\nentering the first parallel region and make the process mask the aggregate of\nthe thread masks. The intend seems to be to make serial code run fast, if you\ncompile with OpenMP enabled but don't actually use parallel regions or so", "idx": 150} {"target": 0, "func": "[PATCH] less efficient but maybe portable logarithm", "idx": 832} {"target": 0, "func": "[PATCH] Templated options are now runtime compile options for opencl\n FAST", "idx": 873} {"target": 0, "func": "[PATCH] Guard performance test which uses Lambda dispatch #821", "idx": 492} {"target": 0, "func": "[PATCH] Enable mac CI testing on azure pipelines\n\n- use native minio bin from brew because Docker for Mac launches slow,\n annoying, etc.", "idx": 968} {"target": 0, "func": "[PATCH] Request flushing denorms to zero in OpenCL\n\nThis change adds by default the -cl-denorms-are-zero to the flags used\nfor kernel compilation. This is done to:\n- avoid a large performance penalty on AMD Vega with ROCm (which by\n default handles denorms on GFX9 or later).\n- make the defaults uniform across CUDA and OpenCL.\n\nFixes #2593\n\nChange-Id: I9e6183c4367b5960e0e21f1dd342d7695acfbc44", "idx": 57} {"target": 0, "func": "[PATCH] FAST will return (af_)features instead of (af_)features *", "idx": 919} {"target": 0, "func": "[PATCH] Fixed bug in 4th order 3D elements. Serendipity now works up\n to order 4 in 3D. Doing some performance testing + trying to get AMR to work\n with serendipity", "idx": 1014} {"target": 0, "func": "[PATCH] Added some profiling to catch performance numbers.", "idx": 220} {"target": 0, "func": "[PATCH] Wrap up unit and performance testing\n\nWith OpenMP, the data-duplicated non-atomic\nversion is 4X faster than the non-duplicated\natomic version using 16 threads, and\n2-3X faster using 2 threads.", "idx": 960} {"target": 0, "func": "[PATCH] Add a performance warning when a dynamic property map is used\n as index map", "idx": 39} {"target": 0, "func": "[PATCH] working but still slow, about to switch to intervals", "idx": 1104} {"target": 0, "func": "[PATCH] Moving coord_string from returning a std::string to\n std::string_view. (#2704)\n\nThe coord_string function is used in a lot of performance critical paths.\nMoving it to return a string_view as none of these paths benefit from\nmaking a copy of the value.", "idx": 591} {"target": 0, "func": "[PATCH] additional precompiler lines for performance testing", "idx": 306} {"target": 0, "func": "[PATCH] 64-bit alignment makes a huge difference on ga_dgemm\n performance on KNL", "idx": 249} {"target": 0, "func": "[PATCH] Stabilize and accelerate the test case by using a smaller\n network architecture.", "idx": 172} {"target": 0, "func": "[PATCH] Reduce noise slightly and increase dataset size, which will\n slow down the test but make the results more accurate.", "idx": 230} {"target": 0, "func": "[PATCH] \n tutorials/combustion/reactingFoam/laminar/counterFlowFlame2D(LTS): changed to\n Wilke transport mixing\n\nChanged the laminar methane combustion cases to use the Wilke mixing rule for\nthe transport properties obtained from the Sutherland model but with coefficient\nmixing for thermodynamic properties for efficient evaluation of reaction\nequilibria.\n\nThis provides significantly more accurate results for laminar combustion,\nproducing a thinner flame and a 10K reduction in peak temperature.", "idx": 258} {"target": 0, "func": "[PATCH] Redesigned experimental memory pool to eliminate race\n conditions by condensing state representation to a single integer and\n simplifying algorithm. Addresses issues #320 , #487 , #452\n\nCreating power-of-two Kokkos::Impl::concurrent_bitset size to streamline\nimplementation and align with MemoryPool needs.\n\nUnit testing over a range of superblocks the following sequence:\n 1) allocate N of varying size\n 2) deallocate N/3 of these\n 3) reallocation deallocated\n 4) concurrently deallocate and allocate N/3 of these\n\nAdd performance test for memory pool.\nAdd performance enhancement note for multiple hints per block size.", "idx": 958} {"target": 0, "func": "[PATCH] added synchs for main events again. this may have a negative\n performance influence in some case. such cases should however be really rare.\n otherwise, these events make no sense without a sync. imagine a master rank\n which is not at the interface", "idx": 644} {"target": 0, "func": "[PATCH] Adding benchmark directory and a bytes_and_flops benchmark\n\nThis directory is intended for hardware and software benchmarks.\nThese are not necessarily performance unit tests, but usually a bit\nmore complex than that.\n\nAnd they are intended to be used on their own.", "idx": 1113} {"target": 0, "func": "[PATCH] - Workaround bugs and misfeatures of GCC 3 in FPU.h. \n Unfortunately at a performance cost :((", "idx": 827} {"target": 0, "func": "[PATCH] Containers: DynRankView Performance Test was using to much\n memory.\n\nThe test was using close to 9GB in memory, with parallel testing this\nlead to out of memory failures.", "idx": 1298} {"target": 0, "func": "[PATCH] Tested a new locking grid with thread priorities. Actually,\n it decreases performance with the current algorithm, but we'll try it again\n later...", "idx": 920} {"target": 0, "func": "[PATCH] performance update ....EJB", "idx": 76} {"target": 0, "func": "[PATCH] Extra options for computational electrophysiology.\n\n* Added two extra .mdp file parameters 'bulk-offset' that allow to specify\nan offset of the swap layers from the compartment midplanes. This is useful\nfor setups where e.g. a transmembrane protein extends far into at least one\nof the compartments. Without an offset, ions would be swapped in the vicinity\nof the protein, which is not wanted. Adding an extended water layer comes\nat the cost of performance, which is not the case for the offset solution.\n* Also made the wording a bit clearer in some places\n* Described the new parameters in the PDF manual, updated figure\n* replaced usage of sprintf in output routine print_ionlist_legend() by snprintf\n* Turned comments describing the variables entering the swapcoords.cpp\n functions into doxygen comments\n\nChange-Id: I2a5314d112384b30f9c910135047cc2441192421", "idx": 728} {"target": 0, "func": "[PATCH] Added performance chart", "idx": 272} {"target": 0, "func": "[PATCH] SMD multi step added with fast Hv product in main", "idx": 641} {"target": 0, "func": "[PATCH] sgemm/dgemm: add a way for an arch kernel to specify prefered\n sizes\n\nThe current gemm threading code can make very unfortunate choices, for\nexample on my 10 core system a 1024x1024x1024 matrix multiply ends up\nchunking into blocks of 102... which is not a vector friendly size\nand performance ends up horrible.\n\nthis patch adds a helper define where an architecture can specify\na preference for size multiples.\nThis is different from existing defines that are minimum sizes and such.\n\nThe performance increase with this patch for the 1024x1024x1024 sgemm\nis 2.3x (!!)", "idx": 1183} {"target": 0, "func": "[PATCH] Add the NeighborSearchRules class, which defines how the\n SingleTreeDepthFirstTraverser can perform a NeighborSearch. Adapt the\n NeighborSearch class to use this. It is not as fast as it could be.", "idx": 670} {"target": 0, "func": "[PATCH] Use fast pool allocator", "idx": 1284} {"target": 0, "func": "[PATCH] Distinghuish cases for performance reasons", "idx": 787} {"target": 0, "func": "[PATCH] Fix a warning\n\nhttps://cgal.geometryfactory.com/CGAL/testsuite/CGAL-5.1-Ic-152/Installation/TestReport_Friedrich_Ubuntu-gcc-7.gz\n```\nCMake Warning at /home/gimeno/foutoir/cgal_root/CGAL-5.1-Ic-152/cmake/modules/CGAL_enable_end_of_configuration_hook.cmake:99 (message):\n =======================================================================\n\n CGAL performance notice:\n\n The variable CMAKE_BUILD_TYPE is set to \"\". For performance reasons, you\n should set CMAKE_BUILD_TYPE to \"Release\".\n\n Set CGAL_DO_NOT_WARN_ABOUT_CMAKE_BUILD_TYPE to TRUE if you want to disable\n this warning.\n\n =======================================================================\nCall Stack (most recent call first):\n CMakeLists.txt:9223372036854775807 (CGAL_run_at_the_end_of_configuration)\n```", "idx": 161} {"target": 0, "func": "[PATCH] attempt to reduce the negative performance impact of adding\n the shift option", "idx": 534} {"target": 0, "func": "[PATCH] KokkosContainers: Mark perf tests as CATEGORY PERFORMANCE\n\nFix https://github.com/kokkos/kokkos/issues/374 by marking\nKokkosContainers' performance tests as CATEGORY PERFORMANCE. They\nwill always build, but they will only run when doing performance\ntests.", "idx": 155} {"target": 0, "func": "[PATCH] Performance test for memory pool. Update v2 API to better\n align with v1.", "idx": 43} {"target": 0, "func": "[PATCH] New version of dgemm_kernel_4x4_bulldozer.S The peak\n performance with 8 cores is now 90 GFlops", "idx": 1273} {"target": 0, "func": "[PATCH] additional precomiler command for performance testing", "idx": 527} {"target": 0, "func": "[PATCH] Fixed backend API for join\n\n* Removed output dimensions as parameter\n* Fixed unitialized warnings for fast cpu", "idx": 1121} {"target": 0, "func": "[PATCH] performance updates....EJB", "idx": 1300} {"target": 0, "func": "[PATCH] Added to PME on GPU information in mdrun performance\n\nAdded -pme to glossary\n\nMoved and modified a previous mdrun with GPU example to follow\na more logical progression - first introduce the simpler use cases\nof gputasks and show examples, then at the end show how to avoid a\ngraphics-dedicated GPU\n\nAdded more GPU task assignment examples\n\nChange-Id: I63304a511d5d98d85fdbb1cea497627a80a14418", "idx": 1484} {"target": 0, "func": "[PATCH] Timer: Available in Kokkos namespace\n\nUpdated unit tests, performance tests, and examples using\nKokkos::Impl::Timer to use Kokkos::Timer", "idx": 1223} {"target": 0, "func": "[PATCH] Update paper.md\n\nSpecify SpMV performance results.", "idx": 1048} {"target": 0, "func": "[PATCH] Layout on the bibliography page\nMIME-Version: 1.0\nContent-Type: text/plain; charset=UTF-8\nContent-Transfer-Encoding: 8bit\n\nWhen having a bit a long citation description, the description runs, in the HTML output on the bibliography page, into 3 or more lines where the 3rd and following lines continue underneath the citation number like:\n```\n [1] Eric Berberich, Arno Eigenwillig, Michael Hemmer, Susan Hert, Lutz Kettner, Kurt Mehlhorn, Joachim Reichel, Susanne Schmitt, Elmar Sch\u00f6mer, and Nicola Wolpert. Exacus: Efficient and exact\n algorithms for curves and surfaces. In Gerth S. Brodal and Stefano Leonardi, editors, 13th Annual European Symposium on Algorithms (ESA 2005), volume 3669 of Lecture Notes in Computer Science,\npages 155\u2013166, Palma de Mallorca, Spain, October 2005. European Association for Theoretical Computer Science (EATCS), Springer.\n```\n\nThe example was found in e.g. https://doc.cgal.org/latest/Algebraic_foundations/citelist.html\n\n- corrected the \"overflow\"\n- made the citation number right aligned", "idx": 1375} {"target": 0, "func": "[PATCH] moved MeshBase::contract() up to Mesh. Unfortunately, there\n is no good way to make MeshBase::delete_elem() efficient, so the old\n implementation of MeshBase::contract() was (potentially) O(n_elem^2), and\n consumed approximately 20 percent of the runtime in ex10. This new\n implementation exploits the fact that the elements are stored in a vector\n (which is why it was moved up to the Mesh class) and is linear in the number\n of elements. The new implementation is less that 1 percent of the run time\n in ex10.", "idx": 1030} {"target": 0, "func": "[PATCH] Use much less PaddedRVecVector and more ArrayRef of RVec\n\nOnly code that handles allocations needs to know the concrete type of\nthe container. In some cases that do need the container type,\ntemplating on the allocator will be needed in future, so that is\narranged here. This prepares for changing the allocator for state->x\nso that we can use one that can be configured at run time for\nefficient GPU transfers.\n\nAlso introduced PaddedArrayRef to use in code that relies on the\npadding and/or alignedness attributes of the PaddedRVecVector. This\nkeeps partial type safety, although a proper implementation of such a\nview should replace the current typedef.\n\nHad to make some associate changes to helper functionality to\nuse more ArrayRef, rather than rely on the way rvec pointers could\ndecay to real pointers.\n\nUsed some compat::make_unique since that is better style.\n\nChange-Id: I1ed3feb016727665329e919433bece9773b46969", "idx": 923} {"target": 0, "func": "[PATCH] AABB tree: added internal search tree (CGAL K-orth search\n tree) to accelerate the projection queries. substantial complication... may\n be improved.", "idx": 145} {"target": 0, "func": "[PATCH] enable GPU sharing among tMPI ranks\n\nIt turns out that the only issue preventing sharing GPUs among thread-MPI\nthreads was that when the thread arriving to free_gpu() first destroys\nthe context, it is highly likely that the other thread(s) sharing a GPU\nwith this are still freeing their resources - operation which fails as\nsoon as the context is destroyed by the \"fast\" thread.\n\nSimply placing a barrier between the GPU resource freeing and context\ndestruction solves the issue. However, there is still a very unlikely\nconcurrency hazard after CUDA texture reference updates (non-bonded\nparameter table and coulomb force table initialization). To be on the\nsafe side, with tMPI a barrier is placed after these operations.\n\nChange-Id: Iac7a39f841ca31a32ab979ee0012cfc18a811d76", "idx": 162} {"target": 0, "func": "[PATCH] Kokkos::Experimental View refactoring #define\n KOKKOS_USING_EXPERIMENTAL_VIEW to alias Kokkos::View to\n Kokkos::Experimental::View.\n\nRevise core unit and performance tests to execute correctly.", "idx": 262} {"target": 0, "func": "[PATCH] - locate() cleanups, performance impact unnoticeable.", "idx": 1421} {"target": 0, "func": "[PATCH] Evaluating Tom Forsyth's fast mesh reordering.", "idx": 573} {"target": 0, "func": "[PATCH] Fix performance regression in KOKKOS package", "idx": 449} {"target": 0, "func": "[PATCH] Adding Changelog for Release 3.2.01\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 3.2", "idx": 533} {"target": 0, "func": "[PATCH] More performance enhancements for the transformation --\n almost done now!", "idx": 738} {"target": 0, "func": "[PATCH] Adjust spin wait in attempt to address performance issue #935\n Bring Windows code back in, just in case...", "idx": 724} {"target": 0, "func": "[PATCH] Use fast returns in md5 computation\n\nThis logic is easier to follow than a recycled ret integer\n\nChange-Id: Idc47cdae3d0453f1645a82582b13c86aa8eadcb8", "idx": 928} {"target": 0, "func": "[PATCH] Remove BoundaryInfo::n_boundary_ids(elem, side) copy/paste\n job.\n\nIt was basically an exact copy of the now-deprecated\nBoundaryInfo::boundary_ids(elem, side). Reusing the set filling code\nmight be *slightly* less efficient, but I think the savings in\nmaintainability and readability is worth it...", "idx": 539} {"target": 0, "func": "[PATCH] Move calcgrid.* to fft/\n\nOne more file out of gmxlib/. These are related to selecting an FFT grid\nsize, and contain some numbers coming from performance measurements, so\nfft/ should be a natural place.\n\nChange-Id: I386965665a92bc47d4c0c3ca0201a6a4b13b5886", "idx": 1568} {"target": 0, "func": "[PATCH] bug fix for fast box intersection in presense of the Infi Box", "idx": 14} {"target": 0, "func": "[PATCH] Completed the base fast SVD case for which the number of\n points is less than the dimension...", "idx": 1512} {"target": 0, "func": "[PATCH] Make jenkins own-fftw verify use local tarball\n\nWe should not bombard the FFTW servers with downloads, plus these can be\nrelatively slow too, so use our local ftp server instead.\n\nChange-Id: Id6ccebf0ac1ae6410cd4f7f13f2ff76d275af5d2", "idx": 568} {"target": 0, "func": "[PATCH] Handle a case where the step size ends up being 0 but the\n gradient is not yet the minimum gradient size. Maybe it is a little slow\n (elementwise comparison is not incredibly fast)...", "idx": 65} {"target": 0, "func": "[PATCH] Serial-only atomics implementation\n\n[#607] [#549]\nA few details:\n - Accepting volatile pointers was necessary\n for compatibility with existing calls which\n pass in volatile pointers, hence the const_cast\n - Special implementations of atomic_increment\n were needed to get equal performance in the\n one application I tested (it was doing its\n own serial special cases before).\n - Compilers have a harder time matching templates\n as opposed to overloads, so some call sites\n had to be modified to specify the scalar\n type explicitly", "idx": 494} {"target": 0, "func": "[PATCH] Use fast PerfLog methods in PerfItem\n\nThis speeds up our LOG_SCOPE usage several fold.", "idx": 1249} {"target": 0, "func": "[PATCH] Add implementation of the sum squared error performance\n function.", "idx": 1451} {"target": 0, "func": "[PATCH] In debug mode it makes no sense to run a performance test", "idx": 403} {"target": 0, "func": "[PATCH] Adjust the performance function test; use the simplified\n performance function.", "idx": 1454} {"target": 0, "func": "[PATCH] Override any OpenCL fast math JIT settings for\n born/coul/wolf{/cs}/gpu to resolve numerical deviations seen with some OpenCL\n implementations.", "idx": 543} {"target": 0, "func": "[PATCH] Draw points with line width 0, otherwise it is too slow when\n we have 1mio points", "idx": 1490} {"target": 0, "func": "[PATCH] Removed init_state\n\nMade a simple zero-initializing constructor for t_state and the\nstructs of some of its members. Called them classes. Later, we might\nprefer to require explicit initialization with actual values, so that\ntools can detect the use of uninitialized values and find our bugs,\nbut for now having a constructor is a useful initial step in that\ndirection.\n\nExtracted some new functions that cover some of the incidental\nfunctionality that was also present in init_state.\n\nMade state.lambda a std::array, thereby removing the need to consider\nresizing it, and converted client code to be passed an ArrayRef rather\nthan hard-code the name of the specific container. This caters for\nconvenient future refactoring of the underlying storage, and sometimes\nneeding to implicitly know what the size of the container is.\n\nPassing an ArrayRef by value is consistent with the CppCoreGuidelines,\nbut has potential for performance impact. Doing this means that a\ncaller pushes onto the stack a copy of the object (containing two\npointers), rather than previous idioms such as pointer + size, or\npointer + implicit constant size from an enum, or pointer + implicit\nsize in some other parameter. This could mean an extra argument is\npushed to the stack for the function call, compared with the\nalternatives of pushing a pointer to data, pointer to container, or\npointer to ArrayRef. In all cases, the caller has to load the pointer\nvalue via an offset that is known to the compiler, so that aspect is\nprobably irrelevant. So, we would probably prefer to avoid calling\nfunctions that take such parameters in a tight loop, or where multiple\ncontainers share a common size. But the uses in this patch seem to be\nof sufficiently high level to be an acceptable trade of possible\nperformance for improved maintainability.\n\nChange-Id: I17e7d83cfc89566f76fa9949c425b950ad6aef62", "idx": 1447} {"target": 0, "func": "[PATCH] Non-reduction boxloops done\n\nThe non-reduction boxloops are all in and pass the struct tests.\nPerformance is VERY slow, but this may just be due to the machine\nI am running on. Reduction boxloops are in progress.", "idx": 690} {"target": 0, "func": "[PATCH] Reword CPU/GPU imbalance notes\n\nChanges text in CPU/GPU imbalance from \"performance loss\" to \"wasting\nresources\", since in some cases one can not get higher performance.\nReplaced \"GPU has less load\" by \"CPU has more load\".\nRemoved hint to reduce the cut-off, since one often can not do this.\nNote that with CUDA all theses notes are never printed, since we no\nlonger have timings on (by default), unlike with OpenCL.\n\nFixes #2253\n\nChange-Id: Ib4a9752ad27c1cd2a3cd751a217249694a56d3b7", "idx": 678} {"target": 0, "func": "[PATCH] Distinguished mutexes from semaphores. The distinction is\n useful because the linux implementation of sem_post() in unnecessarily slow\n when semaphores are used for mutual exclusion. This change made spinlocks\n messier to implement, so I excised them.", "idx": 1550} {"target": 0, "func": "[PATCH] Worked on the performance of point location.", "idx": 547} {"target": 0, "func": "[PATCH] Added fast square tridiagonalization for lower-triangular\n storage.", "idx": 823} {"target": 0, "func": "[PATCH] Fixed essential dynamics / flooding group PBC serial\n\nIn former versions, the PBC representation of essential dynamics /\nflooding group atoms could be incorrect in serial runs if the ED group\ncontained more than a single molecule. In multi-molecule cases, the required\nsteps to choose the correct PBC image in communicate_group_positions()\ntherefore need to be performed also in serial runs. Since the PBC representation\ncan only change in neigborsearching steps, we only need to check the\nshifts then. In parallel, NS is signalled by the bUpdateShifts\nvariable, which is set in dd_make_local_ed_indices(). The latter\nfunction is however not called in serial runs; but still we can pass\nthe bNS boolean to do_flood() to signal the NS status. For essential\ndynamics, unfortunately, since do_edsam() is called from constrain(), there\nis no information about the NS status at that point. Until someone\ncomes up with a better idea, we therefore do the PBC check in every step\nin serial essential dynamics - the performance impact will be negligible\nanyway.\n\nChange-Id: I86336a5e34131bdeac7e28f35b1ccb633450e54e", "idx": 1332} {"target": 0, "func": "[PATCH] reverted change to matrix product sparsity calculation: no\n one is interested in how fast you can calculate the wrong answer....", "idx": 1245} {"target": 0, "func": "[PATCH] Apply suggestions from code review\n\nEverything following are changes to the rcm implementation.\nRemove unneccessary consts from rcm declaration.\nDo not store adjacency matrix and degrees as members in the class.\nReplace explicit gko::vector with vector.\nreplace std::memcpy with std::copy_n.\nVarious documentation improvments.\nReverse description order of return paramters.\nMake IndexType explicit instead of auto.\nUse std::min_elelemt\nFix various spelling errors, typos, rewordings.\nRemoves the 'explicit' from the ExectorAllocators rebinding construtor.\nChanges occurences of array to vector, to save memory.\nMinor cleanup\nRemove the unnecessary test for the rcm adjacency matrix.\nAdd a test case for the correct rcm result.\nRewrite the assert_correct_permutation function to comparing with iota.\nMove test matrices to test class.\nImprove nested vector initialization.\nAllocate degrees inside rcm::generate.\nChange from goto to immediately evaluated lambda expression.\nRefactor loop body into an inline helper function.\nRelace some autos.\nReorganize includes to conform to include order.\nReplace autos with IndexType where necessary.\nReplace size_type with IndexType for num_vertices.\nRefactor for the number of levels to be more explicit.\nMake perm signal value -1.\nFactor out sort_by_degree in a generalized, fast small sorting function.\nMake level_processed a constexpr.\nIn the small_sort make some types explicit.\nWrap omp locks in a omp_mutex, use RAII guards.\n\nCo-authored-by: Tobias Ribizel ", "idx": 1370} {"target": 0, "func": "[PATCH] Added timers to test performance of MPI_File_sync,\n collection, and compression", "idx": 13} {"target": 0, "func": "[PATCH] BJP: Initial checkin of test programs to test performance of\n put and get operations using mirrored arrays.", "idx": 293} {"target": 0, "func": "[PATCH] issue #1871, #1310: inv efficient for SX, evaluatable for MX", "idx": 121} {"target": 0, "func": "[PATCH] Tetrahedral mesh\n\n- Displays the item after its creation without having to move the manipulated frame.\n\n- Enhances the performance when moving the manipulated frame.", "idx": 185} {"target": 0, "func": "[PATCH] Fixed FAST type comparison mismatch warning", "idx": 875} {"target": 0, "func": "[PATCH] code optimiz", "idx": 584} {"target": 0, "func": "[PATCH] fixed a horrendously inefficient minibatch implementation.\n now the cardinality-k-index-set sampler (sampling without replacement) is\n very efficient", "idx": 566} {"target": 0, "func": "[PATCH] added precompiler commands for performance tests", "idx": 1149} {"target": 0, "func": "[PATCH] More performance stats.", "idx": 575} {"target": 0, "func": "[PATCH] New bench for comparing sweep performance", "idx": 459} {"target": 0, "func": "[PATCH] Some performance problem. I am looking into it.", "idx": 1472} {"target": 0, "func": "[PATCH] #1044 disable some slow unittests by default", "idx": 1459} {"target": 0, "func": "[PATCH] Create a AVX512 enabled version of DGEMM\n\nThis patch adds dgemm_kernel_4x8_skylakex.c which is\n* dgemm_kernel_4x8_haswell.s converted to C + intrinsics\n* 8x8 support added\n* 8x8 kernel implemented using AVX512\n\nPerformance is a work in progress, but already shows a 10% - 20%\nincrease for a wide range of matrix sizes.", "idx": 73} {"target": 0, "func": "[PATCH] Due to decrease in code performance, calculation of pressure\n tensor for W12, W13, W23 is commented out and in log file only the diameter\n of pressure tensor W11, W22, W33 will be printed. In case if those\n calculation are required, they need to be uncommented.", "idx": 522} {"target": 0, "func": "[PATCH] Fixed FAST unit tests", "idx": 108} {"target": 0, "func": "[PATCH] Commit the new version of the static filter. Too slow for the\n moment.", "idx": 189} {"target": 0, "func": "[PATCH] Jeff: same as 6.0 branch. should not include ARMCI build\n system headers. support for T3D and T3E, especially something which is only\n for performance, not function, is silly. s/TARGET/NWCHEM_TARGET/ for the\n RTDB case. i don't see any reason for it in the first place but at least now\n the build system is cleaner.", "idx": 1348} {"target": 0, "func": "[PATCH] Remove BoundaryInfo::n_edge_boundary_ids(elem, side)\n copy/paste job.\n\nIt was basically an exact copy of the now-deprecated\nBoundaryInfo::edge_boundary_ids(elem, side). Reusing the set filling code\nmight be *slightly* less efficient, but I think the savings in\nmaintainability and readability is worth it...", "idx": 646} {"target": 0, "func": "[PATCH] use MinMaxTuple as return value of minmax\n\nThis might slow things down.\n\nSigned-off-by: Panagiotis Cheilaris ", "idx": 182} {"target": 0, "func": "[PATCH] first short a performance matrix stuff. Papi interface so\n far.", "idx": 780} {"target": 0, "func": "[PATCH] Coordinate Propagators\n\nThis change introduces the propagator element, which, thanks to templating,\ncan cover the different propagation types used in NVE MD. The combination\nof templating, static functions, and having only the inner-most operations in\nthe static functions allows to have performance comparable to fused update\nelements while keeping easily reordable single instructions.\n\nNote that the two velocity update functions are only necessary to allow\nexact replication of the legacy do_md code for both md and md-vv. The\nparentheses or the lack thereof lead to numerical errors which build up very\nrapidly to make the (very strict) integrator comparison test fail. Relaxing this\ncondition will make getting rid of one of the two variants possible.\n\nAn interesting further development would be to unify the OpenMP loops for\ncoordinate propagation and constraining by using loops over constraint\ngroups in both cases.\n\nChange-Id: I1a1f66f1efe63c791ef3fe51ce2f99da3367adca", "idx": 505} {"target": 0, "func": "[PATCH] Add --skip-partitioning command line option\n\nThis makes testing tweaks to that option easier, and might be useful\nlater when doing performance testing.", "idx": 1365} {"target": 0, "func": "[PATCH] Add basic list of performance factors to tutorials.", "idx": 528} {"target": 0, "func": "[PATCH] Sparse unordered with duplicates: Better vectorization for\n tile bitmaps.\n\nWhen computing tile bitmaps, there are a few places where we were not\nvectorizing properly and it caused performance impacts on some customer\nscenario. Fixing the use of covered/overlap in the dimension class to\nenable proper vectorization.", "idx": 1559} {"target": 0, "func": "[PATCH] There is no need to switch off ITERATOR_DEBUGGING as the\n performance problem was somewhere else", "idx": 865} {"target": 0, "func": "[PATCH] Jan 29 1999\tCalls to Robert's fast esp routines", "idx": 465} {"target": 0, "func": "[PATCH] << operator for segments does clipping (QT advice) only for\n the segments that intersect the boundaries of the screen rectangle use the\n old x_real function because the new one is too slow in doing the\n transformation (use GMP if CGAL_USE_GMP is defined) we should document the\n old one too, it will never be removed.", "idx": 795} {"target": 0, "func": "[PATCH] Various cleanup in Kokkos SNAP, replacing verbose Kokkos\n MDRangePolicy and TeamPolicy types with simpler `using` definitions. No\n performance implications.", "idx": 617} {"target": 0, "func": "[PATCH] normalize_border is there only for performance", "idx": 103} {"target": 0, "func": "[PATCH] Cherry-picking changes on top of our current hash. There are\n things that hurt performance in laster mfem@master commits, not sure what\n yet.\n\nThis includes:\n- feature/artv3/cusparse-Spmv\n- feature/tomstitt/temp-mem-type\n- patches to tmop.cpp and tmop_tools.cpp to address memory issues\n- patch to mem_manager/device to use mfem's default allcator instead of\nthe host umpire one", "idx": 504} {"target": 0, "func": "[PATCH] Break apart update_constraints\n\nThere are four distinct kinds of work being done, and never was any\ncall to update_constraints doing all of them, so it's better to have a\ngroup of functions, each of which do one thing, and the relevant ones\ncalled. This also makes it simpler to express by returning fast that\nwhen we don't have constraints, we do nothing.\n\nMade the logic for whether this is a log or energy step match that of\nthe main MD loop. The old implementation may not have prepared for the\nlast step correctly when it was triggered by something other than the\nnsteps inputrec value.\n\nRemoved a commment mentioning iteration, which is a feature\nthat was removed a while ago.\n\nRemoved some ancient debug dump output.\n\nRefs #2423, #1793\n\nChange-Id: I21c10826721ddc9a79a33b1dc75971a20d0855d9", "idx": 181} {"target": 0, "func": "[PATCH] Fast Haswell CGEMM kernel", "idx": 837} {"target": 0, "func": "[PATCH] -Mvect option causes severe performance problems with\n R1.2.5.1 on Paragon", "idx": 582} {"target": 0, "func": "[PATCH] USER-DPD Kokkos: Remove the SSA's ALLOW_NON_DETERMINISTIC_DPD\n option. There was no measurable performance benefit to turning it on.", "idx": 1383} {"target": 0, "func": "[PATCH] thermophysicalModels::coefficientWilkeMultiComponentMixture:\n New Wilke mixing model for gaseous transport properties\n\nThe new generalised framework for thermophysical mixing models has allowed the\nefficient implementation of the useful combination for gases of coefficient\nmixing for thermodynamic properties with the Wilke model for transport\nproperties:\n\nDescription\n Thermophysical properties mixing class which applies mass-fraction weighted\n mixing to the thermodynamic coefficients and Wilke's equation to\n transport properties.\n\n Reference:\n \\verbatim\n Wilke, C. R. (1950).\n A viscosity equation for gas mixtures.\n The journal of chemical physics, 18(4), 517-519.\n \\endverbatim", "idx": 599} {"target": 0, "func": "[PATCH] Policy performance tests: Added test and sample scripts\n\nThis commit address Github issue #737\nRangePolicy and TeamPolicy tests with nested parallelism added for\nbenchmarking performance - e.g. compare master vs develop\n\npolicy_performance: add functor for parallel_scan", "idx": 419} {"target": 0, "func": "[PATCH] Fix conditional on when DtoH forces copy occur\n\nd2d4a50b4c636c203028c5bff311924ec15e7825 introduced performance\nregression with forces copied from device to host on each step.\nThis fixes the issue by reinstantiating proper condition on the\ncopy call.\n\nFixes #4001\nRefs #2608", "idx": 785} {"target": 0, "func": "[PATCH] Fix bug in polyline simplification: We had hardwired that we\n use Exact_predicates_tag which is slow for EPEC in particular with\n Quotient or leda::real\n\nWe determine the appropriate tag using Algebraic_structure_traits::Is_exact", "idx": 10} {"target": 0, "func": "[PATCH] Speed up a slow test case by using a smaller topology file", "idx": 680} {"target": 0, "func": "[PATCH] Back to bebug aabb slow problem", "idx": 1176} {"target": 0, "func": "[PATCH] completed adding of support point indices, made remark on\n buggy setup of problem in fast and exact case", "idx": 656} {"target": 0, "func": "[PATCH] Fix hardware topology detection for modern systems\n\nThis is a partial rewrite of our hardware topology detection\ncode to handle modern systems where we might not be allowed to\nrun on all threads present, where there might be cpu limits\nthat are lower than the total number of threads, or hybrid\nCPUs that contain combinations of performance and efficiency\ncores. In particular, it includes\n\n- The hwloc detection has been fixed to work for more systems,\n and we are better at properly separating internal logical\n cpu indices from OS-provided logical indices.\n- The cpuinfo code will properly handle the case where we\n are not allowed to run on some cores when detecting a simple\n topology.\n- When compiled without hwloc support, we can now also parse\n cpu topologies from Linux filesystems, which is important\n for non-x86 processors.\n- All detection layers properly handle the case where there is\n a cpuset mask that disallows some cpus from being used.\n- When available, we use linux cgroups (either v1 or v2) to\n detect cpu limits set e.g. in container environments, and use\n this to decide the number of threads rather than the total\n logical core count. This should avoid overloading runs\n in container environments, including our CI system.\n- We no longer assume that all sockets/cores are identical,\n which will commonly not be the case e.g. if slurm has set\n custom cpusets (so only those cpus are visible).", "idx": 250} {"target": 0, "func": "[PATCH] Reduce dataset sizes and number of iterations to accelerate\n tests.", "idx": 536} {"target": 0, "func": "[PATCH] Added Kokkos-like array datatype into RK4 and RHS in\n FixRXKokkos.\n\n- Created an Array class that provides stride access for operator[]\n w/o needing Kokkos views. This was designed to avoid the performance\n issues encountered with Views and sub-views throughout the RHS and\n ODE solver functions.", "idx": 781} {"target": 0, "func": "[PATCH] Fix pixel tests in fast kernels", "idx": 1101} {"target": 0, "func": "[PATCH] The split source is unnecessary and slow on some machines.", "idx": 147} {"target": 0, "func": "[PATCH] Revert \"suppress performance warning concerning an assertion\"\n\nThis reverts commit 580b65d8a5937b99c6206885a6b0504399923501.\n\nThe warning was already fixed by commit\n63ae26eb0ffa8460abddf6843345996f9e4de603.", "idx": 1193} {"target": 0, "func": "[PATCH] Add CUDA compiler support for CC 5.0\n\nWith CUDA 6.5 and later compute capability 5.0 devices are supported, so\nwe generate cubin and PTX for these too and remove PTX 3.5.\nThis change also removes explicit optimization for CC 2.1 where\nsm_20 binary code runs equally fast as sm_21.\n\nChange-Id: I5a277c235b873afb2d1b2b12b5db64b370f1bade", "idx": 1233} {"target": 0, "func": "[PATCH] Added the option to set a close_to_point tolerance in\n PointLocatorTree and MeshFunction, as a fallback option.\n\nThis is helpful when we have numerical tolerance issues which can lead to points being outside a mesh within\nsome tolerance. Note that we use a linear search with close_to_point, so it's slow, but at least it's a robust\nbackup.", "idx": 130} {"target": 0, "func": "[PATCH] Sun Performance Library debugged and tested", "idx": 166} {"target": 0, "func": "[PATCH] Used EAF for all IO on T and cleaned up IO operations so that\n reads/writes on T happen in oV blocks and all reads of T go thru one routine\n (mp2_read_tiajb). Put in performance stats for the basic steps of the\n gradient. Use screening of the density in the non-seperable part of the\n gradient. Reduced the tol2e from 10^-12 to 10^-9/10 but it really needs to\n be an input parameter.", "idx": 1401} {"target": 0, "func": "[PATCH] - Experiment with compacting the in_conflict_flag showed 7%\n performance drop. So it's probably not worth it in practice.", "idx": 267} {"target": 0, "func": "[PATCH] Updated article reference. Added configuration flags to\n support altivec on powerpc. Updated configuration files from gnu.org Removed\n the fast x86 truncation when double precision was used - it would just crash\n since the registers are 32bit only. Added powerpc altivec innerloop support\n to fnbf.c Removed x86 assembly truncation when double precision is used - the\n registers can only handle 32 bits. Dont use general solvent loops when we\n have altivec support in the normal ones.", "idx": 707} {"target": 0, "func": "[PATCH] PetscMatrix::print_personal now prints to file when requested\n (rather than just cout). The implementation is not particularly efficient\n (since print_personal gets passed an ostream) but it does work. And how\n efficient do you need to be if you are printing out matrices anyway?", "idx": 1510} {"target": 0, "func": "[PATCH] Core: fix another issue with the memory tracking, introduced\n recently\n\nThere was a bug introduced when solving the performance issues with\nsubviews. It seemed to not have trickered any errors in either Trilinos\ntests nor kokkos tests, but did crash Nalu.", "idx": 1296} {"target": 0, "func": "[PATCH] EA: test for ga_dgemm performance (N=400 1600 3200)", "idx": 1503} {"target": 0, "func": "[PATCH] implemented plain-C SIMD macros for reference\n\nThis is mainly code reorganization.\nAdds reference plain-C, slow, arbitrary width SIMD for testing.\nAdds FMA for gmx_calc_rsq_pr.\nAdds generic SIMD acceleration (also AVX or double) for pme solve.\nMoved SIMD vector operations to gmx_simd_vec.h\nThe math functions invsqrt, inv, pmecorrF and pmecorrV have been\ncopied from the x86 specific single/double files to generic files\nusing the SIMD macros from gmx_simd_macros.h.\nMoved all architecture specific nbnxn_kernel_simd_utils code to\nseparate files for each SIMD architecture and replaced all macros\nby inline functions.\nThe SIMD reference nbnxn 2xnn kernels now support 16-wide SIMD.\nAdds FMA for in nbnxn kernels for calc_rsq and Coulomb forces.\n\nRefs #1173\n\nChange-Id: Ieda78cc3bcb499e8c17ef8ef539c49cbc2d6d74d", "idx": 551} {"target": 0, "func": "[PATCH] fast point location should work now", "idx": 1523} {"target": 0, "func": "[PATCH] rbOOmit change: If a CouplingMatrix is attached to an\n RBConstruction then we should only increment the designated blocks in matrix\n assembly, otherwise matrix assembly is extremely slow since we go outside the\n sparsity pattern.", "idx": 1368} {"target": 0, "func": "[PATCH] Rebuild the OpenMP runtime library with Clang\n\nTo get rid of the warning:\n\nclang-9: warning: No library 'libomptarget-nvptx-sm_70.bc' found in the default\nclang lib directory or in LIBRARY_PATH. Expect degraded performance due to no\ninlining of runtime functions on target devices. [-Wopenmp-target]", "idx": 133} {"target": 0, "func": "[PATCH] Reducers: fix performance issue #680\n\nAdding non-volatile join helps.", "idx": 1045} {"target": 0, "func": "[PATCH] chasing the performance issue on the aump2 QA test", "idx": 651} {"target": 0, "func": "[PATCH] performance are getting better with grid_nbfm", "idx": 511} {"target": 0, "func": "[PATCH] Make stepWorkload.useGpuXBufferOps flag consistent\n\nOn search steps we do not use x buffer ops, so the workload flag should\ncorrectly reflect that.\n\nAlso slightly refactored a conditional block to clarify the scope of\nworkload flags.\n\nNote that as a side-effect of this change, coordinate H2D copy will be\ndelayed from the beginning of do_force() to just before update on search\nsteps when there are no force tasks that require it (i.e. without PME).\nWhile this is not ideal for performance, the code is easier to reason\nabout.\n\nRefs #3915 #3913 #4268", "idx": 93} {"target": 0, "func": "[PATCH] remove peak_memory_sizer that uses Taucs, slow computation\n and is not working on all platforms.\n\nBy default poisson now uses Eigen is available and Taucs otherwise", "idx": 337} {"target": 0, "func": "[PATCH] Move .f to .F - Including log of .f\n\nRCS file: /msrc/proj/mss/nwchem/src/ddscf/fast/cheby.f,v\nWorking file: cheby.f\nhead: 1.4\nbranch:\nlocks: strict\naccess list:\nsymbolic names:\n release-4-5-patches: 1.4.0.8\n release-4-5: 1.4\n bettis: 1.4\n release-4-1-patches: 1.4.0.6\n release-4-1: 1.4\n release-4-0-1: 1.4\n release-4-0-patches: 1.4.0.4\n release-4-0: 1.4\n v3-3-1: 1.4\n release-3-3-patches: 1.4.0.2\n release-3-3: 1.4\nkeyword substitution: kv\ntotal revisions: 4; selected revisions: 4\ndescription:\n----------------------------\nrevision 1.4\ndate: 1999/07/29 00:53:56; author: d3e129; state: Exp; lines: +3 -0\nadded cvs ID tags\n----------------------------\nrevision 1.3\ndate: 1999/05/10 18:43:30; author: d3g681; state: Exp; lines: +37 -14\nmajor changes to improve precision, speed, stability; fully dynamic FMM to permit deep trees; rtdb input parameters; this is the version used for the paper\n----------------------------\nrevision 1.2\ndate: 1999/01/11 18:18:49; author: d3g681; state: Exp; lines: +250 -124\nbetter linear algebra in the fitting, more varied fit routines, removed dead code\n----------------------------\nrevision 1.1\ndate: 1999/01/01 04:57:29; author: d3g681; state: Exp;\nFirst phase of integrating the fast coulomb code into nwchem. The nested grid evaluation of the density, fourier interpolation, FMM, and fourier solution of the free space Poisson equation", "idx": 1115} {"target": 0, "func": "[PATCH] Fixing bug in build system noticed by Jeff. This should\n improve the performance of default builds", "idx": 291} {"target": 0, "func": "[PATCH] SIMD acceleration for F_FOURDIHS and F_PIDIHS\n\nSince these are computed by the RB and proper dihedral code for which\nwe already have SIMD acceleration, this acceleration comes \"for free\"\n(or rather, we have been stupid not to accelerate these before).\n\nChange-Id: I456af11c23fe1cb3749a889c5d92ec3ba06ab237", "idx": 25} {"target": 0, "func": "[PATCH] Keep track of old dof_indices distribution between\n processors; this makes it easier to construct an efficient send_list in\n System::project_vector.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@3796 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 82} {"target": 0, "func": "[PATCH] Add fast kernel version number", "idx": 1269} {"target": 0, "func": "[PATCH] Continue working on FParser JIT\n\n* Path fix for Linux\n* Add optimizer opcodes, fix bugs\n* Vastly expand JIT opcode support, fix AD bug... oops!\n* Add JIT support for if(a,b,c) control flow opcodes\n* Make compares fuzzy using Fparser's epsilon\n* Fix compile error when fpoptimizer is disabled\n* Enable support for more Value_t types\n* Fix stack recover bug, add test/example\n* Add performance test code\n* Fix another AD bug involving cDup\n* Tweak/expand tests\n* Put JIT cache files into subdirectory", "idx": 229} {"target": 0, "func": "[PATCH] rhoTabulated, hTabulatedThermo, tabulatedTransport: New (p,\n T) tabulated thermophysical functions\n\nThis is a prototype implementation of (p, T) tabulated density, enthalpy,\nviscosity and thermal conductivity using a uniform table in pressure and\ntemperature for fast lookup and interpolation. The standard Newton method is\nused for h->T inversion which could be specifically optimised for this kind of\ntable in the future.", "idx": 326} {"target": 0, "func": "[PATCH] Add helper to reuse generated TPR files in testing\n\nUsed static class members in GoogleTest and provided option in\ntestfilemanager to allow file path specification before test case is\nstarted.\n\nThis should speed up some of the test cases that have been slow due to\nrepeated calls to grompp.\n\nChange-Id: I50e29d04550d78f2324e3665e903d45515464298", "idx": 1055} {"target": 0, "func": "[PATCH] Attempt to accelerate the test a little bit.", "idx": 370} {"target": 0, "func": "[PATCH] Make OverlappingCouplingFunctor threadable\n\nThis *particular* fix is probably efficient enough for use, but when we\nfix #2334 we should change this to use (and get test coverage of) the\nnew GhostingFunctor::clone() API instead.", "idx": 939} {"target": 0, "func": "[PATCH] Use much smaller GroupLens dataset.\n\nThis helps keep the repository a bit smaller and should accelerate some tests.", "idx": 1202} {"target": 0, "func": "[PATCH] Concerning #100: implementing fast dependsOn for SX and MX", "idx": 1137} {"target": 0, "func": "[PATCH] Shift transition to multithreading towards larger matrix\n sizes\n\nSee #1886 and JuliaRobotics issue 500. trsm benchmarks on Haswell and Zen showed that with these values performance is roughly doubled for matrix sizes between 8x8 and 14x14, and still 10 to 20 percent better near the new cutoff at 32x32.", "idx": 835} {"target": 0, "func": "[PATCH] Allow to do gemv and ger buffer allocation on the stack\n\nger and gemv call blas_memory_alloc/free which in their turn\ncall blas_lock. blas_lock create thread contention when matrices\nare small and the number of thread is high enough. We avoid\ncall blas_memory_alloc by replacing it with stack allocation.\nThis can be enabled with:\nmake -DMAX_STACK_ALLOC=2048\nThe given size (in byte) must be high enough to avoid thread contention\nand small enough to avoid stack overflow.\n\nFix #478", "idx": 1094} {"target": 0, "func": "[PATCH] Restore wallcycle subcounter name to \"Bonded F\"\n\nThis makes it easier to check for performance behaviour\n\nChange-Id: Icb67bd75ee58fe280beb9f1cb123d0eeca229f09", "idx": 631} {"target": 0, "func": "[PATCH] Adding Changelog for Release 2.7.00\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 2.7", "idx": 963} {"target": 0, "func": "[PATCH] reorder folders, so that the fast tests are run first", "idx": 964} {"target": 0, "func": "[PATCH] follow up of 2a71e019: VC performance warning", "idx": 391} {"target": 0, "func": "[PATCH] Identity: special type derived from SphericalTensor to\n provide the concept of identity (I)\n\nAllows efficient operators to be defined for the interaction between\ntypes and the equivalent identity.", "idx": 1201} {"target": 0, "func": "[PATCH] Initial support for SkylakeX / AVX512\n\nThis patch adds the basic infrastructure for adding the SkylakeX (Intel Skylake server)\ntarget. The SkylakeX target will use the AVX512 (AVX512VL level) instruction set,\nwhich brings 2 basic things:\n1) 512 bit wide SIMD (2x width of AVX2)\n2) 32 SIMD registers (2x the number on AVX2)\n\nThis initial patch only contains a trivial transofrmation of the Haswell SGEMM kernel\nto AVX512VL; more will follow later but this patch aims to get the infrastructure\nin place for this \"later\".\n\nFull performance tuning has not been done yet; with more registers and wider SIMD\nit's in theory possible to retune the kernels but even without that there's an\ninteresting enough performance increase (30-40% range) with just this change.", "idx": 1180} {"target": 0, "func": "[PATCH] Option to add temporary points on a far sphere before\n insertion\n\nHelps to reduce contention on the infinite vertex\nBut removing those points in the end takes time\nso it's only worth it when points lie on a surface.", "idx": 874} {"target": 0, "func": "[PATCH] 1 : Store performance function 2 : Pass error into Error\n function 3 : Pass network into Error function", "idx": 909} {"target": 0, "func": "[PATCH] Implement OpenCL support\n\nStreamComputing (http://www.streamcomputing.eu) has implemented the\nshort-ranged non-bonded interaction accleration features previously\naccelerated with CUDA using OpenCL 1.1. Supported devices include\nGCN-based AMD GPUs and NVIDIA GPUs.\n\nCompilation requires an OpenCL SDK installed. This is included in\nthe CUDA SDK in that case.\n\nThe overall project is not complete, but Gromacs runs correctly on\nsupported devices. It only runs fast on AMD devices, because of a\nlimitation in the Nvidia driver. A list of known TODO items can be\nfound in docs/OpenCLTODOList.txt. Only devices with a warp/wavefront\nsize that is a multiple of 32 are compatible with the implementation.\n\nKnown issues include that tabulated Ewald kernels do not work (but the\nanalytical kernels are on by default, as with CUDA), and the blocking\nbehaviour of clEnqueue in Nvidia drivers means no overlap of CPU and\nGPU computation occurs. Concerns about concurrency correctness with\ncontext management, JIT compilation, and JIT caching means several\nfeatures are disabled for now. FastGen is enabled by default, so the\nJIT compilation will only compile kernels needed for the current\nsimulation.\n\nThere is some duplication between the two GPU implementations, but\nthe active development expected for both of them suggests it is\nnot worthwhile consolidating the implementations more closely.\n\nChange-Id: Ideaf16929028eb60e785feb8298c08e917394d0f", "idx": 628} {"target": 0, "func": "[PATCH] Fixes MSVC warnings. Bugfix in gmx_density in single\n precision\n\nDisables some unhelpful MSVC warnings for:\n* forcing value to bool (C4800)\n performance warning: not an issue for our code\n* \"this\" in initializer list (C4355)\n level 4 warning (informational) - shouldn't be shown at level 3\n* deprecated (posix, secure) functions (C4996)\n won't be removed soon - so not helpful\n\nChange-Id: I7ea62f88f687f45e169244ed60025c7c7d42f237", "idx": 1239} {"target": 0, "func": "[PATCH] Moving coord_string from returning a std::string to\n std::string_view. (#2704) (#2707)\n\nThe coord_string function is used in a lot of performance critical paths.\nMoving it to return a string_view as none of these paths benefit from\nmaking a copy of the value.", "idx": 956} {"target": 0, "func": "[PATCH] New thread_mpi library: waits now yield to the OS scheduler\n without significant performance penalty.", "idx": 613} {"target": 0, "func": "[PATCH] Separate Windows ci(gh-action) workflow and some improvs\n\nSplitting the windows ci job into a separate workflow enables the\nci to re-run windows specific jobs independent of unix jobs.\n\nUpdated Ninja dependency to 1.10.2 fix release in all ci(gh-actions)\n\nRefactored boost dependency to be installed via packages managers as\nGitHub Actions is removing pre-installed versions from March 8, 2021\n\nUpdate VCPKG hash to newer version to enable fast and better ports.\n\n(cherry picked from commit 58573eda4ded71fe4e0be6305a6f71386d175d12)", "idx": 879} {"target": 0, "func": "[PATCH] Finished initial implementation for arbitrary sizes atomic\n operations\n\nThis implements atomic operations for arbitrarily sized objects.\nThe implementation uses a hash table approach, where a lock is set\nbased on a hash of the memory address of the object for which an\natomic operation should be performed.\nInitial performance results indicate that it is comparable to\nother non-native atomics (i.e. CAS loops with casting to integer\ntypes).\nThe commit implements the full set of supported atomics from\nKokkos and it works in all currently existing execution spaces.\nThe hashtables are static sized global arrays.\n\nNote: this commit requires relocatable-device-code being\nenabled for Cuda.", "idx": 485} {"target": 0, "func": "[PATCH] Domain decomposition and PME load balancing for modular\n simulator\n\nThis change introduces two infrastructure elements responsible for\ndomain decomposition and PME load balancing, respectively. These\nencapsulate function calls which are important for performance, but\noutside the scope of this effort. They rely on legacy data structures\nfor the state (both) and the topology (domdec).\n\nThe elements do not implement the ISimulatorElement interface, as\nthe Simulator is calling them explicitly between task queue population\nsteps. This allows elements to receive the new topology before\ndeciding what functionality they need to run.\n\nThis commit is part of the commit chain introducing the new modular\nsimulator. Please see docs/doxygen/lib/modularsimulator.md for details\non the chosen approach. As the elements of the new simulator cannot all\nbe introduced in one commit, it might be worth to view Iaae1e205 to see\na working prototype of the approach.\n\nChange-Id: I1be444270e79cf1391f5a228c8ce3a9934d92701", "idx": 588} {"target": 0, "func": "[PATCH] committed gmx_random.c and .h written by Erik Lindahl; I have\n added an extremely fast tabulated gaussian random number generator", "idx": 868} {"target": 0, "func": "[PATCH] Change TopologyInformation implementation\n\nThis changes TopologyInformation so that we will be able to use it\nalso in the legacy tools, providing a better migration path for them,\nas well as making progress to removing t_topology and a lot of calls\nto legacy file-reading functions.\n\nIt can now lazily build and cache atom and expanded topology data\nstructures, re-using the gmx_localtop_t type (intended for use by the\nDD code for domain-local topologies). The atoms data structure can\nalso be explicitly copied out, so that tools who need to modify it can\ndo so without necessarily incurring a performance penalty. All these\nare convenient for tools to use.\n\nThe atom coordinate arrays are now maintained as std::vector, which\nmight want a getter overload to make rvec * for the legacy tools.\n\nAdded tests that the reading behaviour for various kinds of inputs is\nas expected. Converted lysozyme.gro to pdb, added a 'B' chain ID,\ngenerated a .top (which needed an HG for CYS) so updated pdb and gro\naccordingly. Some sasa test refdata needed fixing for that minor\nchange.\n\nProvided a convenience overload of gmx_rmpbc_init that takes\nTopologyInformation as input, as this will frequently be used by\ntools.\n\nExtended getTopologyConf to also return velocities, which will\nbe needed by some tools.\n\nAdapted the trajectoryanalysis modules to use the new approach, which\nis well covered by tests.\n\nRefs #1862\n\nChange-Id: I2f43e62bc2d97f5e654f15c6e474b9b71d7106ec", "idx": 1478} {"target": 0, "func": "[PATCH] Removed KV (key-value) functionality. We will add it back in\n a few months in a much more efficient way. (#1415)", "idx": 1064} {"target": 0, "func": "[PATCH] reactingEulerFoam: Un-templated interface composition models\n\nThe recent field-evaluation additions to basicSpecieMixture means that\nthe interface composition models no longer need knowledge of the\nthermodynamic type in order to do efficient evaluation of individual\nspecie properties, so templating on the thermodynamics is unnecessary.\nThis greatly simplifies the implementation.", "idx": 120} {"target": 0, "func": "[PATCH] Code for measuring performance presented in user manual and\n associated data", "idx": 1344} {"target": 0, "func": "[PATCH] Use vector in atoms2md instead of pointer\n\nAt all call sites for atoms2md the underlying vector was cast to an int pointer.\nThis change makes it easier to inspect the values passed to atoms2md in a\ndebugger while incurring no performance penalty.", "idx": 509} {"target": 0, "func": "[PATCH] Implemented basic serial 2D fast diagonalization", "idx": 1312} {"target": 0, "func": "[PATCH] Remove support for sparse writes in dense arrays. (#2504)\n\nSupporting sparse writes in dense arrays caused performance issues and\nsince customer use of the feature is infrequent, the decision was made\nto remove support.", "idx": 321} {"target": 0, "func": "[PATCH] Add support for Hygon Dhyana processor\n\nThis change adds hardware detection and related task assignment\nheuristics support for the Hygon Dhyana CPUs.\n\nChengdu Haiguang IC Design Co., Ltd (Hygon) is a Joint Venture\nbetween AMD and Haiguang Information Technology Co.,Ltd., aims\nat providing high performance x86 processor for China server\nmarket. Its first generation processor codename is Dhyana, which\noriginates from AMD technology and shares most of the architecture\nwith AMD's family 17h, but with different CPU Vendor ID (\"HygonGenuine\")\n/Family series number (Family 18h).\n\nMore details can be found on:\nhttp://lkml.kernel.org/r/5ce86123a7b9dad925ac583d88d2f921040e859b.1538583282.git.puwen@hygon.cn\n\nChange-Id: Ic91b032e69dfc13abad3fbfe6ab5e4f0e57fc7c0", "idx": 72} {"target": 0, "func": "[PATCH] Reduce number of points to accelerate test.", "idx": 538} {"target": 0, "func": "[PATCH] Shape smoothing: some comments added to accelerate matrix\n construction. Konstantinos: this is what is so slow, not the solver!", "idx": 602} {"target": 0, "func": "[PATCH] Converted iir, fir, fftconvolve to async calls\n\nAdded eval, sync statements to orb, fast to make them work with\ntheir asynchronous counter parts. Currently, one test of ORB is failing.\nWill fix it later.", "idx": 732} {"target": 0, "func": "[PATCH] Use importlib_resources in Python 3.6 images.\n\nPython 3.7 adds importlib.resources to the standard library, which\nprovides an efficient built in alternative to pkg_resources.\nBackported functionality is available in the importlib_resources\npackage. We should add it to our Docker images to allow testing new\nfeatures while we still officially support Python 3.6.\n\nSee also issue #2961", "idx": 1549} {"target": 0, "func": "[PATCH] Add CUDA nvcc >=7.0 support\n\nWith CUDA 7.x, there is a few % performance benefit to using sm_52\narch as target instead of JIT-ed compute_50, mostly relevant with\nthe newly released v7.5 (as v7.0 has other regressions which make it\nslower).\n\nThis change adds a single new target architecture (5.2) and changes\nthe virtual architecture included in the binary from 5.0 to 5.2 with\nnew enough nvcc to make 5.1.x versions future-proof when new hardware is\nreleased.\n\nChange-Id: I062cc48a151da3ab15b0508f4ebd59d95880ae9a", "idx": 112} {"target": 0, "func": "[PATCH] aabb tree: added performance section [to be completed with\n distance queries and curves]", "idx": 1147} {"target": 0, "func": "[PATCH] added reduce operation (op=sum) in communication class.\n Efficient MPI_Reduce for MPIDirect, inefficient trivial loop over all slaves\n for all other communication methods. Added simple test in\n ParallelMAtrixOperationsTest.cpp", "idx": 710} {"target": 0, "func": "[PATCH] Avoid corner-only mesh connection in SlitMesh test\n\nI've long said we don't support meshes where a manifold is only\nconnected at one node, but apparently we did kind of support it\naccidentally in at least simple cases? But improving\nGhostPointNeighbors performance breaks this case, so let's extend the\nmesh to something we've actually committed to handle.", "idx": 808} {"target": 0, "func": "[PATCH] Adding: 1. two reading functions. (one with a default\n scanner, and one taking it as a parameter). 2. Copy constructor.\n\nRewriting the assignment operator to be efficient as the copy constructor.", "idx": 1287} {"target": 0, "func": "[PATCH] Enable some performance tests again", "idx": 1255} {"target": 0, "func": "[PATCH] performance update...EJB", "idx": 1382} {"target": 0, "func": "[PATCH] Fix cycle counting in StatePropagatorDataGpu\n\nDouble-counting resulted in broken/truncated performance acounting\ntable.\n\nFixes #3764", "idx": 945} {"target": 0, "func": "[PATCH] routines to control threading for Accelerate framework\n https://github.com/nwchemgit/nwchem/pull/331", "idx": 213} {"target": 0, "func": "[PATCH] Updated article reference. Added configuration flags to\n support altivec on powerpc. Updated configuration files from gnu.org Removed\n the fast x86 truncation when double precision was used - it would just crash\n since the registers are 32bit only. Added powerpc altivec innerloop support\n to fnbf.c Removed x86 assembly truncation when double precision is used - the\n registers can only handle 32 bits. Dont use general solvent loops when we\n have altivec support in the normal ones. Added workaround for clash with\n altivec keyword.", "idx": 437} {"target": 0, "func": "[PATCH] Squeezed some air out of the spread, gather and solve\n routines. Should still be tested for parallel performance and correctness", "idx": 216} {"target": 0, "func": "[PATCH] dgecop is a matrix transpose routine. since ESSL has this\n routine in it, i added a link to that. also, Qingda wrote a fast transpose\n routine that can be used if one has the OSU source code.\n\nby default, nothing changes. the faster transposes must be activated manually with ESSL_TRANSPOSE or OSU_TRANSPOSE.", "idx": 1466} {"target": 0, "func": "[PATCH] HvD: Moved the printing routine of the NWAD library out from\n nwxc.F to a separate file nwxc_print.F. The performance of the expression\n printing is not performance critical but when the routines are included in\n nwxc.F they become candidates for inlining. This causes the compiler to spend\n effort on inlining and optimizing these printing routines which is effort\n wasted. By moving these routines into a file of their own they are not\n getting inlined and the compiler can spend its time on optimizing the actual\n compute routines.", "idx": 462} {"target": 0, "func": "[PATCH] Remove selective unfiltering. (#2410)\n\nFrom what we know about our customer's use case, they either will read\nthe full array where this gains nothing (and actually this affects the\nperformance negatively because of the added code complexity), or very\ntargeted ones, for which the gains would be very small, unless the tile\ncapacity/extent is unusually large if the users have a poorly configured\narray.\n\nAlso, this will allow us to implement compression codec for video and\nimaging, which won't work with selective unfiltering.", "idx": 157} {"target": 0, "func": "[PATCH] travis: try to make the slow tests pass again", "idx": 1434} {"target": 0, "func": "[PATCH] Fix SIMD configuration management\n\nSubsequent runs of cmake gave inconsistent diagnostic messages because\nSUGGEST_BINUTILS_UPDATE was not set on subsequent runs because we were\ncaching the result of logic, as well as caching the results of\ncompilation tests. This made life confusing, e.g. when compiling with\ngcc on MacOS with clang assembler not available.\n\nInstead, we now re-run the fast logic (quietly, if this is a\nsubsequent run).\n\nImproved the handling of ${VARIABLE}, because there was no need to use\nFORCE because the semantics of an unset variable in CMake just work.\nThere was also no need for such variables to be put into the cache,\nand we were using one more variable than we needed to use. This meant\nit was no longer worth implementing the redundant hints about perhaps\nupdating the binutils package, nor suppressing the redundant special\nstatus-line output.\n\nNoted some TODOs for future simplification. Changed the use of SIMD to\nSOURCE, since this utility code doesn't have to relate to SIMD flags.\n\nChange-Id: Id9605ccff0903c55e2621ddd8af10c8da523bebe", "idx": 1072} {"target": 0, "func": "[PATCH] SolverPerformance: Complete the integration of the templated\n SolverPerformance\n\nNow solvers return solver performance information for all components\nwith backward compatibility provided by the \"max\" function which created\nthe scalar solverPerformance from the maximum component residuals from\nthe SolverPerformance.\n\nThe residuals functionObject has been upgraded to support\nSolverPerformance so that now the initial residuals for all\n(valid) components are tabulated, e.g. for the cavity tutorial case the\nresiduals for p, Ux and Uy are listed vs time.\n\nCurrently the residualControl option of pimpleControl and simpleControl\nis supported in backward compatibility mode (only the maximum component\nresidual is considered) but in the future this will be upgraded to\nsupport convergence control for the components individually.\n\nThis development started from patches provided by Bruno Santos, See\nhttp://www.openfoam.org/mantisbt/view.php?id=1824", "idx": 1031} {"target": 0, "func": "[PATCH] Remove mdrun -testverlet\n\nThis was only intended for quick performance testing of old .tpr files\nduring the transition period. The window where that was useful has\npassed, and ongoing abuse of it has been observed. There is no need to\npreserve this until the formal removal of the group scheme.\n\nFixes #1424\n\nChange-Id: I589a8e316beeba6819cd01d9655bfc069bcbb174", "idx": 482} {"target": 0, "func": "[PATCH] Updated whitespace, implemented low hanging performance\n boosts", "idx": 1533} {"target": 0, "func": "[PATCH] - added performance note to solving functions doc - changed\n unbounded direction w so that x + tw is the unbounded ray - aded certificate\n iterators to QP_solution - added example programs that demonstrate the\n certificates - fixed examples so that 2D instead of D is given", "idx": 470} {"target": 0, "func": "[PATCH] Make linspace() non-recursive\n\nFor a \"linspace(a, b, num)\" call with \"a\" and/or \"b\" symbolic expressions,\nthe recursive implementation would lead to a graph of \"num\" depth. For\nlarge \"num\" this can lead to e.g. a very slow \"is_equal\" check due to the\ndepth requirement.\n\nAnother reason to prefer a multiplication over recursive addition is\neliminating the compounding error due to rounding when calling linspace\nwith numeric arguments.", "idx": 261} {"target": 0, "func": "[PATCH] Amendments to density fitting manual section\n\n - Fixed a sign error in the energy and force definition\n - Added performance considerations\n - Fixed whitespace\n - Changed vector notation to mathbf as in the other parts of the manual\n - Added pressure-coupling considerations\n - Added considerations when using multiple-time-stepping\n\nrefs #2282\n\nChange-Id: I8421ccf09ac960fa04508234e738967f51a27fab", "idx": 1551} {"target": 0, "func": "[PATCH] Fix performance noexcept move constructor", "idx": 1404} {"target": 0, "func": "[PATCH] new performance tests", "idx": 271} {"target": 0, "func": "[PATCH] Modernize read_inpfile\n\nUsed std::string and gmx::TextReader to simplify this. It is likely no\nlonger as efficient because it now makes several std::string objects\nper line, but this is not a significant part of the execution time of\ne.g. grompp.\n\nMove COMMENTSIGN to cstringutil.cpp, now that this is an\nimplementation detail of string-handling code, rather than also used\nin readinp.cpp.\n\nMoved responsibility for stripping of comments to TextReader, reworked\nits whitespace-trimming behaviour, and introduced some tests for that\nclass.\n\nIntroduced some TODO items to reconsider multiple behaviours where\nread_inpfile ignores what could be malformed input. It is used for\ngrompp, xpm2ps and membed, however, so it is not immediately clear\nwhat fixes might be appropriate, and we might anyway remove this\ncode shortly.\n\nIntroduced catches for exceptions that might be thrown while calling\nread_inpfile (and related code that might change soon).\n\nAdded tests for newly introduced split-and-trim functionality that\nsupports read_inpfile.\n\nRefs #2074\n\nChange-Id: Id9c46d60a3ec7ecdcdb9529bba2fdb68ce241914", "idx": 848} {"target": 0, "func": "[PATCH] temporary changes to use MemoryType::HOST instead of\n HOST_UMPIRE for performance", "idx": 26} {"target": 0, "func": "[PATCH] Don't use Boost.Operators for +-*/ of Gmpq.\n\nIt shouldn't change the performance significantly (the time is spent in\nmalloc/free and the mpq_* calls), but at least I can follow the\n(smaller) generated code.", "idx": 101} {"target": 0, "func": "[PATCH] Start nvprof profiling at counter reset\n\nWhen running in the NVIDIA profiler, to eliminate initial kernel\nperformance fluctuations, load balancing effects, as well as\ninitialization API calls from the traces, we now start the NVIDIA\nprofiler through the CUDA runtime API at the performance counter\nresetting. This has an effect only if mdrun is started in nvprof with\nprofiling off.\n\nChange-Id: Idfb3c86a96cb8b55cd874f641f4922b5517de6e3", "idx": 1541} {"target": 0, "func": "[PATCH] commented out fast f77 compile options", "idx": 1027} {"target": 0, "func": "[PATCH] added performance in ps/hour (CPU and Real)", "idx": 1305} {"target": 0, "func": "[PATCH] moved MeshBase::contract() up to Mesh. Unfortunately, there\n is no good way to make MeshBase::delete_elem() efficient, so the old\n implementation of MeshBase::contract() was (potentially) O(n_elem^2), and\n consumed approximately 20 percent of the runtime in ex10. This new\n implementation exploits the fact that the elements are stored in a vector\n (which is why it was moved up to the Mesh class) and is linear in the number\n of elements. The new implementation is less that 1 percent of the run time\n in ex10.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@1057 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 915} {"target": 0, "func": "[PATCH] Update tmop_pa_h3m kernel with fast mode", "idx": 862} {"target": 0, "func": "[PATCH] Adding Changelog for Release 3.4.00\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 3.4", "idx": 750} {"target": 0, "func": "[PATCH] fem_system_ex2 local->distributed solution\n\nThere doesn't seem to be any way to get this behavior back in reinit()\nwithout creating a performance regression: most applications simply\nnever need to move data in that direction.\n\nRefs #1593.", "idx": 1} {"target": 0, "func": "[PATCH] HvD: Committing this data mainly for future record. The\n timings in these outputs will be reported in the paper as a demonstration of\n the performance difference between different implementations of the density\n functionals. The automatic differentiation timings reported here were\n generated using the intrinsic POPCNT, LEADZ and TRAILZ functions. At the\n moment the code does not use those by default as they are Fortran 2008\n intrinsics that are not supported by every compiler yet. However, using these\n intrinsics speeds the automatic differentiation code up by a factor of about\n 2.5 and it not generate a fair representation of the automatic\n differentiation technique not to use the intrinsics.", "idx": 363} {"target": 0, "func": "[PATCH] PASSMESS_LOG_SCOPE - disabled for now\n\nI don't really want to drag the whole PerfLog into PassMess, but can't\nthink of a better way to do built-in performance logging here.", "idx": 1020} {"target": 0, "func": "[PATCH] New test file for simple performance test", "idx": 1151} {"target": 0, "func": "[PATCH] As std::fabs is slow on Windows, we switch to an\n implementation using sse2.\n\nThis version is already in CGAL, but it is protected with an #ifdef\nSo this commit consists of a #define for VC++", "idx": 1427} {"target": 0, "func": "[PATCH] Initial checkin of code to test performance of strided\n onesided operations.", "idx": 295} {"target": 0, "func": "[PATCH] Use MPI_IN_PLACE in minloc/maxloc\n\nWe've already broken MPI-2 compatibility; in for a penny, in for a\npound.\n\nIn addition to the infintestimal performance gain, this quells a gcc\n8.1 -Wmaybe-initialized false positive.", "idx": 1405} {"target": 0, "func": "[PATCH] Ditch --enable-debug-malloc and --enable-debug-alignment\n\nWe wrote DEBUG_MALLOC in 1997 to debug memory leaks. Nowadays\nDEBUG_MALLOC is just confusing. Better tools are available, and\nDEBUG_MALLOC is not thread-safe and it does not respect SIMD\nalignment. It confused at least one user.\n\nIn the gcc-2.SOMETHING days, gcc would allocate doubles on the stack\nat 4-byte boundary (vs. 8) reducing performance by a factor of 3.\nThat's when we introduced --enable-debug-alignment, which is totally\nobsolete by now.", "idx": 353} {"target": 0, "func": "[PATCH] Issue #2779 Ensure outputs up-to-date before linearization\n Not sure if this adds overhead and, if so, can be replaced by something more\n efficient", "idx": 645} {"target": 0, "func": "[PATCH] convert a few more styles to use fast full DP erfc()", "idx": 994} {"target": 0, "func": "[PATCH] replace the edge map by a vector of flat_map\n\nit is very efficient since there should not be isolated vertices.\nOn large data, the runtime of the function is divided by 3 to 4", "idx": 1038} {"target": 0, "func": "[PATCH] SIC performance updates...EJB", "idx": 816} {"target": 0, "func": "[PATCH] Turn assert into error in PBCs::neighbor()\n\nThis should fix #2958\n\nI don't like putting opt mode tests into anything called on a single\nelement; that's just asking to bloat kernel runtimes ... but in this\ncase, when we don't actually have broken PBC ids/displacements, we never\nhit the test on a replicated or serialized mesh, and we never hit the\ntest when calling PBCs::neighbor() on a local element, and at least\nwithin the library itself it looks like we're only calling\nPBCs::neighbor from local elements. So this change ought to be safe for\nperformance after all.", "idx": 1293} {"target": 0, "func": "[PATCH] Removed #error in case the file gets included twice. There\n are protected #ifdefs anyway. Also, we do not care about protected includes\n within the .C file anymore. It does not make compilation slow as before.", "idx": 771} {"target": 0, "func": "[PATCH] Removed cudaMemset from FAST", "idx": 1496} {"target": 0, "func": "[PATCH] issue #2155: setting initial guesses slow", "idx": 722} {"target": 0, "func": "[PATCH] added the hybrid PCG solver which uses diagonal scaling first\n and switches to AMG if convergence too slow, solver 20", "idx": 556} {"target": 0, "func": "[PATCH] HvD: Dealing with the issues associated with higher order\n derivatives in automatic differentiation and closed shell calculations. The\n problem addressed in particular is that of triplet excited states in TDDFT.\n Even though the ground state is closed shell the excitation energy is clearly\n a spin dependent quantity. Essentially we can deal with this situation\n effectively only if we evaluate the functional as if we are doing an open\n shell calculation. Doing so degrades performance but the automatic\n differentiation approach is significantly slower than the symbolic algebra\n generated code anyway. Hence performance cannot be the main reason for using\n this code in any case.", "idx": 1034} {"target": 0, "func": "[PATCH] Adding Changelog for Release 3.3.00\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 3.3\n\n(cherry picked from commit bd0c2c3713448f4d00b54998a42696ced1c1a188)", "idx": 1023} {"target": 0, "func": "[PATCH] - Insert a random sample of the polyhedron points, instead of\n the first points, to avoid having a triangulation of dimension < 3 - Set\n the error_behavior to ABORT, so that the try/catch of the Qt4 main loop\n does not intercept our CGAL assertions (that prevents efficient debugging).", "idx": 1016} {"target": 0, "func": "[PATCH] avoid name clash with fast directory", "idx": 1462} {"target": 0, "func": "[PATCH] Fix indexing issue in the pull code\n\nWhen determining if the COM of pull groups should be computed,\nthe indexing range of group[] for each pull coordinate is one element\ntoo long. In most cases this element is 0, so in which case it only\nlead to extra, useless compute when a cylinder group is used.\nNote that for dihedral geometry the extra element is actually dim[0]\nin pull coord, which is 0 or 1, which is harmless.\n\nNo release note, since this did not affect results, it could only\ncause a minor performance loss with cylinder pulling.\n\nFixes #2486\n\nChange-Id: Ie5785181fbe28d8db57e37c58553ae3835e657b7", "idx": 277} {"target": 0, "func": "[PATCH] polygonTriangulate: Added robust polygon triangulation\n algorithm\n\nThe new algorithm provides robust quality triangulations of non-convex\npolygons. It also produces a best attempt for polygons that are badly\nwarped or self intersecting by minimising the area in which the local\nnormal is in the opposite direction to the overal polygon normal. It is\nmemory efficient when applied to multiple polygons as it maintains and\nreuses its workspace.\n\nThis algorithm replaces implementations in the face and\nfaceTriangulation classes, which have been removed.\n\nFaces can no longer be decomposed into mixtures of tris and\nquadrilaterals. Polygonal faces with more than 4 sides are now\ndecomposed into triangles in foamToVTK and in paraFoam.", "idx": 997} {"target": 0, "func": "[PATCH] Add checks for inefficient resource usage\n\nChecks have been added for using too many OpenMP threads and when\nusing GPUs for using single OpenMP thread. A fatal error is generated\nin case where we are quite sure performance is very sub-optimal. This\nis nasty, but a fatal error is the only way to ensure that users don't\nignore this warning. The fatal error can be circumvented by explicitly\nsetting -ntomp, in that case a note is printed to log and stderr.\n\nNow also avoids ranks counts with thread-MPI that don't fit with the\ntotal number of threads requested.\n\nWith a GPU without DD thread count limit is now doubled.\n\nDisabled GPU sharing with OpenCL.\n\nChange-Id: Ib2d892dbac3d5716246fbfdb2e8f246cdc169787", "idx": 1264} {"target": 0, "func": "[PATCH] New and reorganized documentation\n\nCovers more mdrun options, moves a bit of \"practical\" content from the\nreference manual to the user guide.\n\nImported and updated information from wiki page on cutoff\nschemes. Consolidated with information from reference manual.\n\nUpdated some use of \"atom\" to \"particle\" in both guides.\n\nWe could update the performance numbers, but with the impending\nremoval of the group scheme, I don't think that's worth bothering\nabout. e.g. on Haswell, Erik already tested performance of group is a\nbit slower than Verlet, even for unbuffered water systems.\n\nChange-Id: I6410ba9fc08bb133ec8669e14dba11bcbd454fe3", "idx": 1010} {"target": 0, "func": "[PATCH] Use of WallClockTimer to measure performance", "idx": 1498} {"target": 0, "func": "[PATCH] Gredner performance benchmark test", "idx": 770} {"target": 0, "func": "[PATCH] too conservative check removed, for fast removal in delaunay\n 2d", "idx": 675} {"target": 0, "func": "[PATCH] UList::swap: implemented fast version which swaps the size\n and storage pointer", "idx": 914} {"target": 0, "func": "[PATCH] Simplified neighborlist setup with GB and enroute to more\n efficient DD performance", "idx": 310} {"target": 0, "func": "[PATCH] Workaround for libHilbert bug until we manage to get it\n fixed. Part of this workaround may be permanent - the long term fix may\n require efficient user code to call some reinitialization, renumbering\n function manually after reading in a mesh and solution.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@3647 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 640} {"target": 0, "func": "[PATCH] Fix pme gather in double with AVX(2)_128\n\nThe 4NSIMD PME gather change did not change the conditional\nfor grid alignment. This is made consistent here.\nNote that the 4NSIMD change lowered the performance of PME gather\non AVX_128_FMA and AVX2_128 in double precision. We should consider\nusing 256-bit AVX for double precision instead.\n\nRefs #2326\n\nChange-Id: I07bfb3ca8d334bce18ed0b6989405bbc02c25b7b", "idx": 1192} {"target": 0, "func": "[PATCH] Added parallel performance to the doc", "idx": 878} {"target": 0, "func": "[PATCH] Created Qt Demo for Polyhedron_shortest_path\n\nA shortest paths object is created using 'Make Shortest Path' from the\nmenu on a polyhedron.\nSource points are created by shift-clicking on the polyhedron.\nPoints can also be removed by selecting the appropriate option in the\ncombo box. Point locations can be snapped to the nearest edge or vertex,\nor placed anywhere on the face. Choosing 'Shortest Path' from the combo\nbox and shift-clicking will create a polyline object representing the\nshortest path from any of the source points to that destination. Note\nthat computing the shortest paths tree may be slow on the first query.", "idx": 1174} {"target": 0, "func": "[PATCH] moved ocl kernel resources from stack to heap\n\ncl::Program and cl::Kernel objects were being created on global\nstack earlier. Even though that is efficient and worked fine on\nlinux and macosx platforms. It caused out of control runtime errors\non windows platforms at the exit of the application that is using\nthe library. This could be mostly a driver bug in OpenCL drivers\nfor windows, therefore this change might be reverted in future.", "idx": 1189} {"target": 0, "func": "[PATCH] skylake dgemm: Add a 16x8 kernel\n\nThe next step for the avx512 dgemm code is adding a 16x8 kernel.\nIn the 8x8 kernel, each FMA has a matching load (the broadcast);\nin the 16x8 kernel we can reuse this load for 2 FMAs, which\nin turn reduces pressure on the load ports of the CPU and gives\na nice performance boost (in the 25% range).", "idx": 571} {"target": 0, "func": "[PATCH] Add example for distributed spmv scaling performance", "idx": 475} {"target": 0, "func": "[PATCH] dcopy hack into util/diff & reversing performance degradation\n change", "idx": 1198} {"target": 0, "func": "[PATCH] making the thread checker (valgrind drd) happy has no\n performance disadvantage because it is in the planning phase", "idx": 610} {"target": 0, "func": "[PATCH] Add BoundaryVolumeSolutionTransfer class.\n\nThis class can be used for transferring solutions between the surface\nof a volume mesh and the BoundaryMesh associated with that surface.\nThis is joint work with with Xikai Jiang from Argonne National\nLaboratory (@xikaij) who is the original author of the transfer.\n\nSee also: X. Jiang et al., \"An O(N) and parallel approach to integral\nproblems by a kernel-independent fast multipole method: Application to\npolarization and magnetization of interacting particles,\" The Journal\nof Chemical Physics vol. 145, 064307, http://dx.doi.org/10.1063/1.4960436.", "idx": 96} {"target": 0, "func": "[PATCH] Changed par_amg_setup so that\n hypre_BoomerAMGBuildCoarseGridOperator() is no longer used (Ulrike ran some\n performance studies of this). Two separate mat-mults are now used to compute\n (P^T (A P)). This change is important to the non-Galerkin code, because\n here, AP[Cpts,:] is used to compute a minimal sparsity pattern. So, by\n keeping (A P) from (P^T (A P)), we avoid a lot of duplicate communication and\n computation in non-Galerkin code.\n\nThis commit also removes any lumping to the diagonal inside the non-Galerkin\ncode.", "idx": 872} {"target": 0, "func": "[PATCH] Loosen default tolerance: accelerate convergence.\n\nFor larger optimizations the default termination value of 1e-10 may be way too small.\nSo 1e-5 is generally better, but the user can always change it themselves...", "idx": 1524} {"target": 0, "func": "[PATCH] Changed the performance test application", "idx": 634} {"target": 0, "func": "[PATCH] BUILD Adding option MIN_BUILD_TIME to CMake. Options sets O0\n for fast compile\n\n* Od on MSVC\n* Default is OFF. Flags are set when toggled to ON.\n* Resets the flags to default release when toggled back to OFF.", "idx": 664} {"target": 0, "func": "[PATCH] default to 1 thread, make the user manually specify the\n number of threads to avoid mpi/thread contention", "idx": 207} {"target": 0, "func": "[PATCH] ring non-axis parallel segment in pps subcase\n\nThis is only partially implemented. Still, it improves the\nperformance of the norway.cin benchmark.\n\nSigned-off-by: Panagiotis Cheilaris ", "idx": 455} {"target": 0, "func": "[PATCH] Disable tests using EPECK (for performance reasons)", "idx": 1565} {"target": 0, "func": "[PATCH] Explaining the performance test better.", "idx": 1562} {"target": 0, "func": "[PATCH] Code clean up in FAST and ORB for all backends\n\n- C++ features class no longer used inside backends", "idx": 83} {"target": 0, "func": "[PATCH] fix parallel build issues with APFS/HFS+/ext2/3 in\n netlib-lapack\n\nThe problem is that OpenBLAS sets the LAPACKE_LIB and the TMGLIB to the\nsame object and uses the `ar` feature to update the archive file. If the\nunderlying filesystem does not have sub-second timestamp resolution and\nthe system is fast enough (or `ccache` is used), the timestamp of the\nbuilds which should be added to the previously generated archive is the\nsame as the archive file itself and therefore `make` does not update the\narchive.\n\nSince OpenBLAS takes care to not run the different targets updating the\narchive in parallel, the easiest solution is to declare the respective\ntargets `.PHONY`, forcing `make` to always update them.\n\nfixes #1682", "idx": 665} {"target": 0, "func": "[PATCH] WIP NCMesh fast single element neighbor calculation, works\n for quads", "idx": 851} {"target": 0, "func": "[PATCH] Update citations.\n\nRahman_2020 and Wimmer_2020 both cite BISON as a fuel performance code\nand Wimmer actually discusses some of the coefficients used in BISON,\nbut neither present simulation results using BISON. Chen_2020 talks\nabout some models that are in MARMOT, but does not use MOOSE.", "idx": 755} {"target": 0, "func": "[PATCH] Accelerate the distracted sequence recall test.", "idx": 368} {"target": 0, "func": "[PATCH] Refactor sign to signbit internally\n\nThe operation being performance is equivalent of std::signbit\nthus, using signbit is more apt and removes unnecessary redefine\nof sign function in opencl jit kernel.\n\n(cherry picked from commit 2c8fb67ce5d07e573396eb8764470fb79086c797)", "idx": 715} {"target": 0, "func": "[PATCH] HvD: On the Macintosh the BLAS and LAPACK libraries provided\n as part of the compiler framework are broken according to Jeff Daily. They\n work for small matrices but start to produce rubbish for large matrices.\n\nAlso the GA configure scans the machine for linear algebra routines on its\nown account. Now setting BLAS_LIB=\" \" will force the --without-blas option\non the GA configure. This way the GA will build BLAS from source. This avoids\nconflicts between NWChem and GA where NWChem built BLAS from source and GA\nplanned to load BLAS from a library, introducing conflicting views of the\ninteger types.\n\nFinally on contemporary Macintosh machines there does not seem to be a need\nto specify \"--framework veclib\" or \"--framework accelerate\" anymore. The\ncompilers and linkers seem to use this automatically. So we can remove that\nfrom LIST_LINLIBS again.", "idx": 701} {"target": 0, "func": "[PATCH] ATW: Fast sparsity-optimized sigma AB and more scalable sigma\n AA using dgop", "idx": 1323} {"target": 0, "func": "[PATCH] performance updates??...EJB", "idx": 1378} {"target": 0, "func": "[PATCH] Revert \"Adding a Fast configuration\"\n\nThis reverts commit 091cdf9143a944784c5e35671927509d4f8b3d70.", "idx": 579} {"target": 0, "func": "[PATCH] tutorials/lagrangian: Added mixedVesselAMI2D\n\nThis tutorial demonstrates moving mesh and AMI with a Lagrangian cloud.\nIt is very slow, as interaction lists (required to compute collisions)\nare not optimised for moving meshes. The simulation time has therefore\nbeen made very short, so that it finishes in a reasonable time. The\nmixer only completes a small fraction of a rotation in this time. This\nis still sufficient to test tracking and collisions in the presence of\nAMI and mesh motion.\n\nIn order to generate a convincing animation, however, the end time must\nbe increased and the simulation run for a number of days.", "idx": 961} {"target": 0, "func": "[PATCH] combustionModels::EDC: New Eddy Dissipation Concept (EDC)\n turbulent combustion model\n\nincluding support for TDAC and ISAT for efficient chemistry calculation.\n\nDescription\n Eddy Dissipation Concept (EDC) turbulent combustion model.\n\n This model considers that the reaction occurs in the regions of the flow\n where the dissipation of turbulence kinetic energy takes place (fine\n structures). The mass fraction of the fine structures and the mean residence\n time are provided by an energy cascade model.\n\n There are many versions and developments of the EDC model, 4 of which are\n currently supported in this implementation: v1981, v1996, v2005 and\n v2016. The model variant is selected using the optional \\c version entry in\n the \\c EDCCoeffs dictionary, \\eg\n\n \\verbatim\n EDCCoeffs\n {\n version v2016;\n }\n \\endverbatim\n\n The default version is \\c v2015 if the \\c version entry is not specified.\n\n Model versions and references:\n \\verbatim\n Version v2005:\n\n Cgamma = 2.1377\n Ctau = 0.4083\n kappa = gammaL^exp1 / (1 - gammaL^exp2),\n\n where exp1 = 2, and exp2 = 2.\n\n Magnussen, B. F. (2005, June).\n The Eddy Dissipation Concept -\n A Bridge Between Science and Technology.\n In ECCOMAS thematic conference on computational combustion\n (pp. 21-24).\n\n Version v1981:\n\n Changes coefficients exp1 = 3 and exp2 = 3\n\n Magnussen, B. (1981, January).\n On the structure of turbulence and a generalized\n eddy dissipation concept for chemical reaction in turbulent flow.\n In 19th Aerospace Sciences Meeting (p. 42).\n\n Version v1996:\n\n Changes coefficients exp1 = 2 and exp2 = 3\n\n Gran, I. R., & Magnussen, B. F. (1996).\n A numerical study of a bluff-body stabilized diffusion flame.\n Part 2. Influence of combustion modeling and finite-rate chemistry.\n Combustion Science and Technology, 119(1-6), 191-217.\n\n Version v2016:\n\n Use local constants computed from the turbulent Da and Re numbers.\n\n Parente, A., Malik, M. R., Contino, F., Cuoci, A., & Dally, B. B.\n (2016).\n Extension of the Eddy Dissipation Concept for\n turbulence/chemistry interactions to MILD combustion.\n Fuel, 163, 98-111.\n \\endverbatim\n\nTutorials cases provided: reactingFoam/RAS/DLR_A_LTS, reactingFoam/RAS/SandiaD_LTS.\n\nThis codes was developed and contributed by\n\n Zhiyi Li\n Alessandro Parente\n Francesco Contino\n from BURN Research Group\n\nand updated and tested for release by\n\n Henry G. Weller\n CFD Direct Ltd.", "idx": 118} {"target": 0, "func": "[PATCH] Clang-tidy: enable further tests\n\nThose out of misc, performance, readiability, mpi with managable\nnumber of required fixes.\n\nRemaining checks:\n 4 readability-redundant-smartptr-get\n 4 readability-redundant-string-cstr\n 4 readability-simplify-boolean-expr\n 5 misc-misplaced-widening-cast\n 5 readability-named-parameter\n 6 performance-noexcept-move-constructor\n 8 readability-misleading-indentation\n 10 readability-container-size-empty\n 13 misc-suspicious-string-compare\n 13 readability-redundant-control-flow\n 17 performance-unnecessary-value-param\n 17 readability-static-definition-in-anonymous-namespace\n 18 misc-suspicious-missing-comma\n 20 readability-redundant-member-init\n 40 misc-misplaced-const\n 75 performance-type-promotion-in-math-fn\n 88 misc-incorrect-roundings\n 105 misc-macro-parentheses\n 151 readability-function-size\n 201 readability-else-after-return\n 202 readability-inconsistent-declaration-parameter-name\n 316 misc-throw-by-value-catch-by-reference\n 383 readability-non-const-parameter\n 10284 readability-implicit-bool-conversion\n\nChange-Id: I5b35ce33e723349fa583f527fec55bbf29a57508", "idx": 576} {"target": 0, "func": "[PATCH] Call full Kokkos::initialize in performance tests", "idx": 1068} {"target": 0, "func": "[PATCH] trivial change for DofMap::dof_indices to increase\n performance when there are no element-based DOFs", "idx": 1398} {"target": 0, "func": "[PATCH] Update performance tables with more details (memory, etc.)", "idx": 457} {"target": 0, "func": "[PATCH] Switch to a RWLock for S3 Class multipart upload\n\nA new RWLock class is introduced into the common folder. @joe-maley\nprovided this excellent class. We then use this RWLock in the S3 VFS\nclass to handle the multipart uploads and to manage the locks more\ngranular to remove locking and contention for multiple concurrent upload\noperations.", "idx": 1289} {"target": 0, "func": "[PATCH] Near final release version. Performance for 300^3 is 4.64\n 1.96 3.13s on 1,4 and 20 ranks. Upto 4% variation if temps above 40C on GPU", "idx": 432} {"target": 0, "func": "[PATCH] thermophysicalModels: Added new tabulated equation of state,\n thermo and transport models\n\nusing the new nonUniformTable to interpolate between the values vs temperature\nprovided. All properties (density, heat capacity, viscosity and thermal\nconductivite) are considered functions of temperature only and the equation of\nstate is thus incompressible. Built-in mixing rules corresponding to those in\nthe other thermo and transport models are not efficient or practical for\ntabulated data and so these models are currently only instantiated for the pure\nspecie/mixture rhoThermo package but a general external mixing method will be\nadded in the future.\n\nTo handle reactions the Jacobian function dKcdTbyKc has been rewritten to use\nthe Gstd and S functions directly removing the need for the miss-named dGdT\nfunction and hence removing the bugs in the implementation of that function for\nsome of the thermo models. Additionally the Hc() function has been renamed\nHf() (heat of formation) which is more commonly used terminology and consistent\nwith the internals of the thermo models.", "idx": 1557} {"target": 0, "func": "[PATCH] enable GPU emulation without GPU support\n\nGPU emulation can be useful to estimate the performance one could get\nby adding GPU(s) to the machine by running with GMX_EMULATE_GPU and\nGMX_NO_NONBONDED environment variables set. As this feature is useful\neven with mdrun compiled without GPU support, this commit makes GPU\nemulation mode always available.\n\nChange-Id: I0b90b8ec1c6e3116f28f66aac4f3c8ae0831239d", "idx": 1037} {"target": 0, "func": "[PATCH] Global thread pool when TBB is disabled (#1760)\n\nThis introduces a global thread pool for use when TBB is disabled. The\nperformance has not been exhaustively benchmarked against TBB. However, I did\ntest this on two readily available scenarios that I had been recently performance\nbenchmarking for other reasons. One scenario makes heavy use of the parallel_sort\npath while the other does not. Surprisingly, disabling TBB performs about 10%\nquicker with this pach.\n\n// Scenario #1\nTBB: 3.4s\nTBB disabled, this patch: 3.0s\nTBB disabled, on dev: 10.0s\n\n// Scenario #2\nTBB: 3.1\nTBB disabled, this patch: 2.7s\nTBB disabled, on dev: 9.1s\n\nFor now, this patch uses the threadpool at the same scope as the TBB scheduler.\nIt is a global thread pool, shared among a single process, and conditionally\ncompiled. The concurrency level that the thread pool is configured with is\ndetermined from the \"sm.num_tbb_threads\" config.\n\nThis patch does not disable TBB by default.\n\nCo-authored-by: Joe Maley ", "idx": 283} {"target": 0, "func": "[PATCH] Modified version of cgal_test_with_cmake which: - is\n cross-platform Unix/make and Cygwin/VisualC++ - concats all log files to\n cgal_test_with_cmake.log - does not clean up object files and executables\n (too slow when called by developer)", "idx": 405} {"target": 0, "func": "[PATCH] moved warning of slow option to 'bugs'", "idx": 214} {"target": 0, "func": "[PATCH] Use .p2align instead of .align for portability\n\nThe OSX assembler apparently mishandles the argument to decimal .align, leading to a significant loss of performance\nas observed in #730, #901 and most recently #1470", "idx": 1479} {"target": 0, "func": "[PATCH] working, but slow", "idx": 379} {"target": 0, "func": "[PATCH] performance measurement for the triples plus attempts at\n optimizing one routine on a pentium", "idx": 296} {"target": 0, "func": "[PATCH] MDRange: Refactored HostIterateTile to use macros\n\nRemoves the recursive way for nesting the for loops which may hinder\nchances at vectorization during iteration over tiles.\nPerformance test revised to perform a stencil-like operation", "idx": 420} {"target": 0, "func": "[PATCH] Make the use of GpuEventSynchronizer in SYCL conformant with\n CUDA/OpenCL\n\nMR !1035 refactored the use of GpuEventSynchronizer in CUDA and OpenCL\nto make merging the code paths easier. Here, we update SYCL to the same\nstandard.\n\nNote, that it introduces additional synchronization between local and\nnon-local queues. It is present in CUDA and OpenCL, but was implicit in\nSYCL. To simplify code, it is added here. If it turns out to be\ndetrimental to performance, it can be (conditionally) NOPed.\n\nRefs #2608, #3895.", "idx": 1389} {"target": 0, "func": "[PATCH] Change tolerance as it becomes pretty slow", "idx": 1126} {"target": 0, "func": "[PATCH] Move performance logging of solve()s into solver classes\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@1514 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 1100} {"target": 0, "func": "[PATCH] Refactor nbnxn exclusion setting\n\nConsolidate common parts of the simple and GPU exclusion mask\ngeneration code. Made variable names more descriptive.\nNo functionality and performance changes, except that the direct\nj-cluster lookup now also works when the first j-cluster does not\nequal the i-cluster.\n\nChange-Id: I3ef6344ae2796e649ae30bf5ff0668a4548c011f", "idx": 1487} {"target": 0, "func": "[PATCH] resolved performance degrading changed introduced in revision\n 1319 (4)", "idx": 570} {"target": 0, "func": "[PATCH] Correct Pelleg-Moore prunes that finish a node. There were\n cases where a Pelleg-Moore prune would happen before committing the point.\n This is actually getting pretty fast in terms of base cases, so I am happy\n with that (for once).", "idx": 125} {"target": 0, "func": "[PATCH] Don't print invalid performance data\n\nIf mdrun finished before a scheduled reset of the timing information\n(e.g. from mdrun -resetstep or mdrun -resethway), then misleading\ntiming information should not be reported.\n\nFixes #2041\n\nChange-Id: I4bd4383c924a342c01e9a3f06b521da128f96a35", "idx": 1358} {"target": 0, "func": "[PATCH] Added missing memory deletions on FAST unit test.", "idx": 1003} {"target": 0, "func": "[PATCH] Deprecate Node/DofObject copy methods\n\nFixes #1451\n\nAny Node object copying is almost certainly a bug (always in\nperformance, sometimes in functionality), in code that should have\nbeen taking references instead. Since the node_ref_range() method\nmakes it too easy to write such bugs, we should deprecate the Node\ncopy constructor and make it impossible soon.\n\nThe only places where DofObject(DofObject) was being used were in the\nNode copy constructor (no longer relevant) and in old_dof_object\ncreation (where the not-quite-a-proper-copy-constructor behavior is\nintentional), so for added safety let's deprecate non-private access\nto that constructor. Since it was already protected before this\nshouldn't cause any hardship to downstream users.", "idx": 1529} {"target": 0, "func": "[PATCH] adding parallel region around gmx_parallel_3dfft_execute.\n Makes it work correctly but unnessary barrier should be removed for\n performance reasons", "idx": 308} {"target": 0, "func": "[PATCH] Threw away the manual text-parsing, xml parser is fast enough", "idx": 669} {"target": 0, "func": "[PATCH] Fix a critical performance issue\n\nAs decided by `MainWindow`, the `Scene_c3t3_item::toolTip()` method is\ncalled by `MainWindow::updateInfo()` for each `modified()` event of the\nmanipulated frame. While the frame is manipulated, that generates a lot\nof events, and a lot of calls to `toolTip()`.\n\nBefore this commit, the call to `Scene_c3t3_item::toolTip()`\nwas `O(n)`. After this commit it is `O(1)`.\n\nThat speeds up a lot the drawing of the item while the frame is\nmanipulated!", "idx": 685} {"target": 0, "func": "[PATCH] Cuda: Adding performance test for instance overlapping", "idx": 1241} {"target": 0, "func": "[PATCH] Simplify the uniform-refinement mesh methods.\n\nIn the classes Mesh and ParMesh:\n\n* Small optimization in Mixed3DUniformRefinement for hex-only meshes.\n* In Mixed3DUniformRefinement, use marker array instead of std::map.\n* Rename the methods Mixed{2D,3D}UniformRefinement to\n UniformRefinement{2D,3D}.\n* Remove the methods {Quad,Hex,Wedge}UniformRefinement and use\n UniformRefinement{2D,3D} instead. In terms of performance, the\n difference was negligible.", "idx": 1414} {"target": 0, "func": "[PATCH] Separate Windows ci(gh-action) workflow and some improvs\n\nSplitting the windows ci job into a separate workflow enables the\nci to re-run windows specific jobs independent of unix jobs.\n\nUpdated Ninja dependency to 1.10.2 fix release in all ci(gh-actions)\n\nRefactored boost dependency to be installed via packages managers as\nGitHub Actions is removing pre-installed versions from March 8, 2021\n\nUpdate VCPKG hash to newer version to enable fast and better ports.", "idx": 1436} {"target": 0, "func": "[PATCH] Use a separate Performance log line for compute_affine_map\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@1506 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 742} {"target": 0, "func": "[PATCH] changed the performance figure note: this version can test\n the parallel performance", "idx": 436} {"target": 0, "func": "[PATCH] Expand Performance section of the user guide\n\nSalvage and clean up the content from the wiki to expand the user\nguide. Minor fixes to the rest of the Performance section.\n\nChange-Id: I39aba257c4c761a3a1ef428c64424da6fa449158", "idx": 974} {"target": 0, "func": "[PATCH] Quadrature rule fixes for >double precision\n\nMake sure calculations are done in Real precision where possible; fall\nback on less-efficient-but-more-accurate defaults where the more\nefficient cases are tabulated in double precision.", "idx": 409} {"target": 0, "func": "[PATCH] s390x: for clang use fp-contract=on instead of fast\n\nMake clang slightly more cautious when contracting floating-point\noperations (e.g., when applying fused multiply add) by setting\n-ffp-contract=on (instead of fast).\n\nSigned-off-by: Marius Hillenbrand ", "idx": 597} {"target": 0, "func": "[PATCH] FEAT: Enabling additional interpolation types\n\n- Enabling cubic support for rotate and transform\n- resize fallsback to use scale\n\nResize is not using common interp.cl because of compilation\nissues that arose from using too much constant memory when\ncompiling FAST in CUDA backend.", "idx": 740} {"target": 0, "func": "[PATCH] gmres restart converges but slow", "idx": 1120} {"target": 0, "func": "[PATCH] s390x: allow clang to emit fused multiply-adds (replicates\n gcc's default behavior)\n\ngcc's default setting for floating-point expression contraction is\n\"fast\", which allows the compiler to emit fused multiply adds instead of\nseparate multiplies and adds (amongst others). Fused multiply-adds,\nwhich assembly kernels typically apply, also bring a significant\nperformance advantage to the C implementation for matrix-matrix\nmultiplication on s390x. To enable that performance advantage for builds\nwith clang, add -ffp-contract=fast to the compiler options.\n\nSigned-off-by: Marius Hillenbrand ", "idx": 116} {"target": 0, "func": "[PATCH] More fine grained ParmetisPartitioner logging\n\nThis is ludicrously slow outside opt mode", "idx": 1518} {"target": 0, "func": "[PATCH] Make current slow growth behaviour consistent and communicate\n it better\n\nIn lambda dynamics/slow growth, lambda can be used to interpolate the lambda vector or to set its components directly if no lambda vector is provided by the user. In this latter case, lambda was allowed to be > 1, but it was silently kept within [0,1] if a lambda vector was specified. Moreover, setting the components of the lambda vector > 1 as a user produced an error.\n\nNow, it is consistently ensured that lambda vector components are in [0,inf), but warnings are issued if a user provides settings that somehow result in lambda vector components being > 1. If soft-core potentials are used, lambda vector components for Coulomb and vdW are consistently enforced to be in [0,1] (errors are issued else). If lambda is used to interpolate a user-provided lambda vector, it is kept in [0,1]. If user input results in lambda leaving the above ranges during the simulation, lambda will be kept at the respective interval boundary, and warnings are issued from which simulation step on the lambda vector will not change anymore.\n\nFixes #3584.", "idx": 585} {"target": 0, "func": "[PATCH] fast ColumnCovariance", "idx": 1062} {"target": 0, "func": "[PATCH] Changed default cuda stream to be non-zero\n\n* Added additional following api functions specific to cuda backend\n * afcu_get_stream\n * afcu_get_native_id\n* Removed duplicate class in fast kernel that helps declare\n dynamic shared memory based on template type", "idx": 203} {"target": 0, "func": "[PATCH] MKK: a test case to measure the performance of aggregate\n put/get calls.", "idx": 24} {"target": 0, "func": "[PATCH] aabb tree: added plane queries in the performance section and\n demo.", "idx": 847} {"target": 0, "func": "[PATCH] Added performance benchmark", "idx": 110} {"target": 0, "func": "[PATCH] it seems to work comments The armijo rule is really slow The\n statistical update of sigmas is good We need to stay sometime in low sigmas\n otherwise we do not optimize the trace Next setp is BFGS method", "idx": 81} {"target": 0, "func": "[PATCH 01/29] Add support for factorization in\n create_new_algorithm.sh", "idx": 714} {"target": 0, "func": "[PATCH] AMG-DD implementation (#145)\n\nThis includes the implementation of the AMG-DD algorithm, a variant of BoomerAMG designed to limit communication.\n\nAMG-DD may be used as a standalone solver or a preconditioner for Krylov methods (note that AMG-DD is a non-symmetric preconditioner). For an example of how to set up and use AMG-DD, see the IJ driver (src/test/ij.c).\n\nA list with the parameters of AMG-DD is given below:\n\nPadding (recommended default 1): HYPRE_BoomerAMGDDSetPadding(...)\nNumber of ghost layers (recommended default 1): HYPRE_BoomerAMGDDSetNumGhostLayers(...)\nNumber of inner FAC cycles per AMG-DD iteration (default 2): HYPRE_BoomerAMGDDSetFACNumCycles(...)\nFAC cycle type: HYPRE_BoomerAMGDDSetFACCycleType(...)\n1 = V-cycle (default)\n2 = W-cycle\n3 = F-cycle\nNumber of relaxations on each level during FAC cycle: HYPRE_BoomerAMGDDSetFACNumRelax(...)\nType of local relaxation during FAC cycle: HYPRE_BoomerAMGDDSetFACRelaxType(...)\n0 = Jacobi\n1 = Gauss-Seidel\n2 = ordered Gauss-Seidel\n3 = C/F L1-scaled Jacobi (default)\n\nFor more details of the algorithm, see Mitchell W.B., R. Strzodka, and R.D. Falgout (2020), Parallel Performance of Algebraic Multigrid Domain Decomposition (AMG-DD).", "idx": 1146} {"target": 0, "func": "[PATCH] MDRange: Disabling performance test for MDRange: takes way to\n long on KNL", "idx": 721} {"target": 0, "func": "[PATCH] In the performance miniapps, print the MFEM SIMD width.\n\nIn the miniapps/performance makefile, print the auto-detected compiler\nand if that fails, the print the output used for auto-dection.", "idx": 1485} {"target": 0, "func": "[PATCH] PBiCGStab: New preconditioned bi-conjugate gradient\n stabilized solver for asymmetric lduMatrices using a run-time selectable\n preconditioner\n\nReferences:\n Van der Vorst, H. A. (1992).\n Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG\n for the solution of nonsymmetric linear systems.\n SIAM Journal on scientific and Statistical Computing, 13(2), 631-644.\n\n Barrett, R., Berry, M. W., Chan, T. F., Demmel, J., Donato, J.,\n Dongarra, J., Eijkhout, V., Pozo, R., Romine, C. & Van der Vorst, H.\n (1994).\n Templates for the solution of linear systems:\n building blocks for iterative methods\n (Vol. 43). Siam.\n\nSee also: https://en.wikipedia.org/wiki/Biconjugate_gradient_stabilized_method\n\nTests have shown that PBiCGStab with the DILU preconditioner is more\nrobust, reliable and shows faster convergence (~2x) than PBiCG with\nDILU, in particular in parallel where PBiCG occasionally diverges.\n\nThis remarkable improvement over PBiCG prompted the update of all\ntutorial cases currently using PBiCG to use PBiCGStab instead. If any\nissues arise with this update please report on Mantis: http://bugs.openfoam.org", "idx": 1033} {"target": 0, "func": "[PATCH] - Anti-aliasing is quite slow (but in OpenGL mode). It is\n deactivated by default.\n\n- Add a temp message in the status bar when the aliasing mode is changed.", "idx": 831} {"target": 0, "func": "[PATCH] Changed the way hardwall works.\n\nThere are several ways to go about this, but the latest seems\nmost sensible in the context of DD. The hardwall function now\nuses local shells to get coordinates and such. The problem is\nthat after make_local_shells(), the indices are all messed up.\nCommitting this to preserve the logic, but need to fix this\notherwise parallelization is limited to only OpenMP and performance\nis not very good above 4 threads.\n\nChange-Id: I3208519d8704da622b81835604249771d634a28a", "idx": 514} {"target": 0, "func": "[PATCH] ATW: Removed slow serial code from density construction", "idx": 661} {"target": 0, "func": "[PATCH] Use the simplified performance function.", "idx": 1366} {"target": 0, "func": "[PATCH] Adding Changelog for Release 3.5.00\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 3.5", "idx": 803} {"target": 0, "func": "[PATCH] add a warning if the bonded cutoff is large\n\nThis should print a warning when 2x the bonded interaction cutoff list larger then other cutoffs, as was the setting before the performance optimization with the change in https://github.com/lammps/lammps/pull/758/commits/269007540569589aa7c81d9ba1a4b93d34b8c95d", "idx": 1182} {"target": 0, "func": "[PATCH] Make ImdSession into a Pimpl-ed class with factory function\n\nThis prepares to make IMD into a proper module. No\nfunctionality changes in this commit.\n\nReplaced gmx_bool with bool\n\nUsed fast returns when IMD is inactive, for better\nreadability of code.\n\nRefs #2877\n\nChange-Id: Ibbe8c452f6f480e9a357fe1b87da3ab0ae166317", "idx": 931} {"target": 0, "func": "[PATCH] Attempting to accelerate the build by only including most of\n the internal copy:: headers when necessary", "idx": 387} {"target": 0, "func": "[PATCH] Adding performance section", "idx": 734} {"target": 0, "func": "[PATCH] * Refactored stats (#1594)\n\n* Fixed performance of multi-range subarray result estimation\n* Fixed bug in multi-range result estimation", "idx": 563} {"target": 0, "func": "[PATCH] AVX512 CGEMM & ZGEMM kernels\n\n96-99% 1-thread performance of MKL2018", "idx": 693} {"target": 0, "func": "[PATCH] Adding Changelog for Release 2.9.00\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 2.9", "idx": 6} {"target": 0, "func": "[PATCH] Make performance ex1 and ex1p templated on dimension", "idx": 29} {"target": 0, "func": "[PATCH] slow DGOP...EJB", "idx": 153} {"target": 0, "func": "[PATCH] The current version allows for triclinic boxes as well. It is\n very slow though.", "idx": 761} {"target": 0, "func": "[PATCH] MKK: Non-blocking performance test", "idx": 1131} {"target": 0, "func": "[PATCH] Changing switchpoint of kernel of local HerkLN update to be\n twice the blocksize times the grid width. This preserves the local gemm\n performance at the kernel.", "idx": 1285} {"target": 0, "func": "[PATCH] Incorporated Poisson solver in the mdrun code. It is dead\n slow but maybe parallellized easily and is a good reference code for PPPM,\n since it is about twice as accurate as PPPM at the same number of grid\n points.", "idx": 278} {"target": 0, "func": "[PATCH] Switch to unordered_multimap in UNVIO.\n\nAfter trying out a few different container types, it was determined\nthat the cost of building every side (in order to search for it in the\nmultiset) killed the performance of the unordered_set method,\nand that the unordered_multimap approach was superior to a standard\nmultimap across a range of mesh sizes (below, there are 6*N^2 sides to\nsearch through in the container in question).\n\nN unordered_set multimap unordered_multimap\n10 0.0012 0.0006 0.0005\n15 0.0049 0.0021 0.0018\n20 0.0102 0.0055 0.0058\n25 0.0201 0.0118 0.0062\n30 0.0415 0.0288 0.0126\n35 0.0579 0.0377 0.0169\n40 0.0946 0.0612 0.0281\n45 0.1467 0.0996 0.0426\n50 0.1929 0.1213 0.0499\n55 0.2662 0.1688 0.0716\n60 0.4040 0.2250 0.1018\n65 0.5981 0.3388 0.1376\n100 2.2838 1.5329 0.6257", "idx": 98} {"target": 0, "func": "[PATCH] additional output for performance tests", "idx": 843} {"target": 0, "func": "[PATCH] Remove the need for most locking in memory.c.\n\nUsing thread local storage for tracking memory allocations means that threads\nno longer have to lock at all when doing memory allocations / frees. This\nparticularly helps the gemm driver since it does an allocation per invocation.\nEven without threading at all, this helps, since even calling a lock with\nno contention has a cost:\n\nBefore this change, no threading:\n```\n----------------------------------------------------\nBenchmark Time CPU Iterations\n----------------------------------------------------\nBM_SGEMM/4 102 ns 102 ns 13504412\nBM_SGEMM/6 175 ns 175 ns 7997580\nBM_SGEMM/8 205 ns 205 ns 6842073\nBM_SGEMM/10 266 ns 266 ns 5294919\nBM_SGEMM/16 478 ns 478 ns 2963441\nBM_SGEMM/20 690 ns 690 ns 2144755\nBM_SGEMM/32 1906 ns 1906 ns 716981\nBM_SGEMM/40 2983 ns 2983 ns 473218\nBM_SGEMM/64 9421 ns 9422 ns 148450\nBM_SGEMM/72 12630 ns 12631 ns 112105\nBM_SGEMM/80 15845 ns 15846 ns 89118\nBM_SGEMM/90 25675 ns 25676 ns 54332\nBM_SGEMM/100 29864 ns 29865 ns 47120\nBM_SGEMM/112 37841 ns 37842 ns 36717\nBM_SGEMM/128 56531 ns 56532 ns 25361\nBM_SGEMM/140 75886 ns 75888 ns 18143\nBM_SGEMM/150 98493 ns 98496 ns 14299\nBM_SGEMM/160 102620 ns 102622 ns 13381\nBM_SGEMM/170 135169 ns 135173 ns 10231\nBM_SGEMM/180 146170 ns 146172 ns 9535\nBM_SGEMM/189 190226 ns 190231 ns 7397\nBM_SGEMM/200 194513 ns 194519 ns 7210\nBM_SGEMM/256 396561 ns 396573 ns 3531\n```\nwith this change:\n```\n----------------------------------------------------\nBenchmark Time CPU Iterations\n----------------------------------------------------\nBM_SGEMM/4 95 ns 95 ns 14500387\nBM_SGEMM/6 166 ns 166 ns 8381763\nBM_SGEMM/8 196 ns 196 ns 7277044\nBM_SGEMM/10 256 ns 256 ns 5515721\nBM_SGEMM/16 463 ns 463 ns 3025197\nBM_SGEMM/20 636 ns 636 ns 2070213\nBM_SGEMM/32 1885 ns 1885 ns 739444\nBM_SGEMM/40 2969 ns 2969 ns 472152\nBM_SGEMM/64 9371 ns 9372 ns 148932\nBM_SGEMM/72 12431 ns 12431 ns 112919\nBM_SGEMM/80 15615 ns 15616 ns 89978\nBM_SGEMM/90 25397 ns 25398 ns 55041\nBM_SGEMM/100 29445 ns 29446 ns 47540\nBM_SGEMM/112 37530 ns 37531 ns 37286\nBM_SGEMM/128 55373 ns 55375 ns 25277\nBM_SGEMM/140 76241 ns 76241 ns 18259\nBM_SGEMM/150 102196 ns 102200 ns 13736\nBM_SGEMM/160 101521 ns 101525 ns 13556\nBM_SGEMM/170 136182 ns 136184 ns 10567\nBM_SGEMM/180 146861 ns 146864 ns 9035\nBM_SGEMM/189 192632 ns 192632 ns 7231\nBM_SGEMM/200 198547 ns 198555 ns 6995\nBM_SGEMM/256 392316 ns 392330 ns 3539\n```\n\nBefore, when built with USE_THREAD=1, GEMM_MULTITHREAD_THRESHOLD = 4, the cost\nof small matrix operations was overshadowed by thread locking (look smaller than\n32) even when not explicitly spawning threads:\n```\n----------------------------------------------------\nBenchmark Time CPU Iterations\n----------------------------------------------------\nBM_SGEMM/4 328 ns 328 ns 4170562\nBM_SGEMM/6 396 ns 396 ns 3536400\nBM_SGEMM/8 418 ns 418 ns 3330102\nBM_SGEMM/10 491 ns 491 ns 2863047\nBM_SGEMM/16 710 ns 710 ns 2028314\nBM_SGEMM/20 871 ns 871 ns 1581546\nBM_SGEMM/32 2132 ns 2132 ns 657089\nBM_SGEMM/40 3197 ns 3196 ns 437969\nBM_SGEMM/64 9645 ns 9645 ns 144987\nBM_SGEMM/72 35064 ns 32881 ns 50264\nBM_SGEMM/80 37661 ns 35787 ns 42080\nBM_SGEMM/90 36507 ns 36077 ns 40091\nBM_SGEMM/100 32513 ns 31850 ns 48607\nBM_SGEMM/112 41742 ns 41207 ns 37273\nBM_SGEMM/128 67211 ns 65095 ns 21933\nBM_SGEMM/140 68263 ns 67943 ns 19245\nBM_SGEMM/150 121854 ns 115439 ns 10660\nBM_SGEMM/160 116826 ns 115539 ns 10000\nBM_SGEMM/170 126566 ns 122798 ns 11960\nBM_SGEMM/180 130088 ns 127292 ns 11503\nBM_SGEMM/189 120309 ns 116634 ns 13162\nBM_SGEMM/200 114559 ns 110993 ns 10000\nBM_SGEMM/256 217063 ns 207806 ns 6417\n```\nand after, it's gone (note this includes my other change which reduces calls\nto num_cpu_avail):\n```\n----------------------------------------------------\nBenchmark Time CPU Iterations\n----------------------------------------------------\nBM_SGEMM/4 95 ns 95 ns 12347650\nBM_SGEMM/6 166 ns 166 ns 8259683\nBM_SGEMM/8 193 ns 193 ns 7162210\nBM_SGEMM/10 258 ns 258 ns 5415657\nBM_SGEMM/16 471 ns 471 ns 2981009\nBM_SGEMM/20 666 ns 666 ns 2148002\nBM_SGEMM/32 1903 ns 1903 ns 738245\nBM_SGEMM/40 2969 ns 2969 ns 473239\nBM_SGEMM/64 9440 ns 9440 ns 148442\nBM_SGEMM/72 37239 ns 33330 ns 46813\nBM_SGEMM/80 57350 ns 55949 ns 32251\nBM_SGEMM/90 36275 ns 36249 ns 42259\nBM_SGEMM/100 31111 ns 31008 ns 45270\nBM_SGEMM/112 43782 ns 40912 ns 34749\nBM_SGEMM/128 67375 ns 64406 ns 22443\nBM_SGEMM/140 76389 ns 67003 ns 21430\nBM_SGEMM/150 72952 ns 71830 ns 19793\nBM_SGEMM/160 97039 ns 96858 ns 11498\nBM_SGEMM/170 123272 ns 122007 ns 11855\nBM_SGEMM/180 126828 ns 126505 ns 11567\nBM_SGEMM/189 115179 ns 114665 ns 11044\nBM_SGEMM/200 89289 ns 87259 ns 16147\nBM_SGEMM/256 226252 ns 222677 ns 7375\n```\n\nI've also tested this with ThreadSanitizer and found no data races during\nexecution. I'm not sure why 200 is always faster than it's neighbors, we must\nbe hitting some optimal cache size or something.", "idx": 1097} {"target": 0, "func": "[PATCH] HvD: Initial Maxima generated code for the various\n functionals. At present only the energy expressions and the 1st order\n derivatives are implemented. This is sufficient to test whether the energy\n expressions are correct. Next the 2nd and 3rd order derivatives will be\n generated but optimizing the expressions to generate fast Fortran will take a\n while.", "idx": 1464} {"target": 0, "func": "[PATCH] Replace gmx::Mutex with std::mutex\n\nWe use no mutexes during the MD loop, so performance is not a serious\nconsideration and we should simplify by using std components.\nEliminated components of thread-MPI that are now unused.\n\nIn particular, this reduces the cross-dependencies between the\nlibgromacs and threadMPI libraries.\n\nMinor style improvements around set_over_alloc_dd.\n\nPart of #3892", "idx": 833} {"target": 0, "func": "[PATCH] Add Fast LSTM layer implementation.", "idx": 1456} {"target": 0, "func": "[PATCH] WIP: add CPR approach and fast small block inverse.", "idx": 614} {"target": 0, "func": "[PATCH] Extend performance considerations on bonded offload\n\nRefs #2793\n\nChange-Id: I4a8ae8554cf2aad540eb4eb485898f8cabeb3966", "idx": 1184} {"target": 0, "func": "[PATCH] use integer and reduce the number of tests\n\nleda_rational is not automatically doing gcd calls so Quotient\nis faster for our applications.\nThe test is still slow with EPECK", "idx": 480} {"target": 0, "func": "[PATCH] Because I wasn't seeing a speed-up on grapes, I made the\n accumulation put (add_hash_block) non-blocking in hopes that that\n communication, which probably hits more contention because it is a write, can\n be overlapped with all the local computation going on prior to it.", "idx": 1132} {"target": 0, "func": "[PATCH] First phase of integrating the fast coulomb code into nwchem.\n The nested grid evaluation of the density, fourier interpolation, FMM, and\n fourier solution of the free space Poisson equation", "idx": 736} {"target": 0, "func": "[PATCH] Add task dag performance test based upon fibonnaci", "idx": 598} {"target": 0, "func": "[PATCH] Compile using g++ on OSX\n\nThe Accelerate framework requires -flax-vector-conversions to\ncompile on g++", "idx": 87} {"target": 0, "func": "[PATCH] Some simple memory leak fixes.\n\nAlthough the fixed leaks are not important for performance, the less\nthere are leaks, the more convenient it is to run valgrind leak checking\non parts of the code.", "idx": 11} {"target": 0, "func": "[PATCH] Append all ICC Performance flags only to Release Flags\n\nSome of the ICC performance flags were appended to GMXC_CFLAGS\nand thus also used e.g. for Debug.\n\nChange-Id: Iadfaa29fb347f24208e6f2406e0d1ad41f037804", "idx": 1070} {"target": 0, "func": "[PATCH] Enable performance test", "idx": 616} {"target": 0, "func": "[PATCH] New normals orientation method:\n radial_normals_orientation_3() does a radial orientation of the normals of a\n point set. Normals are oriented towards exterior of the point set. This very\n simple and very fast method is intended to convex objects.", "idx": 789} {"target": 0, "func": "[PATCH] Allow useful CI to run in forks\n\n* Move the fast jobs with no dependencies to the first stage.\n* Remove the global KTH-specific job runner tag from jobs in the pre-build stage.\n* Use the `pre-build` stage as the dependency for all later stages, rather than the `simple-build` job, specifically.\n* Convert rule sets to new *rules* syntax.\n* Use '$CI_PROJECT_NAMESPACE == \"gromacs\"' to distinguish jobs created with access to GROMACS GitLab infrastructure.\n\nFixes #3458", "idx": 329} {"target": 0, "func": "[PATCH] Make Constraints a proper class\n\nConverted the opaque C struct to a pimpl-ed C++ class.\n\nNumerous callers of constraint routines now don't have to pass\nparameters that are embedded within the class at setup time,\ne.g. for logging, communication, per-atom information,\nperformance counters.\n\nSome of those parameters have been converted to use const references\nper style, which requires various callers and callees to be modified\naccordingly. In particular, the mtop utility functions that take const\npointers have been deprecated, and some temporary wrapper functions\nused so that we can defer the update of code completely unrelated to\nconstraints until another time. Similarly, t_commrec is retained as a\npointer, since it also makes sense to change globally.\n\nMade ConstraintVariable an enum class. This generates some compiler\nwarnings to force us to cover potential bug cases with fatal errors.\nUsed more complete names for some of the enum members.\n\nIntroduced a factory function to continue the design that constr is\nnullptr when we're not using constraints.\n\nAdded some const correctness where it now became necessary.\n\nRefs #2423\n\nChange-Id: I7a3833489b675f30863ca37c0359cd3e950b5494", "idx": 1013} {"target": 0, "func": "[PATCH] Fixing bugs in slow (non-shared memory) variant of\n lj/charmm/coul/charmm/gpu", "idx": 653} {"target": 0, "func": "[PATCH] checked in the wrong file last time, this is the really\n efficient version !", "idx": 313} {"target": 0, "func": "[PATCH] KokkosCore: Mark perf tests as CATEGORY PERFORMANCE\n\nFix https://github.com/kokkos/kokkos/issues/374 by marking\nKokkosCore's performance tests as CATEGORY PERFORMANCE. They will\nalways build, but they will only run when doing performance tests.", "idx": 260} {"target": 0, "func": "[PATCH] Fast Haswell ZGEMM kernel", "idx": 301} {"target": 0, "func": "[PATCH] Experimental new template class Set. * This class is similar\n to Jakub's new HashTable class from the 'ncmesh-mem-opt-dev' branch, but\n more generic. * It is a container for unique elements of any type T. *\n Supports fast insertion, removal, and searching of elements. * Each element\n is assigned an index (int) upon insertion. * Indices of removed elements are\n reused when inserting new elements. * Elements are stored in a random access\n container that supports fast insertion at the end, like mfem::Array or\n std::vector. * Such container classes require a simple adaptor class to be\n used with class Set. * The indices assigned to elements are indices into\n the container object, which stores \"nodes\", a struct with two fields one of\n type T, and second of type int (index of next element in a bin). * The\n entries of the Set are separated into bins using a generic hash function.\n Bins are represented as linked lists where instead of pointers, the\n link-\"nodes\" use int indices.\n\nThis class can be useful in various contexts:\n* Creating a local enumeration for a set of processor ranks, e.g.\n enumerating the processor neighbors in class GroupTopology.\n* Creating off-diagonal column maps for HypreParMatrix, given a set of\n global column indices.\n* Enumeration of edges and faces as Sets of (sorted) pairs or 3-tuples,\n similar to HashTable - which can be build on top of Set as well.", "idx": 1446} {"target": 0, "func": "[PATCH] HvD: Adding a small performance test to check whether it\n makes sense to use a Taylor series for exp().", "idx": 170} {"target": 0, "func": "[PATCH] Reduce number of epochs for training to accelerate tests.", "idx": 650} {"target": 0, "func": "[PATCH] Accelerate a couple of tests.", "idx": 647} {"target": 0, "func": "[PATCH] Revert simd-avx.h changes from b606e3191\n\nThey didn't improve performance at all as far as I can tell,\nand they ended up breaking the PGI compiler.\n\nIt is always tempting to use the fancy addsub instructions in FFTW to\ndo complex multiplications, but the reality is that FFTW is designed\nto avoid complex multiplications in most cases (we started in the SSE\ndays), and thus they don't make any difference. We are better off\nusing the minimal possible set of AVX instructions to minimize the\nchance of triggering compiler bugs.\n\nThe same statement holds for _mm256_shuffle_pd() versus\n_mm256_permute_pd(): in theory the latter is better, in practice\neither one is rarely used. However, SHUFFLE is older (since the SSE\ndays) and has a higher chance of working.", "idx": 1178} {"target": 0, "func": "[PATCH] Add PointLocatorBase::locate_node()\n\nThis isn't nearly as efficient as it could be if we override it in\nPointLocatorTree, but it's simple and I'm not planning on using it in\nany inner loops.", "idx": 234} {"target": 0, "func": "[PATCH] Task assignment for bonded interactions on CUDA GPUs\n\nMade a query function to find whether any interactions of supported\ntimes exist in the global topology, so that we can make efficient\nhigh-level decisions.\n\nAdded free for gpuBondedLists pointer.\n\nMinor cleanup in manage-threading.h\n\nFixes #2679\n\nChange-Id: I0ebbbd33c2cba5808561111b0ec6160bfd2f840d", "idx": 70} {"target": 0, "func": "[PATCH] Minor changes to mdrun -h descriptions\n\nHopefully these are easier to understand. The suggested application\nfor -pinoffset is covered in the new mdrun performance section\nof the user guide, on release-5-0 branch.\n\nChange-Id: I7bc6172a70c39c02f6ca6db17e26b08d2ca3b444", "idx": 853} {"target": 0, "func": "[PATCH] resolved an exception issue which could make PME slow", "idx": 976} {"target": 0, "func": "[PATCH] Fix backwards atomic defaults\n\nThanks @stanmoore1 for reporting this.\nThis doesn't affect performance results\nbecause those tests don't use the defaults.", "idx": 45} {"target": 0, "func": "[PATCH] added (inner and outer) relaxation parameters to Gauss-Seidel\n routines, also added a backward solve procedure. This required an additional\n parameter for hypre_BoomerAMGRelax. Complete list of choices for smoothers\n are now: relax_type = 0 -> Jacobi or CF-Jacobi relax_type = 1 -> Gauss-Seidel\n <--- very slow, sequential relax_type = 2 -> Gauss_Seidel: interior points in\n parallel, boundary sequential relax_type = 3 -> hybrid: SOR-J mix\n off-processor, SOR on-processor with outer relaxation parameters (forward\n solve) relax_type = 4 -> hybrid: SOR-J mix off-processor, SOR on-processor \n with outer relaxation parameters (backward solve) relax_type = 5 -> hybrid:\n GS-J mix off-processor, chaotic GS on-node relax_type = 6 -> hybrid: SSOR-J\n mix off-processor, SSOR on-processor with outer relaxation parameters\n relax_type = 9 -> Direct Solve", "idx": 1500} {"target": 0, "func": "[PATCH] Complete all the complex single-precision functions of\n level3, but the performance needs further improve.", "idx": 506} {"target": 0, "func": "[PATCH] Fix performance penalty in bspline derivatives", "idx": 1506} {"target": 0, "func": "[PATCH] Adding a SymvCtrl data structure and subsequently\n extending HermitianTridiagCtrl and HermitianEigCtrl to support it due\n to the large differences in performance from different approaches to the\n local portion of a distributed Symv", "idx": 1542} {"target": 0, "func": "[PATCH] Modified Newton Raphson method added and some basic timers A\n modified newton raphson method has been added in place of the old pure newton\n raphson. It's nothing fancy it just sets the relaxation factor for the next\n time step to 0.5 if it notes that the ratio in the norm of the current\n residual and the previous residual hasn't decreased by a factor of 10 or\n greater. The relaxation factor is set to 1.0 if it is converging fast enough.\n The main motivation behind this is to push the solution in the right\n direction when it starts to oscillate the actual answer. Next, a few basic\n timers have been set around the entire solution set to provide some very\n basic profiling for how long each time step takes to solve.", "idx": 55} {"target": 0, "func": "[PATCH] Iterative version of the incident_...(vertex) methods.\n\nSee Andreas' e-mail:\n\n> I just had a look at the code. The problem is that it calls\n> incident_cells, which is implemented recursively, and for a\n> vertex with many incident cells, as in your case the infinite\n> vertex, the stack is full.\n>\n> We have to put it on our todo list.\n\nI did it for 3D only because the degenerate 2D case should be\nhandled by the circulator anyway.\n\nI did not add the test which explodes the call stack (in case we plug\nthe recursive version): too slow for a testsuite. But incident_...\nmethods are used everywhere in the code anyway.", "idx": 468} {"target": 0, "func": "[PATCH] Adding Changelog for Release 3.3.00\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 3.3", "idx": 933} {"target": 0, "func": "[PATCH] Fix automated GPU ID assignment\n\nWhen we permitted separate PME-GPU ranks, we should have relaxed\nthis logic also.\n\nHowever, the performance in such cases is not very predictable, so if\nthere's a distribution of tasks with more than one task to a GPU that\nis uneven, then we should require the user to specify exactly what\nthey want. This also reinstates the 2016-era behaviour where, if\nrunning multiple PP ranks on GPUs, that mdrun will not by default\nproduce an unbalanced mapping with more than one task per GPU.\n\nChange-Id: I5b2fad317ecbb4e5e02fccd68e15350b678df34c", "idx": 350} {"target": 0, "func": "[PATCH] moved diff, fast, gradient, harris, histogram to kernel\n namespace", "idx": 105} {"target": 0, "func": "[PATCH] Faster version, making use of the fast /tmp directory. Also\n removes diffs*.gz from web server.", "idx": 549} {"target": 0, "func": "[PATCH] Add DenseMatrix::svd_solve().\n\nThis function fills a missing requirement in the DenseMatrix classes,\nallowing us to solve non-square systems of equations in a\nleast-squares sense. The user can pass a tolerance to svd_solve()\nwhich determines the cutoff for small singular values. svd_solve() is\na const member function: we make a copy internally instead of allowing\nLapack to modify A.\n\nNote that Eigen also has the capability to solve non-square systems of\nequations, but it is relatively slow, as discussed in this thread:\nhttps://forum.kde.org/viewtopic.php?f=74&t=102088, so having our own\nLapack-based implementation is worthwhile.", "idx": 991} {"target": 0, "func": "[PATCH] C++ math function cleanup\n\nmath/functions.h now implements a number of old and new math\nfunctions with either float, double, or integer arguments.\nManual SIMD versions of 1/sqrt have been tested with gcc and icc\non x86, Power8, Arm32 and Arm64, but with correct 'f' suffixes\non constants there is only 10-15% performance difference, so for\nnow we always use the system versions to avoid having this file\ndepend on config.h. Functions for third and sixth roots have\nbeen introduced to replace many of our pow() calls, and the code\nhas been cleaned up to use the new functions.\n\nRefs #1111.\n\nChange-Id: I74340987fff68bc70d268f07dbddf63eb706db32", "idx": 1260} {"target": 0, "func": "[PATCH] Add A100 performance paper.", "idx": 918} {"target": 0, "func": "[PATCH] Fixed data filename on FAST unit test.", "idx": 1018} {"target": 0, "func": "[PATCH] Revert \"move update of the status outside of the constructor\"\n\nThis reverts commit 6378a51191df7cb28a24dadc1706112c0c7df926.\n\nThe commit was incorrect and was introducing a huge performance issue", "idx": 1415} {"target": 0, "func": "[PATCH] Changed the way FAST handles different datatypes internally", "idx": 1213} {"target": 0, "func": "[PATCH] Import AMD Piledriver DGEMM kernel generated by AUGEM. So\n far, this kernel doesn't deal with edge.\n\nAUGEM: Automatically Generate High Performance Dense Linear Algebra\nKernels on x86 CPUs.\nQian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. In the\nInternational Conference for High Performance Computing, Networking,\nStorage and Analysis (SC'13). Denver, CO. Nov, 2013.", "idx": 28} {"target": 0, "func": "[PATCH] added a hybrid method that will first try to solve with\n diagonal scaled CG and then if convergence is too slow , attempt to use\n BoomerAMG", "idx": 91} {"target": 0, "func": "[PATCH] Remove constant acceleration groups\n\nPer Redmine discussion, this has been broken for about 10\nyears. Simplifying the update and global energy code is nice, and some\nintegrators will now be a trifle faster.\n\nThe value of SimulationAtomGroupType::Count can't change because reading old .tpr files relies\non it, but the enumeration value has a new name so that any relevant\ncode rebasing over this change can't silently still compile. This does\nmean that loops over egcNR are now slightly less efficient than they\ncould be, and some of those are in the per-step update code. But that\nis likely not worth changing in the current form of the code.\n\nOriginally authored by Mark Abraham at\nhttps://gerrit.gromacs.org/c/gromacs/+/8944.\n\nFixes #1354", "idx": 1110} {"target": 0, "func": "[PATCH] Added several HemmAccumulate routines so that Hegst could\n avoid cache-unfriendly SumScatter routines. The warning messages for calling\n these potentially slow redistribution routines were also toned down.", "idx": 1373} {"target": 0, "func": "[PATCH] Removed lots of units code from AMBER file loader, which made\n it unnecessarily slow", "idx": 523} {"target": 0, "func": "[PATCH] added two performance charts", "idx": 708} {"target": 0, "func": "[PATCH] Refactor sign to signbit internally\n\nThe operation being performance is equivalent of std::signbit\nthus, using signbit is more apt and removes unnecessary redefine\nof sign function in opencl jit kernel.", "idx": 206} {"target": 0, "func": "[PATCH] corrected bug in performance testing", "idx": 604} {"target": 0, "func": "[PATCH] Add Conjugate Gradient example to benchmarks - Includes\n sparse matrix\n\nCompare the performance and memory usage of sparse vs dense using conjugate\ngradient example", "idx": 663} {"target": 0, "func": "[PATCH] Example now works in parallel. The solver is still slow but\n it works", "idx": 140} {"target": 0, "func": "[PATCH] switch from int to bool to avoid a performance warning", "idx": 1004} {"target": 0, "func": "[PATCH] added benchmark comparing the performance of the two traits\n classes (CK vs CORE::Expr)", "idx": 114} {"target": 0, "func": "[PATCH] added a pragma to suppress a performance warning in std::map", "idx": 243} {"target": 0, "func": "[PATCH] Multi-phase solvers: Improved handling of inflow/outflow BCs\n in MULES\n\nAvoids slight phase-fraction unboundedness at entertainment BCs and improved\nrobustness.\n\nAdditionally the phase-fractions in the multi-phase (rather than two-phase)\nsolvers are adjusted to avoid the slow growth of inconsistency (\"drift\") caused\nby solving for all of the phase-fractions rather than deriving one from the\nothers.", "idx": 1470} {"target": 0, "func": "[PATCH] Update MFEM to not run the performance miniapp if code\n coverage is enabled. Runtime exceeds automated testing engine allowance.", "idx": 1142} {"target": 0, "func": "[PATCH] Final bugs fixed, now have to look at the iterator\n performance for the subtable.", "idx": 67} {"target": 0, "func": "[PATCH] Fixed FAST CUDA backend case when no features are found", "idx": 901} {"target": 0, "func": "[PATCH] Replaced Vertex_circulator by Edge_circulator to gain\n performance", "idx": 794} {"target": 0, "func": "[PATCH] This is a more correct implementation. But it isn't efficient\n and it may fail on corner cases.\n\nThat can be a problem for another day... (are you reading this from the future?\nSorry...!)", "idx": 248} {"target": 0, "func": "[PATCH] Adding Changelog for Release 3.2.01 [ci skip]\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 3.2\n\n(cherry picked from commit 0e0b28fd78e696f74ab8d1d6bfc1c3e2b9667f49)", "idx": 299} {"target": 0, "func": "[PATCH] Changed slow convergence stop criteria.", "idx": 1009} {"target": 0, "func": "[PATCH] EA: VCALLS turned off for ELAN4 because of performance and\n stability", "idx": 1349} {"target": 0, "func": "[PATCH] Fix the Jacobi compilation time.\n\nThis is an approach at fixing the Jacobi kernels compilation which aims at\nsacrificing as little as possible the runtime performance and get down the total\ncompilation time for this file as much as possible, all the while having a way\nto enable the full performance optimizations in an easy way.\n\nThis approach only touches the `generate` kernels in the end, and provides the\nfull performance for the `apply` and other kernels as they are fast enough to\ncompile.\n\n+ Split the Jacobi kernels into multiple files for parallel compilation and\n smaller code size.\n+ Tone down some optimizations, namely use `noinline` for one function and\n `#pragma unroll 1` in two places to prevent unrolling. This impacts the\n Jacobi `generate` kernels only.\n+ Add a compilation flag for enabling back the full optimizations.\n\n__NOTE:__ the generate kernel compilation with full optimizations still takes\nabove 30 minutes on my laptop for one architecture (Maxwell). Without the\noptimizations enabled, it takes less than 3 minutes.", "idx": 718} {"target": 0, "func": "[PATCH] MDRange: Performance test (3D)\n\nRuns test over multiple ranges (3D), adjusting the tile dims by powers of 2.\nCompare results of MDRange to RangePolicy implemented as Collapse<2> and\nCollapse<3>", "idx": 1377} {"target": 0, "func": "[PATCH] Add fix elstop to USER-MISC\n\nImplements inelastic energy loss for fast particles in solids.", "idx": 1095} {"target": 0, "func": "[PATCH] The test was too slow in Debug mode", "idx": 438} {"target": 0, "func": "[PATCH] corrected accidental disables of fast global to local atom\n lookup for large systems", "idx": 1073} {"target": 0, "func": "[PATCH] Add evaluable_elements_begin()/end()\n\nI put these in MeshBase like all the other iterator ranges, for\nsimplicity and consistency, but we do still need a DofMap reference to\ninitialize them.\n\nIterating through these might be slow, but creating a correct mutable\ncache instead will definitely be a huge pain, so let's just get the\ncode correct for now and optimize later if we need to.", "idx": 1203} {"target": 0, "func": "[PATCH] Limit SMT with PME on GPU\n\nFor small numbers of atoms per core, SMT can seriously deteriorate\nperformance when running both non-bondeds and PME on GPU.\nWith fewer than 10000 atoms per core, SMT is now always off by default\nwith PME on GPU and auto settings.\n\nChange-Id: I1a6b83bc81f68e89bf443e2b0ddb1fde44e2361d", "idx": 1015} {"target": 0, "func": "[PATCH] Next-generation SIMD, for SSE2, SSE4.1 and 128-bit AVX\n\nThis adds the same functionality that was previously done for the\nreference SIMD implementation This includes all the 128-bit x86\nflavors, since SSE4.1 and AVX-128 only overrides a few SSE2\ninstructions/functions. Performance appears to be identical to the\nstate before the new SIMD code on x86 when using SSE2. For the most\nperformance-sensitive functions I expect we will later test a few\ndifferent alternative implementations once we can benchmark the\nroutines inside actual kernels using them.\n\nChange-Id: I59d5741df345b38745f9a6d1ea3a4d27b0a66034", "idx": 726} {"target": 0, "func": "[PATCH] AMOEBA uses fast approximation for erfc()", "idx": 525} {"target": 0, "func": "[PATCH] fix for no performance logging", "idx": 944} {"target": 0, "func": "[PATCH] Fixed g_sham using more than three dimensions\n\nWhen the value given with g_sham -n was greater than 3, arrays used to\noverflow and pick_minima() did not work. pick_minima() has been updated\nto treat an arbitrary number of dimensions, but retains the particular\ncode for the two- and three-dimensional cases in the hope that these\nare faster. The logic of the complex conditionals has hopefully been\nmade easier to follow without compromising performance with modern\ncompilers. Index variables are now of gmx_large_int_t type, as\nhigh-dimensional cases can have large numbers of grid points very fast.\n\nChange-Id: If0c2f9d9ceaf2b5c4c8b1a28a942fae8349fb600", "idx": 317} {"target": 0, "func": "[PATCH] updated log performance stats\n\n- added #of threads column\n- renamed \"Number\" column to \"Count\"\n- swapped \"Second\" and \"G-cycles\" columns to have the former aligned\n with the GPU timing table\n- removed Mnbf/s and GFlops from default output, turning these back on is\n still possible via the GMX_DETAILED_PERF_STATS env var\n- normalized the NODE time and % stats with the number of cores", "idx": 979} {"target": 0, "func": "[PATCH] Move early return for nbnxm force reduction\n\nTo reduce dependencies and code complexity, the early return for\navoiding overhead of a force reduction reducing no forces at all\nhas been moved from nonbonded_verlet_t to atomdata.cpp. The check\nhas been changed from no non-local work to no non-local atoms, which\nshould not affect performance much.\n\nChange-Id: I3315699e15918482b321b702f6ba24209aa3a6b2", "idx": 1024} {"target": 0, "func": "[PATCH] 1. Removed \"task uccsdt energy\"-related input examples\n because they may not work and are pointless and distracting. 2. Added 2EMET\n subsection and filled in details for all new 2emet options. 3. Added\n subsection on RESTART which is complete except for examples. 4. Updated\n response properties section slightly. 5. Added new section on performance\n suggestions since so many users need this, but only the outline exists. I\n will fill in this content in a few days. 6. Removed other Hirata-era stuff\n that just seems out of place now.", "idx": 856} {"target": 0, "func": "[PATCH] add slow tag to about 60 tests that take about as much time\n as the 430 others", "idx": 988} {"target": 0, "func": "[PATCH] Minor cleanup of NMF code; I think the residue should be\n displayed in non-debugging mode (optionally with -v) so I switched to\n Log::Info. Comment on the change to pinv() then rewrite\n RandomAcolInitialization a little bit to avoid allocating memory\n unnecessarily. Unfortunately insert_cols() is a little slow because it\n allocates more memory and memcpy()s.", "idx": 564} {"target": 0, "func": "[PATCH] Increase release build timeout, macos is being a bit slow on\n Azure (#2495) (#2496)\n\nCo-authored-by: Seth Shelnutt ", "idx": 880} {"target": 0, "func": "[PATCH] Update management of linear algebra libraries\n\nManagement of detection and/or linking to BLAS and LAPACK libraries is\nre-organized. The code has migrated to its own module. This will\nhelp future extension and maintenance. This version communicates\nthings that are newsworthy and stays out of the way when nothing\nis changing.\n\nWe no longer over-write the values specified by the user for\nGMX_EXTERNAL_(BLAS|LAPACK). Previously, this was used to signal\nwhether detection succeeded, but that does not really get the job\ndone. Instead, the user is notified that detection failed (repeatedly,\nif they deliberately set such an option on).\n\nCorrect usage and expected behaviour in all cases is documented both\nin the code and the install guide.\n\nThe user interface is pretty much unchanged. We still don't offer full\nconfigurability (e.g. MKL for FFTs must use MKL for linear algebra\nunless GMX_*_USER is used, and the only way to get MKL for linear\nalgebra is to use it for FFTs). The size of any performance difference\nis probably very small, and if the user really needs mdrun with\ncertain FFT and tools with certain linear algebra library, they can do\ntwo configurations. Note that mdrun never calls any linear algebra\nroutines (tested empirically)!\n\nExpanded the solution of #771 by testing that the user supplied\nlibraries that actually work. If not, we emit a warning and try to use\nthem anyway.\n\nWe also now check that MKL really does provide linear algebra\nroutines, and fall back to the default treatment if it does not.\n\nRefs #771,#1186\n\nChange-Id: Ife5c59694e29a3ce73fc55975e26f6c083317d9b", "idx": 633} {"target": 0, "func": "[PATCH] Use a separate Performance log line for compute_affine_map", "idx": 460} {"target": 0, "func": "[PATCH] Tests for valid periodic actions\n\nWe expect that mdrun propagation is unaffected by changing mdp options\nthat determine whether output is written. However, orchestrating mdrun\nto collect and compute such data without affecting the propagation is\ncomplex and currently very fragile. New propagation approaches must be\nable to be tested.\n\nMany mdrun combinations of periodic outputs and periodic action of\nsimulation modules that affect propagation are compared for\ncorrectness against a simulation that did every action at every step.\n\nThese tests are fairly slow, so are in their own test binary and\nannotated appropriately. They run by default only in release-type\nbuilds. As they target testing the kind of coordination issues that\ntend to appear in multi-rank runs, those runs are specifically\ntargeted.\n\nThe energy tolerance for the mdrun test were far too tight. It seems\nthat tests passed anyhow because they compared runs under exactly\nthe same run conditions using (nearly) the same summation order.\nThis change slightly increases the tolerance for energies and\nmassively the tolerance for pressure comparison.\n\nChange-Id: I88ea643873ebec0e5e2b12181f4b51ad90c7b0f7", "idx": 942} {"target": 0, "func": "[PATCH] Pass Cell_handle and Vertex_handle by value instead of by\n const&. This undoes :\n\n r19107 | afabri | 2003-10-17 10:49:19 +0200 (Ven 17 oct 2003) | 2 lignes\n Added const& for gaining performance\n\nwhich was justified at the time by the fact that on VC++, handles encapsulated iterators.", "idx": 270} {"target": 0, "func": "[PATCH] AABB tree demo: - added color ramps for signed and unsigned\n distance functions (thermal for unsigned, and red/blue for signed) - fix moc\n warning issue (thanks Manuel Caroli) - fix strange value in performance\n section of user manual (we now understand that the KD-tree is making\n something really nasty only for large point sets - and we'll investigate the\n KD tree more)", "idx": 1075} {"target": 0, "func": "[PATCH] Isolate PME GPU spline parameter indexing in inline functions\n\nThis makes the spread/gather kernels more readable and allows\nto change the spline indexing scheme much easier in\nthe future. Performance should not be affected.\nMore TODO regarding the spline indexing scheme is marked.\n\nChange-Id: If735cccf2ce82f46b483c9ada6f309425c51f67e", "idx": 54} {"target": 0, "func": "[PATCH] Convert nbnxn_atomdata_t to C++\n\nChanged all manually managed pointer to std::vector.\nSplit of a Params and a SimdMasks struct.\nChanged some data members to be private, more to be done.\n\nThis change is ony refactoring, no functional changes.\n\nNote: minor, negligible performance impact of the nbnxn gridding\ndue to (unnecessary) initialization of std::vector during resize().\n\nChange-Id: I9c70a1f8f272c80a7cf335fcbd867bd79c4102a2", "idx": 748} {"target": 0, "func": "[PATCH] This test was too slow in Debug mode", "idx": 1425} {"target": 0, "func": "[PATCH] Update tests for changed APIs.\n\nRandom forests generally work better, but it is not guaranteed for the vc2\ndataset, so I am still requiring only that it gets 90% of the decision tree\nperformance in the worst case. I expect it will generally be better, but there\nare still situations where it may not be (because of the randomness).", "idx": 1526} {"target": 0, "func": "[PATCH] Final performance tuning on the dense version of the\n algorithm", "idx": 158} {"target": 0, "func": "[PATCH] Added wrappers for ScaLAPACK QR factorization (and an ability\n to test it from tests/lapack_like/QR) and subsequently revealed a (perhaps\n recently introduced) performance issue in Elemental's QR", "idx": 19} {"target": 0, "func": "[PATCH] Minor tweaks to the DD setup\n\nDynamic load balancing is now turned on when the total performance\nloss is more than 2% (lower than that will not help).\nThe check for large prime factors should be done on the PP node\ncount when -npme is set by the user.\n\nChange-Id: Ib81b56a7cb071540b143a4bfc98758788a8ac07d", "idx": 17} {"target": 0, "func": "[PATCH] Added free-energy kernel performance note\n\nChange-Id: Iea5d2b124633c4188753c7b1ebb6f964edb2f644", "idx": 1270} {"target": 0, "func": "[PATCH] #1009 Matrix return type for det, getMinor and cofactor Could\n result in a performance loss, but the implementation is not efficient anyway.\n No deprecation since users (and unit tests) have relied on automatic\n typecasting to Matrix.", "idx": 1362} {"target": 0, "func": "[PATCH] Support pinning in HostAllocator\n\nWe want the resize / reserve behaviour to handle page locking that is\nuseful for efficient GPU transfer, while making it possible to avoid\nlocking more pages than required for that vector. By embedding the\npin()/unpin() behaviour into malloc() and free() for the allocation\npolicy, this can be safely handled in all cases.\n\nAdditionally, high-level code can now choose for any individual vector\nwhen and whether a pinning policy is required, and even manually\npin and unpin in any special cases that might arise.\n\nWhen using the policy that does not support pinning, we now use\nAlignedAllocator, so that we minimize memory consumption.\n\nChange-Id: I807464222c7cc7718282b1e08204f563869322a0", "idx": 274} {"target": 0, "func": "[PATCH] Make sure multiprocessor performance for single vectors is\n unaffected by the multivector capability.", "idx": 1444} {"target": 0, "func": "[PATCH] Performance and thread-safety requires a lock around each\n constraint row acquisition, not just each constraint row entry.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@5893 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 359} {"target": 0, "func": "[PATCH] Reducing dependencies. Print functions are generally not\n fast anyway, inlining them leads to unnecessary dependencies and larger\n headers. Removing print functions from headers.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@1112 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 176} {"target": 0, "func": "[PATCH] Extensive commenting and instructions on how to run fast kde\n is completed", "idx": 1508} {"target": 0, "func": "[PATCH] genbox now performs as expected. fast and reliable.", "idx": 866} {"target": 0, "func": "[PATCH] trivial change for DofMap::dof_indices to increase\n performance when there are no element-based DOFs\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@1127 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 1326} {"target": 0, "func": "[PATCH] Added FAST feature detector example", "idx": 429} {"target": 0, "func": "[PATCH] add nvtx to measure performance", "idx": 769} {"target": 0, "func": "[PATCH] Adding a Fast configuration", "idx": 205} {"target": 0, "func": "[PATCH] Fixed FAST files to comply with new directory structure.", "idx": 343} {"target": 0, "func": "[PATCH] Adding specific code for Tet + Bbox_3 do_intersect as it\n should be ok for Bbox_3 to degenerate.\n\nAdding specific code for Tet + Bbox_3 do_intersect as it should be ok for Bbox_3 to degenerate.\n\nThe previous code failed in case Bbox_3 is degenerate.\n\nI use the result = result || predicate(); to keep the maybe inside result.\nIf certain the code returns early.\nI also avoid the %4 as this is a slow operation, but not sure that this is worth compared to the rest.", "idx": 265} {"target": 0, "func": "[PATCH] fixed performance print when run is terminated or with\n minimizers", "idx": 456} {"target": 0, "func": "[PATCH] Revert \"Updated boost compute version tags\"\n\nThis reverts commit b6d8e2d85c358f9d2be69bd09fb8b64bda9fcc04.\n\nThis commit was causing performance drops on certain hardware", "idx": 1339} {"target": 0, "func": "[PATCH] flag two more subroutines can trigger the variable tracking\n message and slow down compilation", "idx": 1303} {"target": 0, "func": "[PATCH] some performance stuff (matrix products)", "idx": 667} {"target": 0, "func": "[PATCH] Knowing when the tree fails and we're stuck with a linear\n search is useful for performance testing too\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@4595 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 1000} {"target": 0, "func": "[PATCH] Add workaround for performance regression", "idx": 298} {"target": 0, "func": "[PATCH] suppress performance warning concerning an assertion", "idx": 1486} {"target": 0, "func": "[PATCH] Separate CPU NB kernel and buffer clearing subcounters\n\nThis is aimed to allow comparing the performance of the pair-interaction\nkernels separately from the force buffer clearing.\n\nChange-Id: Ifb2b4b3e5a43ac2ee547da651f9432a22fe58421", "idx": 572} {"target": 0, "func": "[PATCH] On second thoughts, follow Kokkos best performance\n recommendations when setting OpenMP env vars", "idx": 1067} {"target": 0, "func": "[PATCH] Backport bug-fix from next: |\n ------------------------------------------------------------------------ |\n r65327 | odevil | 2011-09-06 17:21:27 +0200 (Tue, 06 Sep 2011) | 1 line |\n Changed paths: | M\n /branches/next/Triangulation_2/include/CGAL/Triangulation_2.h | | too\n conservative check removed, for fast removal in delaunay 2d |\n ------------------------------------------------------------------------", "idx": 64} {"target": 0, "func": "[PATCH] List: Reinstated construction from two iterators and added\n construction from an initializer list\n\nUntil C++ supports 'concepts' the only way to support construction from\ntwo iterators is to provide a constructor of the form:\n\n template\n List(InputIterator first, InputIterator last);\n\nwhich for some types conflicts with\n\n //- Construct with given size and value for all elements\n List(const label, const T&);\n\ne.g. to construct a list of 5 scalars initialized to 0:\n\n List sl(5, 0);\n\ncauses a conflict because the initialization type is 'int' rather than\n'scalar'. This conflict may be resolved by specifying the type of the\ninitialization value:\n\n List sl(5, scalar(0));\n\nThe new initializer list contructor provides a convenient and efficient alternative\nto using 'IStringStream' to provide an initial list of values:\n\n List list4(IStringStream(\"((0 1 2) (3 4 5) (6 7 8))\")());\n\nor\n\n List list4\n {\n vector(0, 1, 2),\n vector(3, 4, 5),\n vector(6, 7, 8)\n };", "idx": 913} {"target": 1, "func": "[PATCH] Fixed performance bug in mixed precision", "idx": 469} {"target": 1, "func": "[PATCH] added early quit option to accelerate distance vs user\n defined distance check", "idx": 491} {"target": 1, "func": "[PATCH] Copy team rendezvous implementation from master to address\n performance issue #936", "idx": 1258} {"target": 1, "func": "[PATCH] Re-enable i-atom type local mem prefetch in OpenCL\n\nFor reasons unknown this has been disabled in the original OpenCL\nimplementation. However, it turns out that prefetching does have\nsubstantial performance benefits, especially on AMD (>10%) and in some\ncases on NVIDIA too (although not on Maxwell).\n\nThis change re-enables prefetching code-path and turns it on\nfor AMD devices. For NVIDIA the decision will be revisited later.\n\nThe GMX_OCL_ENABLE_I_PREFETCH/GMX_OCL_DISABLE_I_PREFETCH environment\nvariables allow testing prefetching with future architectures/compilers.\n\nChange-Id: I8324d62d3d78e0a1577dd3125edf059d3b311c2f", "idx": 1164} {"target": 1, "func": "[PATCH] Duplicate and modify functions as they are more efficient", "idx": 1188} {"target": 1, "func": "[PATCH] Write element truth table to exodus to improve performance\n\nThe Exodus API allows for an element truth table to optionally be\nsent to the Exodus library before any element data is written. The\ntruth table simply tells which variables exist on which blocks.\nSending this truth table to Exodus allows for memory to be allocated\nin advance, making for much more efficient writing of data,\nespecially if there is a large number of element blocks.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@5358 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 1513} {"target": 1, "func": "[PATCH] Template free-energy kernel on soft-core\n\nTemplated the free-energy kernel on the presence of soft-core\nand the soft-core r-power.\nNot doing the soft-core math when not using soft-core doubles\nthe speed of the free-energy kernel.\nTemplating for the soft-core power gives 15% performance improvement\nwith soft-core power 6.\nDouble precision r variables are only needed with r-power 48.\n\nChange-Id: I5a37307b2a83304a40343a0708afce46f4bdcf75", "idx": 352} {"target": 1, "func": "[PATCH] Improve Ginkgo device arch flags management.\n\n1. Use a newer version of CAS which makes Auto become All when no\n architecture was detected. This ensures that compiling on a node without\n a GPU will provide good performance (at the cost of binary size and\n compile time), instead of failing to provide optimized architecture\n specific kernels.\n2. Use the `variable` version of CAS instead of `target` so that we do\n not call the CAS redundantly in several tests.\n3. Fix HIP AMDGPU flags which were not properly passed:\n `target_compile_options` do not seem to work with `hip_add_library`, so\n the flags need to be passed to `hip_add_libary` (or executable)\n directly.", "idx": 180} {"target": 1, "func": "[PATCH] moved loading of harmonic bonds to the top level, more\n efficient this way", "idx": 487} {"target": 1, "func": "[PATCH] Performance improvments to CPU Anisotropic Diffusion (#2174)\n\n* Performance improvments to CPU Anisotropic Diffusion\n\n(cherry picked from commit a4713f1aa102ad693129086bfdb9aa2a9d2fb1f7)", "idx": 434} {"target": 1, "func": "[PATCH] Implemented reordering of loads and stores so that the real\n and imaginary part are loaded/stored together. This should improve\n out-of-cache performance in the presence of associativity conflicts, and\n maybe worsen in-cache performance because of worse scheduling. Enabled for\n now, for experimental purposes.", "idx": 334} {"target": 1, "func": "[PATCH] Fix CUDA architecture dependent issues\n\nOnly device code gets generated in multiple passes and therefore\ntarget architecture-dependent macros like __CUDA_ARCH__ or our own\nIATYPE_SHMEM (which also depends on __CUDA_ARCH__) are not usable in\nhost code as these will be both undefined. As a result, current code\nover-allocated dynamic shared memory. This has no negative side-effect.\nThis change replaces the use of macros with runtime device compute\ncapability checks. Also texture objects are now actually enabled,\nwhich give very minor performance improvements.\nNote that on Maxwell + CUDA 7.0 there is a 20% performance regression\nfor the tabulated Ewald kernel (which is not used by default), which\nmagically disappears when texture references are used instead.\n\nChange-Id: I1f911caad85eb38d6a8e95f3b3923561dbfccd0e", "idx": 557} {"target": 1, "func": "[PATCH] Improve object_type detection performance (#2792) (#2793)\n\nThis improves the object APIs performance for detecting types by\nswitching from listing all items in the URI to checking only for the\nexistence of the group indicator, array schema file or array schema\nfolder. We also switch the order to check for the array schema folder\nfirst, since it is most likely to exist based on the assumption that\nthere are more arrays than there are groups.\n\nCo-authored-by: Seth Shelnutt ", "idx": 592} {"target": 1, "func": "[PATCH] Fix performance of 4-dslash_domain_Wall_4d.cuh kernels with\n xpay enabled: store coefficients in __constant__ memory to remove register\n spilling", "idx": 1535} {"target": 1, "func": "[PATCH] Use more efficient way of getting number of classes.", "idx": 247} {"target": 1, "func": "[PATCH] Serious performance bug in threaded code fixed. Now the main\n thread goes after the children are launched.", "idx": 347} {"target": 1, "func": "[PATCH] Avoid calculating distances after an Elkan prune. Slight,\n nearly negligible performance gains.", "idx": 698} {"target": 1, "func": "[PATCH] Fixed put_in_list for better performance", "idx": 190} {"target": 1, "func": "[PATCH] Add TypeTensor::solve().\n\nThis is slightly more efficient than inversion followed by multiplication.", "idx": 537} {"target": 1, "func": "[PATCH] POWER10: Optimize dgemv_n\n\nHandling as 4x8 with vector pairs gives better performance than\nexisting code in POWER10.", "idx": 828} {"target": 1, "func": "[PATCH] Halved the cost of the pull communication\n\nWith DD the PBC reference coordinates are now only communicated\nafter DD repartitioning. This reduces the number of MPI_alltoall\ncalls from 2 to 1 per step, which can significantly improve\nperformance at high parallelization.\n\nAdded a cycle counter for pull potential.\n\nAdded checks for zero pull vectors to avoid div by 0.\n\nChange-Id: Ib89ba9e14eaa887f59a5087135580bc29a20d7d0", "idx": 297} {"target": 1, "func": "[PATCH] Added the CGAL::Multiset class (based on a red-black tree),\n which extends the std::multiset class and makes it more efficient in many\n cases.", "idx": 1235} {"target": 1, "func": "[PATCH] Convert aligned moves to unaligned\n\nshould have no performance impact on reasonably modern cpus and fixes occasional crashes in actual user code.", "idx": 224} {"target": 1, "func": "[PATCH] More efficient nlp_solver_evaluate", "idx": 1139} {"target": 1, "func": "[PATCH] adding requested size threshold to improve performance of\n small size requests, removed bug check", "idx": 1301} {"target": 1, "func": "[PATCH] functionObjects::wallHeatFlux: More efficient evaluation of\n heat-flux\n\nwhich avoids the need for field interpolation and snGrad specification and\nevaluation.\n\nResolves patch request https://bugs.openfoam.org/view.php?id=2725", "idx": 1043} {"target": 1, "func": "[PATCH] fixed several bugs in the load balance performance prints", "idx": 1313} {"target": 1, "func": "[PATCH] Remove unnecessary ICC flags affecting performance\n\nSince we add -msse/../-mavx based on the acceleration we shouldn't\nadd -mtune=core2 anymore. Especially because it is added later\nand takes precedence over the (higher) acceleration flag.\n\n-ip and -funroll-all-loops could also be deleted because they don't\nseem to give any significant performance improvement, and might\nincrease compilation time, but they don't hurt gromacs performance.\n\nIn theory it could help to use -xavx instead of -mavx but I can't\nmeasure a difference.\n\nChange-Id: Icd11c40c3cd3ef2ae6ef42f07d5d75c228593f51", "idx": 1283} {"target": 1, "func": "[PATCH] erased the function Oriented_Side::oriented_side(Plane_3,\n Point_3): it was buggy before and not efficient enough after it has been\n corrected.", "idx": 1086} {"target": 1, "func": "[PATCH] Optimize get_nz by avoiding integer division\n\nInteger divisions are very costly, and typically take ~20-26 clocks\ncycles for 32-bit integers, and ~85-100 clock cycles for 64-bit\nintegers. The default size of indices was changed to 64-bit in commit\na0c6de6c6, causing a significant performance hit in one of the get_nz\nmethods.\n\nThis commit rewrites said get_nz method to avoid divisions, and instead\nuse additions. These have a typical latency of 1 cycle, and a throughput\nof 0.25-0.33 cycles, regardless of argument size.", "idx": 1340} {"target": 1, "func": "[PATCH] add collisions, split vectors into components for performance", "idx": 427} {"target": 1, "func": "[PATCH] make use of CUDA stream priorities\n\nCUDA 5.5 introduced steam priorities with 2 levels. We make use of this\nfeature by launching the non-local non-bonded kernel in a high priority\nstream. As a consequence, the non-local kernel will preempt the local\none and finish first. This will improve performance in multi-node runs\nby reducing the possibility of late arrival of non-local forces.\n\nChange-Id: I4efc65546e4135f12006c0422e1fca42a788129f", "idx": 1124} {"target": 1, "func": "[PATCH] Optimize Hex8::volume() slightly.\n\nIf my calculations are correct, the \"geometric\" formula I used\npreviously had about 12 dot products and 12 cross products, while the\ncurrent one only has 4 of each. This formula is derived by writing out the\nstandard volume formula and dropping terms which are zero due to the\nsymmetry of the integrand and/or triple products containing two copies\nof the same vector.\n\nWhile the original geometric formula was already pretty fast, this one\nis about 1.7x faster (about 0.1697s vs 0.2866s to call volume() on\n3.375M Hex8 elements).", "idx": 1390} {"target": 1, "func": "[PATCH] Parallelize s3 multipart on disconnect\n\nOn disconnect of S3 we can parallelize the marking of multi-part uploads\nas complete. This allows us to remove the exclusive lock early and\nincrease performance if we have a large number of outstanding requests.", "idx": 374} {"target": 1, "func": "[PATCH] shorten ESTIMATE planning time for certain weird sizes\n\nFFTW includes a collection of \"solvers\" that apply to a subset of\n\"problems\". Assume for simplicity that a \"problem\" is a single 1D\ncomplex transform of size N, even though real \"problems\" are much more\ngeneral than that. FFTW includes three \"prime\" solvers called\n\"generic\", \"bluestein\", and \"rader\", which implement different\nalgorithms for prime sizes.\n\nNow, for a \"problem\" of size 13 (say) FFTW also includes special code\nthat handles that size at high speed. It would be a waste of time to\nmeasure the execution time of the prime solvers, since we know that\nthe special code is way faster. However, FFTW is modular and one may\nor may not include the special code for size 13, in which case we must\nresort to one of the \"prime\" solvers. To address this issue, the\n\"prime\" solvers (and others) are proclaimed to be SLOW\". When\nplanning, FFTW first tries to produce a plan ignoring all the SLOW\nsolvers, and if this fails FFTW tries again allowing SLOW solvers.\n\nThis heuristic works ok unless the sizes are too large. For example\nfor 1044000=2*2*2*2*2*3*3*5*5*5*29 FFTW explores a huge search tree of\nall zillion factorizations of 1044000/29, failing every time because\n29 is SLOW; then it finally allows SLOW solvers and finds a solution\nimmediately.\n\nThis patch proclaims solvers to be SLOW only for small values of N.\nFor example, the \"generic\" solver implements an O(n^2) DFT algorithm;\nwe say that it is SLOW only for N<=16.\n\nThe side effects of this choice are as follows. If one modifies FFTW to\ninclude a fast solver of size 17, then planning for N=17*K will be\nslower than today, because FFTW till try both the fast solver and the\ngeneric solver (which is SLOW today and therefore not tried, but is no\nlonger SLOW after the patch). If one removes a fast solver, of size say\n13, then he may still fall into the current exponential-search behavior\nfor \"problems\" of size 13*HIGHLY_FACTORIZABLE_N.\n\nIf somebody had compleined about transforms of size 1044000 ten years\nago, \"don't do that\" would have been an acceptable answer. I guess the\nbar is higher today, so I am going to include this patch in our 3.3.1\nrelease despite their side-effects for people who want to modify FFTW.", "idx": 746} {"target": 1, "func": "[PATCH] New data parallel routines to improve performance", "idx": 177} {"target": 1, "func": "[PATCH] ORourkeCollision: Corrected bugs and added more efficient\n collision detection See http://bugs.openfoam.org/view.php?id=2097", "idx": 367} {"target": 1, "func": "[PATCH] Add check to remove zero Charmm dihedrals\n\nProper torsions where the force constant is zero\nin both A and B states are now removed. We also\ncheck for other angle, torsion, and restraint\nfunctional types, and if all parameters are zero\nfor these the interaction is not added. This will\nnot change any results, but increase performance\nslightly by not calculating unnecessary interactions.\nFixes #810.\n\nChange-Id: I37ecd06d0641008593edab29e5b08433bde7b6cc", "idx": 1261} {"target": 1, "func": "[PATCH] Fix Wundef warnings\n\nAlso fixes a performance bug in gmx_simd_invsqrt_pair_d. Previuosly it\ndid unnecessary number of iterations because it used an non-existing\npreprocessor variable.\n\nChange-Id: Idcdf3872b5a169e8690721bbe83922a4ab280da8", "idx": 386} {"target": 1, "func": "[PATCH] optimize create_atoms performance for large boxes and small\n regions. warn if taking a long time", "idx": 194} {"target": 1, "func": "[PATCH] Improve performance of pair_reaxc, this change is safe\n because the non-bonded i-loop doesn't include ghost atoms; this optimization\n is already included in the USER-OMP version", "idx": 1420} {"target": 1, "func": "[PATCH] Add FEMContext algebraic_type(), custom_solution()\n\nThe NONE and DOFS_ONLY options enable more efficient use of FEMContext\nin cases where it's only used as a container of FE objects.\n\nThe OLD option (with a properly parallelized custom_solution) enables\nthe use of FEMContext to assist in evaluating old solutions during\nprojections.", "idx": 1275} {"target": 1, "func": "[PATCH] exclude slow macos mpich steps", "idx": 95} {"target": 1, "func": "[PATCH] ResultTile::coord() performance (#1689)\n\nThis removes the branch within the ResultTile::coord() implementation to\ntest if it contains zipped or unzipped coordinates.\n\nThe function contract is slightly modified because now the caller must ensure\nthat an underlying coordinate exists at there requested position and dimension.\nCurrently, the `ResultTile::coord()` may return null if nothing is found. This\ndoes not appear to be an expected output anywhere in the unit tests or code.\n\nIn a certain read benchmark that paths through this routine multiple times,\nthis patch reduces the avg execution time from 18.15s to 17.00s.", "idx": 1081} {"target": 1, "func": "[PATCH] Use arma::pow to use fast armadillo computation", "idx": 1143} {"target": 1, "func": "[PATCH] Template free-energy kernel on differing coul/vdw soft-core\n\nThe power function used for the soft-core potential is expensive.\nWhen using the same lambda and alpha parameters for Coulomb and VdW,\nwe can skip one power core, giving a 10% performance improvement.\n\nChange-Id: I8733838c6c32ef2b6fee5a6fb97657679f9bd3b3", "idx": 1424} {"target": 1, "func": "[PATCH] Performance fix - Stop setting the color of each edge to\n black, which could be quite long for a big item, and set the attribute value\n of the shader to black instead.", "idx": 1314} {"target": 1, "func": "[PATCH] Use CUSPARSE_CSRMV_ALG2 for seemingly better performance", "idx": 519} {"target": 1, "func": "[PATCH] Fix (ish) slow serachbox on windows.", "idx": 1262} {"target": 1, "func": "[PATCH] Change fix box/relax example to be more efficient\n\ngit-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12532 f3b2605a-c512-4ea7-a41b-209d697bcdaa", "idx": 779} {"target": 1, "func": "[PATCH] s390x: Add vectorized sgemm kernel for Z14 and newer\n\nAdd a new GEMM kernel implementation to exploit the FP32 SIMD\noperations introduced with z14 and employ it for SGEMM on z14 and newer\narchitectures.\n\nThe SIMD extensions introduced with z13 support operations on\ndouble-sized scalars in vector registers. Thus, the existing SGEMM code\nwould extend floats to doubles before operating on them. z14 extended\nSIMD support to operations on 32-bit floats. By employing these\ninstructions, we can operate on twice the number of scalars per\ninstruction (four floats in each vector registers) and avoid the\nconversion operations.\n\nThe code is written in C with explicit vectorization. In experiments,\nthis kernel improves performance on z14 and z15 by around 2x over the\ncurrent implementation in assembly. The flexibilty of the C code paves\nthe way for adjustments in subsequent commits.\n\nTested via make -C test / ctest / utest and by a couple of additional\nunit tests that exercise blocking (e.g., partial register blocks with\nfewer than UNROLL_M rows and/or fewer than UNROLL_N columns).\n\nSigned-off-by: Marius Hillenbrand ", "idx": 594} {"target": 1, "func": "[PATCH] Improve performance of PairReaxCKokkos", "idx": 807} {"target": 1, "func": "[PATCH] functionObjects: surfaceFieldValue, volFieldValue: Various\n improvements\n\nA number of changes have been made to the surfaceFieldValue and\nvolFieldValue function objects to improve their usability and\nperformance, and to extend them so that similar duplicate functionality\nelsewhere in OpenFOAM can be removed.\n\nWeighted operations have been removed. Weighting for averages and sums\nis now triggered simply by the existence of the \"weightField\" or\n\"weightFields\" entry. Multiple weight fields are now supported in both\nfunctions.\n\nThe distinction between oriented and non-oriented fields has been\nremoved from surfaceFieldValue. There is now just a single list of\nfields which are operated on. Instead of oriented fields, an\n\"orientedSum\" operation has been added, which should be used for\nflowRate calculations and other similar operations on fluxes.\n\nOperations minMag and maxMag have been added to both functions, to\ncalculate the minimum and maximum field magnitudes respectively. The min\nand max operations are performed component-wise, as was the case\npreviously.\n\nIn volFieldValue, minMag and maxMag (and min and mag operations when\napplied to scalar fields) will report the location, cell and processor\nof the maximum or minimum value. There is also a \"writeLocation\" option\nwhich if set will write this location information into the output file.\nThe fieldMinMax function has been made obsolete by this change, and has\ntherefore been removed.\n\nsurfaceFieldValue now operates in parallel without accumulating the\nentire surface on the master processor for calculation of the operation.\nCollecting the entire surface on the master processor is now only done\nif the surface itself is to be written out.", "idx": 889} {"target": 1, "func": "[PATCH] Speed up mtop_util atom lookup\n\nThe lookup of atom indices and properties on global atom index have\nbeen sped up by moving functions to a new header file mtop_lookup.h\nand by storing start and end global atom indices in gmx_mtop_t.\nAnother performance improvement is that the previous molblock index is\nused as starting value for the next search.\nThe atom+residue lookup function now also returns the reside index.\nThis change also simplifies the code, since we no longer need a lookup\ndata structure.\nA large number of files are touched because the t_atom return pointer\nis changed to const also in the atomloop functions.\n\nChange-Id: I185b8c2e614604e9561190dd5e447077d88933ca", "idx": 242} {"target": 1, "func": "[PATCH] Add temporary allocations to PCG to avoid continual\n allocation / freeing. This restores performance to regular CG level", "idx": 448} {"target": 1, "func": "[PATCH] Non-PME OPT calculations now use a much more efficient\n Cartesian field algorithm for the dipole response force terms.", "idx": 673} {"target": 1, "func": "[PATCH] Fixed a potential performance bug in the KDE code that\n resulted in a stricter pruning criterion.", "idx": 885} {"target": 1, "func": "[PATCH] Fixed a performance regression in multi-GPU on CUDA", "idx": 842} {"target": 1, "func": "[PATCH] Kokkos: ViewAssignment: fix critical performance bug for\n unmanaged\n\nTurns out the unmanaged views were not actually unmanaged. The trait\nwas not actually used in the determination of tracking in case of\nassigning a managed view to an unmanaged. The only way to get a truly\nunmanaged view was to start with an unmanaged view wrapping a pointer.", "idx": 1286} {"target": 1, "func": "[PATCH] potentially improved performance of coarsening and\n interpolation by using different Commpkg for strength matrix S. Added a new\n parameter S_commpkg_switch which sets the smallest strength threshold, for\n which this capability is used. This required the addition of a new parameter\n (int array that maps S-indices to A-indices) to the interpolation routine.\n Note that while this change does not affect Falgout, CLJP, PMIS and HMIS\n convergence behaviour and complexities, it affects ruge, ruge2b and ruge3c.\n This can be avoided by setting S_commpkg_switch to 1.", "idx": 430} {"target": 1, "func": "[PATCH] Fixed performance bug.", "idx": 1483} {"target": 1, "func": "[PATCH] Use acq/rel semantics to pass flags/pointers in\n getrf_parallel.\n\nThe current implementation has locks, but the locks each only\nhave a critical section of one variable so atomic reads/writes\nwith barriers can be used to achieve the same behavior.\n\nLike the previous patch, pthread_mutex_lock isn't fair, so in a\ntight loop the previous thread that has the lock can keep it\nstarving another thread, even if that thread is about to write\nthe data that will stop the current thread from spinning.\n\nOn a 64c Arm system this improves performance by 20x on sgesv.goto.", "idx": 236} {"target": 1, "func": "[PATCH] More efficient constructor for Rep", "idx": 1318} {"target": 1, "func": "[PATCH] critical performance bug fix", "idx": 999} {"target": 1, "func": "[PATCH] Adding conditional compilation(#if defined(LOONGSON3A)) to\n avoid affecting the performance of other platforms.", "idx": 217} {"target": 1, "func": "[PATCH] Called MPI_Send_Recv at appropriate locations instead of the\n more elaborate dual calls. Performance improvement is not measurable on Linux\n though. But in principle an MPI implementation could optimize that.", "idx": 1547} {"target": 1, "func": "[PATCH] Additional parameter tweaking for performance enhancement.", "idx": 554} {"target": 1, "func": "[PATCH] SNAP optimizations, kernel fusion, large reduction of memory\n usage on the GPU, misc. performance optimizations.", "idx": 238} {"target": 1, "func": "[PATCH] Send Points rather than individual coordinates\n\nThis should fix a bug with LIBMESH_DIM!=3 and DistributedMesh, and it\nshould be more efficient in some cases, and it should be easier to\nrefactor.", "idx": 1488} {"target": 1, "func": "[PATCH] Set default GMX_OPENMP_MAX_THREADS to 64\n\nAs there are many new CPU with more than 32 hardware threads and\nGROMACS scales quite well to more than 32 threads,\nGMX_OPENMP_MAX_THREADS is increased from 32 to 64 threads.\nThe performance impact of this is that bitmasks are by default\n64-bit instead of 32-but integers, which on 64-bit systems should\nonly have a (negligible) effect on cache pressure.\n\nChange-Id: I73d1c79e86f30f7fc69e1f49e1195271435e77b6", "idx": 413} {"target": 1, "func": "[PATCH] small performance fix: reuse conic1, conic2 in set(5)", "idx": 606} {"target": 1, "func": "[PATCH] resolved performance degrading changed introduced in revision\n 1319 (3)", "idx": 75} {"target": 1, "func": "[PATCH] Added a base class with a lookup table for functions cw(int)\n and ccw(int). It results in a performance improvement.", "idx": 757} {"target": 1, "func": "[PATCH] Anisotropic smoothing perf improvements in CUDA\n\nThis improves CUDA backend performance by about 24%", "idx": 1096} {"target": 1, "func": "[PATCH] turned simplewater on again and added performance fix for\n 7.30 compilers", "idx": 1409} {"target": 1, "func": "[PATCH] Implemented the efficient computation of the second centroid.\n\nThe hierarchical clustering algorithm gets about 15% faster\n(on test Eglise Fontaine, from 91s to 76s).", "idx": 760} {"target": 1, "func": "[PATCH] Add Quad8::volume().\n\nThis speeds up a test code which calls elem->volume() on each\ndistorted QUAD9 in a 4M element mesh by approximately 650x (from about\n206s down to 0.3146s).\n\nAs far as the accuracy of the four-point quadrature rule goes, it\nappears to be quite good up to about 15% distorted elements (in the\ncontext of MeshTools::Modification::distort()). More distortion than\nthis is probably not really usable for finite element analysis,\nanyway. In the table below, max_rel_diff is computed using a\ntwelfth-order quadrature rule (49 points) in the reference volume\ncomputation.\n\ndistortion max_rel_diff\n0.05 3.0166222190035792e-15 (<1%)\n0.1 2.9337828206309171e-15 (<1%)\n0.15 1.9220118208802948e-02 (2%)\n0.2 1.4316734561689504e-01 (14%)\n0.3 3.6570789827049105e-01 (36.5%)\n\nIt would also be possible to implement an \"early return\" branch for\nthis function, but you would have to check the value of 8 different\n3-vectors against zero so I didn't think it was worth the extra\ncomplexity for the minimal performance improvement.", "idx": 1517} {"target": 1, "func": "[PATCH] more efficient version, which only searches when necessary\n and only once per element", "idx": 41} {"target": 1, "func": "[PATCH] fixed performance issue for vacuum with DD", "idx": 753} {"target": 1, "func": "[PATCH] Tune param.h for SkylakeX\n\nparam.h defines a per-platform SWITCH_RATIO, which is used as a measure for how fine\ngrained the blocks for gemm need to be split up. Many platforms define this to 4.\n\nThe reality is that the gemm low level implementation for SkylakeX likes bigger blocks\ndue to the nature of SIMD... by tuning the SWITCH_RATIO to 32 the threading performance\nimproves significantly:\n\nBefore\n Matrix SGEMM cycles MPC DGEMM cycles MPC\n 48 x 48 10756.0 10.5 -0.5% 18296.7 6.1 -1.7%\n 64 x 64 20490.0 12.9 1.4% 40615.0 6.5 0.0%\n 65 x 65 83528.3 3.3 -210.9% 96319.0 2.9 -83.3%\n 80 x 80 101453.5 5.1 -166.3% 128021.7 4.0 -76.6%\n 96 x 96 149795.1 5.9 -143.1% 168059.4 5.3 -47.4%\n 112 x 112 191481.2 7.3 -105.8% 204165.0 6.9 -14.6%\n 128 x 128 265019.2 7.9 -99.0% 272006.4 7.7 -5.3%\n\nAfter\n Matrix SGEMM cycles MPC DGEMM cycles MPC\n 48 x 48 10666.3 10.6 0.4% 18236.9 6.2 -1.4%\n 64 x 64 20410.1 13.0 1.8% 39925.8 6.6 1.7%\n 65 x 65 34983.0 7.9 -30.2% 51494.6 5.4 2.0%\n 80 x 80 39769.1 13.0 -4.4% 63805.2 8.1 12.0%\n 96 x 96 45169.6 19.7 26.7% 80065.8 11.1 29.8%\n 112 x 112 57026.1 24.7 38.7% 99535.5 14.2 44.1%\n 128 x 128 64789.8 32.5 51.3% 117407.2 17.9 54.6%\n\nWith this change, threading starts to be a win already at 96x96", "idx": 674} {"target": 1, "func": "[PATCH] More efficient SystemBase::reinit, flux_jump indicator should\n work but needs testing", "idx": 1170} {"target": 1, "func": "[PATCH] Adding optional wrappers for MKL CSR mat-vec, allowing for\n more configure-time math library detection, adding a Hermitian version of\n Lanczos, and greatly improving the performance of right\n DiagonalScale/DiagonalSolve. It appears that the current distributed sparse\n matrix-vector multiplication has scalability issues.", "idx": 1056} {"target": 1, "func": "[PATCH] allow compilation to optimize for CUDA compute cap. 3.5\n\nEnabling optimizations targeting compute capability 3.5 devices\n(GK110) slightly improves performance of both PME and RF kernels.\nThis requires a hint for the compiler optimization indicating\nthe maximum number of threads/block and minimum number of\nblocks/multiprocessor. This change allows nvcc >=5.0 to generate\ncode for CC 3.5 devices and switches to including PTX 3.5 code\n(instead of 3.0) in the binary.\n\nChange-Id: If7e14d31165bc05859250db7468bf6bd8c186264", "idx": 814} {"target": 1, "func": "[PATCH] Pair of send(vector) -> send(vector)\n\nThis may be more efficient in some cases and it should be easier to\nrefactor shortly.", "idx": 935} {"target": 1, "func": "[PATCH] Add class ListOfLists\n\nThis is a replacement for t_blocka, i.e. a performance optimized\nimplementation of a list of lists. It only allows appending a list\nat the end of the list of lists.\n\nChange-Id: Ib4b7f5f0e57b82c939f53e9805dc16e9d76db22b", "idx": 1172} {"target": 1, "func": "[PATCH] Modify aligned address of sa and sb to improve the\n performance of multi-threads.", "idx": 336} {"target": 1, "func": "[PATCH] Initial implementation of LP IPM-based real Basis Pursuit,\n fixing performance bugs in the real sequential KKT matrix construction for\n LPs and QPs, and adding DistSparseMatrix::GlobalRow", "idx": 1232} {"target": 1, "func": "[PATCH] Modified nb_putv to improve performance of transfers to\n processes on the same SMP node.", "idx": 219} {"target": 1, "func": "[PATCH] Even better performance figures in Poisson reconstruction\n through less pre-allocation in CGAL::Eigen_matrix", "idx": 340} {"target": 1, "func": "[PATCH] Replacing FT by RT, more efficient Sign_at", "idx": 1234} {"target": 1, "func": "[PATCH] Unroll middle jm loop in the nbnxm kernels on Ampere\n\nThe unrolling improves performance of the non-bonded kernels by up to\n12%.\n\nNote: cherry-picked backport, skip when merging.\n\nRefs #3873", "idx": 467} {"target": 1, "func": "[PATCH] minor performance fix; map double_conic to\n double_coefficients", "idx": 758} {"target": 1, "func": "[PATCH] Further improvements to multi-GPU performance", "idx": 1450} {"target": 1, "func": "[PATCH] GetPot: Use a more efficient container for UFO detection", "idx": 888} {"target": 1, "func": "[PATCH] Replacing dynamic_cast with libmesh_cast where appropriate -\n depending on the error checking replaced this will either lead to slightly\n more efficient NDEBUG runs or slightly more run-time checking in debug mode\n runs.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@4246 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 1386} {"target": 1, "func": "[PATCH] Fixed performance problems when many boxes are on a process:\n Removed a Sort from the communication stuff. Reordered the periodic boxes in\n the neighborhood. Changed the CommInfoFromStencil algorithm.", "idx": 1157} {"target": 1, "func": "[PATCH] Latest modifications that increased the parallel performance", "idx": 52} {"target": 1, "func": "[PATCH] Very simple change to get ~20% performance improvement in\n reading Amber prmtop files.", "idx": 141} {"target": 1, "func": "[PATCH] Fixed slow COO SpMV for the OpenMP executor\n\nMoved the `omp parallel for` to the most outer loop of the apply,\nso it is parallelized over the matrix entries instead over the number\nof right hand sides for every single matrix entry.", "idx": 1036} {"target": 1, "func": "[PATCH] Speed improvement with CGAL_HEADER_ONLY and\n WITH_{tests|examples}..\n\nWhen `CGAL_HEADER_ONLY` and `WITH_{examples|tests|demos}`, then only\nthe first call to `find_package(CGAL)` does the job. The subsequent\ncalls return very fast, by caching the results in global properties.", "idx": 1569} {"target": 1, "func": "[PATCH] Optimize the performance of dot by using universal intrinsics\n in X86/ARM", "idx": 876} {"target": 1, "func": "[PATCH] DynaIO option to keep or drop spline nodes\n\nThis should let us easily build meshes that retain the exact geometry of\nan IsoGeometric Analysis mesh but that don't have any topologically\ndisconnected NodeElem elements (so they should scale better with our\nexisting partitioner code) and don't have constraint equations (so they\nmight be more efficient with our existing solvers and should be\ncompatible with our reduced_basis code).", "idx": 1511} {"target": 1, "func": "[PATCH] Improve the performance of dasum and sasum when SMP is\n defined", "idx": 202} {"target": 1, "func": "[PATCH] Improved performance of prolongator through exposing\n intstruction-level parallelism. This enables full bandwidth to be achieved\n on Maxwell.", "idx": 965} {"target": 1, "func": "[PATCH] Moved various operations for polynomial smoothers into setup\n phase to improve performance and added new parameters to allow use of\n different polynomials", "idx": 727} {"target": 1, "func": "[PATCH] performance improvement through moving inlinable functions to\n header file", "idx": 1480} {"target": 1, "func": "[PATCH] Hopefully improve singledot performance", "idx": 636} {"target": 1, "func": "[PATCH] s390x: Use new sgemm kernel also for DGEMM and DTRMM on Z14\n\nApply our new GEMM kernel implementation, written in C with vector intrinsics,\nalso for DGEMM and DTRMM on Z14 and newer (i.e., architectures with FP32 SIMD\ninstructions). As a result, we gain around 10% in performance on z15, in\naddition to improving maintainability.\n\nSigned-off-by: Marius Hillenbrand ", "idx": 1338} {"target": 1, "func": "[PATCH] GEMM: skylake: improve the performance when m is small", "idx": 545} {"target": 1, "func": "[PATCH] Improved performance traces", "idx": 1354} {"target": 1, "func": "[PATCH] Exploiting memory layout for better performance in\n matrix-free restriction", "idx": 938} {"target": 1, "func": "[PATCH] CouplingMatrix::operator&=\n\nThe user and library may need to use the output of a bunch of coupling\nfunctors, and this should be more efficient than carrying around a\nbunch of matrices and iterating through all of them.", "idx": 906} {"target": 1, "func": "[PATCH] perf: minor performance improvements for bilateral", "idx": 390} {"target": 1, "func": "[PATCH] Switching to standard essential BC treatment for solver\n performance gain (thanks to Socratis for catching this!)", "idx": 345} {"target": 1, "func": "[PATCH] Replace vpermpd with vpermilpd\n\nto improve performance on Zen/Zen2 (as demonstrated by wjc404 in #2180)", "idx": 844} {"target": 1, "func": "[PATCH] bug fixes in double_conic, minor performance improvements", "idx": 458} {"target": 1, "func": "[PATCH] Test show an improvement in octree performance", "idx": 477} {"target": 1, "func": "[PATCH] nbnxn utils performance improvement for Phi\n\nAlso remove usage of unpack to load half/quarter aligned data, because\nin case of misaligned data, instead of SegF it only loaded partial data.\n\nChange-Id: Ib0f7807986e6fcbe998bd6ee41ce104666446321", "idx": 388} {"target": 1, "func": "[PATCH] Optimize consolidated fragment metadata loading\n\nThis is part one of two changes for optimizing consolidated fragment\nmetadata loading. This changes removes using VFS to always fetch the\nmetadata file size. Instead we select the file size only if not reading\nfrom a consolidated file buffer. We also adjust the `fragment_size`\nfunction, used by consolidation, to fetch the fragment metadata file\nsize on request. The fetching of the size on-demand for consolidation\nwill yield the same performance degredation we see in open array,\nhowever this is acceptable for patch one and will be addressed in the\nnext series with a format change.", "idx": 1292} {"target": 1, "func": "[PATCH] clover::FloatNOrder now uses vectorized load/store for\n improved performance of all algorithms that use this (clover inversion sees a\n 1.5x speedup). Added missing support for clover norm field save/restore in\n tuning.", "idx": 615} {"target": 1, "func": "[PATCH] Fixed nbnxn_4xN performance regression\n\nCommit 8e92fd67 changed the 2xNN kernel to use gmx_simd_blendnotzero_r\nand the 4xN kernel to use gmx_simd_blendv_r. Making the 4xN kernel\nconsistent with the 2xNN kernel improves the performance with AVX2\nwith 4% and 3% for the RF and PME kernels, respectively.\n\nChange-Id: Iac334865c2b2340493639300d07e7ab9c78e129f", "idx": 199} {"target": 1, "func": "[PATCH] More performance tweak in Monte Carlo sampling.", "idx": 443} {"target": 1, "func": "[PATCH] performance improvement bugfix", "idx": 1279} {"target": 1, "func": "[PATCH] Pass Uncertain and T (enum or bool) by value instead of by\n reference. It is generally accepted that it is more efficient for small\n classes like this (and it's definitely shorter and more readable).", "idx": 1280} {"target": 1, "func": "[PATCH] Switching to Simple_cartesian leads to a performance gain of\n 50% for 100K points", "idx": 332} {"target": 1, "func": "[PATCH] fixed reallocation of nsbox bounding boxes with SSE\n\nThe reallocation of the bounding box array was not aligned and not initialized.\nThis probably did not give incorrect results, but could give a performance hit.", "idx": 1442} {"target": 1, "func": "[PATCH] implemented efficient sparsity detection for SXFunction,\n ticket #126", "idx": 1476} {"target": 1, "func": "[PATCH] Fix OMP num specify issue\n\nIn current code, no matter what number of threads specified, all\navailable CPU count is used when invoking OMP, which leads to very bad\nperformance if the workload is small while all available CPUs are big.\nLots of time are wasted on inter-thread sync. Fix this issue by really\nusing the number specified by the variable 'num' from calling API.\n\nSigned-off-by: Chen, Guobing ", "idx": 733} {"target": 1, "func": "[PATCH] Improving performance of Kokkos ReaxFF\n\ngit-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@15828 f3b2605a-c512-4ea7-a41b-209d697bcdaa", "idx": 817} {"target": 1, "func": "[PATCH] Remove memory churn in compute_affine_map, improving\n threading performance", "idx": 648} {"target": 1, "func": "[PATCH] Accelerate CSR->Ell,Hybrid conversions on CUDA.\n\n+ The previous grid dimensions for `initialize_zero_ell` were `stride *\n num_rows`, i.e. roughly the dense matrix dimension.\n+ Using `max_nnz_per_row * num_rows` reduces significantly the amount of threads\n created which makes this kernel call more efficient (less useless thread\n creation).", "idx": 44} {"target": 1, "func": "[PATCH] More efficient SetNonzerosSlice::evaluate", "idx": 416} {"target": 1, "func": "[PATCH] HvD: The density functionals with derivatives up to and\n including third order generated by automatic differentiation using the\n univariate Taylor series approach by Griewank et al. DOI:\n 10.1090/S0025-5718-00-01120-0.\n\nFor this code to perform well it is essential that the compiler inlines a lot\nof code. If the overloaded operators are evaluated as function calls the\ncalculations slow down by more than an order of magnitude. Some compilers\ncan only inline code within a single file. Hence the addition of nwxc.F\nthat includes all the source code of the various functionals. In particular\nthe GCC generated code saw significant performance improvements as a result.\nFor GCC one is adviced to use compiler versions post 4.6 as apparently\nsignificant improvements to the inline capabilities were introduced at that\npoint.\n\nAt this point the gradient and 2nd derivative evaluation have been tested.\nFor the 3rd order derivatives test cases that actually exploit those derivatives\nare still needed. If there are any deficiencies at that level they are most\nlikely to stem from the driver routine NWXC_EVAL_DF3_DRIVER which has to\ngenerate the appropriate partial derivatives and interpolate the final results.\nThe generation of the derivatives themselves should work alright as the unit\ntests in src/nwxc/nwad/unit_tests work and the lower order order derivatives\nof the functionals work.\n\nAt the moment the performance is still not as good as I would like it. I suspect\nthat the univariate approach is in part to blame for that. The fact that all\nlower order derivatives need to be re-evaluated for every highest order partial\nderivative introduces an overhead that is close to proportional to the\nhighest order of derivative requested. I am planning to try a multivariate\napproach instead. This however has increased memory access as a downside,\nin particular during assignments. We will see what is best on balance.", "idx": 1129} {"target": 1, "func": "[PATCH] Optimizations to kspace_style MSM, including improving the\n single-core performance and increasing the parallel scalability. A bug in MSM\n for mixed periodic and non-periodic boundary conditions was also fixed.\n\ngit-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@9597 f3b2605a-c512-4ea7-a41b-209d697bcdaa", "idx": 488} {"target": 1, "func": "[PATCH] Unroll middle jm loop in the nbnxm kernels on Ampere\n\nThe unrolling improves performance of the non-bonded kernels by up to\n12%.\n\nRefs #3872", "idx": 1105} {"target": 1, "func": "[PATCH] Made Triangulation_data_structure_2::create_face more\n efficient", "idx": 1310} {"target": 1, "func": "[PATCH] Tree serialization is more efficient now.", "idx": 1482} {"target": 1, "func": "[PATCH] fixed remains of xj shifting in legacy CUDA NB kernel\n\nThis removes three extra useless flops, there the fix results in a\nslight performance improvement.", "idx": 601} {"target": 1, "func": "[PATCH] Fix performance unnecessary copy initialization", "idx": 583} {"target": 1, "func": "[PATCH] Temporary fix for OpenCL PME gather\n\nThere is a race on the z-component of the PME forces in the OpenCL\nforce reduction in the gather kernel. This change avoid that race.\nBut a better solution is a different, more efficient reduction.\n\nRefs #2737\n\nChange-Id: I45068c9187873548dff585044d2c8541444e385c", "idx": 668} {"target": 1, "func": "[PATCH] Deprecate the Side class\n\nFor repeated use our new Elem-based side_ptr APIs are more efficient\nnow.", "idx": 774} {"target": 1, "func": "[PATCH] Enabling of NVidia PTX backend for SYCL & nbnxmKernel\n performance optimizations", "idx": 834} {"target": 1, "func": "[PATCH] removed (harmless) left-over in nbnxn SIMD kernels\n\nThis improves performance of PME + p-coupling by about 5%.\nWith Ewald and virial, the nbnxn SIMD energy kernels were used\n(some left-over development code). The plain-C code did not do this.\n\nChange-Id: I039044fcb393bf0bcaa06f38498b2a57d60cf080", "idx": 824} {"target": 1, "func": "[PATCH] Don't use _Atomic for jobs sometimes...\n\nThe use of _Atomic leads to really bad code generation in the compiler\n(on x86, you get 2 \"mfence\" memory barriers around each access with gcc8, despite\nx86 being ordered and cache coherent). But there's a fallback in the code that\njust uses volatile which is more than plenty in practice.\n\nIf we're nervous about cross thread synchronization for these variables, we should\nmake the YIELD function be a compiler/memory barrier instead.\n\nperformance before (after last commit)\n\n Matrix SGEMM cycles MPC DGEMM cycles MPC\n 48 x 48 10630.0 10.6 0.7% 18112.8 6.2 -0.7%\n 64 x 64 20374.8 13.0 1.9% 40487.0 6.5 0.4%\n 65 x 65 141955.2 1.9 -428.3% 146708.8 1.9 -179.2%\n 80 x 80 178921.1 2.9 -369.6% 186032.7 2.8 -156.6%\n 96 x 96 205436.2 4.3 -233.4% 224513.1 3.9 -97.0%\n 112 x 112 244408.2 5.8 -162.7% 262158.7 5.4 -47.1%\n 128 x 128 321334.5 6.5 -141.3% 333829.0 6.3 -29.2%\n\nPerformance with this patch (roughly a 2x improvement):\n\n Matrix SGEMM cycles MPC DGEMM cycles MPC\n 48 x 48 10756.0 10.5 -0.5% 18296.7 6.1 -1.7%\n 64 x 64 20490.0 12.9 1.4% 40615.0 6.5 0.0%\n 65 x 65 83528.3 3.3 -210.9% 96319.0 2.9 -83.3%\n 80 x 80 101453.5 5.1 -166.3% 128021.7 4.0 -76.6%\n 96 x 96 149795.1 5.9 -143.1% 168059.4 5.3 -47.4%\n 112 x 112 191481.2 7.3 -105.8% 204165.0 6.9 -14.6%\n 128 x 128 265019.2 7.9 -99.0% 272006.4 7.7 -5.3%", "idx": 1309} {"target": 1, "func": "[PATCH] Use non caching segment traits to accelerate arrangement\n computations", "idx": 609} {"target": 1, "func": "[PATCH] added a special cache for efficient conversion of\n Sqrt_extension to Bigfloat_interval, disabled by default", "idx": 246} {"target": 1, "func": "[PATCH] Deprecate attempts to build proxy Side objects\n\nNow that we've made Elem side objects much more efficient, we don't need\nthe old proxies anymore.", "idx": 490} {"target": 1, "func": "[PATCH] Improved performance of CCMA", "idx": 1108} {"target": 1, "func": "[PATCH] Fixed an uninitialized memory read problem and a potential\n performance issue for problems in less than 3 dimensions.", "idx": 38} {"target": 1, "func": "[PATCH] Performance enhancement", "idx": 562} {"target": 1, "func": "[PATCH] More efficient Sparsity::serialize", "idx": 877} {"target": 1, "func": "[PATCH] Fixed a performance bug in the sweep. When calling the insert\n functions on the planar map, I made sure the non-intersect version of the\n inserts are being called. I also cleaned up the code a bit.", "idx": 1084} {"target": 1, "func": "[PATCH] whitespace cleanup, fix bug in looking for empty strings,\n improve read performance and handling of comments", "idx": 1299} {"target": 1, "func": "[PATCH] avoid computing factorization of linear matrix again, clean,\n use more efficient code by avoiding duplicates when calculating stiffness\n matrix", "idx": 501} {"target": 1, "func": "[PATCH] AABB tree: do_intersect now calls the First_primitive\n traversal traits (much faster) performance section updated", "idx": 684} {"target": 1, "func": "[PATCH] Improve performance of computeCoarseClover kernel: split\n triple matrix product into two steps to prevent redundant computation", "idx": 8} {"target": 1, "func": "[PATCH] thermophysicalModels: Changed specie thermodynamics from mole\n to mass basis\n\nThe fundamental properties provided by the specie class hierarchy were\nmole-based, i.e. provide the properties per mole whereas the fundamental\nproperties provided by the liquidProperties and solidProperties classes are\nmass-based, i.e. per unit mass. This inconsistency made it impossible to\ninstantiate the thermodynamics packages (rhoThermo, psiThermo) used by the FV\ntransport solvers on liquidProperties. In order to combine VoF with film and/or\nLagrangian models it is essential that the physical propertied of the three\nrepresentations of the liquid are consistent which means that it is necessary to\ninstantiate the thermodynamics packages on liquidProperties. This requires\neither liquidProperties to be rewritten mole-based or the specie classes to be\nrewritten mass-based. Given that most of OpenFOAM solvers operate\nmass-based (solve for mass-fractions and provide mass-fractions to sub-models it\nis more consistent and efficient if the low-level thermodynamics is also\nmass-based.\n\nThis commit includes all of the changes necessary for all of the thermodynamics\nin OpenFOAM to operate mass-based and supports the instantiation of\nthermodynamics packages on liquidProperties.\n\nNote that most users, developers and contributors to OpenFOAM will not notice\nany difference in the operation of the code except that the confusing\n\n nMoles 1;\n\nentries in the thermophysicalProperties files are no longer needed or used and\nhave been removed in this commet. The only substantial change to the internals\nis that species thermodynamics are now \"mixed\" with mass rather than mole\nfractions. This is more convenient except for defining reaction equilibrium\nthermodynamics for which the molar rather than mass composition is usually know.\nThe consequence of this can be seen in the adiabaticFlameT, equilibriumCO and\nequilibriumFlameT utilities in which the species thermodynamics are\npre-multiplied by their molecular mass to effectively convert them to mole-basis\nto simplify the definition of the reaction equilibrium thermodynamics, e.g. in\nequilibriumCO\n\n // Reactants (mole-based)\n thermo FUEL(thermoData.subDict(fuelName)); FUEL *= FUEL.W();\n\n // Oxidant (mole-based)\n thermo O2(thermoData.subDict(\"O2\")); O2 *= O2.W();\n thermo N2(thermoData.subDict(\"N2\")); N2 *= N2.W();\n\n // Intermediates (mole-based)\n thermo H2(thermoData.subDict(\"H2\")); H2 *= H2.W();\n\n // Products (mole-based)\n thermo CO2(thermoData.subDict(\"CO2\")); CO2 *= CO2.W();\n thermo H2O(thermoData.subDict(\"H2O\")); H2O *= H2O.W();\n thermo CO(thermoData.subDict(\"CO\")); CO *= CO.W();\n\n // Product dissociation reactions\n\n thermo CO2BreakUp\n (\n CO2 == CO + 0.5*O2\n );\n\n thermo H2OBreakUp\n (\n H2O == H2 + 0.5*O2\n );\n\nPlease report any problems with this substantial but necessary rewrite of the\nthermodynamic at https://bugs.openfoam.org\n\nHenry G. Weller\nCFD Direct Ltd.", "idx": 1199} {"target": 1, "func": "[PATCH] Inverse and make_sqrt functions added. RT / Root_of_2\n division added. Some operations with int. Comparisons function performance\n improved. Added the idea of representing a rational (when we know, by using\n the Root_of_2(FT) construction) inside the Root_of_2. The constructor\n Root_of_2(const RT&, const RT&, const RT&, bool) has now another boolean\n parameter at the end, in the case you know delta is not zero. Some others\n goodies.", "idx": 1078} {"target": 1, "func": "[PATCH] Complete OMP version except for the inner product.\n Performance below host.", "idx": 717} {"target": 1, "func": "[PATCH] new optimization of dgemm kernel for bulldozer: 10%\n performance increase", "idx": 1548} {"target": 1, "func": "[PATCH] Replace some std::endl with newline character\n\nNote that std::endl flushes the buffer, so we can get better\nperformance by using newlines when there is no need to flush.", "idx": 926} {"target": 1, "func": "[PATCH] Fixed FAST memory leaks on CPU backend", "idx": 77} {"target": 1, "func": "[PATCH] Accelerate distance queries with a kd-tree", "idx": 1240} {"target": 1, "func": "[PATCH] Improvements to multi-GPU performance", "idx": 1355} {"target": 1, "func": "[PATCH] Adjust serialized query buffer sizes (#2115)\n\nAdjust serialized query buffer sizes\n\nThis change the client/server flow to always send the server the\noriginal user requested buffer sizes. This solves a bug in which with\nserialized queries incompletes would cause the \"server\" to use smaller\nbuffers for each iteration of the incomplete query. This yield\ndecreasing performance as the buffers approached zero. The fix here lets\nthe server always get the original user's buffer size.", "idx": 489} {"target": 1, "func": "[PATCH] made nbnxn analytical Ewald consistent\n\nThe recent addition of nbnxn analytical Ewald kernels switched those\nkernels on for local interactions, but not for non-local domains.\nThis performance bug has been fixed.\n\nChange-Id: I28abc822ee8f1cf8f7dbb5c516703145400441b2", "idx": 314} {"target": 1, "func": "[PATCH] dgemm: use dgemm_ncopy_8_skylakex.c also for Haswell\n\nThe dgemm_ncopy_8_skylakex.c code is not avx512 specific and gives\na nice performance boost for medium sized matrices", "idx": 603} {"target": 1, "func": "[PATCH] AABB tree: more on internal KD-tree used to accelerate the\n distance queries.", "idx": 682} {"target": 1, "func": "[PATCH] Use new ranges in MeshTools\n\nPlus manual caching, where it's more efficient", "idx": 1065} {"target": 1, "func": "[PATCH] MKK: Tuned to optimize performance by setting an optimized\n chunk size", "idx": 1225} {"target": 1, "func": "[PATCH] Improvement of rectangular slicing: part 2 - fast\n implementation for large blocks of sparse matrices", "idx": 836} {"target": 1, "func": "[PATCH] Fast transfers and diffusion kernels", "idx": 507} {"target": 1, "func": "[PATCH] Accelerate AABB tree traversal by passing the tolerance as\n initial min_dist", "idx": 951} {"target": 1, "func": "[PATCH] More efficient SetNonzerosSlice2::evaluate", "idx": 692} {"target": 1, "func": "[PATCH] fix for sigfpe (underflow) in ECPs: NWints/ecp is compiled\n with \"-math_library accurate\" instead of \"-math_library fast\"", "idx": 78} {"target": 1, "func": "[PATCH] Improved the performance on Powerpc by tweaking the altivec\n innerloops and changing sqrt(x) to x*invsqrt(x)", "idx": 1173} {"target": 1, "func": "[PATCH] Fixed FAST memory leaks on OpenCL backend", "idx": 995} {"target": 1, "func": "[PATCH] Inline a more efficient implementation of\n BoundingBox::intersects.\n\nThis ends up being about 9x faster due to inlining and improved short\ncircuiting.", "idx": 796} {"target": 1, "func": "[PATCH] sbgemm: spr: enlarge P to 256 for performance", "idx": 900} {"target": 1, "func": "[PATCH] This should slightly improve performance in non-threaded proj\n constraint generation, may save us from a race condition leading to\n inaccurate proj constraints in a few corner cases (3D, level one rule off; or\n AMR combined with periodic BCs) when we're threaded.", "idx": 1087} {"target": 1, "func": "[PATCH] Update RAS-IR.\n\nNotes:\n1. Overlap and row ordering have significant effects on convergence.\n2. Lower the matrix bandwidth, better the performance ?\n3. Sync is lower, but communication becomes higher, tradeoff!\n4. Parallelism is higher, but overhead is also higher, tradeoff!\n5. Very high accurate solves, do not provide anything.\n6. Two level and multi-levels might significantly reduce number of\niterations to converge.", "idx": 160} {"target": 1, "func": "[PATCH] This should slightly improve performance in non-threaded proj\n constraint generation, may save us from a race condition leading to\n inaccurate proj constraints in a few corner cases (3D, level one rule off; or\n AMR combined with periodic BCs) when we're threaded.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@5889 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 66} {"target": 1, "func": "[PATCH] Fix IBM VSX SIMD compiles with xlc\n\nRemove most of the previous inline asm to improve\nperformance (the optimizer works better w/o asm),\nand make sure the VSX SIMD code compiles with XLC.\n\nChange-Id: I3e8e9b4dd6102dd5503210e3b49b844ee5492342", "idx": 90} {"target": 1, "func": "[PATCH] s390x: Optimize SGEMM/DGEMM blocks for z14 with explicit loop\n unrolling/interleaving\n\nImprove performance of SGEMM and DGEMM on z14 and z15 by unrolling and\ninterleaving the inner loop of the SGEMM 16x4 and DGEMM 8x4 blocks.\nSpecifically, we explicitly interleave vector register loads and\ncomputation of two iterations.\n\nNote that this change only adds one C function, since SGEMM 16x4 and\nDGEMM 8x4 actually map to the same C code: they both hold intermediate\nresults in a 4x4 grid of vector registers, and the C implementation is\nbuilt around that.\n\nSigned-off-by: Marius Hillenbrand ", "idx": 1162} {"target": 1, "func": "[PATCH] Use SIMD transpose scatter in bondeds\n\nThe angle and dihedral SIMD functions now use the SIMD transpose\nscatter functions for force reduction. This change gives a massive\nperformance improvement for bondeds, mainly because the dihedral\nforce update did a lot of vector operations without SIMD that are\nnow fully replaced by SIMD operations.\n\nChange-Id: Id08e6c83d4c9943d790bfe2a40c70fa4697077af", "idx": 871} {"target": 1, "func": "[PATCH] Improved CUDA SIFT coalescing and performance", "idx": 34} {"target": 1, "func": "[PATCH] Improve the performance of zasum and casum with AVX512\n intrinsic", "idx": 1376} {"target": 1, "func": "[PATCH] JN: ELAN_ACC for fast accumulate", "idx": 383} {"target": 1, "func": "[PATCH] Improve time performance", "idx": 503} {"target": 1, "func": "[PATCH] Code factorization + \"manual\" min/max for slightly better\n performance", "idx": 428} {"target": 1, "func": "[PATCH] More getElementByMass performance improvements\n\nReduces time taken from 15.7 seconds (last commit) to 2.2 seconds by avoiding\nQuantity comparisons altogether.", "idx": 132} {"target": 1, "func": "[PATCH] PERF: improvements to element wise operations in CPU backend\n\n- Improved performance when all buffers can be indexed linearly", "idx": 319} {"target": 1, "func": "[PATCH] Use boost::thread::hardware_concurrency instead of the\n std::thread one\n\nstd::thread::hardware_concurrency is not implemented in my GCC version\nWas causing a huge performance problem on Linux.", "idx": 1153} {"target": 1, "func": "[PATCH] Add DofMap::is_evaluable()\n\nThis is O(log(send_list.size())), which may be fast enough for most\nusers; there's no obvious way to do better without unsorted_set.", "idx": 993} {"target": 1, "func": "[PATCH] Improve the performance of rot by using AVX512 and AVX2\n intrinsic", "idx": 119} {"target": 1, "func": "[PATCH] made the code MUCH more efficient in various places", "idx": 1186} {"target": 1, "func": "[PATCH] Improve performance running on multiple GPUs (#3347)\n\n* Use multiple streams to broadcast positions\n\n* Use multiple streams to reduce forces\n\n* Adds sync between default stream and peer-copy\n\n* Minor cleanup\n\nCo-authored-by: David Clark ", "idx": 759} {"target": 1, "func": "[PATCH] HvD: Mainly optimized bits of code. The automatic\n differentiation module still has scope for optimization (which was also\n pointed out by one of the reviewers of the paper). So I have improved a few\n things:\n\n1. I have added USE_FORTRAN2008 a flag, when set, to use the popcnt, trailz and\n leadz Fortran 2008 intrinsics rather than the corresponding Fortran\n implementations util_popcnt, util_leadz and util_trailz. The util routines\n are still used by default.\n\n2. I optimized the powx routine which calculates x**y to use less exponentiation\n evaluations.\n\n3. I added a routine (powix) to do x**i where i is an integer as that is 20\n times faster than doing x**y where y is a double precision variable with an\n integer value. This is the only case for an integer argument because in this\n specific case the performance difference is really significant.\n\n4. I have also changed nwad_print_int_opx to print double precision numbers that\n hold integer values as integers rather than floating point numbers in the\n expectation that that will filter through the code generated by Maxima.\n\nOverall performance improvement was only 10% though.", "idx": 739} {"target": 1, "func": "[PATCH] Adding manually inlined variants of Sweep functions since\n they were observed to be more efficient in practice", "idx": 552} {"target": 1, "func": "[PATCH] Improve rendezvous performance when hardware is\n oversubscribed", "idx": 324} {"target": 1, "func": "[PATCH] Creating a boolean operations object from map overlay. It\n gives the user the possibility to create the boolean oepration object with\n the walk along aline point-location, which is more efficient when using\n sweep-line.", "idx": 60} {"target": 1, "func": "[PATCH] Removing gigaflop utility functions, removing illegal 'const'\n declarations in unblocked TwoSidedTrmm routines, fixing comments in several\n DistMatrix declarations, and adding a Trsv test driver in preparation for a\n significant performance improvement from an upcoming patch.", "idx": 1058} {"target": 1, "func": "[PATCH] Drastically improve performance of getElementByMass\n\nThe old approach iterated through the entire periodic table by atomic number and\nsubtracted the provided mass by the element's mass and kept track of the\nsmallest difference. The new approach steps through the elements in order of\natomic number and bails once it hits an element with a higher mass than the\ntarget mass (assuming masses are monotonically increasing).\n\nOn my desktop, processing 4TVP-dmj_wat-ion.psf dropped from 297 s to 15.4 s. But\n15.4 s is still a bit too long...", "idx": 228} {"target": 1, "func": "[PATCH] Use templates to improve performance when not using triclinic\n boxes", "idx": 1150} {"target": 1, "func": "[PATCH] NumericVector::add_vector refactoring\n\nSimilar to #411 and #413\n\nThis was originally intended to be just another additional T* API plus\na refactoring; however, the new PetscVector::add_vector(DenseVector)\ncode path should be a performance improvement as well.", "idx": 302} {"target": 1, "func": "[PATCH] Added allow_rules_with_negative_weights flag to QBase. \n Default is true (which was the standard behavior) but you can set this to\n false to use more expensive (but potentially safer) quadrature rules instead.\n\nReplaced the 15-point tet Gauss quadrature rule with a 14-point rule\nby Walkington of equivalent order.\n\nAdded Dunavant quadrature rules for triangles up to THIRTEENTH order. These\nare more efficient than the conical product rules they are replacing. Up to\nTWENTIETH order still to come.\n\nReplaced SECOND-order rule for triangles with a rule having interior integration\npoints. The previous rule had points on the boundary of the reference element.\n\n\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@2889 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 720} {"target": 1, "func": "[PATCH] Comment out the do_intersect tests\n\nThey caused a performance problem when used with the tweaked AABB_traits\nof Surface_mesh_segmentation.", "idx": 1527} {"target": 1, "func": "[PATCH] Conditional tweak in the nonbonded GPU kernels\n\nGPU compilers miss an easy optimization of a loop invariant in the\ninner-lop conditional. Precomputing part of the conditional together\nwith using bitwise instead of logical and/or improves performance with\nmost compilers by up to 5%.\n\nChange-Id: I3ba0b9025b11af3d8465e0d26ca69a78e32a0ece", "idx": 1074} {"target": 1, "func": "[PATCH] issue #939: more efficient kronecker product for\n Sparsity/Matrix", "idx": 1046} {"target": 1, "func": "[PATCH] GetPot: Use a more efficient container for UFO detection\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@3880 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 998} {"target": 1, "func": "[PATCH] Initially assign spans of empty superblocks to block sizes\n and enable stealing of empty superblocks among block sizes.\n\nExpand block size superblock hint array to \"N\" values per block size\nto provide space for TBD superblock search optimizations.\n\nConstruct memory pool with min block, max block, and superblock size\nand introduce performance optimizations related to max vs. min block size.\n\nIssues #487, #320, #738, #215", "idx": 327} {"target": 1, "func": "[PATCH] Add framework for extensible ArrayFire memory managers\n (#2461)\n\nMany different use cases require performance across many different memory\nallocation patterns. Even different devices/backends have different costs\nassociated with memory allocations/manipulations. Having the flexibility to\nimplement different memory management schemes can help optimize performance for\nthe use case and backend.\n\nThis commit adds the ability to replace the default memory manager to a\nspecialized user defined memory manager. This commit also exposes the events API\nto the user which allows you to synchronize tasks between two streams. The\nevents API will be disabled in a future commit but it can be used in the future\nonce we add support for streams.\n\nArrayFire will use the user defined memory manager whenever it allocates or\nfrees memory. The memory manager is exposed using the C API using function\npointers. The memory manager handle is created by the user during initialization\nand the user sets several function pointers to define the behavior.\n\nThe default memory manager behavior has not changed with this commit.", "idx": 1329} {"target": 1, "func": "[PATCH] take out assert to avoid performance issue", "idx": 1336} {"target": 1, "func": "[PATCH] improved copy_ccb method by replacing the used insert methods\n with more efficient versions of those methods.", "idx": 307} {"target": 1, "func": "[PATCH] performance improved", "idx": 1431} {"target": 1, "func": "[PATCH] same as commit 76d5bddd5c3dfdef76beaab8222231624eb75e89:\n Split ga_acc in smaller ga_acc on MPI-PR since gives large performance\n improvement on NERSC Cori", "idx": 652} {"target": 1, "func": "[PATCH] performance improvemnets that disable erf2 and ssf more\n changes needed", "idx": 23} {"target": 1, "func": "[PATCH] more efficient way to extract submatrices, remove some\n unnecessary members, better naming of functions and variables", "idx": 1489} {"target": 1, "func": "[PATCH] Improving the performance, 0-10% slower than Triangulation_2,\n not removing the initial vertices", "idx": 389} {"target": 1, "func": "[PATCH] More efficient Sign_at", "idx": 970} {"target": 1, "func": "[PATCH] Specialized the new estimate_errors() version for more\n efficient use in UniformRefinementEstimator", "idx": 1435} {"target": 1, "func": "[PATCH] Preserve an old partitioning in copy_nodes_and_elements -\n should be more efficient and more reliable.", "idx": 1288} {"target": 1, "func": "[PATCH] Enable SIMD register calling convention with gmx_simdcall\n\nCmake now checks if the compiler supports __vectorcall or\n__regcall calling convention modifiers, and sets gmx_simdcall\nto one of these if supported, otherwise a blank string.\nThis should enable 32-bit MSVC to accept our SIMD routines\n(starting from MSVC 2013), and with ICC it can at least in\ntheory improve performance slightly by using more registers\nfor argument passing in 64-bit mode too. Presently this is\nonly useful on x86, but the infrastructure will work if we\nfind similar calling conventions on other architectures.\n\nFixes #1541.\n\nChange-Id: I7026fb4e1fb6b88c8aa18b060a631cbb80231cd4", "idx": 1543} {"target": 1, "func": "[PATCH] fix obvious performance bug in Isolate_1", "idx": 1190} {"target": 1, "func": "[PATCH] fixing performance of nwpw_gauss_weights...EJB", "idx": 50} {"target": 1, "func": "[PATCH] made the domain decompostion a bit more efficient and cleaned\n up the DD code", "idx": 798} {"target": 1, "func": "[PATCH] Performance improvement differentiating between sparse and\n dense arrays when computing sparse results (#1605)", "idx": 1315} {"target": 1, "func": "[PATCH] In OpenMP threading, preallocate the thread buffer instead of\n allocating the buffer every time. This patch improved the performance\n slightly.", "idx": 100} {"target": 1, "func": "[PATCH] mesh::data: Use a DynamicList for the performance data for\n efficiency", "idx": 35} {"target": 1, "func": "[PATCH] Added gpu implementation for second convergence test in\n bicgstab. Improved performance by moving x += alpha * y from step_3 to step_2\n in all implementations.", "idx": 256} {"target": 1, "func": "[PATCH] HvD: Adjusted the these scripts removing the use of\n \"svnversion\" and replacing it with \"svn info | grep Revision:\". Svnversion\n works out exactly what revisions are contributing to your source code. While\n it is accurate it also takes quite some time to work this. On systems with\n slow disk access it can take 15 minutes or so. Svn info by contrast only\n checks the revision of the current directory. Hence it is much faster but not\n so accurate. However, for the source code distributions we generate we take a\n clean copy of the repository anyway and then the svn info result must match\n svnversion. So this should be good enough and much faster.", "idx": 148} {"target": 1, "func": "[PATCH] Improve performance of SSAMGRelax", "idx": 1448} {"target": 1, "func": "[PATCH] Replace the comparison of 2 lines with a more efficient one\n (just for this specific case where direction doesnt matter).", "idx": 1069} {"target": 1, "func": "[PATCH] Fixes for style and performance issues found by cppcheck\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@5614 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 1019} {"target": 1, "func": "[PATCH] Improve object_type detection performance (#2792)\n\nThis improves the object APIs performance for detecting types by\nswitching from listing all items in the URI to checking only for the\nexistence of the group indicator, array schema file or array schema\nfolder. We also switch the order to check for the array schema folder\nfirst, since it is most likely to exist based on the assumption that\nthere are more arrays than there are groups.", "idx": 497} {"target": 1, "func": "[PATCH] Rewriting of Attribute_elevation (shorter and more efficient\n code)", "idx": 518} {"target": 1, "func": "[PATCH] Use stack allocation in zgemv and zger\n\nFor better performance with small matrices\nRef #727", "idx": 805} {"target": 1, "func": "[PATCH] PERF improvements and bugfixes for select and replace\n\n- Performance improvements to CUDA backend\n- Bugs fixed in OpenCL backend", "idx": 580} {"target": 1, "func": "[PATCH] Use [..] instead of at(..) in LINCS GPU data management code\n\nThe .at(..) was used to make sure that the indices are within bonds\nwhile the code is not thoroughly tested. Now it can be replaced with\ndirect access [..] for performance reasons.", "idx": 163} {"target": 1, "func": "[PATCH] * modified termination policies * fast SVDBatch\n implementation", "idx": 412} {"target": 1, "func": "[PATCH] better performance with hanging nodes", "idx": 328} {"target": 1, "func": "[PATCH] Removed unnecessary synchronization that hurt performance on\n Nvidia", "idx": 144} {"target": 1, "func": "[PATCH] Correct CUDA kernel energy flag\n\nThe CUDA kernels calculated energies based on the GMX_FORCE_VIRIAL\nflag. This did not cause errors, since (currently) GMX_FORCE_ENERGY\nis always set when the virial flag is set. But using the latter flag\ngives a small performance improvement when using pressure coupling.\n\nChange-Id: If874e651058dc06c464f0fa810b17ba83146c9a3", "idx": 1501} {"target": 1, "func": "[PATCH] fix performance issue--forgot to move name parameters", "idx": 987} {"target": 1, "func": "[PATCH] More efficient IntegratorInternal::getDerivative #936", "idx": 406} {"target": 1, "func": "[PATCH] updated traits class by replacing Side_of_hyperbolic_triangle\n with Side_of_oriented_hyperbolic_segment; added function\n side_of_hyperbolic_triangle to class Periodic_4_hyperbolic_triangulation;\n changed locate() function to more efficient version", "idx": 318} {"target": 1, "func": "[PATCH] s390x/Z14: Change register blocking for SGEMM to 16x4\n\nChange register blocking for SGEMM (and STRMM) on z14 from 8x4 to 16x4\nby adjusting SGEMM_DEFAULT_UNROLL_M and choosing the appropriate copy\nimplementations. Actually make KERNEL.Z14 more flexible, so that the\nchange in param.h suffices. As a result, performance for SGEMM improves\nby around 30% on z15.\n\nOn z14, FP SIMD instructions can operate on float-sized scalars in\nvector registers, while z13 could do that for double-sized scalars only.\nThus, we can double the amount of elements of C that are held in\nregisters in an SGEMM kernel.\n\nSigned-off-by: Marius Hillenbrand ", "idx": 1154} {"target": 1, "func": "[PATCH] Simplify gmx chi\n\nThis change is pure refactoring that prepares for performance\nimprovements to ResidueType handling that will benefit both grompp and\npdb2gmx.\n\nUse vector and ArrayRef to replace C-style memory handling. Some\nhistogram vectors were being over-allocated by 1, which is no longer\nsafe to do now that the size of the vector is relevant when looping,\nso those are reduced.\n\nEliminated and reduce scope of iteration variables. Removed an unused\nfunction and some debug code in comments. Used const references rather\nthan pointers where possible. Used range-based for and algorithms in\nsome places that are now possible to do so.", "idx": 167} {"target": 1, "func": "[PATCH] #914 Replaced casadi_copy_sparse runtime function with\n casadi_project With work vector and more cache efficient", "idx": 725} {"target": 1, "func": "[PATCH] Performance bug fix.", "idx": 358} {"target": 1, "func": "[PATCH] rewrite of T double kernels to improve performance with Intel\n 16", "idx": 361} {"target": 1, "func": "[PATCH] Minor code reordering in GPU kernels\n\nUpdating bCalcFshift just before use instead at the top of the kernel\nimproves performance by 1-2% on CUDA. This also improves readability.\nMaking specialized (no)shift kernels will only add 1% gain.\nAlso updated the OpenCL kernels for consistency and readability\n(the perfromance impact is negligible with current hardware/compiler).\n\nChange-Id: I309f90ad61e5815726d55254e2cd38d5e4e7662d", "idx": 424} {"target": 1, "func": "[PATCH] Try to improve parallel performance of CBMC", "idx": 1205} {"target": 1, "func": "[PATCH] GPU+DD performance improvements and code clean-up", "idx": 1335} {"target": 1, "func": "[PATCH] Optimize cf_vs parallel performance", "idx": 1350} {"target": 1, "func": "[PATCH] Made some modifications that hopefully improve the\n performance of the non-blocking code. Test functionality is still sketchy.", "idx": 1463} {"target": 1, "func": "[PATCH] avoid casts to the wrong derived class, which upsets code\n analysis tools. seems to improve performance, too.", "idx": 451} {"target": 1, "func": "[PATCH] Added allow_rules_with_negative_weights flag to QBase. \n Default is true (which was the standard behavior) but you can set this to\n false to use more expensive (but potentially safer) quadrature rules instead.\n\nReplaced the 15-point tet Gauss quadrature rule with a 14-point rule\nby Walkington of equivalent order.\n\nAdded Dunavant quadrature rules for triangles up to THIRTEENTH order. These\nare more efficient than the conical product rules they are replacing. Up to\nTWENTIETH order still to come.\n\nReplaced SECOND-order rule for triangles with a rule having interior integration\npoints. The previous rule had points on the boundary of the reference element.", "idx": 421} {"target": 1, "func": "[PATCH] JN: memcpy on T3D is now fast enough", "idx": 826} {"target": 1, "func": "[PATCH] Performance improvements for find_*_neighbors, and a new\n find_point_neighbors version for finding neighbors at just one point\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@4557 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 79} {"target": 1, "func": "[PATCH] made PME load balancing + DD DLB more efficient\n\nThe DD dynamic load balancing is now limited, such that the fastest\ntimed PME load balancing cut-off setting can always be used.\nFixes #1089\n\nChange-Id: I3216dfd5a8b2b0676eee5519e08cf36e06047251", "idx": 555} {"target": 1, "func": "[PATCH] Refactor CalculateTopRecommendations(), including a complete\n overhaul of how recommendations are actually calculated. std::pair<> and\n std::map<> are often quite slow, especially in that implementation. This is\n faster.", "idx": 696} {"target": 1, "func": "[PATCH] Use efficient intersection traits, Used kernel as template\n parameters", "idx": 1418} {"target": 1, "func": "[PATCH] more efficient treatment of geometries in qmmm", "idx": 259} {"target": 1, "func": "[PATCH] Extending LLL to support linearly dependent bases (and\n returning the nullity). A small performance optimization was also made by\n removing lll::ExpandQR from lll::HouseholderStep", "idx": 825} {"target": 1, "func": "[PATCH] New DCEL structure that allows more efficient handling of\n holes and isolated vertices. (Ron) Automatic handling of holes to remove in\n the construction visitor. (Baruch)", "idx": 846} {"target": 1, "func": "[PATCH] Improve the performance of read_data of gzip'ed files using\n taskset. Normally, the gzip process would be pinned to the same core as the\n MPI rank 0 process, which makes the pipe stay in one core's cache, but forces\n the two process to fight for that core, slowing things down.", "idx": 815} {"target": 1, "func": "[PATCH] Revert \"Minor changes to use libmesh indentation,\n initialization styles.\"\n\nThis reverts commit ddf20ff42df730dfb38bf0d618b7d1d9264948b7.\n\nRevert \"Write element truth table to exodus to improve performance\"\n\nThis reverts commit 3c400489f41dbd72fe704c3dfe94db6e9bcd200e.", "idx": 1090} {"target": 1, "func": "[PATCH] Preserve an old partitioning in copy_nodes_and_elements -\n should be more efficient and more reliable.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@3388 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 1388} {"target": 1, "func": "[PATCH] Fix performance for resize, and deep_copy", "idx": 596} {"target": 1, "func": "[PATCH] Performance drop fix - Added a QTime to reduce the number of\n calls to deform().", "idx": 797} {"target": 1, "func": "[PATCH] Fix excessive list splitting\n\nDue to a possible integer overflow, the pair list splitting code could\nend up over-splitting pair lists and causing large performance\ndegradation. Due to the larger processor count, runs using AMD GPUs,\nusing 100k+ simulation systems are more prone to suffer from the issue.\n\nFixes #1904\n\nChange-Id: I29139ec80aa75c78fa93de0858f7c60cdae88d5b", "idx": 713} {"target": 1, "func": "[PATCH] Fix the integer overflow issue for large matrix size\n\nFor large matrix, e.g. M=N=K, and M>1290, int mnk=M*N*K will overflow.\nThis will lead to wrong branching to single-threading. The performance\nis downgraded significantly.\n\nSigned-off-by: Wang, Long ", "idx": 683} {"target": 1, "func": "[PATCH] Fixed a performance regression on AMD GPUs", "idx": 694} {"target": 1, "func": "[PATCH] Fixing bug that arose in parallel performance example ex1p", "idx": 677} {"target": 1, "func": "[PATCH] made routine more efficient", "idx": 1007} {"target": 1, "func": "[PATCH] Working on performance improvements", "idx": 903} {"target": 1, "func": "[PATCH] Fix for #1139 performance regression bug (and #1140 for\n tracking). Set default CUDA launch bounds to <0,0> and when do not use CUDA\n __launch_bounds__ unless CUDA launch bounds are explicitly specified.", "idx": 1041} {"target": 1, "func": "[PATCH] Cache n_nodes/sides/edges in DofMap constraints\n\nWe loop over each of these ranges multiple times, so doing the loop\nend manually should more efficient than even the new range idiom.", "idx": 967} {"target": 1, "func": "[PATCH] new macos accelerate step", "idx": 712} {"target": 1, "func": "[PATCH] changed DofMap::build_constraint_matrix to be more efficient\n in the (usual) case that the element has no constraints. Also fixed for the\n case that an element has constraints in terms of its *own* dofs, (not others)", "idx": 1347} {"target": 1, "func": "[PATCH] Use efficient intersection traits, Used the kernel\n appropriately, Added a constructor", "idx": 1311} {"target": 1, "func": "[PATCH] Sparse refactored readers: disable filtered buffer tile\n cache.\n\nFrom tests, it's been found that writing the cache for the filter\npipeline takes a significant amount of time for the tile unfiltering\noperation. For example, 2.25 seconds with and 1.88 seconds without in\nsome cases. The cache improved performance before multi-range subarrays\nwere implemented, so dropping it is fine at least for the refactored\nreaders.", "idx": 48} {"target": 1, "func": "[PATCH] Improved performance of bicgstab after adding the second\n convergence test.", "idx": 1035} {"target": 1, "func": "[PATCH] POWER10: Improving dasum performance\n\nUnrolling a loop in dasum micro code to help in improving\nPOWER10 performance.", "idx": 1497} {"target": 1, "func": "[PATCH] Extending QP and NNLS to handle multiple right-hand sides in\n an efficient way and adding a Non-negative Matrix Factorization (NMF).", "idx": 705} {"target": 1, "func": "[PATCH] - Handle_for memory leak fixed : initialize_with() now\n assigns instead of constructing, so that it works correctly after\n Handle_for has been default constructed. There's a new way of constructing\n a Handle_for : Handle_for(TO_BE_USED_ONLY_WITH_CONSTRUCT_WITH) followed by \n construct_with(), which is supposed to produce more efficient code. \n Simple_handle_for also accepts it.", "idx": 341} {"target": 1, "func": "[PATCH] Improving the performance of the HessenbergSchur QR sweeps", "idx": 632} {"target": 1, "func": "[PATCH] Use a map for pushed_ids\n\nWe won't be handing this to parallel_sync but we want it to be more\nefficient in the large processor count case anyway.", "idx": 444} {"target": 1, "func": "[PATCH] More efficient TNG selection group creation\n\nDo not create a TNG selection group if no selection is specified\nexplicitly, or if the selection contains all atoms in the system.\n\nChange-Id: Ibe2a14e55aff829fdb74de074447f00f0e85f090", "idx": 292} {"target": 1, "func": "[PATCH] Add Pelleg-Moore type prune. This improves performance -- at\n least a bit.", "idx": 129} {"target": 1, "func": "[PATCH] changed dim_type typedef to int from long long\n\nUsing long long for indexing data type is causing drastic\nperformance drop for CUDA and OpenCL kernels. Hence, changing the\ntypedef to point to int.", "idx": 69} {"target": 1, "func": "[PATCH] Global optimizers: better parallel performance\n\n- We used to have a thread-local variable for cell::TDS_data to make\n incident_cells concurrently callable but it was slow and memory-consuming\n => new incident_cells function which do not use cell::TDS_data\n => faster and lighter\n- update_restricted_delaunay now uses parallel_for instead of parallel_do\n (it was quite slow with the implicit oracle)\n => faster (but requires to fill a temporary vector)", "idx": 1002} {"target": 1, "func": "[PATCH] improved performance in some of the\n IsoparametricTransformation::*RevDiff methods by stack allocating dFdx_bar\n inside. Added additional FiniteElement::*RevDiff methods for differentiating\n various methods. Added EvalRevDiff methods to additional coefficient classes,\n and added tests for the coefficient differentiation as well as the finite\n element differentiation", "idx": 1171} {"target": 1, "func": "[PATCH] CUDA PME kernels with analytical Ewald correction\n\nThe analytical Ewald kernels have been used in the CPU SIMD kernels, but\ndue to CUDA compiler issues it has been difficult to determine in which\ncases does this provide a performance advantage compared to the\ntabulated kernels.Although the nvcc optimizations are rather unreliable,\non Kepler (SM 3.x) the analytical Ewald kernels are up to 5% faster, but\non Fermi (SM 2.x) 7% slower than the tabulated. Hence, this commit\nenables the analytical kernels as default for Kepler GPUs, but keeps the\ntabulated kernels as default on Fermi.\n\nNote that the analytical Ewald correction is not implemented in the\nlegacy kernels as these are anyway only used on Fermi.\n\nAdditional minor change is the back-port of some variable (re)naming and\nsimple optimizations from the default to the legacy CUDA kernels which\ngive 2-3% performance improvement and better code readability.\n\nChange-Id: Idd4659ef3805609356fe8865dc57fd19b0b614fe", "idx": 821} {"target": 1, "func": "[PATCH] Add a new example called `custom-logger`.\n\nThe purpose of this example is to show the users how to customize Ginkgo by\nadding a new logger, which is useful and more efficient for application specific\nproblems. This is also one of the most basic (and simple) ways of customizing\nGinkgo, therefore this is a good entry level example.\n\nThis example simply prints a table of the recurrent residual norm against the\nreal residual norm.\n\nThis example is documented as much as well as the `simple-solver` example for\nthe user's convenience.", "idx": 1558} {"target": 1, "func": "[PATCH] Improved CUDA non-bonded kernel performance\n\nSome old tweak which was supposed to improve performance had in fact\nthe opposite effect. Removing this tweak and with it eliminating\nshared memory bank conflicts it caused improved performance by up\nto 2.5% in the force-only CUDA kernel.\n\nChange-Id: I7fcb24defed2c68627457522c39805afc83b3276", "idx": 479} {"target": 1, "func": "[PATCH] Improve performance of some mathematical functions", "idx": 1106} {"target": 1, "func": "[PATCH] Refs JuliaLang/julia#5728. Fix gemv performance bug on\n Haswell Mac OSX.\n\nOn Mac OS X, it should use .align 4 (equal to .align 16 on Linux).\nI didn't get the performance benefit from .align. Thus, I deleted it.", "idx": 356} {"target": 1, "func": "[PATCH] Cache FEMContext::point_value() FE Objects\n\nThis really ought to be redone entirely, but hopefully we can get a\nlittle performance improvement right away by just avoiding de and\nreallocations.", "idx": 128} {"target": 1, "func": "[PATCH] THUNDERX2T99: Performance fix for ZGEMM", "idx": 569} {"target": 1, "func": "[PATCH] Adjust serialized query buffer sizes (#2115) (#2117)\n\nAdjust serialized query buffer sizes\n\nThis change the client/server flow to always send the server the\noriginal user requested buffer sizes. This solves a bug in which with\nserialized queries incompletes would cause the \"server\" to use smaller\nbuffers for each iteration of the incomplete query. This yield\ndecreasing performance as the buffers approached zero. The fix here lets\nthe server always get the original user's buffer size.", "idx": 776} {"target": 1, "func": "[PATCH] Fix mis-branching in CouplingMatrix iterator++\n\nThis was causing a *huge* performance penalty in cases where we had\nmany variables.\n\nHopefully https://github.com/idaholab/moose/issues/9480 will be fully\nresolved by the fix. I see orders of magnitude speedup in our test\ncase.", "idx": 312} {"target": 1, "func": "[PATCH] Removed FillSubcellsForNode, due to more efficient\n implementation where I add contributions for all nodes of a certain subcell.", "idx": 679} {"target": 1, "func": "[PATCH] Small Matrix: skylakex: sgemm nn: add n6 to improve\n performance", "idx": 1212} {"target": 1, "func": "[PATCH] More efficient CEED matrix assembly", "idx": 589} {"target": 1, "func": "[PATCH] More efficient set reassignment", "idx": 1185} {"target": 1, "func": "[PATCH] BUGFIX Fixed memory leak in image io, performance\n improvements", "idx": 1380} {"target": 1, "func": "[PATCH] Performance bug fix in single node case.", "idx": 752} {"target": 1, "func": "[PATCH] Move fast LUT in CUDA backend to texture memory\n\ncuda::kernel::locate_features is the CUDA kernel that uses the fast\nlookup table. Shared below is performance of the kernel using constant\nmemory vs texture memory. There is neglible to no difference between two\nversions. Hence, shifted to texture memory LUT to reduce global constant\nmemory usage.\n\nPerformance using constant memory LUT\n-------------------------------------\n\nTime(%) Time Calls Avg Min Max Name\n1.48% 101.09us 3 33.696us 32.385us 34.976us void cuda::kernel::locate_features\n1.34% 91.713us 2 45.856us 45.792us 45.921us void cuda::kernel::locate_features\n1.02% 69.505us 2 34.752us 34.400us 35.105us void cuda::kernel::locate_features\n0.99% 67.456us 2 33.728us 32.768us 34.688us void cuda::kernel::locate_features\n0.95% 65.186us 2 32.593us 31.201us 33.985us void cuda::kernel::locate_features\n0.93% 63.874us 2 31.937us 30.817us 33.057us void cuda::kernel::locate_features\n\nPerformance using texture LUT\n-----------------------------\n\nTime(%) Time Calls Avg Min Max Name\n1.45% 99.776us 3 33.258us 32.896us 33.504us void cuda::kernel::locate_features\n1.33% 91.105us 2 45.552us 44.961us 46.144us void cuda::kernel::locate_features\n1.02% 70.017us 2 35.008us 34.273us 35.744us void cuda::kernel::locate_features\n0.97% 66.689us 2 33.344us 32.065us 34.624us void cuda::kernel::locate_features\n0.95% 65.249us 2 32.624us 31.585us 33.664us void cuda::kernel::locate_features\n0.95% 65.025us 2 32.512us 30.945us 34.080us void cuda::kernel::locate_features", "idx": 1263} {"target": 1, "func": "[PATCH] This commit introduces VariableGroups as an optimization when\n there are repeated variables of the same type inside a system. Presently,\n these are only activated through the system.add_variables() API, but in the\n future there may be provisions for automatically identifying groups.\n\nThe memory usage for DofObjects now scales like\nN_sys+N_var_group_per_sys instead of N_sys+N_vars. The DofMap\ndistribution code has been refactored to use VariableGroups.\n\nAll existing loops over Variables within a system will work unchanged,\nbut can be replaced with more efficient loops over VariableGroups.\n\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@6521 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 697} {"target": 1, "func": "[PATCH] Use the Side<> class as a proxy when building element sides. \n This eliminates the need for allocating and deallocating connectivity storage\n arrays when building sides, thus making the Elem::build_side() member more\n efficient. Note that this has not been implemented yet in the case of\n infinite elements, however it would be easy to add. Assuming there are many\n more interior elements than infinite elements there is also probably little\n performance impact.", "idx": 1169} {"target": 1, "func": "[PATCH] added fast global atom nr. to molecule lookup\n\nMost atom search functions in mtop_util now use binary search.", "idx": 955} {"target": 1, "func": "[PATCH] More efficient IntegratorInternal::getDerivative for\n SXFunction #936", "idx": 838} {"target": 1, "func": "[PATCH] Simplify AVX integer load/store\n\nAlso has the potential to improve performance on some architectures\n(if there is a domain crossing penalty - not sure whether any AVX capable\ngeneration has a penalty).\n\nChange-Id: Icc7b136571fc9ad1dbabeabe446c93e8816ec678", "idx": 980} {"target": 1, "func": "[PATCH] Implement a suggestion from cppcheck.\n\nUsing vector::empty() to check for emptiness, instead of vector::size()\n== 0 may be better for performance, since it's guaranteed, to be 0(1).\nRelated to #30.", "idx": 471} {"target": 1, "func": "[PATCH] Fixed FAST C++ API, added proper destructor calls.", "idx": 382} {"target": 1, "func": "[PATCH] Use a priority queue (heap) to store the list of candidates\n while searching. This makes the code more efficient, especially when k is\n greater. For example, for knn, given a list of k candidates neighbors, we\n need to do 2 fast operations: - know the furthest of them. - insert a new\n candidate. This is the appropiate situation for using a heap.", "idx": 905} {"target": 1, "func": "[PATCH] Fixed up some bugs in the host to device accumulate for same\n process communication in nb_accv and added operations to support improved\n performance of scatter operation within same SMP node.", "idx": 149} {"target": 1, "func": "[PATCH] Added parallel LCG, a way to avoid an MPI performance bug on\n BG/P, a function for creating random Hermitian matrices, and modified LAPACK\n wrappers to throw an error if the 'info' parameter is nonzero, even in\n RELEASE mode.", "idx": 1544} {"target": 1, "func": "[PATCH] More efficient implementation of atoms class.", "idx": 621} {"target": 1, "func": "[PATCH] Moving quantities around in ExactSolution::_compute_error\n\nThis is in preparation for updating to support mixed-dimension\nmeshes. Although this is a bit more inefficient, the code is simpler\nand I thnk that since this is usually used for debugging/regression\npurposes, simpler code would be preferred over higher performance\ncode.", "idx": 1455} {"target": 1, "func": "[PATCH] Add size optimization to HashedMap\n\nThe table size in HashedMap is now optimized when calling clear()\nusing the old number of keys. Also the number of keys is now set\nto a power of 2, so we can use bit masking instead of modulo.\nThe bit masking allows for negative keys, which is also tested.\n\nThis is preparation for replacing gmx_hash_t with HashedMap,\nbut also improves performance for gmx_ga2la_t.\n\nChange-Id: I90c5a602cb7e213eb6d2e8259a0effc4fd7c4e14", "idx": 453} {"target": 1, "func": "[PATCH] minor changes for performance improvement and documentation", "idx": 4} {"target": 1, "func": "[PATCH] Adding a block diagonal preconditioner in the serial example\n to improve solver performance", "idx": 58} {"target": 1, "func": "[PATCH] added a switch -Ssw which can lead to better performance\n during AMG setup", "idx": 1333} {"target": 1, "func": "[PATCH] Use processor_id_type where appropriate\n\nI was accidentally using dof_id_type instead, and since that will\nalways be equal or larger this mistake shouldn't have led to bugs, so\nI won't bother turning this commit into a half dozen different fixup\ncommits to rebase.\n\nHowever, using the correct type should be infintesimally more\nefficient and significantly less confusing.", "idx": 193} {"target": 1, "func": "[PATCH] Add non-virtual Elem::vertex_average() and update\n Elem::centroid()\n\nWhen the Elem has an elevated p_level, this will affect the Order of\nthe FE that gets reinit()'d for computing the centroid. Rather than do\nany non-const hacking of the p_level value, we just work around this\nissue by making a copy of the Elem with non-elevated p_level and\nreturn its centroid instead.\n\nThis obviously introduces an additional performance hit, but truly\noptimized code should not be calling the base class Elem::centroid()\nimplementation to begin with. This approach also has the benefit of\nnot requiring const_cast and avoiding potential thread safety issues.", "idx": 1385} {"target": 1, "func": "[PATCH] QUDA: Overall of BLAS, better tuning, bugs fixed, fixed\n performance regressions, support for 32 way reductions, removed bank\n conflicts from complex and triple reductions\n\ngit-svn-id: http://lattice.bu.edu/qcdalg/cuda/quda@1121 be54200a-260c-0410-bdd7-ce6af2a381ab", "idx": 138} {"target": 1, "func": "[PATCH] NVIDIA Volta performance tweaks\n\nRemoved ballot syncs and replaced all computed masks with full warp\nmask (as all branches in question are warp-synchronous).\nThis improves performance by 7-12%.\n\nChange-Id: I769d6d8f0d171eb528d30868d567624d5e246dbf", "idx": 1134} {"target": 1, "func": "[PATCH] * Refactored stats * Fixed performance of multi-range\n subarray result estimation * Fixed bug in multi-range result estimation", "idx": 672} {"target": 1, "func": "[PATCH] New edge_map(), not inverse_map(), in edge reinit\n\nThis should be much more efficient, and on top of that it seems to fix\nthe edge projection problems I was seeing on the Reactor Pressure Vessel\nIGA mesh.", "idx": 1222} {"target": 1, "func": "[PATCH] use CUDA texture objects when supported\n\nCUDA texture objects are more efficient than texture references, their\nuse reduces the kernel launch overhead by up to 20%. The kernel\nperformance is not affected.\n\nChange-Id: Ifa7c148eb2eea8e33ed0b2f1d8ef092d59ba768e", "idx": 1374} {"target": 1, "func": "[PATCH] Cuda: Enabling SHFL based reduction for static\n value_type>128bit\n\nThis improves performance significantly when reducing structs, since\nthe shared memory footprint is massively reduced. On smaller reduction\ntypes it is still benefitial to go through SHMEM.", "idx": 1505} {"target": 1, "func": "[PATCH] fast pool allocator\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@4485 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 642} {"target": 1, "func": "[PATCH] reactingMultiphaseEulerFoam: Added referencePhase option\n\nIn multiphase systems it is only necessary to solve for all but one of the\nmoving phases. The new referencePhase option allows the user to specify which\nof the moving phases should not be solved, e.g. in constant/phaseProperties of the\ntutorials/multiphase/reactingMultiphaseEulerFoam/RAS/fluidisedBed tutorial case with\n\nphases (particles air);\n\nreferencePhase air;\n\nthe particles phase is solved for and the air phase fraction and fluxes obtained\nfrom the particles phase which provides equivalent behaviour to\nreactingTwoPhaseEulerFoam and is more efficient than solving for both phases.", "idx": 957} {"target": 1, "func": "[PATCH] performance optimizations for sgemv_n", "idx": 508} {"target": 1, "func": "[PATCH] Fixes for style and performance issues found by cppcheck", "idx": 941} {"target": 1, "func": "[PATCH] - Be more storage efficient, and general clean up.", "idx": 159} {"target": 1, "func": "[PATCH] type 2 instability fixed; type 2.5 still not fixed but Rick\n disabled texas in this case; significant performance optimizations for '95\n esp. gradients", "idx": 284} {"target": 1, "func": "[PATCH] Use block size stepping of 16 in new dslash kernels: improves\n performance of exterior-x kernels and 5 dimensional stencils. Also, set max\n shared bytes to full dyanmic limit, since this gives a small improvement on\n Volta", "idx": 134} {"target": 1, "func": "[PATCH] Use smaller block size on smaller system I/O - this seems to\n fix a performance problem found by Jens Lohne Eftang", "idx": 937} {"target": 1, "func": "[PATCH] more efficient usage of mutex. The lock is only done if the\n build need to be done. We have an extra \"if (m_need_build)\" but otherwise we\n would need to use mutex::try_lock() which results in more code and as\n efficient.", "idx": 74} {"target": 1, "func": "[PATCH] Specialized dotInterpolate for the efficient calculation of\n flux fields\n\ne.g. (fvc::interpolate(HbyA) & mesh.Sf()) -> fvc::flux(HbyA)\n\nThis removes the need to create an intermediate face-vector field when\ncomputing fluxes which is more efficient, reduces the peak storage and\nimproved cache coherency in addition to providing a simpler and cleaner\nAPI.", "idx": 1402} {"target": 1, "func": "[PATCH] ReplicateMesh should only read once\n\nReading once and broadcasting should be more efficient than hammering\nthe filesystem on every processor.", "idx": 1252} {"target": 1, "func": "[PATCH] decreased hl_tol (level shift threshold) from 0.05 to 0.01\n introduce 2 new keywords: stable (same as default) fast (faster than default\n but less safe)", "idx": 1227} {"target": 1, "func": "[PATCH] use fast insertion functions of arr", "idx": 31} {"target": 1, "func": "[PATCH] Improved performance of building neighbor list on AMD GPUs", "idx": 407} {"target": 1, "func": "[PATCH] Some debug instrumentation and big performance improvements.", "idx": 1254} {"target": 1, "func": "[PATCH] some improvements in performance on it2 (previous name of\n routine was hferi.f)", "idx": 151} {"target": 1, "func": "[PATCH] Use analysis nbsearch in insert-molecules\n\nAdvantages:\n - This reduces the amount of code by ~90% compared to what addconf.c\n has, making it significantly easier to understand.\n - Now the tool is independent of potential changes in the\n mdrun-specific neighborhood search.\n - Memory leaks related to addconf.c are gone.\n - The neighborhood search is terminated as soon as one pair within the\n cutoff is found, potentially making it faster. This likely offsets\n any performance differences between the nbsearch implementations.\n The unit tests are ~35% faster.\n - Confusing mdrun-specific output related to the neighborhood\n searching is gone. This includes notes that \"This file uses the\n deprecated 'group' cutoff_scheme\" and references to Coulomb or VdW\n tables and cutoffs.\n\nChange-Id: Iba82858b9a2b43b6e10a49cd3964b99b22996166", "idx": 625} {"target": 1, "func": "[PATCH] Refactor parallel_algebra.h\n\nWe can simplify our derived types slightly by noting that the\nLIBMESH_DIM entries are homogenous.\n\nWe should wrap MPI calls in the new error checking macro.\n\nWe should commit intermediate MPI types; failing to do so is\ninfinitesimally more efficient and seems to work in practice, but the\nerror descriptions in the resize docs I've read suggest that using an\nuncommitted type may not be strictly allowed.", "idx": 126} {"target": 1, "func": "[PATCH] Add performance improvement features to matrix loading.", "idx": 629} {"target": 1, "func": "[PATCH] Update RAS-IR.\n\nNotes:\n1. Overlap and row ordering have significant effects on convergence.\n2. Lower the matrix bandwidth, better the performance ?\n3. Sync is lower, but communication becomes higher, tradeoff!\n4. Parallelism is higher, but overhead is also higher, tradeoff!\n5. Very high accurate solves, do not provide anything.\n6. Two level and multi-levels might significantly reduce number of\niterations to converge.", "idx": 266} {"target": 1, "func": "[PATCH] Add SIMD intrinsics version of simple update\n\nTo get better performance in cases where the compiler can't vectorize\nthe simple leap frog integrator loop and to reduce cache pressure of\nthe invMassPerDim, introduced a SIMD intrinsics version of the simple\nleap-frog update without pressure coupling and one T-scale factor.\nTo achieve this md->invmass now uses the aligned allocation policy\nand is padded by GMX_REAL_MAX_SIMD_WIDTH elements.\nAsserts have been added to check for the padding.\n\nChange-Id: I98f766e32adc292403782dc67f941a816609e304", "idx": 850} {"target": 1, "func": "[PATCH] Add locks only for non-OPENMP multithreading\n\nto migitate performance problems caused by #1052 and #1299 as seen in #1461", "idx": 342} {"target": 1, "func": "[PATCH] PERF: Anisotropic smoothing improvements (#2713)\n\nThis improves CUDA/OpenCL backend performance by about 24%", "idx": 1156} {"target": 1, "func": "[PATCH] Use reference to improve performance in pair_reaxc_kokkos", "idx": 1477} {"target": 1, "func": "[PATCH] performance fixes in fftconvolve kernels", "idx": 1325} {"target": 1, "func": "[PATCH] replace patch routines that call scatter with more efficient\n ones; combine multiple ddots into one call to reduce no. of gops; added\n inheritance to print control", "idx": 1102} {"target": 1, "func": "[PATCH] Update VSX SIMD to avoid inline assembly\n\nThanks to some help from Michael Gschwind of\nIBM, this removes the remaining inline assembly\ncalls and replace the with vector functions. This\navoid interfering with the optimizer both on GCC\nand XLC, and gets us another 3-10% of performance\nwhen using VSX SIMD. Tested with GCC-4.9, XLC-13.1\nin single and double on little-endian power 8.", "idx": 1481} {"target": 1, "func": "[PATCH] fixed weak scaling performance which was hampered to do a\n commit on Apr 27 2010", "idx": 1236} {"target": 1, "func": "[PATCH] efficient jacobian of parallelizer, little speed penalty\n remaining", "idx": 749} {"target": 1, "func": "[PATCH] foamDictionary: Added support for reading files as case\n IOdictionary in parallel\n\nIf the -case option is specified time is created from the case\nsystem/controlDict enabling support for parallel operation, e.g.\n\nmpirun -np 4 \\\n foamDictionary -case . 0/U -entry boundaryField.movingWall.value \\\n -set \"uniform (2 0 0)\" \\\n -parallel\n\nThis will read and modify the 0/U field file from the processor directories even\nif it is collated. To also write the 0/U file in collated format the collated\nfileHandler can be specified, e.g.\n\nmpirun -np 4 \\\n foamDictionary -case . 0/U -entry boundaryField.movingWall.value \\\n -set \"uniform (2 0 0)\" \\\n -fileHandler collated -parallel\n\nThis provides functionality for field manipulation equivalent to that provided\nby the deprecated changeDictionary utility but in a more flexible and efficient\nmanner and with the support of fileHandlers for collated parallel operation.", "idx": 325} {"target": 1, "func": "[PATCH] Fix performance regression with gcc-3.3", "idx": 1493} {"target": 1, "func": "[PATCH] Performance increase for charge-implicit ReaxFF/changed\n cutoff selection", "idx": 1304} {"target": 1, "func": "[PATCH] fixed nbnxn x86 SIMD non-bonded performance regression\n\nCommit f40969c2 broke the LJ combination detection,\nwhich effectively made all runs use the full combination rule\nmatrix for x86 SIMD kernels. This is now corrected.\n\nChange-Id: I1073801546fde23e6a53199120246697a7c61b5f", "idx": 40} {"target": 1, "func": "[PATCH] Sparse refactored readers: Better vectorization for tile\n bitmaps calculations. (#2711)\n\n* Sparse unordered with duplicates: Better vectorization for tile bitmaps.\n\nWhen computing tile bitmaps, there are a few places where we were not\nvectorizing properly and it caused performance impacts on some customer\nscenario. Fixing the use of covered/overlap in the dimension class to\nenable proper vectorization.", "idx": 1047} {"target": 1, "func": "[PATCH] Use Traits::Vector to accelerate 2D optimization", "idx": 1221} {"target": 1, "func": "[PATCH] general SIMD acceleration for angles+dihedrals\n\nImplemented SIMD intrinsics for angle potential and pbc_dx.\nChanged SSE2 intrinsics to general SIMD using gmx_simd_macros.h.\nImproves performance significantly, especially with AVX-256\nand reduces load imbalance, especially with GPUs.\n\nChange-Id: Ic83441cce68714ae91c6d5ca2a6e1069a62cd2ae", "idx": 768} {"target": 1, "func": "[PATCH] Optimized the performance of the CG-based LPR algorithm", "idx": 1001} {"target": 1, "func": "[PATCH] Improved performance of computing sums with CustomIntegrator", "idx": 1191} {"target": 1, "func": "[PATCH] resolved performance degrading changed introduced in revision\n 1319 (2)", "idx": 351} {"target": 1, "func": "[PATCH] Tabulated function parameters are hardcoded in the kernel\n instead of being stored in an array. This makes the code simpler and may\n help performance slightly.", "idx": 845} {"target": 1, "func": "[PATCH] fixed slow reading of large xvg files with LAM MPI", "idx": 1416} {"target": 1, "func": "[PATCH] Avoid setting config on transient subarray (#2740)\n\nFor the existing `tiledb_query_add_range` APIs there is a transient\nsubarray object that is referenced. In one API the query config was also\nbeing set on this subarray. This is not required, and other variations\nof `tiledb_query_add_range` did not set the config. Removing this saves\na few seconds of the WALL time when a user is setting hundreds of\nthousands of ranges.\n\nThe performance bottleneck comes in when the config is copied which is a\nfull copy of the config map. This was being done once per range setting.\n\nInstead to maintain current behavior setting the query config also sets\nthe subarray config.", "idx": 882} {"target": 1, "func": "[PATCH] added while loops in pbc_dx to correct for multiple box\n vectors shifts, and added set_pbc_ss for efficient single shift pbc_dx", "idx": 344} {"target": 1, "func": "[PATCH] Parallel Performance bug in structure factor fixed for pspw.\n The new code works ok in serial, but still need to check parallel\n performance.\n\n...EJB", "idx": 623} {"target": 1, "func": "[PATCH] More efficient SetNonzeros::propagateSparsities", "idx": 1534} {"target": 1, "func": "[PATCH] replace std::set with\n std::array\n\nfor facets vertices\nthis should be a lot more efficient", "idx": 1509} {"target": 1, "func": "[PATCH] fixed SD and BD integrator OpenMP performance\n\nSD and BD integrator always integrated single threaded.\nReally fixes #1121\n\nChange-Id: I2217c40e9c188c7cd57801e413750035c6488f56", "idx": 476} {"target": 1, "func": "[PATCH] CUDA: change heuristic for BlockSize to prefer 128 threads\n\nSome experiments deomnstrated that for certain kernels the\ncurrent heuristic isn't great. In particular copy and memset\nkernels were bad.\n\nUsing the updated stream benchmark I got before this change:\n\nSet 327316.30 MB/s\nCopy 654344.27 MB/s\nScale 654263.20 MB/s\nAdd 846497.84 MB/s\nTriad 844604.40 MB/s\n\nWith this change:\n\nSet 652713.29 MB/s\nCopy 807649.65 MB/s\nScale 808014.29 MB/s\nAdd 847403.47 MB/s\nTriad 845885.63 MB/s\n\nExaminidMD also improved from 2.48e+08 to 2.82e+08:\n\n1 256000 | 0.906401 0.480328 0.142917 0.165107 0.117937 | 1103.264687 2.824358e+08 2.824358e+08 PERFORMANCE\n\n1 256000 | 1.030611 0.501819 0.243033 0.163163 0.122484 | 970.297956 2.483963e+08 2.483963e+08 PERFORMANCE", "idx": 1423} {"target": 1, "func": "[PATCH] A few, perhaps not so useless and dangerous, changes to qmmm\n code 1) more efficient use of Bq module 2) enabling QMMM runs for property\n calculations", "idx": 1228} {"target": 1, "func": "[PATCH] improve skylakex paralleled sgemm performance", "idx": 1449} {"target": 1, "func": "[PATCH] Fixes to internal functions\n\n- Was using incorrect number of elements for the total\n- Fixed copy because right now isOwner() does not mean isLinear()\n - Potentially improves performance when isLinear() is not isOwner()", "idx": 1397} {"target": 1, "func": "[PATCH] Made DD exclusion processing more efficient\n\nWith the Verlet scheme exclusions no longer need to be assigned only\nonce and there are no charge groups. This means the global to local\nexclusion conversion can be more than twice as fast.\n\nChange-Id: I80e1213715f051864d2989389212510428896cb8", "idx": 553} {"target": 1, "func": "[PATCH] Added Grsm_ggm_sym_dot subroutine for more efficient\n calculation of product that result in symmetry matrices...EJB", "idx": 381} {"target": 1, "func": "[PATCH] Hard CPU affinity is set when Nthreads == Ncores.\n\nThis causes a slight thread_mpi performance gain on NUMA systems.", "idx": 520} {"target": 1, "func": "[PATCH] made changes which removed further operations and made CLJP\n and Falgout coarsenings more efficient", "idx": 822} {"target": 1, "func": "[PATCH] converted part of rhogen to daxpy getting performance\n improvement on ia64", "idx": 985} {"target": 1, "func": "[PATCH] VFS Read-Ahead Cache (#1785)\n\nThis introduces a read-ahead cache within VFS::read(). This is an LRU cache\nthat maintains a single cached buffer for an arbitrary number unique URIs, not\nto exceed 10MiB (by default). Each cached buffer has a max size of 100KiB (by\ndefault). These parameters can be tweaked with the following config items:\n\n`vfs.read_ahead_size` (100KiB default)\n`vfs.read_ahead_cache_size` (10Mib default)\n\nThe motiviation for this patch is to optimize IO patterns of small, relatively\nsequential reads against cloud storage backends. Only the S3, Azure, and GCS\nbackends utilize this read cache. The POSIX/Windows/HDFS backends are\nunaffected by this patch.\n\nBoth performing and caching the read-ahead incur a performance penalty:\n1. We must read more than the requested bytes.\n2. We must make a copy of the read buffer (one to store in the cache, one to\nreturn to the user).\n\nWe will only perform a read-ahead if the requested read is smaller than the\ndefault 100KB cached buffer size. IO patterns of large reads will be unaffected.\nThe assumption is that fragment data is large and that reading fragment\ndata will not incur a performance penalty. Additionally, reads to tile data\nare bypassed because tiles have their own separate tile cache.\n\nOn the recent S3 workload we've been discussing, this read cache has a 78%\nhit rate, where every cache hit is in the fragment metadata. I've observed a\nbest-case runtime of 6.5s with this patch, and a 27s runtime without this\npatch.\n\nCo-authored-by: Joe Maley ", "idx": 80} {"target": 1, "func": "[PATCH] Added a few lines regarding the GA array distribution but,\n only performance bug and results were still correct.", "idx": 984} {"target": 1, "func": "[PATCH] Update RAS-IR.\n\nNotes:\n1. Overlap and row ordering have significant effects on convergence.\n2. Lower the matrix bandwidth, better the performance ?\n3. Sync is lower, but communication becomes higher, tradeoff!\n4. Parallelism is higher, but overhead is also higher, tradeoff!\n5. Very high accurate solves, do not provide anything.\n6. Two level and multi-levels might significantly reduce number of\niterations to converge.", "idx": 366} {"target": 1, "func": "[PATCH] New traits classes for efficient handling of circles.", "idx": 1076} {"target": 1, "func": "[PATCH] introducing info on elements in the overlap regions for more\n efficient calculation of the characteristic function (restriction to part of\n the mesh)", "idx": 864} {"target": 1, "func": "[PATCH] Performance improvements and fixed bug with memory budget and\n multi-range subarrays (#1601)", "idx": 1306} {"target": 1, "func": "[PATCH] Improvement of rectangular slicing: part 1 - memory efficient\n formulation", "idx": 135} {"target": 1, "func": "[PATCH] replace custom kdtree by a fast version of\n Orthogonal_k_neighbor_search", "idx": 1077} {"target": 1, "func": "[PATCH] Fix false sharing in RandomNumber Pools\n\nThe pool arrays are accessed by all threads, but the\nelements per thread are rather small and thus share cachelines.\n\nMaking the arrays 2D using the second dimension for padding only,\nsolves that problem. I have seen up to 200x improvement for\n20 threads on skylake running the random number example with 100k and 1\nas parameters. Serial performance is slightly reduced, but it only makes\nthe \"grep a generator slower\" by the equivalent of one double load more\nand a few integer ops.", "idx": 1085} {"target": 1, "func": "[PATCH] DynRankView: operator() performance improvements\n\nDebug macros added to check active memory space, rank, and bounds\nDynRankView: Simple performance test added\n compare performance to View and rank 7 View", "idx": 486} {"target": 1, "func": "[PATCH] Improve the performance of BoundingBox::contains_point by\n marking is_between as an inline function\n\n3.28s 130: if (bboxes[i_from].contains_point(*node + _to_positions[i_to]))\n\n520ms 130: if (bboxes[i_from].contains_point(*node + _to_positions[i_to]))", "idx": 481} {"target": 1, "func": "[PATCH] Removed unnecessary synchronization that hurt performance on\n Nvidia", "idx": 691} {"target": 1, "func": "[PATCH] slow draw of polyhedron_items fixed", "idx": 215} {"target": 1, "func": "[PATCH] Modifying a couple paramaters in the \"POWER10\"-specific\n section of param.h, for performance enhancements for SGEMM and DGEMM.", "idx": 392} {"target": 1, "func": "[PATCH] Undo deprecation of BoundaryInfo::add_side().\n\n* Remove new version of add_side() that takes a reference to a std::set.\n We are now ensuring that the entries in the input vector are unique\n while storing them in the BoundaryInfo object.\n\n* Refactor MeshTools::Modification::change_boundary_id(). This\n function used to call boundary_ids() for every side, edge, and node\n in the mesh, but it should be more efficient to only call that for\n sides, edges, and nodes that have boundary ids on them.", "idx": 369} {"target": 1, "func": "[PATCH] 28-30% improvement in cuda vs opencl speedup for bilateral\n\n* Replacing exp cuda device function with __expf improved the cuda\n bilateral kernel performance vs opencl kernel.\n* Removed redundant multiplication calculation in the for loop\n of cuda/opencl kernels", "idx": 179} {"target": 1, "func": "[PATCH] Significant DBSCAN refactoring and improvements.\n\n - Use PARAM_MATRIX() instead of strings.\n - Add a single-point mode that handles RAM better.\n - Use UnionFind for much more efficient cluster finding.\n - Make --single_mode use the single point mode.", "idx": 907} {"target": 1, "func": "[PATCH] template cases: added cylindrical background mesh in rotating\n geometry cases\n\nsnappyHexMesh produces a far better quality AMI interface using a cylindrical background mesh,\nleading to much more robust performance, even on a relatively coarse mesh. The min/max AMI\nweights remain close to 1 as the mesh moves, giving better conservation.\n\nThe rotating geometry template cases are configured with a blockMeshDict file for a cylindrical\nbackground mesh aligned along the z-axis. The details of use are found in the README and\nblockMeshDict files.", "idx": 211} {"target": 1, "func": "[PATCH] improved CUDA kernel performance by pre-loading cj\n\nChange-Id: Ic725a82d550e2ecffd4d32edd2c44205aef99b8d", "idx": 630} {"target": 1, "func": "[PATCH] Fixed FAST memory leaks on CUDA backend", "idx": 892} {"target": 1, "func": "[PATCH] Reorder indices of Gamma_P_ia\n\nThis step increases spatial cache locality of the exchange of Gamma_P_ia. The increased locality increase the performance of its redistribution.", "idx": 618} {"target": 1, "func": "[PATCH] Improve performance for SYCL parallel_reduce", "idx": 195} {"target": 1, "func": "[PATCH] Using fast approximation for erfc instead of tabulated values", "idx": 784} {"target": 1, "func": "[PATCH] Improve performance of MIC exponential\n\nThe limited precision is due to argument scaling\nrather than the exponential function lookup, so\ninstead of iterating we can improve the accuracy\nwith a simple correction step, similar to what\nwas done for the recent AVX-512ER implementation.\n\nChange-Id: If55e7c4cefac5022e7211dfa56686cb9ee03a54a", "idx": 1059} {"target": 1, "func": "[PATCH] MeshCommunication new_nodes methods\n\nThese should be more efficient in many cases and may be necessary for\ncorrectness in others.", "idx": 124} {"target": 1, "func": "[PATCH] Fix Kokkos performance regression for small systems", "idx": 7} {"target": 1, "func": "[PATCH] Added support of IBM's MASS library that optimizes\n performance on Power architectures", "idx": 68} {"target": 1, "func": "[PATCH] accelerate distance queries for offset meshing", "idx": 36} {"target": 1, "func": "[PATCH] Various performance improvements: (#1573)\n\n- Each fragment metadata directory is now associated with an empty file with the same name as the directory, but with added suffix '.ok'. This prevents an extra REST request on object stores when opening the array.\n- Consolidation does not delete the consolidated fragments or array metadata. This is to prevent locking the array during consolidation and to enable time traveling and fine granularities.\n- Added new consolidation functionality that enables consolidation of all fragment metadata footers in a single file. This boosts the performance of opening an array significantly.\n- Added vacuum API to clean up consolidated fragments, array metadata, or consolidated fragment metadata.\n- Parallelized the reader in various places, significantly boosting the read performance.", "idx": 1530} {"target": 1, "func": "[PATCH] Added more constructors, for more efficient handling of\n circles with rational radii.", "idx": 1148} {"target": 1, "func": "[PATCH] more efficient comparison function\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@623 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 212} {"target": 1, "func": "[PATCH] Improving the performance, 10-20% slower than\n Triangulation_2, not removing the initial vertices", "idx": 88} {"target": 1, "func": "[PATCH] PERF: Improve performance for sort_by_key\n\n- Has added benefit of cutting build times by half", "idx": 1319} {"target": 1, "func": "[PATCH] Improve the performance of sending the AmoebaVdwLambda to\n Cuda using pinned host memory; Updated the AmoebaVdwForceProxy to version 3,\n and added backward compatibility to version 2; updated TestAPIUnits.py to\n handle the per particle lambda flag", "idx": 804} {"target": 1, "func": "[PATCH] Adding the ability to convert the dist rank, cross rank, and\n redundant rank for a particular distribution to the VC rank and subsequently\n making use of this to improve the performance of the GetSubmatrix routine for\n AbstractDistMatrix", "idx": 983} {"target": 1, "func": "[PATCH] HvD: Adding a first pass of the 2nd and 3rd derivatives\n generated with the Maxima symbolic algebra program. Starting the code\n generation on Sunday at 10:41 AM and finishing on Wednesday at 00:11 AM it\n took approximately 62 hours to generate this code (for most part using two\n cores on my machine). The main offender was the TPSS correlation functional\n which took well over 12 hours. Part of this was due to Maxima exhausting all\n 4 GB of physical memory in my desktop machine and the subsequent paging\n reduced the performance significantly.\n\nNevertheless the results look good right now (still need to test the higher\norder derivative which could significantly impact this assessment). In\nparticular the code for the M06 correlation functional was originally 44 MB\nin size. This was the result of simply generating the relevant derivative\nexpressions and expressing them as Fortran code. Now that the autoxc script\nhas been enhanced to optimize the expressions before generating Fortran the\nsame routines only amount to 470 KB (a reduction of almost a factor 100 (98 to\nbe precise)).\n\nIt is obvious that optimizing these expressions does take time. However, the\ngood thing is that with this result the compiler does not need to take that time\nwhen you try to build NWChem. I am sure that anyone compiling NWChem will be\ngrateful for not having to wait 62 hours.", "idx": 1352} {"target": 1, "func": "[PATCH] Fixed the bug where dumping distribution file would decrease\n the performance", "idx": 15} {"target": 1, "func": "[PATCH] basicThermo: Cache thermal conductivity kappa rather than\n thermal diffusivity alpha\n\nNow that Cp and Cv are cached it is more convenient and consistent and slightly\nmore efficient to cache thermal conductivity kappa rather than thermal\ndiffusivity alpha which is not a fundamental property, the appropriate form\ndepending on the energy solved for. kappa is converted into the appropriate\nthermal diffusivity for the energy form solved for by dividing by the\ncorresponding cached heat capacity when required, which is efficient.", "idx": 154} {"target": 1, "func": "[PATCH] small performance optimization for pair style comb", "idx": 649} {"target": 1, "func": "[PATCH] Tabulated log(x) to improve GB performance", "idx": 1138} {"target": 1, "func": "[PATCH] small performance improvement for nbnxn SSE kernels", "idx": 1445} {"target": 1, "func": "[PATCH] Converts PADiffusionSetup3D and\n QuadratureInterpolator::Eval3D kernels from 1 element per thread to 1 qpt/dof\n per thread for better performance when offloading (there are not enough units\n of work with 1 element/thread)", "idx": 586} {"target": 1, "func": "[PATCH] Performance improvements and fixed bug with memory budget and\n multi-range subarrays", "idx": 1406} {"target": 1, "func": "[PATCH] Fix for processors being offline on Arm\n\nUse the number of configured rather than online CPUs.\nWe will still get a warning about failures when trying to\npin to offline CPUs, which hurts performance slightly.\nTo fix this, we also check if there is a mismatch between\nconfigured and online processors and warn the user that\nthey should force all their processors online\nfor better performance.\n\nChange-Id: Iebdf0d5b820edcd7d06859a2b814adf06589ef96", "idx": 1564} {"target": 1, "func": "[PATCH] improved load balancing on the GPU\n\nFor the GPU, small pair list entries are now sorted to the end.\nThe improves performance by 5 to 20%.\n\nChange-Id: I25e5efeb813ad5dde48f0955366519db699f21a2", "idx": 1118} {"target": 1, "func": "[PATCH] Clean up and use fast pool allocator (faster)", "idx": 1290} {"target": 1, "func": "[PATCH] Slightly improved performance, and small refinements", "idx": 884} {"target": 1, "func": "[PATCH] Performance optimization of Tokenizer\n\nReduces string allocations and removes std::vector from Tokenizer\nMost processing now happens on-demand.", "idx": 415} {"target": 1, "func": "[PATCH] Small improvements for readability and performance", "idx": 786} {"target": 1, "func": "[PATCH] Issue #2056 More efficient mapped evaluation", "idx": 1159} {"target": 1, "func": "[PATCH] Changes to internal memory manager\n\n- Manager now contains list of locked and free buffers separately\n- Should improve performance when allocationg new buffers\n- Added proper documentation", "idx": 1327} {"target": 1, "func": "[PATCH] Adjust s3 multi-part locking to unlock early\n\nThis switches from iteraters to using find + at to take references to\nthe state objects for manipulations. This allows us to release the S3 class\nlevel locks earlier and faster removing performance bottlenecks.", "idx": 1119} {"target": 1, "func": "[PATCH] Speedup of kernel caching mechanism by hashing sources at\n compile time (#3043)\n\n* Reduced overhead of kernel caching for OpenCL & CUDA.\n\nThe program source files memory footprint is reduced (-30%) by eliminating\ncomments in the generated kernel headers. Hash calculation of each source\nfile is performed at compile time and incrementally extended at runtime\nwith the options & tInstance vectors. Overall performance increased up to\n21%, up to the point that the GPU becomes the bottleneck, and the overhead\nto launch the same (small) kernel was improved by 63%.\n\n* Fix couple of minor cmake changes\n\n* Move spdlog fetch to use it in bin2cpp link command\n\nCo-authored-by: pradeep \n(cherry picked from commit 3cde757face979cd9f51a4c01bd26107e69e4605)", "idx": 1167} {"target": 1, "func": "[PATCH] Matrix: Replace the row-start pointer array with computed\n offsets\n\nThe row-start pointer array provided performance benefits on old\ncomputers but now that computation is often cache-miss limited the\nbenefit of avoiding a integer multiply is more than offset by the\naddition memory access into a separately allocated array.\n\nWith the new addressing scheme LUsolve is 15% faster.", "idx": 1322} {"target": 1, "func": "[PATCH] fixed bug with Verlet + DD + bonded atom communication\n\nAtoms communicated for bonded interactions can be beyond the cut-off\ndistance. Such atoms are now put placed in an extra row in the grid.\nFixes #1114\n\nAlso improved the performance of the nbnxn grid sorting, especially\nfor inhomogeneous systems.\n\nChange-Id: Ibe5ba24af95959f5dadd89584e2315da60b55091", "idx": 1567} {"target": 1, "func": "[PATCH] Enhancement: replace SQRT and POW by more efficient\n computations", "idx": 305} {"target": 1, "func": "[PATCH] added precompiler commands for performance tests bug fix in\n SNC_FM_decorator and SNC_constructor: plane sweep must be done on correct\n planes", "idx": 186} {"target": 1, "func": "[PATCH] impr performance of resultant", "idx": 187} {"target": 1, "func": "[PATCH] added fast return, if m or n < 1", "idx": 335} {"target": 1, "func": "[PATCH] More efficient SystemBase::reinit, flux_jump indicator should\n work but needs testing\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@304 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 861} {"target": 1, "func": "[PATCH] add more efficient getppn for BGQ and MPI-3\n\nBGQ has a system call for PPN etc.\nMPI-3 has a routine to get a node communicator and the size of this communicator is the number of PPN.\n\nI have efficient implementations for Cray, but since Cray has supported MPI-3 for over a year, there is no need.\n\nThe MPI-2 implementation could be optimized but it should not be bottleneck and is unlikely to be used except\nby users that insist on using old MPI implementations, since all relevant platforms support MPI-3 or are BGQ.", "idx": 1471} {"target": 1, "func": "[PATCH] Improve implementation of cycle subcounting\n\nConfiguring with GMX_CYCLE_SUBCOUNTERS on is intended to make active\nsome counters that show finer-grained timing details, but its\nimplementation with the preprocessor was more complex and more bug\nprone than this one. Static analysis now finds the bug where we\nover-run buf (fixed here and in release-5-1).\n\nThe only place we care about performance of the subcounter\nimplementation is that it doesn't do work when GMX_CYCLE_SUBCOUNTERS\nis off, and constant propagation and dead code elimination will handle\nthat.\n\nAlso moved some declarations into the blocks where they are used.\n\nChange-Id: I3d7a06a65c636c11557a094997a7a81f86a1ed8a", "idx": 869} {"target": 1, "func": "[PATCH] finally fixed a major performance bug. New implementation now\n slightly faster than multiple on old implementation", "idx": 619} {"target": 1, "func": "[PATCH] Modified vector calls to improve performance when copy data\n to same processor.", "idx": 396} {"target": 1, "func": "[PATCH] OPENCL: Disabling greedy assignment for csrmm and csrmv\n\n- Was causing performance issues on intel and amd devices", "idx": 735} {"target": 1, "func": "[PATCH] Fixed a small performance bug", "idx": 639} {"target": 1, "func": "[PATCH] Replace SIMD copy patch routine to obtain better parallel\n performance", "idx": 463} {"target": 1, "func": "[PATCH] constrainPressure: Updated to use the more efficient\n patch-based MRF::relative function", "idx": 1468} {"target": 1, "func": "[PATCH] TurbulenceModels::kOmegaSST.*: Updated source-terms and\n associated functions to use volScalarField::Internal\n\nThis is more efficient, avoids divide-by-0 when evaluating unnecessary\nboundary values and avoids unnecessary communications when running in parallel.", "idx": 49} {"target": 1, "func": "[PATCH] more efficient binary operations in MX, ticket #192", "idx": 1531} {"target": 1, "func": "[PATCH] implemented more efficient Ruge-coarsening", "idx": 897} {"target": 1, "func": "[PATCH] More efficient gain calculation in SQPMethod, #551", "idx": 1461} {"target": 1, "func": "[PATCH] Added an efficient parallel version of the ADMM linear\n program solver of Boyd et al.", "idx": 510} {"target": 1, "func": "[PATCH] Improved OpenCL SIFT coalescing and performance", "idx": 1433} {"target": 1, "func": "[PATCH] Fixing a performance bug in trsm_[LR].c.", "idx": 106} {"target": 1, "func": "[PATCH] Build a default DiffSolver at init() not at construction, to\n be more efficient when the user wants to create a DiffSolver themselves", "idx": 1400} {"target": 1, "func": "[PATCH] - Added the ability to add points on a sphere outside the\n domain in the sequential case => better performance for the fandisk model\n (x2). I'm still wondering why... - Code refactoring/clean-up", "idx": 472} {"target": 1, "func": "[PATCH] In coarse dslash, added parallelization of color-column\n multiplication by splitting up the warp into different regions of the\n column-wise multiplication across the space-time dimention. Presently 1-way,\n 2-way and 4-way warp splitting is implemented. Prior to writing out the\n result a warp-level reduction into the first segment is performed, which\n writes out the result. This adds additional levels of parallelsim that\n improves the performance on small lattices. The level of warp splitting is\n autotuned using the auxillary tuning dimension.", "idx": 1246} {"target": 1, "func": "[PATCH] When need to realloc, double the size. This is more\n efficient if the ultimate size is very large.", "idx": 1387} {"target": 1, "func": "[PATCH] Always more efficient Face Partial Assembly Kernels. Lot of\n simplifications in the design of Domain Kernels. Remove inefficient Kernels\n Based on Eigen.", "idx": 402} {"target": 1, "func": "[PATCH] Improved performance on AMD GPUs", "idx": 1025} {"target": 1, "func": "[PATCH] fixed GPU particle gridding performance issue\n\nThe scaling factor for the grid binning for the GPU pair search\nwas set incorrectly, which made the binning 50% slower.\n\nChange-Id: I146592c37094a3d81a7ae50b3903fcc615e748d5", "idx": 777} {"target": 1, "func": "[PATCH] Acquire BC lists outside the CheckPointIO id loop\n\nThis should be just as fast for restarts and O(Nsplits/Nprocs) times\nfaster for mesh splitting. Thanks to @friedmud for the idea.", "idx": 273} {"target": 1, "func": "[PATCH] Try to use inline in boxDimension class to increase the\n performance", "idx": 608} {"target": 1, "func": "[PATCH] hypre's GPU SpGemm (#433)\n\nThis PR improves the performance of hypre's sparse matrix-matrix on NVIDIA GPUs, and fixes it on AMD GPUs with hip.\n\nCo-authored-by: Ruipeng Li \nCo-authored-by: Paul T. Bauman ", "idx": 400} {"target": 1, "func": "[PATCH] Optimize multi-fragment unfiltering, part 1 (#1692)\n\nThis patch improves the execution time of the attribute unfiltering path.\n\n```\n// Current\nTotal read query time (array open + init state + read): 15.0957 secs\n Time to unfilter attribute tiles: 8.86625 secs\n```\n\n```\n// With this patch\nTotal read query time (array open + init state + read): 7.32202 secs\n Time to unfilter attribute tiles: 1.80354 secs\n```\n\nThe issue is that the destruction time of `forward_list` is obscenely\nslow within my OSX environrment. On the Linux environment that I\noriginally wrote this in, the `forward_list` did not cause a performance\nproblem. This patch just replaces the `forward_list` with a `vector`.\n\nI have titled this as a \"part 1\" because I have tentatively explored\nmulti-threading the `unfilter_tiles` path and have observed a speedup.\nMore on this later.\n\nCo-authored-by: Joe Maley ", "idx": 799} {"target": 1, "func": "[PATCH] Replaced LU by QR decomposition.\n\nColPivHouseholderQR is, according to the manual both fast and rank\nrevealing.\n\nRelated to #59.", "idx": 380} {"target": 1, "func": "[PATCH] POWER10: Change dgemm unroll factors\n\nChanging the unroll factors for dgemm to 8 shows improved performance with\nPOWER10 MMA feature. Also made some minor changes in sgemm for edge cases.", "idx": 209} {"target": 1, "func": "[PATCH] Fixing Performance Bug with Atomics\n\nCalling templated versions of atomics will prevent matching of\nnon-templated code, thus it would never call the optimized atomic\nroutines. This effected in particular atomic increment and decrement.", "idx": 681} {"target": 1, "func": "[PATCH] Core: Atomic Performance Test Reduce Loop count\n\nThis was taking to long, reduced the loop count.", "idx": 778} {"target": 1, "func": "[PATCH] Build a default DiffSolver at init() not at construction, to\n be more efficient when the user wants to create a DiffSolver themselves\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@1513 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 1324} {"target": 1, "func": "[PATCH] Improved the performance of the gmx_sum routines on cluster\n with multi-core nodes by using a two step communication procedure", "idx": 1079} {"target": 1, "func": "[PATCH] slight performance improvement for 4x4 search", "idx": 322} {"target": 1, "func": "[PATCH] Initial volume() optimization for HEX27.\n\nThis initial, relatively straightforward optimization improved the\nperformance of Hex27::volume() by about 16x (1476.46s down to 92.09s\nfor 150^3 elements). This is good, but I still think it can be made\nfaster.", "idx": 1276} {"target": 1, "func": "[PATCH] Parallelize closing of files on write\n\nThis change parallelizes the closing of files on writes. This solves a\nperformance problem when the user was using S3 or other object store\nwhere we buffer the multi-part writes. If the user's data was below the\nbuffer size, then no io would have occurred until the closing when we\nflush buffers. This causes a large performance penalty relative to\nexpected because up to three files per field had to be uploaded\nserially.", "idx": 252} {"target": 1, "func": "[PATCH] CUDA nb kernel performance improvement for CUDA 4.1\n\nThe manual unrolling of the jm4 loop improves somewhat the performance\nof the nobonded CUDA kernels, but there is still a 5-7% performance\nregression with CUDA 4.1 compared to 3.2/4.0.", "idx": 500} {"target": 1, "func": "[PATCH] improved nbnxn PME kernel performance on AMD\n\nThe performance of the nbnxn PME kernels on AMD was much worse with\ngcc than with icc. Now the table load macro has been changed,\nwhich roughly halves the performance difference.", "idx": 1570} {"target": 1, "func": "[PATCH] make ::localize() more efficient, still need to handle\n ::localize_to_one()", "idx": 1437} {"target": 1, "func": "[PATCH] performance optimizations in sgemm_kernel_16x2_bulldozer.S", "idx": 1051} {"target": 1, "func": "[PATCH] Restructured the subdivision package\n\n-- Integrated the doc in the header files\n-- Split and moved files to have a proper internal structure and to distinguish\n between hosts, stencils and methods at the filename level.\n-- Removed all instances of Polyhedron to have PolygonMesh instead\n-- Cleaned off useless functions (Polyhedron_decorator remnants)\n-- Improved general documentation\n-- Minor performance improvements", "idx": 309} {"target": 1, "func": "[PATCH] Added full double precision support to the hisq-force\n routines. Replaced the macro used to write the hisq fermion force to global\n memory with a much more efficient device function.", "idx": 1302} {"target": 1, "func": "[PATCH] Signifcantly improved restrictor performance through the use\n of CTA index swizzling to improve spatial locality improving cache line\n utilization", "idx": 222} {"target": 1, "func": "[PATCH] #1295 More efficient implementation of A[I] = B", "idx": 1042} {"target": 1, "func": "[PATCH] Accelerate \"Checking if non utf-8 characters are used\"", "idx": 401} {"target": 1, "func": "[PATCH] Improved performance of tranpose\n\n* Using int instead of dim_type\n* Unrolling loops using static consts\n* Using output dimensions", "idx": 355} {"target": 1, "func": "[PATCH] fixed problem with index group for system size and made the\n diameter calculation more efficient", "idx": 1175} {"target": 1, "func": "[PATCH] on itanium (with Intel compiler) **2 was verrrry slow,\n replaced with ()*()", "idx": 1539} {"target": 1, "func": "[PATCH] Added shared memory carve out setting for Volta - improves\n dslash performance by ~5%", "idx": 1243} {"target": 1, "func": "[PATCH] Tensor, SymmTensor: Simplified invariantII\n\nNow the calculation of the 2nd-invariant is more efficient and\naccumulates less round-off error.", "idx": 53} {"target": 1, "func": "[PATCH] Improved performance of g_bar while reading .edr files with\n small nstenergy.", "idx": 849} {"target": 1, "func": "[PATCH] some more performance improvements from Krys. Mostly in\n gradients", "idx": 1229} {"target": 1, "func": "[PATCH] Changed increment ordering for performance", "idx": 687} {"target": 1, "func": "[PATCH] Better optimized ICC release flags\n\nAdd those flags included in -fast which both helps performance\nand are appropriate for GROMACS.\n\nThe flags included in -fast for Linux we weren't using were:\n-ipo, -no-prec-div, -static, -fimf-domain-exclusion=15\n\nFull static depends on static libraries to be installed and thus\nwill not always work. IPO increases compile time by a huge factor.\nWe do require that extreme values (e.g. large negative arguments\nto exp and large positive to erfc) are computed correctly.\n\nThis leaves -no-prec-div -fimf-domain-exclusion=14 -static-intel\nas safe and useful flags for GROMACS.\n\nChange-Id: Ifbee69431841e3051c95f0b4c0ad204aac965c4e", "idx": 977} {"target": 1, "func": "[PATCH] Performance Improvements", "idx": 241} {"target": 1, "func": "[PATCH] BJP: Changed determination of io procs so that IO is more\n efficient for single files using parallel IO.", "idx": 550} {"target": 1, "func": "[PATCH] optimization, improve performance by more than 20%", "idx": 333} {"target": 1, "func": "[PATCH] Switching to Simple_cartesian gives better performance\"", "idx": 1441} {"target": 1, "func": "[PATCH] Use HostVector for PME CPU Force Buffer\n\nFix performance bug: PME CPU force buffer should be a HostVector to\nallow pinned memory GPU transfers (which occur in PME-PP\ncommunications on virial steps).", "idx": 169} {"target": 1, "func": "[PATCH] improved efficiency for communication somewhat and fixed Ssw\n option to work correctly, albeit not as efficient as possible yet.", "idx": 198} {"target": 1, "func": "[PATCH] Optimize the performance of sum by using universal intrinsics", "idx": 1341} {"target": 1, "func": "[PATCH] - Changed Hash_map to Unique_hash_map. Kept old file. -\n Separated Handle_hash_function into own file. - Improved performance of\n default Handle_hash_function. - Rewrote manual pages including the\n UniqueHandleFunction concept. - Made protected methods in chained_map.h\n public such that Unique_hash_map can be implemented using a private member \n instead of private inheritance.", "idx": 1122} {"target": 1, "func": "[PATCH] eliminate it extra assignement to increase performance", "idx": 1218} {"target": 1, "func": "[PATCH] Fix performance regresssion bug of unnecessary destruction of\n IPC comms buffers", "idx": 802} {"target": 1, "func": "[PATCH] Residuals: New MeshObject class to store solver performance\n residuals\n\nThis is more efficient and modular than the previous approach of storing the\nresiduals in the mesh data dictionary.", "idx": 595} {"target": 1, "func": "[PATCH] More efficient GetNonzeros::evaluateGen (sensitivities)", "idx": 1242} {"target": 1, "func": "[PATCH] checking in change for 1-1-48 pathway, a more efficient free\n energy path. See Pham and Shirts,\n http://jcp.aip.org/resource/1/jcpsa6/v135/i3/p034114_s1", "idx": 1520} {"target": 1, "func": "[PATCH] refactor reading last line of potential file code to be more\n efficient", "idx": 1207} {"target": 1, "func": "[PATCH] Stopping criterion: improve the performance one last time by\n using a kernel for the boolean initialization instead of a synchronous copy.", "idx": 365} {"target": 1, "func": "[PATCH] POWER10: Improve dgemm performance\n\nThis patch uses vector pair pointer for input load operation\nwhich helps to generate power10 lxvp instructions.", "idx": 637} {"target": 1, "func": "[PATCH] Crucial bug fix - table lookup was too slow in the previous\n version.", "idx": 940} {"target": 1, "func": "[PATCH] VT:for infiniband VT:disabled HBNA get for performance\n reasons, needs to be enabled in the future", "idx": 703} {"target": 1, "func": "[PATCH] SOLVE, MATMUL and INVERSE now use af_mat_prop\n\naf_mat_prop values can be used for performance improvements\nby calling specialized routines", "idx": 1438} {"target": 1, "func": "[PATCH] remove legacy CUDA non-bonded kernels\n\nThis commit drops the legacy set of kernels which were optimized for use\nwith CUDA compilers 3.2 and 4.0 (previous to the switch to llvm backend\nin 4.1).\n\nFor now the only consequence is slight performance degradation with CUDA\n3.2/4.0, the build system still requires CUDA >=3.2 as the kernels do\nbuild with the older CUDA compilers. Whether to require at least CUDA\n4.1 will be decided later.\n\nRefs #1382\n\nChange-Id: I75d31b449e5b5e10f823408e23f35b9a7ac68bae", "idx": 765} {"target": 1, "func": "[PATCH] Improved the performance of finding whether a halfedge is on\n the outer ccb", "idx": 188} {"target": 1, "func": "[PATCH] Optimize `copy_cells`, part 2 (#1695)\n\nThis patch optimizes the `copy_cells` path for multi-fragment reads. The\nfollowing benchmarks are for the multi-fragment read scenario discussed\noffline:\n\n```\n// Current\nRead time: 4.75082 secs\n * Time to copy result attribute values: 3.54192 secs\n > Time to read attribute tiles: 0.311707 secs\n > Time to unfilter attribute tiles: 0.370434 secs\n > Time to copy fixed-sized attribute values: 0.898421 secs\n > Time to copy var-sized attribute values: 0.954925 secs\n```\n\n```\n// With this patch\nRead time: 3.04627 secs\n * Time to copy result attribute values: 1.83972 secs\n > Time to read attribute tiles: 0.274928 secs\n > Time to unfilter attribute tiles: 0.38196 secs\n > Time to copy fixed-sized attribute values: 0.517415 secs\n > Time to copy var-sized attribute values: 0.461847 secs\n```\n\nFor context, here are the benchmark results for the single-fragment read. The\nstats are similar with and without this patch:\n```\nRead time: 1.86883 secs\n * Time to copy result attribute values: 1.19411 secs\n > Time to read attribute tiles: 0.304055 secs\n > Time to unfilter attribute tiles: 0.351332 secs\n > Time to copy fixed-sized attribute values: 0.289661 secs\n > Time to copy var-sized attribute values: 0.142405 secs\n```\n\nThis patch does three things:\n1. Converts the `offset_offsets_per_cs` and `var_offsets_per_cs` in the var-sized\n path from a 2D array (vector>) to a 1D array (vector", "idx": 809} {"target": 1, "func": "[PATCH] Improve general performance, set up 1-attribute, fix\n correspondence issue between original and copied map", "idx": 600} {"target": 1, "func": "[PATCH] Transitioning and cleaning up toward a more efficient load\n balancer.", "idx": 1473} {"target": 1, "func": "[PATCH] HvD: It turns out that the performance improvements of a\n number of the optimizations don't carry over to other platforms. So I am\n removing most of them again. The ones that are here to stay are: the\n USE_FORTRAN2008 flag to pick Fortran 2008 intrinsic functions up for POPCNT,\n LEADZ, TRAILZ and ERF where available; the nwxc_dble_powix function to\n exploit that exponentiation with an integer power is much faster than\n exponentiation with a double precision power of the same value.", "idx": 1469} {"target": 1, "func": "[PATCH] fixed bug in CRSSparsity append, now efficient concat", "idx": 567} {"target": 1, "func": "[PATCH] Converted lots of ints to unsigned ints (might help\n performance a little by avoiding conversions)", "idx": 466} {"target": 1, "func": "[PATCH] Replaced the \"visited_facets\" array (parallel version) by an\n atomic char\n\nIt's as fast, and it required less memory.", "idx": 890} {"target": 1, "func": "[PATCH] fix minor CUDA NB kernel performance regression\n\nCommit f2b9db26 introduced the thread index z component as a stride in\nthe middle j4 loop. As this index is not a constant but a value\nloaded from a special register, this change caused up to a few %\nperformance loss in the force kernels. This went unnoticed because\nsome architectures (cc 3.5/5.2) and some compilers (CUDA 7.0) were\nbarely affected.\n\nChange-Id: I423790e8fb01a35f7234d26ff064dcc555e73c48", "idx": 898} {"target": 1, "func": "[PATCH] Improve GPU performance especially without electrostatics", "idx": 1392} {"target": 1, "func": "[PATCH] Changed workgroup size to work around NVIDIA bug. This also\n improves performance slightly.", "idx": 830} {"target": 1, "func": "[PATCH] Attempting to add diffusion term in a more efficient manner", "idx": 349} {"target": 1, "func": "[PATCH] Use non-allocating build_edge_ptr where possible\n\nThis may be noticeably more efficient in a few of these cases.", "idx": 253} {"target": 1, "func": "[PATCH] Fix performance bug when LDC is a multiple of 1024", "idx": 417} {"target": 1, "func": "[PATCH] made some bondeds slightly more efficient", "idx": 747} {"target": 1, "func": "[PATCH] performance slightly improved", "idx": 385} {"target": 1, "func": "[PATCH] Improved FAST performance on CUDA backend", "idx": 546} {"target": 1, "func": "[PATCH] Optimize\n\n- Makes the resizing of the points real fast\n- Makes the resizing of the normals applied on slider released instead of every tick when the point set size is bigger than 300 000\n- sets the initial value of the point size to 2 instead of 5", "idx": 311} {"target": 1, "func": "[PATCH] More efficient GetNonzeros::evaluateGen", "idx": 1361} {"target": 1, "func": "[PATCH] ARM64: Improve DAXPY for ThunderX2\n\nImprove performance of DAXPY for ThunderX2\nwhen the vector fits in L1 Cache.", "idx": 136} {"target": 1, "func": "[PATCH] More efficient ordering of constrained DoF index sets", "idx": 411} {"target": 1, "func": "[PATCH] performance improvemnts under ia64", "idx": 294} {"target": 1, "func": "[PATCH] Changed some compiler flags for better performance", "idx": 164} {"target": 1, "func": "[PATCH] Remove use of interaction_mask_indices on BG/Q\n\nThis field was degrading cache performance ~1% on x86. It probably\nmade little difference on BG/Q, because the extra integer operations\ncan use the second instruction-issue port, assuming the use of OpenMP\nto use more than one hardware thread per core. Overall, this code is\nabout 1% faster on BG/Q.\n\nMinor fix to the gmx_load_simd_4xn_interactions() function that looks\nup the exclusion masks, so that new non-x86 platforms won't silently\nfail for want of an implementation of this function.\n\nMinor simplication to always pass simd_interaction_indices to\ngmx_load_simd_4xn_interactions(), since it is only used on BG/Q and\nthen it is non-null.\n\nChange-Id: I140a11607810e9cf08b702cae0b48426c3592fec", "idx": 1057} {"target": 1, "func": "[PATCH] Add node ranking for increased nearest neighbour performance,\n currently failing tests for k > 1", "idx": 767} {"target": 1, "func": "[PATCH] Improve performance of Python integrator (NVE_Opt version)\n\nRemoving the loop over atoms by using NumPy array indexing allows to recover\nperformance close to that of plain fix nve.", "idx": 1408} {"target": 1, "func": "[PATCH] dtrsm_kernel_LT_8x2_bulldozer.S performance optimization", "idx": 330} {"target": 1, "func": "[PATCH] Performance improvments to CPU Anisotropic Diffusion (#2174)\n\n* Performance improvments to CPU Anisotropic Diffusion", "idx": 1049} {"target": 1, "func": "[PATCH] Modifications in sweepline algorithm for is_simple: - Made\n replacing edge by another edge more efficient (using insert with hint). -\n altered the output during debugging somewhat.", "idx": 1371} {"target": 1, "func": "[PATCH] changed the redistribution to x/y/z only moves, which\n improves the performance for especially 3D decomposition", "idx": 454} {"target": 1, "func": "[PATCH] convert pair styles in USER-OMP to use fast DP analytical\n coulomb", "idx": 894} {"target": 1, "func": "[PATCH] - Added C++ include guards to all installed .h files\n\n - Modified PCG:\n - usage is consistent with SMG\n - no HYPRE_*pcg.h files\n - no need to allocate extra work-space vectors\n - more efficient matvec (does not do setup everytime)\n - default preconditioner is identity, i.e. default PCG is CG\n\n - All HYPRE_* interface routines are now consistently HYPRE_Struct*", "idx": 1208} {"target": 1, "func": "[PATCH] More efficient way of creating point set from selected points", "idx": 688} {"target": 1, "func": "[PATCH] Fixed return value of gmx_mtop_bondeds_free_energy\n\nThe return value was always true, which was harmless, since it\ncould only cause a small performance hit of useless sorting.\n\nFixes #1387\n\nChange-Id: I088a3747ddb3517fbb5e416b791bd542bd49fed2", "idx": 264} {"target": 1, "func": "[PATCH] Added const& for gaining performance", "idx": 63} {"target": 1, "func": "[PATCH] Precalculate pbc shift for analysis nbsearch\n\nInstead of using pbc_dx_aiuc(), precalculate the PBC shift between grid\ncells outside the inner loop when doing grid searching for analysis\nneighborhood searching. In addition to improving the performance, this\nencapsulates another piece of code that needs to be changed to implement\nmore generic grids.\n\nChange-Id: Ifbbe54596f820b01572fe7bb97a5354556a4981d", "idx": 1231} {"target": 1, "func": "[PATCH] USER-DPD: propagate a minor performance bugfix throughout the\n DPDE code\n\nThe fix_shardlow_kokkos.cpp code had already factored out a redundant\nsqrt() calculation in the innermost loop of ssa_update_dpde(). This\nchangeset propagates an equivilent optimization to:\n fix_shardlow.cpp\n pair_dpd_fdt_energy.cpp\n pair_dpd_fdt_energy_kokkos.cpp\nThe alpha_ij variable was really just an [itype][jtype] lookup parameter,\nreplacing a sqrt() and two multiplies per interacting particle pair\nby a cached memory read. Even if there isn't much time savings, the\ncode is now consistent across the various versions.", "idx": 751} {"target": 1, "func": "[PATCH] POWER10: Update param.h\n\nIncreasing the values of DGEMM_DEFAULT_P and DGEMM_DEFAULT_Q helps\nin improving performance ~10% for DGEMM.", "idx": 84} {"target": 1, "func": "[PATCH] Removed one star. Nomore five stars record. This makes that\n mollocs are larger, therefore more efficient, and it works around buggy\n malloc library routines.", "idx": 709} {"target": 1, "func": "[PATCH] Issue #2738 More efficient clearing of DaeBuilder cache Wait\n until it's really needed", "idx": 85} {"target": 1, "func": "[PATCH] [ZARCH] Improve loading performance for camax/icamax", "idx": 1140} {"target": 1, "func": "[PATCH] improved performance", "idx": 165} {"target": 1, "func": "[PATCH] Performance bug", "idx": 1521} {"target": 1, "func": "[PATCH] dgemm: Use the skylakex beta function also for haswell\n\nit's more efficient for certain tall/skinny matrices", "idx": 526} {"target": 1, "func": "[PATCH] make ::localize() more efficient, still need to handle\n ::localize_to_one()\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@2242 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "idx": 1328} {"target": 1, "func": "[PATCH] Rewrote the BoxManGatherEntries() code for handling\n AddGraphEntries() to improve performance when using the assumed partition.", "idx": 316} {"target": 1, "func": "[PATCH] more efficient way to display a combinatorial map", "idx": 638} {"target": 1, "func": "[PATCH] performance improvements for Linear JIT CUDA kernels", "idx": 175} {"target": 1, "func": "[PATCH] Optimize atomic accumulation in CUDA NB kernel\n\nAs a result of this reorganization of the reduction, the final atomic\naccumulation of the three force components can happen on three threads\nconcurrently. As this can be serviced by the hardware in a single\ninstruction, the optimization improves overall performance by a few %.\nThis also results in fewer shuffle operations.\n\nChange-Id: I29519469b1e1848c026ee5b7a32256440031dbce", "idx": 1211} {"target": 1, "func": "[PATCH] Accelerate L-BFGS with a couple of tricks.\n\n1. Function objective calculation after optimization isn't needed.\n2. minPointIterate isn't actually used anywhere, so get rid of it.\n3. Return best result from line search, not the last result.\n\nAlso I cleaned up a few no-longer-needed sections of code and simplified a few\nlines.", "idx": 611} {"target": 1, "func": "[PATCH] Switch to \"parallel_for\" for cell scan => Better performance,\n in particular with implicit function domain.", "idx": 354} {"target": 1, "func": "[PATCH] Added OPTLD for optimized but slow loading which you want for\n time-critical programs such as mdrun.", "idx": 196} {"target": 1, "func": "[PATCH] Improving performance of right Trsm routines based on\n suggestions from Bryan Marker.", "idx": 981} {"target": 1, "func": "[PATCH] Rationalize HAVE_FMA\n\nDistinguish ARCH_PREFERS_FMA, for architectures that \"naturally\"\nprefer FMA (e.g., powerpc), from ISA_EXTENSION_PREFERS_FMA, for\ninstruction-set extensions that favor FMA where the base architecture\ndoes not (e.g., avx2 on x86).\n\nPreviously, --enable-avx2 would use FMA code for scalar and avx\ncodelets, which is wrong.\n\nThis change improves performance by a few percent on Ryzen (where FMA\ndoesn't really do anything), and is a wash on Haswell.", "idx": 930} {"target": 1, "func": "[PATCH] fvOptions: Changed to be a MeshObject to support automatic\n update for mesh changes\n\nNow cellSetOption correctly handles the update of the cell set following mesh\ntopology changes rather than every time any of the fvOption functions are\ncalled for moving meshes. This is more efficient and consistent with the rest\nof OpenFOAM and avoids a lot of unnecessary clutter in the log.", "idx": 1256} {"target": 1, "func": "[PATCH] Use gmx_mtop_t in selections, part 2\n\nUse gmx_mtop_t throughout low-level selection routines, i.e.,\ncenterofmass.cpp, poscalc.cpp, and indexutil.cpp. Adapt test code,\nwhich is now using gmx_mtop_t throughout as well.\n\nIn places where gmx_mtop_t is actually accessed, the changes are as\nlocal as possible. In most cases, some additional restructuring could\ngive better performance and/or much clearer code, but that is outside\nthe scope of this change.\n\nPart of #1862.\n\nChange-Id: Icc99432bddec04a325aef733df56571d709130fb", "idx": 139} {"target": 1, "func": "[PATCH] Call to change_notf->update_all_faces(result, a1, a2), which\n is the notifier function updating all the faces features. This was added\n since the Map overlay should use from now on the Post precessing notifier,\n rather then the In processing notifier, since the former is more efficient\n than the latter.", "idx": 473}