{"commit_message": "[PATCH] fem_system_ex2 local->distributed solution\n\nThere doesn't seem to be any way to get this behavior back in reinit()\nwithout creating a performance regression: most applications simply\nnever need to move data in that direction.\n\nRefs #1593.", "target": 0} {"commit_message": "[PATCH] Set cmake build to release for LZ4/Zlib and Zstd\n\nThis enables vectorization which yields a boost in performance\nfor these compressors. #1033\n\nPR #1034\n\n(cherry picked from commit 3db1195e2e8df48ca399b18d0250256443651bda)", "target": 0} {"commit_message": "[PATCH] Tiny libmesh_assert_valid_boundary_ids speedup\n\nThis is actually pretty slow in parallel. Truly making it faster\nwould require lumping multiple verification communications together,\nbut eliminating redundant verification is a start, at a slightly lower\npenalty to readability and usability.", "target": 1} {"commit_message": "[PATCH] minor changes for performance improvement and documentation", "target": 1} {"commit_message": "[PATCH] NumericVector::add_vector refactoring\n\nSimilar to #411 and #413\n\nThis was originally intended to be just another additional T* API plus\na refactoring; however, the new PetscVector::add_vector(DenseVector)\ncode path should be a performance improvement as well.", "target": 1} {"commit_message": "[PATCH] Adding Changelog for Release 2.9.00\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 2.9", "target": 0} {"commit_message": "[PATCH] Fix Kokkos performance regression for small systems", "target": 1} {"commit_message": "[PATCH] Improve performance of computeCoarseClover kernel: split\n triple matrix product into two steps to prevent redundant computation", "target": 1} {"commit_message": "[PATCH] Fix config for PERFORMANCE category build (#3632)\n\nSome test executables have the default category (BASIC), so they\nare not added in a build with \"CATEGORIES PERFORMANCE\". Check that\nthese executables exist before setting properties on them.", "target": 0} {"commit_message": "[PATCH] Fix bug in polyline simplification: We had hardwired that we\n use Exact_predicates_tag which is slow for EPEC in particular with\n Quotient or leda::real\n\nWe determine the appropriate tag using Algebraic_structure_traits::Is_exact", "target": 0} {"commit_message": "[PATCH] Some simple memory leak fixes.\n\nAlthough the fixed leaks are not important for performance, the less\nthere are leaks, the more convenient it is to run valgrind leak checking\non parts of the code.", "target": 0} {"commit_message": "[PATCH] Mesh_3: Address TBB performance warning on hashing", "target": 0} {"commit_message": "[PATCH] Added timers to test performance of MPI_File_sync,\n collection, and compression", "target": 0} {"commit_message": "[PATCH] bug fix for fast box intersection in presense of the Infi Box", "target": 0} {"commit_message": "[PATCH] Fixed the bug where dumping distribution file would decrease\n the performance", "target": 1} {"commit_message": "[PATCH] additional pre-compiler command for performance testing", "target": 0} {"commit_message": "[PATCH] Minor tweaks to the DD setup\n\nDynamic load balancing is now turned on when the total performance\nloss is more than 2% (lower than that will not help).\nThe check for large prime factors should be done on the PP node\ncount when -npme is set by the user.\n\nChange-Id: Ib81b56a7cb071540b143a4bfc98758788a8ac07d", "target": 0} {"commit_message": "[PATCH] Misc. Doxygen build system improvements\n\n- Only generate the installed header list once after CMake is run.\n It cannot change unless CMake is run again. This wasn't particularly\n slow earlier either, but now it can be added as a dependency also in\n the -fast targets without any impact on the behavior.\n- Do not update the Doxyfile-common each time CMake is run if\n GMX_COMPACT_DOXYGEN=ON.\n- Partition the Markdown pages into subdirectories based on the\n documentation level where they should appear. Exclude things from\n the documentation based on the directory, not individually.\n- Use a CMake function to create the various Doxygen targets to remove\n some duplication.\n- Some cleanup in the directories.cpp and misc.cpp documentation files.\n- Some cleanup to use consistent casing throughout CMakeLists.txt.\n\nChange-Id: I30de6f36841f25260700ec92284762e989f66507", "target": 0} {"commit_message": "[PATCH] Added wrappers for ScaLAPACK QR factorization (and an ability\n to test it from tests/lapack_like/QR) and subsequently revealed a (perhaps\n recently introduced) performance issue in Elemental's QR", "target": 0} {"commit_message": "[PATCH] Minor performance improments\n\nMostly useful as lesson learned.\n\n1) The double precision constant forces the compiler to convert\n the single precision variable to double, then do the multiplication\n in double and then convert back. Using the single precsion\n constant in double reduces the accuracy (the calculation is still done\n double but the constant has only single precision).\n2) Using a temporary array instead of a temporary scalar causes ICC14 to\n generate an extra store.\n\nChange-Id: Ib320ac2ae4ff80ce48277544abff468c483cc83a", "target": 1} {"commit_message": "[PATCH] initial caching of AO integrals for triples; SO integrals;\n performance measurement for the triples", "target": 0} {"commit_message": "[PATCH] Fixed up bugs and add performance tests for get and\n accumulate.", "target": 0} {"commit_message": "[PATCH] performance improvemnets that disable erf2 and ssf more\n changes needed", "target": 1} {"commit_message": "[PATCH] MKK: a test case to measure the performance of aggregate\n put/get calls.", "target": 0} {"commit_message": "[PATCH] SIMD acceleration for F_FOURDIHS and F_PIDIHS\n\nSince these are computed by the RB and proper dihedral code for which\nwe already have SIMD acceleration, this acceleration comes \"for free\"\n(or rather, we have been stupid not to accelerate these before).\n\nChange-Id: I456af11c23fe1cb3749a889c5d92ec3ba06ab237", "target": 0} {"commit_message": "[PATCH] temporary changes to use MemoryType::HOST instead of\n HOST_UMPIRE for performance", "target": 0} {"commit_message": "[PATCH] Improve mdrun performance user guide\n\n- Extended GPU update related section;\n- Tweaked the bonded GPU offload related section to reflect the current\n state of the code.\n\nChange-Id: I89ed39750d449df8b273f36ee6530df29ff31b7e", "target": 1} {"commit_message": "[PATCH] Import AMD Piledriver DGEMM kernel generated by AUGEM. So\n far, this kernel doesn't deal with edge.\n\nAUGEM: Automatically Generate High Performance Dense Linear Algebra\nKernels on x86 CPUs.\nQian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. In the\nInternational Conference for High Performance Computing, Networking,\nStorage and Analysis (SC'13). Denver, CO. Nov, 2013.", "target": 0} {"commit_message": "[PATCH] Make performance ex1 and ex1p templated on dimension", "target": 0} {"commit_message": "[PATCH] Add the possibility to remove the far points\n\nThe far points are added by the parallel version to reduce the contention\non the infinite vertex", "target": 0} {"commit_message": "[PATCH] use fast insertion functions of arr", "target": 1} {"commit_message": "[PATCH] Made FAST CPU results match CUDA results", "target": 0} {"commit_message": "[PATCH 1/8] use dot", "target": 0} {"commit_message": "[PATCH] Improved CUDA SIFT coalescing and performance", "target": 1} {"commit_message": "[PATCH] mesh::data: Use a DynamicList for the performance data for\n efficiency", "target": 1} {"commit_message": "[PATCH] accelerate distance queries for offset meshing", "target": 1} {"commit_message": "[PATCH] added an efficient code path for plain leap-frog update", "target": 0} {"commit_message": "[PATCH] Fixed an uninitialized memory read problem and a potential\n performance issue for problems in less than 3 dimensions.", "target": 1} {"commit_message": "[PATCH] Add a performance warning when a dynamic property map is used\n as index map", "target": 0} {"commit_message": "[PATCH] fixed nbnxn x86 SIMD non-bonded performance regression\n\nCommit f40969c2 broke the LJ combination detection,\nwhich effectively made all runs use the full combination rule\nmatrix for x86 SIMD kernels. This is now corrected.\n\nChange-Id: I1073801546fde23e6a53199120246697a7c61b5f", "target": 1} {"commit_message": "[PATCH] more efficient version, which only searches when necessary\n and only once per element", "target": 1} {"commit_message": "[PATCH] thermophysicalModels: Added caching of Cp and Cv for\n efficiency\n\nIn multi-specie systems the calculation of Cp and Cv is relatively expensive due\nto the cost of mixing the coefficients of the specie thermo model, e.g. JANAF,\nand it is significantly more efficient if these are calculated at the same time\nas the rest of the thermo-physical properties following the energy solution.\n\nAlso the need for CpByCpv is also avoided by the specie thermo providing the\nenergy type in the form of a boolean 'enthalpy()' function which returns true if\nthe energy type is an enthalpy form.", "target": 1} {"commit_message": "[PATCH] Performance test for memory pool. Update v2 API to better\n align with v1.", "target": 0} {"commit_message": "[PATCH] Accelerate CSR->Ell,Hybrid conversions on CUDA.\n\n+ The previous grid dimensions for `initialize_zero_ell` were `stride *\n num_rows`, i.e. roughly the dense matrix dimension.\n+ Using `max_nnz_per_row * num_rows` reduces significantly the amount of threads\n created which makes this kernel call more efficient (less useless thread\n creation).", "target": 1} {"commit_message": "[PATCH] Fix backwards atomic defaults\n\nThanks @stanmoore1 for reporting this.\nThis doesn't affect performance results\nbecause those tests don't use the defaults.", "target": 0} {"commit_message": "[PATCH] rPolynomial: New equation of state for liquids and solids\n\nDescription\n Reciprocal polynomial equation of state for liquids and solids\n\n \\f[\n 1/\\rho = C_0 + C_1 T + C_2 T^2 - C_3 p - C_4 p T\n \\f]\n\n This polynomial for the reciprocal of the density provides a much better fit\n than the equivalent polynomial for the density and has the advantage that it\n support coefficient mixing to support liquid and solid mixtures in an\n efficient manner.\n\nUsage\n \\table\n Property | Description\n C | Density polynomial coefficients\n \\endtable\n\n Example of the specification of the equation of state for pure water:\n \\verbatim\n equationOfState\n {\n C (0.001278 -2.1055e-06 3.9689e-09 4.3772e-13 -2.0225e-16);\n }\n \\endverbatim\n Note: This fit is based on the small amount of data which is freely\n available for the range 20-65degC and 1-100bar.\n\nThis equation of state is a much better fit for water and other liquids than\nperfectFluid and in general polynomials for the reciprocal of the density\nconverge much faster than polynomials of the density. Currently rPolynomial is\nquadratic in the temperature and linear in the pressure which is sufficient for\nmodest ranges of pressure typically encountered in CFD but could be extended to\nhigher order in pressure and/temperature if necessary. The other huge advantage\nin formulating the equation of state in terms of the reciprocal of the density\nis that coefficient mixing is simple.\n\nGiven these advantages over the perfectFluid equation of state the libraries and\ntutorial cases have all been updated to us rPolynomial rather than perfectFluid\nfor liquids and water in particular.", "target": 0} {"commit_message": "[PATCH] Read tiles: fixing preallocation size for var and validity\n buffers. (#2781) (#2782)\n\nPreallocation size for var buffer and validity buffer was not using the\ncorrect size, which will have a performance impact.", "target": 0} {"commit_message": "[PATCH] Sparse refactored readers: disable filtered buffer tile\n cache.\n\nFrom tests, it's been found that writing the cache for the filter\npipeline takes a significant amount of time for the tile unfiltering\noperation. For example, 2.25 seconds with and 1.88 seconds without in\nsome cases. The cache improved performance before multi-range subarrays\nwere implemented, so dropping it is fine at least for the refactored\nreaders.", "target": 1} {"commit_message": "[PATCH] TurbulenceModels::kOmegaSST.*: Updated source-terms and\n associated functions to use volScalarField::Internal\n\nThis is more efficient, avoids divide-by-0 when evaluating unnecessary\nboundary values and avoids unnecessary communications when running in parallel.", "target": 1} {"commit_message": "[PATCH] fixing performance of nwpw_gauss_weights...EJB", "target": 1} {"commit_message": "[PATCH] Modified waterChannel tutorial to make case better posed\n Existing case did not properly converge and suffered slow convergence with\n the water level failing to reach an equilibrium. A slight rise in the\n channel appears to help the water level reach an equlibrium when the flow\n rate over the rise matches the inlet flow rate.", "target": 0} {"commit_message": "[PATCH] Latest modifications that increased the parallel performance", "target": 1} {"commit_message": "[PATCH] Tensor, SymmTensor: Simplified invariantII\n\nNow the calculation of the 2nd-invariant is more efficient and\naccumulates less round-off error.", "target": 1} {"commit_message": "[PATCH] Isolate PME GPU spline parameter indexing in inline functions\n\nThis makes the spread/gather kernels more readable and allows\nto change the spline indexing scheme much easier in\nthe future. Performance should not be affected.\nMore TODO regarding the spline indexing scheme is marked.\n\nChange-Id: If735cccf2ce82f46b483c9ada6f309425c51f67e", "target": 0} {"commit_message": "[PATCH] Modified Newton Raphson method added and some basic timers A\n modified newton raphson method has been added in place of the old pure newton\n raphson. It's nothing fancy it just sets the relaxation factor for the next\n time step to 0.5 if it notes that the ratio in the norm of the current\n residual and the previous residual hasn't decreased by a factor of 10 or\n greater. The relaxation factor is set to 1.0 if it is converging fast enough.\n The main motivation behind this is to push the solution in the right\n direction when it starts to oscillate the actual answer. Next, a few basic\n timers have been set around the entire solution set to provide some very\n basic profiling for how long each time step takes to solve.", "target": 0} {"commit_message": "[PATCH] changed the ndelta to setting the number of cells per cut-off\n radius to per cut-off diameter and changed the default value to 3 in\n do_inputrec, this gives at most 64 cells per icg iso 125, this gives a few\n percent performance increase in ns and domain decomposition (the cg sorting)", "target": 0} {"commit_message": "[PATCH] Request flushing denorms to zero in OpenCL\n\nThis change adds by default the -cl-denorms-are-zero to the flags used\nfor kernel compilation. This is done to:\n- avoid a large performance penalty on AMD Vega with ROCm (which by\n default handles denorms on GFX9 or later).\n- make the defaults uniform across CUDA and OpenCL.\n\nFixes #2593\n\nChange-Id: I9e6183c4367b5960e0e21f1dd342d7695acfbc44", "target": 0} {"commit_message": "[PATCH] Adding a block diagonal preconditioner in the serial example\n to improve solver performance", "target": 1} {"commit_message": "[PATCH] Added FAST unit tests.", "target": 0} {"commit_message": "[PATCH] Creating a boolean operations object from map overlay. It\n gives the user the possibility to create the boolean oepration object with\n the walk along aline point-location, which is more efficient when using\n sweep-line.", "target": 1} {"commit_message": "[PATCH] Fixed performance figures. TODO: new graphs", "target": 0} {"commit_message": "[PATCH] Improved GB performance in mixed precision", "target": 1} {"commit_message": "[PATCH] Added const& for gaining performance", "target": 1} {"commit_message": "[PATCH] Backport bug-fix from next: |\n ------------------------------------------------------------------------ |\n r65327 | odevil | 2011-09-06 17:21:27 +0200 (Tue, 06 Sep 2011) | 1 line |\n Changed paths: | M\n /branches/next/Triangulation_2/include/CGAL/Triangulation_2.h | | too\n conservative check removed, for fast removal in delaunay 2d |\n ------------------------------------------------------------------------", "target": 0} {"commit_message": "[PATCH] Handle a case where the step size ends up being 0 but the\n gradient is not yet the minimum gradient size. Maybe it is a little slow\n (elementwise comparison is not incredibly fast)...", "target": 0} {"commit_message": "[PATCH] This should slightly improve performance in non-threaded proj\n constraint generation, may save us from a race condition leading to\n inaccurate proj constraints in a few corner cases (3D, level one rule off; or\n AMR combined with periodic BCs) when we're threaded.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@5889 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 1} {"commit_message": "[PATCH] Final bugs fixed, now have to look at the iterator\n performance for the subtable.", "target": 0} {"commit_message": "[PATCH] Added support of IBM's MASS library that optimizes\n performance on Power architectures", "target": 1} {"commit_message": "[PATCH] changed dim_type typedef to int from long long\n\nUsing long long for indexing data type is causing drastic\nperformance drop for CUDA and OpenCL kernels. Hence, changing the\ntypedef to point to int.", "target": 1} {"commit_message": "[PATCH] Task assignment for bonded interactions on CUDA GPUs\n\nMade a query function to find whether any interactions of supported\ntimes exist in the global topology, so that we can make efficient\nhigh-level decisions.\n\nAdded free for gpuBondedLists pointer.\n\nMinor cleanup in manage-threading.h\n\nFixes #2679\n\nChange-Id: I0ebbbd33c2cba5808561111b0ec6160bfd2f840d", "target": 0} {"commit_message": "[PATCH] Encapsulate code in ifdef NUMPY clauses. Efficient pythoncode\n for toArray.", "target": 1} {"commit_message": "[PATCH] Add support for Hygon Dhyana processor\n\nThis change adds hardware detection and related task assignment\nheuristics support for the Hygon Dhyana CPUs.\n\nChengdu Haiguang IC Design Co., Ltd (Hygon) is a Joint Venture\nbetween AMD and Haiguang Information Technology Co.,Ltd., aims\nat providing high performance x86 processor for China server\nmarket. Its first generation processor codename is Dhyana, which\noriginates from AMD technology and shares most of the architecture\nwith AMD's family 17h, but with different CPU Vendor ID (\"HygonGenuine\")\n/Family series number (Family 18h).\n\nMore details can be found on:\nhttp://lkml.kernel.org/r/5ce86123a7b9dad925ac583d88d2f921040e859b.1538583282.git.puwen@hygon.cn\n\nChange-Id: Ic91b032e69dfc13abad3fbfe6ab5e4f0e57fc7c0", "target": 0} {"commit_message": "[PATCH] Create a AVX512 enabled version of DGEMM\n\nThis patch adds dgemm_kernel_4x8_skylakex.c which is\n* dgemm_kernel_4x8_haswell.s converted to C + intrinsics\n* 8x8 support added\n* 8x8 kernel implemented using AVX512\n\nPerformance is a work in progress, but already shows a 10% - 20%\nincrease for a wide range of matrix sizes.", "target": 0} {"commit_message": "[PATCH] more efficient usage of mutex. The lock is only done if the\n build need to be done. We have an extra \"if (m_need_build)\" but otherwise we\n would need to use mutex::try_lock() which results in more code and as\n efficient.", "target": 1} {"commit_message": "[PATCH] resolved performance degrading changed introduced in revision\n 1319 (3)", "target": 1} {"commit_message": "[PATCH] performance update ....EJB", "target": 0} {"commit_message": "[PATCH] Fixed FAST memory leaks on CPU backend", "target": 1} {"commit_message": "[PATCH] fix for sigfpe (underflow) in ECPs: NWints/ecp is compiled\n with \"-math_library accurate\" instead of \"-math_library fast\"", "target": 1} {"commit_message": "[PATCH] Performance improvements for find_*_neighbors, and a new\n find_point_neighbors version for finding neighbors at just one point\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@4557 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 1} {"commit_message": "[PATCH] VFS Read-Ahead Cache (#1785)\n\nThis introduces a read-ahead cache within VFS::read(). This is an LRU cache\nthat maintains a single cached buffer for an arbitrary number unique URIs, not\nto exceed 10MiB (by default). Each cached buffer has a max size of 100KiB (by\ndefault). These parameters can be tweaked with the following config items:\n\n`vfs.read_ahead_size` (100KiB default)\n`vfs.read_ahead_cache_size` (10Mib default)\n\nThe motiviation for this patch is to optimize IO patterns of small, relatively\nsequential reads against cloud storage backends. Only the S3, Azure, and GCS\nbackends utilize this read cache. The POSIX/Windows/HDFS backends are\nunaffected by this patch.\n\nBoth performing and caching the read-ahead incur a performance penalty:\n1. We must read more than the requested bytes.\n2. We must make a copy of the read buffer (one to store in the cache, one to\nreturn to the user).\n\nWe will only perform a read-ahead if the requested read is smaller than the\ndefault 100KB cached buffer size. IO patterns of large reads will be unaffected.\nThe assumption is that fragment data is large and that reading fragment\ndata will not incur a performance penalty. Additionally, reads to tile data\nare bypassed because tiles have their own separate tile cache.\n\nOn the recent S3 workload we've been discussing, this read cache has a 78%\nhit rate, where every cache hit is in the fragment metadata. I've observed a\nbest-case runtime of 6.5s with this patch, and a 27s runtime without this\npatch.\n\nCo-authored-by: Joe Maley ", "target": 1} {"commit_message": "[PATCH] it seems to work comments The armijo rule is really slow The\n statistical update of sigmas is good We need to stay sometime in low sigmas\n otherwise we do not optimize the trace Next setp is BFGS method", "target": 0} {"commit_message": "[PATCH] Keep track of old dof_indices distribution between\n processors; this makes it easier to construct an efficient send_list in\n System::project_vector.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@3796 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 0} {"commit_message": "[PATCH] Code clean up in FAST and ORB for all backends\n\n- C++ features class no longer used inside backends", "target": 0} {"commit_message": "[PATCH] POWER10: Update param.h\n\nIncreasing the values of DGEMM_DEFAULT_P and DGEMM_DEFAULT_Q helps\nin improving performance ~10% for DGEMM.", "target": 1} {"commit_message": "[PATCH] Issue #2738 More efficient clearing of DaeBuilder cache Wait\n until it's really needed", "target": 1} {"commit_message": "[PATCH] remove int to bool conversion performance warning with VC", "target": 0} {"commit_message": "[PATCH] Compile using g++ on OSX\n\nThe Accelerate framework requires -flax-vector-conversions to\ncompile on g++", "target": 0} {"commit_message": "[PATCH] Improving the performance, 10-20% slower than\n Triangulation_2, not removing the initial vertices", "target": 1} {"commit_message": "[PATCH] Added single-accuracy SIMD double math functions\n\nApart from double SIMD variables typically being\nhalf the width of single, the math functions are\nconsiderably more expensive due to higher-order\npolynomials, which can drop the throughput to 25%\nof single. In some cases we do not need the full\ndouble precision in SIMD operations, so these\nnew math functions use double precision\nSIMD variables but only target single precision\naccuracy, which can improve performance twofold.\nThe patch also makes the target precision in\nsingle and double SIMD an advanced CMake variable,\nand the unit test tolerance is set based on these\nvariables. This can be used (decided by the user)\nfor a few platforms where the rsqrt/inv table\nlookups provide one bit too little to get by\nwith a single N-R iteration based on our default\ntarget accuracy of 22 bits.\n\nChange-Id: Id4b1c7800e16cb0eb3d564e89a368b4db6eede3e", "target": 1} {"commit_message": "[PATCH] Fix IBM VSX SIMD compiles with xlc\n\nRemove most of the previous inline asm to improve\nperformance (the optimizer works better w/o asm),\nand make sure the VSX SIMD code compiles with XLC.\n\nChange-Id: I3e8e9b4dd6102dd5503210e3b49b844ee5492342", "target": 1} {"commit_message": "[PATCH] added a hybrid method that will first try to solve with\n diagonal scaled CG and then if convergence is too slow , attempt to use\n BoomerAMG", "target": 0} {"commit_message": "[PATCH] Working on the umutual2b kernel, the tdipdip values are\n computed on the fly for now, maybe a seprate neigh list as in the CPU version\n will be more efficient", "target": 0} {"commit_message": "[PATCH] Make stepWorkload.useGpuXBufferOps flag consistent\n\nOn search steps we do not use x buffer ops, so the workload flag should\ncorrectly reflect that.\n\nAlso slightly refactored a conditional block to clarify the scope of\nworkload flags.\n\nNote that as a side-effect of this change, coordinate H2D copy will be\ndelayed from the beginning of do_force() to just before update on search\nsteps when there are no force tasks that require it (i.e. without PME).\nWhile this is not ideal for performance, the code is easier to reason\nabout.\n\nRefs #3915 #3913 #4268", "target": 0} {"commit_message": "[PATCH] TCE GPU: add environment varibale NWC_OFFLOAD_SPAN\n\nThe NWC_OFFLOAD_SPAN environment variable is meant to control which GA\nranks are offloading to GPUs. When setting the variable, every\nNWC_OFFLOAD_SPANth rank (staring at rank 0) will be offloading its work\nto a GPU.\n\nThe current solution is a place holder for the final implementation that\nstill needs to be figured out with actual performance tests.", "target": 0} {"commit_message": "[PATCH] exclude slow macos mpich steps", "target": 1} {"commit_message": "[PATCH] Add BoundaryVolumeSolutionTransfer class.\n\nThis class can be used for transferring solutions between the surface\nof a volume mesh and the BoundaryMesh associated with that surface.\nThis is joint work with with Xikai Jiang from Argonne National\nLaboratory (@xikaij) who is the original author of the transfer.\n\nSee also: X. Jiang et al., \"An O(N) and parallel approach to integral\nproblems by a kernel-independent fast multipole method: Application to\npolarization and magnetization of interacting particles,\" The Journal\nof Chemical Physics vol. 145, 064307, http://dx.doi.org/10.1063/1.4960436.", "target": 0} {"commit_message": "[PATCH] Accelerate GMM test by reducing number of EM iterations.", "target": 0} {"commit_message": "[PATCH] Switch to unordered_multimap in UNVIO.\n\nAfter trying out a few different container types, it was determined\nthat the cost of building every side (in order to search for it in the\nmultiset) killed the performance of the unordered_set method,\nand that the unordered_multimap approach was superior to a standard\nmultimap across a range of mesh sizes (below, there are 6*N^2 sides to\nsearch through in the container in question).\n\nN unordered_set multimap unordered_multimap\n10 0.0012 0.0006 0.0005\n15 0.0049 0.0021 0.0018\n20 0.0102 0.0055 0.0058\n25 0.0201 0.0118 0.0062\n30 0.0415 0.0288 0.0126\n35 0.0579 0.0377 0.0169\n40 0.0946 0.0612 0.0281\n45 0.1467 0.0996 0.0426\n50 0.1929 0.1213 0.0499\n55 0.2662 0.1688 0.0716\n60 0.4040 0.2250 0.1018\n65 0.5981 0.3388 0.1376\n100 2.2838 1.5329 0.6257", "target": 0} {"commit_message": "[PATCH] more efficient comparison function", "target": 1} {"commit_message": "[PATCH] In OpenMP threading, preallocate the thread buffer instead of\n allocating the buffer every time. This patch improved the performance\n slightly.", "target": 1} {"commit_message": "[PATCH] Don't use Boost.Operators for +-*/ of Gmpq.\n\nIt shouldn't change the performance significantly (the time is spent in\nmalloc/free and the mpq_* calls), but at least I can follow the\n(smaller) generated code.", "target": 0} {"commit_message": "[PATCH] tuned performance of script; added linked index", "target": 0} {"commit_message": "[PATCH] normalize_border is there only for performance", "target": 0} {"commit_message": "[PATCH] Updated article reference. Added configuration flags to\n support altivec on powerpc. Updated configuration files from gnu.org Removed\n the fast x86 truncation when double precision was used - it would just crash\n since the registers are 32bit only. Added powerpc altivec innerloop support\n to fnbf.c Removed x86 assembly truncation when double precision is used - the\n registers can only handle 32 bits. Dont use general solvent loops when we\n have altivec support in the normal ones. Added workaround for clash with\n altivec keyword. Added workaround for clash with altivec keyword.", "target": 0} {"commit_message": "[PATCH] moved diff, fast, gradient, harris, histogram to kernel\n namespace", "target": 0} {"commit_message": "[PATCH] Fixing a performance bug in trsm_[LR].c.", "target": 1} {"commit_message": "[PATCH] Demo updates, (I added a \"paint-like smoothing\" feature but I\n am going to remove it since real-time smoothing is not efficient due to AABB\n reconstruction.", "target": 0} {"commit_message": "[PATCH] Fixed FAST unit tests", "target": 0} {"commit_message": "[PATCH] Optimize the performance of rot by using universal intrinsics", "target": 1} {"commit_message": "[PATCH] Added performance benchmark", "target": 0} {"commit_message": "[PATCH] added performance figures in the doc", "target": 0} {"commit_message": "[PATCH] Add CUDA nvcc >=7.0 support\n\nWith CUDA 7.x, there is a few % performance benefit to using sm_52\narch as target instead of JIT-ed compute_50, mostly relevant with\nthe newly released v7.5 (as v7.0 has other regressions which make it\nslower).\n\nThis change adds a single new target architecture (5.2) and changes\nthe virtual architecture included in the binary from 5.0 to 5.2 with\nnew enough nvcc to make 5.1.x versions future-proof when new hardware is\nreleased.\n\nChange-Id: I062cc48a151da3ab15b0508f4ebd59d95880ae9a", "target": 0} {"commit_message": "[PATCH] Added support for (arbitrary) high-order Nedelec finite\n element discretizations in AMS.\n\nSeems to be working quite well on hex meshes and reasonably well on tet meshes,\ncorrelating with the performance of BoomerAMG on the associated high-order nodal\nproblems.", "target": 0} {"commit_message": "[PATCH] added benchmark comparing the performance of the two traits\n classes (CK vs CORE::Expr)", "target": 0} {"commit_message": "[PATCH] Fixed OpenMP compile and added a tux regression test\n\nThe MGR OpenMP code has not been tested yet, so I commented it out for now\nAdded a tux OpenMP compile test and reorganized the tux tests from fast to slow (more or less)", "target": 0} {"commit_message": "[PATCH] s390x: allow clang to emit fused multiply-adds (replicates\n gcc's default behavior)\n\ngcc's default setting for floating-point expression contraction is\n\"fast\", which allows the compiler to emit fused multiply adds instead of\nseparate multiplies and adds (amongst others). Fused multiply-adds,\nwhich assembly kernels typically apply, also bring a significant\nperformance advantage to the C implementation for matrix-matrix\nmultiplication on s390x. To enable that performance advantage for builds\nwith clang, add -ffp-contract=fast to the compiler options.\n\nSigned-off-by: Marius Hillenbrand ", "target": 0} {"commit_message": "[PATCH] #1295 Refactored Matrix::setSub(IMatrix,IMatrix) The new\n implementation should be much more efficient and handle non-monotone indices\n correctly", "target": 1} {"commit_message": "[PATCH] combustionModels::EDC: New Eddy Dissipation Concept (EDC)\n turbulent combustion model\n\nincluding support for TDAC and ISAT for efficient chemistry calculation.\n\nDescription\n Eddy Dissipation Concept (EDC) turbulent combustion model.\n\n This model considers that the reaction occurs in the regions of the flow\n where the dissipation of turbulence kinetic energy takes place (fine\n structures). The mass fraction of the fine structures and the mean residence\n time are provided by an energy cascade model.\n\n There are many versions and developments of the EDC model, 4 of which are\n currently supported in this implementation: v1981, v1996, v2005 and\n v2016. The model variant is selected using the optional \\c version entry in\n the \\c EDCCoeffs dictionary, \\eg\n\n \\verbatim\n EDCCoeffs\n {\n version v2016;\n }\n \\endverbatim\n\n The default version is \\c v2015 if the \\c version entry is not specified.\n\n Model versions and references:\n \\verbatim\n Version v2005:\n\n Cgamma = 2.1377\n Ctau = 0.4083\n kappa = gammaL^exp1 / (1 - gammaL^exp2),\n\n where exp1 = 2, and exp2 = 2.\n\n Magnussen, B. F. (2005, June).\n The Eddy Dissipation Concept -\n A Bridge Between Science and Technology.\n In ECCOMAS thematic conference on computational combustion\n (pp. 21-24).\n\n Version v1981:\n\n Changes coefficients exp1 = 3 and exp2 = 3\n\n Magnussen, B. (1981, January).\n On the structure of turbulence and a generalized\n eddy dissipation concept for chemical reaction in turbulent flow.\n In 19th Aerospace Sciences Meeting (p. 42).\n\n Version v1996:\n\n Changes coefficients exp1 = 2 and exp2 = 3\n\n Gran, I. R., & Magnussen, B. F. (1996).\n A numerical study of a bluff-body stabilized diffusion flame.\n Part 2. Influence of combustion modeling and finite-rate chemistry.\n Combustion Science and Technology, 119(1-6), 191-217.\n\n Version v2016:\n\n Use local constants computed from the turbulent Da and Re numbers.\n\n Parente, A., Malik, M. R., Contino, F., Cuoci, A., & Dally, B. B.\n (2016).\n Extension of the Eddy Dissipation Concept for\n turbulence/chemistry interactions to MILD combustion.\n Fuel, 163, 98-111.\n \\endverbatim\n\nTutorials cases provided: reactingFoam/RAS/DLR_A_LTS, reactingFoam/RAS/SandiaD_LTS.\n\nThis codes was developed and contributed by\n\n Zhiyi Li\n Alessandro Parente\n Francesco Contino\n from BURN Research Group\n\nand updated and tested for release by\n\n Henry G. Weller\n CFD Direct Ltd.", "target": 0} {"commit_message": "[PATCH] Improve the performance of rot by using AVX512 and AVX2\n intrinsic", "target": 1} {"commit_message": "[PATCH] reactingEulerFoam: Un-templated interface composition models\n\nThe recent field-evaluation additions to basicSpecieMixture means that\nthe interface composition models no longer need knowledge of the\nthermodynamic type in order to do efficient evaluation of individual\nspecie properties, so templating on the thermodynamics is unnecessary.\nThis greatly simplifies the implementation.", "target": 0} {"commit_message": "[PATCH] issue #1871, #1310: inv efficient for SX, evaluatable for MX", "target": 0} {"commit_message": "[PATCH] Auto cache compiled CUDA kernels on disk to speed up\n compilation (#2848)\n\n* Adds CMake variable AF_CACHE_KERNELS_TO_DISK to enable kernel caching. It is turned ON by default.\n* cuda::buildKernel() now dumps cubin to disk for reuse\n* Adds cuda::loadKernel() for loading cached cubin files\n* cuda::loadKernel() returns empty kernel on failure\n* Uses XDG_CACHE_HOME as cache directory for Linux\n* Adds common::deterministicHash() - This uses the FNV-1a hashing algorithm for fast and reproducible hashing of string or binary data. This is meant to replace the use of std::hash in some place, since std::hash does not guarantee its return value will be the same in subsequence executions of the program.\n* Write cached kernel to temporary file before moving into final file. This prevents data races where two threads or two processes might write to the same file.\n* Uses deterministicHash() for hashing kernel names and kernel binary data.\n* Adds kernel binary data file integrity check upon loading from disk", "target": 0} {"commit_message": "[PATCH] default to 1 thread, make the user manually specify the\n number of threads to avoid mpi/thread contention\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@2673 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 0} {"commit_message": "[PATCH] MeshCommunication new_nodes methods\n\nThese should be more efficient in many cases and may be necessary for\ncorrectness in others.", "target": 1} {"commit_message": "[PATCH] Correct Pelleg-Moore prunes that finish a node. There were\n cases where a Pelleg-Moore prune would happen before committing the point.\n This is actually getting pretty fast in terms of base cases, so I am happy\n with that (for once).", "target": 0} {"commit_message": "[PATCH] Refactor parallel_algebra.h\n\nWe can simplify our derived types slightly by noting that the\nLIBMESH_DIM entries are homogenous.\n\nWe should wrap MPI calls in the new error checking macro.\n\nWe should commit intermediate MPI types; failing to do so is\ninfinitesimally more efficient and seems to work in practice, but the\nerror descriptions in the resize docs I've read suggest that using an\nuncommitted type may not be strictly allowed.", "target": 1} {"commit_message": "[PATCH] Introduce self-pairs search in nbsearch\n\nMake it possible to search for all pairs within a single set of\npositions using AnalysisNeighborhood. This effectively excludes half of\nthe pairs from the search, speeding things up.\n\nNot used yet anywhere, but this makes the code a better reference for\nperformance comparisons, and for places where this is applicable it has\npotential for speeding things up quite a bit.\n\nChange-Id: Ib0e6f36460b8dbda97704447222c864c149d8e56", "target": 1} {"commit_message": "[PATCH] Cache FEMContext::point_value() FE Objects\n\nThis really ought to be redone entirely, but hopefully we can get a\nlittle performance improvement right away by just avoiding de and\nreallocations.", "target": 1} {"commit_message": "[PATCH] Add Pelleg-Moore type prune. This improves performance -- at\n least a bit.", "target": 1} {"commit_message": "[PATCH] Added the option to set a close_to_point tolerance in\n PointLocatorTree and MeshFunction, as a fallback option.\n\nThis is helpful when we have numerical tolerance issues which can lead to points being outside a mesh within\nsome tolerance. Note that we use a linear search with close_to_point, so it's slow, but at least it's a robust\nbackup.", "target": 0} {"commit_message": "[PATCH] driver: more reasonable thread wait timeout on Windows.\n\nIt used to be 5ms, which might not be long enough in some cases for the\nthread to exit well, but then when set to 5000 (5s), it would slow down\nany program depending on OpenBlas.\n\nLet's just set it to 50ms, which is at least 10 times longer than\noriginally, but still reasonable in case of failed thread termination.", "target": 0} {"commit_message": "[PATCH] More getElementByMass performance improvements\n\nReduces time taken from 15.7 seconds (last commit) to 2.2 seconds by avoiding\nQuantity comparisons altogether.", "target": 1} {"commit_message": "[PATCH] Rebuild the OpenMP runtime library with Clang\n\nTo get rid of the warning:\n\nclang-9: warning: No library 'libomptarget-nvptx-sm_70.bc' found in the default\nclang lib directory or in LIBRARY_PATH. Expect degraded performance due to no\ninlining of runtime functions on target devices. [-Wopenmp-target]", "target": 0} {"commit_message": "[PATCH] Use block size stepping of 16 in new dslash kernels: improves\n performance of exterior-x kernels and 5 dimensional stencils. Also, set max\n shared bytes to full dyanmic limit, since this gives a small improvement on\n Volta", "target": 1} {"commit_message": "[PATCH] Improvement of rectangular slicing: part 1 - memory efficient\n formulation", "target": 1} {"commit_message": "[PATCH] ARM64: Improve DAXPY for ThunderX2\n\nImprove performance of DAXPY for ThunderX2\nwhen the vector fits in L1 Cache.", "target": 1} {"commit_message": "[PATCH] Add TaskDAG performance test to CMake", "target": 0} {"commit_message": "[PATCH] QUDA: Overall of BLAS, better tuning, bugs fixed, fixed\n performance regressions, support for 32 way reductions, removed bank\n conflicts from complex and triple reductions\n\ngit-svn-id: http://lattice.bu.edu/qcdalg/cuda/quda@1121 be54200a-260c-0410-bdd7-ce6af2a381ab", "target": 1} {"commit_message": "[PATCH] Use gmx_mtop_t in selections, part 2\n\nUse gmx_mtop_t throughout low-level selection routines, i.e.,\ncenterofmass.cpp, poscalc.cpp, and indexutil.cpp. Adapt test code,\nwhich is now using gmx_mtop_t throughout as well.\n\nIn places where gmx_mtop_t is actually accessed, the changes are as\nlocal as possible. In most cases, some additional restructuring could\ngive better performance and/or much clearer code, but that is outside\nthe scope of this change.\n\nPart of #1862.\n\nChange-Id: Icc99432bddec04a325aef733df56571d709130fb", "target": 1} {"commit_message": "[PATCH] Example now works in parallel. The solver is still slow but\n it works", "target": 0} {"commit_message": "[PATCH] Very simple change to get ~20% performance improvement in\n reading Amber prmtop files.", "target": 1} {"commit_message": "[PATCH] Replace vpermpd with vpermilpd in the Haswell DTRMM kernel\n\nto improve performance on AMD Zen (#2180) applying wjc404's improvement of the DGEMM kernel from #2186", "target": 1} {"commit_message": "[PATCH] AABB tree: enrich performance section with a summary (general\n comments and advices about how to put the tree at work with good\n performances). this is not exhaustive nor conclusive of course but I believe\n a documentation must also tell the obvious.", "target": 0} {"commit_message": "[PATCH] Removed unnecessary synchronization that hurt performance on\n Nvidia", "target": 1} {"commit_message": "[PATCH] AABB tree: added internal search tree (CGAL K-orth search\n tree) to accelerate the projection queries. substantial complication... may\n be improved.", "target": 0} {"commit_message": "[PATCH] gemini performance updates for strided gets; code in the\n ARMCI_Get/Put/Acc subroutines should be incorporated into the ARMCI_Nb\n equivalent routines", "target": 0} {"commit_message": "[PATCH] The split source is unnecessary and slow on some machines.", "target": 0} {"commit_message": "[PATCH] HvD: Adjusted the these scripts removing the use of\n \"svnversion\" and replacing it with \"svn info | grep Revision:\". Svnversion\n works out exactly what revisions are contributing to your source code. While\n it is accurate it also takes quite some time to work this. On systems with\n slow disk access it can take 15 minutes or so. Svn info by contrast only\n checks the revision of the current directory. Hence it is much faster but not\n so accurate. However, for the source code distributions we generate we take a\n clean copy of the repository anyway and then the svn info result must match\n svnversion. So this should be good enough and much faster.", "target": 1} {"commit_message": "[PATCH] Fixed up some bugs in the host to device accumulate for same\n process communication in nb_accv and added operations to support improved\n performance of scatter operation within same SMP node.", "target": 1} {"commit_message": "[PATCH] Fix issue with HWLOC + OpenMP on XeonPhi\n\nUsing omp_get_max_threads(); is problematic in conjunction with\nHwloc on Intel (essentially an initial call to the OpenMP runtime\nwithout a parallel region before will set a process mask for a single core\nThe runtime will than bind threads for a parallel region to other cores on the\nentering the first parallel region and make the process mask the aggregate of\nthe thread masks. The intend seems to be to make serial code run fast, if you\ncompile with OpenMP enabled but don't actually use parallel regions or so", "target": 0} {"commit_message": "[PATCH] some improvements in performance on it2 (previous name of\n routine was hferi.f)", "target": 1} {"commit_message": "[PATCH] USER-DPD: performance optimizations to ssa_update() in\n fix_shardlow Overall improvements range from 2% to 18% on our benchmarks 1)\n Newton has to be turned on for SSA, so remove those conditionals 2) Rework\n the math in ssa_update() to eliminate many ops and temporaries 3) Split\n ssa_update() into two versions, based on DPD vs. DPDE 4) Reorder code in\n ssa_update_*() to reduce register pressure", "target": 1} {"commit_message": "[PATCH] slow DGOP...EJB", "target": 0} {"commit_message": "[PATCH] basicThermo: Cache thermal conductivity kappa rather than\n thermal diffusivity alpha\n\nNow that Cp and Cv are cached it is more convenient and consistent and slightly\nmore efficient to cache thermal conductivity kappa rather than thermal\ndiffusivity alpha which is not a fundamental property, the appropriate form\ndepending on the energy solved for. kappa is converted into the appropriate\nthermal diffusivity for the energy form solved for by dividing by the\ncorresponding cached heat capacity when required, which is efficient.", "target": 1} {"commit_message": "[PATCH] KokkosContainers: Mark perf tests as CATEGORY PERFORMANCE\n\nFix https://github.com/kokkos/kokkos/issues/374 by marking\nKokkosContainers' performance tests as CATEGORY PERFORMANCE. They\nwill always build, but they will only run when doing performance\ntests.", "target": 0} {"commit_message": "[PATCH] nishant, added efficient jackknifed m spacing estimator", "target": 0} {"commit_message": "[PATCH] Remove selective unfiltering. (#2410)\n\nFrom what we know about our customer's use case, they either will read\nthe full array where this gains nothing (and actually this affects the\nperformance negatively because of the added code complexity), or very\ntargeted ones, for which the gains would be very small, unless the tile\ncapacity/extent is unusually large if the users have a poorly configured\narray.\n\nAlso, this will allow us to implement compression codec for video and\nimaging, which won't work with selective unfiltering.", "target": 0} {"commit_message": "[PATCH] Final performance tuning on the dense version of the\n algorithm", "target": 0} {"commit_message": "[PATCH] - Be more storage efficient, and general clean up.", "target": 1} {"commit_message": "[PATCH] Update RAS-IR.\n\nNotes:\n1. Overlap and row ordering have significant effects on convergence.\n2. Lower the matrix bandwidth, better the performance ?\n3. Sync is lower, but communication becomes higher, tradeoff!\n4. Parallelism is higher, but overhead is also higher, tradeoff!\n5. Very high accurate solves, do not provide anything.\n6. Two level and multi-levels might significantly reduce number of\niterations to converge.", "target": 1} {"commit_message": "[PATCH] Fix a warning\n\nhttps://cgal.geometryfactory.com/CGAL/testsuite/CGAL-5.1-Ic-152/Installation/TestReport_Friedrich_Ubuntu-gcc-7.gz\n```\nCMake Warning at /home/gimeno/foutoir/cgal_root/CGAL-5.1-Ic-152/cmake/modules/CGAL_enable_end_of_configuration_hook.cmake:99 (message):\n =======================================================================\n\n CGAL performance notice:\n\n The variable CMAKE_BUILD_TYPE is set to \"\". For performance reasons, you\n should set CMAKE_BUILD_TYPE to \"Release\".\n\n Set CGAL_DO_NOT_WARN_ABOUT_CMAKE_BUILD_TYPE to TRUE if you want to disable\n this warning.\n\n =======================================================================\nCall Stack (most recent call first):\n CMakeLists.txt:9223372036854775807 (CGAL_run_at_the_end_of_configuration)\n```", "target": 0} {"commit_message": "[PATCH] enable GPU sharing among tMPI ranks\n\nIt turns out that the only issue preventing sharing GPUs among thread-MPI\nthreads was that when the thread arriving to free_gpu() first destroys\nthe context, it is highly likely that the other thread(s) sharing a GPU\nwith this are still freeing their resources - operation which fails as\nsoon as the context is destroyed by the \"fast\" thread.\n\nSimply placing a barrier between the GPU resource freeing and context\ndestruction solves the issue. However, there is still a very unlikely\nconcurrency hazard after CUDA texture reference updates (non-bonded\nparameter table and coulomb force table initialization). To be on the\nsafe side, with tMPI a barrier is placed after these operations.\n\nChange-Id: Iac7a39f841ca31a32ab979ee0012cfc18a811d76", "target": 0} {"commit_message": "[PATCH] Use [..] instead of at(..) in LINCS GPU data management code\n\nThe .at(..) was used to make sure that the indices are within bonds\nwhile the code is not thoroughly tested. Now it can be replaced with\ndirect access [..] for performance reasons.", "target": 1} {"commit_message": "[PATCH] Changed some compiler flags for better performance", "target": 1} {"commit_message": "[PATCH] improved performance", "target": 1} {"commit_message": "[PATCH] Sun Performance Library debugged and tested", "target": 0} {"commit_message": "[PATCH] Simplify gmx chi\n\nThis change is pure refactoring that prepares for performance\nimprovements to ResidueType handling that will benefit both grompp and\npdb2gmx.\n\nUse vector and ArrayRef to replace C-style memory handling. Some\nhistogram vectors were being over-allocated by 1, which is no longer\nsafe to do now that the size of the vector is relevant when looping,\nso those are reduced.\n\nEliminated and reduce scope of iteration variables. Removed an unused\nfunction and some debug code in comments. Used const references rather\nthan pointers where possible. Used range-based for and algorithms in\nsome places that are now possible to do so.", "target": 1} {"commit_message": "[PATCH] all modules and domqtests.mpi fast [travis skip]", "target": 0} {"commit_message": "[PATCH] Use HostVector for PME CPU Force Buffer\n\nFix performance bug: PME CPU force buffer should be a HostVector to\nallow pinned memory GPU transfers (which occur in PME-PP\ncommunications on virial steps).", "target": 1} {"commit_message": "[PATCH] HvD: Adding a small performance test to check whether it\n makes sense to use a Taylor series for exp().", "target": 0} {"commit_message": "[PATCH] Performance improvments. BuildWeightMatrix() is probably\n unnecessary entirely.", "target": 1} {"commit_message": "[PATCH] Stabilize and accelerate the test case by using a smaller\n network architecture.", "target": 0} {"commit_message": "[PATCH] Use Array objects instead of Param objects in several\n functions\n\n* Update SIFT with new RAII memAlloc\n* Workaround for function resolution in ternary operator\n* Fix Fast and Orb functions", "target": 0} {"commit_message": "[PATCH] Wrote the compute_children_node_keys() function in elem.h\n which allows one to generate appropriate node keys while reading in a mesh\n with multiple refinement levels. This allows us to avoid a linear search in\n the MeshRefinement::add_point routine since all the nodes can now be found in\n the nodes hash table. The resulting performance improvement was significant.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@1263 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 0} {"commit_message": "[PATCH] performance improvements for Linear JIT CUDA kernels", "target": 1} {"commit_message": "[PATCH] Reducing dependencies. Print functions are generally not\n fast anyway, inlining them leads to unnecessary dependencies and larger\n headers. Removing print functions from headers.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@1112 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 0} {"commit_message": "[PATCH] New data parallel routines to improve performance", "target": 1} {"commit_message": "[PATCH] need to slow down in some changes ... cancel eckart for the\n time being", "target": 0} {"commit_message": "[PATCH] 28-30% improvement in cuda vs opencl speedup for bilateral\n\n* Replacing exp cuda device function with __expf improved the cuda\n bilateral kernel performance vs opencl kernel.\n* Removed redundant multiplication calculation in the for loop\n of cuda/opencl kernels", "target": 1} {"commit_message": "[PATCH] Improve Ginkgo device arch flags management.\n\n1. Use a newer version of CAS which makes Auto become All when no\n architecture was detected. This ensures that compiling on a node without\n a GPU will provide good performance (at the cost of binary size and\n compile time), instead of failing to provide optimized architecture\n specific kernels.\n2. Use the `variable` version of CAS instead of `target` so that we do\n not call the CAS redundantly in several tests.\n3. Fix HIP AMDGPU flags which were not properly passed:\n `target_compile_options` do not seem to work with `hip_add_library`, so\n the flags need to be passed to `hip_add_libary` (or executable)\n directly.", "target": 1} {"commit_message": "[PATCH] Break apart update_constraints\n\nThere are four distinct kinds of work being done, and never was any\ncall to update_constraints doing all of them, so it's better to have a\ngroup of functions, each of which do one thing, and the relevant ones\ncalled. This also makes it simpler to express by returning fast that\nwhen we don't have constraints, we do nothing.\n\nMade the logic for whether this is a log or energy step match that of\nthe main MD loop. The old implementation may not have prepared for the\nlast step correctly when it was triggered by something other than the\nnsteps inputrec value.\n\nRemoved a commment mentioning iteration, which is a feature\nthat was removed a while ago.\n\nRemoved some ancient debug dump output.\n\nRefs #2423, #1793\n\nChange-Id: I21c10826721ddc9a79a33b1dc75971a20d0855d9", "target": 0} {"commit_message": "[PATCH] use MinMaxTuple as return value of minmax\n\nThis might slow things down.\n\nSigned-off-by: Panagiotis Cheilaris ", "target": 0} {"commit_message": "[PATCH] removed order natoms^2 loops which made grompp vsite stuff\n extremely slow for large systems", "target": 0} {"commit_message": "[PATCH] Prevent PME tuning excessive grid scaling\n\nWe limit the maximum grid scaling to a factor 1.8. This allows\nplenty of room for shifting work from PME on CPU to short-range\nGPU kernels, but avoids excessive scaling for diminishing return\nin performance for a significant increase in power consumption,\ncommunication volume (which may with fluctuating network load not\nshow up during tuning) as well as limiting load balancing.\n\nChange-Id: I85c02478faa6b67c063b6e1b45a9ac1755b2d81e", "target": 0} {"commit_message": "[PATCH] Tetrahedral mesh\n\n- Displays the item after its creation without having to move the manipulated frame.\n\n- Enhances the performance when moving the manipulated frame.", "target": 0} {"commit_message": "[PATCH] added precompiler commands for performance tests bug fix in\n SNC_FM_decorator and SNC_constructor: plane sweep must be done on correct\n planes", "target": 1} {"commit_message": "[PATCH] impr performance of resultant", "target": 1} {"commit_message": "[PATCH] Improved the performance of finding whether a halfedge is on\n the outer ccb", "target": 1} {"commit_message": "[PATCH] Commit the new version of the static filter. Too slow for the\n moment.", "target": 0} {"commit_message": "[PATCH] Fixed put_in_list for better performance", "target": 1} {"commit_message": "[PATCH] Use range for in dof_map.C\n\nWe can use more efficient iterators in a couple places here too.", "target": 1} {"commit_message": "[PATCH] Refactor threading model\n\nRemoves:\n- global_tp_ (the replacement TBB thread pool)\n- StorageManager::async_thread_pool_\n- StorageManager::reader_thread_pool_\n- StorageManager::writer_thread_pool_\n- VFS::thread_pool_\n\nAdds:\n- StorageManager::compute_tp_\n- StorageManager::io_tp_\n\nUsage changes:\n1. Our three parallel functions (`parallel_sort`, `parallel_for[_2d]`) now use\n the `StorageManager::compute_tp_`.\n2. Both the `Reader::read_tiles()` and `Writer::write_tiles()` now execute on\n `StorageManager::io_tp_`.\n3. The VFS is now initalized with a thread pool, where the storage manager\n initializes it with the `StorageManager::io_tp_`. This means that both the\n VFS and Reader/Writer io paths execute on the same thread pool. There was\n previously a deadlock scenario if both used the same thread pool, but that\n is no longer an issue now that the threadpools are recursive.\n4. The async queries are executed on `StorageManager::compute_tp_`.\n\nConfig changes:\n- Adds configuration parameters for the compute and IO thread pool \"concurrency\n levels\". A level of \"1\" is serial execution while all other levels have a\n maximum concurrency of N but allocate N-1 OS threads.\n- Deprecate the async/reader/writer/vfs thread num configurations. If any of\n these are set and larger than the new \"sm.compute_concurrency_level\" and\n \"sm.io_concurrency_level\", the old values will be used instead. The motiviation\n is so that existing users will not see a drop in performance if they are\n currently using larger-than-default values.", "target": 1} {"commit_message": "[PATCH] Use processor_id_type where appropriate\n\nI was accidentally using dof_id_type instead, and since that will\nalways be equal or larger this mistake shouldn't have led to bugs, so\nI won't bother turning this commit into a half dozen different fixup\ncommits to rebase.\n\nHowever, using the correct type should be infintesimally more\nefficient and significantly less confusing.", "target": 1} {"commit_message": "[PATCH] optimize create_atoms performance for large boxes and small\n regions. warn if taking a long time", "target": 1} {"commit_message": "[PATCH] Improve performance for SYCL parallel_reduce", "target": 1} {"commit_message": "[PATCH] Added OPTLD for optimized but slow loading which you want for\n time-critical programs such as mdrun.", "target": 1} {"commit_message": "[PATCH] Remove OpenMP from KOKKOS_DEVICES in Kokkos CUDA Makefiles\n since normally this doesn't improve performance", "target": 0} {"commit_message": "[PATCH] improved efficiency for communication somewhat and fixed Ssw\n option to work correctly, albeit not as efficient as possible yet.", "target": 1} {"commit_message": "[PATCH] Fixed nbnxn_4xN performance regression\n\nCommit 8e92fd67 changed the 2xNN kernel to use gmx_simd_blendnotzero_r\nand the 4xN kernel to use gmx_simd_blendv_r. Making the 4xN kernel\nconsistent with the 2xNN kernel improves the performance with AVX2\nwith 4% and 3% for the RF and PME kernels, respectively.\n\nChange-Id: Iac334865c2b2340493639300d07e7ab9c78e129f", "target": 1} {"commit_message": "[PATCH] Don't resize() as that assembles the matrix and makes\n add_coef() slow", "target": 0} {"commit_message": "[PATCH] update test (Cactus_deformation_session.cpp): make it\n suitable for test performance (not active by default) make it suitable for\n test suite (precomputed mesh difs active by default)", "target": 0} {"commit_message": "[PATCH] Improve the performance of dasum and sasum when SMP is\n defined", "target": 1} {"commit_message": "[PATCH] Changed default cuda stream to be non-zero\n\n* Added additional following api functions specific to cuda backend\n * afcu_get_stream\n * afcu_get_native_id\n* Removed duplicate class in fast kernel that helps declare\n dynamic shared memory based on template type", "target": 0} {"commit_message": "[PATCH] a slow version of triangle split", "target": 0} {"commit_message": "[PATCH] Adding a Fast configuration", "target": 0} {"commit_message": "[PATCH] Refactor sign to signbit internally\n\nThe operation being performance is equivalent of std::signbit\nthus, using signbit is more apt and removes unnecessary redefine\nof sign function in opencl jit kernel.", "target": 0} {"commit_message": "[PATCH] default to 1 thread, make the user manually specify the\n number of threads to avoid mpi/thread contention", "target": 0} {"commit_message": "[PATCH] subCycle: Add special treatment for nSubCycles = 1\n\nNow running sub-cycling with nSubCycles = 1 is as efficient as running the same\ncode without the sub-cycling loop.", "target": 0} {"commit_message": "[PATCH] POWER10: Change dgemm unroll factors\n\nChanging the unroll factors for dgemm to 8 shows improved performance with\nPOWER10 MMA feature. Also made some minor changes in sgemm for edge cases.", "target": 1} {"commit_message": "[PATCH] #1295 Refactored MX::setSub(IMatrix,IMatrix) The new\n implementation should be much more efficient and handle non-monotone indices\n correctly", "target": 1} {"commit_message": "[PATCH] template cases: added cylindrical background mesh in rotating\n geometry cases\n\nsnappyHexMesh produces a far better quality AMI interface using a cylindrical background mesh,\nleading to much more robust performance, even on a relatively coarse mesh. The min/max AMI\nweights remain close to 1 as the mesh moves, giving better conservation.\n\nThe rotating geometry template cases are configured with a blockMeshDict file for a cylindrical\nbackground mesh aligned along the z-axis. The details of use are found in the README and\nblockMeshDict files.", "target": 1} {"commit_message": "[PATCH] more efficient comparison function\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@623 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 1} {"commit_message": "[PATCH] routines to control threading for Accelerate framework\n https://github.com/nwchemgit/nwchem/pull/331", "target": 0} {"commit_message": "[PATCH] moved warning of slow option to 'bugs'", "target": 0} {"commit_message": "[PATCH] slow draw of polyhedron_items fixed", "target": 1} {"commit_message": "[PATCH] Squeezed some air out of the spread, gather and solve\n routines. Should still be tested for parallel performance and correctness", "target": 0} {"commit_message": "[PATCH] Adding conditional compilation(#if defined(LOONGSON3A)) to\n avoid affecting the performance of other platforms.", "target": 1} {"commit_message": "[PATCH] HvD: In response to the development of a new ARMCI over MPI\n implementation I have updated the build_nwchem script and the tools\n GNUmakefile to be able to drive this target. This implementation goes by the\n name MPI_TS (short for MPI Two-Sided) as it eliminates the data-server and\n uses MPI two-sided communications to implement the ARMCI functionality. It is\n expected to be highly portable as it relies only on MPI. Trying the\n performance of this implementation will be interesting.\n\nIn all cases it is currently required to set the environment variable\n\n EXP_GA\n\nany value for this variable will do. This variable is currently required because\nthe implementation lives in a separate branch of GA. In addition you need to\nrun get-tools with this environment variable set to pick up the correct GA\nbranch.\n\nThe build_nwchem script will under certain conditions automatically pick the\nMPI_TS implementation up. However, it is safer to set\n\n ARMCI_NETWORK=MPI_TS\n\nas that eliminates any guess work that might lead to a different result.\n\nIf you want to try this implementation, please go ahead, as I am sure the GA\nteam will value any feedback we provide.", "target": 0} {"commit_message": "[PATCH] Modified nb_putv to improve performance of transfers to\n processes on the same SMP node.", "target": 1} {"commit_message": "[PATCH] Added some profiling to catch performance numbers.", "target": 0} {"commit_message": "[PATCH] performance updates G_indx replaced with Pack_G_indx...EJB", "target": 0} {"commit_message": "[PATCH] Signifcantly improved restrictor performance through the use\n of CTA index swizzling to improve spatial locality improving cache line\n utilization", "target": 1} {"commit_message": "[PATCH] Got parallelization working with CollocationIntegrator, very\n slow", "target": 0} {"commit_message": "[PATCH] Convert aligned moves to unaligned\n\nshould have no performance impact on reasonably modern cpus and fixes occasional crashes in actual user code.", "target": 1} {"commit_message": "[PATCH] MultiReduce kernels now instantiate a series of power of two\n block sizes: this improves performance significantly", "target": 1} {"commit_message": "[PATCH] more efficient live variables in SX virtual machine", "target": 1} {"commit_message": "[PATCH] Adding Changelog for Release 2.9.99\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 2.9.99", "target": 0} {"commit_message": "[PATCH] Drastically improve performance of getElementByMass\n\nThe old approach iterated through the entire periodic table by atomic number and\nsubtracted the provided mass by the element's mass and kept track of the\nsmallest difference. The new approach steps through the elements in order of\natomic number and bails once it hits an element with a higher mass than the\ntarget mass (assuming masses are monotonically increasing).\n\nOn my desktop, processing 4TVP-dmj_wat-ion.psf dropped from 297 s to 15.4 s. But\n15.4 s is still a bit too long...", "target": 1} {"commit_message": "[PATCH] Continue working on FParser JIT\n\n* Path fix for Linux\n* Add optimizer opcodes, fix bugs\n* Vastly expand JIT opcode support, fix AD bug... oops!\n* Add JIT support for if(a,b,c) control flow opcodes\n* Make compares fuzzy using Fparser's epsilon\n* Fix compile error when fpoptimizer is disabled\n* Enable support for more Value_t types\n* Fix stack recover bug, add test/example\n* Add performance test code\n* Fix another AD bug involving cDup\n* Tweak/expand tests\n* Put JIT cache files into subdirectory", "target": 0} {"commit_message": "[PATCH] Reduce noise slightly and increase dataset size, which will\n slow down the test but make the results more accurate.", "target": 0} {"commit_message": "[PATCH] Add IntRange helper class\n\nThis seems to be the tersest efficient way to iterate over a range of\nintegers, and since there are quite a few libMesh methods which take\nlocal integer indices, we still have to iterate over those indices\nquite often.", "target": 0} {"commit_message": "[PATCH] Use numerical_jacobian_h_for_var\n\nThe refactoring required makes the ALE perturbations a little simpler\nto boot. This may be a tiny bit slower, but numerical jacobians are\nslow to begin with and the difference shouldn't be noticeable.", "target": 0} {"commit_message": "[PATCH] Refactor UpdateTree() to sometimes Hamerly prune. We aren't\n properly retaining pruned nodes between iterations, but this is definitely a\n start and it's basically as fast as any of these attempted algorithms I've\n written.", "target": 0} {"commit_message": "[PATCH] Add PointLocatorBase::locate_node()\n\nThis isn't nearly as efficient as it could be if we override it in\nPointLocatorTree, but it's simple and I'm not planning on using it in\nany inner loops.", "target": 0} {"commit_message": "[PATCH] AABB tree: the projection does not construct the KD-tree at\n the first projection query anymore. for efficient projection queries either\n the user calls for its explicit construction during the AABB construction or\n calls \"construct_search_Tree()\". otherwise the first primitive reference\n point is used as (naive) hint.", "target": 1} {"commit_message": "[PATCH] Use acq/rel semantics to pass flags/pointers in\n getrf_parallel.\n\nThe current implementation has locks, but the locks each only\nhave a critical section of one variable so atomic reads/writes\nwith barriers can be used to achieve the same behavior.\n\nLike the previous patch, pthread_mutex_lock isn't fair, so in a\ntight loop the previous thread that has the lock can keep it\nstarving another thread, even if that thread is about to write\nthe data that will stop the current thread from spinning.\n\nOn a 64c Arm system this improves performance by 20x on sgesv.goto.", "target": 1} {"commit_message": "[PATCH] aabb tree: more on performance section (benchmark across\n kernels)", "target": 0} {"commit_message": "[PATCH] SNAP optimizations, kernel fusion, large reduction of memory\n usage on the GPU, misc. performance optimizations.", "target": 1} {"commit_message": "[PATCH] In the performance miniapps, add options to set the mesh\n refinements.\n\nRun the miniapps/performance tests at coarser mesh resolutions, so that\nthe tests run faster.\n\nExplicitly remove and ignore the temporary files created by the tests,\nbecause, in some cases, they may not be removed automatically.", "target": 0} {"commit_message": "[PATCH] Enable dynamic pair list pruning\n\nThis change activates the dynamic pruning scheme and the pruning\nonly kernels added in previous commits.\nA heuristic estimate is used to select value for nstlist and\nnstlistPrune that should result in performance that is reasonably\nclose to optimal. The nstlist increase code has been moved from\nrunner.cpp to nbnxn_tuning.cpp. The KNL check in that code has been\nreplaced by a check for Xeon Phi.\nA paragraph has been added to the manual to describe the dynamic\nand rolling list pruning scheme. A reference with all the details\nwill be added once the paper has been published.\n\nChange-Id: Ic625858a07083916c8aa3e07f7497488dcfaee9e", "target": 0} {"commit_message": "[PATCH] Performance Improvements", "target": 1} {"commit_message": "[PATCH] Speed up mtop_util atom lookup\n\nThe lookup of atom indices and properties on global atom index have\nbeen sped up by moving functions to a new header file mtop_lookup.h\nand by storing start and end global atom indices in gmx_mtop_t.\nAnother performance improvement is that the previous molblock index is\nused as starting value for the next search.\nThe atom+residue lookup function now also returns the reside index.\nThis change also simplifies the code, since we no longer need a lookup\ndata structure.\nA large number of files are touched because the t_atom return pointer\nis changed to const also in the atomloop functions.\n\nChange-Id: I185b8c2e614604e9561190dd5e447077d88933ca", "target": 1} {"commit_message": "[PATCH] added a pragma to suppress a performance warning in std::map", "target": 0} {"commit_message": "[PATCH] Unroll dynamic indexing in SYCL gather kernel\n\nThread ID-based dynamic indexing of constant memory data in the SYCL\ngather kernel caused a large amount of register spills and poor\nperformance of the gather kernel on AMD. Avoiding dynamic indexing\neliminates spills and improves performance of this kernel >10x.\n\nRefs #3927", "target": 1} {"commit_message": "[PATCH] finished consolidating the InRCut device function. It uses\n double3's for axes, halfAx, and dist. It passes the currentPArticle,\n neighborParticle, and gpu_x,y,z arrays to calculate distance further down the\n trace. Also, I replaces the diff_com and virComponents with double3s. It\n would be more elegant to use an array of double3's as opposed to three\n separate arrays for coords. Right now this change isn't implemented due to\n possible performance concerns.", "target": 0} {"commit_message": "[PATCH] added a special cache for efficient conversion of\n Sqrt_extension to Bigfloat_interval, disabled by default", "target": 1} {"commit_message": "[PATCH] Use more efficient way of getting number of classes.", "target": 1} {"commit_message": "[PATCH] This is a more correct implementation. But it isn't efficient\n and it may fail on corner cases.\n\nThat can be a problem for another day... (are you reading this from the future?\nSorry...!)", "target": 0} {"commit_message": "[PATCH] 64-bit alignment makes a huge difference on ga_dgemm\n performance on KNL", "target": 0} {"commit_message": "[PATCH] Fix hardware topology detection for modern systems\n\nThis is a partial rewrite of our hardware topology detection\ncode to handle modern systems where we might not be allowed to\nrun on all threads present, where there might be cpu limits\nthat are lower than the total number of threads, or hybrid\nCPUs that contain combinations of performance and efficiency\ncores. In particular, it includes\n\n- The hwloc detection has been fixed to work for more systems,\n and we are better at properly separating internal logical\n cpu indices from OS-provided logical indices.\n- The cpuinfo code will properly handle the case where we\n are not allowed to run on some cores when detecting a simple\n topology.\n- When compiled without hwloc support, we can now also parse\n cpu topologies from Linux filesystems, which is important\n for non-x86 processors.\n- All detection layers properly handle the case where there is\n a cpuset mask that disallows some cpus from being used.\n- When available, we use linux cgroups (either v1 or v2) to\n detect cpu limits set e.g. in container environments, and use\n this to decide the number of threads rather than the total\n logical core count. This should avoid overloading runs\n in container environments, including our CI system.\n- We no longer assume that all sockets/cores are identical,\n which will commonly not be the case e.g. if slurm has set\n custom cpusets (so only those cpus are visible).", "target": 0} {"commit_message": "[PATCH] Expanded the GetIrTest behavior to enable efficient testing", "target": 0} {"commit_message": "[PATCH] Parallelize closing of files on write\n\nThis change parallelizes the closing of files on writes. This solves a\nperformance problem when the user was using S3 or other object store\nwhere we buffer the multi-part writes. If the user's data was below the\nbuffer size, then no io would have occurred until the closing when we\nflush buffers. This causes a large performance penalty relative to\nexpected because up to three files per field had to be uploaded\nserially.", "target": 1} {"commit_message": "[PATCH] Use non-allocating build_edge_ptr where possible\n\nThis may be noticeably more efficient in a few of these cases.", "target": 1} {"commit_message": "[PATCH] Some performance tuning", "target": 0} {"commit_message": "[PATCH] Continued stats refactor + subarray stats (#2200)\n\nTYPE: IMPROVEMENT\nDESC: Added additional stats for subarrays and subarray partitioners", "target": 0} {"commit_message": "[PATCH] Added gpu implementation for second convergence test in\n bicgstab. Improved performance by moving x += alpha * y from step_3 to step_2\n in all implementations.", "target": 1} {"commit_message": "[PATCH] tables added on cache performance", "target": 0} {"commit_message": "[PATCH] \n tutorials/combustion/reactingFoam/laminar/counterFlowFlame2D(LTS): changed to\n Wilke transport mixing\n\nChanged the laminar methane combustion cases to use the Wilke mixing rule for\nthe transport properties obtained from the Sutherland model but with coefficient\nmixing for thermodynamic properties for efficient evaluation of reaction\nequilibria.\n\nThis provides significantly more accurate results for laminar combustion,\nproducing a thinner flame and a 10K reduction in peak temperature.", "target": 0} {"commit_message": "[PATCH] more efficient treatment of geometries in qmmm", "target": 1} {"commit_message": "[PATCH] KokkosCore: Mark perf tests as CATEGORY PERFORMANCE\n\nFix https://github.com/kokkos/kokkos/issues/374 by marking\nKokkosCore's performance tests as CATEGORY PERFORMANCE. They will\nalways build, but they will only run when doing performance tests.", "target": 0} {"commit_message": "[PATCH] Make linspace() non-recursive\n\nFor a \"linspace(a, b, num)\" call with \"a\" and/or \"b\" symbolic expressions,\nthe recursive implementation would lead to a graph of \"num\" depth. For\nlarge \"num\" this can lead to e.g. a very slow \"is_equal\" check due to the\ndepth requirement.\n\nAnother reason to prefer a multiplication over recursive addition is\neliminating the compounding error due to rounding when calling linspace\nwith numeric arguments.", "target": 0} {"commit_message": "[PATCH] Kokkos::Experimental View refactoring #define\n KOKKOS_USING_EXPERIMENTAL_VIEW to alias Kokkos::View to\n Kokkos::Experimental::View.\n\nRevise core unit and performance tests to execute correctly.", "target": 0} {"commit_message": "[PATCH] convert coul/dsf/omp styles to use fast analytical erfc()", "target": 0} {"commit_message": "[PATCH] Fixed return value of gmx_mtop_bondeds_free_energy\n\nThe return value was always true, which was harmless, since it\ncould only cause a small performance hit of useless sorting.\n\nFixes #1387\n\nChange-Id: I088a3747ddb3517fbb5e416b791bd542bd49fed2", "target": 1} {"commit_message": "[PATCH] Adding specific code for Tet + Bbox_3 do_intersect as it\n should be ok for Bbox_3 to degenerate.\n\nAdding specific code for Tet + Bbox_3 do_intersect as it should be ok for Bbox_3 to degenerate.\n\nThe previous code failed in case Bbox_3 is degenerate.\n\nI use the result = result || predicate(); to keep the maybe inside result.\nIf certain the code returns early.\nI also avoid the %4 as this is a slow operation, but not sure that this is worth compared to the rest.", "target": 0} {"commit_message": "[PATCH] Update RAS-IR.\n\nNotes:\n1. Overlap and row ordering have significant effects on convergence.\n2. Lower the matrix bandwidth, better the performance ?\n3. Sync is lower, but communication becomes higher, tradeoff!\n4. Parallelism is higher, but overhead is also higher, tradeoff!\n5. Very high accurate solves, do not provide anything.\n6. Two level and multi-levels might significantly reduce number of\niterations to converge.", "target": 1} {"commit_message": "[PATCH] - Experiment with compacting the in_conflict_flag showed 7%\n performance drop. So it's probably not worth it in practice.", "target": 0} {"commit_message": "[PATCH] Fix performances of Triangulation_2 with EPEC\n\nThere was a performance degradation between CGAL-3.7 and CGAL-3.8, when\nTriangulation_2 is used with EPEC. This patch fixes the issue. Using a\nfunctor that is specialized for EPEC, in inexact_orientation, to_double is\nnot called on p.x() but on p.approx().x().", "target": 1} {"commit_message": "[PATCH] Added workaround for bad performance with CUDA 7.x on clover\n sigma oprod", "target": 0} {"commit_message": "[PATCH] Pass Cell_handle and Vertex_handle by value instead of by\n const&. This undoes :\n\n r19107 | afabri | 2003-10-17 10:49:19 +0200 (Ven 17 oct 2003) | 2 lignes\n Added const& for gaining performance\n\nwhich was justified at the time by the fact that on VC++, handles encapsulated iterators.", "target": 0} {"commit_message": "[PATCH] new performance tests", "target": 0} {"commit_message": "[PATCH] Added performance chart", "target": 0} {"commit_message": "[PATCH] Acquire BC lists outside the CheckPointIO id loop\n\nThis should be just as fast for restarts and O(Nsplits/Nprocs) times\nfaster for mesh splitting. Thanks to @friedmud for the idea.", "target": 1} {"commit_message": "[PATCH] Support pinning in HostAllocator\n\nWe want the resize / reserve behaviour to handle page locking that is\nuseful for efficient GPU transfer, while making it possible to avoid\nlocking more pages than required for that vector. By embedding the\npin()/unpin() behaviour into malloc() and free() for the allocation\npolicy, this can be safely handled in all cases.\n\nAdditionally, high-level code can now choose for any individual vector\nwhen and whether a pinning policy is required, and even manually\npin and unpin in any special cases that might arise.\n\nWhen using the policy that does not support pinning, we now use\nAlignedAllocator, so that we minimize memory consumption.\n\nChange-Id: I807464222c7cc7718282b1e08204f563869322a0", "target": 0} {"commit_message": "[PATCH] replaced fr->solvent_opt by fr->cginfo, which also contains\n the energy group id and implemented efficient cg sorting with DD, which is\n now done at every DD decomposition", "target": 0} {"commit_message": "[PATCH] SDG Linf fast insertion examples\n\nSigned-off-by: Panagiotis Cheilaris ", "target": 0} {"commit_message": "[PATCH] Fix indexing issue in the pull code\n\nWhen determining if the COM of pull groups should be computed,\nthe indexing range of group[] for each pull coordinate is one element\ntoo long. In most cases this element is 0, so in which case it only\nlead to extra, useless compute when a cylinder group is used.\nNote that for dihedral geometry the extra element is actually dim[0]\nin pull coord, which is 0 or 1, which is harmless.\n\nNo release note, since this did not affect results, it could only\ncause a minor performance loss with cylinder pulling.\n\nFixes #2486\n\nChange-Id: Ie5785181fbe28d8db57e37c58553ae3835e657b7", "target": 0} {"commit_message": "[PATCH] Incorporated Poisson solver in the mdrun code. It is dead\n slow but maybe parallellized easily and is a good reference code for PPPM,\n since it is about twice as accurate as PPPM at the same number of grid\n points.", "target": 0} {"commit_message": "[PATCH] Update to Kokkos r2.04.04 and add workaround for performance\n regression", "target": 0} {"commit_message": "[PATCH] bond/react: efficient competing reactions", "target": 1} {"commit_message": "[PATCH] Fixup re-enable core performance tests\n\nThese had been inadvertently disbaled in #3839\n\nCo-Authored-By: Nick Curtis \nCo-Authored-By: Bruno Turcksin ", "target": 0} {"commit_message": "[PATCH] vector specialization for parallel_sync\n\nWe can't do this the efficient way in the general algorithm, so let's\ndo it as best we can right now, with blocking receives.\n\nThis should be replaced by Derek's algorithm in #1684 as soon as we\ncan support that.", "target": 0} {"commit_message": "[PATCH] Global thread pool when TBB is disabled (#1760)\n\nThis introduces a global thread pool for use when TBB is disabled. The\nperformance has not been exhaustively benchmarked against TBB. However, I did\ntest this on two readily available scenarios that I had been recently performance\nbenchmarking for other reasons. One scenario makes heavy use of the parallel_sort\npath while the other does not. Surprisingly, disabling TBB performs about 10%\nquicker with this pach.\n\n// Scenario #1\nTBB: 3.4s\nTBB disabled, this patch: 3.0s\nTBB disabled, on dev: 10.0s\n\n// Scenario #2\nTBB: 3.1\nTBB disabled, this patch: 2.7s\nTBB disabled, on dev: 9.1s\n\nFor now, this patch uses the threadpool at the same scope as the TBB scheduler.\nIt is a global thread pool, shared among a single process, and conditionally\ncompiled. The concurrency level that the thread pool is configured with is\ndetermined from the \"sm.num_tbb_threads\" config.\n\nThis patch does not disable TBB by default.\n\nCo-authored-by: Joe Maley ", "target": 0} {"commit_message": "[PATCH] type 2 instability fixed; type 2.5 still not fixed but Rick\n disabled texas in this case; significant performance optimizations for '95\n esp. gradients", "target": 1} {"commit_message": "[PATCH] RJH: Tweaked slow convergence threshold and improved\n stability of RJH: the line search", "target": 0} {"commit_message": "[PATCH] Reducing dependencies. Print functions are generally not\n fast anyway, inlining them leads to unnecessary dependencies and larger\n headers. Removing print functions from headers.", "target": 0} {"commit_message": "[PATCH] Wrote the compute_children_node_keys() function in elem.h\n which allows one to generate appropriate node keys while reading in a mesh\n with multiple refinement levels. This allows us to avoid a linear search in\n the MeshRefinement::add_point routine since all the nodes can now be found in\n the nodes hash table. The resulting performance improvement was significant.", "target": 0} {"commit_message": "[PATCH] Improve meanshift filter performance on CPU\n\nImproved the meanshift filter performance on the CPU backend by replacing\nvectors with std::arrays and moving them out of the for loops. Also\nreduced a few conversion operations.", "target": 1} {"commit_message": "[PATCH] Beginning to add support for freezing the sparsity pattern of\n graphs and sparse matrices to improve the performance of subsequent updates", "target": 1} {"commit_message": "[PATCH] Remove no-inline-max-size and suppress remark\n\nTo avoid the remark that inlining isn't possible I added the flag\nin d28edf2a07dcf11. This causes slow compile and should be avoided.\nInstead suppress the remark.\n\nTODO (for later): Check whether the additional inlining can improve\npermance and consider enable it for release build.\n\nChange-Id: I5866fcc5865fb44ca3dca0cf217e0cab2afbea0c", "target": 0} {"commit_message": "[PATCH] Fixing bug in build system noticed by Jeff. This should\n improve the performance of default builds", "target": 0} {"commit_message": "[PATCH] More efficient TNG selection group creation\n\nDo not create a TNG selection group if no selection is specified\nexplicitly, or if the selection contains all atoms in the system.\n\nChange-Id: Ibe2a14e55aff829fdb74de074447f00f0e85f090", "target": 1} {"commit_message": "[PATCH] BJP: Initial checkin of test programs to test performance of\n put and get operations using mirrored arrays.", "target": 0} {"commit_message": "[PATCH] performance improvemnts under ia64", "target": 1} {"commit_message": "[PATCH] Initial checkin of code to test performance of strided\n onesided operations.", "target": 0} {"commit_message": "[PATCH] performance measurement for the triples plus attempts at\n optimizing one routine on a pentium", "target": 0} {"commit_message": "[PATCH] Halved the cost of the pull communication\n\nWith DD the PBC reference coordinates are now only communicated\nafter DD repartitioning. This reduces the number of MPI_alltoall\ncalls from 2 to 1 per step, which can significantly improve\nperformance at high parallelization.\n\nAdded a cycle counter for pull potential.\n\nAdded checks for zero pull vectors to avoid div by 0.\n\nChange-Id: Ib89ba9e14eaa887f59a5087135580bc29a20d7d0", "target": 1} {"commit_message": "[PATCH] Add workaround for performance regression", "target": 0} {"commit_message": "[PATCH] Adding Changelog for Release 3.2.01 [ci skip]\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 3.2\n\n(cherry picked from commit 0e0b28fd78e696f74ab8d1d6bfc1c3e2b9667f49)", "target": 0} {"commit_message": "[PATCH] Change from single Performance test Executable to different\n executables per backend.", "target": 0} {"commit_message": "[PATCH] Fast Haswell ZGEMM kernel", "target": 0} {"commit_message": "[PATCH] NumericVector::add_vector refactoring\n\nSimilar to #411 and #413\n\nThis was originally intended to be just another additional T* API plus\na refactoring; however, the new PetscVector::add_vector(DenseVector)\ncode path should be a performance improvement as well.", "target": 1} {"commit_message": "[PATCH] Set cmake build to release for LZ4/Zlib and Zstd\n\nThis enables vectorization which yields a boost in performance\nfor these compressors. #1033", "target": 0} {"commit_message": "[PATCH] fem_system_ex2 local->distributed solution\n\nThere doesn't seem to be any way to get this behavior back in reinit()\nwithout creating a performance regression: most applications simply\nnever need to move data in that direction.", "target": 0} {"commit_message": "[PATCH] Enhancement: replace SQRT and POW by more efficient\n computations", "target": 1} {"commit_message": "[PATCH] additional precompiler lines for performance testing", "target": 0} {"commit_message": "[PATCH] improved copy_ccb method by replacing the used insert methods\n with more efficient versions of those methods.", "target": 1} {"commit_message": "[PATCH] adding parallel region around gmx_parallel_3dfft_execute.\n Makes it work correctly but unnessary barrier should be removed for\n performance reasons", "target": 0} {"commit_message": "[PATCH] Restructured the subdivision package\n\n-- Integrated the doc in the header files\n-- Split and moved files to have a proper internal structure and to distinguish\n between hosts, stencils and methods at the filename level.\n-- Removed all instances of Polyhedron to have PolygonMesh instead\n-- Cleaned off useless functions (Polyhedron_decorator remnants)\n-- Improved general documentation\n-- Minor performance improvements", "target": 1} {"commit_message": "[PATCH] Simplified neighborlist setup with GB and enroute to more\n efficient DD performance", "target": 0} {"commit_message": "[PATCH] Optimize\n\n- Makes the resizing of the points real fast\n- Makes the resizing of the normals applied on slider released instead of every tick when the point set size is bigger than 300 000\n- sets the initial value of the point size to 2 instead of 5", "target": 1} {"commit_message": "[PATCH] Fix mis-branching in CouplingMatrix iterator++\n\nThis was causing a *huge* performance penalty in cases where we had\nmany variables.\n\nHopefully https://github.com/idaholab/moose/issues/9480 will be fully\nresolved by the fix. I see orders of magnitude speedup in our test\ncase.", "target": 1} {"commit_message": "[PATCH] checked in the wrong file last time, this is the really\n efficient version !", "target": 0} {"commit_message": "[PATCH] made nbnxn analytical Ewald consistent\n\nThe recent addition of nbnxn analytical Ewald kernels switched those\nkernels on for local interactions, but not for non-local domains.\nThis performance bug has been fixed.\n\nChange-Id: I28abc822ee8f1cf8f7dbb5c516703145400441b2", "target": 1} {"commit_message": "[PATCH] Update performance section", "target": 0} {"commit_message": "[PATCH] Rewrote the BoxManGatherEntries() code for handling\n AddGraphEntries() to improve performance when using the assumed partition.", "target": 1} {"commit_message": "[PATCH] Fixed g_sham using more than three dimensions\n\nWhen the value given with g_sham -n was greater than 3, arrays used to\noverflow and pick_minima() did not work. pick_minima() has been updated\nto treat an arbitrary number of dimensions, but retains the particular\ncode for the two- and three-dimensional cases in the hope that these\nare faster. The logic of the complex conditionals has hopefully been\nmade easier to follow without compromising performance with modern\ncompilers. Index variables are now of gmx_large_int_t type, as\nhigh-dimensional cases can have large numbers of grid points very fast.\n\nChange-Id: If0c2f9d9ceaf2b5c4c8b1a28a942fae8349fb600", "target": 0} {"commit_message": "[PATCH] updated traits class by replacing Side_of_hyperbolic_triangle\n with Side_of_oriented_hyperbolic_segment; added function\n side_of_hyperbolic_triangle to class Periodic_4_hyperbolic_triangulation;\n changed locate() function to more efficient version", "target": 1} {"commit_message": "[PATCH] PERF: improvements to element wise operations in CPU backend\n\n- Improved performance when all buffers can be indexed linearly", "target": 1} {"commit_message": "[PATCH] Fixed FAST edge assertions", "target": 0} {"commit_message": "[PATCH] Remove support for sparse writes in dense arrays. (#2504)\n\nSupporting sparse writes in dense arrays caused performance issues and\nsince customer use of the feature is infrequent, the decision was made\nto remove support.", "target": 0} {"commit_message": "[PATCH] slight performance improvement for 4x4 search", "target": 1} {"commit_message": "[PATCH] Read tiles: fixing preallocation size for var and validity\n buffers. (#2781)\n\nPreallocation size for var buffer and validity buffer was not using the\ncorrect size, which will have a performance impact.", "target": 0} {"commit_message": "[PATCH] Improve rendezvous performance when hardware is\n oversubscribed", "target": 1} {"commit_message": "[PATCH] foamDictionary: Added support for reading files as case\n IOdictionary in parallel\n\nIf the -case option is specified time is created from the case\nsystem/controlDict enabling support for parallel operation, e.g.\n\nmpirun -np 4 \\\n foamDictionary -case . 0/U -entry boundaryField.movingWall.value \\\n -set \"uniform (2 0 0)\" \\\n -parallel\n\nThis will read and modify the 0/U field file from the processor directories even\nif it is collated. To also write the 0/U file in collated format the collated\nfileHandler can be specified, e.g.\n\nmpirun -np 4 \\\n foamDictionary -case . 0/U -entry boundaryField.movingWall.value \\\n -set \"uniform (2 0 0)\" \\\n -fileHandler collated -parallel\n\nThis provides functionality for field manipulation equivalent to that provided\nby the deprecated changeDictionary utility but in a more flexible and efficient\nmanner and with the support of fileHandlers for collated parallel operation.", "target": 1} {"commit_message": "[PATCH] rhoTabulated, hTabulatedThermo, tabulatedTransport: New (p,\n T) tabulated thermophysical functions\n\nThis is a prototype implementation of (p, T) tabulated density, enthalpy,\nviscosity and thermal conductivity using a uniform table in pressure and\ntemperature for fast lookup and interpolation. The standard Newton method is\nused for h->T inversion which could be specifically optimised for this kind of\ntable in the future.", "target": 0} {"commit_message": "[PATCH] Initially assign spans of empty superblocks to block sizes\n and enable stealing of empty superblocks among block sizes.\n\nExpand block size superblock hint array to \"N\" values per block size\nto provide space for TBD superblock search optimizations.\n\nConstruct memory pool with min block, max block, and superblock size\nand introduce performance optimizations related to max vs. min block size.\n\nIssues #487, #320, #738, #215", "target": 1} {"commit_message": "[PATCH] better performance with hanging nodes", "target": 1} {"commit_message": "[PATCH] Allow useful CI to run in forks\n\n* Move the fast jobs with no dependencies to the first stage.\n* Remove the global KTH-specific job runner tag from jobs in the pre-build stage.\n* Use the `pre-build` stage as the dependency for all later stages, rather than the `simple-build` job, specifically.\n* Convert rule sets to new *rules* syntax.\n* Use '$CI_PROJECT_NAMESPACE == \"gromacs\"' to distinguish jobs created with access to GROMACS GitLab infrastructure.\n\nFixes #3458", "target": 0} {"commit_message": "[PATCH] dtrsm_kernel_LT_8x2_bulldozer.S performance optimization", "target": 1} {"commit_message": "[PATCH] Simplify handling of DD bonded distances\n\nTo simplify and clarify the DD setup code, we now always always store\nthe systemInfo.minCutoffForMultiBody and use a separate flag to tell\nif we should increase the cut-off distance for bonded communication.\nThere is a minor behavioral change in that with large domains and\nbonded communication filtering or DLB, the bonded cut-off is now\n5% of the bonded cut-off longer as the margin is now included.\nThis has a negligible effect on performance in all cases.\n\nChange-Id: Id409353c517181ac56e8d3f1f36c22c705aa8077", "target": 0} {"commit_message": "[PATCH] Switching to Simple_cartesian leads to a performance gain of\n 50% for 100K points", "target": 1} {"commit_message": "[PATCH] optimization, improve performance by more than 20%", "target": 1} {"commit_message": "[PATCH] Implemented reordering of loads and stores so that the real\n and imaginary part are loaded/stored together. This should improve\n out-of-cache performance in the presence of associativity conflicts, and\n maybe worsen in-cache performance because of worse scheduling. Enabled for\n now, for experimental purposes.", "target": 1} {"commit_message": "[PATCH] added fast return, if m or n < 1", "target": 1} {"commit_message": "[PATCH] Modify aligned address of sa and sb to improve the\n performance of multi-threads.", "target": 1} {"commit_message": "[PATCH] remove peak_memory_sizer that uses Taucs, slow computation\n and is not working on all platforms.\n\nBy default poisson now uses Eigen is available and Taucs otherwise", "target": 0} {"commit_message": "[PATCH] Replaced Vertex_handle and Cell_handle by const & versions \n in order to regain performance", "target": 1} {"commit_message": "[PATCH] Initial fast reciprocal space LJPME implementation, with\n test.", "target": 0} {"commit_message": "[PATCH] Even better performance figures in Poisson reconstruction\n through less pre-allocation in CGAL::Eigen_matrix", "target": 1} {"commit_message": "[PATCH] - Handle_for memory leak fixed : initialize_with() now\n assigns instead of constructing, so that it works correctly after\n Handle_for has been default constructed. There's a new way of constructing\n a Handle_for : Handle_for(TO_BE_USED_ONLY_WITH_CONSTRUCT_WITH) followed by \n construct_with(), which is supposed to produce more efficient code. \n Simple_handle_for also accepts it.", "target": 1} {"commit_message": "[PATCH] Add locks only for non-OPENMP multithreading\n\nto migitate performance problems caused by #1052 and #1299 as seen in #1461", "target": 1} {"commit_message": "[PATCH] Fixed FAST files to comply with new directory structure.", "target": 0} {"commit_message": "[PATCH] added while loops in pbc_dx to correct for multiple box\n vectors shifts, and added set_pbc_ss for efficient single shift pbc_dx", "target": 1} {"commit_message": "[PATCH] Switching to standard essential BC treatment for solver\n performance gain (thanks to Socratis for catching this!)", "target": 1} {"commit_message": "[PATCH] Replacement for pdb2gmx tests\n\nThis test directly asserts upon the .top and .gro files that are\nwritten out, using fragments of the\nregressiontests/complex/aminoacids/conf.gro because these cover all\nbasic amino acid types. It also adds testing for hydrogen vsites for\namber and charmm.\n\nWe now omit doing an energy minimization after the string checks,\nwhich was always a doubtful way to test pdb2gmx. These tests are still\ntoo slow to run with other pre- and post-submit testing, so a new\nCTest category has been made for them, and that category is excluded\nfrom Jenkins builds by default. Developers will still run these by\ndefault with \"make check\" or \"ctest\" but that should be fast enough on\na workstation. Later we can probably refactor them to use in-memory\nbuffers and be fast enough to put with the other tests.\n\nModified pdb2gmx to avoid writing fractional charges for every atom in\nthe [atoms] output, which isn't very useful for users and makes\nwriting tests more difficult.\n\nFixed unstable sorting of dihedrals whose parameters are strings that\nidentify macros.\n\nAdded new capability for refdata tests to filter out lines that vary\nat run time by supplying a regex that matches the lines to skip.\nThat's not ideal, but useful for now. Better would be to refactor\ntools so that e.g. header output can go to a different stream, but\nfirst we should have basic tests in place.\n\nAdded tests for Regex. Fixed minor bug in c++ stdlib regex\nimplementation of Regex. Noted the remaining reason why we have\nRegex supported by two different implementations.\n\nMinor updates to use compat::make_unique\n\nExtended functionality of CommandLine for convenient use.\n\nRefs #1587, #2566\n\nChange-Id: I6a4aeb1a4c460621ca89a0dc6177567fa09d9200", "target": 0} {"commit_message": "[PATCH] Serious performance bug in threaded code fixed. Now the main\n thread goes after the children are launched.", "target": 1} {"commit_message": "[PATCH] fixed slow memory reallocation, especially at the start of\n runs and for large xvg file by replacing the linear increment by a scaling\n with a factor of 1.19 and renamed the dd over_alloc to over_alloc_dd", "target": 0} {"commit_message": "[PATCH] Attempting to add diffusion term in a more efficient manner", "target": 1} {"commit_message": "[PATCH] Fix automated GPU ID assignment\n\nWhen we permitted separate PME-GPU ranks, we should have relaxed\nthis logic also.\n\nHowever, the performance in such cases is not very predictable, so if\nthere's a distribution of tasks with more than one task to a GPU that\nis uneven, then we should require the user to specify exactly what\nthey want. This also reinstates the 2016-era behaviour where, if\nrunning multiple PP ranks on GPUs, that mdrun will not by default\nproduce an unbalanced mapping with more than one task per GPU.\n\nChange-Id: I5b2fad317ecbb4e5e02fccd68e15350b678df34c", "target": 0} {"commit_message": "[PATCH] resolved performance degrading changed introduced in revision\n 1319 (2)", "target": 1} {"commit_message": "[PATCH] Template free-energy kernel on soft-core\n\nTemplated the free-energy kernel on the presence of soft-core\nand the soft-core r-power.\nNot doing the soft-core math when not using soft-core doubles\nthe speed of the free-energy kernel.\nTemplating for the soft-core power gives 15% performance improvement\nwith soft-core power 6.\nDouble precision r variables are only needed with r-power 48.\n\nChange-Id: I5a37307b2a83304a40343a0708afce46f4bdcf75", "target": 1} {"commit_message": "[PATCH] Ditch --enable-debug-malloc and --enable-debug-alignment\n\nWe wrote DEBUG_MALLOC in 1997 to debug memory leaks. Nowadays\nDEBUG_MALLOC is just confusing. Better tools are available, and\nDEBUG_MALLOC is not thread-safe and it does not respect SIMD\nalignment. It confused at least one user.\n\nIn the gcc-2.SOMETHING days, gcc would allocate doubles on the stack\nat 4-byte boundary (vs. 8) reducing performance by a factor of 3.\nThat's when we introduced --enable-debug-alignment, which is totally\nobsolete by now.", "target": 0} {"commit_message": "[PATCH] Switch to \"parallel_for\" for cell scan => Better performance,\n in particular with implicit function domain.", "target": 1} {"commit_message": "[PATCH] Improved performance of tranpose\n\n* Using int instead of dim_type\n* Unrolling loops using static consts\n* Using output dimensions", "target": 1} {"commit_message": "[PATCH] Refs JuliaLang/julia#5728. Fix gemv performance bug on\n Haswell Mac OSX.\n\nOn Mac OS X, it should use .align 4 (equal to .align 16 on Linux).\nI didn't get the performance benefit from .align. Thus, I deleted it.", "target": 1} {"commit_message": "[PATCH] Add CUDA bonded kernels\n\nCUDA bonded kernels are added for the most common bonded and LJ-14\ninteractions.\nThe default auto settings of mdrun offloads these interactions\nto the GPU when possible.\nCurrently these interactions are computed in the local or non-local\nnbnxn non-bonded streams. We should consider using a separate stream.\nThis change uses synchronous transfers. A child change will change\nthese to asynchronous.\n\nUpdated release notes and performance guide.\n\nFixes #2678\nRefs #2675\n\nChange-Id: Ifc6d97854cc7afa8526602942ec3b1712ba45bac", "target": 0} {"commit_message": "[PATCH] Performance bug fix.", "target": 1} {"commit_message": "[PATCH] Performance and thread-safety requires a lock around each\n constraint row acquisition, not just each constraint row entry.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@5893 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 0} {"commit_message": "[PATCH] Add a single-tree depth-first traverser. It is not as fast\n as it could be.", "target": 0} {"commit_message": "[PATCH] rewrite of T double kernels to improve performance with Intel\n 16", "target": 1} {"commit_message": "[PATCH] lagrangian: Rationalized the handling of multi-component\n liquids and solids Ensures consistency between the mixture thermodynamics and\n composition specifications for the parcels. Simpler more efficient\n implementation. Resolves bug-report\n http://www.openfoam.org/mantisbt/view.php?id=1395 as well as other\n consistency issues not yet reported.", "target": 1} {"commit_message": "[PATCH] HvD: Committing this data mainly for future record. The\n timings in these outputs will be reported in the paper as a demonstration of\n the performance difference between different implementations of the density\n functionals. The automatic differentiation timings reported here were\n generated using the intrinsic POPCNT, LEADZ and TRAILZ functions. At the\n moment the code does not use those by default as they are Fortran 2008\n intrinsics that are not supported by every compiler yet. However, using these\n intrinsics speeds the automatic differentiation code up by a factor of about\n 2.5 and it not generate a fair representation of the automatic\n differentiation technique not to use the intrinsics.", "target": 0} {"commit_message": "[PATCH] ODESolvers: Add support for efficient ODE solver resizing\n\nNote: this reuses the existing storage rather than costly reallocation\nwhich requires the initial allocation to be sufficient for the largest\nsize the ODE system might have. Attempt to set a size larger than the\ninitial size is a fatal error.", "target": 0} {"commit_message": "[PATCH] Stopping criterion: improve the performance one last time by\n using a kernel for the boolean initialization instead of a synchronous copy.", "target": 1} {"commit_message": "[PATCH] Update RAS-IR.\n\nNotes:\n1. Overlap and row ordering have significant effects on convergence.\n2. Lower the matrix bandwidth, better the performance ?\n3. Sync is lower, but communication becomes higher, tradeoff!\n4. Parallelism is higher, but overhead is also higher, tradeoff!\n5. Very high accurate solves, do not provide anything.\n6. Two level and multi-levels might significantly reduce number of\niterations to converge.", "target": 1} {"commit_message": "[PATCH] ORourkeCollision: Corrected bugs and added more efficient\n collision detection See http://bugs.openfoam.org/view.php?id=2097", "target": 1} {"commit_message": "[PATCH] Accelerate the distracted sequence recall test.", "target": 0} {"commit_message": "[PATCH] Undo deprecation of BoundaryInfo::add_side().\n\n* Remove new version of add_side() that takes a reference to a std::set.\n We are now ensuring that the entries in the input vector are unique\n while storing them in the BoundaryInfo object.\n\n* Refactor MeshTools::Modification::change_boundary_id(). This\n function used to call boundary_ids() for every side, edge, and node\n in the mesh, but it should be more efficient to only call that for\n sides, edges, and nodes that have boundary ids on them.", "target": 1} {"commit_message": "[PATCH] Attempt to accelerate the test a little bit.", "target": 0} {"commit_message": "[PATCH] Add initial reinit_func(1.) call in Euler2Solver\n\nWe need the call to reinit_func to set the time, t, in the context\nto the correct value. Also added clarifying comments in EulerSolver\nand Euler2Solver that we're also setting the time in addition to\npossibly resetting the mesh if there's mesh motion.\n\nThere is probably a way to make this more efficient such that we\nonly call reinit_func twice in Euler2Solver, but I didn't put any\nthought into it.", "target": 1} {"commit_message": "[PATCH] CDGW now working; slow", "target": 0} {"commit_message": "[PATCH] Fix nblib pairlist update function\n\nPreviously the function put_atoms_in_box was called only by the\nnblib integrator, or when constructing a system via a SimulationState\nobject. In the case of the integrator, this is a needless performance\ndegradation. When using an nb force calculator without\nfirst putting atoms in the box, this could lead to a cryptic\nerror from nbnxm grid search failure. Both of these issues are\nrectified with this change, which also adds a member to the\nnon-bonded force calculator to hold the requested number of OMP\nthreads to use in a call to put_atoms_in_box_omp, which is faster\nthan the non OMP version.", "target": 0} {"commit_message": "[PATCH] Parallelize s3 multipart on disconnect\n\nOn disconnect of S3 we can parallelize the marking of multi-part uploads\nas complete. This allows us to remove the exclusive lock early and\nincrease performance if we have a large number of outstanding requests.", "target": 1} {"commit_message": "[PATCH] Keep track of old dof_indices distribution between\n processors; this makes it easier to construct an efficient send_list in\n System::project_vector.", "target": 0} {"commit_message": "[PATCH] Fast code now computes confidence bands", "target": 0} {"commit_message": "[PATCH] Preload Tile Offsets (#1795)\n\nParticularly for cloud storage backends, loading tile offsets is a performance\nsensitive path. Sequential reads perform much better than random reads because\nthey make better use of the read-ahead cache. This patch aims to increase\nthe number of sequential reads when loading tile offsets.\n\nCurrently, tile offsets are loaded in the following path:\nfor each attribute:\n read_tiles\n parallel_for each fragment:\n load_tile_offsets\n parallel_for each fragment:\n load_var_tile_offsets\n\nThis patch refactors it to:\nparallel_for each fragment:\n for each attribute:\n load_tile_offsets\n load_var_tile_offsets\nfor each attribute:\n read_tiles\n\nBy inverting the order in which we iterate fragments and attributes, we can\nsort the attributes by their index in the fragment metadaa file. By loading\nattributes in ascending order of their offsets, we ensure a sequential read\nto maximum hits in the read-ahead cache.\n\nAdditionally, this defers loading var offsets until all fixed offsets have\nbeen loaded. This is because the fixed offsets exist before the var size\noffsets in the file format: https://github.com/TileDB-Inc/TileDB/blob/dev/format_spec/fragment.md\n\nMost importantly, this is a pre-requisite to parallelizing the read_tiles()\nfor each attribute. When read_tiles are parallel, we can't control the order\nthat they are executed and therefore load tiles, which may reduce hits in\nthe read-ahead.\n\nCo-authored-by: Joe Maley ", "target": 0} {"commit_message": "[PATCH] Performance improvements for find_*_neighbors, and a new\n find_point_neighbors version for finding neighbors at just one point", "target": 1} {"commit_message": "[PATCH] working, but slow", "target": 0} {"commit_message": "[PATCH] Replaced LU by QR decomposition.\n\nColPivHouseholderQR is, according to the manual both fast and rank\nrevealing.\n\nRelated to #59.", "target": 1} {"commit_message": "[PATCH] Added Grsm_ggm_sym_dot subroutine for more efficient\n calculation of product that result in symmetry matrices...EJB", "target": 1} {"commit_message": "[PATCH] Fixed FAST C++ API, added proper destructor calls.", "target": 1} {"commit_message": "[PATCH] JN: ELAN_ACC for fast accumulate", "target": 1} {"commit_message": "[PATCH] Performance updates...EJB", "target": 0} {"commit_message": "[PATCH] performance slightly improved", "target": 1} {"commit_message": "[PATCH] Fix Wundef warnings\n\nAlso fixes a performance bug in gmx_simd_invsqrt_pair_d. Previuosly it\ndid unnecessary number of iterations because it used an non-existing\npreprocessor variable.\n\nChange-Id: Idcdf3872b5a169e8690721bbe83922a4ab280da8", "target": 1} {"commit_message": "[PATCH] Attempting to accelerate the build by only including most of\n the internal copy:: headers when necessary", "target": 0} {"commit_message": "[PATCH] nbnxn utils performance improvement for Phi\n\nAlso remove usage of unpack to load half/quarter aligned data, because\nin case of misaligned data, instead of SegF it only loaded partial data.\n\nChange-Id: Ib0f7807986e6fcbe998bd6ee41ce104666446321", "target": 1} {"commit_message": "[PATCH] Improving the performance, 0-10% slower than Triangulation_2,\n not removing the initial vertices", "target": 1} {"commit_message": "[PATCH] perf: minor performance improvements for bilateral", "target": 1} {"commit_message": "[PATCH] follow up of 2a71e019: VC performance warning", "target": 0} {"commit_message": "[PATCH] Modifying a couple paramaters in the \"POWER10\"-specific\n section of param.h, for performance enhancements for SGEMM and DGEMM.", "target": 1} {"commit_message": "[PATCH] PERFFIX: improved 2d convolve perf in cuda by 33%\n\n* templating cuda kernel for filter lengths increased\n performance by 30% which is 93% of closed-source ArrayFire\n implementation of 2d convolution\n* templating separable cuda kernel improved performance by 20%\n* separated separable convolution kernel and wrapper into their\n own file to speed up compilation time", "target": 1} {"commit_message": "[PATCH] performance timing...EJB", "target": 0} {"commit_message": "[PATCH] On behalf of Ryan Olson: Checking in the changes for server\n side registration to improve performance", "target": 1} {"commit_message": "[PATCH] Modified vector calls to improve performance when copy data\n to same processor.", "target": 1} {"commit_message": "[PATCH] Disks are too fast, fix formatting of speed as *****", "target": 0} {"commit_message": "[PATCH] Adding logic for detecting whether or not a Mac must link\n vecLib or Accelerate", "target": 0} {"commit_message": "[PATCH] Fixed typo in HegstRLVar3/HegstRUVar3 and improved\n performance of Trmm and Symm for relatively small numbers of right-hand\n sides.", "target": 1} {"commit_message": "[PATCH] hypre's GPU SpGemm (#433)\n\nThis PR improves the performance of hypre's sparse matrix-matrix on NVIDIA GPUs, and fixes it on AMD GPUs with hip.\n\nCo-authored-by: Ruipeng Li \nCo-authored-by: Paul T. Bauman ", "target": 1} {"commit_message": "[PATCH] Accelerate \"Checking if non utf-8 characters are used\"", "target": 1} {"commit_message": "[PATCH] Always more efficient Face Partial Assembly Kernels. Lot of\n simplifications in the design of Domain Kernels. Remove inefficient Kernels\n Based on Eigen.", "target": 1} {"commit_message": "[PATCH] In debug mode it makes no sense to run a performance test", "target": 0} {"commit_message": "[PATCH] AABB tree: update performance section with more details about\n memory occupancy (table here is better than a curve as the memory grows\n linearly)", "target": 0} {"commit_message": "[PATCH] Modified version of cgal_test_with_cmake which: - is\n cross-platform Unix/make and Cygwin/VisualC++ - concats all log files to\n cgal_test_with_cmake.log - does not clean up object files and executables\n (too slow when called by developer)", "target": 0} {"commit_message": "[PATCH] More efficient IntegratorInternal::getDerivative #936", "target": 1} {"commit_message": "[PATCH] Improved performance of building neighbor list on AMD GPUs", "target": 1} {"commit_message": "[PATCH] rocSPARSE does not require sorted columns for csrgemm\n\nThis is specific to the rocSPARSE CSR implementation of SpGEMM.\nBut this is a substantial performance savings.", "target": 1} {"commit_message": "[PATCH] Quadrature rule fixes for >double precision\n\nMake sure calculations are done in Real precision where possible; fall\nback on less-efficient-but-more-accurate defaults where the more\nefficient cases are tabulated in double precision.", "target": 0} {"commit_message": "[PATCH] Add debug output; don't adjust second bound. This provides\n another minor speedup, but this still is nowhere near as fast as it should be\n with a properly working Hamerly prune.", "target": 0} {"commit_message": "[PATCH] More efficient ordering of constrained DoF index sets", "target": 1} {"commit_message": "[PATCH] * modified termination policies * fast SVDBatch\n implementation", "target": 1} {"commit_message": "[PATCH] Set default GMX_OPENMP_MAX_THREADS to 64\n\nAs there are many new CPU with more than 32 hardware threads and\nGROMACS scales quite well to more than 32 threads,\nGMX_OPENMP_MAX_THREADS is increased from 32 to 64 threads.\nThe performance impact of this is that bitmasks are by default\n64-bit instead of 32-but integers, which on 64-bit systems should\nonly have a (negligible) effect on cache pressure.\n\nChange-Id: I73d1c79e86f30f7fc69e1f49e1195271435e77b6", "target": 1} {"commit_message": "[PATCH] Introduced python typemaps for std::vector,\n std::vector.\n\nWas having difficulties with 5-argument DMatrix constructor and numpy. Does seem to slow down compilation and linking significantly (factor 2?).", "target": 0} {"commit_message": "[PATCH] Performance optimization of Tokenizer\n\nReduces string allocations and removes std::vector from Tokenizer\nMost processing now happens on-demand.", "target": 1} {"commit_message": "[PATCH] More efficient SetNonzerosSlice::evaluate", "target": 1} {"commit_message": "[PATCH] Fix performance bug when LDC is a multiple of 1024", "target": 1} {"commit_message": "[PATCH] #1285 Refactred substituteInPlace. Now more efficient and has\n same signature for SX and MX.", "target": 1} {"commit_message": "[PATCH] Policy performance tests: Added test and sample scripts\n\nThis commit address Github issue #737\nRangePolicy and TeamPolicy tests with nested parallelism added for\nbenchmarking performance - e.g. compare master vs develop\n\npolicy_performance: add functor for parallel_scan", "target": 0} {"commit_message": "[PATCH] MDRange: Refactored HostIterateTile to use macros\n\nRemoves the recursive way for nesting the for loops which may hinder\nchances at vectorization during iteration over tiles.\nPerformance test revised to perform a stencil-like operation", "target": 0} {"commit_message": "[PATCH] Added allow_rules_with_negative_weights flag to QBase. \n Default is true (which was the standard behavior) but you can set this to\n false to use more expensive (but potentially safer) quadrature rules instead.\n\nReplaced the 15-point tet Gauss quadrature rule with a 14-point rule\nby Walkington of equivalent order.\n\nAdded Dunavant quadrature rules for triangles up to THIRTEENTH order. These\nare more efficient than the conical product rules they are replacing. Up to\nTWENTIETH order still to come.\n\nReplaced SECOND-order rule for triangles with a rule having interior integration\npoints. The previous rule had points on the boundary of the reference element.", "target": 1} {"commit_message": "[PATCH] Fix input ndims validation in fast,orb,sift", "target": 0} {"commit_message": "[PATCH] Remove OpenMP compile flag in CUDA backend\n\nThis flag isn't needed based on recent tests. If it is causing any\nperformance regression, it will be reverted and the following\nflag to disable two-phase lookup for cuda backend on windows will\nbe added back.\n\n/permissive flag does not work with two-phase-lookup enabled\nfor projects with openmp support enabled.", "target": 0} {"commit_message": "[PATCH] Minor code reordering in GPU kernels\n\nUpdating bCalcFshift just before use instead at the top of the kernel\nimproves performance by 1-2% on CUDA. This also improves readability.\nMaking specialized (no)shift kernels will only add 1% gain.\nAlso updated the OpenCL kernels for consistency and readability\n(the perfromance impact is negligible with current hardware/compiler).\n\nChange-Id: I309f90ad61e5815726d55254e2cd38d5e4e7662d", "target": 1} {"commit_message": "[PATCH 1/9] Update all versions to v1.1.1.", "target": 0} {"commit_message": "[PATCH] Update performance test case (to use polar_decomposition)\n Make polar default closest rotation computer", "target": 0} {"commit_message": "[PATCH] add collisions, split vectors into components for performance", "target": 1} {"commit_message": "[PATCH] Code factorization + \"manual\" min/max for slightly better\n performance", "target": 1} {"commit_message": "[PATCH] Added FAST feature detector example", "target": 0} {"commit_message": "[PATCH] potentially improved performance of coarsening and\n interpolation by using different Commpkg for strength matrix S. Added a new\n parameter S_commpkg_switch which sets the smallest strength threshold, for\n which this capability is used. This required the addition of a new parameter\n (int array that maps S-indices to A-indices) to the interpolation routine.\n Note that while this change does not affect Falgout, CLJP, PMIS and HMIS\n convergence behaviour and complexities, it affects ruge, ruge2b and ruge3c.\n This can be avoided by setting S_commpkg_switch to 1.", "target": 1} {"commit_message": "[PATCH] Code cleanup in swapcoords.cpp\n\nFor ion/water position exchanges with DD, the positions of the ion group,\nthe split group 0 and the split group 1 are assembled into an array known\non all processors (g->xc). Only if ion/water exchanges need to be done, the\npositions of the solvent group need to be assembled as well.\n\nBefore this patch, the group index ran from 0 to eGrpNr, so therefore also\nthe solvent group positions were assembled in every swap step. This was\nsuperfluous since they would be assembled again if bSwap is TRUE.\n\nIf the swap protocol is called very frequently (nstswap << 100), the\nperformance impact is now smaller.\n\nChange-Id: I8eff6bbd33810d6641ec97aa00a537fe782214d3", "target": 0} {"commit_message": "[PATCH] Near final release version. Performance for 300^3 is 4.64\n 1.96 3.13s on 1,4 and 20 ranks. Upto 4% variation if temps above 40C on GPU", "target": 0} {"commit_message": "[PATCH] Bring performance estimation up to date\n\nThe performance estimation code for estimating the PME/PP load\nand the optimal DD grid setup used outdated numbers.\nWe now estimate using actual cycle counts on Haswell and esimate\nfor other architectures through a scaling factor that takes into\naccount the SIMD width and FMA.\nThe DD grid automation now ignores PBC cost for exclusions with\nthe Verlet scheme and the for angles and dihedrals with SIMD.\n\nThe effect of this is a more reliable PME load estimate that's\nnow a factor 1.4 to 1.7 higher on Haswell.\nThe DD grid automation will now often choose a setup that better\nmatches the PME `decomposition and reduce the PME redist cost.\n\nChange-Id: I5daa6a6856f2b09ba6d17fda0eea800b816d21e4", "target": 0} {"commit_message": "[PATCH] Performance improvments to CPU Anisotropic Diffusion (#2174)\n\n* Performance improvments to CPU Anisotropic Diffusion\n\n(cherry picked from commit a4713f1aa102ad693129086bfdb9aa2a9d2fb1f7)", "target": 1} {"commit_message": "[PATCH] Made some performance improvements and fixed a bug when\n running on a single processor but compiled with mpi.", "target": 1} {"commit_message": "[PATCH] changed the performance figure note: this version can test\n the parallel performance", "target": 0} {"commit_message": "[PATCH] Updated article reference. Added configuration flags to\n support altivec on powerpc. Updated configuration files from gnu.org Removed\n the fast x86 truncation when double precision was used - it would just crash\n since the registers are 32bit only. Added powerpc altivec innerloop support\n to fnbf.c Removed x86 assembly truncation when double precision is used - the\n registers can only handle 32 bits. Dont use general solvent loops when we\n have altivec support in the normal ones. Added workaround for clash with\n altivec keyword.", "target": 0} {"commit_message": "[PATCH] The test was too slow in Debug mode", "target": 0} {"commit_message": "[PATCH] macos accelerate does not contain dcombossq", "target": 0} {"commit_message": "[PATCH] replaced std::endl with \\n in all file IO and stringstreams. \n std::endl forces a flush, which kills performance on some machines\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@834 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 0} {"commit_message": "[PATCH] Refactor threading model (#1766)\n\n* Refactor threading model\n\nRemoves:\n- global_tp_ (the replacement TBB thread pool)\n- StorageManager::async_thread_pool_\n- StorageManager::reader_thread_pool_\n- StorageManager::writer_thread_pool_\n- VFS::thread_pool_\n\nAdds:\n- StorageManager::compute_tp_\n- StorageManager::io_tp_\n\nUsage changes:\n1. Our three parallel functions (`parallel_sort`, `parallel_for[_2d]`) now use\n the `StorageManager::compute_tp_`.\n2. Both the `Reader::read_tiles()` and `Writer::write_tiles()` now execute on\n `StorageManager::io_tp_`.\n3. The VFS is now initalized with a thread pool, where the storage manager\n initializes it with the `StorageManager::io_tp_`. This means that both the\n VFS and Reader/Writer io paths execute on the same thread pool. There was\n previously a deadlock scenario if both used the same thread pool, but that\n is no longer an issue now that the threadpools are recursive.\n4. The async queries are executed on `StorageManager::compute_tp_`.\n\nConfig changes:\n- Adds configuration parameters for the compute and IO thread pool \"concurrency\n levels\". A level of \"1\" is serial execution while all other levels have a\n maximum concurrency of N but allocate N-1 OS threads.\n- Deprecate the async/reader/writer/vfs thread num configurations. If any of\n these are set and larger than the new \"sm.compute_concurrency_level\" and\n \"sm.io_concurrency_level\", the old values will be used instead. The motiviation\n is so that existing users will not see a drop in performance if they are\n currently using larger-than-default values.\n\n* Recursive ThreadPool::execute() (#1772)\n\nCurrently, we break recursive deadlock in the ThreadPool:wait*() routines. This\nworks well for the type of \"execute-and-wait\" model we use. For instance:\n```\nThreadPool tp;\nauto task = tp.execute(...);\ntp.wait_all(task);\n```\n\nWe are currently unable to break recursive deadlock if the threadpool user does\nnot use our \"wait\" routine. For instance:\n```\ncondition_variable cv;\nauto task = tp.execute([&]() {\n cv.signal_all();\n});\ncv.wait(...);\n```\n\nThe S3 client uses the above style of synchronization. With our compute/io\nthreadpool refactor, we encounter recursive deadlock. This patch allows breaking\nrecursive deadlock on the call to ThreadPool::execute().\n\nWith this patch, ThreadPool::execute() checks if 1) the calling thread belongs\nto the thread pool instance and 2) all other threads are non-idle.\n\nCo-authored-by: Joe Maley \n\nCo-authored-by: Joe Maley ", "target": 0} {"commit_message": "[PATCH] Use more efficient iterators in MeshCommunication", "target": 1} {"commit_message": "[PATCH] More performance tweak in Monte Carlo sampling.", "target": 1} {"commit_message": "[PATCH] Use a map for pushed_ids\n\nWe won't be handing this to parallel_sync but we want it to be more\nefficient in the large processor count case anyway.", "target": 1} {"commit_message": "[PATCH] Remove RangeSet, DenseIntMap, and fast allocation routines. \n They were not used anywhere.", "target": 0} {"commit_message": "[PATCH] Adding Changelog for Release 3.4.01\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 3.4", "target": 0} {"commit_message": "[PATCH] new, more efficient jacobian calculation for integrator", "target": 1} {"commit_message": "[PATCH] Add temporary allocations to PCG to avoid continual\n allocation / freeing. This restores performance to regular CG level", "target": 1} {"commit_message": "[PATCH] Fix performance regression in KOKKOS package", "target": 0} {"commit_message": "[PATCH] Added DefaultInitializationAllocator\n\nAdded an allocator that can be used to default initialize elements\nof a std::vector on resize(). This is useful to avoid initialization\nin performance critical code.\n\nChange-Id: I65bd52a760c68c73555e8bb9e017de353a6e9a81", "target": 0} {"commit_message": "[PATCH] avoid casts to the wrong derived class, which upsets code\n analysis tools. seems to improve performance, too.", "target": 1} {"commit_message": "[PATCH] 128-bit AVX2 SIMD support\n\nAdd 128 bit support for AVX2. Similar to AVX-128, this\nimproves slightly on SSE2 due to more efficient instructions,\nand the shorter SIMD width is beneficial in some cases. Both\n128- and 256-bit flavors will be built automatically with\n--enable-avx2, and the timing routines will chose the best one\nautomatically.", "target": 1} {"commit_message": "[PATCH] Add size optimization to HashedMap\n\nThe table size in HashedMap is now optimized when calling clear()\nusing the old number of keys. Also the number of keys is now set\nto a power of 2, so we can use bit masking instead of modulo.\nThe bit masking allows for negative keys, which is also tested.\n\nThis is preparation for replacing gmx_hash_t with HashedMap,\nbut also improves performance for gmx_ga2la_t.\n\nChange-Id: I90c5a602cb7e213eb6d2e8259a0effc4fd7c4e14", "target": 1} {"commit_message": "[PATCH] changed the redistribution to x/y/z only moves, which\n improves the performance for especially 3D decomposition", "target": 1} {"commit_message": "[PATCH] ring non-axis parallel segment in pps subcase\n\nThis is only partially implemented. Still, it improves the\nperformance of the norway.cin benchmark.\n\nSigned-off-by: Panagiotis Cheilaris ", "target": 0} {"commit_message": "[PATCH] fixed performance print when run is terminated or with\n minimizers", "target": 0} {"commit_message": "[PATCH] Update performance tables with more details (memory, etc.)", "target": 0} {"commit_message": "[PATCH] bug fixes in double_conic, minor performance improvements", "target": 1} {"commit_message": "[PATCH] New bench for comparing sweep performance", "target": 0} {"commit_message": "[PATCH] Use a separate Performance log line for compute_affine_map", "target": 0} {"commit_message": "[PATCH] More extensive performance logging of the Kelly Error\n Estimator.", "target": 0} {"commit_message": "[PATCH] HvD: Moved the printing routine of the NWAD library out from\n nwxc.F to a separate file nwxc_print.F. The performance of the expression\n printing is not performance critical but when the routines are included in\n nwxc.F they become candidates for inlining. This causes the compiler to spend\n effort on inlining and optimizing these printing routines which is effort\n wasted. By moving these routines into a file of their own they are not\n getting inlined and the compiler can spend its time on optimizing the actual\n compute routines.", "target": 0} {"commit_message": "[PATCH] Replace SIMD copy patch routine to obtain better parallel\n performance", "target": 1} {"commit_message": "[PATCH] ranks/device changed from 2 to 1. Value of 2 doubles memory\n usage (and hits MPOSS bugs) but does not help with performance", "target": 0} {"commit_message": "[PATCH] Jan 29 1999\tCalls to Robert's fast esp routines", "target": 0} {"commit_message": "[PATCH] Converted lots of ints to unsigned ints (might help\n performance a little by avoiding conversions)", "target": 1} {"commit_message": "[PATCH] Unroll middle jm loop in the nbnxm kernels on Ampere\n\nThe unrolling improves performance of the non-bonded kernels by up to\n12%.\n\nNote: cherry-picked backport, skip when merging.\n\nRefs #3873", "target": 1} {"commit_message": "[PATCH] Iterative version of the incident_...(vertex) methods.\n\nSee Andreas' e-mail:\n\n> I just had a look at the code. The problem is that it calls\n> incident_cells, which is implemented recursively, and for a\n> vertex with many incident cells, as in your case the infinite\n> vertex, the stack is full.\n>\n> We have to put it on our todo list.\n\nI did it for 3D only because the degenerate 2D case should be\nhandled by the circulator anyway.\n\nI did not add the test which explodes the call stack (in case we plug\nthe recursive version): too slow for a testsuite. But incident_...\nmethods are used everywhere in the code anyway.", "target": 0} {"commit_message": "[PATCH] Fixed performance bug in mixed precision", "target": 1} {"commit_message": "[PATCH] - added performance note to solving functions doc - changed\n unbounded direction w so that x + tw is the unbounded ray - aded certificate\n iterators to QP_solution - added example programs that demonstrate the\n certificates - fixed examples so that 2D instead of D is given", "target": 0} {"commit_message": "[PATCH] Implement a suggestion from cppcheck.\n\nUsing vector::empty() to check for emptiness, instead of vector::size()\n== 0 may be better for performance, since it's guaranteed, to be 0(1).\nRelated to #30.", "target": 1} {"commit_message": "[PATCH] - Added the ability to add points on a sphere outside the\n domain in the sequential case => better performance for the fandisk model\n (x2). I'm still wondering why... - Code refactoring/clean-up", "target": 1} {"commit_message": "[PATCH] Call to change_notf->update_all_faces(result, a1, a2), which\n is the notifier function updating all the faces features. This was added\n since the Map overlay should use from now on the Post precessing notifier,\n rather then the In processing notifier, since the former is more efficient\n than the latter.", "target": 1} {"commit_message": "[PATCH] Working on the performance tests", "target": 0} {"commit_message": "[PATCH] Add example for distributed spmv scaling performance", "target": 0} {"commit_message": "[PATCH] fixed SD and BD integrator OpenMP performance\n\nSD and BD integrator always integrated single threaded.\nReally fixes #1121\n\nChange-Id: I2217c40e9c188c7cd57801e413750035c6488f56", "target": 1} {"commit_message": "[PATCH] Test show an improvement in octree performance", "target": 1} {"commit_message": "[PATCH] Changed FFTW warning from AVX to no SSE\n\nChanged the cmake FFTW SIMD check warning from complaining about\nAVX to complaining about missing SSE or SSE2.\nWith FFTW 3.3.4 the performance of FFTW with both SSE and AVX enabled\nis often a bit better and never much worse than SSE along. Newer\nIntel processors probably also perform better with AVX with FFTW 3.3.3\nso we should not complain about the combination of SSE(2) and AVX,\nbut only when SSE is missing.\n\nChange-Id: I3665a35ec98616f015d05e314c8fbb80a8862092", "target": 0} {"commit_message": "[PATCH] Improved CUDA non-bonded kernel performance\n\nSome old tweak which was supposed to improve performance had in fact\nthe opposite effect. Removing this tweak and with it eliminating\nshared memory bank conflicts it caused improved performance by up\nto 2.5% in the force-only CUDA kernel.\n\nChange-Id: I7fcb24defed2c68627457522c39805afc83b3276", "target": 1} {"commit_message": "[PATCH] use integer and reduce the number of tests\n\nleda_rational is not automatically doing gcd calls so Quotient\nis faster for our applications.\nThe test is still slow with EPECK", "target": 0} {"commit_message": "[PATCH] Improve the performance of BoundingBox::contains_point by\n marking is_between as an inline function\n\n3.28s 130: if (bboxes[i_from].contains_point(*node + _to_positions[i_to]))\n\n520ms 130: if (bboxes[i_from].contains_point(*node + _to_positions[i_to]))", "target": 1} {"commit_message": "[PATCH] Remove mdrun -testverlet\n\nThis was only intended for quick performance testing of old .tpr files\nduring the transition period. The window where that was useful has\npassed, and ongoing abuse of it has been observed. There is no need to\npreserve this until the formal removal of the group scheme.\n\nFixes #1424\n\nChange-Id: I589a8e316beeba6819cd01d9655bfc069bcbb174", "target": 0} {"commit_message": "[PATCH] performance update ...EJB", "target": 0} {"commit_message": "[PATCH] Added ODE diagnostics to FixRxKokkos using Kokkos managed\n data.\n\n- Added the diagnostics performance analysis routine to FixRxKokkos\n using Kokkos views.\nTODO:\n - Switch to using Kokkos data for the per-iteration scratch data.\n How to allocate only enouch for each work-unit and not all\n iterations? Can the shared-memory scratch memory work for this,\n even for large sizes?", "target": 0} {"commit_message": "[PATCH] Finished initial implementation for arbitrary sizes atomic\n operations\n\nThis implements atomic operations for arbitrarily sized objects.\nThe implementation uses a hash table approach, where a lock is set\nbased on a hash of the memory address of the object for which an\natomic operation should be performed.\nInitial performance results indicate that it is comparable to\nother non-native atomics (i.e. CAS loops with casting to integer\ntypes).\nThe commit implements the full set of supported atomics from\nKokkos and it works in all currently existing execution spaces.\nThe hashtables are static sized global arrays.\n\nNote: this commit requires relocatable-device-code being\nenabled for Cuda.", "target": 0} {"commit_message": "[PATCH] DynRankView: operator() performance improvements\n\nDebug macros added to check active memory space, rank, and bounds\nDynRankView: Simple performance test added\n compare performance to View and rank 7 View", "target": 1} {"commit_message": "[PATCH] moved loading of harmonic bonds to the top level, more\n efficient this way", "target": 1} {"commit_message": "[PATCH] Optimizations to kspace_style MSM, including improving the\n single-core performance and increasing the parallel scalability. A bug in MSM\n for mixed periodic and non-periodic boundary conditions was also fixed.\n\ngit-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@9597 f3b2605a-c512-4ea7-a41b-209d697bcdaa", "target": 1} {"commit_message": "[PATCH] Adjust serialized query buffer sizes (#2115)\n\nAdjust serialized query buffer sizes\n\nThis change the client/server flow to always send the server the\noriginal user requested buffer sizes. This solves a bug in which with\nserialized queries incompletes would cause the \"server\" to use smaller\nbuffers for each iteration of the incomplete query. This yield\ndecreasing performance as the buffers approached zero. The fix here lets\nthe server always get the original user's buffer size.", "target": 1} {"commit_message": "[PATCH] Deprecate attempts to build proxy Side objects\n\nNow that we've made Elem side objects much more efficient, we don't need\nthe old proxies anymore.", "target": 1} {"commit_message": "[PATCH] added early quit option to accelerate distance vs user\n defined distance check", "target": 1} {"commit_message": "[PATCH] Guard performance test which uses Lambda dispatch #821", "target": 0} {"commit_message": "[PATCH] Optimize the performance of daxpy by using universal\n intrinsics", "target": 1} {"commit_message": "[PATCH] Serial-only atomics implementation\n\n[#607] [#549]\nA few details:\n - Accepting volatile pointers was necessary\n for compatibility with existing calls which\n pass in volatile pointers, hence the const_cast\n - Special implementations of atomic_increment\n were needed to get equal performance in the\n one application I tested (it was doing its\n own serial special cases before).\n - Compilers have a harder time matching templates\n as opposed to overloads, so some call sites\n had to be modified to specify the scalar\n type explicitly", "target": 0} {"commit_message": "[PATCH] s390x: Use new sgemm kernel also for strmm on Z14 and newer\n\nEmploy the newly added GEMM kernel also for STRMM on Z14. The\nimplementation in C with vector intrinsics exploits FP32 SIMD operations\nand thereby gains performance over the existing assembly code. Extend\nthe implementation for handling triangular matrix multiplication,\naccordingly. As added benefit, the more flexible C code enables us to\nadjust register blocking in the subsequent commit.\n\nTested via make -C test / ctest / utest and by a couple of additional\nunit tests that exercise blocking.\n\nSigned-off-by: Marius Hillenbrand ", "target": 0} {"commit_message": "[PATCH] Added parentheses to performance logging messages\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@1518 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 0} {"commit_message": "[PATCH] Improve object_type detection performance (#2792)\n\nThis improves the object APIs performance for detecting types by\nswitching from listing all items in the URI to checking only for the\nexistence of the group indicator, array schema file or array schema\nfolder. We also switch the order to check for the array schema folder\nfirst, since it is most likely to exist based on the assumption that\nthere are more arrays than there are groups.", "target": 1} {"commit_message": "[PATCH] Remove use of omp 5.0 feature\n\nomp_init_lock_with_hint is an OpenMP 5.0 feature.\nThis is not significant from a performance perspective.", "target": 0} {"commit_message": "[PATCH] Sparse refactored readers: disable filtered buffer tile\n cache. (#2702)\n\nFrom tests, it's been found that writing the cache for the filter\npipeline takes a significant amount of time for the tile unfiltering\noperation. For example, 2.25 seconds with and 1.88 seconds without in\nsome cases. The cache improved performance before multi-range subarrays\nwere implemented, so dropping it is fine at least for the refactored\nreaders.", "target": 1} {"commit_message": "[PATCH] CUDA nb kernel performance improvement for CUDA 4.1\n\nThe manual unrolling of the jm4 loop improves somewhat the performance\nof the nobonded CUDA kernels, but there is still a 5-7% performance\nregression with CUDA 4.1 compared to 3.2/4.0.", "target": 1} {"commit_message": "[PATCH] avoid computing factorization of linear matrix again, clean,\n use more efficient code by avoiding duplicates when calculating stiffness\n matrix", "target": 1} {"commit_message": "[PATCH] fix internal/external OpenMP thread affinity clash\n\nThread affinity set by the OpenMP library, either automatically or\nrequested by the user through environment variables, can conflict with\nthe mdrun internal affinity setting.\nTo avoid performance degradation, as Intel OpenMP has affinity setting\non by default, we will explicitly disable it unless the user manually\nset OpenMP affinity through one of the KMP_AFFINITY or GOMP_CPU_AFFINITY\nenvironment variables. If any of these variables is set, we honor the\nexternally set affinity and turn off the internal one.\n\nChange-Id: I78c6347154d6f11695ee04243db17bbb2e5cb0a7", "target": 0} {"commit_message": "[PATCH] Improve time performance", "target": 1} {"commit_message": "[PATCH] Cherry-picking changes on top of our current hash. There are\n things that hurt performance in laster mfem@master commits, not sure what\n yet.\n\nThis includes:\n- feature/artv3/cusparse-Spmv\n- feature/tomstitt/temp-mem-type\n- patches to tmop.cpp and tmop_tools.cpp to address memory issues\n- patch to mem_manager/device to use mfem's default allcator instead of\nthe host umpire one", "target": 0} {"commit_message": "[PATCH] Coordinate Propagators\n\nThis change introduces the propagator element, which, thanks to templating,\ncan cover the different propagation types used in NVE MD. The combination\nof templating, static functions, and having only the inner-most operations in\nthe static functions allows to have performance comparable to fused update\nelements while keeping easily reordable single instructions.\n\nNote that the two velocity update functions are only necessary to allow\nexact replication of the legacy do_md code for both md and md-vv. The\nparentheses or the lack thereof lead to numerical errors which build up very\nrapidly to make the (very strict) integrator comparison test fail. Relaxing this\ncondition will make getting rid of one of the two variants possible.\n\nAn interesting further development would be to unify the OpenMP loops for\ncoordinate propagation and constraining by using loops over constraint\ngroups in both cases.\n\nChange-Id: I1a1f66f1efe63c791ef3fe51ce2f99da3367adca", "target": 0} {"commit_message": "[PATCH] Complete all the complex single-precision functions of\n level3, but the performance needs further improve.", "target": 0} {"commit_message": "[PATCH] Fast transfers and diffusion kernels", "target": 1} {"commit_message": "[PATCH] performance optimizations for sgemv_n", "target": 1} {"commit_message": "[PATCH] Use vector in atoms2md instead of pointer\n\nAt all call sites for atoms2md the underlying vector was cast to an int pointer.\nThis change makes it easier to inspect the values passed to atoms2md in a\ndebugger while incurring no performance penalty.", "target": 0} {"commit_message": "[PATCH] Added an efficient parallel version of the ADMM linear\n program solver of Boyd et al.", "target": 1} {"commit_message": "[PATCH] performance are getting better with grid_nbfm", "target": 0} {"commit_message": "[PATCH] Added code to improve performance for the structure factor\n calculations.....Works in serial...Still need to check parallel\n performance......EJB", "target": 1} {"commit_message": "[PATCH] Tightening default tolerances on the ADMM algorithms where\n possible (the linear program solver was kept the same due to its slow\n convergence rate).", "target": 0} {"commit_message": "[PATCH] Changed the way hardwall works.\n\nThere are several ways to go about this, but the latest seems\nmost sensible in the context of DD. The hardwall function now\nuses local shells to get coordinates and such. The problem is\nthat after make_local_shells(), the indices are all messed up.\nCommitting this to preserve the logic, but need to fix this\notherwise parallelization is limited to only OpenMP and performance\nis not very good above 4 threads.\n\nChange-Id: I3208519d8704da622b81835604249771d634a28a", "target": 0} {"commit_message": "[PATCH] Avoiding redundant metadata calculations in order to\n accelerate AbstractDistMatrix::QueueUpdate and\n AbstractDistMatrix::ProcessQueues (as well as generalizing\n AbstractDistMatrix::ProcessPullQueue to support not including 'viewer'\n processes)", "target": 0} {"commit_message": "[PATCH] remove fast hash", "target": 0} {"commit_message": "[PATCH] Grid-based utility nbsearch implementation.\n\nMore efficient implementations are possible, but the present one should\nwork reasonably well in most cases, also for triclinic cells, without too\nmuch complexity.", "target": 1} {"commit_message": "[PATCH] Rewriting of Attribute_elevation (shorter and more efficient\n code)", "target": 1} {"commit_message": "[PATCH] Use CUSPARSE_CSRMV_ALG2 for seemingly better performance", "target": 1} {"commit_message": "[PATCH] Hard CPU affinity is set when Nthreads == Ncores.\n\nThis causes a slight thread_mpi performance gain on NUMA systems.", "target": 1} {"commit_message": "[PATCH] Add performance graph for region growing", "target": 0} {"commit_message": "[PATCH] Due to decrease in code performance, calculation of pressure\n tensor for W12, W13, W23 is commented out and in log file only the diameter\n of pressure tensor W11, W22, W33 will be printed. In case if those\n calculation are required, they need to be uncommented.", "target": 0} {"commit_message": "[PATCH] Removed lots of units code from AMBER file loader, which made\n it unnecessarily slow", "target": 0} {"commit_message": "[PATCH] - Performance issue, in Standard_criteria.h: Quality were\n not defined correctly, and then facets were not ordered correctly, in the\n Double_map.", "target": 0} {"commit_message": "[PATCH] AMOEBA uses fast approximation for erfc()", "target": 0} {"commit_message": "[PATCH] dgemm: Use the skylakex beta function also for haswell\n\nit's more efficient for certain tall/skinny matrices", "target": 1} {"commit_message": "[PATCH] additional precomiler command for performance testing", "target": 0} {"commit_message": "[PATCH] Add basic list of performance factors to tutorials.", "target": 0} {"commit_message": "[PATCH] log_name would be unused without perf_log on\n\nSo if we're not performance logging, we need to comment out that\nvariable entirely to avoid an unused variable warning.", "target": 0} {"commit_message": "[PATCH] Use Evaluate(const arma::mat& parameters) instead of\n Evaluate(const arma::mat& parameters, const size_t i) to calculate the\n objective and to accelerate the evaluation process.", "target": 0} {"commit_message": "[PATCH] Use more accurate error message.\n\nAs discussed over in idaholab/moose#9097, Nanoflann does not actually\nimplement an *approximate* nearest node search algorithm. In contrast\nto \"exact\" nearest node searches, approximate nearest node searches\nare not guaranteed to return the nearest node, in return for potential\nperformance improvements.\n\nSee, for example, their README file [0], which states that Nanoflann\ncan \"[Build] KD-trees with a single index (no randomized KD-trees, no\napproximate searches).\"\n\n[0]: https://github.com/jlblancoc/nanoflann/blob/v1.2.3/README.md", "target": 0} {"commit_message": "[PATCH] Update CUDA Performance Tests for Stream interoperability", "target": 0} {"commit_message": "[PATCH] Adding Changelog for Release 3.2.01\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 3.2", "target": 0} {"commit_message": "[PATCH] attempt to reduce the negative performance impact of adding\n the shift option", "target": 0} {"commit_message": "[PATCH] .EXPORT_ALL_VARIABLES: commented out since it seems to slow\n down the compilation", "target": 0} {"commit_message": "[PATCH] Reduce dataset sizes and number of iterations to accelerate\n tests.", "target": 0} {"commit_message": "[PATCH] Add TypeTensor::solve().\n\nThis is slightly more efficient than inversion followed by multiplication.", "target": 1} {"commit_message": "[PATCH] Reduce number of points to accelerate test.", "target": 0} {"commit_message": "[PATCH] Remove BoundaryInfo::n_boundary_ids(elem, side) copy/paste\n job.\n\nIt was basically an exact copy of the now-deprecated\nBoundaryInfo::boundary_ids(elem, side). Reusing the set filling code\nmight be *slightly* less efficient, but I think the savings in\nmaintainability and readability is worth it...", "target": 0} {"commit_message": "[PATCH] Some performance issues due to profiling", "target": 0} {"commit_message": "[PATCH] fast nary union in OFF2nef3 also constructor finds more\n problems and handles them", "target": 0} {"commit_message": "[PATCH] Disable CUDA textures on NVIDIA Volta\n\nThis has significant performance benefit for the nbnxm kernels with\ntabulated Ewald correction and it has negligible impact on the PME kernels.\n\nPartially addresses #3845", "target": 1} {"commit_message": "[PATCH] Override any OpenCL fast math JIT settings for\n born/coul/wolf{/cs}/gpu to resolve numerical deviations seen with some OpenCL\n implementations.", "target": 0} {"commit_message": "[PATCH] Added test to log the performance of the periodic Delaunay\n triangulation", "target": 0} {"commit_message": "[PATCH] GEMM: skylake: improve the performance when m is small", "target": 1} {"commit_message": "[PATCH] Improved FAST performance on CUDA backend", "target": 1} {"commit_message": "[PATCH] Worked on the performance of point location.", "target": 0} {"commit_message": "[PATCH] Use std::make_shared instead of new...\n\nIt is more efficient, since it requires only one memory allocation in\ncontrast to two.", "target": 1} {"commit_message": "[PATCH] Faster version, making use of the fast /tmp directory. Also\n removes diffs*.gz from web server.", "target": 0} {"commit_message": "[PATCH] BJP: Changed determination of io procs so that IO is more\n efficient for single files using parallel IO.", "target": 1} {"commit_message": "[PATCH] implemented plain-C SIMD macros for reference\n\nThis is mainly code reorganization.\nAdds reference plain-C, slow, arbitrary width SIMD for testing.\nAdds FMA for gmx_calc_rsq_pr.\nAdds generic SIMD acceleration (also AVX or double) for pme solve.\nMoved SIMD vector operations to gmx_simd_vec.h\nThe math functions invsqrt, inv, pmecorrF and pmecorrV have been\ncopied from the x86 specific single/double files to generic files\nusing the SIMD macros from gmx_simd_macros.h.\nMoved all architecture specific nbnxn_kernel_simd_utils code to\nseparate files for each SIMD architecture and replaced all macros\nby inline functions.\nThe SIMD reference nbnxn 2xnn kernels now support 16-wide SIMD.\nAdds FMA for in nbnxn kernels for calc_rsq and Coulomb forces.\n\nRefs #1173\n\nChange-Id: Ieda78cc3bcb499e8c17ef8ef539c49cbc2d6d74d", "target": 0} {"commit_message": "[PATCH] Adding manually inlined variants of Sweep functions since\n they were observed to be more efficient in practice", "target": 1} {"commit_message": "[PATCH] Made DD exclusion processing more efficient\n\nWith the Verlet scheme exclusions no longer need to be assigned only\nonce and there are no charge groups. This means the global to local\nexclusion conversion can be more than twice as fast.\n\nChange-Id: I80e1213715f051864d2989389212510428896cb8", "target": 1} {"commit_message": "[PATCH] Additional parameter tweaking for performance enhancement.", "target": 1} {"commit_message": "[PATCH] made PME load balancing + DD DLB more efficient\n\nThe DD dynamic load balancing is now limited, such that the fastest\ntimed PME load balancing cut-off setting can always be used.\nFixes #1089\n\nChange-Id: I3216dfd5a8b2b0676eee5519e08cf36e06047251", "target": 1} {"commit_message": "[PATCH] added the hybrid PCG solver which uses diagonal scaling first\n and switches to AMG if convergence too slow, solver 20", "target": 0} {"commit_message": "[PATCH] Fix CUDA architecture dependent issues\n\nOnly device code gets generated in multiple passes and therefore\ntarget architecture-dependent macros like __CUDA_ARCH__ or our own\nIATYPE_SHMEM (which also depends on __CUDA_ARCH__) are not usable in\nhost code as these will be both undefined. As a result, current code\nover-allocated dynamic shared memory. This has no negative side-effect.\nThis change replaces the use of macros with runtime device compute\ncapability checks. Also texture objects are now actually enabled,\nwhich give very minor performance improvements.\nNote that on Maxwell + CUDA 7.0 there is a 20% performance regression\nfor the tabulated Ewald kernel (which is not used by default), which\nmagically disappears when texture references are used instead.\n\nChange-Id: I1f911caad85eb38d6a8e95f3b3923561dbfccd0e", "target": 1} {"commit_message": "[PATCH] JN: Solaris uses now PTR_ALIGN for performance and\n compatibility with JN: different compilers", "target": 0} {"commit_message": "[PATCH] Improving performance of BigInt/BigFloat routines (such as\n Cholesky) by more than a factor of three by avoiding allocations within the\n templated BLAS routines", "target": 1} {"commit_message": "[PATCH] fast pool allocator", "target": 1} {"commit_message": "[PATCH] Workaround for libHilbert bug until we manage to get it\n fixed. Part of this workaround may be permanent - the long term fix may\n require efficient user code to call some reinitialization, renumbering\n function manually after reading in a mesh and solution.", "target": 0} {"commit_message": "[PATCH] Performance enhancement", "target": 1} {"commit_message": "[PATCH] * Refactored stats (#1594)\n\n* Fixed performance of multi-range subarray result estimation\n* Fixed bug in multi-range result estimation", "target": 0} {"commit_message": "[PATCH] Minor cleanup of NMF code; I think the residue should be\n displayed in non-debugging mode (optionally with -v) so I switched to\n Log::Info. Comment on the change to pinv() then rewrite\n RandomAcolInitialization a little bit to avoid allocating memory\n unnecessarily. Unfortunately insert_cols() is a little slow because it\n allocates more memory and memcpy()s.", "target": 0} {"commit_message": "[PATCH] Made gmx_numzero static for performance reasons.", "target": 1} {"commit_message": "[PATCH] fixed a horrendously inefficient minibatch implementation.\n now the cardinality-k-index-set sampler (sampling without replacement) is\n very efficient", "target": 0} {"commit_message": "[PATCH] fixed bug in CRSSparsity append, now efficient concat", "target": 1} {"commit_message": "[PATCH] Make jenkins own-fftw verify use local tarball\n\nWe should not bombard the FFTW servers with downloads, plus these can be\nrelatively slow too, so use our local ftp server instead.\n\nChange-Id: Id6ccebf0ac1ae6410cd4f7f13f2ff76d275af5d2", "target": 0} {"commit_message": "[PATCH] THUNDERX2T99: Performance fix for ZGEMM", "target": 1} {"commit_message": "[PATCH] resolved performance degrading changed introduced in revision\n 1319 (4)", "target": 0} {"commit_message": "[PATCH] skylake dgemm: Add a 16x8 kernel\n\nThe next step for the avx512 dgemm code is adding a 16x8 kernel.\nIn the 8x8 kernel, each FMA has a matching load (the broadcast);\nin the 16x8 kernel we can reuse this load for 2 FMAs, which\nin turn reduces pressure on the load ports of the CPU and gives\na nice performance boost (in the 25% range).", "target": 0} {"commit_message": "[PATCH] Separate CPU NB kernel and buffer clearing subcounters\n\nThis is aimed to allow comparing the performance of the pair-interaction\nkernels separately from the force buffer clearing.\n\nChange-Id: Ifb2b4b3e5a43ac2ee547da651f9432a22fe58421", "target": 0} {"commit_message": "[PATCH] Evaluating Tom Forsyth's fast mesh reordering.", "target": 0} {"commit_message": "[PATCH] Avoid confusing message at end of non-dynamical runs\n\nEM, TPI, NM, etc. are not targets for performance optimization\nso we will not write performance reports. This commit fixes\nand oversight whereby we would warn a user when the lack of\nperformance report is normal and expected.\n\nFixes #2172\n\nChange-Id: I1097304d79701be748612510572382729f7f26be", "target": 0} {"commit_message": "[PATCH] More performance stats.", "target": 0} {"commit_message": "[PATCH] Clang-tidy: enable further tests\n\nThose out of misc, performance, readiability, mpi with managable\nnumber of required fixes.\n\nRemaining checks:\n 4 readability-redundant-smartptr-get\n 4 readability-redundant-string-cstr\n 4 readability-simplify-boolean-expr\n 5 misc-misplaced-widening-cast\n 5 readability-named-parameter\n 6 performance-noexcept-move-constructor\n 8 readability-misleading-indentation\n 10 readability-container-size-empty\n 13 misc-suspicious-string-compare\n 13 readability-redundant-control-flow\n 17 performance-unnecessary-value-param\n 17 readability-static-definition-in-anonymous-namespace\n 18 misc-suspicious-missing-comma\n 20 readability-redundant-member-init\n 40 misc-misplaced-const\n 75 performance-type-promotion-in-math-fn\n 88 misc-incorrect-roundings\n 105 misc-macro-parentheses\n 151 readability-function-size\n 201 readability-else-after-return\n 202 readability-inconsistent-declaration-parameter-name\n 316 misc-throw-by-value-catch-by-reference\n 383 readability-non-const-parameter\n 10284 readability-implicit-bool-conversion\n\nChange-Id: I5b35ce33e723349fa583f527fec55bbf29a57508", "target": 0} {"commit_message": "[PATCH] Benchmark can now export performance data to an XML file +\n added Tanglecube function to benchmark", "target": 0} {"commit_message": "[PATCH] Python 3 does not have dict.itervalues\n\nReally we should move to using six and importing the iterator versions of dict\nmethods and range/zip from there, but for now just use \"values\" since using a\nraw list doesn't seem like it will cause memory or performance problems here.", "target": 0} {"commit_message": "[PATCH] Revert \"Adding a Fast configuration\"\n\nThis reverts commit 091cdf9143a944784c5e35671927509d4f8b3d70.", "target": 0} {"commit_message": "[PATCH] PERF improvements and bugfixes for select and replace\n\n- Performance improvements to CUDA backend\n- Bugs fixed in OpenCL backend", "target": 1} {"commit_message": "[PATCH] removed explict assigment of BLASOPT=-mkl for MACX64. Iprefer\n have the slow blas baseline by default for generic users", "target": 0} {"commit_message": "[PATCH] -Mvect option causes severe performance problems with\n R1.2.5.1 on Paragon", "target": 0} {"commit_message": "[PATCH] Fix performance unnecessary copy initialization", "target": 1} {"commit_message": "[PATCH] code optimiz", "target": 0} {"commit_message": "[PATCH] Make current slow growth behaviour consistent and communicate\n it better\n\nIn lambda dynamics/slow growth, lambda can be used to interpolate the lambda vector or to set its components directly if no lambda vector is provided by the user. In this latter case, lambda was allowed to be > 1, but it was silently kept within [0,1] if a lambda vector was specified. Moreover, setting the components of the lambda vector > 1 as a user produced an error.\n\nNow, it is consistently ensured that lambda vector components are in [0,inf), but warnings are issued if a user provides settings that somehow result in lambda vector components being > 1. If soft-core potentials are used, lambda vector components for Coulomb and vdW are consistently enforced to be in [0,1] (errors are issued else). If lambda is used to interpolate a user-provided lambda vector, it is kept in [0,1]. If user input results in lambda leaving the above ranges during the simulation, lambda will be kept at the respective interval boundary, and warnings are issued from which simulation step on the lambda vector will not change anymore.\n\nFixes #3584.", "target": 0} {"commit_message": "[PATCH] Converts PADiffusionSetup3D and\n QuadratureInterpolator::Eval3D kernels from 1 element per thread to 1 qpt/dof\n per thread for better performance when offloading (there are not enough units\n of work with 1 element/thread)", "target": 1} {"commit_message": "[PATCH] Introduce GMX_USE_SIMD_KERNELS cmake option\n\nMost GROMACS development does not need to recompile the SIMD nbnxm\n(and fep) kernels whenever their dependencies change. These\ndependencies are large in number, and include frequently changed\nfiles, including config.h and various utility and nbnxm module\nheaders. This flag permits people to efficiently recompile while\nworking on code that doesn't directly target changes to the SIMD\nkernels.\n\nIt also means that CI builds not aimed at efficient mdrun execution\ntimes can instead minimize compilation times and ccache db sizes. This\nwill also have the side effect of testing more of the reference NBNXM\nkernels.\n\nThere's other SIMD code (particularly PME, bonded, LINCS, update)\nwhich still compiles and runs in the usual way. Currently these are\nless costly to compile and harder to disable. That could change in\nfuture.", "target": 0} {"commit_message": "[PATCH] Domain decomposition and PME load balancing for modular\n simulator\n\nThis change introduces two infrastructure elements responsible for\ndomain decomposition and PME load balancing, respectively. These\nencapsulate function calls which are important for performance, but\noutside the scope of this effort. They rely on legacy data structures\nfor the state (both) and the topology (domdec).\n\nThe elements do not implement the ISimulatorElement interface, as\nthe Simulator is calling them explicitly between task queue population\nsteps. This allows elements to receive the new topology before\ndeciding what functionality they need to run.\n\nThis commit is part of the commit chain introducing the new modular\nsimulator. Please see docs/doxygen/lib/modularsimulator.md for details\non the chosen approach. As the elements of the new simulator cannot all\nbe introduced in one commit, it might be worth to view Iaae1e205 to see\na working prototype of the approach.\n\nChange-Id: I1be444270e79cf1391f5a228c8ce3a9934d92701", "target": 0} {"commit_message": "[PATCH] More efficient CEED matrix assembly", "target": 1} {"commit_message": "[PATCH] Add an example to show how to debug performance with loggers.", "target": 0} {"commit_message": "[PATCH] Moving coord_string from returning a std::string to\n std::string_view. (#2704)\n\nThe coord_string function is used in a lot of performance critical paths.\nMoving it to return a string_view as none of these paths benefit from\nmaking a copy of the value.", "target": 0} {"commit_message": "[PATCH] Improve object_type detection performance (#2792) (#2793)\n\nThis improves the object APIs performance for detecting types by\nswitching from listing all items in the URI to checking only for the\nexistence of the group indicator, array schema file or array schema\nfolder. We also switch the order to check for the array schema folder\nfirst, since it is most likely to exist based on the assumption that\nthere are more arrays than there are groups.\n\nCo-authored-by: Seth Shelnutt ", "target": 1} {"commit_message": "[PATCH] Added parentheses to performance logging messages", "target": 0} {"commit_message": "[PATCH] s390x: Add vectorized sgemm kernel for Z14 and newer\n\nAdd a new GEMM kernel implementation to exploit the FP32 SIMD\noperations introduced with z14 and employ it for SGEMM on z14 and newer\narchitectures.\n\nThe SIMD extensions introduced with z13 support operations on\ndouble-sized scalars in vector registers. Thus, the existing SGEMM code\nwould extend floats to doubles before operating on them. z14 extended\nSIMD support to operations on 32-bit floats. By employing these\ninstructions, we can operate on twice the number of scalars per\ninstruction (four floats in each vector registers) and avoid the\nconversion operations.\n\nThe code is written in C with explicit vectorization. In experiments,\nthis kernel improves performance on z14 and z15 by around 2x over the\ncurrent implementation in assembly. The flexibilty of the C code paves\nthe way for adjustments in subsequent commits.\n\nTested via make -C test / ctest / utest and by a couple of additional\nunit tests that exercise blocking (e.g., partial register blocks with\nfewer than UNROLL_M rows and/or fewer than UNROLL_N columns).\n\nSigned-off-by: Marius Hillenbrand ", "target": 1} {"commit_message": "[PATCH] Residuals: New MeshObject class to store solver performance\n residuals\n\nThis is more efficient and modular than the previous approach of storing the\nresiduals in the mesh data dictionary.", "target": 1} {"commit_message": "[PATCH] Fix performance for resize, and deep_copy", "target": 1} {"commit_message": "[PATCH] s390x: for clang use fp-contract=on instead of fast\n\nMake clang slightly more cautious when contracting floating-point\noperations (e.g., when applying fused multiply add) by setting\n-ffp-contract=on (instead of fast).\n\nSigned-off-by: Marius Hillenbrand ", "target": 0} {"commit_message": "[PATCH] Add task dag performance test based upon fibonnaci", "target": 0} {"commit_message": "[PATCH] thermophysicalModels::coefficientWilkeMultiComponentMixture:\n New Wilke mixing model for gaseous transport properties\n\nThe new generalised framework for thermophysical mixing models has allowed the\nefficient implementation of the useful combination for gases of coefficient\nmixing for thermodynamic properties with the Wilke model for transport\nproperties:\n\nDescription\n Thermophysical properties mixing class which applies mass-fraction weighted\n mixing to the thermodynamic coefficients and Wilke's equation to\n transport properties.\n\n Reference:\n \\verbatim\n Wilke, C. R. (1950).\n A viscosity equation for gas mixtures.\n The journal of chemical physics, 18(4), 517-519.\n \\endverbatim", "target": 0} {"commit_message": "[PATCH] Improve general performance, set up 1-attribute, fix\n correspondence issue between original and copied map", "target": 1} {"commit_message": "[PATCH] fixed remains of xj shifting in legacy CUDA NB kernel\n\nThis removes three extra useless flops, there the fix results in a\nslight performance improvement.", "target": 1} {"commit_message": "[PATCH] Shape smoothing: some comments added to accelerate matrix\n construction. Konstantinos: this is what is so slow, not the solver!", "target": 0} {"commit_message": "[PATCH] dgemm: use dgemm_ncopy_8_skylakex.c also for Haswell\n\nThe dgemm_ncopy_8_skylakex.c code is not avx512 specific and gives\na nice performance boost for medium sized matrices", "target": 1} {"commit_message": "[PATCH] corrected bug in performance testing", "target": 0} {"commit_message": "[PATCH] Add functions to Quantity to compute the max, min, standard\n deviation (as the sqrt of the variance), and average, returning a Quantity\n with the proper units.\n\nThis should be reasonably efficient, as it takes advantage of numpy-accelerated\nmethods if they're present.", "target": 0} {"commit_message": "[PATCH] small performance fix: reuse conic1, conic2 in set(5)", "target": 1} {"commit_message": "[PATCH] Removed AVX code, since it had very little effect on\n performance and would have required a more complicated build process. Also\n worked around a compilation error with clang.", "target": 0} {"commit_message": "[PATCH] Try to use inline in boxDimension class to increase the\n performance", "target": 1} {"commit_message": "[PATCH] Use non caching segment traits to accelerate arrangement\n computations", "target": 1} {"commit_message": "[PATCH] making the thread checker (valgrind drd) happy has no\n performance disadvantage because it is in the planning phase", "target": 0} {"commit_message": "[PATCH] Accelerate L-BFGS with a couple of tricks.\n\n1. Function objective calculation after optimization isn't needed.\n2. minPointIterate isn't actually used anywhere, so get rid of it.\n3. Return best result from line search, not the last result.\n\nAlso I cleaned up a few no-longer-needed sections of code and simplified a few\nlines.", "target": 1} {"commit_message": "[PATCH] added option to keep transpose to hybrid for better gpu\n performance", "target": 1} {"commit_message": "[PATCH] New thread_mpi library: waits now yield to the OS scheduler\n without significant performance penalty.", "target": 0} {"commit_message": "[PATCH] WIP: add CPR approach and fast small block inverse.", "target": 0} {"commit_message": "[PATCH] clover::FloatNOrder now uses vectorized load/store for\n improved performance of all algorithms that use this (clover inversion sees a\n 1.5x speedup). Added missing support for clover norm field save/restore in\n tuning.", "target": 1} {"commit_message": "[PATCH] Enable performance test", "target": 0} {"commit_message": "[PATCH] Various cleanup in Kokkos SNAP, replacing verbose Kokkos\n MDRangePolicy and TeamPolicy types with simpler `using` definitions. No\n performance implications.", "target": 0} {"commit_message": "[PATCH] Reorder indices of Gamma_P_ia\n\nThis step increases spatial cache locality of the exchange of Gamma_P_ia. The increased locality increase the performance of its redistribution.", "target": 1} {"commit_message": "[PATCH] finally fixed a major performance bug. New implementation now\n slightly faster than multiple on old implementation", "target": 1} {"commit_message": "[PATCH] Keep COARSEN_INACTIVE flags in sync in corner case\n\nThis isn't the most efficient solution - I think we could probably let\nthese flags stay out of sync for a while, weaken the assertions that\ncomplain, and trust to make_coarsening_compatible to eventually fix\nthe inconsistencies.\n\nThis is the safest solution, though.", "target": 0} {"commit_message": "[PATCH] More efficient implementation of atoms class.", "target": 1} {"commit_message": "[PATCH] Made a pass on Performance tutorials in docs.", "target": 0} {"commit_message": "[PATCH] Parallel Performance bug in structure factor fixed for pspw.\n The new code works ok in serial, but still need to check parallel\n performance.\n\n...EJB", "target": 1} {"commit_message": "[PATCH] problem with c-macro? but slow fabs works fine", "target": 0} {"commit_message": "[PATCH] Use analysis nbsearch in insert-molecules\n\nAdvantages:\n - This reduces the amount of code by ~90% compared to what addconf.c\n has, making it significantly easier to understand.\n - Now the tool is independent of potential changes in the\n mdrun-specific neighborhood search.\n - Memory leaks related to addconf.c are gone.\n - The neighborhood search is terminated as soon as one pair within the\n cutoff is found, potentially making it faster. This likely offsets\n any performance differences between the nbsearch implementations.\n The unit tests are ~35% faster.\n - Confusing mdrun-specific output related to the neighborhood\n searching is gone. This includes notes that \"This file uses the\n deprecated 'group' cutoff_scheme\" and references to Coulomb or VdW\n tables and cutoffs.\n\nChange-Id: Iba82858b9a2b43b6e10a49cd3964b99b22996166", "target": 1} {"commit_message": "[PATCH] Greatly improved the performance of copy::GeneralPurpose by\n exploiting the tensor product structure of the integer metadata calculations", "target": 1} {"commit_message": "[PATCH] add generic Mac toolchain file\n\nI prefer that CMake find the MPI compiler wrappers instead of the base\ncompilers, if for no other reason than this often means mixing clang and\ngfortran, since clang will likely be found first in PATH.\n\nthis toolchain assumes MPI wrappers are in the PATH but one can set the\nbase directory (with a trailing \"/\") if desired.\n\nthe BLAS/LAPACK used is \"-framework Accelerate\"", "target": 0} {"commit_message": "[PATCH] Implement OpenCL support\n\nStreamComputing (http://www.streamcomputing.eu) has implemented the\nshort-ranged non-bonded interaction accleration features previously\naccelerated with CUDA using OpenCL 1.1. Supported devices include\nGCN-based AMD GPUs and NVIDIA GPUs.\n\nCompilation requires an OpenCL SDK installed. This is included in\nthe CUDA SDK in that case.\n\nThe overall project is not complete, but Gromacs runs correctly on\nsupported devices. It only runs fast on AMD devices, because of a\nlimitation in the Nvidia driver. A list of known TODO items can be\nfound in docs/OpenCLTODOList.txt. Only devices with a warp/wavefront\nsize that is a multiple of 32 are compatible with the implementation.\n\nKnown issues include that tabulated Ewald kernels do not work (but the\nanalytical kernels are on by default, as with CUDA), and the blocking\nbehaviour of clEnqueue in Nvidia drivers means no overlap of CPU and\nGPU computation occurs. Concerns about concurrency correctness with\ncontext management, JIT compilation, and JIT caching means several\nfeatures are disabled for now. FastGen is enabled by default, so the\nJIT compilation will only compile kernels needed for the current\nsimulation.\n\nThere is some duplication between the two GPU implementations, but\nthe active development expected for both of them suggests it is\nnot worthwhile consolidating the implementations more closely.\n\nChange-Id: Ideaf16929028eb60e785feb8298c08e917394d0f", "target": 0} {"commit_message": "[PATCH] Add performance improvement features to matrix loading.", "target": 1} {"commit_message": "[PATCH] improved CUDA kernel performance by pre-loading cj\n\nChange-Id: Ic725a82d550e2ecffd4d32edd2c44205aef99b8d", "target": 1} {"commit_message": "[PATCH] Restore wallcycle subcounter name to \"Bonded F\"\n\nThis makes it easier to check for performance behaviour\n\nChange-Id: Icb67bd75ee58fe280beb9f1cb123d0eeca229f09", "target": 0} {"commit_message": "[PATCH] Improving the performance of the HessenbergSchur QR sweeps", "target": 1} {"commit_message": "[PATCH] Update management of linear algebra libraries\n\nManagement of detection and/or linking to BLAS and LAPACK libraries is\nre-organized. The code has migrated to its own module. This will\nhelp future extension and maintenance. This version communicates\nthings that are newsworthy and stays out of the way when nothing\nis changing.\n\nWe no longer over-write the values specified by the user for\nGMX_EXTERNAL_(BLAS|LAPACK). Previously, this was used to signal\nwhether detection succeeded, but that does not really get the job\ndone. Instead, the user is notified that detection failed (repeatedly,\nif they deliberately set such an option on).\n\nCorrect usage and expected behaviour in all cases is documented both\nin the code and the install guide.\n\nThe user interface is pretty much unchanged. We still don't offer full\nconfigurability (e.g. MKL for FFTs must use MKL for linear algebra\nunless GMX_*_USER is used, and the only way to get MKL for linear\nalgebra is to use it for FFTs). The size of any performance difference\nis probably very small, and if the user really needs mdrun with\ncertain FFT and tools with certain linear algebra library, they can do\ntwo configurations. Note that mdrun never calls any linear algebra\nroutines (tested empirically)!\n\nExpanded the solution of #771 by testing that the user supplied\nlibraries that actually work. If not, we emit a warning and try to use\nthem anyway.\n\nWe also now check that MKL really does provide linear algebra\nroutines, and fall back to the default treatment if it does not.\n\nRefs #771,#1186\n\nChange-Id: Ife5c59694e29a3ce73fc55975e26f6c083317d9b", "target": 0} {"commit_message": "[PATCH] Changed the performance test application", "target": 0} {"commit_message": "[PATCH] Improved ORB performance and memory usage on CUDA backend", "target": 1} {"commit_message": "[PATCH] Hopefully improve singledot performance", "target": 1} {"commit_message": "[PATCH] POWER10: Improve dgemm performance\n\nThis patch uses vector pair pointer for input load operation\nwhich helps to generate power10 lxvp instructions.", "target": 1} {"commit_message": "[PATCH] more efficient way to display a combinatorial map", "target": 1} {"commit_message": "[PATCH] Fixed a small performance bug", "target": 1} {"commit_message": "[PATCH] Workaround for libHilbert bug until we manage to get it\n fixed. Part of this workaround may be permanent - the long term fix may\n require efficient user code to call some reinitialization, renumbering\n function manually after reading in a mesh and solution.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@3647 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 0} {"commit_message": "[PATCH] SMD multi step added with fast Hv product in main", "target": 0} {"commit_message": "[PATCH] fast pool allocator\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@4485 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 1} {"commit_message": "[PATCH] BJP: Initial checkin of test program for performance\n evaluation of DRA routines in 3 dimensions.", "target": 0} {"commit_message": "[PATCH] added synchs for main events again. this may have a negative\n performance influence in some case. such cases should however be really rare.\n otherwise, these events make no sense without a sync. imagine a master rank\n which is not at the interface", "target": 0} {"commit_message": "[PATCH] Issue #2779 Ensure outputs up-to-date before linearization\n Not sure if this adds overhead and, if so, can be replaced by something more\n efficient", "target": 0} {"commit_message": "[PATCH] Remove BoundaryInfo::n_edge_boundary_ids(elem, side)\n copy/paste job.\n\nIt was basically an exact copy of the now-deprecated\nBoundaryInfo::edge_boundary_ids(elem, side). Reusing the set filling code\nmight be *slightly* less efficient, but I think the savings in\nmaintainability and readability is worth it...", "target": 0} {"commit_message": "[PATCH] Accelerate a couple of tests.", "target": 0} {"commit_message": "[PATCH] Remove memory churn in compute_affine_map, improving\n threading performance", "target": 1} {"commit_message": "[PATCH] small performance optimization for pair style comb", "target": 1} {"commit_message": "[PATCH] Reduce number of epochs for training to accelerate tests.", "target": 0} {"commit_message": "[PATCH] chasing the performance issue on the aump2 QA test", "target": 0} {"commit_message": "[PATCH] same as commit 76d5bddd5c3dfdef76beaab8222231624eb75e89:\n Split ga_acc in smaller ga_acc on MPI-PR since gives large performance\n improvement on NERSC Cori", "target": 1} {"commit_message": "[PATCH] Fixing bugs in slow (non-shared memory) variant of\n lj/charmm/coul/charmm/gpu", "target": 0} {"commit_message": "[PATCH] Stage bonded kernel atomics through shared memory\n\nFixes performance bug introduced in 01b2f20bd5 by staging energy step\natomics through shared memory rather than have all threads write\natomically directly to global memory.\n\nFixes #3443", "target": 1} {"commit_message": "[PATCH] Added preconditions and made it more efficient", "target": 1} {"commit_message": "[PATCH] completed adding of support point indices, made remark on\n buggy setup of problem in fast and exact case", "target": 0} {"commit_message": "[PATCH] Update RAS-IR.\n\nNotes:\n1. Overlap and row ordering have significant effects on convergence.\n2. Lower the matrix bandwidth, better the performance ?\n3. Sync is lower, but communication becomes higher, tradeoff!\n4. Parallelism is higher, but overhead is also higher, tradeoff!\n5. Very high accurate solves, do not provide anything.\n6. Two level and multi-levels might significantly reduce number of\niterations to converge.", "target": 1} {"commit_message": "[PATCH] Allow Gromacs to run on more than 64 threads by default\n\nAvoids Gromacs dying on machines when starting more than 64 threads,\nwhich is getting more common today - including our own tests.\n\nThis increases the default (hard) OpenMP thread limit from\n64 to 128, since at least Roland has not seen any performance\ndrawbacks from that.\n\nSecond, we no longer attempt to start more threads than Gromacs\nhas been configured for. In principle it would have been cleanest\nto limit gmx_omp_get_max_threads() to GMX_OPENMP_MAX_THREADS,\nbut unfortunately the first value is used to detect the total\nnumber of threads for all ranks, while the latter is rather used\nas a limit for the number of threads to start inside each rank.\nAll this threading code needs to cleaned up later, but to avoid\nseverely limiting MPI parallelism on these machines, for now we\ninstead need to apply the limit once we have adjusted for the\nnumber of ranks in the higher-level routine.\n\nCloses #4370.", "target": 0} {"commit_message": "[PATCH] add comment regarding OpenMP performance", "target": 0} {"commit_message": "[PATCH] wallDist/patchDistMethods/Poisson: New method for fast\n calculation of an approximate wall-distance field by solving Poisson's\n equation", "target": 0} {"commit_message": "[PATCH] ATW: Removed slow serial code from density construction", "target": 0} {"commit_message": "[PATCH] HvD: Adding a few performance tests to see if certain\n optimizations are needed. It turns out that aa**ii is about 20 times faster\n than aa**bb, where aa and bb are double precision numbers and ii is an\n integer. Whereas aa*ii is about as fast as aa*bb (measured by running \"time\n \"). So it seems as if we need to add exponentiation to an integer\n power to the NWAD module.", "target": 0} {"commit_message": "[PATCH] Add Conjugate Gradient example to benchmarks - Includes\n sparse matrix\n\nCompare the performance and memory usage of sparse vs dense using conjugate\ngradient example", "target": 0} {"commit_message": "[PATCH] BUILD Adding option MIN_BUILD_TIME to CMake. Options sets O0\n for fast compile\n\n* Od on MSVC\n* Default is OFF. Flags are set when toggled to ON.\n* Resets the flags to default release when toggled back to OFF.", "target": 0} {"commit_message": "[PATCH] fix parallel build issues with APFS/HFS+/ext2/3 in\n netlib-lapack\n\nThe problem is that OpenBLAS sets the LAPACKE_LIB and the TMGLIB to the\nsame object and uses the `ar` feature to update the archive file. If the\nunderlying filesystem does not have sub-second timestamp resolution and\nthe system is fast enough (or `ccache` is used), the timestamp of the\nbuilds which should be added to the previously generated archive is the\nsame as the archive file itself and therefore `make` does not update the\narchive.\n\nSince OpenBLAS takes care to not run the different targets updating the\narchive in parallel, the easiest solution is to declare the respective\ntargets `.PHONY`, forcing `make` to always update them.\n\nfixes #1682", "target": 0} {"commit_message": "[PATCH] Only use AVX512 in own-FFTW if GROMACS also uses it\n\nBuilding the own FFTW with AVX512 enabled for all AVX-flavors means that\nan AVX2 build can end up loosing a significant amount of performance due\nto clock throttle if the FFTW auto-tuner inadvertently picks and AVX512\nkernel. This is not unlikely as measurements at startup are very noisy\nand often lead to inconsistent kernel choice (observed in practice).\n\nChange-Id: I857326a13a7c4dd1a6f5ab44360211301b05d3ac", "target": 0} {"commit_message": "[PATCH] some performance stuff (matrix products)", "target": 0} {"commit_message": "[PATCH] Temporary fix for OpenCL PME gather\n\nThere is a race on the z-component of the PME forces in the OpenCL\nforce reduction in the gather kernel. This change avoid that race.\nBut a better solution is a different, more efficient reduction.\n\nRefs #2737\n\nChange-Id: I45068c9187873548dff585044d2c8541444e385c", "target": 1} {"commit_message": "[PATCH] Threw away the manual text-parsing, xml parser is fast enough", "target": 0} {"commit_message": "[PATCH] Add the NeighborSearchRules class, which defines how the\n SingleTreeDepthFirstTraverser can perform a NeighborSearch. Adapt the\n NeighborSearch class to use this. It is not as fast as it could be.", "target": 0} {"commit_message": "[PATCH] Performance and thread-safety requires a lock around each\n constraint row acquisition, not just each constraint row entry.", "target": 0} {"commit_message": "[PATCH] * Refactored stats * Fixed performance of multi-range\n subarray result estimation * Fixed bug in multi-range result estimation", "target": 1} {"commit_message": "[PATCH] Non-PME OPT calculations now use a much more efficient\n Cartesian field algorithm for the dipole response force terms.", "target": 1} {"commit_message": "[PATCH] Tune param.h for SkylakeX\n\nparam.h defines a per-platform SWITCH_RATIO, which is used as a measure for how fine\ngrained the blocks for gemm need to be split up. Many platforms define this to 4.\n\nThe reality is that the gemm low level implementation for SkylakeX likes bigger blocks\ndue to the nature of SIMD... by tuning the SWITCH_RATIO to 32 the threading performance\nimproves significantly:\n\nBefore\n Matrix SGEMM cycles MPC DGEMM cycles MPC\n 48 x 48 10756.0 10.5 -0.5% 18296.7 6.1 -1.7%\n 64 x 64 20490.0 12.9 1.4% 40615.0 6.5 0.0%\n 65 x 65 83528.3 3.3 -210.9% 96319.0 2.9 -83.3%\n 80 x 80 101453.5 5.1 -166.3% 128021.7 4.0 -76.6%\n 96 x 96 149795.1 5.9 -143.1% 168059.4 5.3 -47.4%\n 112 x 112 191481.2 7.3 -105.8% 204165.0 6.9 -14.6%\n 128 x 128 265019.2 7.9 -99.0% 272006.4 7.7 -5.3%\n\nAfter\n Matrix SGEMM cycles MPC DGEMM cycles MPC\n 48 x 48 10666.3 10.6 0.4% 18236.9 6.2 -1.4%\n 64 x 64 20410.1 13.0 1.8% 39925.8 6.6 1.7%\n 65 x 65 34983.0 7.9 -30.2% 51494.6 5.4 2.0%\n 80 x 80 39769.1 13.0 -4.4% 63805.2 8.1 12.0%\n 96 x 96 45169.6 19.7 26.7% 80065.8 11.1 29.8%\n 112 x 112 57026.1 24.7 38.7% 99535.5 14.2 44.1%\n 128 x 128 64789.8 32.5 51.3% 117407.2 17.9 54.6%\n\nWith this change, threading starts to be a win already at 96x96", "target": 1} {"commit_message": "[PATCH] too conservative check removed, for fast removal in delaunay\n 2d", "target": 0} {"commit_message": "[PATCH] clang-tidy: more misc+readability\n\nAuto fixes:\nmisc-macro-parentheses\nreadability-named-parameter\n\nEnabled with few violations disabled by configuration:\nmisc-throw-by-value-catch-by-reference\nreadability-function-size\n\nSet clang-tidy checks as list to allow comments.\n\nRefactored the operator <<< used by GMX_THROW to take the exception by\nvalue and return a copy, which is easier for tools to see is a proper\nthrow-by-value (of a temporary). Performance on the throwing path is\nnot important, but is anyway not affected because the inlining of the\noperator allows the compiler to elide multiple copies. This also\navoids casting away the constness.\n\nChange-Id: I85c3e3c8a494119ef906c0492680c0d0b177a38d", "target": 0} {"commit_message": "[PATCH] Fixing bug that arose in parallel performance example ex1p", "target": 1} {"commit_message": "[PATCH] Reword CPU/GPU imbalance notes\n\nChanges text in CPU/GPU imbalance from \"performance loss\" to \"wasting\nresources\", since in some cases one can not get higher performance.\nReplaced \"GPU has less load\" by \"CPU has more load\".\nRemoved hint to reduce the cut-off, since one often can not do this.\nNote that with CUDA all theses notes are never printed, since we no\nlonger have timings on (by default), unlike with OpenCL.\n\nFixes #2253\n\nChange-Id: Ib4a9752ad27c1cd2a3cd751a217249694a56d3b7", "target": 0} {"commit_message": "[PATCH] Removed FillSubcellsForNode, due to more efficient\n implementation where I add contributions for all nodes of a certain subcell.", "target": 1} {"commit_message": "[PATCH] Speed up a slow test case by using a smaller topology file", "target": 0} {"commit_message": "[PATCH] Fixing Performance Bug with Atomics\n\nCalling templated versions of atomics will prevent matching of\nnon-templated code, thus it would never call the optimized atomic\nroutines. This effected in particular atomic increment and decrement.", "target": 1} {"commit_message": "[PATCH] AABB tree: more on internal KD-tree used to accelerate the\n distance queries.", "target": 1} {"commit_message": "[PATCH] Fix the integer overflow issue for large matrix size\n\nFor large matrix, e.g. M=N=K, and M>1290, int mnk=M*N*K will overflow.\nThis will lead to wrong branching to single-threading. The performance\nis downgraded significantly.\n\nSigned-off-by: Wang, Long ", "target": 1} {"commit_message": "[PATCH] AABB tree: do_intersect now calls the First_primitive\n traversal traits (much faster) performance section updated", "target": 1} {"commit_message": "[PATCH] Fix a critical performance issue\n\nAs decided by `MainWindow`, the `Scene_c3t3_item::toolTip()` method is\ncalled by `MainWindow::updateInfo()` for each `modified()` event of the\nmanipulated frame. While the frame is manipulated, that generates a lot\nof events, and a lot of calls to `toolTip()`.\n\nBefore this commit, the call to `Scene_c3t3_item::toolTip()`\nwas `O(n)`. After this commit it is `O(1)`.\n\nThat speeds up a lot the drawing of the item while the frame is\nmanipulated!", "target": 0} {"commit_message": "[PATCH] PME-gather: Use templated functor instead of preprocessor\n\nAdded restrict in several places, but this does not affect performance\nwith gcc and icc.\n\nChange-Id: Id366621fa3ad02ca182b8a4da48cae940059cf46", "target": 0} {"commit_message": "[PATCH] Changed increment ordering for performance", "target": 1} {"commit_message": "[PATCH] More efficient way of creating point set from selected points", "target": 1} {"commit_message": "[PATCH] DLB can now turn off, when slower\n\nUnder certain conditions, especially with (shared) GPUs, DLB can\ndecrease the performance. We now measure the cycles per step before\nturning on DLB. When the running average of cycles per step with DLB\ngets above the average without DLB, we turn off DLB. We then measure\nagain without DLB. DLB can then turn on again. If we turn on DLB of\nDLB multiple times in close succesion and we measure performance loss,\nwe keep DLB off for the remainder of the run. This procedure ensures\nthat the performance will never deteriorate due to DLB.\nUpdated and expanded the DLB section in the manual.\n\nChange-Id: I6e0291c1a41adf6da94fae46d36e0fcb95585a02", "target": 0} {"commit_message": "[PATCH] Non-reduction boxloops done\n\nThe non-reduction boxloops are all in and pass the struct tests.\nPerformance is VERY slow, but this may just be due to the machine\nI am running on. Reduction boxloops are in progress.", "target": 0} {"commit_message": "[PATCH] Removed unnecessary synchronization that hurt performance on\n Nvidia", "target": 1} {"commit_message": "[PATCH] More efficient SetNonzerosSlice2::evaluate", "target": 1} {"commit_message": "[PATCH] AVX512 CGEMM & ZGEMM kernels\n\n96-99% 1-thread performance of MKL2018", "target": 0} {"commit_message": "[PATCH] Fixed a performance regression on AMD GPUs", "target": 1} {"commit_message": "[PATCH] fixed bugs in BLT sorting, added makeSemiExplicit function\n (working, but not as efficient as it coudl be)", "target": 0} {"commit_message": "[PATCH] Refactor CalculateTopRecommendations(), including a complete\n overhaul of how recommendations are actually calculated. std::pair<> and\n std::map<> are often quite slow, especially in that implementation. This is\n faster.", "target": 1} {"commit_message": "[PATCH] This commit introduces VariableGroups as an optimization when\n there are repeated variables of the same type inside a system. Presently,\n these are only activated through the system.add_variables() API, but in the\n future there may be provisions for automatically identifying groups.\n\nThe memory usage for DofObjects now scales like\nN_sys+N_var_group_per_sys instead of N_sys+N_vars. The DofMap\ndistribution code has been refactored to use VariableGroups.\n\nAll existing loops over Variables within a system will work unchanged,\nbut can be replaced with more efficient loops over VariableGroups.\n\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@6521 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 1} {"commit_message": "[PATCH] Avoid calculating distances after an Elkan prune. Slight,\n nearly negligible performance gains.", "target": 1} {"commit_message": "[PATCH] Adding BoundaryInfo::add_elements()\n\nThis allows us to easily add boundary elements to an existing mesh\nwithout creating a new mesh for them.\n\nFactoring out _find_id_maps simplifies the code a bit. If\nBoundaryInfo::sync() is too slow we might also factor out\n_add_elements to avoid a redundant _find_id_maps call, but I doubt\nthe redundancy will ever show up in profiling.", "target": 0} {"commit_message": "[PATCH] some more prints to debug comex_malloc performance", "target": 0} {"commit_message": "[PATCH] HvD: On the Macintosh the BLAS and LAPACK libraries provided\n as part of the compiler framework are broken according to Jeff Daily. They\n work for small matrices but start to produce rubbish for large matrices.\n\nAlso the GA configure scans the machine for linear algebra routines on its\nown account. Now setting BLAS_LIB=\" \" will force the --without-blas option\non the GA configure. This way the GA will build BLAS from source. This avoids\nconflicts between NWChem and GA where NWChem built BLAS from source and GA\nplanned to load BLAS from a library, introducing conflicting views of the\ninteger types.\n\nFinally on contemporary Macintosh machines there does not seem to be a need\nto specify \"--framework veclib\" or \"--framework accelerate\" anymore. The\ncompilers and linkers seem to use this automatically. So we can remove that\nfrom LIST_LINLIBS again.", "target": 0} {"commit_message": "[PATCH] performance updates...EJB", "target": 0} {"commit_message": "[PATCH] VT:for infiniband VT:disabled HBNA get for performance\n reasons, needs to be enabled in the future", "target": 1} {"commit_message": "[PATCH] Added upper triangular fast tridiagonalization routines.", "target": 0} {"commit_message": "[PATCH] Extending QP and NNLS to handle multiple right-hand sides in\n an efficient way and adding a Non-negative Matrix Factorization (NMF).", "target": 1} {"commit_message": "[PATCH] Enorme typo sur le parametre fast", "target": 0} {"commit_message": "[PATCH] Updated article reference. Added configuration flags to\n support altivec on powerpc. Updated configuration files from gnu.org Removed\n the fast x86 truncation when double precision was used - it would just crash\n since the registers are 32bit only. Added powerpc altivec innerloop support\n to fnbf.c Removed x86 assembly truncation when double precision is used - the\n registers can only handle 32 bits. Dont use general solvent loops when we\n have altivec support in the normal ones.", "target": 0} {"commit_message": "[PATCH] added two performance charts", "target": 0} {"commit_message": "[PATCH] Removed one star. Nomore five stars record. This makes that\n mollocs are larger, therefore more efficient, and it works around buggy\n malloc library routines.", "target": 1} {"commit_message": "[PATCH] added reduce operation (op=sum) in communication class.\n Efficient MPI_Reduce for MPIDirect, inefficient trivial loop over all slaves\n for all other communication methods. Added simple test in\n ParallelMAtrixOperationsTest.cpp", "target": 0} {"commit_message": "[PATCH] replaced std::endl with \\n in all file IO and stringstreams. \n std::endl forces a flush, which kills performance on some machines", "target": 1} {"commit_message": "[PATCH] new macos accelerate step", "target": 1} {"commit_message": "[PATCH] Fix excessive list splitting\n\nDue to a possible integer overflow, the pair list splitting code could\nend up over-splitting pair lists and causing large performance\ndegradation. Due to the larger processor count, runs using AMD GPUs,\nusing 100k+ simulation systems are more prone to suffer from the issue.\n\nFixes #1904\n\nChange-Id: I29139ec80aa75c78fa93de0858f7c60cdae88d5b", "target": 1} {"commit_message": "[PATCH 01/29] Add support for factorization in\n create_new_algorithm.sh", "target": 0} {"commit_message": "[PATCH] Refactor sign to signbit internally\n\nThe operation being performance is equivalent of std::signbit\nthus, using signbit is more apt and removes unnecessary redefine\nof sign function in opencl jit kernel.\n\n(cherry picked from commit 2c8fb67ce5d07e573396eb8764470fb79086c797)", "target": 0} {"commit_message": "[PATCH] Reworked function 'get_best_weight()' of Slivers_exuder.h\n\n- Avoid computing incident cells multiple times\n- Make it more efficient for P3M3 by not having to use\n tr.min_squared_distance() to compute the distance\n between neighboring vertices", "target": 1} {"commit_message": "[PATCH] Complete OMP version except for the inner product.\n Performance below host.", "target": 1} {"commit_message": "[PATCH] Fix the Jacobi compilation time.\n\nThis is an approach at fixing the Jacobi kernels compilation which aims at\nsacrificing as little as possible the runtime performance and get down the total\ncompilation time for this file as much as possible, all the while having a way\nto enable the full performance optimizations in an easy way.\n\nThis approach only touches the `generate` kernels in the end, and provides the\nfull performance for the `apply` and other kernels as they are fast enough to\ncompile.\n\n+ Split the Jacobi kernels into multiple files for parallel compilation and\n smaller code size.\n+ Tone down some optimizations, namely use `noinline` for one function and\n `#pragma unroll 1` in two places to prevent unrolling. This impacts the\n Jacobi `generate` kernels only.\n+ Add a compilation flag for enabling back the full optimizations.\n\n__NOTE:__ the generate kernel compilation with full optimizations still takes\nabove 30 minutes on my laptop for one architecture (Maxwell). Without the\noptimizations enabled, it takes less than 3 minutes.", "target": 0} {"commit_message": "[PATCH] Auto cache compiled CUDA kernels on disk to speed up\n compilation (#2848)\n\n* Adds CMake variable AF_CACHE_KERNELS_TO_DISK to enable kernel caching. It is turned ON by default.\n* cuda::buildKernel() now dumps cubin to disk for reuse\n* Adds cuda::loadKernel() for loading cached cubin files\n* cuda::loadKernel() returns empty kernel on failure\n* Uses XDG_CACHE_HOME as cache directory for Linux\n* Adds common::deterministicHash() - This uses the FNV-1a hashing algorithm for fast and reproducible hashing of string or binary data. This is meant to replace the use of std::hash in some place, since std::hash does not guarantee its return value will be the same in subsequence executions of the program.\n* Write cached kernel to temporary file before moving into final file. This prevents data races where two threads or two processes might write to the same file.\n* Uses deterministicHash() for hashing kernel names and kernel binary data.\n* Adds kernel binary data file integrity check upon loading from disk", "target": 0} {"commit_message": "[PATCH] Added allow_rules_with_negative_weights flag to QBase. \n Default is true (which was the standard behavior) but you can set this to\n false to use more expensive (but potentially safer) quadrature rules instead.\n\nReplaced the 15-point tet Gauss quadrature rule with a 14-point rule\nby Walkington of equivalent order.\n\nAdded Dunavant quadrature rules for triangles up to THIRTEENTH order. These\nare more efficient than the conical product rules they are replacing. Up to\nTWENTIETH order still to come.\n\nReplaced SECOND-order rule for triangles with a rule having interior integration\npoints. The previous rule had points on the boundary of the reference element.\n\n\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@2889 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 1} {"commit_message": "[PATCH] MDRange: Disabling performance test for MDRange: takes way to\n long on KNL", "target": 0} {"commit_message": "[PATCH] issue #2155: setting initial guesses slow", "target": 0} {"commit_message": "[PATCH] Adding performance patch for trmm, just like #2836", "target": 0} {"commit_message": "[PATCH] Adjust spin wait in attempt to address performance issue #935\n Bring Windows code back in, just in case...", "target": 0} {"commit_message": "[PATCH] #914 Replaced casadi_copy_sparse runtime function with\n casadi_project With work vector and more cache efficient", "target": 1} {"commit_message": "[PATCH] Next-generation SIMD, for SSE2, SSE4.1 and 128-bit AVX\n\nThis adds the same functionality that was previously done for the\nreference SIMD implementation This includes all the 128-bit x86\nflavors, since SSE4.1 and AVX-128 only overrides a few SSE2\ninstructions/functions. Performance appears to be identical to the\nstate before the new SIMD code on x86 when using SSE2. For the most\nperformance-sensitive functions I expect we will later test a few\ndifferent alternative implementations once we can benchmark the\nroutines inside actual kernels using them.\n\nChange-Id: I59d5741df345b38745f9a6d1ea3a4d27b0a66034", "target": 0} {"commit_message": "[PATCH] Moved various operations for polynomial smoothers into setup\n phase to improve performance and added new parameters to allow use of\n different polynomials", "target": 1} {"commit_message": "[PATCH] Extra options for computational electrophysiology.\n\n* Added two extra .mdp file parameters 'bulk-offset' that allow to specify\nan offset of the swap layers from the compartment midplanes. This is useful\nfor setups where e.g. a transmembrane protein extends far into at least one\nof the compartments. Without an offset, ions would be swapped in the vicinity\nof the protein, which is not wanted. Adding an extended water layer comes\nat the cost of performance, which is not the case for the offset solution.\n* Also made the wording a bit clearer in some places\n* Described the new parameters in the PDF manual, updated figure\n* replaced usage of sprintf in output routine print_ionlist_legend() by snprintf\n* Turned comments describing the variables entering the swapcoords.cpp\n functions into doxygen comments\n\nChange-Id: I2a5314d112384b30f9c910135047cc2441192421", "target": 0} {"commit_message": "[PATCH] 1. BoomerAMG keeps track of the number of iterations\n accumulated over all calls. This is needed for user-level performance\n monitoring if it is a preconditioner for a Krylov method such as PCG. The\n regular iteration count only tells you about the last time PCG invoked\n BoomerAMG. There are ifdefs so you can eliminate this if you like - remove\n #define CUMNUMIT.\n\n2. very minor code fixes, comments, etc.", "target": 0} {"commit_message": "[PATCH] Replaced static arrays of cl::program/kernels with maps\n\nfast, fftconvolve, orb and random kernels were using static\narrays which are not replaced with std::maps. This fixes the\npure virtual function error that is happening on windows for\nintel OpenCL devices.", "target": 0} {"commit_message": "[PATCH] Fixing a bug in measuring performance (didn't reset the\n timer)", "target": 0} {"commit_message": "[PATCH] Converted iir, fir, fftconvolve to async calls\n\nAdded eval, sync statements to orb, fast to make them work with\ntheir asynchronous counter parts. Currently, one test of ORB is failing.\nWill fix it later.", "target": 0} {"commit_message": "[PATCH] Fix OMP num specify issue\n\nIn current code, no matter what number of threads specified, all\navailable CPU count is used when invoking OMP, which leads to very bad\nperformance if the workload is small while all available CPUs are big.\nLots of time are wasted on inter-thread sync. Fix this issue by really\nusing the number specified by the variable 'num' from calling API.\n\nSigned-off-by: Chen, Guobing ", "target": 1} {"commit_message": "[PATCH] Adding performance section", "target": 0} {"commit_message": "[PATCH] OPENCL: Disabling greedy assignment for csrmm and csrmv\n\n- Was causing performance issues on intel and amd devices", "target": 1} {"commit_message": "[PATCH] First phase of integrating the fast coulomb code into nwchem.\n The nested grid evaluation of the density, fourier interpolation, FMM, and\n fourier solution of the free space Poisson equation", "target": 0} {"commit_message": "[PATCH] add (T) kernels optimized for OpenMP+SIMD\n\nThese kernels are taken from https://github.com/jeffhammond/nwchem-tce-triples-kernels/,\nwhich were previously part of private development branch of NWChem hosted by Argonne.\nThe code was developed by Jeff Hammond from 2013-2014 with help from Karol Kowalski.\n\nThese kernels have been tested on Intel Xeon, Intel Xeon Phi, IBM Blue Gene/Q,\nIBM POWER7, AMD Bulldozer and ARM32 processors using the Intel, Cray, IBM XL,\nand GCC compilers. In rare instances, the optimal loop order is different between\nIntel, Cray and IBM compilers. In such cases, we default to the Intel compiler case\nbecause it is the most commonly used Fortran compiler for NWChem. In particular,\nNWChem as a whole cannot be compiled with Cray Fortran, so the only context in which\nit would be used for these kernels is if someone did a mixed build.\nThe performance differences with XLF were observed on POWER7, which is a relatively\nrare platform for NWChem.\n\nIn any case, these optimizations are better than the serial version any time OpenMP\nis used. Detailed performance information for some platforms can be found at\nhttps://github.com/jeffhammond/nwchem-tce-triples-kernels/tree/master/results.\n\nFinally, it should be noted that all of Jeff Hammond's developments for non-Intel\narchitectures were done prior to his employment at Intel, which can be verified\nfrom the Github commit log associated with the aforementioned repo.", "target": 1} {"commit_message": "[PATCH] More performance enhancements for the transformation --\n almost done now!", "target": 0} {"commit_message": "[PATCH] HvD: Mainly optimized bits of code. The automatic\n differentiation module still has scope for optimization (which was also\n pointed out by one of the reviewers of the paper). So I have improved a few\n things:\n\n1. I have added USE_FORTRAN2008 a flag, when set, to use the popcnt, trailz and\n leadz Fortran 2008 intrinsics rather than the corresponding Fortran\n implementations util_popcnt, util_leadz and util_trailz. The util routines\n are still used by default.\n\n2. I optimized the powx routine which calculates x**y to use less exponentiation\n evaluations.\n\n3. I added a routine (powix) to do x**i where i is an integer as that is 20\n times faster than doing x**y where y is a double precision variable with an\n integer value. This is the only case for an integer argument because in this\n specific case the performance difference is really significant.\n\n4. I have also changed nwad_print_int_opx to print double precision numbers that\n hold integer values as integers rather than floating point numbers in the\n expectation that that will filter through the code generated by Maxima.\n\nOverall performance improvement was only 10% though.", "target": 1} {"commit_message": "[PATCH] FEAT: Enabling additional interpolation types\n\n- Enabling cubic support for rotate and transform\n- resize fallsback to use scale\n\nResize is not using common interp.cl because of compilation\nissues that arose from using too much constant memory when\ncompiling FAST in CUDA backend.", "target": 0} {"commit_message": "[PATCH] More extensive performance logging of the Kelly Error\n Estimator.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@1122 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 0} {"commit_message": "[PATCH] Use a separate Performance log line for compute_affine_map\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@1506 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 0} {"commit_message": "[PATCH] Move clang-tidy build\n\nThis has seemed too slow when combined with the ASAN build.\n\nChange-Id: I45ea5856ca05edbb6107b62f219e8afd3cdbda3f", "target": 0} {"commit_message": "[PATCH] Improved performance mostly by using hints to insert to the\n status line.", "target": 1} {"commit_message": "[PATCH] Adding pair style dpd/intel and dihedral style fourier/intel\n Adding raw performance numbers for Skylake xeon server. Fixes for using older\n Intel compilers and compiling without OpenMP. Fix adding in hooks for using\n USER-INTEL w/ minimization.", "target": 0} {"commit_message": "[PATCH] shorten ESTIMATE planning time for certain weird sizes\n\nFFTW includes a collection of \"solvers\" that apply to a subset of\n\"problems\". Assume for simplicity that a \"problem\" is a single 1D\ncomplex transform of size N, even though real \"problems\" are much more\ngeneral than that. FFTW includes three \"prime\" solvers called\n\"generic\", \"bluestein\", and \"rader\", which implement different\nalgorithms for prime sizes.\n\nNow, for a \"problem\" of size 13 (say) FFTW also includes special code\nthat handles that size at high speed. It would be a waste of time to\nmeasure the execution time of the prime solvers, since we know that\nthe special code is way faster. However, FFTW is modular and one may\nor may not include the special code for size 13, in which case we must\nresort to one of the \"prime\" solvers. To address this issue, the\n\"prime\" solvers (and others) are proclaimed to be SLOW\". When\nplanning, FFTW first tries to produce a plan ignoring all the SLOW\nsolvers, and if this fails FFTW tries again allowing SLOW solvers.\n\nThis heuristic works ok unless the sizes are too large. For example\nfor 1044000=2*2*2*2*2*3*3*5*5*5*29 FFTW explores a huge search tree of\nall zillion factorizations of 1044000/29, failing every time because\n29 is SLOW; then it finally allows SLOW solvers and finds a solution\nimmediately.\n\nThis patch proclaims solvers to be SLOW only for small values of N.\nFor example, the \"generic\" solver implements an O(n^2) DFT algorithm;\nwe say that it is SLOW only for N<=16.\n\nThe side effects of this choice are as follows. If one modifies FFTW to\ninclude a fast solver of size 17, then planning for N=17*K will be\nslower than today, because FFTW till try both the fast solver and the\ngeneric solver (which is SLOW today and therefore not tried, but is no\nlonger SLOW after the patch). If one removes a fast solver, of size say\n13, then he may still fall into the current exponential-search behavior\nfor \"problems\" of size 13*HIGHLY_FACTORIZABLE_N.\n\nIf somebody had compleined about transforms of size 1044000 ten years\nago, \"don't do that\" would have been an acceptable answer. I guess the\nbar is higher today, so I am going to include this patch in our 3.3.1\nrelease despite their side-effects for people who want to modify FFTW.", "target": 1} {"commit_message": "[PATCH] made some bondeds slightly more efficient", "target": 1} {"commit_message": "[PATCH] Convert nbnxn_atomdata_t to C++\n\nChanged all manually managed pointer to std::vector.\nSplit of a Params and a SimdMasks struct.\nChanged some data members to be private, more to be done.\n\nThis change is ony refactoring, no functional changes.\n\nNote: minor, negligible performance impact of the nbnxn gridding\ndue to (unnecessary) initialization of std::vector during resize().\n\nChange-Id: I9c70a1f8f272c80a7cf335fcbd867bd79c4102a2", "target": 0} {"commit_message": "[PATCH] efficient jacobian of parallelizer, little speed penalty\n remaining", "target": 1} {"commit_message": "[PATCH] Adding Changelog for Release 3.4.00\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 3.4", "target": 0} {"commit_message": "[PATCH] USER-DPD: propagate a minor performance bugfix throughout the\n DPDE code\n\nThe fix_shardlow_kokkos.cpp code had already factored out a redundant\nsqrt() calculation in the innermost loop of ssa_update_dpde(). This\nchangeset propagates an equivilent optimization to:\n fix_shardlow.cpp\n pair_dpd_fdt_energy.cpp\n pair_dpd_fdt_energy_kokkos.cpp\nThe alpha_ij variable was really just an [itype][jtype] lookup parameter,\nreplacing a sqrt() and two multiplies per interacting particle pair\nby a cached memory read. Even if there isn't much time savings, the\ncode is now consistent across the various versions.", "target": 1} {"commit_message": "[PATCH] Performance bug fix in single node case.", "target": 1} {"commit_message": "[PATCH] fixed performance issue for vacuum with DD", "target": 1} {"commit_message": "[PATCH] all gonzalez stuff uploaded for trying to fix gonzalez, make\n it fast and accurate", "target": 1} {"commit_message": "[PATCH] Update citations.\n\nRahman_2020 and Wimmer_2020 both cite BISON as a fuel performance code\nand Wimmer actually discusses some of the coefficients used in BISON,\nbut neither present simulation results using BISON. Chen_2020 talks\nabout some models that are in MARMOT, but does not use MOOSE.", "target": 0} {"commit_message": "[PATCH] Recursive ThreadPools (#1774)\n\nCurrently, ThreadPool instances are only recursive with their own isntance. This\npatch allows multiple ThreadPool instances to recurse among themselves without\nthe potential for bottlenecking (either in performance reduction or a deadlock)\non available threads.\n\nFor instance, ThreadPool instances A and B may be called in any order:\nA->execute([&]() {\n B->execute([&]() {\n A->execute([&]() {\n foo();\n });\n });\n});\n\nThe motiviation for this patch is to allow us to use our `compute_tp` and\n`io_tp` without worrying about their order in a call stack. This patch reduces\nthe runtime of our `bench_large_io` from 280587ms to 269621.\n\nSee the new unit tests in `unit-threadpool.cc` for more examples.", "target": 0} {"commit_message": "[PATCH] Added a base class with a lookup table for functions cw(int)\n and ccw(int). It results in a performance improvement.", "target": 1} {"commit_message": "[PATCH] minor performance fix; map double_conic to\n double_coefficients", "target": 1} {"commit_message": "[PATCH] Improve performance running on multiple GPUs (#3347)\n\n* Use multiple streams to broadcast positions\n\n* Use multiple streams to reduce forces\n\n* Adds sync between default stream and peer-copy\n\n* Minor cleanup\n\nCo-authored-by: David Clark ", "target": 1} {"commit_message": "[PATCH] Implemented the efficient computation of the second centroid.\n\nThe hierarchical clustering algorithm gets about 15% faster\n(on test Eglise Fontaine, from 91s to 76s).", "target": 1} {"commit_message": "[PATCH] The current version allows for triclinic boxes as well. It is\n very slow though.", "target": 0} {"commit_message": "[PATCH] query t on side of bounded square\n\nI moved a lot of the functionality for deciding the Linf incircle\ntest for four points to the side of bounded square predicate.\n\nIn the case of query point t being on one of the sides of the\nbounded square, I use the predicate test1d. Maybe even this can\nbe optimized, or made even more robust with some more checks.\n\nA bug that is fixed with the current commit is in the following\ninput:\n\n$ cat ~/Dropbox/cgal/sdg/panos/sqch1a.cin\np -51 -180\np -180 -30\np -180 20\np -7 -180\n\nI also fixed a small bug when expanding both sides of the bounded\nsquare.\n\nThe next step is to completely remove the slow \"side of oriented\nsquare\" test.\n\nSigned-off-by: Panagiotis Cheilaris ", "target": 1} {"commit_message": "[PATCH] Enable fp-exceptions\n\nThis can help with finding errors quicker because mdrun crashes as soon\nas a floating point value overflows or is invalid. fp-exceptions are\nonly enabled for builds with asserts (without NDEBUG), mainly because\nit isn't always possible to avoid invalid fp operations for SIMD math\nwithout a performance penalty.\n\nAlso, fix a few places where we had 1/0 or other invalid fp operations.\n\nFixes #1582\n\nChange-Id: Ib1b3afc525706f4b171564fcaf08ebf3b2be3122", "target": 0} {"commit_message": "[PATCH] avoid using epeck in slow (when using leda) tests", "target": 0} {"commit_message": "[PATCH] remove legacy CUDA non-bonded kernels\n\nThis commit drops the legacy set of kernels which were optimized for use\nwith CUDA compilers 3.2 and 4.0 (previous to the switch to llvm backend\nin 4.1).\n\nFor now the only consequence is slight performance degradation with CUDA\n3.2/4.0, the build system still requires CUDA >=3.2 as the kernels do\nbuild with the older CUDA compilers. Whether to require at least CUDA\n4.1 will be decided later.\n\nRefs #1382\n\nChange-Id: I75d31b449e5b5e10f823408e23f35b9a7ac68bae", "target": 1} {"commit_message": "[PATCH] Extend Force sub-counters\n\nNeed more data for understanding performance variation\n\nImplemented subcounter \"restart\" and used it for accumulating\nposition-restraints time with FEP to the position-restraints\nsubcounter.\n\nNoted TODOs for some future extensions not currently possible.\n\nAlso added logfile output from GMX_CYCLE_BARRIER where people\nanalyzing the performance will see it.\n\nRefs #1686\n\nChange-Id: I9d60d0a683f56549879bb739269e9466c96572c4", "target": 0} {"commit_message": "[PATCH] Add node ranking for increased nearest neighbour performance,\n currently failing tests for k > 1", "target": 1} {"commit_message": "[PATCH] general SIMD acceleration for angles+dihedrals\n\nImplemented SIMD intrinsics for angle potential and pbc_dx.\nChanged SSE2 intrinsics to general SIMD using gmx_simd_macros.h.\nImproves performance significantly, especially with AVX-256\nand reduces load imbalance, especially with GPUs.\n\nChange-Id: Ic83441cce68714ae91c6d5ca2a6e1069a62cd2ae", "target": 1} {"commit_message": "[PATCH] add nvtx to measure performance", "target": 0} {"commit_message": "[PATCH] Gredner performance benchmark test", "target": 0} {"commit_message": "[PATCH] Removed #error in case the file gets included twice. There\n are protected #ifdefs anyway. Also, we do not care about protected includes\n within the .C file anymore. It does not make compilation slow as before.", "target": 0} {"commit_message": "[PATCH] Better performance (~10-15% better)\n\nBy removing several tests (and use CGAL::max instead), the generated\nassembly is more efficient.", "target": 1} {"commit_message": "[PATCH] fast dynamic_cast in Lazy_kernel::Construct_point_3", "target": 1} {"commit_message": "[PATCH] Deprecate the Side class\n\nFor repeated use our new Elem-based side_ptr APIs are more efficient\nnow.", "target": 1} {"commit_message": "[PATCH] Optimize load_tile_offsets for only relevant frags\n\nThis optimization is to adjust the `Reader::load_tile_offsets` class to\nonly loop over the relevant fragments as computed by the subarray class\nbased on intersection. This is a performance optimization for arrays\nwhich have a large number of fragments and which we are incorrectly\nfetching a large amount of unneeded data.", "target": 1} {"commit_message": "[PATCH] Adjust serialized query buffer sizes (#2115) (#2117)\n\nAdjust serialized query buffer sizes\n\nThis change the client/server flow to always send the server the\noriginal user requested buffer sizes. This solves a bug in which with\nserialized queries incompletes would cause the \"server\" to use smaller\nbuffers for each iteration of the incomplete query. This yield\ndecreasing performance as the buffers approached zero. The fix here lets\nthe server always get the original user's buffer size.", "target": 1} {"commit_message": "[PATCH] fixed GPU particle gridding performance issue\n\nThe scaling factor for the grid binning for the GPU pair search\nwas set incorrectly, which made the binning 50% slower.\n\nChange-Id: I146592c37094a3d81a7ae50b3903fcc615e748d5", "target": 1} {"commit_message": "[PATCH] Core: Atomic Performance Test Reduce Loop count\n\nThis was taking to long, reduced the loop count.", "target": 1} {"commit_message": "[PATCH] Change fix box/relax example to be more efficient\n\ngit-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12532 f3b2605a-c512-4ea7-a41b-209d697bcdaa", "target": 1} {"commit_message": "[PATCH] first short a performance matrix stuff. Papi interface so\n far.", "target": 0} {"commit_message": "[PATCH] Added Kokkos-like array datatype into RK4 and RHS in\n FixRXKokkos.\n\n- Created an Array class that provides stride access for operator[]\n w/o needing Kokkos views. This was designed to avoid the performance\n issues encountered with Views and sub-views throughout the RHS and\n ODE solver functions.", "target": 0} {"commit_message": "[PATCH] performance update ... EJB", "target": 0} {"commit_message": "[PATCH] Check in changes to improve performance for nb_puts and\n nb_gets on GPU-hosted global arrays.", "target": 1} {"commit_message": "[PATCH] Using fast approximation for erfc instead of tabulated values", "target": 1} {"commit_message": "[PATCH] Fix conditional on when DtoH forces copy occur\n\nd2d4a50b4c636c203028c5bff311924ec15e7825 introduced performance\nregression with forces copied from device to host on each step.\nThis fixes the issue by reinstantiating proper condition on the\ncopy call.\n\nFixes #4001\nRefs #2608", "target": 0} {"commit_message": "[PATCH] Small improvements for readability and performance", "target": 1} {"commit_message": "[PATCH] Distinghuish cases for performance reasons", "target": 0} {"commit_message": "[PATCH] Changed number of nonbonded thread blocks to improve\n performance", "target": 1} {"commit_message": "[PATCH] New normals orientation method:\n radial_normals_orientation_3() does a radial orientation of the normals of a\n point set. Normals are oriented towards exterior of the point set. This very\n simple and very fast method is intended to convex objects.", "target": 0} {"commit_message": "[PATCH] modified to have 3c and 2c two electron eri sums also prints\n outer loop index to show user about where the computation is. Also\n statically (modulo) parallelized for better performance\n\nRick Kendall", "target": 1} {"commit_message": "[PATCH] ATW: Interim commit - new sparse packing and IO OK but slow", "target": 0} {"commit_message": "[PATCH] Move performance logging of solve()s into solver classes", "target": 0} {"commit_message": "[PATCH] efficient sparsity pattern computation for the case when the\n user specifies the DOF coupling", "target": 1} {"commit_message": "[PATCH] Replaced Vertex_circulator by Edge_circulator to gain\n performance", "target": 0} {"commit_message": "[PATCH] << operator for segments does clipping (QT advice) only for\n the segments that intersect the boundaries of the screen rectangle use the\n old x_real function because the new one is too slow in doing the\n transformation (use GMP if CGAL_USE_GMP is defined) we should document the\n old one too, it will never be removed.", "target": 0} {"commit_message": "[PATCH] Inline a more efficient implementation of\n BoundingBox::intersects.\n\nThis ends up being about 9x faster due to inlining and improved short\ncircuiting.", "target": 1} {"commit_message": "[PATCH] Performance drop fix - Added a QTime to reduce the number of\n calls to deform().", "target": 1} {"commit_message": "[PATCH] made the domain decompostion a bit more efficient and cleaned\n up the DD code", "target": 1} {"commit_message": "[PATCH] Optimize multi-fragment unfiltering, part 1 (#1692)\n\nThis patch improves the execution time of the attribute unfiltering path.\n\n```\n// Current\nTotal read query time (array open + init state + read): 15.0957 secs\n Time to unfilter attribute tiles: 8.86625 secs\n```\n\n```\n// With this patch\nTotal read query time (array open + init state + read): 7.32202 secs\n Time to unfilter attribute tiles: 1.80354 secs\n```\n\nThe issue is that the destruction time of `forward_list` is obscenely\nslow within my OSX environrment. On the Linux environment that I\noriginally wrote this in, the `forward_list` did not cause a performance\nproblem. This patch just replaces the `forward_list` with a `vector`.\n\nI have titled this as a \"part 1\" because I have tentatively explored\nmulti-threading the `unfilter_tiles` path and have observed a speedup.\nMore on this later.\n\nCo-authored-by: Joe Maley ", "target": 1} {"commit_message": "[PATCH] Add access to base kernel in Efficient RANSAC traits", "target": 0} {"commit_message": "[PATCH] using int instead of size_t should be more efficient and\n range doesn't seem to be needed", "target": 1} {"commit_message": "[PATCH] Fix performance regresssion bug of unnecessary destruction of\n IPC comms buffers", "target": 1} {"commit_message": "[PATCH] Adding Changelog for Release 3.5.00\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 3.5", "target": 0} {"commit_message": "[PATCH] Improve the performance of sending the AmoebaVdwLambda to\n Cuda using pinned host memory; Updated the AmoebaVdwForceProxy to version 3,\n and added backward compatibility to version 2; updated TestAPIUnits.py to\n handle the per particle lambda flag", "target": 1} {"commit_message": "[PATCH] Use stack allocation in zgemv and zger\n\nFor better performance with small matrices\nRef #727", "target": 1} {"commit_message": "[PATCH] turned performance fix for 7.30 compilers off by default", "target": 1} {"commit_message": "[PATCH] Improve performance of PairReaxCKokkos", "target": 1} {"commit_message": "[PATCH] Avoid corner-only mesh connection in SlitMesh test\n\nI've long said we don't support meshes where a manifold is only\nconnected at one node, but apparently we did kind of support it\naccidentally in at least simple cases? But improving\nGhostPointNeighbors performance breaks this case, so let's extend the\nmesh to something we've actually committed to handle.", "target": 0} {"commit_message": "[PATCH] Optimize `copy_cells`, part 2 (#1695)\n\nThis patch optimizes the `copy_cells` path for multi-fragment reads. The\nfollowing benchmarks are for the multi-fragment read scenario discussed\noffline:\n\n```\n// Current\nRead time: 4.75082 secs\n * Time to copy result attribute values: 3.54192 secs\n > Time to read attribute tiles: 0.311707 secs\n > Time to unfilter attribute tiles: 0.370434 secs\n > Time to copy fixed-sized attribute values: 0.898421 secs\n > Time to copy var-sized attribute values: 0.954925 secs\n```\n\n```\n// With this patch\nRead time: 3.04627 secs\n * Time to copy result attribute values: 1.83972 secs\n > Time to read attribute tiles: 0.274928 secs\n > Time to unfilter attribute tiles: 0.38196 secs\n > Time to copy fixed-sized attribute values: 0.517415 secs\n > Time to copy var-sized attribute values: 0.461847 secs\n```\n\nFor context, here are the benchmark results for the single-fragment read. The\nstats are similar with and without this patch:\n```\nRead time: 1.86883 secs\n * Time to copy result attribute values: 1.19411 secs\n > Time to read attribute tiles: 0.304055 secs\n > Time to unfilter attribute tiles: 0.351332 secs\n > Time to copy fixed-sized attribute values: 0.289661 secs\n > Time to copy var-sized attribute values: 0.142405 secs\n```\n\nThis patch does three things:\n1. Converts the `offset_offsets_per_cs` and `var_offsets_per_cs` in the var-sized\n path from a 2D array (vector>) to a 1D array (vector", "target": 1} {"commit_message": "[PATCH] Mark closed flag in EigenSparseVector regardless of METHOD\n\nI don't think it's a performance hit to set these flags\nregardless of METHOD, and it's a nice state flag that the user\ncan check", "target": 0} {"commit_message": "[PATCH] Improve the performance of the `Record` logger by using\n deques of `std::unique_ptr` instead of plain object.", "target": 1} {"commit_message": "[PATCH] --enable-distmesh, --with-mapvector-chunk-size\n\nThe parmesh argument probably should have been deprecated when the\nParallelMesh name was.\n\nWe'll want to select chunked_mapvector array size at configure time,\nsince the exact performance optimization/pessimization results are\nlikely to be system-dependent.", "target": 0} {"commit_message": "[PATCH] Target haswell or AVX2 for prebuilt libraries\n\nPrebuilt artifacts that we publish for releases will now target\n`haswell` for linux and macos and `AVX2` for windows to allow for\ngreater compatibility while maintaining the performance of AVX\noptimizations.", "target": 0} {"commit_message": "[PATCH] allow compilation to optimize for CUDA compute cap. 3.5\n\nEnabling optimizations targeting compute capability 3.5 devices\n(GK110) slightly improves performance of both PME and RF kernels.\nThis requires a hint for the compiler optimization indicating\nthe maximum number of threads/block and minimum number of\nblocks/multiprocessor. This change allows nvcc >=5.0 to generate\ncode for CC 3.5 devices and switches to including PTX 3.5 code\n(instead of 3.0) in the binary.\n\nChange-Id: If7e14d31165bc05859250db7468bf6bd8c186264", "target": 1} {"commit_message": "[PATCH] Improve the performance of read_data of gzip'ed files using\n taskset. Normally, the gzip process would be pinned to the same core as the\n MPI rank 0 process, which makes the pipe stay in one core's cache, but forces\n the two process to fight for that core, slowing things down.", "target": 1} {"commit_message": "[PATCH] SIC performance updates...EJB", "target": 0} {"commit_message": "[PATCH] Improving performance of Kokkos ReaxFF\n\ngit-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@15828 f3b2605a-c512-4ea7-a41b-209d697bcdaa", "target": 1} {"commit_message": "[PATCH] Performance updates ....EJB", "target": 0} {"commit_message": "[PATCH] ClangCuda: Make Cuda compilation with clang 3.9.0 work\n\nThis makes all the unit and performance tests work, with only\none test disabled (view of views).\nOn the other hand it brakes the cuda build because clang is ok with\n__device__ void foo() {}\nvoid foo() {}\nnvcc sees that as a redeclaration.", "target": 0} {"commit_message": "[PATCH] Adjusted scheduling, removed slow atomic section", "target": 1} {"commit_message": "[PATCH] CUDA PME kernels with analytical Ewald correction\n\nThe analytical Ewald kernels have been used in the CPU SIMD kernels, but\ndue to CUDA compiler issues it has been difficult to determine in which\ncases does this provide a performance advantage compared to the\ntabulated kernels.Although the nvcc optimizations are rather unreliable,\non Kepler (SM 3.x) the analytical Ewald kernels are up to 5% faster, but\non Fermi (SM 2.x) 7% slower than the tabulated. Hence, this commit\nenables the analytical kernels as default for Kepler GPUs, but keeps the\ntabulated kernels as default on Fermi.\n\nNote that the analytical Ewald correction is not implemented in the\nlegacy kernels as these are anyway only used on Fermi.\n\nAdditional minor change is the back-port of some variable (re)naming and\nsimple optimizations from the default to the legacy CUDA kernels which\ngive 2-3% performance improvement and better code readability.\n\nChange-Id: Idd4659ef3805609356fe8865dc57fd19b0b614fe", "target": 1} {"commit_message": "[PATCH] made changes which removed further operations and made CLJP\n and Falgout coarsenings more efficient", "target": 1} {"commit_message": "[PATCH] Added fast square tridiagonalization for lower-triangular\n storage.", "target": 0} {"commit_message": "[PATCH] removed (harmless) left-over in nbnxn SIMD kernels\n\nThis improves performance of PME + p-coupling by about 5%.\nWith Ewald and virial, the nbnxn SIMD energy kernels were used\n(some left-over development code). The plain-C code did not do this.\n\nChange-Id: I039044fcb393bf0bcaa06f38498b2a57d60cf080", "target": 1} {"commit_message": "[PATCH] Extending LLL to support linearly dependent bases (and\n returning the nullity). A small performance optimization was also made by\n removing lll::ExpandQR from lll::HouseholderStep", "target": 1} {"commit_message": "[PATCH] JN: memcpy on T3D is now fast enough", "target": 1} {"commit_message": "[PATCH] - Workaround bugs and misfeatures of GCC 3 in FPU.h. \n Unfortunately at a performance cost :((", "target": 0} {"commit_message": "[PATCH] POWER10: Optimize dgemv_n\n\nHandling as 4x8 with vector pairs gives better performance than\nexisting code in POWER10.", "target": 1} {"commit_message": "[PATCH] Fixed FAST CPU backend case when no features are found", "target": 0} {"commit_message": "[PATCH] Changed workgroup size to work around NVIDIA bug. This also\n improves performance slightly.", "target": 1} {"commit_message": "[PATCH] - Anti-aliasing is quite slow (but in OpenGL mode). It is\n deactivated by default.\n\n- Add a temp message in the status bar when the aliasing mode is changed.", "target": 0} {"commit_message": "[PATCH] less efficient but maybe portable logarithm", "target": 0} {"commit_message": "[PATCH] Replace gmx::Mutex with std::mutex\n\nWe use no mutexes during the MD loop, so performance is not a serious\nconsideration and we should simplify by using std components.\nEliminated components of thread-MPI that are now unused.\n\nIn particular, this reduces the cross-dependencies between the\nlibgromacs and threadMPI libraries.\n\nMinor style improvements around set_over_alloc_dd.\n\nPart of #3892", "target": 0} {"commit_message": "[PATCH] Enabling of NVidia PTX backend for SYCL & nbnxmKernel\n performance optimizations", "target": 1} {"commit_message": "[PATCH] Shift transition to multithreading towards larger matrix\n sizes\n\nSee #1886 and JuliaRobotics issue 500. trsm benchmarks on Haswell and Zen showed that with these values performance is roughly doubled for matrix sizes between 8x8 and 14x14, and still 10 to 20 percent better near the new cutoff at 32x32.", "target": 0} {"commit_message": "[PATCH] Improvement of rectangular slicing: part 2 - fast\n implementation for large blocks of sparse matrices", "target": 1} {"commit_message": "[PATCH] Fast Haswell CGEMM kernel", "target": 0} {"commit_message": "[PATCH] More efficient IntegratorInternal::getDerivative for\n SXFunction #936", "target": 1} {"commit_message": "[PATCH] Simplify the usage of the performance function by reducing\n the template parameter.", "target": 0} {"commit_message": "[PATCH] Use a fixed-length arrays, avoid heap allocation.\n\nAlso reduce the default number of elements so that it runs fast enough in DBG mode.", "target": 1} {"commit_message": "[PATCH] Add implementation of the cross-entropy error performance\n function.", "target": 0} {"commit_message": "[PATCH] Fixed a performance regression in multi-GPU on CUDA", "target": 1} {"commit_message": "[PATCH] additional output for performance tests", "target": 0} {"commit_message": "[PATCH] Replace vpermpd with vpermilpd\n\nto improve performance on Zen/Zen2 (as demonstrated by wjc404 in #2180)", "target": 1} {"commit_message": "[PATCH] Tabulated function parameters are hardcoded in the kernel\n instead of being stored in an array. This makes the code simpler and may\n help performance slightly.", "target": 1} {"commit_message": "[PATCH] New DCEL structure that allows more efficient handling of\n holes and isolated vertices. (Ron) Automatic handling of holes to remove in\n the construction visitor. (Baruch)", "target": 1} {"commit_message": "[PATCH] aabb tree: added plane queries in the performance section and\n demo.", "target": 0} {"commit_message": "[PATCH] Modernize read_inpfile\n\nUsed std::string and gmx::TextReader to simplify this. It is likely no\nlonger as efficient because it now makes several std::string objects\nper line, but this is not a significant part of the execution time of\ne.g. grompp.\n\nMove COMMENTSIGN to cstringutil.cpp, now that this is an\nimplementation detail of string-handling code, rather than also used\nin readinp.cpp.\n\nMoved responsibility for stripping of comments to TextReader, reworked\nits whitespace-trimming behaviour, and introduced some tests for that\nclass.\n\nIntroduced some TODO items to reconsider multiple behaviours where\nread_inpfile ignores what could be malformed input. It is used for\ngrompp, xpm2ps and membed, however, so it is not immediately clear\nwhat fixes might be appropriate, and we might anyway remove this\ncode shortly.\n\nIntroduced catches for exceptions that might be thrown while calling\nread_inpfile (and related code that might change soon).\n\nAdded tests for newly introduced split-and-trim functionality that\nsupports read_inpfile.\n\nRefs #2074\n\nChange-Id: Id9c46d60a3ec7ecdcdb9529bba2fdb68ce241914", "target": 0} {"commit_message": "[PATCH] Improved performance of g_bar while reading .edr files with\n small nstenergy.", "target": 1} {"commit_message": "[PATCH] Add SIMD intrinsics version of simple update\n\nTo get better performance in cases where the compiler can't vectorize\nthe simple leap frog integrator loop and to reduce cache pressure of\nthe invMassPerDim, introduced a SIMD intrinsics version of the simple\nleap-frog update without pressure coupling and one T-scale factor.\nTo achieve this md->invmass now uses the aligned allocation policy\nand is padded by GMX_REAL_MAX_SIMD_WIDTH elements.\nAsserts have been added to check for the padding.\n\nChange-Id: I98f766e32adc292403782dc67f941a816609e304", "target": 1} {"commit_message": "[PATCH] WIP NCMesh fast single element neighbor calculation, works\n for quads", "target": 0} {"commit_message": "[PATCH] Added support for gettimeofday() when available to get\n microsecond resolution for wallclock time. This enables accurate performance\n benchmarks from short simulations even in parallel. When gettimeofday() is\n not available, we use time(). Time is still stored as seconds, but now as\n double instead of time_t.", "target": 0} {"commit_message": "[PATCH] Minor changes to mdrun -h descriptions\n\nHopefully these are easier to understand. The suggested application\nfor -pinoffset is covered in the new mdrun performance section\nof the user guide, on release-5-0 branch.\n\nChange-Id: I7bc6172a70c39c02f6ca6db17e26b08d2ca3b444", "target": 0} {"commit_message": "[PATCH] Interactive Molecular Dynamics (IMD)\n\nIMD allows to interact with and to monitor a running molecular dynamics\nsimulation. The protocol goes back to 2001 (\"A system for interactive\nmolecular dynamics simulation\", JE Stone, J Gullingsrud, K Schulten,\nP Grayson, in: ACM symposium on interactive 3D graphics, Ed. JF Hughes\nand CH Sequin, pp. 191--194, ACM SIGGRAPH). The user can watch the\nrunning simulation (e.g. using VMD) and optionally interact with\nit by pulling atoms or residues with a mouse or a force-feedback\ndevice.\nCommunitcation between GROMACS and VMD is achieved via TCP sockets\nand thus enables controlling an mdrun locally or one running on a\nremote cluster. Every N steps, mdrun receives the applied forces from\nthe VMD client and sends the new positions to VMD.\nOther features:\n- correct PBC treatment, molecules of a (parallel) simulation are made\n whole (with respect to the configuration found in the .tpr file)\n- in the .mdp file, one can define an IMD group (including the protein\n but not the water for example is useful). Only the coordinates of\n atoms belonging to this group are then transferred between mdrun and\n VMD. This can be used to reduce the performance impact to an almost\n negligible level.\n- adds only two single-line function calls in the main MD loop\n- and mdrun test fixture checks whether grompp and mdrun understand\n the IMD options\n\nChange-Id: I235e07e204f2fb77f05c2f06a14b37efca5e70ea", "target": 0} {"commit_message": "[PATCH] MKK: A new classic test case for matmul that give the\n performance numbers", "target": 0} {"commit_message": "[PATCH] 1. Removed \"task uccsdt energy\"-related input examples\n because they may not work and are pointless and distracting. 2. Added 2EMET\n subsection and filled in details for all new 2emet options. 3. Added\n subsection on RESTART which is complete except for examples. 4. Updated\n response properties section slightly. 5. Added new section on performance\n suggestions since so many users need this, but only the outline exists. I\n will fill in this content in a few days. 6. Removed other Hirata-era stuff\n that just seems out of place now.", "target": 0} {"commit_message": "[PATCH] HvD: In order to get reasonable performance out of the\n automatic differentiation approach the compiler has to inline the various\n overloaded operators. If the compiler fails to do that performance\n degradation exceeding an order of magnitude has been observed. For the GNU\n compilers code inlining requires all the code to be present in a single\n source file. The compiler cannot inline code from a different file. Hence the\n NWAD code module must be included into the same source file as the density\n functional subroutines that use it. To do that we need a fixed format version\n of NWAD module (typically identified to the compiler with the .F extension\n rather than the .F90 extension for free format Fortran). This commit creates\n the appropriate file that will be converted from free format to fixed format.\n At a later stage the original nwad.F90 file can be deleted.", "target": 0} {"commit_message": "[PATCH] Initial check in of performance test for read-only property", "target": 0} {"commit_message": "[PATCH] improved performance of free energy runs in water\n significantly by allowing water-water loops and added a slight speed up for\n neighborsearching for free-energy runs", "target": 1} {"commit_message": "[PATCH] Fix bonded atom communication performance\n\nThe filtering of atoms to communicate for bonded interactions that\nare beyond the non-bonded cut-off was effectively missing, because\nthe home atom indices were not set. This lead to many more atoms\nbeing communicated.\n\nChange-Id: I4bd5b9b561a3077e055186f312939221dba6cefa", "target": 0} {"commit_message": "[PATCH] More efficient SystemBase::reinit, flux_jump indicator should\n work but needs testing\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@304 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 1} {"commit_message": "[PATCH] Update tmop_pa_h3m kernel with fast mode", "target": 0} {"commit_message": "[PATCH] Create dedicated subcounter for nonbonded FEP\n\nNow all nonbonded work has their own separate subcoutners which allows\nmeasuring the performance of each task separately.\n\nRefs #2997\n\nChange-Id: I601445364592923d08087a858da4629b0b58ae76", "target": 0} {"commit_message": "[PATCH] introducing info on elements in the overlap regions for more\n efficient calculation of the characteristic function (restriction to part of\n the mesh)", "target": 1} {"commit_message": "[PATCH] There is no need to switch off ITERATOR_DEBUGGING as the\n performance problem was somewhere else", "target": 0} {"commit_message": "[PATCH] genbox now performs as expected. fast and reliable.", "target": 0} {"commit_message": "[PATCH] Added performance data", "target": 0} {"commit_message": "[PATCH] committed gmx_random.c and .h written by Erik Lindahl; I have\n added an extremely fast tabulated gaussian random number generator", "target": 0} {"commit_message": "[PATCH] Improve implementation of cycle subcounting\n\nConfiguring with GMX_CYCLE_SUBCOUNTERS on is intended to make active\nsome counters that show finer-grained timing details, but its\nimplementation with the preprocessor was more complex and more bug\nprone than this one. Static analysis now finds the bug where we\nover-run buf (fixed here and in release-5-1).\n\nThe only place we care about performance of the subcounter\nimplementation is that it doesn't do work when GMX_CYCLE_SUBCOUNTERS\nis off, and constant propagation and dead code elimination will handle\nthat.\n\nAlso moved some declarations into the blocks where they are used.\n\nChange-Id: I3d7a06a65c636c11557a094997a7a81f86a1ed8a", "target": 1} {"commit_message": "[PATCH] improved performance MatvecCommPkgCreate", "target": 1} {"commit_message": "[PATCH] Use SIMD transpose scatter in bondeds\n\nThe angle and dihedral SIMD functions now use the SIMD transpose\nscatter functions for force reduction. This change gives a massive\nperformance improvement for bondeds, mainly because the dihedral\nforce update did a lot of vector operations without SIMD that are\nnow fully replaced by SIMD operations.\n\nChange-Id: Id08e6c83d4c9943d790bfe2a40c70fa4697077af", "target": 1} {"commit_message": "[PATCH] Changed par_amg_setup so that\n hypre_BoomerAMGBuildCoarseGridOperator() is no longer used (Ulrike ran some\n performance studies of this). Two separate mat-mults are now used to compute\n (P^T (A P)). This change is important to the non-Galerkin code, because\n here, AP[Cpts,:] is used to compute a minimal sparsity pattern. So, by\n keeping (A P) from (P^T (A P)), we avoid a lot of duplicate communication and\n computation in non-Galerkin code.\n\nThis commit also removes any lumping to the diagonal inside the non-Galerkin\ncode.", "target": 0} {"commit_message": "[PATCH] Templated options are now runtime compile options for opencl\n FAST", "target": 0} {"commit_message": "[PATCH] Option to add temporary points on a far sphere before\n insertion\n\nHelps to reduce contention on the infinite vertex\nBut removing those points in the end takes time\nso it's only worth it when points lie on a surface.", "target": 0} {"commit_message": "[PATCH] Fixed FAST type comparison mismatch warning", "target": 0} {"commit_message": "[PATCH] Optimize the performance of dot by using universal intrinsics\n in X86/ARM", "target": 1} {"commit_message": "[PATCH] More efficient Sparsity::serialize", "target": 1} {"commit_message": "[PATCH] Added parallel performance to the doc", "target": 0} {"commit_message": "[PATCH] Separate Windows ci(gh-action) workflow and some improvs\n\nSplitting the windows ci job into a separate workflow enables the\nci to re-run windows specific jobs independent of unix jobs.\n\nUpdated Ninja dependency to 1.10.2 fix release in all ci(gh-actions)\n\nRefactored boost dependency to be installed via packages managers as\nGitHub Actions is removing pre-installed versions from March 8, 2021\n\nUpdate VCPKG hash to newer version to enable fast and better ports.\n\n(cherry picked from commit 58573eda4ded71fe4e0be6305a6f71386d175d12)", "target": 0} {"commit_message": "[PATCH] Increase release build timeout, macos is being a bit slow on\n Azure (#2495) (#2496)\n\nCo-authored-by: Seth Shelnutt ", "target": 0} {"commit_message": "[PATCH] increased granularity of performance logging, fixed a bug in\n DofMap::add_neighbors_to_send_list() which caused the _send_list to become\n excessively large. Further, this slowed the DofMap::sort_send_list() method\n considerably.", "target": 1} {"commit_message": "[PATCH] Avoid setting config on transient subarray (#2740)\n\nFor the existing `tiledb_query_add_range` APIs there is a transient\nsubarray object that is referenced. In one API the query config was also\nbeing set on this subarray. This is not required, and other variations\nof `tiledb_query_add_range` did not set the config. Removing this saves\na few seconds of the WALL time when a user is setting hundreds of\nthousands of ranges.\n\nThe performance bottleneck comes in when the config is copied which is a\nfull copy of the config map. This was being done once per range setting.\n\nInstead to maintain current behavior setting the query config also sets\nthe subarray config.", "target": 1} {"commit_message": "[PATCH] Performance Improvements - changed Cartesian to\n Simple_cartesian in the examples - changed list to vector in the code -\n removed unnecessary includes - introduced multipass_distance", "target": 1} {"commit_message": "[PATCH] Slightly improved performance, and small refinements", "target": 1} {"commit_message": "[PATCH] Fixed a potential performance bug in the KDE code that\n resulted in a stricter pruning criterion.", "target": 1} {"commit_message": "[PATCH] Knowing when the tree fails and we're stuck with a linear\n search is useful for performance testing too", "target": 0} {"commit_message": "[PATCH] Added complex routines for blas lapack but they compile as a\n separate library so that it doesn't slow down applications that do not need\n complex routines", "target": 0} {"commit_message": "[PATCH] GetPot: Use a more efficient container for UFO detection", "target": 1} {"commit_message": "[PATCH] functionObjects: surfaceFieldValue, volFieldValue: Various\n improvements\n\nA number of changes have been made to the surfaceFieldValue and\nvolFieldValue function objects to improve their usability and\nperformance, and to extend them so that similar duplicate functionality\nelsewhere in OpenFOAM can be removed.\n\nWeighted operations have been removed. Weighting for averages and sums\nis now triggered simply by the existence of the \"weightField\" or\n\"weightFields\" entry. Multiple weight fields are now supported in both\nfunctions.\n\nThe distinction between oriented and non-oriented fields has been\nremoved from surfaceFieldValue. There is now just a single list of\nfields which are operated on. Instead of oriented fields, an\n\"orientedSum\" operation has been added, which should be used for\nflowRate calculations and other similar operations on fluxes.\n\nOperations minMag and maxMag have been added to both functions, to\ncalculate the minimum and maximum field magnitudes respectively. The min\nand max operations are performed component-wise, as was the case\npreviously.\n\nIn volFieldValue, minMag and maxMag (and min and mag operations when\napplied to scalar fields) will report the location, cell and processor\nof the maximum or minimum value. There is also a \"writeLocation\" option\nwhich if set will write this location information into the output file.\nThe fieldMinMax function has been made obsolete by this change, and has\ntherefore been removed.\n\nsurfaceFieldValue now operates in parallel without accumulating the\nentire surface on the master processor for calculation of the operation.\nCollecting the entire surface on the master processor is now only done\nif the surface itself is to be written out.", "target": 1} {"commit_message": "[PATCH] Replaced the \"visited_facets\" array (parallel version) by an\n atomic char\n\nIt's as fast, and it required less memory.", "target": 1} {"commit_message": "[PATCH] issue #2389: re-adding MapSum Function fore efficient\n reduce_in/out", "target": 1} {"commit_message": "[PATCH] Fixed FAST memory leaks on CUDA backend", "target": 1} {"commit_message": "[PATCH] IQN-ILS modified such that the coomunication of the rhs uses\n the efficient MPI_Reduce operation if MasterSlave comm is configured with\n mpi-single, i.e., MPIDirect. All tests are working ... still check coupling\n iterations, i.e., performance of IQN-ILS", "target": 0} {"commit_message": "[PATCH] convert pair styles in USER-OMP to use fast DP analytical\n coulomb", "target": 1} {"commit_message": "[PATCH] Add data set that shows the performance gain when running\n self_intersections_example.cpp (4.6 sec master, 0.6 sec this PR when run\n sequentially", "target": 0} {"commit_message": "[PATCH] Add content to user guide\n\nConverted sections on environment variables, mdrun features, mdrun\nperformance to reStructuredText.\n\nChange-Id: I2a18528729dc6756be093e52e6f87f9df9fe3b94", "target": 0} {"commit_message": "[PATCH] implemented more efficient Ruge-coarsening", "target": 1} {"commit_message": "[PATCH] fix minor CUDA NB kernel performance regression\n\nCommit f2b9db26 introduced the thread index z component as a stride in\nthe middle j4 loop. As this index is not a constant but a value\nloaded from a special register, this change caused up to a few %\nperformance loss in the force kernels. This went unnoticed because\nsome architectures (cc 3.5/5.2) and some compilers (CUDA 7.0) were\nbarely affected.\n\nChange-Id: I423790e8fb01a35f7234d26ff064dcc555e73c48", "target": 1} {"commit_message": "[PATCH] Moved FAST description to docs directory.\n\nAdditionally indented FAST parameter documentation.", "target": 0} {"commit_message": "[PATCH] sbgemm: spr: enlarge P to 256 for performance", "target": 1} {"commit_message": "[PATCH] Fixed FAST CUDA backend case when no features are found", "target": 0} {"commit_message": "[PATCH] Modified locate region to use more efficient algorithms for\n most block-cyclic distributions.", "target": 1} {"commit_message": "[PATCH] Working on performance improvements", "target": 1} {"commit_message": "[PATCH] document \"slow\" and \"unstable\" labels for unit tests", "target": 0} {"commit_message": "[PATCH] Use a priority queue (heap) to store the list of candidates\n while searching. This makes the code more efficient, especially when k is\n greater. For example, for knn, given a list of k candidates neighbors, we\n need to do 2 fast operations: - know the furthest of them. - insert a new\n candidate. This is the appropiate situation for using a heap.", "target": 1} {"commit_message": "[PATCH] CouplingMatrix::operator&=\n\nThe user and library may need to use the output of a bunch of coupling\nfunctors, and this should be more efficient than carrying around a\nbunch of matrices and iterating through all of them.", "target": 1} {"commit_message": "[PATCH] Significant DBSCAN refactoring and improvements.\n\n - Use PARAM_MATRIX() instead of strings.\n - Add a single-point mode that handles RAM better.\n - Use UnionFind for much more efficient cluster finding.\n - Make --single_mode use the single point mode.", "target": 1} {"commit_message": "[PATCH] Set interior parent only when mesh contains multiple\n dimensions\n\nThis commit is to address the issues brought up in libMesh/libmesh/#709,\nnamely, handle ParallelMesh more appropriately by using\nLIBMESH_BEST_UNORDERED_MAP and asserting when an element id is greater\nthan the maximum element id rather than the number of elements. Also,\nautomatically setting the interior parent should occur when a mesh has\nmultiple dimensions and skipped when the mesh has only one dimension.\n\nIn order to utilize mesh.elem_dimensions(), which allows a user to\ndetermine the dimensions of a mesh, the code to automatically set the\ninterior parent was moved into mesh_base.C as a separate method. This\nway mesh.cache_elem_dimensions() can be called prior to setting the\ninterior parents and thus mesh.elem_dimensions() will be available.\n\nAlso, the methods were moved prior to the partitioning in order to avoid\nthe complexities of one processor containing the interior parent of an\nelement on another processor.\n\nLastly, there is one noticeable performance penalty for moving the\nautomatic setting of the interior parents to a separate method and that\nis populating the node_to_elem map which requires iterating through all\nactive elements. Previously, the node_to_elem map was populated during\nan existing element iteration inside find_neighbors().", "target": 0} {"commit_message": "[PATCH] 1 : Store performance function 2 : Pass error into Error\n function 3 : Pass network into Error function", "target": 0} {"commit_message": "[PATCH] fewer qas for slow mpi-pt", "target": 0} {"commit_message": "[PATCH] moved some test after fast exit [ci skip]", "target": 0} {"commit_message": "[PATCH] HvD: Eliminating a strange way of evaluating rho^1/3. If this\n costs performance then a better way would be to evaluate rho^4/3 = rho^1/3 *\n rho.", "target": 0} {"commit_message": "[PATCH] List: Reinstated construction from two iterators and added\n construction from an initializer list\n\nUntil C++ supports 'concepts' the only way to support construction from\ntwo iterators is to provide a constructor of the form:\n\n template\n List(InputIterator first, InputIterator last);\n\nwhich for some types conflicts with\n\n //- Construct with given size and value for all elements\n List(const label, const T&);\n\ne.g. to construct a list of 5 scalars initialized to 0:\n\n List sl(5, 0);\n\ncauses a conflict because the initialization type is 'int' rather than\n'scalar'. This conflict may be resolved by specifying the type of the\ninitialization value:\n\n List sl(5, scalar(0));\n\nThe new initializer list contructor provides a convenient and efficient alternative\nto using 'IStringStream' to provide an initial list of values:\n\n List list4(IStringStream(\"((0 1 2) (3 4 5) (6 7 8))\")());\n\nor\n\n List list4\n {\n vector(0, 1, 2),\n vector(3, 4, 5),\n vector(6, 7, 8)\n };", "target": 0} {"commit_message": "[PATCH] UList::swap: implemented fast version which swaps the size\n and storage pointer", "target": 0} {"commit_message": "[PATCH] moved MeshBase::contract() up to Mesh. Unfortunately, there\n is no good way to make MeshBase::delete_elem() efficient, so the old\n implementation of MeshBase::contract() was (potentially) O(n_elem^2), and\n consumed approximately 20 percent of the runtime in ex10. This new\n implementation exploits the fact that the elements are stored in a vector\n (which is why it was moved up to the Mesh class) and is linear in the number\n of elements. The new implementation is less that 1 percent of the run time\n in ex10.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@1057 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 0} {"commit_message": "[PATCH] Possible performance enhancement in Mesh::delete_elem. We use\n the passed Elems id() as a guess for the location of the Elem in the\n _elements vector. If the guess does not succeed, then we revert to the linear\n search.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@1187 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 0} {"commit_message": "[PATCH] Step towards better performance regarding second convergence\n test in bicgstab", "target": 1} {"commit_message": "[PATCH] Add A100 performance paper.", "target": 0} {"commit_message": "[PATCH] FAST will return (af_)features instead of (af_)features *", "target": 0} {"commit_message": "[PATCH] Tested a new locking grid with thread priorities. Actually,\n it decreases performance with the current algorithm, but we'll try it again\n later...", "target": 0} {"commit_message": "[PATCH] Improved performance of Python State objects", "target": 1} {"commit_message": "[PATCH] -added support for the new version of RS -fixed some minor\n bugs -now the kernel uses directly the extremely fast RS refinement function\n -updated the generic kernel tests accordingly", "target": 0} {"commit_message": "[PATCH] Use much less PaddedRVecVector and more ArrayRef of RVec\n\nOnly code that handles allocations needs to know the concrete type of\nthe container. In some cases that do need the container type,\ntemplating on the allocator will be needed in future, so that is\narranged here. This prepares for changing the allocator for state->x\nso that we can use one that can be configured at run time for\nefficient GPU transfers.\n\nAlso introduced PaddedArrayRef to use in code that relies on the\npadding and/or alignedness attributes of the PaddedRVecVector. This\nkeeps partial type safety, although a proper implementation of such a\nview should replace the current typedef.\n\nHad to make some associate changes to helper functionality to\nuse more ArrayRef, rather than rely on the way rvec pointers could\ndecay to real pointers.\n\nUsed some compat::make_unique since that is better style.\n\nChange-Id: I1ed3feb016727665329e919433bece9773b46969", "target": 0} {"commit_message": "[PATCH] ex1p: implement pa jacobi preconditioning\n\nPerformance is disappointing at the moment.", "target": 0} {"commit_message": "[PATCH] Minor fixes to mdrun performance documentation\n\nIn \"Examples for mdrun on one node,\" third example description, the respective number\nof thread-mpi ranks and OpenMP threads per rank were reversed.\n\nIn \"Examples for mdrun on one node,\" 6th example. For 12 logical cores, the pinoffsets\nshould be 0 and 6, respectively (I think)\n\nA few command line examples of running mdrun with more than 1 node used gmx rather\nthan gmx_mpi\n\nSeveral spelling/grammar/tense error/linking issues addressed.\n\nChange-Id: I014bc52d55cda1cbd05843cb8e960c2a2d7cbb47", "target": 0} {"commit_message": "[PATCH] Replace some std::endl with newline character\n\nNote that std::endl flushes the buffer, so we can get better\nperformance by using newlines when there is no need to flush.", "target": 1} {"commit_message": "[PATCH] Use AVX512 also for DGEMM\n\nthis required switching to the generic gemm_beta code (which is faster anyway on SKX)\nfor both DGEMM and SGEMM\n\nPerformance for the not-retuned version is in the 30% range", "target": 0} {"commit_message": "[PATCH] Use fast returns in md5 computation\n\nThis logic is easier to follow than a recycled ret integer\n\nChange-Id: Idc47cdae3d0453f1645a82582b13c86aa8eadcb8", "target": 0} {"commit_message": "[PATCH] Don't preallocate USMObjectMem\n\nThis might hurt performance in a bad case, should profile & check.", "target": 0} {"commit_message": "[PATCH] Rationalize HAVE_FMA\n\nDistinguish ARCH_PREFERS_FMA, for architectures that \"naturally\"\nprefer FMA (e.g., powerpc), from ISA_EXTENSION_PREFERS_FMA, for\ninstruction-set extensions that favor FMA where the base architecture\ndoes not (e.g., avx2 on x86).\n\nPreviously, --enable-avx2 would use FMA code for scalar and avx\ncodelets, which is wrong.\n\nThis change improves performance by a few percent on Ryzen (where FMA\ndoesn't really do anything), and is a wash on Haswell.", "target": 1} {"commit_message": "[PATCH] Make ImdSession into a Pimpl-ed class with factory function\n\nThis prepares to make IMD into a proper module. No\nfunctionality changes in this commit.\n\nReplaced gmx_bool with bool\n\nUsed fast returns when IMD is inactive, for better\nreadability of code.\n\nRefs #2877\n\nChange-Id: Ibbe8c452f6f480e9a357fe1b87da3ab0ae166317", "target": 0} {"commit_message": "[PATCH] Simplifying ARMv8 build parameters\n\nARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode\n(which is not right because TX2 is ARMv8.1) as well as requiring a few\nredundancies in the defines, making it harder to maintain and understand\nwhat core has what. A few other minor issues were also fixed.\n\nTests were made on the following cores: A53, A57, A72, Falkor, ThunderX,\nThunderX2, and XGene.\n\nTests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester.\n\nA summary:\n * Removed TX2 code from ARMv8 build, to make sure it is compatible with\n all ARMv8 cores, not just v8.1. Also, the TX2 code has actually\n harmed performance on big cores.\n * Commoned up ARMv8 architectures' defines in params.h, to make sure\n that all will benefit from ARMv8 settings, in addition to their own.\n * Adding a few more cores, using ARMv8's include strategy, to benefit\n from compiler optimisations using mtune. Also updated cache\n information from the manuals, making sure we set good conservative\n values by default. Removed Vulcan, as it's an alias to TX2.\n * Auto-detecting most of those cores, but also updating the forced\n compilation in getarch.c, to make sure the parameters are the same\n whether compiled natively or forced arch.\n\nBenefits:\n * ARMv8 build is now guaranteed to work on all ARMv8 cores\n * Improved performance for ARMv8 builds on some cores (A72, Falkor,\n ThunderX1 and 2: up to 11%) over current develop\n * Improved performance for *all* cores comparing to develop branch\n before TX2's patch (9% ~ 36%)\n * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than\n current develop's branch and 8% faster than deveop before tx2 patches\n\nIssues:\n * Regression from current develop branch for A53 (-12%) and A57 (-3%)\n with ARMv8 builds, but still faster than before TX2's commit (+15%\n and +24% respectively). This can be improved with a simplification of\n TX2's code, to be done in future patches. At least the code is\n guaranteed to be ARMv8.0 now.\n\nComments:\n * CortexA57 builds are unchanged on A57 hardware from develop's branch,\n which makes sense, as it's untouched.\n * CortexA72 builds improve over A57 on A72 hardware, even if they're\n using the same includes due to new compiler tunning in the makefile.", "target": 1} {"commit_message": "[PATCH] Adding Changelog for Release 3.3.00\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 3.3", "target": 0} {"commit_message": "[PATCH] more efficient jacobian of mapping modes", "target": 1} {"commit_message": "[PATCH] Pair of send(vector) -> send(vector)\n\nThis may be more efficient in some cases and it should be easier to\nrefactor shortly.", "target": 1} {"commit_message": "[PATCH] Update GMM code. It should be a little faster training, but\n it is still too slow for my preferences. I am not sure what is making it so\n slow.", "target": 0} {"commit_message": "[PATCH] Use smaller block size on smaller system I/O - this seems to\n fix a performance problem found by Jens Lohne Eftang", "target": 1} {"commit_message": "[PATCH] Exploiting memory layout for better performance in\n matrix-free restriction", "target": 1} {"commit_message": "[PATCH] Make OverlappingCouplingFunctor threadable\n\nThis *particular* fix is probably efficient enough for use, but when we\nfix #2334 we should change this to use (and get test coverage of) the\nnew GhostingFunctor::clone() API instead.", "target": 0} {"commit_message": "[PATCH] Crucial bug fix - table lookup was too slow in the previous\n version.", "target": 1} {"commit_message": "[PATCH] Fixes for style and performance issues found by cppcheck", "target": 1} {"commit_message": "[PATCH] Tests for valid periodic actions\n\nWe expect that mdrun propagation is unaffected by changing mdp options\nthat determine whether output is written. However, orchestrating mdrun\nto collect and compute such data without affecting the propagation is\ncomplex and currently very fragile. New propagation approaches must be\nable to be tested.\n\nMany mdrun combinations of periodic outputs and periodic action of\nsimulation modules that affect propagation are compared for\ncorrectness against a simulation that did every action at every step.\n\nThese tests are fairly slow, so are in their own test binary and\nannotated appropriately. They run by default only in release-type\nbuilds. As they target testing the kind of coordination issues that\ntend to appear in multi-rank runs, those runs are specifically\ntargeted.\n\nThe energy tolerance for the mdrun test were far too tight. It seems\nthat tests passed anyhow because they compared runs under exactly\nthe same run conditions using (nearly) the same summation order.\nThis change slightly increases the tolerance for energies and\nmassively the tolerance for pressure comparison.\n\nChange-Id: I88ea643873ebec0e5e2b12181f4b51ad90c7b0f7", "target": 0} {"commit_message": "[PATCH] 3 fast kernels", "target": 0} {"commit_message": "[PATCH] fix for no performance logging", "target": 0} {"commit_message": "[PATCH] Fix cycle counting in StatePropagatorDataGpu\n\nDouble-counting resulted in broken/truncated performance acounting\ntable.\n\nFixes #3764", "target": 0} {"commit_message": "[PATCH] bad performance with some data", "target": 0} {"commit_message": "[PATCH] add Apple Accelerate to the list of BLAS libraries\n\nSigned-off-by: Jeff Hammond ", "target": 0} {"commit_message": "[PATCH] rearranged TRANSPOSED format, numerous speedups\n\nSplit the TRANSPOSED and non-TRANSPOSED rank-geq2 solvers, and changed\nthe DFT TRANSPOSED format to be more like fftw2 (both globally and\nlocally transposed). In general, more emphasis on arranging the data\ncontiguously for the DFTs, and more flexibility in intermediate\ntransposed formats. Also disable NO_SLOW when planning transposes,\nsince otherwise non-square in-place transposes gratuitously put the\nplanner in SLOW mode.\n\nCurrently, dft-rank1-bigvec has 5 variants (or 10, if DESTROY_INPUT).\nIt looks like only 2 of these are commonly used, so I should probably\nadd some UGLY tags once I do more benchmarking.", "target": 0} {"commit_message": "[PATCH] Traits class inherits from Hyperbolic traits now; TODO:\n investigate why triangulation is so slow", "target": 0} {"commit_message": "[PATCH] made GMX_FORCE_ENERGY a separate flag\n\nGMX_FORCE_ENERGY was (temporarily) defined as GMX_FORCE_VIRIAL.\nNow it is a separate flags, which is less confusing. This allows\nnstcalcenergy to be larger than nstpcouple, which improves performance\nwith the SSE and CUDA kernels.", "target": 1} {"commit_message": "[PATCH] Accelerate AABB tree traversal by passing the tolerance as\n initial min_dist", "target": 1} {"commit_message": "[PATCH] Add a CMake warning about FFTW with --enable-avx\n\nFFTW_MEASURE runs single-threaded tests for FFT performance, which is\nvery different from the GROMACS usage pattern, particularly with how\nthe cache access pattern works. In practice, with FFTW 3.3.2 and\n3.3.3, the performance of FFTW with --enable-avx is considerably\nworse than that of FFTW with --enable-sse or --enable-sse2. It's\nunlikely but theoretically possible that performance might change,\nso we prompt the user both to avoid --enable-avx now, and to\nperhaps consider it in the future.\n\nChange-Id: Ib4906645587cfc6a6306a7f7f46d612a6446b156", "target": 0} {"commit_message": "[PATCH] Fix fast for CUDA 9. Use CUB library for reductions\n\nFAST was failing on CUDA 9 because of insufficient synchronization in the\nreduction of the non_max_count function. The reduction is now implemented\nusing BlockReduce from CUB.\n\nThis also adds CUB as a dependency which is brought in as a submodule.", "target": 0} {"commit_message": "[PATCH] QN update is now in Base class - subclasses onla compute the\n update. Changed computation of QN-Update for MVQN: use QR decomposition of\n matrix V, instead of LU decomposition of VTV. More efficient and more robust\n implementation", "target": 1} {"commit_message": "[PATCH] added fast global atom nr. to molecule lookup\n\nMost atom search functions in mtop_util now use binary search.", "target": 1} {"commit_message": "[PATCH] Moving coord_string from returning a std::string to\n std::string_view. (#2704) (#2707)\n\nThe coord_string function is used in a lot of performance critical paths.\nMoving it to return a string_view as none of these paths benefit from\nmaking a copy of the value.", "target": 0} {"commit_message": "[PATCH] reactingMultiphaseEulerFoam: Added referencePhase option\n\nIn multiphase systems it is only necessary to solve for all but one of the\nmoving phases. The new referencePhase option allows the user to specify which\nof the moving phases should not be solved, e.g. in constant/phaseProperties of the\ntutorials/multiphase/reactingMultiphaseEulerFoam/RAS/fluidisedBed tutorial case with\n\nphases (particles air);\n\nreferencePhase air;\n\nthe particles phase is solved for and the air phase fraction and fluxes obtained\nfrom the particles phase which provides equivalent behaviour to\nreactingTwoPhaseEulerFoam and is more efficient than solving for both phases.", "target": 1} {"commit_message": "[PATCH] Redesigned experimental memory pool to eliminate race\n conditions by condensing state representation to a single integer and\n simplifying algorithm. Addresses issues #320 , #487 , #452\n\nCreating power-of-two Kokkos::Impl::concurrent_bitset size to streamline\nimplementation and align with MemoryPool needs.\n\nUnit testing over a range of superblocks the following sequence:\n 1) allocate N of varying size\n 2) deallocate N/3 of these\n 3) reallocation deallocated\n 4) concurrently deallocate and allocate N/3 of these\n\nAdd performance test for memory pool.\nAdd performance enhancement note for multiple hints per block size.", "target": 0} {"commit_message": "[PATCH] #1295 Refactored MX::setSub(single argument) This should be\n significantly more efficient, though I didn't do any performance testing", "target": 1} {"commit_message": "[PATCH] Wrap up unit and performance testing\n\nWith OpenMP, the data-duplicated non-atomic\nversion is 4X faster than the non-duplicated\natomic version using 16 threads, and\n2-3X faster using 2 threads.", "target": 0} {"commit_message": "[PATCH] tutorials/lagrangian: Added mixedVesselAMI2D\n\nThis tutorial demonstrates moving mesh and AMI with a Lagrangian cloud.\nIt is very slow, as interaction lists (required to compute collisions)\nare not optimised for moving meshes. The simulation time has therefore\nbeen made very short, so that it finishes in a reasonable time. The\nmixer only completes a small fraction of a rotation in this time. This\nis still sufficient to test tracking and collisions in the presence of\nAMI and mesh motion.\n\nIn order to generate a convincing animation, however, the end time must\nbe increased and the simulation run for a number of days.", "target": 0} {"commit_message": "[PATCH] gauge_field::FloatNOrder can now use __ldg loads. Generally\n improves performance across the board, but some regressions at 12/8\n reconstruct so left switched off for now (USE_LDG macro in\n include/gauge_field_order.h).", "target": 1} {"commit_message": "[PATCH] Adding Changelog for Release 2.7.00\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 2.7", "target": 0} {"commit_message": "[PATCH] reorder folders, so that the fast tests are run first", "target": 0} {"commit_message": "[PATCH] Improved performance of prolongator through exposing\n intstruction-level parallelism. This enables full bandwidth to be achieved\n on Maxwell.", "target": 1} {"commit_message": "[PATCH] Update unit tests and performance tests\n\nCo-authored-by: Daniel Arndt ", "target": 0} {"commit_message": "[PATCH] Cache n_nodes/sides/edges in DofMap constraints\n\nWe loop over each of these ranges multiple times, so doing the loop\nend manually should more efficient than even the new range idiom.", "target": 1} {"commit_message": "[PATCH] Enable mac CI testing on azure pipelines\n\n- use native minio bin from brew because Docker for Mac launches slow,\n annoying, etc.", "target": 0} {"commit_message": "[PATCH] Update citations.\n\nGoogle scholar claims this paper cites libmesh but I obtained a copy and it does not...\n\n@InProceedings{Monteiro_2016,\n author = {S.~Monteiro and F.~Iandola and D.~Wong},\n title = {{STOMP: Statistical Techniques for Optimizing and Modeling Performance of blocked sparse matrix vector multiplication}},\n booktitle = {{28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)}},\n pages = {93--100},\n publisher = {IEEE},\n month = oct,\n year = 2016,\n note = {\\url{http://dx.doi.org/10.1109/SBAC-PAD.2016.20}}\n}", "target": 0} {"commit_message": "[PATCH] More efficient Sign_at", "target": 1} {"commit_message": "[PATCH] Better workaround of g++ 4.1 optimizer bug:\n -fno-strict-aliasing. Performance penalty is 5% vs 24% with -O", "target": 0} {"commit_message": "[PATCH] Cleaning code, Sylvain should take a look to the previous\n version of this file and Root_of_2.h because there is an obvious slow down in\n the execution of all the filtering kernels", "target": 0} {"commit_message": "[PATCH] aabb tree: added curve in performance section", "target": 0} {"commit_message": "[PATCH] Expand Performance section of the user guide\n\nSalvage and clean up the content from the wiki to expand the user\nguide. Minor fixes to the rest of the Performance section.\n\nChange-Id: I39aba257c4c761a3a1ef428c64424da6fa449158", "target": 0} {"commit_message": "[PATCH] Add Line_2 to Efficient RANSAC traits", "target": 0} {"commit_message": "[PATCH] resolved an exception issue which could make PME slow", "target": 0} {"commit_message": "[PATCH] Better optimized ICC release flags\n\nAdd those flags included in -fast which both helps performance\nand are appropriate for GROMACS.\n\nThe flags included in -fast for Linux we weren't using were:\n-ipo, -no-prec-div, -static, -fimf-domain-exclusion=15\n\nFull static depends on static libraries to be installed and thus\nwill not always work. IPO increases compile time by a huge factor.\nWe do require that extreme values (e.g. large negative arguments\nto exp and large positive to erfc) are computed correctly.\n\nThis leaves -no-prec-div -fimf-domain-exclusion=14 -static-intel\nas safe and useful flags for GROMACS.\n\nChange-Id: Ifbee69431841e3051c95f0b4c0ad204aac965c4e", "target": 1} {"commit_message": "[PATCH] Optimize the script (from 20 minutes to 0.3 seconds!)\n\n- avoid opening and reading the file `processed_test_results` thousands\n of time: its content is stored once in a hash, for fast lookup,\n\n- do not call `fuser` for files that are already processed", "target": 1} {"commit_message": "[PATCH] updated log performance stats\n\n- added #of threads column\n- renamed \"Number\" column to \"Count\"\n- swapped \"Second\" and \"G-cycles\" columns to have the former aligned\n with the GPU timing table\n- removed Mnbf/s and GFlops from default output, turning these back on is\n still possible via the GMX_DETAILED_PERF_STATS env var\n- normalized the NODE time and % stats with the number of cores", "target": 0} {"commit_message": "[PATCH] Simplify AVX integer load/store\n\nAlso has the potential to improve performance on some architectures\n(if there is a domain crossing penalty - not sure whether any AVX capable\ngeneration has a penalty).\n\nChange-Id: Icc7b136571fc9ad1dbabeabe446c93e8816ec678", "target": 1} {"commit_message": "[PATCH] Improving performance of right Trsm routines based on\n suggestions from Bryan Marker.", "target": 1} {"commit_message": "[PATCH] Force field updates.\n\nNew order for SWM4-NDP and SWM6 topologies, with SETTLE\ninstead of 3 constraints. Performance is slightly better\nin this case. Uploading a SWM4-NDP water box that has\nbeen equilibrated for 100 ps to serve as input for gmx\nsolvate.\n\nChange-Id: I67e10693ca76e77b99b371ea9887402e7ac0acc1", "target": 0} {"commit_message": "[PATCH] Adding the ability to convert the dist rank, cross rank, and\n redundant rank for a particular distribution to the VC rank and subsequently\n making use of this to improve the performance of the GetSubmatrix routine for\n AbstractDistMatrix", "target": 1} {"commit_message": "[PATCH] Added a few lines regarding the GA array distribution but,\n only performance bug and results were still correct.", "target": 1} {"commit_message": "[PATCH] converted part of rhogen to daxpy getting performance\n improvement on ia64", "target": 1} {"commit_message": "[PATCH] Increase memory allocation for NPT simulation in Ewald.cpp.\n Assign linear molecule with 3 and more atoms to DCGraph at higher level to\n improve code performance", "target": 1} {"commit_message": "[PATCH] fix performance issue--forgot to move name parameters", "target": 1} {"commit_message": "[PATCH] add slow tag to about 60 tests that take about as much time\n as the 430 others", "target": 0} {"commit_message": "[PATCH] reverted commit 76d5bddd5c3dfdef76beaab8222231624eb75e89.\n Split ga_acc in moints2x_trf2K in smaller ga_acc on MPI-PR since gives large\n performance improvement on NERSC Cori", "target": 1} {"commit_message": "[PATCH] 1 : Store performance function 2 : Pass error into Error\n function 3 : Pass network into Error function 4 : Add public api to access\n underlying network 5 : Use perfect forwarding to accept LayerTypes", "target": 0} {"commit_message": "[PATCH] Add DenseMatrix::svd_solve().\n\nThis function fills a missing requirement in the DenseMatrix classes,\nallowing us to solve non-square systems of equations in a\nleast-squares sense. The user can pass a tolerance to svd_solve()\nwhich determines the cutoff for small singular values. svd_solve() is\na const member function: we make a copy internally instead of allowing\nLapack to modify A.\n\nNote that Eigen also has the capability to solve non-square systems of\nequations, but it is relatively slow, as discussed in this thread:\nhttps://forum.kde.org/viewtopic.php?f=74&t=102088, so having our own\nLapack-based implementation is worthwhile.", "target": 0} {"commit_message": "[PATCH] Use pre-trained network to accelerate test", "target": 0} {"commit_message": "[PATCH] Add DofMap::is_evaluable()\n\nThis is O(log(send_list.size())), which may be fast enough for most\nusers; there's no obvious way to do better without unsorted_set.", "target": 1} {"commit_message": "[PATCH] convert a few more styles to use fast full DP erfc()", "target": 0} {"commit_message": "[PATCH] Fixed FAST memory leaks on OpenCL backend", "target": 1} {"commit_message": "[PATCH] BJP: Cleaned up code a little to make it easier to use for\n performance testing.", "target": 0} {"commit_message": "[PATCH] polygonTriangulate: Added robust polygon triangulation\n algorithm\n\nThe new algorithm provides robust quality triangulations of non-convex\npolygons. It also produces a best attempt for polygons that are badly\nwarped or self intersecting by minimising the area in which the local\nnormal is in the opposite direction to the overal polygon normal. It is\nmemory efficient when applied to multiple polygons as it maintains and\nreuses its workspace.\n\nThis algorithm replaces implementations in the face and\nfaceTriangulation classes, which have been removed.\n\nFaces can no longer be decomposed into mixtures of tris and\nquadrilaterals. Polygonal faces with more than 4 sides are now\ndecomposed into triangles in foamToVTK and in paraFoam.", "target": 0} {"commit_message": "[PATCH] GetPot: Use a more efficient container for UFO detection\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@3880 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 1} {"commit_message": "[PATCH] critical performance bug fix", "target": 1} {"commit_message": "[PATCH] Knowing when the tree fails and we're stuck with a linear\n search is useful for performance testing too\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@4595 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 0} {"commit_message": "[PATCH] Optimized the performance of the CG-based LPR algorithm", "target": 1} {"commit_message": "[PATCH] Global optimizers: better parallel performance\n\n- We used to have a thread-local variable for cell::TDS_data to make\n incident_cells concurrently callable but it was slow and memory-consuming\n => new incident_cells function which do not use cell::TDS_data\n => faster and lighter\n- update_restricted_delaunay now uses parallel_for instead of parallel_do\n (it was quite slow with the implicit oracle)\n => faster (but requires to fill a temporary vector)", "target": 1} {"commit_message": "[PATCH] Added missing memory deletions on FAST unit test.", "target": 0} {"commit_message": "[PATCH] switch from int to bool to avoid a performance warning", "target": 0} {"commit_message": "[PATCH] sbgemm: cooperlake: reorder ptr increase for performance", "target": 1} {"commit_message": "[PATCH] Added documentation for FAST", "target": 0} {"commit_message": "[PATCH] made routine more efficient", "target": 1} {"commit_message": "[PATCH] Reduced the cost of the pull communication\n\nWith more than 32 ranks, a sub-communicator will be used\nfor the pull communication. This reduces the pull communication\nsignificantly with small pull groups. With large pull groups the total\nsimulation performance might not improve much, because ranks\nthat are not in the sub-communicator will later wait for the pull\nranks during the communication for the constraints.\n\nAdded a pull_comm_t struct to separate the data used for communication.\n\nChange-Id: I92b64d098b508b11718ef3ae175b771032ad7be2", "target": 1} {"commit_message": "[PATCH] Changed slow convergence stop criteria.", "target": 0} {"commit_message": "[PATCH] New and reorganized documentation\n\nCovers more mdrun options, moves a bit of \"practical\" content from the\nreference manual to the user guide.\n\nImported and updated information from wiki page on cutoff\nschemes. Consolidated with information from reference manual.\n\nUpdated some use of \"atom\" to \"particle\" in both guides.\n\nWe could update the performance numbers, but with the impending\nremoval of the group scheme, I don't think that's worth bothering\nabout. e.g. on Haswell, Erik already tested performance of group is a\nbit slower than Verlet, even for unbuffered water systems.\n\nChange-Id: I6410ba9fc08bb133ec8669e14dba11bcbd454fe3", "target": 0} {"commit_message": "[PATCH] propagate lower bound for culling on TM1 to accelerate\n symmetric distance", "target": 1} {"commit_message": "[PATCH] Fix slow QuadratureFunction::GetElementValues when provided\n int pt", "target": 0} {"commit_message": "[PATCH] Make Constraints a proper class\n\nConverted the opaque C struct to a pimpl-ed C++ class.\n\nNumerous callers of constraint routines now don't have to pass\nparameters that are embedded within the class at setup time,\ne.g. for logging, communication, per-atom information,\nperformance counters.\n\nSome of those parameters have been converted to use const references\nper style, which requires various callers and callees to be modified\naccordingly. In particular, the mtop utility functions that take const\npointers have been deprecated, and some temporary wrapper functions\nused so that we can defer the update of code completely unrelated to\nconstraints until another time. Similarly, t_commrec is retained as a\npointer, since it also makes sense to change globally.\n\nMade ConstraintVariable an enum class. This generates some compiler\nwarnings to force us to cover potential bug cases with fatal errors.\nUsed more complete names for some of the enum members.\n\nIntroduced a factory function to continue the design that constr is\nnullptr when we're not using constraints.\n\nAdded some const correctness where it now became necessary.\n\nRefs #2423\n\nChange-Id: I7a3833489b675f30863ca37c0359cd3e950b5494", "target": 0} {"commit_message": "[PATCH] Fixed bug in 4th order 3D elements. Serendipity now works up\n to order 4 in 3D. Doing some performance testing + trying to get AMR to work\n with serendipity", "target": 0} {"commit_message": "[PATCH] Limit SMT with PME on GPU\n\nFor small numbers of atoms per core, SMT can seriously deteriorate\nperformance when running both non-bondeds and PME on GPU.\nWith fewer than 10000 atoms per core, SMT is now always off by default\nwith PME on GPU and auto settings.\n\nChange-Id: I1a6b83bc81f68e89bf443e2b0ddb1fde44e2361d", "target": 0} {"commit_message": "[PATCH] - Insert a random sample of the polyhedron points, instead of\n the first points, to avoid having a triangulation of dimension < 3 - Set\n the error_behavior to ABORT, so that the try/catch of the Qt4 main loop\n does not intercept our CGAL assertions (that prevents efficient debugging).", "target": 0} {"commit_message": "[PATCH] Finally, a fully parallel version of the refinement. Not very\n efficient, though, but the idea was to identify all data races and to protect\n it using locks, atomics, TLS... Needs some tests now, to check if we didn't\n miss any rare data race.", "target": 0} {"commit_message": "[PATCH] Fixed data filename on FAST unit test.", "target": 0} {"commit_message": "[PATCH] Fixes for style and performance issues found by cppcheck\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@5614 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 1} {"commit_message": "[PATCH] PASSMESS_LOG_SCOPE - disabled for now\n\nI don't really want to drag the whole PerfLog into PassMess, but can't\nthink of a better way to do built-in performance logging here.", "target": 0} {"commit_message": "[PATCH] Performance improvements to kd_tree_test, added peer bounds\n checking.", "target": 1} {"commit_message": "[PATCH] resolved performance degrating changed introduced in revision\n 1319", "target": 1} {"commit_message": "[PATCH] Adding Changelog for Release 3.3.00\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 3.3\n\n(cherry picked from commit bd0c2c3713448f4d00b54998a42696ced1c1a188)", "target": 0} {"commit_message": "[PATCH] Move early return for nbnxm force reduction\n\nTo reduce dependencies and code complexity, the early return for\navoiding overhead of a force reduction reducing no forces at all\nhas been moved from nonbonded_verlet_t to atomdata.cpp. The check\nhas been changed from no non-local work to no non-local atoms, which\nshould not affect performance much.\n\nChange-Id: I3315699e15918482b321b702f6ba24209aa3a6b2", "target": 0} {"commit_message": "[PATCH] Improved performance on AMD GPUs", "target": 1} {"commit_message": "[PATCH] Add a \"sgemm direct\" mode for small matrixes\n\nOpenBLAS has a fancy algorithm for copying the input data while laying\nit out in a more CPU friendly memory layout.\n\nThis is great for large matrixes; the cost of the copy is easily\nammortized by the gains from the better memory layout.\n\nBut for small matrixes (on CPUs that can do efficient unaligned loads) this\ncopy can be a net loss.\n\nThis patch adds (for SKYLAKEX initially) a \"sgemm direct\" mode, that bypasses\nthe whole copy machinary for ALPHA=1/BETA=0/... standard arguments,\nfor small matrixes only.\n\nWhat is small? For the non-threaded case this has been measured to be\nin the M*N*K = 28 * 512 * 512 range, while in the threaded case it's\nless, around M*N*K = 1 * 512 * 512", "target": 0} {"commit_message": "[PATCH] commented out fast f77 compile options", "target": 0} {"commit_message": "[PATCH] Fix performance report when init_step!=0\n\nChange-Id: Ia4e15c2fb9b0e3debe7fc7f2aa8a1cdf346f90cb", "target": 0} {"commit_message": "[PATCH] Possible performance enhancement in Mesh::delete_elem. We use\n the passed Elems id() as a guess for the location of the Elem in the\n _elements vector. If the guess does not succeed, then we revert to the linear\n search.", "target": 0} {"commit_message": "[PATCH] moved MeshBase::contract() up to Mesh. Unfortunately, there\n is no good way to make MeshBase::delete_elem() efficient, so the old\n implementation of MeshBase::contract() was (potentially) O(n_elem^2), and\n consumed approximately 20 percent of the runtime in ex10. This new\n implementation exploits the fact that the elements are stored in a vector\n (which is why it was moved up to the Mesh class) and is linear in the number\n of elements. The new implementation is less that 1 percent of the run time\n in ex10.", "target": 0} {"commit_message": "[PATCH] SolverPerformance: Complete the integration of the templated\n SolverPerformance\n\nNow solvers return solver performance information for all components\nwith backward compatibility provided by the \"max\" function which created\nthe scalar solverPerformance from the maximum component residuals from\nthe SolverPerformance.\n\nThe residuals functionObject has been upgraded to support\nSolverPerformance so that now the initial residuals for all\n(valid) components are tabulated, e.g. for the cavity tutorial case the\nresiduals for p, Ux and Uy are listed vs time.\n\nCurrently the residualControl option of pimpleControl and simpleControl\nis supported in backward compatibility mode (only the maximum component\nresidual is considered) but in the future this will be upgraded to\nsupport convergence control for the components individually.\n\nThis development started from patches provided by Bruno Santos, See\nhttp://www.openfoam.org/mantisbt/view.php?id=1824", "target": 0} {"commit_message": "[PATCH] the mistery of the missing MIC performance on Intel 16 might\n be fixed with -qopt-assume-safe-padding", "target": 0} {"commit_message": "[PATCH] PBiCGStab: New preconditioned bi-conjugate gradient\n stabilized solver for asymmetric lduMatrices using a run-time selectable\n preconditioner\n\nReferences:\n Van der Vorst, H. A. (1992).\n Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG\n for the solution of nonsymmetric linear systems.\n SIAM Journal on scientific and Statistical Computing, 13(2), 631-644.\n\n Barrett, R., Berry, M. W., Chan, T. F., Demmel, J., Donato, J.,\n Dongarra, J., Eijkhout, V., Pozo, R., Romine, C. & Van der Vorst, H.\n (1994).\n Templates for the solution of linear systems:\n building blocks for iterative methods\n (Vol. 43). Siam.\n\nSee also: https://en.wikipedia.org/wiki/Biconjugate_gradient_stabilized_method\n\nTests have shown that PBiCGStab with the DILU preconditioner is more\nrobust, reliable and shows faster convergence (~2x) than PBiCG with\nDILU, in particular in parallel where PBiCG occasionally diverges.\n\nThis remarkable improvement over PBiCG prompted the update of all\ntutorial cases currently using PBiCG to use PBiCGStab instead. If any\nissues arise with this update please report on Mantis: http://bugs.openfoam.org", "target": 0} {"commit_message": "[PATCH] HvD: Dealing with the issues associated with higher order\n derivatives in automatic differentiation and closed shell calculations. The\n problem addressed in particular is that of triplet excited states in TDDFT.\n Even though the ground state is closed shell the excitation energy is clearly\n a spin dependent quantity. Essentially we can deal with this situation\n effectively only if we evaluate the functional as if we are doing an open\n shell calculation. Doing so degrades performance but the automatic\n differentiation approach is significantly slower than the symbolic algebra\n generated code anyway. Hence performance cannot be the main reason for using\n this code in any case.", "target": 0} {"commit_message": "[PATCH] Improved performance of bicgstab after adding the second\n convergence test.", "target": 1} {"commit_message": "[PATCH] Fixed slow COO SpMV for the OpenMP executor\n\nMoved the `omp parallel for` to the most outer loop of the apply,\nso it is parallelized over the matrix entries instead over the number\nof right hand sides for every single matrix entry.", "target": 1} {"commit_message": "[PATCH] enable GPU emulation without GPU support\n\nGPU emulation can be useful to estimate the performance one could get\nby adding GPU(s) to the machine by running with GMX_EMULATE_GPU and\nGMX_NO_NONBONDED environment variables set. As this feature is useful\neven with mdrun compiled without GPU support, this commit makes GPU\nemulation mode always available.\n\nChange-Id: I0b90b8ec1c6e3116f28f66aac4f3c8ae0831239d", "target": 0} {"commit_message": "[PATCH] replace the edge map by a vector of flat_map\n\nit is very efficient since there should not be isolated vertices.\nOn large data, the runtime of the function is divided by 3 to 4", "target": 0} {"commit_message": "[PATCH] performance improvement for DD assignment of settles with\n cg's", "target": 1} {"commit_message": "[PATCH] added benchmark comparing the performance of the two traits\n classes (CK vs CORE::Expr)", "target": 0} {"commit_message": "[PATCH] Fix for #1139 performance regression bug (and #1140 for\n tracking). Set default CUDA launch bounds to <0,0> and when do not use CUDA\n __launch_bounds__ unless CUDA launch bounds are explicitly specified.", "target": 1} {"commit_message": "[PATCH] #1295 More efficient implementation of A[I] = B", "target": 1} {"commit_message": "[PATCH] functionObjects::wallHeatFlux: More efficient evaluation of\n heat-flux\n\nwhich avoids the need for field interpolation and snGrad specification and\nevaluation.\n\nResolves patch request https://bugs.openfoam.org/view.php?id=2725", "target": 1} {"commit_message": "[PATCH] nonUniformTableThermophysicalFunction: New non-uniform table\n thermophysicalFunction for liquid properties\n\nDescription\n Non-uniform tabulated property function that linearly interpolates between\n the values.\n\n To speed-up the search of the non-uniform table a uniform jump-table is\n created on construction which is used for fast indirect addressing into\n the table.\n\nUsage\n \\nonUniformTable\n Property | Description\n values | List of (temperature property) value pairs\n \\endnonUniformTable\n\n Example for the density of water between 280 and 350K\n \\verbatim\n rho\n {\n type nonUniformTable;\n\n values\n (\n (280 999.87)\n (300 995.1)\n (350 973.7)\n );\n }\n \\endverbatim", "target": 0} {"commit_message": "[PATCH] Reducers: fix performance issue #680\n\nAdding non-volatile join helps.", "target": 0} {"commit_message": "[PATCH] issue #939: more efficient kronecker product for\n Sparsity/Matrix", "target": 1} {"commit_message": "[PATCH] Sparse refactored readers: Better vectorization for tile\n bitmaps calculations. (#2711)\n\n* Sparse unordered with duplicates: Better vectorization for tile bitmaps.\n\nWhen computing tile bitmaps, there are a few places where we were not\nvectorizing properly and it caused performance impacts on some customer\nscenario. Fixing the use of covered/overlap in the dimension class to\nenable proper vectorization.", "target": 1} {"commit_message": "[PATCH] Update paper.md\n\nSpecify SpMV performance results.", "target": 0} {"commit_message": "[PATCH] Performance improvments to CPU Anisotropic Diffusion (#2174)\n\n* Performance improvments to CPU Anisotropic Diffusion", "target": 1} {"commit_message": "[PATCH] use problem-state pointer to write SPE mailbox with lower\n latency (makes a significant performance difference for N < 32k), thanks to\n Jan Wagner for suggestion [empty commit message]", "target": 0} {"commit_message": "[PATCH] performance optimizations in sgemm_kernel_16x2_bulldozer.S", "target": 1} {"commit_message": "[PATCH] Add Z-batch to fast kernels", "target": 0} {"commit_message": "[PATCH] fix for no performance logging\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@485 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 0} {"commit_message": "[PATCH] Add doSetup parameter to Matrix::init\n\nNot calling doSetup can give you performance gains when using\npreallocation on the matrix.", "target": 0} {"commit_message": "[PATCH] Add helper to reuse generated TPR files in testing\n\nUsed static class members in GoogleTest and provided option in\ntestfilemanager to allow file path specification before test case is\nstarted.\n\nThis should speed up some of the test cases that have been slow due to\nrepeated calls to grompp.\n\nChange-Id: I50e29d04550d78f2324e3665e903d45515464298", "target": 0} {"commit_message": "[PATCH] Adding optional wrappers for MKL CSR mat-vec, allowing for\n more configure-time math library detection, adding a Hermitian version of\n Lanczos, and greatly improving the performance of right\n DiagonalScale/DiagonalSolve. It appears that the current distributed sparse\n matrix-vector multiplication has scalability issues.", "target": 1} {"commit_message": "[PATCH] Remove use of interaction_mask_indices on BG/Q\n\nThis field was degrading cache performance ~1% on x86. It probably\nmade little difference on BG/Q, because the extra integer operations\ncan use the second instruction-issue port, assuming the use of OpenMP\nto use more than one hardware thread per core. Overall, this code is\nabout 1% faster on BG/Q.\n\nMinor fix to the gmx_load_simd_4xn_interactions() function that looks\nup the exclusion masks, so that new non-x86 platforms won't silently\nfail for want of an implementation of this function.\n\nMinor simplication to always pass simd_interaction_indices to\ngmx_load_simd_4xn_interactions(), since it is only used on BG/Q and\nthen it is non-null.\n\nChange-Id: I140a11607810e9cf08b702cae0b48426c3592fec", "target": 1} {"commit_message": "[PATCH] Removing gigaflop utility functions, removing illegal 'const'\n declarations in unblocked TwoSidedTrmm routines, fixing comments in several\n DistMatrix declarations, and adding a Trsv test driver in preparation for a\n significant performance improvement from an upcoming patch.", "target": 1} {"commit_message": "[PATCH] Improve performance of MIC exponential\n\nThe limited precision is due to argument scaling\nrather than the exponential function lookup, so\ninstead of iterating we can improve the accuracy\nwith a simple correction step, similar to what\nwas done for the recent AVX-512ER implementation.\n\nChange-Id: If55e7c4cefac5022e7211dfa56686cb9ee03a54a", "target": 1} {"commit_message": "[PATCH] performance improvements on grid_ssw", "target": 1} {"commit_message": "[PATCH] Fix performance problems for large molecular systems", "target": 1} {"commit_message": "[PATCH] fast ColumnCovariance", "target": 0} {"commit_message": "[PATCH] Add a barrier() to ~LibMeshInit\n\nThis didn't fix the bug I was trying to hunt down, and as best as I\ncan tell this won't fix any actual bugs (we'll be synchronizing at the\nFinalize() calls later regardless), but it couldn't hurt to wait for\nother processes to exit before we start spewing reference counter\nand/or performance log data to the screen, lest other processes\nconsole error messages get buried.", "target": 0} {"commit_message": "[PATCH] Removed KV (key-value) functionality. We will add it back in\n a few months in a much more efficient way. (#1415)", "target": 0} {"commit_message": "[PATCH] Use new ranges in MeshTools\n\nPlus manual caching, where it's more efficient", "target": 1} {"commit_message": "[PATCH] Fast forward porting work to master\n\nChange-Id: Ieb428e4a001efadf880dbe2c64c2a685cebdd4ae", "target": 0} {"commit_message": "[PATCH] On second thoughts, follow Kokkos best performance\n recommendations when setting OpenMP env vars", "target": 0} {"commit_message": "[PATCH] Call full Kokkos::initialize in performance tests", "target": 0} {"commit_message": "[PATCH] Replace the comparison of 2 lines with a more efficient one\n (just for this specific case where direction doesnt matter).", "target": 1} {"commit_message": "[PATCH] Append all ICC Performance flags only to Release Flags\n\nSome of the ICC performance flags were appended to GMXC_CFLAGS\nand thus also used e.g. for Debug.\n\nChange-Id: Iadfaa29fb347f24208e6f2406e0d1ad41f037804", "target": 0} {"commit_message": "[PATCH] Accelerate 3D version of inexact_locate as we do it for 2D\n\nsee commit 4c477c853c82bb1ca77bb496cdce993178c3dfa2", "target": 0} {"commit_message": "[PATCH] Fix SIMD configuration management\n\nSubsequent runs of cmake gave inconsistent diagnostic messages because\nSUGGEST_BINUTILS_UPDATE was not set on subsequent runs because we were\ncaching the result of logic, as well as caching the results of\ncompilation tests. This made life confusing, e.g. when compiling with\ngcc on MacOS with clang assembler not available.\n\nInstead, we now re-run the fast logic (quietly, if this is a\nsubsequent run).\n\nImproved the handling of ${VARIABLE}, because there was no need to use\nFORCE because the semantics of an unset variable in CMake just work.\nThere was also no need for such variables to be put into the cache,\nand we were using one more variable than we needed to use. This meant\nit was no longer worth implementing the redundant hints about perhaps\nupdating the binutils package, nor suppressing the redundant special\nstatus-line output.\n\nNoted some TODOs for future simplification. Changed the use of SIMD to\nSOURCE, since this utility code doesn't have to relate to SIMD flags.\n\nChange-Id: Id9605ccff0903c55e2621ddd8af10c8da523bebe", "target": 0} {"commit_message": "[PATCH] corrected accidental disables of fast global to local atom\n lookup for large systems", "target": 0} {"commit_message": "[PATCH] Conditional tweak in the nonbonded GPU kernels\n\nGPU compilers miss an easy optimization of a loop invariant in the\ninner-lop conditional. Precomputing part of the conditional together\nwith using bitwise instead of logical and/or improves performance with\nmost compilers by up to 5%.\n\nChange-Id: I3ba0b9025b11af3d8465e0d26ca69a78e32a0ece", "target": 1} {"commit_message": "[PATCH] AABB tree demo: - added color ramps for signed and unsigned\n distance functions (thermal for unsigned, and red/blue for signed) - fix moc\n warning issue (thanks Manuel Caroli) - fix strange value in performance\n section of user manual (we now understand that the KD-tree is making\n something really nasty only for large point sets - and we'll investigate the\n KD tree more)", "target": 0} {"commit_message": "[PATCH] New traits classes for efficient handling of circles.", "target": 1} {"commit_message": "[PATCH] replace custom kdtree by a fast version of\n Orthogonal_k_neighbor_search", "target": 1} {"commit_message": "[PATCH] Inverse and make_sqrt functions added. RT / Root_of_2\n division added. Some operations with int. Comparisons function performance\n improved. Added the idea of representing a rational (when we know, by using\n the Root_of_2(FT) construction) inside the Root_of_2. The constructor\n Root_of_2(const RT&, const RT&, const RT&, bool) has now another boolean\n parameter at the end, in the case you know delta is not zero. Some others\n goodies.", "target": 1} {"commit_message": "[PATCH] Improved the performance of the gmx_sum routines on cluster\n with multi-core nodes by using a two step communication procedure", "target": 1} {"commit_message": "[PATCH] ManipulatedFrame issues : fix\n\n- The call to bbox() at each top of a manipulated frame made it verry slow to manipulate frames\n on a big item, because the bbox was computed at every call. The result is now kept in a\n member and updated only when invalidate_buffers is called.\n\n- The color of the cutting plane is repaired.", "target": 0} {"commit_message": "[PATCH] ResultTile::coord() performance (#1689)\n\nThis removes the branch within the ResultTile::coord() implementation to\ntest if it contains zipped or unzipped coordinates.\n\nThe function contract is slightly modified because now the caller must ensure\nthat an underlying coordinate exists at there requested position and dimension.\nCurrently, the `ResultTile::coord()` may return null if nothing is found. This\ndoes not appear to be an expected output anywhere in the unit tests or code.\n\nIn a certain read benchmark that paths through this routine multiple times,\nthis patch reduces the avg execution time from 18.15s to 17.00s.", "target": 1} {"commit_message": "[PATCH] Removed debug code with would slow down mdrun when writing\n trajectories", "target": 1} {"commit_message": "[PATCH] add SDG Linf fast examples to doc file", "target": 0} {"commit_message": "[PATCH] Fixed a performance bug in the sweep. When calling the insert\n functions on the planar map, I made sure the non-intersect version of the\n inserts are being called. I also cleaned up the code a bit.", "target": 1} {"commit_message": "[PATCH] Fix false sharing in RandomNumber Pools\n\nThe pool arrays are accessed by all threads, but the\nelements per thread are rather small and thus share cachelines.\n\nMaking the arrays 2D using the second dimension for padding only,\nsolves that problem. I have seen up to 200x improvement for\n20 threads on skylake running the random number example with 100k and 1\nas parameters. Serial performance is slightly reduced, but it only makes\nthe \"grep a generator slower\" by the equivalent of one double load more\nand a few integer ops.", "target": 1} {"commit_message": "[PATCH] erased the function Oriented_Side::oriented_side(Plane_3,\n Point_3): it was buggy before and not efficient enough after it has been\n corrected.", "target": 1} {"commit_message": "[PATCH] This should slightly improve performance in non-threaded proj\n constraint generation, may save us from a race condition leading to\n inaccurate proj constraints in a few corner cases (3D, level one rule off; or\n AMR combined with periodic BCs) when we're threaded.", "target": 1} {"commit_message": "[PATCH] Robust retries for S3 request-limit retries (#1651)\n\nThis patch provides a retry handler that is identical to the existing, default\nhandler with an exception for CoreErrors::SLOW_DOWN. For this error, we will\nunconditionally retry every 1.25-1.75 seconds.\n\nThe motivation for this patch is to allow the TileDB client to remain functional\neven when performance may be bottlenecked on this error. The server returns\nthis error when we exceed a fixed number of requests per second. The client\nwill eventually make progress.", "target": 1} {"commit_message": "[PATCH] Keep track of FE requests for mapping data\n\nBefore, if *only* mapping data was requested, we wouldn't realize that,\nwe would think that nothing had been requested, and we would calculate\nall data at reinit(), for backwards compatibility. This was making the\nElem::volume() fallback code too slow in the general case, and was\nbreaking it (due to unimplemented second derivatives) in the Rational\nBezier case.", "target": 0} {"commit_message": "[PATCH] Revert \"Minor changes to use libmesh indentation,\n initialization styles.\"\n\nThis reverts commit ddf20ff42df730dfb38bf0d618b7d1d9264948b7.\n\nRevert \"Write element truth table to exodus to improve performance\"\n\nThis reverts commit 3c400489f41dbd72fe704c3dfe94db6e9bcd200e.", "target": 1} {"commit_message": "[PATCH] Sort cell slab ranges for ND arrays (#1736)\n\nThe selective decompression intersection algorithm requires cell slab ranges\nto be sorted in ascending order. This is true for 1D arrays, but not for ND\narrays. In this scenario, we must sort them.\n\nThe ranges are sorted in vectors that have already been partitioned per-tile.\nThis should keep the sort runtimes relatively quick. In the future, we can\nbenchmark this on large arrays+queries and measure its timing. With this sort,\nwe now may coalesce cell ranges as a future optimization.\n\nIf the N*LOG(N) sorting is too slow, the alternative approach is to leave them\nunsorted and perform O(N*M) range-chunk intersection comparisons, where M is\nthe number of chunks. If the number of chunks is less than LOG(N), this may\nbe faster.\n\nThis solves the following error message:\n[TileDB::ChunkedBuffer] Error: Chunk read error; chunk unallocated error\n\nCo-authored-by: Joe Maley ", "target": 0} {"commit_message": "[PATCH] Don't store zero entries in IGA constraint rows\n\nMost meshes aren't going to have many of these, but omitting them should\nbe a slight performance increase. More importantly, this makes\ndebugging on trivial meshes much easier.", "target": 0} {"commit_message": "[PATCH] moved performance fix for 7.30 compilers to FFLAGS", "target": 0} {"commit_message": "[PATCH] Allow to do gemv and ger buffer allocation on the stack\n\nger and gemv call blas_memory_alloc/free which in their turn\ncall blas_lock. blas_lock create thread contention when matrices\nare small and the number of thread is high enough. We avoid\ncall blas_memory_alloc by replacing it with stack allocation.\nThis can be enabled with:\nmake -DMAX_STACK_ALLOC=2048\nThe given size (in byte) must be high enough to avoid thread contention\nand small enough to avoid stack overflow.\n\nFix #478", "target": 0} {"commit_message": "[PATCH] Add fix elstop to USER-MISC\n\nImplements inelastic energy loss for fast particles in solids.", "target": 0} {"commit_message": "[PATCH] Anisotropic smoothing perf improvements in CUDA\n\nThis improves CUDA backend performance by about 24%", "target": 1} {"commit_message": "[PATCH] Remove the need for most locking in memory.c.\n\nUsing thread local storage for tracking memory allocations means that threads\nno longer have to lock at all when doing memory allocations / frees. This\nparticularly helps the gemm driver since it does an allocation per invocation.\nEven without threading at all, this helps, since even calling a lock with\nno contention has a cost:\n\nBefore this change, no threading:\n```\n----------------------------------------------------\nBenchmark Time CPU Iterations\n----------------------------------------------------\nBM_SGEMM/4 102 ns 102 ns 13504412\nBM_SGEMM/6 175 ns 175 ns 7997580\nBM_SGEMM/8 205 ns 205 ns 6842073\nBM_SGEMM/10 266 ns 266 ns 5294919\nBM_SGEMM/16 478 ns 478 ns 2963441\nBM_SGEMM/20 690 ns 690 ns 2144755\nBM_SGEMM/32 1906 ns 1906 ns 716981\nBM_SGEMM/40 2983 ns 2983 ns 473218\nBM_SGEMM/64 9421 ns 9422 ns 148450\nBM_SGEMM/72 12630 ns 12631 ns 112105\nBM_SGEMM/80 15845 ns 15846 ns 89118\nBM_SGEMM/90 25675 ns 25676 ns 54332\nBM_SGEMM/100 29864 ns 29865 ns 47120\nBM_SGEMM/112 37841 ns 37842 ns 36717\nBM_SGEMM/128 56531 ns 56532 ns 25361\nBM_SGEMM/140 75886 ns 75888 ns 18143\nBM_SGEMM/150 98493 ns 98496 ns 14299\nBM_SGEMM/160 102620 ns 102622 ns 13381\nBM_SGEMM/170 135169 ns 135173 ns 10231\nBM_SGEMM/180 146170 ns 146172 ns 9535\nBM_SGEMM/189 190226 ns 190231 ns 7397\nBM_SGEMM/200 194513 ns 194519 ns 7210\nBM_SGEMM/256 396561 ns 396573 ns 3531\n```\nwith this change:\n```\n----------------------------------------------------\nBenchmark Time CPU Iterations\n----------------------------------------------------\nBM_SGEMM/4 95 ns 95 ns 14500387\nBM_SGEMM/6 166 ns 166 ns 8381763\nBM_SGEMM/8 196 ns 196 ns 7277044\nBM_SGEMM/10 256 ns 256 ns 5515721\nBM_SGEMM/16 463 ns 463 ns 3025197\nBM_SGEMM/20 636 ns 636 ns 2070213\nBM_SGEMM/32 1885 ns 1885 ns 739444\nBM_SGEMM/40 2969 ns 2969 ns 472152\nBM_SGEMM/64 9371 ns 9372 ns 148932\nBM_SGEMM/72 12431 ns 12431 ns 112919\nBM_SGEMM/80 15615 ns 15616 ns 89978\nBM_SGEMM/90 25397 ns 25398 ns 55041\nBM_SGEMM/100 29445 ns 29446 ns 47540\nBM_SGEMM/112 37530 ns 37531 ns 37286\nBM_SGEMM/128 55373 ns 55375 ns 25277\nBM_SGEMM/140 76241 ns 76241 ns 18259\nBM_SGEMM/150 102196 ns 102200 ns 13736\nBM_SGEMM/160 101521 ns 101525 ns 13556\nBM_SGEMM/170 136182 ns 136184 ns 10567\nBM_SGEMM/180 146861 ns 146864 ns 9035\nBM_SGEMM/189 192632 ns 192632 ns 7231\nBM_SGEMM/200 198547 ns 198555 ns 6995\nBM_SGEMM/256 392316 ns 392330 ns 3539\n```\n\nBefore, when built with USE_THREAD=1, GEMM_MULTITHREAD_THRESHOLD = 4, the cost\nof small matrix operations was overshadowed by thread locking (look smaller than\n32) even when not explicitly spawning threads:\n```\n----------------------------------------------------\nBenchmark Time CPU Iterations\n----------------------------------------------------\nBM_SGEMM/4 328 ns 328 ns 4170562\nBM_SGEMM/6 396 ns 396 ns 3536400\nBM_SGEMM/8 418 ns 418 ns 3330102\nBM_SGEMM/10 491 ns 491 ns 2863047\nBM_SGEMM/16 710 ns 710 ns 2028314\nBM_SGEMM/20 871 ns 871 ns 1581546\nBM_SGEMM/32 2132 ns 2132 ns 657089\nBM_SGEMM/40 3197 ns 3196 ns 437969\nBM_SGEMM/64 9645 ns 9645 ns 144987\nBM_SGEMM/72 35064 ns 32881 ns 50264\nBM_SGEMM/80 37661 ns 35787 ns 42080\nBM_SGEMM/90 36507 ns 36077 ns 40091\nBM_SGEMM/100 32513 ns 31850 ns 48607\nBM_SGEMM/112 41742 ns 41207 ns 37273\nBM_SGEMM/128 67211 ns 65095 ns 21933\nBM_SGEMM/140 68263 ns 67943 ns 19245\nBM_SGEMM/150 121854 ns 115439 ns 10660\nBM_SGEMM/160 116826 ns 115539 ns 10000\nBM_SGEMM/170 126566 ns 122798 ns 11960\nBM_SGEMM/180 130088 ns 127292 ns 11503\nBM_SGEMM/189 120309 ns 116634 ns 13162\nBM_SGEMM/200 114559 ns 110993 ns 10000\nBM_SGEMM/256 217063 ns 207806 ns 6417\n```\nand after, it's gone (note this includes my other change which reduces calls\nto num_cpu_avail):\n```\n----------------------------------------------------\nBenchmark Time CPU Iterations\n----------------------------------------------------\nBM_SGEMM/4 95 ns 95 ns 12347650\nBM_SGEMM/6 166 ns 166 ns 8259683\nBM_SGEMM/8 193 ns 193 ns 7162210\nBM_SGEMM/10 258 ns 258 ns 5415657\nBM_SGEMM/16 471 ns 471 ns 2981009\nBM_SGEMM/20 666 ns 666 ns 2148002\nBM_SGEMM/32 1903 ns 1903 ns 738245\nBM_SGEMM/40 2969 ns 2969 ns 473239\nBM_SGEMM/64 9440 ns 9440 ns 148442\nBM_SGEMM/72 37239 ns 33330 ns 46813\nBM_SGEMM/80 57350 ns 55949 ns 32251\nBM_SGEMM/90 36275 ns 36249 ns 42259\nBM_SGEMM/100 31111 ns 31008 ns 45270\nBM_SGEMM/112 43782 ns 40912 ns 34749\nBM_SGEMM/128 67375 ns 64406 ns 22443\nBM_SGEMM/140 76389 ns 67003 ns 21430\nBM_SGEMM/150 72952 ns 71830 ns 19793\nBM_SGEMM/160 97039 ns 96858 ns 11498\nBM_SGEMM/170 123272 ns 122007 ns 11855\nBM_SGEMM/180 126828 ns 126505 ns 11567\nBM_SGEMM/189 115179 ns 114665 ns 11044\nBM_SGEMM/200 89289 ns 87259 ns 16147\nBM_SGEMM/256 226252 ns 222677 ns 7375\n```\n\nI've also tested this with ThreadSanitizer and found no data races during\nexecution. I'm not sure why 200 is always faster than it's neighbors, we must\nbe hitting some optimal cache size or something.", "target": 0} {"commit_message": "[PATCH] efficient sparsity pattern computation for the case when the\n user specifies the DOF coupling\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@1130 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 0} {"commit_message": "[PATCH] allows requesting of multiple moments. restructures blocks\n for performance improvements", "target": 1} {"commit_message": "[PATCH] Move performance logging of solve()s into solver classes\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@1514 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 0} {"commit_message": "[PATCH] Fix pixel tests in fast kernels", "target": 0} {"commit_message": "[PATCH] replace patch routines that call scatter with more efficient\n ones; combine multiple ddots into one call to reduce no. of gops; added\n inheritance to print control", "target": 1} {"commit_message": "[PATCH] RJH: Inserted performance statistics into the SCF using\n pstat. RJH: File cscfps.fh contains the handles.", "target": 0} {"commit_message": "[PATCH] working but still slow, about to switch to intervals", "target": 0} {"commit_message": "[PATCH] Unroll middle jm loop in the nbnxm kernels on Ampere\n\nThe unrolling improves performance of the non-bonded kernels by up to\n12%.\n\nRefs #3872", "target": 1} {"commit_message": "[PATCH] Improve performance of some mathematical functions", "target": 1} {"commit_message": "[PATCH] Remove Isidore_only_equalized_KLT_5000_with_normals.xyz from\n test suite as the Surface Mesher/APSS bug is fixed and processing\n Isidore_only_equalized_KLT_5000_with_normals.xyz is very slow", "target": 0} {"commit_message": "[PATCH] Improved performance of CCMA", "target": 1} {"commit_message": "[PATCH] Fixes for QM/MM MdModule\n\nSeveral fixes for problems found in QM/MM during beta1:\n\n* Additional check for external input files\n* Changed writing from `gmx::TextWriter` to `gmx::TextOutputFile` (`TextWriter` tries to format lines which is very slow in case of big files)\n* Added QM/MM to highlights\n\nRefs #3172", "target": 0} {"commit_message": "[PATCH] Remove constant acceleration groups\n\nPer Redmine discussion, this has been broken for about 10\nyears. Simplifying the update and global energy code is nice, and some\nintegrators will now be a trifle faster.\n\nThe value of SimulationAtomGroupType::Count can't change because reading old .tpr files relies\non it, but the enumeration value has a new name so that any relevant\ncode rebasing over this change can't silently still compile. This does\nmean that loops over egcNR are now slightly less efficient than they\ncould be, and some of those are in the per-step update code. But that\nis likely not worth changing in the current form of the code.\n\nOriginally authored by Mark Abraham at\nhttps://gerrit.gromacs.org/c/gromacs/+/8944.\n\nFixes #1354", "target": 0} {"commit_message": "[PATCH] performance updates??....EJB", "target": 0} {"commit_message": "[PATCH] use local variables for more efficient force accumulation", "target": 1} {"commit_message": "[PATCH] Adding benchmark directory and a bytes_and_flops benchmark\n\nThis directory is intended for hardware and software benchmarks.\nThese are not necessarily performance unit tests, but usually a bit\nmore complex than that.\n\nAnd they are intended to be used on their own.", "target": 0} {"commit_message": "[PATCH] Fix clang-3.7 build warnings in core unit and performance\n tests\n\nFix mismatched use of array new and scalar delete\nT * data = new T[1];\n...\ndelete [] data;", "target": 0} {"commit_message": "[PATCH] Move .f to .F - Including log of .f\n\nRCS file: /msrc/proj/mss/nwchem/src/ddscf/fast/cheby.f,v\nWorking file: cheby.f\nhead: 1.4\nbranch:\nlocks: strict\naccess list:\nsymbolic names:\n release-4-5-patches: 1.4.0.8\n release-4-5: 1.4\n bettis: 1.4\n release-4-1-patches: 1.4.0.6\n release-4-1: 1.4\n release-4-0-1: 1.4\n release-4-0-patches: 1.4.0.4\n release-4-0: 1.4\n v3-3-1: 1.4\n release-3-3-patches: 1.4.0.2\n release-3-3: 1.4\nkeyword substitution: kv\ntotal revisions: 4; selected revisions: 4\ndescription:\n----------------------------\nrevision 1.4\ndate: 1999/07/29 00:53:56; author: d3e129; state: Exp; lines: +3 -0\nadded cvs ID tags\n----------------------------\nrevision 1.3\ndate: 1999/05/10 18:43:30; author: d3g681; state: Exp; lines: +37 -14\nmajor changes to improve precision, speed, stability; fully dynamic FMM to permit deep trees; rtdb input parameters; this is the version used for the paper\n----------------------------\nrevision 1.2\ndate: 1999/01/11 18:18:49; author: d3g681; state: Exp; lines: +250 -124\nbetter linear algebra in the fitting, more varied fit routines, removed dead code\n----------------------------\nrevision 1.1\ndate: 1999/01/01 04:57:29; author: d3g681; state: Exp;\nFirst phase of integrating the fast coulomb code into nwchem. The nested grid evaluation of the density, fourier interpolation, FMM, and fourier solution of the free space Poisson equation", "target": 0} {"commit_message": "[PATCH] Lazily Init AWS ClientConfiguration\n\nAWS changes the ClientConfiguration in the 1.8 SDK to do the checking for env\nvariables and ec2 metadata in [1]. This can cause TileDB to behavior slow if\nS3 support is built but the environment is not configured. The AWS SDK\ncheck for the ec2 metadata and has to wait for a timeout. We need to\nlazily init the ClientConfiguration.\n\n[1] https://github.com/aws/aws-sdk-cpp/commit/147469373c9fec1037bd2d75d7cd949250c6f7c5", "target": 0} {"commit_message": "[PATCH] Command-line override of default Metis-vs-Parmetis\n\nUsing Parmetis for ReplicatedMesh doesn't seem to make any difference in\nperformance on our benchmark set.", "target": 0} {"commit_message": "[PATCH] improved load balancing on the GPU\n\nFor the GPU, small pair list entries are now sorted to the end.\nThe improves performance by 5 to 20%.\n\nChange-Id: I25e5efeb813ad5dde48f0955366519db699f21a2", "target": 1} {"commit_message": "[PATCH] Adjust s3 multi-part locking to unlock early\n\nThis switches from iteraters to using find + at to take references to\nthe state objects for manipulations. This allows us to release the S3 class\nlevel locks earlier and faster removing performance bottlenecks.", "target": 1} {"commit_message": "[PATCH] gmres restart converges but slow", "target": 0} {"commit_message": "[PATCH] Fixed backend API for join\n\n* Removed output dimensions as parameter\n* Fixed unitialized warnings for fast cpu", "target": 0} {"commit_message": "[PATCH] - Changed Hash_map to Unique_hash_map. Kept old file. -\n Separated Handle_hash_function into own file. - Improved performance of\n default Handle_hash_function. - Rewrote manual pages including the\n UniqueHandleFunction concept. - Made protected methods in chained_map.h\n public such that Unique_hash_map can be implemented using a private member \n instead of private inheritance.", "target": 1} {"commit_message": "[PATCH] Adding timers into ID to help facilitate @YingzhouLi's\n performance benchmarks", "target": 0} {"commit_message": "[PATCH] make use of CUDA stream priorities\n\nCUDA 5.5 introduced steam priorities with 2 levels. We make use of this\nfeature by launching the non-local non-bonded kernel in a high priority\nstream. As a consequence, the non-local kernel will preempt the local\none and finish first. This will improve performance in multi-node runs\nby reducing the possibility of late arrival of non-local forces.\n\nChange-Id: I4efc65546e4135f12006c0422e1fca42a788129f", "target": 1} {"commit_message": "[PATCH] Adding rough drafts of fast tridiagonalization on square\n process grids.", "target": 0} {"commit_message": "[PATCH] Change tolerance as it becomes pretty slow", "target": 0} {"commit_message": "[PATCH] Added CUDA LJ-PME nbnxn kernels\n\nThis change implements CUDA non-bonded kernels for LJ-PME introduced\nin the Verlet scheme with 99029d.\n\nThe CUDA kernels implement geometric as well as Lorentz-Berthelot (LB)\ncombinations rules (unlike the CPU SIMD) mostly because even though PME\nis very slow with LB, it is still beneficial to let the user offload the\nnon-bondeds to a GPU and potentially bump up the cut-off to further\nreduce the CPU PME load.\n\nNote that as now we have 120 kernels compiled for up to four different\ntarget architectures, the nbnxn_cuda module takes a very long time to\nbuild and can become the bottleneck during compilation. We will deal\nwith this later.\n\nChange-Id: I819b59a8948da0c8492eac6a43d4a7fb6dc98354", "target": 0} {"commit_message": "[PATCH] Run include order check in doc-check\n\nNow the doc-check target also checks that all files conform to the\ninclude ordering produced by the include sorter.\n\nAdd support into the include sorter for only checking the ordering\nwithout changing anything, and partially improve things such that the\nfull contents of the file are no longer required for some parts of the\nchecking. There seems to be no performance impact for now from storing\nall the file contents in memory, so did not go through this, but the\npartial changes also improve readability of the code.\nAdd support to gmxtree for loading the git attributes, to know which\nfiles not to check for include ordering.\n\nChange-Id: I919850dab2dfa742f9fb5b216cc163bc118082cc", "target": 0} {"commit_message": "[PATCH] HvD: The density functionals with derivatives up to and\n including third order generated by automatic differentiation using the\n univariate Taylor series approach by Griewank et al. DOI:\n 10.1090/S0025-5718-00-01120-0.\n\nFor this code to perform well it is essential that the compiler inlines a lot\nof code. If the overloaded operators are evaluated as function calls the\ncalculations slow down by more than an order of magnitude. Some compilers\ncan only inline code within a single file. Hence the addition of nwxc.F\nthat includes all the source code of the various functionals. In particular\nthe GCC generated code saw significant performance improvements as a result.\nFor GCC one is adviced to use compiler versions post 4.6 as apparently\nsignificant improvements to the inline capabilities were introduced at that\npoint.\n\nAt this point the gradient and 2nd derivative evaluation have been tested.\nFor the 3rd order derivatives test cases that actually exploit those derivatives\nare still needed. If there are any deficiencies at that level they are most\nlikely to stem from the driver routine NWXC_EVAL_DF3_DRIVER which has to\ngenerate the appropriate partial derivatives and interpolate the final results.\nThe generation of the derivatives themselves should work alright as the unit\ntests in src/nwxc/nwad/unit_tests work and the lower order order derivatives\nof the functionals work.\n\nAt the moment the performance is still not as good as I would like it. I suspect\nthat the univariate approach is in part to blame for that. The fact that all\nlower order derivatives need to be re-evaluated for every highest order partial\nderivative introduces an overhead that is close to proportional to the\nhighest order of derivative requested. I am planning to try a multivariate\napproach instead. This however has increased memory access as a downside,\nin particular during assignments. We will see what is best on balance.", "target": 1} {"commit_message": "[PATCH] Modified operations so that MPI_Type_free is called after\n wait. Also tried adding and MPI_Win_flush_local call to force both local and\n remote completion before calling MPI_Type_free, but does not seem to get rid\n of failures in some of the performance tests.", "target": 0} {"commit_message": "[PATCH] MKK: Non-blocking performance test", "target": 0} {"commit_message": "[PATCH] Because I wasn't seeing a speed-up on grapes, I made the\n accumulation put (add_hash_block) non-blocking in hopes that that\n communication, which probably hits more contention because it is a write, can\n be overlapped with all the local computation going on prior to it.", "target": 0} {"commit_message": "[PATCH] Compile without debugging or profiling symbols by default\n (i.e. be fast unless the user asks otherwise).", "target": 0} {"commit_message": "[PATCH] NVIDIA Volta performance tweaks\n\nRemoved ballot syncs and replaced all computed masks with full warp\nmask (as all branches in question are warp-synchronous).\nThis improves performance by 7-12%.\n\nChange-Id: I769d6d8f0d171eb528d30868d567624d5e246dbf", "target": 1} {"commit_message": "[PATCH] Don't construct perflog strings unless needed\n\nPreviously, if we disabled the perflog at runtime (rather than compile\ntime), it would *not* disable the implicit construction of C++\nstd::strings from C char* strings, which turns out to be the most\ncostly part of perf log operation.\n\nWe still need to rework this whole class for efficiency, but at the\nvery least it's now efficient when disabled.", "target": 0} {"commit_message": "[PATCH] Add implementation of convolution using the naive approach\n (it's pretty fast for small filter).", "target": 0} {"commit_message": "[PATCH] Concerning #100: implementing fast dependsOn for SX and MX", "target": 0} {"commit_message": "[PATCH] Tabulated log(x) to improve GB performance", "target": 1} {"commit_message": "[PATCH] More efficient nlp_solver_evaluate", "target": 1} {"commit_message": "[PATCH] [ZARCH] Improve loading performance for camax/icamax", "target": 1} {"commit_message": "[PATCH] Enforced rotation: Minor performance optimization in\n do_flex[1,2]_lowlevel", "target": 1} {"commit_message": "[PATCH] Update MFEM to not run the performance miniapp if code\n coverage is enabled. Runtime exceeds automated testing engine allowance.", "target": 0} {"commit_message": "[PATCH] Use arma::pow to use fast armadillo computation", "target": 1} {"commit_message": "[PATCH] Introduce HostAllocationPolicy\n\nThis permits host-side standard containers and smart pointers to have\ntheir contents placed in memory suitable for efficient GPU transfer.\n\nThe behaviour can be configured at run time during simulation setup,\nso that if we are not running on a GPU, then none of the buffers that\nmight be affected actually are. The downside is that all such\ncontainers now have state.\n\nChange-Id: I9367d0f996de04c21312cef2081cc08148f80561", "target": 0} {"commit_message": "[PATCH] Removed unnecassary flush of trn,xtc,edr. Important for\n performance for very frequent writes (on small systems) Fixed a bug related\n to setting the duty of pp/pme/io", "target": 1} {"commit_message": "[PATCH] AMG-DD implementation (#145)\n\nThis includes the implementation of the AMG-DD algorithm, a variant of BoomerAMG designed to limit communication.\n\nAMG-DD may be used as a standalone solver or a preconditioner for Krylov methods (note that AMG-DD is a non-symmetric preconditioner). For an example of how to set up and use AMG-DD, see the IJ driver (src/test/ij.c).\n\nA list with the parameters of AMG-DD is given below:\n\nPadding (recommended default 1): HYPRE_BoomerAMGDDSetPadding(...)\nNumber of ghost layers (recommended default 1): HYPRE_BoomerAMGDDSetNumGhostLayers(...)\nNumber of inner FAC cycles per AMG-DD iteration (default 2): HYPRE_BoomerAMGDDSetFACNumCycles(...)\nFAC cycle type: HYPRE_BoomerAMGDDSetFACCycleType(...)\n1 = V-cycle (default)\n2 = W-cycle\n3 = F-cycle\nNumber of relaxations on each level during FAC cycle: HYPRE_BoomerAMGDDSetFACNumRelax(...)\nType of local relaxation during FAC cycle: HYPRE_BoomerAMGDDSetFACRelaxType(...)\n0 = Jacobi\n1 = Gauss-Seidel\n2 = ordered Gauss-Seidel\n3 = C/F L1-scaled Jacobi (default)\n\nFor more details of the algorithm, see Mitchell W.B., R. Strzodka, and R.D. Falgout (2020), Parallel Performance of Algebraic Multigrid Domain Decomposition (AMG-DD).", "target": 0} {"commit_message": "[PATCH] aabb tree: added performance section [to be completed with\n distance queries and curves]", "target": 0} {"commit_message": "[PATCH] Added more constructors, for more efficient handling of\n circles with rational radii.", "target": 1} {"commit_message": "[PATCH] added precompiler commands for performance tests", "target": 0} {"commit_message": "[PATCH] Use templates to improve performance when not using triclinic\n boxes", "target": 1} {"commit_message": "[PATCH] New test file for simple performance test", "target": 0} {"commit_message": "[PATCH] Made FAST OpenCL results match CUDA results", "target": 0} {"commit_message": "[PATCH] Use boost::thread::hardware_concurrency instead of the\n std::thread one\n\nstd::thread::hardware_concurrency is not implemented in my GCC version\nWas causing a huge performance problem on Linux.", "target": 1} {"commit_message": "[PATCH] s390x/Z14: Change register blocking for SGEMM to 16x4\n\nChange register blocking for SGEMM (and STRMM) on z14 from 8x4 to 16x4\nby adjusting SGEMM_DEFAULT_UNROLL_M and choosing the appropriate copy\nimplementations. Actually make KERNEL.Z14 more flexible, so that the\nchange in param.h suffices. As a result, performance for SGEMM improves\nby around 30% on z15.\n\nOn z14, FP SIMD instructions can operate on float-sized scalars in\nvector registers, while z13 could do that for double-sized scalars only.\nThus, we can double the amount of elements of C that are held in\nregisters in an SGEMM kernel.\n\nSigned-off-by: Marius Hillenbrand ", "target": 1} {"commit_message": "[PATCH] Random engines & distributions as proper C++11 classes\n\nThis change implements the ThreeFry2x64 random engine with flexible\nnumber of encryption rounds and internal counter bits. The class is\ncompatible with the C++11 random number generators, and the GROMACS\ntabulated normal distribution has likewise been turned into a\nrandom distribution compatible with C++11, meaning they can be used in\nalmost any combination with the standard library distributions.\n- The ThreeFry2x64 implementation uses John Salmon's idea of a template-\n selected internal counter so a number of bits are reserved to generate\n an arbitrary random stream. This makes it possible to use ThreeFry as\n a normal random engine, and even in counter mode it is possible to\n draw an arbitrary amount of random numbers before restarting counters.\n- Both accurate (20-round) and fast (13-round) versions are available.\n- There is a gmx::DefaultRandomEngine when we don't care about details.\n- gmx::GammaDistribution has been added to work around bugs in\n libstdc++-4.4.7 headers, and to avoid getting different results\n for libstdc++ vs. libc++.\n- Custom Uniform, normal, and exponential distributions have been added\n to make all results reproducible across platforms since stdlibc++ and\n libc++ do not use the same generating algorithms.\n- Code using random numbers has been updated, but no changes have been\n made to turn random seeds into 64bits yet.\n- The selection nbsearch unit test was a bit fragile and very sensitive\n to the coordinate specific values; this has been fixed so it should\n be resilient no matter what RNG is used in the future.\n\nChange-Id: I47a04d03e2f264e1a6ef0aa0a2174cb464ed9af7", "target": 0} {"commit_message": "[PATCH] PERF: Anisotropic smoothing improvements (#2713)\n\nThis improves CUDA/OpenCL backend performance by about 24%", "target": 1} {"commit_message": "[PATCH] Fixed performance problems when many boxes are on a process:\n Removed a Sort from the communication stuff. Reordered the periodic boxes in\n the neighborhood. Changed the CommInfoFromStencil algorithm.", "target": 1} {"commit_message": "[PATCH] Removed nbnxn kernel blendv optimization\n\nThe nbnxn simd kernel blendv optmization, which was accidentally\ndeactivated since 5.0, has been removed. It made assumptions about\nthe internal storage of SIMD representations. With gcc 4.x blendv\nwould give a small performance improvement, but with gcc 5 performance\nis equal or deteriorates.\n\nChange-Id: I2b07895257a2fde0ade2a627369ed22683dd89e1", "target": 1} {"commit_message": "[PATCH] Issue #2056 More efficient mapped evaluation", "target": 1} {"commit_message": "[PATCH] Refactor the performance measurement system", "target": 0} {"commit_message": "[PATCH] Added workaround for performance issue in CasADi 2.4", "target": 0} {"commit_message": "[PATCH] s390x: Optimize SGEMM/DGEMM blocks for z14 with explicit loop\n unrolling/interleaving\n\nImprove performance of SGEMM and DGEMM on z14 and z15 by unrolling and\ninterleaving the inner loop of the SGEMM 16x4 and DGEMM 8x4 blocks.\nSpecifically, we explicitly interleave vector register loads and\ncomputation of two iterations.\n\nNote that this change only adds one C function, since SGEMM 16x4 and\nDGEMM 8x4 actually map to the same C code: they both hold intermediate\nresults in a 4x4 grid of vector registers, and the C implementation is\nbuilt around that.\n\nSigned-off-by: Marius Hillenbrand ", "target": 1} {"commit_message": "[PATCH] Sparse refactored readers: Better vectorization for tile\n bitmaps calculations. (#2711) (#2734)\n\n* Sparse unordered with duplicates: Better vectorization for tile bitmaps.\n\nWhen computing tile bitmaps, there are a few places where we were not\nvectorizing properly and it caused performance impacts on some customer\nscenario. Fixing the use of covered/overlap in the dimension class to\nenable proper vectorization.", "target": 0} {"commit_message": "[PATCH] Re-enable i-atom type local mem prefetch in OpenCL\n\nFor reasons unknown this has been disabled in the original OpenCL\nimplementation. However, it turns out that prefetching does have\nsubstantial performance benefits, especially on AMD (>10%) and in some\ncases on NVIDIA too (although not on Maxwell).\n\nThis change re-enables prefetching code-path and turns it on\nfor AMD devices. For NVIDIA the decision will be revisited later.\n\nThe GMX_OCL_ENABLE_I_PREFETCH/GMX_OCL_DISABLE_I_PREFETCH environment\nvariables allow testing prefetching with future architectures/compilers.\n\nChange-Id: I8324d62d3d78e0a1577dd3125edf059d3b311c2f", "target": 1} {"commit_message": "[PATCH] Increase release build timeout, macos is being a bit slow on\n Azure (#2495)", "target": 0} {"commit_message": "[PATCH] default to 1 thread, make the user manually specify the\n number of threads to avoid mpi/thread contention\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@2672 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 0} {"commit_message": "[PATCH] Speedup of kernel caching mechanism by hashing sources at\n compile time (#3043)\n\n* Reduced overhead of kernel caching for OpenCL & CUDA.\n\nThe program source files memory footprint is reduced (-30%) by eliminating\ncomments in the generated kernel headers. Hash calculation of each source\nfile is performed at compile time and incrementally extended at runtime\nwith the options & tInstance vectors. Overall performance increased up to\n21%, up to the point that the GPU becomes the bottleneck, and the overhead\nto launch the same (small) kernel was improved by 63%.\n\n* Fix couple of minor cmake changes\n\n* Move spdlog fetch to use it in bin2cpp link command\n\nCo-authored-by: pradeep \n(cherry picked from commit 3cde757face979cd9f51a4c01bd26107e69e4605)", "target": 1} {"commit_message": "[PATCH] added driver for testing performance", "target": 0} {"commit_message": "[PATCH] Use the Side<> class as a proxy when building element sides. \n This eliminates the need for allocating and deallocating connectivity storage\n arrays when building sides, thus making the Elem::build_side() member more\n efficient. Note that this has not been implemented yet in the case of\n infinite elements, however it would be easy to add. Assuming there are many\n more interior elements than infinite elements there is also probably little\n performance impact.", "target": 1} {"commit_message": "[PATCH] More efficient SystemBase::reinit, flux_jump indicator should\n work but needs testing", "target": 1} {"commit_message": "[PATCH] improved performance in some of the\n IsoparametricTransformation::*RevDiff methods by stack allocating dFdx_bar\n inside. Added additional FiniteElement::*RevDiff methods for differentiating\n various methods. Added EvalRevDiff methods to additional coefficient classes,\n and added tests for the coefficient differentiation as well as the finite\n element differentiation", "target": 1} {"commit_message": "[PATCH] Add class ListOfLists\n\nThis is a replacement for t_blocka, i.e. a performance optimized\nimplementation of a list of lists. It only allows appending a list\nat the end of the list of lists.\n\nChange-Id: Ib4b7f5f0e57b82c939f53e9805dc16e9d76db22b", "target": 1} {"commit_message": "[PATCH] Improved the performance on Powerpc by tweaking the altivec\n innerloops and changing sqrt(x) to x*invsqrt(x)", "target": 1} {"commit_message": "[PATCH] Created Qt Demo for Polyhedron_shortest_path\n\nA shortest paths object is created using 'Make Shortest Path' from the\nmenu on a polyhedron.\nSource points are created by shift-clicking on the polyhedron.\nPoints can also be removed by selecting the appropriate option in the\ncombo box. Point locations can be snapped to the nearest edge or vertex,\nor placed anywhere on the face. Choosing 'Shortest Path' from the combo\nbox and shift-clicking will create a polyline object representing the\nshortest path from any of the source points to that destination. Note\nthat computing the shortest paths tree may be slow on the first query.", "target": 0} {"commit_message": "[PATCH] fixed problem with index group for system size and made the\n diameter calculation more efficient", "target": 1} {"commit_message": "[PATCH] Back to bebug aabb slow problem", "target": 0} {"commit_message": "[PATCH] output detailed multi-thread performance data only with\n \"timer full\"", "target": 0} {"commit_message": "[PATCH] Revert simd-avx.h changes from b606e3191\n\nThey didn't improve performance at all as far as I can tell,\nand they ended up breaking the PGI compiler.\n\nIt is always tempting to use the fancy addsub instructions in FFTW to\ndo complex multiplications, but the reality is that FFTW is designed\nto avoid complex multiplications in most cases (we started in the SSE\ndays), and thus they don't make any difference. We are better off\nusing the minimal possible set of AVX instructions to minimize the\nchance of triggering compiler bugs.\n\nThe same statement holds for _mm256_shuffle_pd() versus\n_mm256_permute_pd(): in theory the latter is better, in practice\neither one is rarely used. However, SHUFFLE is older (since the SSE\ndays) and has a higher chance of working.", "target": 0} {"commit_message": "[PATCH] Update testing matrices for coverage and speed\n\nMoved slow aspect of pre-submit matrix to nightly (icc with release\nmode and SIMD support).\n\nRemoved slow gcc-7 config adjusting similar builds to achieve its\nformer objectives.\n\nRemoved outdated TODOs, noted new ones\n\nAdded hwloc test specifier, and todo for hwloc 2\n\nAdded tng test logic, and a specfier for each non-default case. Fixed\nmissing return values for no-tng case, and clarified the docs.\n\nChange-Id: I340b9a64dc4e4958f260657d3d82480be62ef979", "target": 0} {"commit_message": "[PATCH] Initial support for SkylakeX / AVX512\n\nThis patch adds the basic infrastructure for adding the SkylakeX (Intel Skylake server)\ntarget. The SkylakeX target will use the AVX512 (AVX512VL level) instruction set,\nwhich brings 2 basic things:\n1) 512 bit wide SIMD (2x width of AVX2)\n2) 32 SIMD registers (2x the number on AVX2)\n\nThis initial patch only contains a trivial transofrmation of the Haswell SGEMM kernel\nto AVX512VL; more will follow later but this patch aims to get the infrastructure\nin place for this \"later\".\n\nFull performance tuning has not been done yet; with more registers and wider SIMD\nit's in theory possible to retune the kernels but even without that there's an\ninteresting enough performance increase (30-40% range) with just this change.", "target": 0} {"commit_message": "[PATCH] new version of forward and adjoint jacobian calculation for\n SXFunction, ticket #127, still need an efficient way of calculating seed\n matrices", "target": 0} {"commit_message": "[PATCH] add a warning if the bonded cutoff is large\n\nThis should print a warning when 2x the bonded interaction cutoff list larger then other cutoffs, as was the setting before the performance optimization with the change in https://github.com/lammps/lammps/pull/758/commits/269007540569589aa7c81d9ba1a4b93d34b8c95d", "target": 0} {"commit_message": "[PATCH] sgemm/dgemm: add a way for an arch kernel to specify prefered\n sizes\n\nThe current gemm threading code can make very unfortunate choices, for\nexample on my 10 core system a 1024x1024x1024 matrix multiply ends up\nchunking into blocks of 102... which is not a vector friendly size\nand performance ends up horrible.\n\nthis patch adds a helper define where an architecture can specify\na preference for size multiples.\nThis is different from existing defines that are minimum sizes and such.\n\nThe performance increase with this patch for the 1024x1024x1024 sgemm\nis 2.3x (!!)", "target": 0} {"commit_message": "[PATCH] Extend performance considerations on bonded offload\n\nRefs #2793\n\nChange-Id: I4a8ae8554cf2aad540eb4eb485898f8cabeb3966", "target": 0} {"commit_message": "[PATCH] More efficient set reassignment", "target": 1} {"commit_message": "[PATCH] made the code MUCH more efficient in various places", "target": 1} {"commit_message": "[PATCH] I tried several things but it is still slow", "target": 0} {"commit_message": "[PATCH] Duplicate and modify functions as they are more efficient", "target": 1} {"commit_message": "[PATCH] moved ocl kernel resources from stack to heap\n\ncl::Program and cl::Kernel objects were being created on global\nstack earlier. Even though that is efficient and worked fine on\nlinux and macosx platforms. It caused out of control runtime errors\non windows platforms at the exit of the application that is using\nthe library. This could be mostly a driver bug in OpenCL drivers\nfor windows, therefore this change might be reverted in future.", "target": 0} {"commit_message": "[PATCH] fix obvious performance bug in Isolate_1", "target": 1} {"commit_message": "[PATCH] Improved performance of computing sums with CustomIntegrator", "target": 1} {"commit_message": "[PATCH] Fix pme gather in double with AVX(2)_128\n\nThe 4NSIMD PME gather change did not change the conditional\nfor grid alignment. This is made consistent here.\nNote that the 4NSIMD change lowered the performance of PME gather\non AVX_128_FMA and AVX2_128 in double precision. We should consider\nusing 256-bit AVX for double precision instead.\n\nRefs #2326\n\nChange-Id: I07bfb3ca8d334bce18ed0b6989405bbc02c25b7b", "target": 0} {"commit_message": "[PATCH] Revert \"suppress performance warning concerning an assertion\"\n\nThis reverts commit 580b65d8a5937b99c6206885a6b0504399923501.\n\nThe warning was already fixed by commit\n63ae26eb0ffa8460abddf6843345996f9e4de603.", "target": 0} {"commit_message": "[PATCH] In the performance miniapps, print the SIMD width in terms of\n \"doubles\".", "target": 0} {"commit_message": "[PATCH] Ensure PME with OpenCL does not attempt to pin\n\nHost-only memory pinning was designed with CUDA in mind, while OpenCL\nrequires managing both host and device memory buffer for efficient\nmapping, which is not yet implemented.\n\nThis change teaches the PME module to understand what pinning policy\nis appropriate to the build configuration, so that the setup of data\nstructures in various parts of the code can use a pinning policy that\nalways works.\n\nRefs #2498\n\nChange-Id: I2a294aee460947cd3aad5e23869cead1b99fd610", "target": 0} {"commit_message": "[PATCH] for three star segments, return fast common point\n\nSigned-off-by: Panagiotis Cheilaris ", "target": 0} {"commit_message": "[PATCH] improve the ell performance", "target": 1} {"commit_message": "[PATCH] dcopy hack into util/diff & reversing performance degradation\n change", "target": 0} {"commit_message": "[PATCH] thermophysicalModels: Changed specie thermodynamics from mole\n to mass basis\n\nThe fundamental properties provided by the specie class hierarchy were\nmole-based, i.e. provide the properties per mole whereas the fundamental\nproperties provided by the liquidProperties and solidProperties classes are\nmass-based, i.e. per unit mass. This inconsistency made it impossible to\ninstantiate the thermodynamics packages (rhoThermo, psiThermo) used by the FV\ntransport solvers on liquidProperties. In order to combine VoF with film and/or\nLagrangian models it is essential that the physical propertied of the three\nrepresentations of the liquid are consistent which means that it is necessary to\ninstantiate the thermodynamics packages on liquidProperties. This requires\neither liquidProperties to be rewritten mole-based or the specie classes to be\nrewritten mass-based. Given that most of OpenFOAM solvers operate\nmass-based (solve for mass-fractions and provide mass-fractions to sub-models it\nis more consistent and efficient if the low-level thermodynamics is also\nmass-based.\n\nThis commit includes all of the changes necessary for all of the thermodynamics\nin OpenFOAM to operate mass-based and supports the instantiation of\nthermodynamics packages on liquidProperties.\n\nNote that most users, developers and contributors to OpenFOAM will not notice\nany difference in the operation of the code except that the confusing\n\n nMoles 1;\n\nentries in the thermophysicalProperties files are no longer needed or used and\nhave been removed in this commet. The only substantial change to the internals\nis that species thermodynamics are now \"mixed\" with mass rather than mole\nfractions. This is more convenient except for defining reaction equilibrium\nthermodynamics for which the molar rather than mass composition is usually know.\nThe consequence of this can be seen in the adiabaticFlameT, equilibriumCO and\nequilibriumFlameT utilities in which the species thermodynamics are\npre-multiplied by their molecular mass to effectively convert them to mole-basis\nto simplify the definition of the reaction equilibrium thermodynamics, e.g. in\nequilibriumCO\n\n // Reactants (mole-based)\n thermo FUEL(thermoData.subDict(fuelName)); FUEL *= FUEL.W();\n\n // Oxidant (mole-based)\n thermo O2(thermoData.subDict(\"O2\")); O2 *= O2.W();\n thermo N2(thermoData.subDict(\"N2\")); N2 *= N2.W();\n\n // Intermediates (mole-based)\n thermo H2(thermoData.subDict(\"H2\")); H2 *= H2.W();\n\n // Products (mole-based)\n thermo CO2(thermoData.subDict(\"CO2\")); CO2 *= CO2.W();\n thermo H2O(thermoData.subDict(\"H2O\")); H2O *= H2O.W();\n thermo CO(thermoData.subDict(\"CO\")); CO *= CO.W();\n\n // Product dissociation reactions\n\n thermo CO2BreakUp\n (\n CO2 == CO + 0.5*O2\n );\n\n thermo H2OBreakUp\n (\n H2O == H2 + 0.5*O2\n );\n\nPlease report any problems with this substantial but necessary rewrite of the\nthermodynamic at https://bugs.openfoam.org\n\nHenry G. Weller\nCFD Direct Ltd.", "target": 1} {"commit_message": "[PATCH] removed call to fast localization for the time being...EJB", "target": 0} {"commit_message": "[PATCH] Identity: special type derived from SphericalTensor to\n provide the concept of identity (I)\n\nAllows efficient operators to be defined for the interaction between\ntypes and the equivalent identity.", "target": 0} {"commit_message": "[PATCH] Use much smaller GroupLens dataset.\n\nThis helps keep the repository a bit smaller and should accelerate some tests.", "target": 0} {"commit_message": "[PATCH] Add evaluable_elements_begin()/end()\n\nI put these in MeshBase like all the other iterator ranges, for\nsimplicity and consistency, but we do still need a DofMap reference to\ninitialize them.\n\nIterating through these might be slow, but creating a correct mutable\ncache instead will definitely be a huge pain, so let's just get the\ncode correct for now and optimize later if we need to.", "target": 0} {"commit_message": "[PATCH] Add Edge4::volume().\n\nThis speeds up calling volume() on every element in a mesh with 4M\nEdge4's by about 450x. Though I'm not sure this was ever going to be a\nperformance concern, I still want to have a custom volume()\nimplementation for all the different Elem types for completeness.\n\nThe 4-pt quadrature is also reasonably accurate. The biggest\ndiscrepancies seem to be introduced by perturbing the interior nodes\nof the EDGE4. For example, perturbing an interior node of the\nreference element by 0.125 in the y (or z) direction leads to a\nrelative error of much less than 1%, compared to a 12th-order (7-pt)\napproximation of the volume.\n\nApproximate volume = 2.0412301724043291e+00\nVerification volume = 2.0410437292661414e+00\nRel err = 9.1346959163268581e-03%", "target": 0} {"commit_message": "[PATCH] Try to improve parallel performance of CBMC", "target": 1} {"commit_message": "[PATCH] More efficient generation of random numbers", "target": 1} {"commit_message": "[PATCH] refactor reading last line of potential file code to be more\n efficient", "target": 1} {"commit_message": "[PATCH] - Added C++ include guards to all installed .h files\n\n - Modified PCG:\n - usage is consistent with SMG\n - no HYPRE_*pcg.h files\n - no need to allocate extra work-space vectors\n - more efficient matvec (does not do setup everytime)\n - default preconditioner is identity, i.e. default PCG is CG\n\n - All HYPRE_* interface routines are now consistently HYPRE_Struct*", "target": 1} {"commit_message": "[PATCH] Implement user guide\n\nRenamed former user manual to reference manual.\n\nThe content for the new user guide has mostly migrated in from the\nwiki, install guide, and mdrun -h, and updated as appropriate. This\nguide is intended for documenting practical use, whereas the reference\nmanual should document algorithms and high-level implementations, etc.\n\nEstablished references.md to do automatic linking of frequently\nused things. This can be automatically concatenated by pandoc\nonto any Markdown file to do easy link generation.\n\nSection on mdrun and performance imported and enhanced from the\nAcceleration and Parallelization wiki page.\n\nAdded section on mdrun features, e.g. rerun and multi-simulation.\n\nSection on getting started imported from online/getting_started.html\nand updated - there used to be a tutorial here, but there isn't any\nmore. Linked to more up-to-date tutorials.\n\nAdded TNG to docs/manual/files.tex.\n\nRemoved gmx options, now that its content is in the user guide (in\ntools.md).\n\nMoved old mdp_opt.html to docs/mdp-options.html, for now. Removed from\nreference manual, left pointer to new location. This is not an ideal\nformat or location either, but it's a step closer to being able to\ngenerate it from the code. Some trivial fixes to content. Generating\nlinks and references to follow in a future commit.\n\nMoved environment-variable section from reference manual to user guide.\nMinor fixes here.\n\nRemoved superseded reference manual sections on running in parallel or\nwith GPUs. Renamed install.tex as technical-details.tex, because that\nis all that is left there. Moved section on use of floating-point\nprecision to chapter on definitions and units, and thus eliminated the\nformer Appendix A.\n\nCross-references from user-guide.pdf don't work well yet, but that\nshould be dealt with when we decide on the final publishing platform.\n\nSome TODO comments for documentation sections remain for work in\nfuture patches, but please note the other new content in existing\nchild patches, so we don't duplicate any work.\n\nChange-Id: I026d67353863ae069c6c45b840a61fcaf205a377", "target": 0} {"commit_message": "[PATCH] Modified some of the tests so that they run reasonably fast\n through Insure++.", "target": 0} {"commit_message": "[PATCH] Optimize atomic accumulation in CUDA NB kernel\n\nAs a result of this reorganization of the reduction, the final atomic\naccumulation of the three force components can happen on three threads\nconcurrently. As this can be serviced by the hardware in a single\ninstruction, the optimization improves overall performance by a few %.\nThis also results in fewer shuffle operations.\n\nChange-Id: I29519469b1e1848c026ee5b7a32256440031dbce", "target": 1} {"commit_message": "[PATCH] Small Matrix: skylakex: sgemm nn: add n6 to improve\n performance", "target": 1} {"commit_message": "[PATCH] Changed the way FAST handles different datatypes internally", "target": 0} {"commit_message": "[PATCH] r70387, r70573, r70574 from Mesh_3-experimental-GF\n\nAdd incident_cells_3(Vertex_handle, std::vector)\n\nThis function avoids the construction of two additional std::vectors.\nThe performance gain is between 30% (g++) and 50% (VC++)\nfor points on surfaces as well as for points filling space.\n\nWe at the same time change the implementation of the function\nincident_cells(Vertex_handle, OutputIterator).\nIn order to save one additional std::vector,\nthe cells are reported in bfs and not in dfs order", "target": 0} {"commit_message": "[PATCH] EA:increased MAXMEM size to 512 since it gives large\n performance improvement on big matrix multiplies", "target": 1} {"commit_message": "[PATCH] Restarted the hierarchical pca algorithm, now replaced with\n Vempala's fast SVD algorithm. Need to finish the subspace combining part...", "target": 0} {"commit_message": "[PATCH] Remove non-backported entry from NEWS\n\nWe decided against this one after all, due to risk of side effects\noutweighing the fixes to performance issues in some cases.", "target": 0} {"commit_message": "[PATCH] eliminate it extra assignement to increase performance", "target": 1} {"commit_message": "[PATCH] PetscMatrix::print_personal now prints to file when requested\n (rather than just cout). The implementation is not particularly efficient\n (since print_personal gets passed an ostream) but it does work. And how\n efficient do you need to be if you are printing out matrices anyway?\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@4244 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 0} {"commit_message": "[PATCH] Fast KDE is now using submodule for organizing parameters", "target": 0} {"commit_message": "[PATCH] Use Traits::Vector to accelerate 2D optimization", "target": 1} {"commit_message": "[PATCH] New edge_map(), not inverse_map(), in edge reinit\n\nThis should be much more efficient, and on top of that it seems to fix\nthe edge projection problems I was seeing on the Reactor Pressure Vessel\nIGA mesh.", "target": 1} {"commit_message": "[PATCH] Timer: Available in Kokkos namespace\n\nUpdated unit tests, performance tests, and examples using\nKokkos::Impl::Timer to use Kokkos::Timer", "target": 0} {"commit_message": "[PATCH] added docu for efficient MVJ and restart modefor MVJ", "target": 0} {"commit_message": "[PATCH] MKK: Tuned to optimize performance by setting an optimized\n chunk size", "target": 1} {"commit_message": "[PATCH] Fix code so that gradient is not wrapped into the box fast\n exit out of symmetry routines if C1", "target": 0} {"commit_message": "[PATCH] decreased hl_tol (level shift threshold) from 0.05 to 0.01\n introduce 2 new keywords: stable (same as default) fast (faster than default\n but less safe)", "target": 1} {"commit_message": "[PATCH] A few, perhaps not so useless and dangerous, changes to qmmm\n code 1) more efficient use of Bq module 2) enabling QMMM runs for property\n calculations", "target": 1} {"commit_message": "[PATCH] some more performance improvements from Krys. Mostly in\n gradients", "target": 1} {"commit_message": "[PATCH] Improved consistency of the ApplyPackedReflectors routines\n (as well as performance in several cases) and several more implementations,\n fixed mistakes in the build section of the documentation, added a short\n description of the new SVD function, and fixed mistakes in HouseholderSolve\n after adding a simple example driver.", "target": 1} {"commit_message": "[PATCH] Precalculate pbc shift for analysis nbsearch\n\nInstead of using pbc_dx_aiuc(), precalculate the PBC shift between grid\ncells outside the inner loop when doing grid searching for analysis\nneighborhood searching. In addition to improving the performance, this\nencapsulates another piece of code that needs to be changed to implement\nmore generic grids.\n\nChange-Id: Ifbbe54596f820b01572fe7bb97a5354556a4981d", "target": 1} {"commit_message": "[PATCH] Initial implementation of LP IPM-based real Basis Pursuit,\n fixing performance bugs in the real sequential KKT matrix construction for\n LPs and QPs, and adding DistSparseMatrix::GlobalRow", "target": 1} {"commit_message": "[PATCH] Add CUDA compiler support for CC 5.0\n\nWith CUDA 6.5 and later compute capability 5.0 devices are supported, so\nwe generate cubin and PTX for these too and remove PTX 3.5.\nThis change also removes explicit optimization for CC 2.1 where\nsm_20 binary code runs equally fast as sm_21.\n\nChange-Id: I5a277c235b873afb2d1b2b12b5db64b370f1bade", "target": 0} {"commit_message": "[PATCH] Replacing FT by RT, more efficient Sign_at", "target": 1} {"commit_message": "[PATCH] Added the CGAL::Multiset class (based on a red-black tree),\n which extends the std::multiset class and makes it more efficient in many\n cases.", "target": 1} {"commit_message": "[PATCH] fixed weak scaling performance which was hampered to do a\n commit on Apr 27 2010", "target": 1} {"commit_message": "[PATCH] Allow to pop the context menu with `Key_Menu`\n\nAs the item selection is rather slow, for the moment, that is a lot\nfaster than `Shift+Rightbutton`.", "target": 0} {"commit_message": "[PATCH] Edit for faster performance", "target": 1} {"commit_message": "[PATCH] Fixes MSVC warnings. Bugfix in gmx_density in single\n precision\n\nDisables some unhelpful MSVC warnings for:\n* forcing value to bool (C4800)\n performance warning: not an issue for our code\n* \"this\" in initializer list (C4355)\n level 4 warning (informational) - shouldn't be shown at level 3\n* deprecated (posix, secure) functions (C4996)\n won't be removed soon - so not helpful\n\nChange-Id: I7ea62f88f687f45e169244ed60025c7c7d42f237", "target": 0} {"commit_message": "[PATCH] Accelerate distance queries with a kd-tree", "target": 1} {"commit_message": "[PATCH] Cuda: Adding performance test for instance overlapping", "target": 0} {"commit_message": "[PATCH] More efficient GetNonzeros::evaluateGen (sensitivities)", "target": 1} {"commit_message": "[PATCH] Added shared memory carve out setting for Volta - improves\n dslash performance by ~5%", "target": 1} {"commit_message": "[PATCH] Adding Changelog for Release 3.2.00\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 3.2", "target": 0} {"commit_message": "[PATCH] reverted change to matrix product sparsity calculation: no\n one is interested in how fast you can calculate the wrong answer....", "target": 0} {"commit_message": "[PATCH] In coarse dslash, added parallelization of color-column\n multiplication by splitting up the warp into different regions of the\n column-wise multiplication across the space-time dimention. Presently 1-way,\n 2-way and 4-way warp splitting is implemented. Prior to writing out the\n result a warp-level reduction into the first segment is performed, which\n writes out the result. This adds additional levels of parallelsim that\n improves the performance on small lattices. The level of warp splitting is\n autotuned using the auxillary tuning dimension.", "target": 1} {"commit_message": "[PATCH] Optimizations to Compute[Yi/Zi/Bi], switching over to an\n AoSoA data layout on the GPU. CPU vs GPU code paths are now maximally\n divergent, will include some discussion of that in PR. Small performance\n tweaks in Compute[UiTot/FusedDeidrj].", "target": 0} {"commit_message": "[PATCH] WIP Fixed performance of multi-range subarray result\n estimation", "target": 1} {"commit_message": "[PATCH] Use fast PerfLog methods in PerfItem\n\nThis speeds up our LOG_SCOPE usage several fold.", "target": 0} {"commit_message": "[PATCH] This was the version of code used for the FastMKS benchmarks\n in the recently submitted paper, \"Dual-tree Fast Exact Max-Kernel Search\".", "target": 0} {"commit_message": "[PATCH] Accelerate hmm and gmm", "target": 0} {"commit_message": "[PATCH] ReplicateMesh should only read once\n\nReading once and broadcasting should be more efficient than hammering\nthe filesystem on every processor.", "target": 1} {"commit_message": "[PATCH] Add performance function test.", "target": 0} {"commit_message": "[PATCH] Some debug instrumentation and big performance improvements.", "target": 1} {"commit_message": "[PATCH] Enable some performance tests again", "target": 0} {"commit_message": "[PATCH] fvOptions: Changed to be a MeshObject to support automatic\n update for mesh changes\n\nNow cellSetOption correctly handles the update of the cell set following mesh\ntopology changes rather than every time any of the fvOption functions are\ncalled for moving meshes. This is more efficient and consistent with the rest\nof OpenFOAM and avoids a lot of unnecessary clutter in the log.", "target": 1} {"commit_message": "[PATCH] tutorials: Changed compressed ascii output to binary to\n improve IO performance\n\nalso rationalized the writeCompression specification", "target": 1} {"commit_message": "[PATCH] Copy team rendezvous implementation from master to address\n performance issue #936", "target": 1} {"commit_message": "[PATCH] fixed issue with slow insertion in the presence of dummy\n points", "target": 0} {"commit_message": "[PATCH] C++ math function cleanup\n\nmath/functions.h now implements a number of old and new math\nfunctions with either float, double, or integer arguments.\nManual SIMD versions of 1/sqrt have been tested with gcc and icc\non x86, Power8, Arm32 and Arm64, but with correct 'f' suffixes\non constants there is only 10-15% performance difference, so for\nnow we always use the system versions to avoid having this file\ndepend on config.h. Functions for third and sixth roots have\nbeen introduced to replace many of our pow() calls, and the code\nhas been cleaned up to use the new functions.\n\nRefs #1111.\n\nChange-Id: I74340987fff68bc70d268f07dbddf63eb706db32", "target": 0} {"commit_message": "[PATCH] Add check to remove zero Charmm dihedrals\n\nProper torsions where the force constant is zero\nin both A and B states are now removed. We also\ncheck for other angle, torsion, and restraint\nfunctional types, and if all parameters are zero\nfor these the interaction is not added. This will\nnot change any results, but increase performance\nslightly by not calculating unnecessary interactions.\nFixes #810.\n\nChange-Id: I37ecd06d0641008593edab29e5b08433bde7b6cc", "target": 1} {"commit_message": "[PATCH] Fix (ish) slow serachbox on windows.", "target": 1} {"commit_message": "[PATCH] Move fast LUT in CUDA backend to texture memory\n\ncuda::kernel::locate_features is the CUDA kernel that uses the fast\nlookup table. Shared below is performance of the kernel using constant\nmemory vs texture memory. There is neglible to no difference between two\nversions. Hence, shifted to texture memory LUT to reduce global constant\nmemory usage.\n\nPerformance using constant memory LUT\n-------------------------------------\n\nTime(%) Time Calls Avg Min Max Name\n1.48% 101.09us 3 33.696us 32.385us 34.976us void cuda::kernel::locate_features\n1.34% 91.713us 2 45.856us 45.792us 45.921us void cuda::kernel::locate_features\n1.02% 69.505us 2 34.752us 34.400us 35.105us void cuda::kernel::locate_features\n0.99% 67.456us 2 33.728us 32.768us 34.688us void cuda::kernel::locate_features\n0.95% 65.186us 2 32.593us 31.201us 33.985us void cuda::kernel::locate_features\n0.93% 63.874us 2 31.937us 30.817us 33.057us void cuda::kernel::locate_features\n\nPerformance using texture LUT\n-----------------------------\n\nTime(%) Time Calls Avg Min Max Name\n1.45% 99.776us 3 33.258us 32.896us 33.504us void cuda::kernel::locate_features\n1.33% 91.105us 2 45.552us 44.961us 46.144us void cuda::kernel::locate_features\n1.02% 70.017us 2 35.008us 34.273us 35.744us void cuda::kernel::locate_features\n0.97% 66.689us 2 33.344us 32.065us 34.624us void cuda::kernel::locate_features\n0.95% 65.249us 2 32.624us 31.585us 33.664us void cuda::kernel::locate_features\n0.95% 65.025us 2 32.512us 30.945us 34.080us void cuda::kernel::locate_features", "target": 1} {"commit_message": "[PATCH] Add checks for inefficient resource usage\n\nChecks have been added for using too many OpenMP threads and when\nusing GPUs for using single OpenMP thread. A fatal error is generated\nin case where we are quite sure performance is very sub-optimal. This\nis nasty, but a fatal error is the only way to ensure that users don't\nignore this warning. The fatal error can be circumvented by explicitly\nsetting -ntomp, in that case a note is printed to log and stderr.\n\nNow also avoids ranks counts with thread-MPI that don't fit with the\ntotal number of threads requested.\n\nWith a GPU without DD thread count limit is now doubled.\n\nDisabled GPU sharing with OpenCL.\n\nChange-Id: Ib2d892dbac3d5716246fbfdb2e8f246cdc169787", "target": 0} {"commit_message": "[PATCH] Cleanup of the performance test", "target": 0} {"commit_message": "[PATCH] new implemenation using boost CSR graph, it can be 1.5x\n faster from prev implementation but there is a performance problem that I\n couldn't solve using public functionality of graph (however there might be a\n solution) will look it back.", "target": 1} {"commit_message": "[PATCH] Updated article reference. Added configuration flags to\n support altivec on powerpc. Updated configuration files from gnu.org Removed\n the fast x86 truncation when double precision was used - it would just crash\n since the registers are 32bit only. Added powerpc altivec innerloop support\n to fnbf.c", "target": 0} {"commit_message": "[PATCH] Ordinary slow Ewald electrostatics implemented for md runs", "target": 0} {"commit_message": "[PATCH] Add fast kernel version number", "target": 0} {"commit_message": "[PATCH] Added free-energy kernel performance note\n\nChange-Id: Iea5d2b124633c4188753c7b1ebb6f964edb2f644", "target": 0} {"commit_message": "[PATCH] removed _infostream output (slow performance on massively\n parallel systems), documentation", "target": 0} {"commit_message": "[PATCH] Attempt to work-around old gcc bugs in a more efficient\n fashion that does not lose performance on newer gcc's. [empty commit message]", "target": 0} {"commit_message": "[PATCH] New version of dgemm_kernel_4x4_bulldozer.S The peak\n performance with 8 cores is now 90 GFlops", "target": 0} {"commit_message": "[PATCH] Fix RDTSCP handling\n\nCommit 13def2872ae5311d tried to make all builds default to using\nRDTSCP, which would have broken non-x86 builds. But it also\ndeactivated the implementation of RDTSCP support because HAVE_RDTSCP\nwas left undefined. So all it did was make timing on x86 less\nefficient (plus e.g. DLB effects from that).\n\nUsed GMX_RDTSCP everywhere. Only the GROMACS project depends on\nthread-MPI, so it's reasonable to let a GMX symbol leak in there (and\nit's easily fixed if ever needed).", "target": 0} {"commit_message": "[PATCH] Add FEMContext algebraic_type(), custom_solution()\n\nThe NONE and DOFS_ONLY options enable more efficient use of FEMContext\nin cases where it's only used as a container of FE objects.\n\nThe OLD option (with a properly parallelized custom_solution) enables\nthe use of FEMContext to assist in evaluating old solutions during\nprojections.", "target": 1} {"commit_message": "[PATCH] Initial volume() optimization for HEX27.\n\nThis initial, relatively straightforward optimization improved the\nperformance of Hex27::volume() by about 16x (1476.46s down to 92.09s\nfor 150^3 elements). This is good, but I still think it can be made\nfaster.", "target": 1} {"commit_message": "[PATCH] SYCL NBNXM offload support\n\nAssociated changes:\n\n- Added function stubs to PME: necessary for compilation.\n- Stricter SYCL hardware compatibility checks: limits on subgroup size\n and the availability of local memory.\n- The kernel implementation and overall logic closely follow the OpenCL\n implementation. Divergences are documented locally.\n\nLimitations:\n\n- No fine-grained timings yet.\n- Code-duplication with CUDA and OpenCL: see #2608.\n- Minor differences in local/nonlocal synchronization: see #3895,\n related to #2608.\n- Only the OpenCL backend was extensively tested. LevelZero works fine\n without MPI but stalls due to a known bug. The fix for DPCPP runtime\n is available, but not yet part of any OneAPI release:\n https://github.com/intel/llvm/pull/3045.\n- The complex/position-restraints regression test fails: see #3846.\n- No performance tuning: see #3847.\n\nPerformance on rnase-cubic system is similar to OpenCL implementation.", "target": 0} {"commit_message": "[PATCH] Disable DD again in serial with GPU or without PME\n\nPerformance is better without DD in most of those cases.\n\nRefs #4195, #4198, #4171", "target": 0} {"commit_message": "[PATCH] performance improvement bugfix", "target": 1} {"commit_message": "[PATCH] Pass Uncertain and T (enum or bool) by value instead of by\n reference. It is generally accepted that it is more efficient for small\n classes like this (and it's definitely shorter and more readable).", "target": 1} {"commit_message": "[PATCH] Workaround for very slow compilation on Windows", "target": 0} {"commit_message": "[PATCH] The Lazy Kernel dont compile the assertion and the more\n efficient bbox() function. (Because Interval_nt<> dont have gamma(),\n is_rational(), etc...)", "target": 0} {"commit_message": "[PATCH] Remove unnecessary ICC flags affecting performance\n\nSince we add -msse/../-mavx based on the acceleration we shouldn't\nadd -mtune=core2 anymore. Especially because it is added later\nand takes precedence over the (higher) acceleration flag.\n\n-ip and -funroll-all-loops could also be deleted because they don't\nseem to give any significant performance improvement, and might\nincrease compilation time, but they don't hurt gromacs performance.\n\nIn theory it could help to use -xavx instead of -mavx but I can't\nmeasure a difference.\n\nChange-Id: Icd11c40c3cd3ef2ae6ef42f07d5d75c228593f51", "target": 1} {"commit_message": "[PATCH] Use fast pool allocator", "target": 0} {"commit_message": "[PATCH] Changing switchpoint of kernel of local HerkLN update to be\n twice the blocksize times the grid width. This preserves the local gemm\n performance at the kernel.", "target": 0} {"commit_message": "[PATCH] Kokkos: ViewAssignment: fix critical performance bug for\n unmanaged\n\nTurns out the unmanaged views were not actually unmanaged. The trait\nwas not actually used in the determination of tracking in case of\nassigning a managed view to an unmanaged. The only way to get a truly\nunmanaged view was to start with an unmanaged view wrapping a pointer.", "target": 1} {"commit_message": "[PATCH] Adding: 1. two reading functions. (one with a default\n scanner, and one taking it as a parameter). 2. Copy constructor.\n\nRewriting the assignment operator to be efficient as the copy constructor.", "target": 0} {"commit_message": "[PATCH] Preserve an old partitioning in copy_nodes_and_elements -\n should be more efficient and more reliable.", "target": 1} {"commit_message": "[PATCH] Switch to a RWLock for S3 Class multipart upload\n\nA new RWLock class is introduced into the common folder. @joe-maley\nprovided this excellent class. We then use this RWLock in the S3 VFS\nclass to handle the multipart uploads and to manage the locks more\ngranular to remove locking and contention for multiple concurrent upload\noperations.", "target": 0} {"commit_message": "[PATCH] Clean up and use fast pool allocator (faster)", "target": 1} {"commit_message": "[PATCH] Replacing dynamic_cast with libmesh_cast where appropriate -\n depending on the error checking replaced this will either lead to slightly\n more efficient NDEBUG runs or slightly more run-time checking in debug mode\n runs.", "target": 1} {"commit_message": "[PATCH] Optimize consolidated fragment metadata loading\n\nThis is part one of two changes for optimizing consolidated fragment\nmetadata loading. This changes removes using VFS to always fetch the\nmetadata file size. Instead we select the file size only if not reading\nfrom a consolidated file buffer. We also adjust the `fragment_size`\nfunction, used by consolidation, to fetch the fragment metadata file\nsize on request. The fetching of the size on-demand for consolidation\nwill yield the same performance degredation we see in open array,\nhowever this is acceptable for patch one and will be addressed in the\nnext series with a format change.", "target": 1} {"commit_message": "[PATCH] Turn assert into error in PBCs::neighbor()\n\nThis should fix #2958\n\nI don't like putting opt mode tests into anything called on a single\nelement; that's just asking to bloat kernel runtimes ... but in this\ncase, when we don't actually have broken PBC ids/displacements, we never\nhit the test on a replicated or serialized mesh, and we never hit the\ntest when calling PBCs::neighbor() on a local element, and at least\nwithin the library itself it looks like we're only calling\nPBCs::neighbor from local elements. So this change ought to be safe for\nperformance after all.", "target": 0} {"commit_message": "[PATCH] USER-DPD: specialize PairTableRXKokkos's compute_all_items()\n on NEWTON_PAIR No noticable performance change, but it does eliminate a deep\n conditional.", "target": 0} {"commit_message": "[PATCH] performance update....EJB", "target": 0} {"commit_message": "[PATCH] Core: fix another issue with the memory tracking, introduced\n recently\n\nThere was a bug introduced when solving the performance issues with\nsubviews. It seemed to not have trickered any errors in either Trilinos\ntests nor kokkos tests, but did crash Nalu.", "target": 0} {"commit_message": "[PATCH] 2d convolve performance improvements\n\nchanged the shared memory loading access pattern in 2d convolve\nkernel for cuda and opencl backends", "target": 1} {"commit_message": "[PATCH] Containers: DynRankView Performance Test was using to much\n memory.\n\nThe test was using close to 9GB in memory, with parallel testing this\nlead to out of memory failures.", "target": 0} {"commit_message": "[PATCH] whitespace cleanup, fix bug in looking for empty strings,\n improve read performance and handling of comments", "target": 1} {"commit_message": "[PATCH] performance updates....EJB", "target": 0} {"commit_message": "[PATCH] adding requested size threshold to improve performance of\n small size requests, removed bug check", "target": 1} {"commit_message": "[PATCH] Added full double precision support to the hisq-force\n routines. Replaced the macro used to write the hisq fermion force to global\n memory with a much more efficient device function.", "target": 1} {"commit_message": "[PATCH] flag two more subroutines can trigger the variable tracking\n message and slow down compilation", "target": 0} {"commit_message": "[PATCH] Performance increase for charge-implicit ReaxFF/changed\n cutoff selection", "target": 1} {"commit_message": "[PATCH] added performance in ps/hour (CPU and Real)", "target": 0} {"commit_message": "[PATCH] Performance improvements and fixed bug with memory budget and\n multi-range subarrays (#1601)", "target": 1} {"commit_message": "[PATCH] Make mallocs uniformly an error in PetscMatrix\n\nThis adds the setting to the other two init methods that\nif a new malloc occurs during MatSetValues then it is an\nerror. I think this is an improvement because it establishes\nuniformity across our init methods and it can prevent users\nfrom running extremely slow simulations.\n\nI fully expect this to cause failures in MOOSE...", "target": 0} {"commit_message": "[PATCH] Reinstate fast copy methods", "target": 0} {"commit_message": "[PATCH] Don't use _Atomic for jobs sometimes...\n\nThe use of _Atomic leads to really bad code generation in the compiler\n(on x86, you get 2 \"mfence\" memory barriers around each access with gcc8, despite\nx86 being ordered and cache coherent). But there's a fallback in the code that\njust uses volatile which is more than plenty in practice.\n\nIf we're nervous about cross thread synchronization for these variables, we should\nmake the YIELD function be a compiler/memory barrier instead.\n\nperformance before (after last commit)\n\n Matrix SGEMM cycles MPC DGEMM cycles MPC\n 48 x 48 10630.0 10.6 0.7% 18112.8 6.2 -0.7%\n 64 x 64 20374.8 13.0 1.9% 40487.0 6.5 0.4%\n 65 x 65 141955.2 1.9 -428.3% 146708.8 1.9 -179.2%\n 80 x 80 178921.1 2.9 -369.6% 186032.7 2.8 -156.6%\n 96 x 96 205436.2 4.3 -233.4% 224513.1 3.9 -97.0%\n 112 x 112 244408.2 5.8 -162.7% 262158.7 5.4 -47.1%\n 128 x 128 321334.5 6.5 -141.3% 333829.0 6.3 -29.2%\n\nPerformance with this patch (roughly a 2x improvement):\n\n Matrix SGEMM cycles MPC DGEMM cycles MPC\n 48 x 48 10756.0 10.5 -0.5% 18296.7 6.1 -1.7%\n 64 x 64 20490.0 12.9 1.4% 40615.0 6.5 0.0%\n 65 x 65 83528.3 3.3 -210.9% 96319.0 2.9 -83.3%\n 80 x 80 101453.5 5.1 -166.3% 128021.7 4.0 -76.6%\n 96 x 96 149795.1 5.9 -143.1% 168059.4 5.3 -47.4%\n 112 x 112 191481.2 7.3 -105.8% 204165.0 6.9 -14.6%\n 128 x 128 265019.2 7.9 -99.0% 272006.4 7.7 -5.3%", "target": 1} {"commit_message": "[PATCH] Made Triangulation_data_structure_2::create_face more\n efficient", "target": 1} {"commit_message": "[PATCH] Use efficient intersection traits, Used the kernel\n appropriately, Added a constructor", "target": 1} {"commit_message": "[PATCH] Implemented basic serial 2D fast diagonalization", "target": 0} {"commit_message": "[PATCH] fixed several bugs in the load balance performance prints", "target": 1} {"commit_message": "[PATCH] Performance fix - Stop setting the color of each edge to\n black, which could be quite long for a big item, and set the attribute value\n of the shader to black instead.", "target": 1} {"commit_message": "[PATCH] Performance improvement differentiating between sparse and\n dense arrays when computing sparse results (#1605)", "target": 1} {"commit_message": "[PATCH] After merging pmatrix-dev, at least one thing got broken:\n ParMesh::Print for AMR meshes. Since in pmatrix-dev, slave faces are no\n longer considered shared (their P rows are not needed by the processor owning\n the master face), they are also not printed when visualizing the parallel\n solution. I suspect this also might have broken NC face neighbors. This\n branch contains a temporary solution, a downgrade of\n ParNCMesh::AddMasterSlaveRanks, so that it works the old way: slave faces are\n grouped with the masters. This fixes visualization and maybe other things,\n but may negatively impact performance of (or even break) the P matrix\n construction. I need to look more into this to find a permanent solution.", "target": 0} {"commit_message": "[PATCH] Added configure flag to disable FFTW measurements, to enable\n binary reproducible runs. Note that this typically WILL deteriorate\n performance, so it is usually better to run the optimized versions and use\n the -reprod flag to mdrun when you need binary identity. However, if you\n compile FFTW3 with SSE support (which is NOT the default) the selected\n kernels seems to be close-to-optimal even without measurements, and then you\n can use this option to always get binary reproducible runs.", "target": 0} {"commit_message": "[PATCH] More efficient constructor for Rep", "target": 1} {"commit_message": "[PATCH] PERF: Improve performance for sort_by_key\n\n- Has added benefit of cutting build times by half", "target": 1} {"commit_message": "[PATCH] Fix performance for range copy", "target": 0} {"commit_message": "[PATCH] Revert \"Use correct number of atoms in GPU Update kernels\"\n\nThis reverts commit 4d9a6d110b614a299a5bab4120c87a04f9ac14c1 (MR !2523).\n\nIt was supposed to be a trivial fix, but things turned out to be more\ncomplicated, and my testing was insufficient (#4401).\n\nThe fix is still needed, but the bug is not causing incorrect physics,\nmerely harmless sanitizer errors and slightly higher resource\nconsumption by kernels that are very fast regardless.\n\nI suggest doing it in master or delaying till the patch release.\nRight now is not the best time to fix such issues.\n\nRefs #4398", "target": 0} {"commit_message": "[PATCH] Matrix: Replace the row-start pointer array with computed\n offsets\n\nThe row-start pointer array provided performance benefits on old\ncomputers but now that computation is often cache-miss limited the\nbenefit of avoiding a integer multiply is more than offset by the\naddition memory access into a separately allocated array.\n\nWith the new addressing scheme LUsolve is 15% faster.", "target": 1} {"commit_message": "[PATCH] ATW: Fast sparsity-optimized sigma AB and more scalable sigma\n AA using dgop", "target": 0} {"commit_message": "[PATCH] Build a default DiffSolver at init() not at construction, to\n be more efficient when the user wants to create a DiffSolver themselves\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@1513 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 1} {"commit_message": "[PATCH] performance fixes in fftconvolve kernels", "target": 1} {"commit_message": "[PATCH] trivial change for DofMap::dof_indices to increase\n performance when there are no element-based DOFs\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@1127 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 0} {"commit_message": "[PATCH] Changes to internal memory manager\n\n- Manager now contains list of locked and free buffers separately\n- Should improve performance when allocationg new buffers\n- Added proper documentation", "target": 1} {"commit_message": "[PATCH] make ::localize() more efficient, still need to handle\n ::localize_to_one()\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@2242 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 1} {"commit_message": "[PATCH] Add framework for extensible ArrayFire memory managers\n (#2461)\n\nMany different use cases require performance across many different memory\nallocation patterns. Even different devices/backends have different costs\nassociated with memory allocations/manipulations. Having the flexibility to\nimplement different memory management schemes can help optimize performance for\nthe use case and backend.\n\nThis commit adds the ability to replace the default memory manager to a\nspecialized user defined memory manager. This commit also exposes the events API\nto the user which allows you to synchronize tasks between two streams. The\nevents API will be disabled in a future commit but it can be used in the future\nonce we add support for streams.\n\nArrayFire will use the user defined memory manager whenever it allocates or\nfrees memory. The memory manager is exposed using the C API using function\npointers. The memory manager handle is created by the user during initialization\nand the user sets several function pointers to define the behavior.\n\nThe default memory manager behavior has not changed with this commit.", "target": 1} {"commit_message": "[PATCH] Split up travis 'full' target; it became too slow", "target": 0} {"commit_message": "[PATCH] s390x/GEMM: replace 0-init with peeled first iteration\n\n... since it gains another ~2% of SGEMM and DGEMM performance on z15;\nalso, the code just called for that cleanup.\n\nSigned-off-by: Marius Hillenbrand ", "target": 0} {"commit_message": "[PATCH] Fixed essential dynamics / flooding group PBC serial\n\nIn former versions, the PBC representation of essential dynamics /\nflooding group atoms could be incorrect in serial runs if the ED group\ncontained more than a single molecule. In multi-molecule cases, the required\nsteps to choose the correct PBC image in communicate_group_positions()\ntherefore need to be performed also in serial runs. Since the PBC representation\ncan only change in neigborsearching steps, we only need to check the\nshifts then. In parallel, NS is signalled by the bUpdateShifts\nvariable, which is set in dd_make_local_ed_indices(). The latter\nfunction is however not called in serial runs; but still we can pass\nthe bNS boolean to do_flood() to signal the NS status. For essential\ndynamics, unfortunately, since do_edsam() is called from constrain(), there\nis no information about the NS status at that point. Until someone\ncomes up with a better idea, we therefore do the PBC check in every step\nin serial essential dynamics - the performance impact will be negligible\nanyway.\n\nChange-Id: I86336a5e34131bdeac7e28f35b1ccb633450e54e", "target": 0} {"commit_message": "[PATCH] added a switch -Ssw which can lead to better performance\n during AMG setup", "target": 1} {"commit_message": "[PATCH] AdResS: lost performance tweak\n\nChange-Id: I164bc6a60f62d117fef83844da74c3707455a980", "target": 0} {"commit_message": "[PATCH] GPU+DD performance improvements and code clean-up", "target": 1} {"commit_message": "[PATCH] take out assert to avoid performance issue", "target": 1} {"commit_message": "[PATCH] LJ combination rule kernels for OpenCL\n\nThe current implementation enables combination rules for both AMD and\nNVIDIA OpenCL (also ports the changes to the \"nowarp\" test/CPU kernel).\n\nLike in the CUDA implementation, all kernels support it, but only for\nplain cut-off are combination rules used.\n\nNotes:\n- On AMD tested on Hawaii, Fiji, Spectre and Oland devices;\n combination rules in all cases improve performance, although combined\n with the i-prefetching, the improvement is typically only ~10%.\n- On NVIDIA tested on Kepler and Maxwell; in most cases the combination\n rule kernels are fastest.\n However, with certain inputs these kernels are 25% slower on Maxwell\n (e.g. pure water box, cut-off LJ, pot shift), but not on Kepler.\n This is likely a compiler mis-optimization, so we'll just leave the\n defaults the same as AMD.\n\nChange-Id: I05396e000cdf93c1d872729e6b477192af152495", "target": 1} {"commit_message": "[PATCH] s390x: Use new sgemm kernel also for DGEMM and DTRMM on Z14\n\nApply our new GEMM kernel implementation, written in C with vector intrinsics,\nalso for DGEMM and DTRMM on Z14 and newer (i.e., architectures with FP32 SIMD\ninstructions). As a result, we gain around 10% in performance on z15, in\naddition to improving maintainability.\n\nSigned-off-by: Marius Hillenbrand ", "target": 1} {"commit_message": "[PATCH] Revert \"Updated boost compute version tags\"\n\nThis reverts commit b6d8e2d85c358f9d2be69bd09fb8b64bda9fcc04.\n\nThis commit was causing performance drops on certain hardware", "target": 0} {"commit_message": "[PATCH] Optimize get_nz by avoiding integer division\n\nInteger divisions are very costly, and typically take ~20-26 clocks\ncycles for 32-bit integers, and ~85-100 clock cycles for 64-bit\nintegers. The default size of indices was changed to 64-bit in commit\na0c6de6c6, causing a significant performance hit in one of the get_nz\nmethods.\n\nThis commit rewrites said get_nz method to avoid divisions, and instead\nuse additions. These have a typical latency of 1 cycle, and a throughput\nof 0.25-0.33 cycles, regardless of argument size.", "target": 1} {"commit_message": "[PATCH] Optimize the performance of sum by using universal intrinsics", "target": 1} {"commit_message": "[PATCH] Fixed FAST on Mac OS X", "target": 0} {"commit_message": "[PATCH] Added support for manual load balancing with a -load option\n for grompp. It takes the relative performance of each of the processors in\n your system in arbitrary units, and normalizes it.", "target": 0} {"commit_message": "[PATCH] Code for measuring performance presented in user manual and\n associated data", "target": 0} {"commit_message": "[PATCH] Add an accurate but slow way to generate a sparse point set", "target": 0} {"commit_message": "[PATCH] reworked the project_vector to be more efficient for Lagrange\n elements. Changed the corresponding calls in reinit(). amr.cc now tests the\n projection stuff.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@279 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 1} {"commit_message": "[PATCH] changed DofMap::build_constraint_matrix to be more efficient\n in the (usual) case that the element has no constraints. Also fixed for the\n case that an element has constraints in terms of its *own* dofs, (not others)", "target": 1} {"commit_message": "[PATCH] Jeff: same as 6.0 branch. should not include ARMCI build\n system headers. support for T3D and T3E, especially something which is only\n for performance, not function, is silly. s/TARGET/NWCHEM_TARGET/ for the\n RTDB case. i don't see any reason for it in the first place but at least now\n the build system is cleaner.", "target": 0} {"commit_message": "[PATCH] EA: VCALLS turned off for ELAN4 because of performance and\n stability", "target": 0} {"commit_message": "[PATCH] Optimize cf_vs parallel performance", "target": 1} {"commit_message": "[PATCH] threading test performance updates", "target": 0} {"commit_message": "[PATCH] HvD: Adding a first pass of the 2nd and 3rd derivatives\n generated with the Maxima symbolic algebra program. Starting the code\n generation on Sunday at 10:41 AM and finishing on Wednesday at 00:11 AM it\n took approximately 62 hours to generate this code (for most part using two\n cores on my machine). The main offender was the TPSS correlation functional\n which took well over 12 hours. Part of this was due to Maxima exhausting all\n 4 GB of physical memory in my desktop machine and the subsequent paging\n reduced the performance significantly.\n\nNevertheless the results look good right now (still need to test the higher\norder derivative which could significantly impact this assessment). In\nparticular the code for the M06 correlation functional was originally 44 MB\nin size. This was the result of simply generating the relevant derivative\nexpressions and expressing them as Fortran code. Now that the autoxc script\nhas been enhanced to optimize the expressions before generating Fortran the\nsame routines only amount to 470 KB (a reduction of almost a factor 100 (98 to\nbe precise)).\n\nIt is obvious that optimizing these expressions does take time. However, the\ngood thing is that with this result the compiler does not need to take that time\nwhen you try to build NWChem. I am sure that anyone compiling NWChem will be\ngrateful for not having to wait 62 hours.", "target": 1} {"commit_message": "[PATCH] performance analysis, use of multipoles for long range bits,\n lots of additional screening", "target": 0} {"commit_message": "[PATCH] Improved performance traces", "target": 1} {"commit_message": "[PATCH] Improvements to multi-GPU performance", "target": 1} {"commit_message": "[PATCH] Restructured nonbonded calculation to allow more efficient\n vectorization", "target": 1} {"commit_message": "[PATCH] Improve performance of PFMG", "target": 1} {"commit_message": "[PATCH] Don't print invalid performance data\n\nIf mdrun finished before a scheduled reset of the timing information\n(e.g. from mdrun -resetstep or mdrun -resethway), then misleading\ntiming information should not be reported.\n\nFixes #2041\n\nChange-Id: I4bd4383c924a342c01e9a3f06b521da128f96a35", "target": 0} {"commit_message": "[PATCH] WIP fast single-element neighbor calculation.", "target": 0} {"commit_message": "[PATCH] Rename GPU launch/wait cycle counters\n\nIn preparation for the PME GPU task and GPU launch overhead to be\ncounted together in the same counter for all GPU tasks, the current main\ncounters have been renamed to be more general. The label of GPU waits in\nthe performance table have also been renamed to reflect the task name.\nAdditionally a non-bonded specific sub-counter is been added.\n\nChange-Id: I65a15b0090c1ccebb300cf425c7b3be4100e17a0", "target": 0} {"commit_message": "[PATCH] More efficient GetNonzeros::evaluateGen", "target": 1} {"commit_message": "[PATCH] #1009 Matrix return type for det, getMinor and cofactor Could\n result in a performance loss, but the implementation is not efficient anyway.\n No deprecation since users (and unit tests) have relied on automatic\n typecasting to Matrix.", "target": 0} {"commit_message": "[PATCH] fix a few performance drop in some matrix size per data type\n\nSigned-off-by: Wang,Long ", "target": 0} {"commit_message": "[PATCH] Accelerate KSInitialization tests by using fewer CV folds and\n fewer training epochs as well as relaxed tolerances.", "target": 0} {"commit_message": "[PATCH] Add --skip-partitioning command line option\n\nThis makes testing tweaks to that option easier, and might be useful\nlater when doing performance testing.", "target": 0} {"commit_message": "[PATCH] Use the simplified performance function.", "target": 0} {"commit_message": "[PATCH] Replacing Mish and Derivative of Mish by Fast Mish and\n Derivative of Fast Mish resp.", "target": 0} {"commit_message": "[PATCH] rbOOmit change: If a CouplingMatrix is attached to an\n RBConstruction then we should only increment the designated blocks in matrix\n assembly, otherwise matrix assembly is extremely slow since we go outside the\n sparsity pattern.", "target": 0} {"commit_message": "[PATCH] Add implementation of the mean squared error performance\n function.", "target": 0} {"commit_message": "[PATCH] Apply suggestions from code review\n\nEverything following are changes to the rcm implementation.\nRemove unneccessary consts from rcm declaration.\nDo not store adjacency matrix and degrees as members in the class.\nReplace explicit gko::vector with vector.\nreplace std::memcpy with std::copy_n.\nVarious documentation improvments.\nReverse description order of return paramters.\nMake IndexType explicit instead of auto.\nUse std::min_elelemt\nFix various spelling errors, typos, rewordings.\nRemoves the 'explicit' from the ExectorAllocators rebinding construtor.\nChanges occurences of array to vector, to save memory.\nMinor cleanup\nRemove the unnecessary test for the rcm adjacency matrix.\nAdd a test case for the correct rcm result.\nRewrite the assert_correct_permutation function to comparing with iota.\nMove test matrices to test class.\nImprove nested vector initialization.\nAllocate degrees inside rcm::generate.\nChange from goto to immediately evaluated lambda expression.\nRefactor loop body into an inline helper function.\nRelace some autos.\nReorganize includes to conform to include order.\nReplace autos with IndexType where necessary.\nReplace size_type with IndexType for num_vertices.\nRefactor for the number of levels to be more explicit.\nMake perm signal value -1.\nFactor out sort_by_degree in a generalized, fast small sorting function.\nMake level_processed a constexpr.\nIn the small_sort make some types explicit.\nWrap omp locks in a omp_mutex, use RAII guards.\n\nCo-authored-by: Tobias Ribizel ", "target": 0} {"commit_message": "[PATCH] Modifications in sweepline algorithm for is_simple: - Made\n replacing edge by another edge more efficient (using insert with hint). -\n altered the output during debugging somewhat.", "target": 1} {"commit_message": "[PATCH] performance improvement through avoiding function call and\n dereference overhead\n\n- make i_to_potl() and ij_to_potl() functions inline and const\n- don't dereference inside the functions, but cache, if possible in external variables\n=> up to 15% speedup.", "target": 1} {"commit_message": "[PATCH] Added several HemmAccumulate routines so that Hegst could\n avoid cache-unfriendly SumScatter routines. The warning messages for calling\n these potentially slow redistribution routines were also toned down.", "target": 0} {"commit_message": "[PATCH] use CUDA texture objects when supported\n\nCUDA texture objects are more efficient than texture references, their\nuse reduces the kernel launch overhead by up to 20%. The kernel\nperformance is not affected.\n\nChange-Id: Ifa7c148eb2eea8e33ed0b2f1d8ef092d59ba768e", "target": 1} {"commit_message": "[PATCH] Layout on the bibliography page\nMIME-Version: 1.0\nContent-Type: text/plain; charset=UTF-8\nContent-Transfer-Encoding: 8bit\n\nWhen having a bit a long citation description, the description runs, in the HTML output on the bibliography page, into 3 or more lines where the 3rd and following lines continue underneath the citation number like:\n```\n [1] Eric Berberich, Arno Eigenwillig, Michael Hemmer, Susan Hert, Lutz Kettner, Kurt Mehlhorn, Joachim Reichel, Susanne Schmitt, Elmar Sch\u00f6mer, and Nicola Wolpert. Exacus: Efficient and exact\n algorithms for curves and surfaces. In Gerth S. Brodal and Stefano Leonardi, editors, 13th Annual European Symposium on Algorithms (ESA 2005), volume 3669 of Lecture Notes in Computer Science,\npages 155\u2013166, Palma de Mallorca, Spain, October 2005. European Association for Theoretical Computer Science (EATCS), Springer.\n```\n\nThe example was found in e.g. https://doc.cgal.org/latest/Algebraic_foundations/citelist.html\n\n- corrected the \"overflow\"\n- made the citation number right aligned", "target": 0} {"commit_message": "[PATCH] Improve the performance of zasum and casum with AVX512\n intrinsic", "target": 1} {"commit_message": "[PATCH] MDRange: Performance test (3D)\n\nRuns test over multiple ranges (3D), adjusting the tile dims by powers of 2.\nCompare results of MDRange to RangePolicy implemented as Collapse<2> and\nCollapse<3>", "target": 0} {"commit_message": "[PATCH] performance updates??...EJB", "target": 0} {"commit_message": "[PATCH] Reverting optimizations that hurt performance on some\n compilers\n\ngit-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@15551 f3b2605a-c512-4ea7-a41b-209d697bcdaa", "target": 0} {"commit_message": "[PATCH] BUGFIX Fixed memory leak in image io, performance\n improvements", "target": 1} {"commit_message": "[PATCH] JJU: performance test for put get and acc", "target": 0} {"commit_message": "[PATCH] performance update...EJB", "target": 0} {"commit_message": "[PATCH] USER-DPD Kokkos: Remove the SSA's ALLOW_NON_DETERMINISTIC_DPD\n option. There was no measurable performance benefit to turning it on.", "target": 0} {"commit_message": "[PATCH] Adds sfence and lfence options if user builds with assembly\n to enable more efficient use of out of order engines", "target": 1} {"commit_message": "[PATCH] Add non-virtual Elem::vertex_average() and update\n Elem::centroid()\n\nWhen the Elem has an elevated p_level, this will affect the Order of\nthe FE that gets reinit()'d for computing the centroid. Rather than do\nany non-const hacking of the p_level value, we just work around this\nissue by making a copy of the Elem with non-elevated p_level and\nreturn its centroid instead.\n\nThis obviously introduces an additional performance hit, but truly\noptimized code should not be calling the base class Elem::centroid()\nimplementation to begin with. This approach also has the benefit of\nnot requiring const_cast and avoiding potential thread safety issues.", "target": 1} {"commit_message": "[PATCH] Replacing dynamic_cast with libmesh_cast where appropriate -\n depending on the error checking replaced this will either lead to slightly\n more efficient NDEBUG runs or slightly more run-time checking in debug mode\n runs.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@4246 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 1} {"commit_message": "[PATCH] When need to realloc, double the size. This is more\n efficient if the ultimate size is very large.", "target": 1} {"commit_message": "[PATCH] Preserve an old partitioning in copy_nodes_and_elements -\n should be more efficient and more reliable.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@3388 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 1} {"commit_message": "[PATCH] Make the use of GpuEventSynchronizer in SYCL conformant with\n CUDA/OpenCL\n\nMR !1035 refactored the use of GpuEventSynchronizer in CUDA and OpenCL\nto make merging the code paths easier. Here, we update SYCL to the same\nstandard.\n\nNote, that it introduces additional synchronization between local and\nnon-local queues. It is present in CUDA and OpenCL, but was implicit in\nSYCL. To simplify code, it is added here. If it turns out to be\ndetrimental to performance, it can be (conditionally) NOPed.\n\nRefs #2608, #3895.", "target": 0} {"commit_message": "[PATCH] Optimize Hex8::volume() slightly.\n\nIf my calculations are correct, the \"geometric\" formula I used\npreviously had about 12 dot products and 12 cross products, while the\ncurrent one only has 4 of each. This formula is derived by writing out the\nstandard volume formula and dropping terms which are zero due to the\nsymmetry of the integrand and/or triple products containing two copies\nof the same vector.\n\nWhile the original geometric formula was already pretty fast, this one\nis about 1.7x faster (about 0.1697s vs 0.2866s to call volume() on\n3.375M Hex8 elements).", "target": 1} {"commit_message": "[PATCH] Deprecate version of BoundaryInfo::boundary_ids(const Node*)\n that returns a vector.\n\nAdd new version that must be passed a std::set. The new version\nshould be more efficient for making repeated calls to boundary_ids(),\nsince the container does not need to be created and destroyed\nrepeatedly...", "target": 1} {"commit_message": "[PATCH] Improve GPU performance especially without electrostatics", "target": 1} {"commit_message": "[PATCH] Add new constructor to Iso_rectangle_2(Point_2, Point_2,\n int). The additional dummy \"int\" specifies that the 2 points are the\n lower-left and upper-right corner. This is more efficient when one knows\n they are already in this configuration.\n\nSame thing for Iso_cuboid_3, and the functors.\n\nUse them in Cartesian_converter and Homogeneous_converter.", "target": 1} {"commit_message": "[PATCH] Workaround for Visual Studio bug that causes very slow\n compilation", "target": 0} {"commit_message": "[PATCH] Use new style with make_array(), more compact and efficient", "target": 1} {"commit_message": "[PATCH] bond/react: performance improvement", "target": 1} {"commit_message": "[PATCH] Fixes to internal functions\n\n- Was using incorrect number of elements for the total\n- Fixed copy because right now isOwner() does not mean isLinear()\n - Potentially improves performance when isLinear() is not isOwner()", "target": 1} {"commit_message": "[PATCH] trivial change for DofMap::dof_indices to increase\n performance when there are no element-based DOFs", "target": 0} {"commit_message": "[PATCH] Enabling a fast pass option for reorder", "target": 0} {"commit_message": "[PATCH] Build a default DiffSolver at init() not at construction, to\n be more efficient when the user wants to create a DiffSolver themselves", "target": 1} {"commit_message": "[PATCH] Used EAF for all IO on T and cleaned up IO operations so that\n reads/writes on T happen in oV blocks and all reads of T go thru one routine\n (mp2_read_tiajb). Put in performance stats for the basic steps of the\n gradient. Use screening of the density in the non-seperable part of the\n gradient. Reduced the tol2e from 10^-12 to 10^-9/10 but it really needs to\n be an input parameter.", "target": 0} {"commit_message": "[PATCH] Specialized dotInterpolate for the efficient calculation of\n flux fields\n\ne.g. (fvc::interpolate(HbyA) & mesh.Sf()) -> fvc::flux(HbyA)\n\nThis removes the need to create an intermediate face-vector field when\ncomputing fluxes which is more efficient, reduces the peak storage and\nimproved cache coherency in addition to providing a simpler and cleaner\nAPI.", "target": 1} {"commit_message": "[PATCH] First cut at the performance statistics library", "target": 0} {"commit_message": "[PATCH] Fix performance noexcept move constructor", "target": 0} {"commit_message": "[PATCH] Use MPI_IN_PLACE in minloc/maxloc\n\nWe've already broken MPI-2 compatibility; in for a penny, in for a\npound.\n\nIn addition to the infintestimal performance gain, this quells a gcc\n8.1 -Wmaybe-initialized false positive.", "target": 0} {"commit_message": "[PATCH] Performance improvements and fixed bug with memory budget and\n multi-range subarrays", "target": 1} {"commit_message": "[PATCH] Fix for calculate_dphiref-only\n\nI doubt anyone's ever triggered this, and it shouldn't have been more\nthan a performance issue if they had, but now that we're deprecating the\nold fallback it'll become important.", "target": 0} {"commit_message": "[PATCH] Improve performance of Python integrator (NVE_Opt version)\n\nRemoving the loop over atoms by using NumPy array indexing allows to recover\nperformance close to that of plain fix nve.", "target": 1} {"commit_message": "[PATCH] turned simplewater on again and added performance fix for\n 7.30 compilers", "target": 1} {"commit_message": "[PATCH] Fixed race condition in GPU restrictor. Improved performance\n of both prolongator and restrictor using compile-time evaluated fine-spin ->\n coarse-spin mapper instead of an array.", "target": 1} {"commit_message": "[PATCH] use consistent constants from math_const.h and fast integer\n powers from math_special", "target": 0} {"commit_message": "[PATCH] Use inFastDrawing instead of quick_camera and provide direct\n access to fast drawing state", "target": 0} {"commit_message": "[PATCH] reformatted the flops and performance output", "target": 0} {"commit_message": "[PATCH] Simplify the uniform-refinement mesh methods.\n\nIn the classes Mesh and ParMesh:\n\n* Small optimization in Mixed3DUniformRefinement for hex-only meshes.\n* In Mixed3DUniformRefinement, use marker array instead of std::map.\n* Rename the methods Mixed{2D,3D}UniformRefinement to\n UniformRefinement{2D,3D}.\n* Remove the methods {Quad,Hex,Wedge}UniformRefinement and use\n UniformRefinement{2D,3D} instead. In terms of performance, the\n difference was negligible.", "target": 0} {"commit_message": "[PATCH] Revert \"move update of the status outside of the constructor\"\n\nThis reverts commit 6378a51191df7cb28a24dadc1706112c0c7df926.\n\nThe commit was incorrect and was introducing a huge performance issue", "target": 0} {"commit_message": "[PATCH] fixed slow reading of large xvg files with LAM MPI", "target": 1} {"commit_message": "[PATCH] Further work on fast diagonalization", "target": 0} {"commit_message": "[PATCH] Use efficient intersection traits, Used kernel as template\n parameters", "target": 1} {"commit_message": "[PATCH] Improved performance of interaction groups on CPU", "target": 1} {"commit_message": "[PATCH] Improve performance of pair_reaxc, this change is safe\n because the non-bonded i-loop doesn't include ghost atoms; this optimization\n is already included in the USER-OMP version", "target": 1} {"commit_message": "[PATCH] - locate() cleanups, performance impact unnoticeable.", "target": 0} {"commit_message": "[PATCH] Fixed performance regression on Kepler", "target": 1} {"commit_message": "[PATCH] CUDA: change heuristic for BlockSize to prefer 128 threads\n\nSome experiments deomnstrated that for certain kernels the\ncurrent heuristic isn't great. In particular copy and memset\nkernels were bad.\n\nUsing the updated stream benchmark I got before this change:\n\nSet 327316.30 MB/s\nCopy 654344.27 MB/s\nScale 654263.20 MB/s\nAdd 846497.84 MB/s\nTriad 844604.40 MB/s\n\nWith this change:\n\nSet 652713.29 MB/s\nCopy 807649.65 MB/s\nScale 808014.29 MB/s\nAdd 847403.47 MB/s\nTriad 845885.63 MB/s\n\nExaminidMD also improved from 2.48e+08 to 2.82e+08:\n\n1 256000 | 0.906401 0.480328 0.142917 0.165107 0.117937 | 1103.264687 2.824358e+08 2.824358e+08 PERFORMANCE\n\n1 256000 | 1.030611 0.501819 0.243033 0.163163 0.122484 | 970.297956 2.483963e+08 2.483963e+08 PERFORMANCE", "target": 1} {"commit_message": "[PATCH] Template free-energy kernel on differing coul/vdw soft-core\n\nThe power function used for the soft-core potential is expensive.\nWhen using the same lambda and alpha parameters for Coulomb and VdW,\nwe can skip one power core, giving a 10% performance improvement.\n\nChange-Id: I8733838c6c32ef2b6fee5a6fb97657679f9bd3b3", "target": 1} {"commit_message": "[PATCH] This test was too slow in Debug mode", "target": 0} {"commit_message": "[PATCH] Added a little extra information (which inner loop) to the\n printout of performance info.", "target": 0} {"commit_message": "[PATCH] As std::fabs is slow on Windows, we switch to an\n implementation using sse2.\n\nThis version is already in CGAL, but it is protected with an #ifdef\nSo this commit consists of a #define for VC++", "target": 0} {"commit_message": "[PATCH] Changed innerloop optimization options, and made SSE/3dnow\n loops default together with fast truncation on linux.", "target": 0} {"commit_message": "[PATCH] changed DofMap::build_constraint_matrix to be more efficient\n in the (usual) case that the element has no constraints. Also fixed for the\n case that an element has constraints in terms of its *own* dofs, (not others)\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@870 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 1} {"commit_message": "[PATCH] Restore evGW performance", "target": 0} {"commit_message": "[PATCH] performance improved", "target": 1} {"commit_message": "[PATCH] Issue #724 Gradient of the Lagrangian function is always\n generated for NLP solvers Might slow down initialization a bit, but probably\n not much", "target": 0} {"commit_message": "[PATCH] Improved OpenCL SIFT coalescing and performance", "target": 1} {"commit_message": "[PATCH] travis: try to make the slow tests pass again", "target": 0} {"commit_message": "[PATCH] Specialized the new estimate_errors() version for more\n efficient use in UniformRefinementEstimator", "target": 1} {"commit_message": "[PATCH] Separate Windows ci(gh-action) workflow and some improvs\n\nSplitting the windows ci job into a separate workflow enables the\nci to re-run windows specific jobs independent of unix jobs.\n\nUpdated Ninja dependency to 1.10.2 fix release in all ci(gh-actions)\n\nRefactored boost dependency to be installed via packages managers as\nGitHub Actions is removing pre-installed versions from March 8, 2021\n\nUpdate VCPKG hash to newer version to enable fast and better ports.", "target": 0} {"commit_message": "[PATCH] make ::localize() more efficient, still need to handle\n ::localize_to_one()", "target": 1} {"commit_message": "[PATCH] SOLVE, MATMUL and INVERSE now use af_mat_prop\n\naf_mat_prop values can be used for performance improvements\nby calling specialized routines", "target": 1} {"commit_message": "[PATCH] First experimental prototype of Efficient Ransac (written by\n Yannick)\n\nPlane detection only.\nSome work to do to make it CGAL-conforming.", "target": 0} {"commit_message": "[PATCH] Fix memory alloc for fast opencl", "target": 0} {"commit_message": "[PATCH] Switching to Simple_cartesian gives better performance\"", "target": 1} {"commit_message": "[PATCH] fixed reallocation of nsbox bounding boxes with SSE\n\nThe reallocation of the bounding box array was not aligned and not initialized.\nThis probably did not give incorrect results, but could give a performance hit.", "target": 1} {"commit_message": "[PATCH] Removed Reaction-Field-nec\n\nThe RF no exclusion correction option was only introduced for\nbackward compatibility and a performance advantage for systems\nwith only rigid molecules (e.g. water). For all other systems\nthe forces are incorrect. The Verlet scheme did not support this\noption and if it would, it wouldn't even improve performance.\n\nChange-Id: Ic22ccf76d50b5bb7951fcac2293621b5eef285c5", "target": 1} {"commit_message": "[PATCH] Make sure multiprocessor performance for single vectors is\n unaffected by the multivector capability.", "target": 0} {"commit_message": "[PATCH] small performance improvement for nbnxn SSE kernels", "target": 1} {"commit_message": "[PATCH] Experimental new template class Set. * This class is similar\n to Jakub's new HashTable class from the 'ncmesh-mem-opt-dev' branch, but\n more generic. * It is a container for unique elements of any type T. *\n Supports fast insertion, removal, and searching of elements. * Each element\n is assigned an index (int) upon insertion. * Indices of removed elements are\n reused when inserting new elements. * Elements are stored in a random access\n container that supports fast insertion at the end, like mfem::Array or\n std::vector. * Such container classes require a simple adaptor class to be\n used with class Set. * The indices assigned to elements are indices into\n the container object, which stores \"nodes\", a struct with two fields one of\n type T, and second of type int (index of next element in a bin). * The\n entries of the Set are separated into bins using a generic hash function.\n Bins are represented as linked lists where instead of pointers, the\n link-\"nodes\" use int indices.\n\nThis class can be useful in various contexts:\n* Creating a local enumeration for a set of processor ranks, e.g.\n enumerating the processor neighbors in class GroupTopology.\n* Creating off-diagonal column maps for HypreParMatrix, given a set of\n global column indices.\n* Enumeration of edges and faces as Sets of (sorted) pairs or 3-tuples,\n similar to HashTable - which can be build on top of Set as well.", "target": 0} {"commit_message": "[PATCH] Removed init_state\n\nMade a simple zero-initializing constructor for t_state and the\nstructs of some of its members. Called them classes. Later, we might\nprefer to require explicit initialization with actual values, so that\ntools can detect the use of uninitialized values and find our bugs,\nbut for now having a constructor is a useful initial step in that\ndirection.\n\nExtracted some new functions that cover some of the incidental\nfunctionality that was also present in init_state.\n\nMade state.lambda a std::array, thereby removing the need to consider\nresizing it, and converted client code to be passed an ArrayRef rather\nthan hard-code the name of the specific container. This caters for\nconvenient future refactoring of the underlying storage, and sometimes\nneeding to implicitly know what the size of the container is.\n\nPassing an ArrayRef by value is consistent with the CppCoreGuidelines,\nbut has potential for performance impact. Doing this means that a\ncaller pushes onto the stack a copy of the object (containing two\npointers), rather than previous idioms such as pointer + size, or\npointer + implicit constant size from an enum, or pointer + implicit\nsize in some other parameter. This could mean an extra argument is\npushed to the stack for the function call, compared with the\nalternatives of pushing a pointer to data, pointer to container, or\npointer to ArrayRef. In all cases, the caller has to load the pointer\nvalue via an offset that is known to the compiler, so that aspect is\nprobably irrelevant. So, we would probably prefer to avoid calling\nfunctions that take such parameters in a tight loop, or where multiple\ncontainers share a common size. But the uses in this patch seem to be\nof sufficiently high level to be an acceptable trade of possible\nperformance for improved maintainability.\n\nChange-Id: I17e7d83cfc89566f76fa9949c425b950ad6aef62", "target": 0} {"commit_message": "[PATCH] Improve performance of SSAMGRelax", "target": 1} {"commit_message": "[PATCH] improve skylakex paralleled sgemm performance", "target": 1} {"commit_message": "[PATCH] Further improvements to multi-GPU performance", "target": 1} {"commit_message": "[PATCH] Add implementation of the sum squared error performance\n function.", "target": 0} {"commit_message": "[PATCH] This commit introduces VariableGroups as an optimization when\n there are repeated variables of the same type inside a system. Presently,\n these are only activated through the system.add_variables() API, but in the\n future there may be provisions for automatically identifying groups.\n\nThe memory usage for DofObjects now scales like\nN_sys+N_var_group_per_sys instead of N_sys+N_vars. The DofMap\ndistribution code has been refactored to use VariableGroups.\n\nAll existing loops over Variables within a system will work unchanged,\nbut can be replaced with more efficient loops over VariableGroups.", "target": 1} {"commit_message": "[PATCH] 128-bit AVX2 SIMD for AMD Ryzen\n\nWhile Ryzen supports 256-bit AVX2, the internal units are organized\nto execute either a single 256-bit instruction or two 128-bit SIMD\ninstruction per cycle. Since most of our kernels are slightly\nless efficient for wider SIMD, this improves performance by roughly\n10%.\n\nChange-Id: Ie601b1dbe13d70334cdf9284e236ad9132951ec9", "target": 1} {"commit_message": "[PATCH] Adjust the performance function test; use the simplified\n performance function.", "target": 0} {"commit_message": "[PATCH] Moving quantities around in ExactSolution::_compute_error\n\nThis is in preparation for updating to support mixed-dimension\nmeshes. Although this is a bit more inefficient, the code is simpler\nand I thnk that since this is usually used for debugging/regression\npurposes, simpler code would be preferred over higher performance\ncode.", "target": 1} {"commit_message": "[PATCH] Add Fast LSTM layer implementation.", "target": 0} {"commit_message": "[PATCH] make skylakex sgemm code more friendly for readers\n\nBTW some kernels were adjusted to improve performance", "target": 1} {"commit_message": "[PATCH] Enforce memory alignment to improve performance of vector\n operations. Also fixed bugs in an earlier optimization.", "target": 1} {"commit_message": "[PATCH] #1044 disable some slow unittests by default", "target": 0} {"commit_message": "[PATCH] Restore locking optimizations for OpenMP case\n\nrestore another accidentally dropped part of #1468 that was missed in #2004 to address performance regression reported in #1461", "target": 0} {"commit_message": "[PATCH] More efficient gain calculation in SQPMethod, #551", "target": 1} {"commit_message": "[PATCH] avoid name clash with fast directory", "target": 0} {"commit_message": "[PATCH] Made some modifications that hopefully improve the\n performance of the non-blocking code. Test functionality is still sketchy.", "target": 1} {"commit_message": "[PATCH] HvD: Initial Maxima generated code for the various\n functionals. At present only the energy expressions and the 1st order\n derivatives are implemented. This is sufficient to test whether the energy\n expressions are correct. Next the 2nd and 3rd order derivatives will be\n generated but optimizing the expressions to generate fast Fortran will take a\n while.", "target": 0} {"commit_message": "[PATCH] IA64: efc with Optimiz break moints2x and moints6x", "target": 0} {"commit_message": "[PATCH] dgecop is a matrix transpose routine. since ESSL has this\n routine in it, i added a link to that. also, Qingda wrote a fast transpose\n routine that can be used if one has the OSU source code.\n\nby default, nothing changes. the faster transposes must be activated manually with ESSL_TRANSPOSE or OSU_TRANSPOSE.", "target": 0} {"commit_message": "[PATCH] VT: Added a few lines about performance tuning", "target": 0} {"commit_message": "[PATCH] constrainPressure: Updated to use the more efficient\n patch-based MRF::relative function", "target": 1} {"commit_message": "[PATCH] HvD: It turns out that the performance improvements of a\n number of the optimizations don't carry over to other platforms. So I am\n removing most of them again. The ones that are here to stay are: the\n USE_FORTRAN2008 flag to pick Fortran 2008 intrinsic functions up for POPCNT,\n LEADZ, TRAILZ and ERF where available; the nwxc_dble_powix function to\n exploit that exponentiation with an integer power is much faster than\n exponentiation with a double precision power of the same value.", "target": 1} {"commit_message": "[PATCH] Multi-phase solvers: Improved handling of inflow/outflow BCs\n in MULES\n\nAvoids slight phase-fraction unboundedness at entertainment BCs and improved\nrobustness.\n\nAdditionally the phase-fractions in the multi-phase (rather than two-phase)\nsolvers are adjusted to avoid the slow growth of inconsistency (\"drift\") caused\nby solving for all of the phase-fractions rather than deriving one from the\nothers.", "target": 0} {"commit_message": "[PATCH] add more efficient getppn for BGQ and MPI-3\n\nBGQ has a system call for PPN etc.\nMPI-3 has a routine to get a node communicator and the size of this communicator is the number of PPN.\n\nI have efficient implementations for Cray, but since Cray has supported MPI-3 for over a year, there is no need.\n\nThe MPI-2 implementation could be optimized but it should not be bottleneck and is unlikely to be used except\nby users that insist on using old MPI implementations, since all relevant platforms support MPI-3 or are BGQ.", "target": 1} {"commit_message": "[PATCH] Some performance problem. I am looking into it.", "target": 0} {"commit_message": "[PATCH] Transitioning and cleaning up toward a more efficient load\n balancer.", "target": 1} {"commit_message": "[PATCH] undo slow dgemm/skylake microoptimization\n\nthe compare is more costly than the work", "target": 0} {"commit_message": "[PATCH] Convert nbnxn_pairlist_set_t to class PairlistSet\n\nThis change is only refactoring.\nTwo implementation details have changed:\nThe CPU and GPU pairlist objects are now straight lists instead\nof array of pointer to lists. This means that the pairlist objects are\nno longer allocated on their respecitive thtreads. But since the lists\nhave buffers to avoid cache polution and the actual i- and j-lists are\nllocated thread local, this should not affect performance\nignificantly.\nThe free-energy lists are now only allocated with perturbed atoms.\n\nChange-Id: Ifc76608215518edfc61c0ca8eb71ea2a928cf57c", "target": 0} {"commit_message": "[PATCH] implemented efficient sparsity detection for SXFunction,\n ticket #126", "target": 1} {"commit_message": "[PATCH] Use reference to improve performance in pair_reaxc_kokkos", "target": 1} {"commit_message": "[PATCH] Change TopologyInformation implementation\n\nThis changes TopologyInformation so that we will be able to use it\nalso in the legacy tools, providing a better migration path for them,\nas well as making progress to removing t_topology and a lot of calls\nto legacy file-reading functions.\n\nIt can now lazily build and cache atom and expanded topology data\nstructures, re-using the gmx_localtop_t type (intended for use by the\nDD code for domain-local topologies). The atoms data structure can\nalso be explicitly copied out, so that tools who need to modify it can\ndo so without necessarily incurring a performance penalty. All these\nare convenient for tools to use.\n\nThe atom coordinate arrays are now maintained as std::vector, which\nmight want a getter overload to make rvec * for the legacy tools.\n\nAdded tests that the reading behaviour for various kinds of inputs is\nas expected. Converted lysozyme.gro to pdb, added a 'B' chain ID,\ngenerated a .top (which needed an HG for CYS) so updated pdb and gro\naccordingly. Some sasa test refdata needed fixing for that minor\nchange.\n\nProvided a convenience overload of gmx_rmpbc_init that takes\nTopologyInformation as input, as this will frequently be used by\ntools.\n\nExtended getTopologyConf to also return velocities, which will\nbe needed by some tools.\n\nAdapted the trajectoryanalysis modules to use the new approach, which\nis well covered by tests.\n\nRefs #1862\n\nChange-Id: I2f43e62bc2d97f5e654f15c6e474b9b71d7106ec", "target": 0} {"commit_message": "[PATCH] Use .p2align instead of .align for portability\n\nThe OSX assembler apparently mishandles the argument to decimal .align, leading to a significant loss of performance\nas observed in #730, #901 and most recently #1470", "target": 0} {"commit_message": "[PATCH] performance improvement through moving inlinable functions to\n header file", "target": 1} {"commit_message": "[PATCH] Update VSX SIMD to avoid inline assembly\n\nThanks to some help from Michael Gschwind of\nIBM, this removes the remaining inline assembly\ncalls and replace the with vector functions. This\navoid interfering with the optimizer both on GCC\nand XLC, and gets us another 3-10% of performance\nwhen using VSX SIMD. Tested with GCC-4.9, XLC-13.1\nin single and double on little-endian power 8.", "target": 1} {"commit_message": "[PATCH] Tree serialization is more efficient now.", "target": 1} {"commit_message": "[PATCH] Fixed performance bug.", "target": 1} {"commit_message": "[PATCH] Added to PME on GPU information in mdrun performance\n\nAdded -pme to glossary\n\nMoved and modified a previous mdrun with GPU example to follow\na more logical progression - first introduce the simpler use cases\nof gputasks and show examples, then at the end show how to avoid a\ngraphics-dedicated GPU\n\nAdded more GPU task assignment examples\n\nChange-Id: I63304a511d5d98d85fdbb1cea497627a80a14418", "target": 0} {"commit_message": "[PATCH] In the performance miniapps, print the MFEM SIMD width.\n\nIn the miniapps/performance makefile, print the auto-detected compiler\nand if that fails, the print the output used for auto-dection.", "target": 0} {"commit_message": "[PATCH] suppress performance warning concerning an assertion", "target": 0} {"commit_message": "[PATCH] Refactor nbnxn exclusion setting\n\nConsolidate common parts of the simple and GPU exclusion mask\ngeneration code. Made variable names more descriptive.\nNo functionality and performance changes, except that the direct\nj-cluster lookup now also works when the first j-cluster does not\nequal the i-cluster.\n\nChange-Id: I3ef6344ae2796e649ae30bf5ff0668a4548c011f", "target": 0} {"commit_message": "[PATCH] Send Points rather than individual coordinates\n\nThis should fix a bug with LIBMESH_DIM!=3 and DistributedMesh, and it\nshould be more efficient in some cases, and it should be easier to\nrefactor.", "target": 1} {"commit_message": "[PATCH] more efficient way to extract submatrices, remove some\n unnecessary members, better naming of functions and variables", "target": 1} {"commit_message": "[PATCH] Draw points with line width 0, otherwise it is too slow when\n we have 1mio points", "target": 0} {"commit_message": "[PATCH] PERF Batching + Blocks images in rotate and transform\n\n* Batching all images in single block set is slow at high blocks\n* Divide batches of images into sets and launch blocks for that", "target": 0} {"commit_message": "[PATCH] Improve performance of GEMM for small matrices when SMP is\n defined.\n\nAlways checking num_cpu_avail() regardless of whether threading will actually\nbe used adds noticeable overhead for small matrices. Most other uses of\nnum_cpu_avail() do so only if threading will be used, so do the same here.", "target": 1} {"commit_message": "[PATCH] Fix performance regression with gcc-3.3", "target": 1} {"commit_message": "[PATCH] - Some clean-up - Function get_number_of_bad_elements in the\n mesher levels (for debugging) - option\n CGAL_MESH_3_ADD_OUTSIDE_POINTS_ON_A_FAR_SPHERE to reduce contention on the\n infinite vertex", "target": 0} {"commit_message": "[PATCH] The Viewer declares if the current drawing is a fast draw or\n not.", "target": 0} {"commit_message": "[PATCH] Removed cudaMemset from FAST", "target": 0} {"commit_message": "[PATCH] POWER10: Improving dasum performance\n\nUnrolling a loop in dasum micro code to help in improving\nPOWER10 performance.", "target": 1} {"commit_message": "[PATCH] Use of WallClockTimer to measure performance", "target": 0} {"commit_message": "[PATCH] Added hacks for SIMD rvec/load store in lincs & bondeds\n\nWe have added proper gather/scatter operations to work on\nrvecs for all SIMD architectures, but that will not make it into\nGromacs-5.1. Since Berk already wrote a few routines to use\nmaskloads at least for AVX & AVX2, this is a bit of a hack to\nget the performance benefits of that code already in Gromacs-5.1\n(for AVX/AVX2), without altering the SIMD module. This is definitely\na hack, and the code will be replaced once the extended SIMD\nmodule is in place.\n\nChange-Id: I385acb5f989b2ecf463948be84947fe1f6dfd19b", "target": 0} {"commit_message": "[PATCH] added (inner and outer) relaxation parameters to Gauss-Seidel\n routines, also added a backward solve procedure. This required an additional\n parameter for hypre_BoomerAMGRelax. Complete list of choices for smoothers\n are now: relax_type = 0 -> Jacobi or CF-Jacobi relax_type = 1 -> Gauss-Seidel\n <--- very slow, sequential relax_type = 2 -> Gauss_Seidel: interior points in\n parallel, boundary sequential relax_type = 3 -> hybrid: SOR-J mix\n off-processor, SOR on-processor with outer relaxation parameters (forward\n solve) relax_type = 4 -> hybrid: SOR-J mix off-processor, SOR on-processor \n with outer relaxation parameters (backward solve) relax_type = 5 -> hybrid:\n GS-J mix off-processor, chaotic GS on-node relax_type = 6 -> hybrid: SSOR-J\n mix off-processor, SSOR on-processor with outer relaxation parameters\n relax_type = 9 -> Direct Solve", "target": 0} {"commit_message": "[PATCH] Correct CUDA kernel energy flag\n\nThe CUDA kernels calculated energies based on the GMX_FORCE_VIRIAL\nflag. This did not cause errors, since (currently) GMX_FORCE_ENERGY\nis always set when the virial flag is set. But using the latter flag\ngives a small performance improvement when using pressure coupling.\n\nChange-Id: If874e651058dc06c464f0fa810b17ba83146c9a3", "target": 1} {"commit_message": "[PATCH] Used sparse identity for more efficient memory utilization", "target": 1} {"commit_message": "[PATCH] EA: test for ga_dgemm performance (N=400 1600 3200)", "target": 0} {"commit_message": "[PATCH] Modified nb_accv to improve performance of accumulates to\n processors on the same SMP node.", "target": 1} {"commit_message": "[PATCH] Cuda: Enabling SHFL based reduction for static\n value_type>128bit\n\nThis improves performance significantly when reducing structs, since\nthe shared memory footprint is massively reduced. On smaller reduction\ntypes it is still benefitial to go through SHMEM.", "target": 1} {"commit_message": "[PATCH] Fix performance penalty in bspline derivatives", "target": 0} {"commit_message": "[PATCH] Create even less contention\n\nAll the readers are waiting on our writer thread to mark our condition\n(`_array_is_present`) as ready before they can even attempt to read, so\nthere is no data race for our writer thread. Hence we can remove the\nlock on the mutex the readers are using. And it's much better this way\nbecause one could imagine that our writer hits the `std::unique_lock`\nfirst which would then prevent our reader threads from even getting to\nthe condition variable `wait`, which is the logical place we want them\nto get to while the writer thread is doing its job.\n\nAnd the second change is unlocking as soon as we're through waiting\nbecause we are through the read-write portion of the program and are\nonly reading so it's safe to let everyone through at once", "target": 0} {"commit_message": "[PATCH] Extensive commenting and instructions on how to run fast kde\n is completed", "target": 0} {"commit_message": "[PATCH] replace std::set with\n std::array\n\nfor facets vertices\nthis should be a lot more efficient", "target": 1} {"commit_message": "[PATCH] PetscMatrix::print_personal now prints to file when requested\n (rather than just cout). The implementation is not particularly efficient\n (since print_personal gets passed an ostream) but it does work. And how\n efficient do you need to be if you are printing out matrices anyway?", "target": 0} {"commit_message": "[PATCH] DynaIO option to keep or drop spline nodes\n\nThis should let us easily build meshes that retain the exact geometry of\nan IsoGeometric Analysis mesh but that don't have any topologically\ndisconnected NodeElem elements (so they should scale better with our\nexisting partitioner code) and don't have constraint equations (so they\nmight be more efficient with our existing solvers and should be\ncompatible with our reduced_basis code).", "target": 1} {"commit_message": "[PATCH] Completed the base fast SVD case for which the number of\n points is less than the dimension...", "target": 0} {"commit_message": "[PATCH] Write element truth table to exodus to improve performance\n\nThe Exodus API allows for an element truth table to optionally be\nsent to the Exodus library before any element data is written. The\ntruth table simply tells which variables exist on which blocks.\nSending this truth table to Exodus allows for memory to be allocated\nin advance, making for much more efficient writing of data,\nespecially if there is a large number of element blocks.\n\ngit-svn-id: file:///Users/petejw/Documents/libmesh_svn_bak@5358 434f946d-2f3d-0410-ba4c-cb9f52fb0dbf", "target": 1} {"commit_message": "[PATCH] Improve performance of random number generator calls", "target": 1} {"commit_message": "[PATCH] fast QA for LINUX", "target": 0} {"commit_message": "[PATCH] reverted commit 76d5bddd5c3dfdef76beaab8222231624eb75e89.\n Split ga_acc in moints2x_trf2K in smaller ga_acc on MPI-PR since gives large\n performance improvement on NERSC Cori", "target": 1} {"commit_message": "[PATCH] Add Quad8::volume().\n\nThis speeds up a test code which calls elem->volume() on each\ndistorted QUAD9 in a 4M element mesh by approximately 650x (from about\n206s down to 0.3146s).\n\nAs far as the accuracy of the four-point quadrature rule goes, it\nappears to be quite good up to about 15% distorted elements (in the\ncontext of MeshTools::Modification::distort()). More distortion than\nthis is probably not really usable for finite element analysis,\nanyway. In the table below, max_rel_diff is computed using a\ntwelfth-order quadrature rule (49 points) in the reference volume\ncomputation.\n\ndistortion max_rel_diff\n0.05 3.0166222190035792e-15 (<1%)\n0.1 2.9337828206309171e-15 (<1%)\n0.15 1.9220118208802948e-02 (2%)\n0.2 1.4316734561689504e-01 (14%)\n0.3 3.6570789827049105e-01 (36.5%)\n\nIt would also be possible to implement an \"early return\" branch for\nthis function, but you would have to check the value of 8 different\n3-vectors against zero so I didn't think it was worth the extra\ncomplexity for the minimal performance improvement.", "target": 1} {"commit_message": "[PATCH] More fine grained ParmetisPartitioner logging\n\nThis is ludicrously slow outside opt mode", "target": 0} {"commit_message": "[PATCH] Fixing issues from @rcurtin's review + adding unit test for\n cross entropy performance function", "target": 0} {"commit_message": "[PATCH] checking in change for 1-1-48 pathway, a more efficient free\n energy path. See Pham and Shirts,\n http://jcp.aip.org/resource/1/jcpsa6/v135/i3/p034114_s1", "target": 1} {"commit_message": "[PATCH] Performance bug", "target": 1} {"commit_message": "[PATCH] removed a lot of slow FFT grid sizes from calc_grid and made\n calcgrid.c thread safe", "target": 0} {"commit_message": "[PATCH] fast point location should work now", "target": 0} {"commit_message": "[PATCH] Loosen default tolerance: accelerate convergence.\n\nFor larger optimizations the default termination value of 1e-10 may be way too small.\nSo 1e-5 is generally better, but the user can always change it themselves...", "target": 0} {"commit_message": "[PATCH] Add Elem::loose_bounding_box()\n\nThe default implementation is what we were using previously to\ncalculate bounding boxes for PointLocatorTree. For higher order\nelements we add fast approximations that are strict in the case of\nlinear geometry but that should still be bounds in the case of higher\norder geometry.", "target": 0} {"commit_message": "[PATCH] Update tests for changed APIs.\n\nRandom forests generally work better, but it is not guaranteed for the vc2\ndataset, so I am still requiring only that it gets 90% of the decision tree\nperformance in the worst case. I expect it will generally be better, but there\nare still situations where it may not be (because of the randomness).", "target": 0} {"commit_message": "[PATCH] Comment out the do_intersect tests\n\nThey caused a performance problem when used with the tweaked AABB_traits\nof Surface_mesh_segmentation.", "target": 1} {"commit_message": "[PATCH] Clean up documentation slightly for the case of slow runs.", "target": 0} {"commit_message": "[PATCH] Deprecate Node/DofObject copy methods\n\nFixes #1451\n\nAny Node object copying is almost certainly a bug (always in\nperformance, sometimes in functionality), in code that should have\nbeen taking references instead. Since the node_ref_range() method\nmakes it too easy to write such bugs, we should deprecate the Node\ncopy constructor and make it impossible soon.\n\nThe only places where DofObject(DofObject) was being used were in the\nNode copy constructor (no longer relevant) and in old_dof_object\ncreation (where the not-quite-a-proper-copy-constructor behavior is\nintentional), so for added safety let's deprecate non-private access\nto that constructor. Since it was already protected before this\nshouldn't cause any hardship to downstream users.", "target": 0} {"commit_message": "[PATCH] Various performance improvements: (#1573)\n\n- Each fragment metadata directory is now associated with an empty file with the same name as the directory, but with added suffix '.ok'. This prevents an extra REST request on object stores when opening the array.\n- Consolidation does not delete the consolidated fragments or array metadata. This is to prevent locking the array during consolidation and to enable time traveling and fine granularities.\n- Added new consolidation functionality that enables consolidation of all fragment metadata footers in a single file. This boosts the performance of opening an array significantly.\n- Added vacuum API to clean up consolidated fragments, array metadata, or consolidated fragment metadata.\n- Parallelized the reader in various places, significantly boosting the read performance.", "target": 1} {"commit_message": "[PATCH] more efficient binary operations in MX, ticket #192", "target": 1} {"commit_message": "[PATCH] Adding Changelog for Release 3.2.00\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 3.2\nUpdate Kokkos version macros to 3.2.0", "target": 0} {"commit_message": "[PATCH] Updated whitespace, implemented low hanging performance\n boosts", "target": 0} {"commit_message": "[PATCH] More efficient SetNonzeros::propagateSparsities", "target": 1} {"commit_message": "[PATCH] Fix performance of 4-dslash_domain_Wall_4d.cuh kernels with\n xpay enabled: store coefficients in __constant__ memory to remove register\n spilling", "target": 1} {"commit_message": "[PATCH] Adding Changelog for Release 2.8.00\n\nPart of Kokkos C++ Performance Portability Programming EcoSystem 2.8", "target": 0} {"commit_message": "[PATCH] some more performance improvements", "target": 1} {"commit_message": "[PATCH] pathf90 v2.1 better performance with ro=1 vs. ro=2", "target": 1} {"commit_message": "[PATCH] on itanium (with Intel compiler) **2 was verrrry slow,\n replaced with ()*()", "target": 1} {"commit_message": "[PATCH] fast exit out of symmetry routines if C1 No longer print \"ok\"\n on first geomtry step", "target": 0} {"commit_message": "[PATCH] Start nvprof profiling at counter reset\n\nWhen running in the NVIDIA profiler, to eliminate initial kernel\nperformance fluctuations, load balancing effects, as well as\ninitialization API calls from the traces, we now start the NVIDIA\nprofiler through the CUDA runtime API at the performance counter\nresetting. This has an effect only if mdrun is started in nvprof with\nprofiling off.\n\nChange-Id: Idfb3c86a96cb8b55cd874f641f4922b5517de6e3", "target": 0} {"commit_message": "[PATCH] Adding a SymvCtrl data structure and subsequently\n extending HermitianTridiagCtrl and HermitianEigCtrl to support it due\n to the large differences in performance from different approaches to the\n local portion of a distributed Symv", "target": 0} {"commit_message": "[PATCH] Enable SIMD register calling convention with gmx_simdcall\n\nCmake now checks if the compiler supports __vectorcall or\n__regcall calling convention modifiers, and sets gmx_simdcall\nto one of these if supported, otherwise a blank string.\nThis should enable 32-bit MSVC to accept our SIMD routines\n(starting from MSVC 2013), and with ICC it can at least in\ntheory improve performance slightly by using more registers\nfor argument passing in 64-bit mode too. Presently this is\nonly useful on x86, but the infrastructure will work if we\nfind similar calling conventions on other architectures.\n\nFixes #1541.\n\nChange-Id: I7026fb4e1fb6b88c8aa18b060a631cbb80231cd4", "target": 1} {"commit_message": "[PATCH] Added parallel LCG, a way to avoid an MPI performance bug on\n BG/P, a function for creating random Hermitian matrices, and modified LAPACK\n wrappers to throw an error if the 'info' parameter is nonzero, even in\n RELEASE mode.", "target": 1} {"commit_message": "[PATCH] Fix AMD OpenCL float3 array optimization bug\n\nBecause float3 by OpenCL spec is 16-byte, when used as an array type\nthe allocation needs to optimized to avoid unnecessary register use.\nThe nbnxm kernels use a float3 i-force accumulator array in registers.\n\nStarting with ROCm 2.3 the AMD OpenCL compiler regressed and lost\nits ability to effectively optimize code that uses float3 register\narrays. The large amount of extra registers used limits the kernel\noccupancy and significantly impacts performance.\nOnly the AMD platform is affected, other vendors' compilers are able to\ndo the necessary transformations to avoid the extra register use.\n\nThis change converts the float3 array to a float[3] saving 8*4 bytes\nregister space. This improves nonbonded kernel performance\non an AMD Vega GPU by 25% and 40% for the most common flavor of the\nEwald and RF force-only kernels, respectively.\n\nNote that eliminating the rest of the non-array use of float3 has no\nsignificant impact.", "target": 1} {"commit_message": "[PATCH] Code for ordinary, slow, ewald electrostatics", "target": 0} {"commit_message": "[PATCH] Called MPI_Send_Recv at appropriate locations instead of the\n more elaborate dual calls. Performance improvement is not measurable on Linux\n though. But in principle an MPI implementation could optimize that.", "target": 1} {"commit_message": "[PATCH] new optimization of dgemm kernel for bulldozer: 10%\n performance increase", "target": 1} {"commit_message": "[PATCH] Use importlib_resources in Python 3.6 images.\n\nPython 3.7 adds importlib.resources to the standard library, which\nprovides an efficient built in alternative to pkg_resources.\nBackported functionality is available in the importlib_resources\npackage. We should add it to our Docker images to allow testing new\nfeatures while we still officially support Python 3.6.\n\nSee also issue #2961", "target": 0} {"commit_message": "[PATCH] Distinguished mutexes from semaphores. The distinction is\n useful because the linux implementation of sem_post() in unnecessarily slow\n when semaphores are used for mutual exclusion. This change made spinlocks\n messier to implement, so I excised them.", "target": 0} {"commit_message": "[PATCH] Amendments to density fitting manual section\n\n - Fixed a sign error in the energy and force definition\n - Added performance considerations\n - Fixed whitespace\n - Changed vector notation to mathbf as in the other parts of the manual\n - Added pressure-coupling considerations\n - Added considerations when using multiple-time-stepping\n\nrefs #2282\n\nChange-Id: I8421ccf09ac960fa04508234e738967f51a27fab", "target": 0} {"commit_message": "[PATCH] fixed a bug with efficient IMVJ and reuse of old data - need\n to clear the matrix Wtil after convergence of the current time step, as\n columns are the result of J*V and J is outdated now.", "target": 0} {"commit_message": "[PATCH] performance miniapps: don't enable -march=native on ARM\n\nSee e.g. https://stackoverflow.com/questions/65966969/why-does-march-native-not-work-on-apple-m1", "target": 0} {"commit_message": "[PATCH] wmkdep: Added path string substitution support\n\nto avoid the need for sed'ing the output. This improves performance by avoiding\nthe need for calling additional commands and generating a temporary file.", "target": 1} {"commit_message": "[PATCH] Fix for md_parallel_for for Cuda reported in #1057\n\nmd_parallel_for is intended to be deprecated, but fix the issue for not\nrespecting lower bounds while resolving difference in performance\nbetween md_parallel_for and parallel_for calls with MDRangePolicy", "target": 0} {"commit_message": "[PATCH] MDRange: Minor perf test fixes for KNL\n\nTests are still commented out for now, but removed running a slow test,\nreduced the number of tests, and made minor fixes when checking results\nduring the first iteration (vectorization may have occurred and made\nthe check no longer bitwise correct; epsilon comparison should be added\nas well as option to not check correctness which will speed up the tests)", "target": 0} {"commit_message": "[PATCH] thermophysicalModels: Added new tabulated equation of state,\n thermo and transport models\n\nusing the new nonUniformTable to interpolate between the values vs temperature\nprovided. All properties (density, heat capacity, viscosity and thermal\nconductivite) are considered functions of temperature only and the equation of\nstate is thus incompressible. Built-in mixing rules corresponding to those in\nthe other thermo and transport models are not efficient or practical for\ntabulated data and so these models are currently only instantiated for the pure\nspecie/mixture rhoThermo package but a general external mixing method will be\nadded in the future.\n\nTo handle reactions the Jacobian function dKcdTbyKc has been rewritten to use\nthe Gstd and S functions directly removing the need for the miss-named dGdT\nfunction and hence removing the bugs in the implementation of that function for\nsome of the thermo models. Additionally the Hc() function has been renamed\nHf() (heat of formation) which is more commonly used terminology and consistent\nwith the internals of the thermo models.", "target": 0} {"commit_message": "[PATCH] Add a new example called `custom-logger`.\n\nThe purpose of this example is to show the users how to customize Ginkgo by\nadding a new logger, which is useful and more efficient for application specific\nproblems. This is also one of the most basic (and simple) ways of customizing\nGinkgo, therefore this is a good entry level example.\n\nThis example simply prints a table of the recurrent residual norm against the\nreal residual norm.\n\nThis example is documented as much as well as the `simple-solver` example for\nthe user's convenience.", "target": 1} {"commit_message": "[PATCH] Sparse unordered with duplicates: Better vectorization for\n tile bitmaps.\n\nWhen computing tile bitmaps, there are a few places where we were not\nvectorizing properly and it caused performance impacts on some customer\nscenario. Fixing the use of covered/overlap in the dimension class to\nenable proper vectorization.", "target": 0} {"commit_message": "[PATCH] Fix IR issue causing very slow to no convergence when using\n an inaccurate inner solver", "target": 1} {"commit_message": "[PATCH] Adds a URIManager to manage all URIs within an array\n directory. This introduces several performance improvements, especially\n around redundant URI listings, parallelizing URI listings, etc. Also makes\n VFS::ls a noop for POSIX and HDFS when the listed directory does not exist\n instead of throwing an error, matching the functionality of the object\n stores. Finally, it removes partial vacuuming, as that leads to incorrect\n behavior with time traveling.", "target": 1} {"commit_message": "[PATCH] Explaining the performance test better.", "target": 0} {"commit_message": "[PATCH] bugfix of set_coef introduced by\n c84cc28d1c5550876d2d648f905716ffa69335ef\n\nThe problem is that building the matrix from a set of triplets sums\nthe value provided in case several values at the same position are\nprovided. In order to overwrite a value, we have no other choice\nthan to build the matrix and set the value (with current Eigen API).\nThe following fix is as efficient if the matrix is assembled\nin one pass. Using the boolean member of set_coef is very important now\nas it can imply a premature building of the eigen matrix.", "target": 0} {"commit_message": "[PATCH] Fix for processors being offline on Arm\n\nUse the number of configured rather than online CPUs.\nWe will still get a warning about failures when trying to\npin to offline CPUs, which hurts performance slightly.\nTo fix this, we also check if there is a mismatch between\nconfigured and online processors and warn the user that\nthey should force all their processors online\nfor better performance.\n\nChange-Id: Iebdf0d5b820edcd7d06859a2b814adf06589ef96", "target": 1} {"commit_message": "[PATCH] Disable tests using EPECK (for performance reasons)", "target": 0} {"commit_message": "[PATCH] Move orb LUT in CUDA backend to texture memory\n\ncuda::kernel::extract_orb is the CUDA kernel that uses the orb\nlookup table. Shared below is performance of the kernel using constant\nmemory vs texture memory. There is neglible to no difference between two\nversions. Hence, shifted to texture memory LUT to reduce global constant\nmemory usage.\n\nPerformance using constant memory LUT\n-------------------------------------\n\nTime(%) Time Calls Avg Min Max Name\n\n3.02% 292.26us 24 12.177us 11.360us 14.528us void cuda::kernel::extract_orb\n2.16% 209.00us 16 13.062us 11.616us 16.033us void cuda::kernel::extract_orb\n\nPerformance using texture LUT\n-----------------------------\n\nTime(%) Time Calls Avg Min Max Name\n\n2.84% 270.63us 24 11.276us 9.6970us 15.040us void cuda::kernel::extract_orb\n2.20% 209.28us 16 13.080us 10.688us 16.960us void cuda::kernel::extract_orb", "target": 1} {"commit_message": "[PATCH] fixed bug with Verlet + DD + bonded atom communication\n\nAtoms communicated for bonded interactions can be beyond the cut-off\ndistance. Such atoms are now put placed in an extra row in the grid.\nFixes #1114\n\nAlso improved the performance of the nbnxn grid sorting, especially\nfor inhomogeneous systems.\n\nChange-Id: Ibe5ba24af95959f5dadd89584e2315da60b55091", "target": 1} {"commit_message": "[PATCH] Move calcgrid.* to fft/\n\nOne more file out of gmxlib/. These are related to selecting an FFT grid\nsize, and contain some numbers coming from performance measurements, so\nfft/ should be a natural place.\n\nChange-Id: I386965665a92bc47d4c0c3ca0201a6a4b13b5886", "target": 0} {"commit_message": "[PATCH] Speed improvement with CGAL_HEADER_ONLY and\n WITH_{tests|examples}..\n\nWhen `CGAL_HEADER_ONLY` and `WITH_{examples|tests|demos}`, then only\nthe first call to `find_package(CGAL)` does the job. The subsequent\ncalls return very fast, by caching the results in global properties.", "target": 1} {"commit_message": "[PATCH] improved nbnxn PME kernel performance on AMD\n\nThe performance of the nbnxn PME kernels on AMD was much worse with\ngcc than with icc. Now the table load macro has been changed,\nwhich roughly halves the performance difference.", "target": 1}