Blis

Latest version: v0.9.1

Safety actively analyzes 619345 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 7

1.3

Modified config/zen/make_defs.mk, now CKVECFLAGS := -mavx2 -mfpmath=sse -mfma -march=znver1

Change-Id: Ia0942d285a21447cd0c470de1bc021fe63e80d81

commit 3bdab823fa93342895bf45d812439324a37db77c
Merge: 70f12f20 e2a02ebd
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Feb 28 14:07:24 2019 -0600

Merge branch 'master' into dev

commit e2a02ebd005503c63138d48a2b7d18978ee29205
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Feb 28 13:58:59 2019 -0600

Updates (from ls5) to test/3m4m/runme.sh.

Details:
- Lonestar5-specific updates to runme.sh.

commit f0dcc8944fa379d53770f5cae5d670140918f00c
Author: Isuru Fernando <isurufgmail.com>
Date: Wed Feb 27 17:27:23 2019 -0600

Add symbol export macro for all functions (302)

* initial export of blis functions

* Regenerate def file for master

* restore bli_extern_defs exporting for now

commit 540ec1b479712d5e1da637a718927249c15d867f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Feb 24 19:09:10 2019 -0600

Updated level-3 BLAS to call object API directly.

Details:
- Updated the BLAS compatibility layer for level-3 operations so that
the corresponding BLIS object API is called directly rather than first
calling the typed BLIS API. The previous code based on the typed BLIS
API calls is still available in a deactivated cpp macro branch, which
may be re-activated by defining BLIS_BLAS3_CALLS_TAPI. (This does not
yet correspond to a configure option. If it seems like people might
want to toggle this behavior more regularly, a configure option can be
added in the future.)
- Updated the BLIS typed API to statically "pre-initialize" objects via
new initializor macros. Initialization is then finished via calls to
static functions bli_obj_init_finish_1x1() and bli_obj_init_finish(),
which are similar to the previously-called functions,
bli_obj_create_1x1_with_attached_buffer() and
bli_obj_create_with_attached_buffer(), respectively. (The BLAS
compatibility layer updates mentioned above employ this new technique
as well.)
- Transformed certain routines in bli_param_map.c--specifically, the
ones that convert netlib-style parameters to BLIS equivalents--into
static functions, now in bli_param_map.h. (The remaining three classes
of conversation routines were left unchanged.)
- Added the aforementioned pre-initializor macros to bli_type_defs.h.
- Relocated bli_obj_init_const() and bli_obj_init_constdata() from
bli_obj_macro_defs.h to bli_type_defs.h.
- Added a few macros to bli_param_macro_defs.h for testing domains for
real/complexness and precisions for single/double-ness.

commit 8e023bc914e9b4ac1f13614feb360b105fbe44d2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Feb 22 16:55:30 2019 -0600

Updates to 3m4m/matlab scripts.

Details:
- Minor updates to matlab graph-generating scripts.
- Added a plot_all.m script that is more of a scratchpad for copying and
pasting function invocations into matlab to generate plots that are
presently of interest to us.

commit b06244d98cc468346eb1a8eb931bc05f35ff280c
Merge: e938ff08 4c7e6680
Author: praveeng <praveen.gamd.com>
Date: Thu Feb 21 12:56:15 2019 +0530

Merge branch 'ut-austin-amd' of ssh://git.amd.com:29418/cpulibraries/er/blis into ut-austin-amd

commit e938ff08cea3d108c84524eb129d9e89d701ea90
Author: praveeng <praveen.gamd.com>
Date: Thu Feb 21 12:44:38 2019 +0530

deleted test.txt

Change-Id: I3871f5fe76e548bc29ec2733745b29964e829dd3

commit ed13ad465dcba350ad3d5e16c9cc7542e33f3760
Author: mkv <Mallikarjuna-Reddy.K-Vamd.com>
Date: Thu Feb 21 01:04:16 2019 -0500

added test file for initial commit

commit 4c7e6680832b497468cf50c2399e3ac4de0e3450
Author: praveeng <praveen.gamd.com>
Date: Thu Feb 21 12:44:38 2019 +0530

deleted test.txt

Change-Id: I3871f5fe76e548bc29ec2733745b29964e829dd3

commit 95e070581c54ed2edc211874faec56055ea298c8
Author: mkv <Mallikarjuna-Reddy.K-Vamd.com>
Date: Thu Feb 21 01:04:16 2019 -0500

added test file for initial commit

commit 70f12f209bc1901b5205902503707134cf2991a0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Feb 20 16:10:10 2019 -0600

Changed unsafe-loop to unsafe-math optimizations.

Details:
- Changed -funsafe-loop-optimizations (re-)introduced in 7690855 for
make_defs.mk files' CRVECFLAGS to -funsafe-math-optimizations (to
account for a miscommunication in issue 300). Thanks to Dave Love
for this suggestion and Jeff Hammond for his feedback on the topic.

commit 7690855c5106a56e5b341a350f8db1c78caacd89
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 18 19:16:01 2019 -0600

Restored -funsafe-loop-optimizations to subconfigs.

Details:
- Restored use of -funsafe-loop-optimizations in the definitions of
CRVECFLAGS (when using gcc), but only for sub-configurations (and
not configuration families such as amd64, intel64, and x86_64).
This more or less reverts 5190d05 and 6cf1550.

commit 44994d1490897b08cde52a615a2e37ddae8b2061
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 18 18:35:30 2019 -0600

Disable TBM, XOP, LWP instructions in AMD configs.

Details:
- Added -mno-tbm -mno-xop -mno-lwp to CKVECFLAGS in bulldozer,
piledriver, steamroller, and excavator configurations to explicitly
disable AMD's bulldozer-era TBM, XOP, and LWP instruction sets in an
attempt to fix the invalid instruction error that has plagued Travis
CI builds since 6a014a3. Thanks to Devin Matthews for pointing out
that the offending instruction was part of TBM (issue 300).
- Restored -O3 to piledriver configuration's COPTFLAGS.

commit 1e5b530744c1906140d47f43c5cad235eaa619cf
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 18 18:04:38 2019 -0600

Reverted piledriver COPTFLAGS from -O3 to -O2.

Details:
- Debugging continues; changing COPTFLAGS for piledriver subconfig from
-O3 to -O2, its original value prior to 6a014a3.

commit 6cf155049168652c512aefdd16d74e7ff39b98df
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 18 17:29:51 2019 -0600

Removed -funsafe-loop-optimizations from all configs.

Details:
- Error persists. Removed -funsafe-loop-optimizations from all remaining
sub-configurations.

commit 5190d05a27c5fa4c7942e20094f76eb9a9785c3e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 18 17:07:35 2019 -0600

Removed -funsafe-loop-optimizations from piledriver.

Details:
- Error persists; continuing debugging from bf0fb78c by removing
-funsafe-loop-optimizations from piledriver configuration.

commit bf0fb78c5e575372060d22f5ceeb5b332e8978ec
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 18 16:51:38 2019 -0600

Removed -funsafe-loop-optimizations from families.

Details:
- Removed -funsafe-loop-optimizations from the configuration families
affected by 6a014a3, specifically: intel64, amd64, and x86_64.
This is part of an attempt to debug why the sde, as executed by
Travis CI, is crashing via the following error:

TID 0 SDE-ERROR: Executed instruction not valid for specified chip
(ICELAKE): 0x9172a5: bextr_xop rax, rcx, 0x103

commit 6a014a3377a2e829dbc294b814ca257a2bfcb763
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 18 14:52:29 2019 -0600

Standardized optimization flags in make_defs.mk.

Details:
- Per Dave Love's recommendation in issue 300, this commit defines
COPTFLAGS := -03
and
CRVECFLAGS := $(CKVECFLAGS) -funsafe-loop-optimizations
in the make_defs.mk for all Intel- and AMD-based configurations.

commit 565fa3853b381051ac92cff764625909d105644d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 18 11:43:58 2019 -0600

Redirect trsm pc, ir parallelism to ic, jr loops.

Details:
- trsm parallelization was temporarily simplifed in 075143d to entirely
ignore any parallelism specified via the pc or ir loops. Now, any
parallelism specified to the pc loop will be redirected to the ic
loop, and any parallelism specified to the ir loop will be redirected
to the jr loop. (Note that because of inter-iteration dependencies,
trsm cannot parallelize the ir loop. Parallelism via the pc loop is
at least somewhat feasible in theory, but it would require tracking
dependencies between blocks--something for which BLIS currently lacks
the necessary supporting infrastructure.)

commit a023c643f25222593f4c98c2166212561d030621
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Feb 14 20:18:55 2019 -0600

Regenerated symbols in build/libblis-symbols.def.

Details:
- Reran ./build/regen-symbols.sh after running
'configure --enable-cblas auto'

commit 075143dfd92194647da9022c1a58511b20fc11f3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Feb 14 18:52:45 2019 -0600

Added support for IC loop parallelism to trsm.

Details:
- Parallelism within the IC loop (3rd loop around the microkernel) is
now supported within the trsm operation. This is done via a new branch
on each of the control and thread trees, which guide execution of a
new trsm-only subproblem from within bli_trsm_blk_var1(). This trsm
subproblem corresponds to the macrokernel computation on only the
block of A that contains the diagonal (labeled as A11 in algorithms
with FLAME-like partitioning), and the corresponding row panel of C.
During the trsm subproblem, all threads within the JC communicator
participate and parallelize along the JR loop, including any
parallelism that was specified for the IC loop. (IR loop parallelism
is not supported for trsm due to inter-iteration dependencies.) After
this trsm subproblem is complete, a barrier synchronizes all
participating threads and then they proceed to apply the prescribed
BLIS_IC_NT (or equivalent) ways of parallelism (and any BLIS_JR_NT
parallelism specified within) to the remaining gemm subproblem (the
rank-k update that is performed using the newly updated row-panel of
B). Thus, trsm now supports JC, IC, and JR loop parallelism.
- Modified bli_trsm_l_cntl_create() to create the new "prenode" branch
of the trsm_l cntl_t tree. The trsm_r tree was left unchanged, for
now, since it is not currently used. (All trsm problems are cast in
terms of left-side trsm.)
- Updated bli_cntl_free_w_thrinfo() to be able to free the newly shaped
trsm cntl_t trees. Fixed a potentially latent bug whereby a cntl_t
subnode is only recursed upon if there existed a corresponding
thrinfo_t node, which may not always exist (for problems too small
to employ full parallelization due to the minimum granularity imposed
by micropanels).
- Updated other functions in frame/base/bli_cntl.c, such as
bli_cntl_copy() and bli_cntl_mark_family(), to recurse on sub-prenodes
if they exist.
- Updated bli_thrinfo_free() to recurse into sub-nodes and prenodes
when they exist, and added support for growing a prenode branch to
bli_thrinfo_grow() via a corresponding set of help functions named
with the _prenode() suffix.
- Added a bszid_t field thrinfo_t nodes. This field comes in handy when
debugging the allocation/release of thrinfo_t nodes, as it helps trace
the "identity" of each nodes as it is created/destroyed.
- Renamed
bli_l3_thrinfo_print_paths() -> bli_l3_thrinfo_print_gemm_paths()
and created a separate bli_l3_thrinfo_print_trsm_paths() function to
print out the newly reconfigured thrinfo_t trees for the trsm
operation.
- Trival changes to bli_gemm_blk_var?.c and bli_trsm_blk_var?.c
regarding variable declarations.
- Removed subpart_t enum values BLIS_SUBPART1T, BLIS_SUBPART1B,
BLIS_SUBPART1L, BLIS_SUBPART1R. Then added support for two new labels
(semantically speaking): BLIS_SUBPART1A and BLIS_SUBPART1B, which
represent the subpartition ahead of and behind, respectively,
BLIS_SUBPART1. Updated check functions in bli_check.c accordingly.
- Shuffled layering/APIs for bli_acquire_mpart_[mn]dim() and
bli_acquire_mpart_t2b/b2t(), _l2r/r2l().
- Deprecated old functions in frame/3/bli_l3_thrinfo.c.

commit 78bc0bc8b6b528c79b11f81ea19250a1db7450ed
Author: Nicholai Tukanov <nicholaiutexas.edu>
Date: Thu Feb 14 13:29:02 2019 -0600

Power9 sub-configuration (298)

Formally registered power9 sub-configuration.

Details:
- Added and registered power9 sub-configuration into the build system.
Thanks to Nicholai Tukanov and Devangi Parikh for these contributions.
- Note: The sub-configuration does not yet have a corresponding
architecture-specific kernel set registered, and so for now the
sub-config is using the generic kernel set.

commit 6b832731261f9e7ad003a9ea4682e9ca973ef844
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Feb 12 16:01:28 2019 -0600

Generalized ref kernels' pragma omp simd usage.

Details:
- Replaced direct usage of _Pragma( "omp simd" ) in reference kernels
with PRAGMA_SIMD, which is defined as a function of the compiler being
used in a new bli_pragma_macro_defs.h file. That definition is cleared
when BLIS detects that the -fopenmp-simd command line option is
unsupported. Thanks to Devin Matthews and Jeff Hammond for suggestions
that guided this commit.
- Updated configure and bli_config.h.in so that the appropriate anchor
is substituted in (when the corresponding pragma omp simd support is
present).

commit b1f5ce8622b682b79f956fed83f04a60daa8e0fc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Feb 5 17:38:50 2019 -0600

Minor updates to scripts in test/mixeddt/matlab.

commit 38203ecd15b1fa50897d733daeac6850d254e581
Author: Devangi N. Parikh <dnpcs.utexas.edu>
Date: Mon Feb 4 15:28:28 2019 -0500

Added thunderx2 system in the mixeddt test scripts

Details:
- Added thunderx2 (tx2) as a system in the runme.sh in test/mixeddt

commit dfc91843ea52297bf636147793029a0c1345be04
Author: Devangi N. Parikh <dnpcs.utexas.edu>
Date: Mon Feb 4 15:23:40 2019 -0500

Fixed gcc flags for thunderx2 subconfiguration

Details:
- Fixed -march flag. Thunderx2 is an armv8.1a architecture not armv8a.

commit c665eb9b888ec7e41bd0a28c4c8ac4094d0a01b5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jan 28 16:22:23 2019 -0600

Minor updates to docs, Makefiles.

Details:
- Changed all occurrances of
micro-kernel -> microkernel
macro-kernel -> macrokernel
micro-panel -> micropanel
in all markdown documents in 'docs' directory. This change is being
made since we've reached the point in adoption and acceptance of
BLIS's insights where words such as "microkernel" are no longer new,
and therefore now merit being unhyphenated.
- Updated "Implementation Notes" sections of KernelsHowTo.md, which
still contained references to nonexistent cpp macros such as
BLIS_DEFAULT_MR_? and BLIS_PACKDIM_MR_?.
- Added 'run-fast' and 'check-fast' targets to testsuite/Makefile.
- Minor updates to Testsuite.md, including suggesting use of
'make check' and 'make check-fast' when running from the local
testsuite directory.
- Added a comment to top-level Makefile explaining the purpose behind
the TESTSUITE_WRAPPER variable, which at first glance appears to serve
no purpose.

commit 1aa280d0520ed5eaea3b119b4e92b789ecad78a4
Author: M. Zhou <5723047+cdluminateusers.noreply.github.com>
Date: Sun Jan 27 21:40:48 2019 +0000

Amend OS detection for kFreeBSD. (295)

commit fffc23bb35d117a433886eb52ee684ff5cf6997f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jan 25 13:35:31 2019 -0600

CREDITS file update.

commit 26c5cf495ce22521af5a36a1012491213d5a4551
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jan 24 18:49:31 2019 -0600

Fixed bug in skx subconfig related to bdd46f9.

Details:
- Fixed code in the skx subconfiguration that became a bug after
committing bdd46f9. Specifically, the bli_cntx_init_skx() function
was overwriting default blocksizes for the scomplex and dcomplex
microkernels despite the fact that only single and double real
microkernels were being registered. This was not a problem prior to
bdd46f9 since all microkernels used dynamically-queried (at runtime)
register blocksizes for loop bounds. However, post-bdd46f9, this
became a bug because the reference ukernels for scomplex and dcomplex
were written with their register blocksizes hard-coded as constant
loop bounds, which conflicted the the erroneous scomplex and dcomplex
values that bli_cntx_init_skx() was setting in the context. The
lesson here is that going forward, all subconfigurations must not set
any blocksizes for datatypes corresponding to default/reference
microkernels. (Note that a blocksize is left unchanged by the
bli_cntx_set_blkszs() function if it was set to -1.)

commit 180f8e42e167b83a757340ad4bd4a5c7a1d6437b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jan 24 18:01:15 2019 -0600

Fixed undefined behavior trsm ukr bug in bdd46f9.

Details:
- Fixed a bug that mainfested anytime a configuration was used in which
optimized microkernels were registered and the trsm operation (or
kernel) was invoked. The bug resulted from the optimized microkernels'
register blocksizes conflicting with the hard-coded values--expressed
in the form of constant loop bounds--used in the new reference trsm
ukernels that were introduced in bdd46f9. The fix was easy: reverting
back to the implementation that uses variable-bound loops, which
amounted to changing an if 0 to if 1 (since I preserved the older
implementation in the file alongside the new code based on constant-
bound loops). It should be noted that this fix must be permanent,
since the trsm kernel code with constant-bound loops can never work
with gemm ukernels that use different register blocksizes.

commit bdd46f9ee88057d52610161966a11c224e5a026c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jan 24 17:23:18 2019 -0600

Rewrote reference kernels to use pragma omp simd.

Details:
- Rewrote level-1v, -1f, and -3 reference kernels in terms of simplified
indexing annotated by the pragma omp simd directive, which a compiler
can use to vectorize certain constant-bounded loops. (The new kernels
actually use _Pragma("omp simd") since the kernels are defined via
templatizing macros.) Modest speedup was observed in most cases using
gcc 5.4.0, which may improve with newer versions. Thanks to Devin
Matthews for suggesting this via issue 286 and 259.
- Updated default blocksizes defined in ref_kernels/bli_cntx_ref.c to
be 4x16, 4x8, 4x8, and 4x4 for single, double, scomplex and dcomplex,
respectively, with a default row preference for the gemm ukernel. Also
updated axpyf, dotxf, and dotxaxpyf fusing factors to 8, 6, and 4,
respectively, for all datatypes.
- Modified configure to verify that -fopenmp-simd is a valid compiler
option (via a new detect/omp_simd/omp_simd_detect.c file).
- Added a new header in which prefetch macros are defined according to
which compiler is detected (via macros such as __GNUC__). These
prefetch macros are not yet employed anywhere, though.
- Updated the year in copyrights of template license headers in
build/templates and removed AMD as a default copyright holder.

commit 63de2b0090829677755eb5cdb27e73bc738da32d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jan 23 12:16:27 2019 -0600

Prevent redef of ftnlen in blastest f2c_types.h.

Details:
- Guard typedef of ftnlen in f2c_types.h with a ifndef HAVE_BLIS_H
directive to prevent the redefinition of that type. Thanks to Jeff
Diamond for reporting this compiler warning (and apologies for the
delay in committing a fix).

commit eec2e183a7b7d67702dbd1f39c153f38148b2446
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jan 21 12:12:18 2019 -0600

Added escaping to '/' in os_name in configure.

Details:
- Add os_name to the list of variables into which the '/' character is
escaped. This is meant to address (or at least make progress toward
addressing) 293. Thanks to Isuru Fernando for spotting this as the
potential fix, and also thanks to M. Zhou for the original report.

commit adf5c17f0839fdbc1f4a1780f637928b1e78e389
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jan 18 15:14:45 2019 -0600

Formally registered thunderx2 subconfiguration.

Details:
- Added a separate subconfiguration for thunderx2, which now uses
different optimization flags than cortexa57/cortexa53.

commit 094cfdf7df6c2764c25fcbfce686ba29b933942c
Author: M. Zhou <5723047+cdluminateusers.noreply.github.com>
Date: Fri Jan 18 18:46:13 2019 +0000

Port BLIS to GNU Hurd OS. (294)

Prevent blis.h from misidentifying Hurd as OSX.

commit 5d7d616e8e591c2f3c7c2d73220eb27ea484f9c9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jan 15 20:52:51 2019 -0600

README.md update re: mixeddt TOMS paper.

commit 58c7fb4788177487f73a3964b7a910fe4dc75941
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jan 8 17:00:27 2019 -0600

Added more matlab scripts for mixeddt paper.

Details:
- Added a variant set of matlab scripts geared to producing plots that
reflect performance data gathered with and without extra memory
optimizations enabled. These scripts reside (for now) in
test/mixeddt/matlab/wawoxmem.

commit 34286eb914b48b56cdda4dfce192608b9f86d053
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jan 8 11:41:20 2019 -0600

Minor update to docs/HardwareSupport.md.

commit 108b04dc5b1b1288db95f24088d1e40407d7bc88
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jan 7 20:16:31 2019 -0600

Regenerated symbols in build/libblis-symbols.def.

Details:
- Reran ./build/regen-symbols.sh after running
'configure --enable-cblas auto' to reflect removal of
bli_malloc_pool() and bli_free_pool().

commit 706cbd9d5622f4690e6332a89cf41ab5c8771899
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jan 7 18:28:19 2019 -0600

Minor tweaks/cleanups to bli_malloc.c, _apool.c.

Details:
- Removed malloc_ft and free_ft function pointer arguments from the
interface to bli_apool_init() after deciding that there is no need to
specify the malloc()/free() for blocks within the apool. (The apool
blocks are actually just array_t structs.) Instead, we simply call
bli_malloc_intl()/_free_intl() directly. This has the added benefit
of allowing additional output when memory tracing is enabled via
--enable-mem-tracing. Also made corresponding changes elsewhere in
the apool API.
- Changed the inner pools (elements of the array_t within the apool_t)
to use BLIS_MALLOC_POOL and BLIS_FREE_POOL instead of BLIS_MALLOC_INTL
and BLIS_FREE_INTL.
- Disabled definitions of bli_malloc_pool() and bli_free_pool() since
there are no longer any consumers of these functions.
- Very minor comment / printf() updates.

commit 579145039d945adbcad1177b1d53fb2d3f2e6573
Author: Minh Quan Ho <1337056+hominhquanusers.noreply.github.com>
Date: Mon Jan 7 23:00:15 2019 +0100

Initialize error messages at compile time (289)

* Initialize error messages at compile time

- Assigning strings directly to the bli_error_string array, instead of
snprintf() at execution-time.

* Retired bli_error_init(), _finalize().

Details:
- Removed functions obviated by changes in 80e8dc6: bli_error_init(),
bli_error_finalize(), and bli_error_init_msgs(), as well as calls to
the former two in bli_init.c.

* Regenerated symbols in build/libblis-symbols.def.

Details:
- Reran ./build/regen-symbols.sh after running
'configure --enable-cblas auto'.

commit aafbca086e36b6727d7be67e21fef5bd9ff7bfd9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jan 7 12:38:21 2019 -0600

Updated external package language in README.md.

Details:
- Updated/added comments about Fedora, OpenSUSE, and GNU Guix under the
newly-renamed "External GNU/Linux packages" section. Thanks to Dave
Love for providing these revisions.

commit daacfe68404c9cc8078e5e7ba49a8c7d93e8cda3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jan 7 12:12:47 2019 -0600

Allow running configure with python 3.4.

Details:
- Relax version blacklisting of python3 to allow 3.4 or later instead
of 3.5 or later. Thanks to Dave Love for pointing out that 3.4 was
sufficient for the purpose of BLIS's build system. (It should be
noted that we're not sure which, if any, python3 versions prior to
3.4 are insufficient, and that the only thing stopping us from
determining this is the fact that these earlier versions of python3
are not readily available for us to test with.)
- Updated docs/BuildSystem.md to be explicit about current python2 vs
python3 version requirements.

commit cdbf16aa93234e0d6a80f0d0e385ec81e7b75465
Author: prangana <pradeep.raoamd.com>
Date: Fri Jan 4 15:59:21 2019 +0530

Update version 1.3

Change-Id: I32a7d24af860e87a60396614075236afb65a28a9

commit cf9c1150515b8e9cc4f12e0d4787b3471b12ba4a
Author: kdevraje <Kiran.Devrajegowdaamd.com>
Date: Thu Jan 3 09:51:46 2019 +0530

This commit adds a macro, which is to be enabled when BLIS is working on single instance mode

Change-Id: I7f3fd654b78e64c4e6e24e9f0e245b1a30c492b0

commit ad8d9adb09a7dd267bbdeb2bd1fbbf9daf64ee76
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jan 3 16:08:24 2019 -0600

README.md, CREDITS update.

Details:
- Added "What's New" and "What People Are Saying About BLIS" sections to
README.md.
- Added missing github handles to various individuals' entries in the
CREDITS file.

commit 7052fca5aef430241278b67d24cef6fe33106904
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jan 2 13:48:40 2019 -0600

Apply f272c289 to bli_fmalloc_noalign().

Details:
- Perform the same check for NULL return values and error message output
in bli_fmalloc_noalign() as is performed by bli_fmalloc_align(). (This
change was intended for f272c289.)

commit 528e3ad16a42311a852a8376101959b4ccd801a5
Merge: 3126c52e f272c289
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jan 2 13:39:19 2019 -0600

Merge branch 'amd'

commit 3126c52ea795ffb7d30b16b7f7ccc2a288a6158d
Merge: 61441b24 8091998b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jan 2 13:37:37 2019 -0600

Merge branch 'amd'

commit f272c2899a6764eedbe05cea874ee3bd258dbff3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jan 2 12:34:15 2019 -0600

Add error message to malloc() check for NULL.

Details:
- Output an error message if and when the malloc()-equivalent called by
bli_fmalloc_align() ever returns NULL. Everything was already in place
for this to happen, including the error return code, the error string
sprintf(), the error checking function bli_check_valid_malloc_buf()
definition, and its prototype. Thanks to Minh Quan Ho for pointing out
the missing error message.
- Increased the default block_ptrs_len for each inner pool stored in the
small block allocator from 10 to 25. Under normal execution, each
thread uses only 21 blocks, so this change will prevent the sba from
needing to resize the block_ptrs array of any given inner pool as
threads initially populate the pool with small blocks upon first
execution of a level-3 operation.
- Nix stray newline echo in configure.

commit eb97f778a1e13ee8d3b3aade05e479c4dfcfa7c0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 25 20:17:09 2018 -0600

Added missing AMD copyrights to previous commit.

Details:
- Forgot to add AMD copyrights to several touched files that did not
already have them in 2f31743.

commit 2f3174330fb29164097d664b7c84e05c7ced7d95
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 25 19:35:01 2018 -0600

Implemented a pool-based small block allocator.

Details:
- Implemented a sophisticated data structure and set of APIs that track
the small blocks of memory (around 80-100 bytes each) used when
creating nodes for control and thread trees (cntl_t and thrinfo_t) as
well as thread communicators (thrcomm_t). The purpose of the small
block allocator, or sba, is to allow the library to transition into a
runtime state in which it does not perform any calls to malloc() or
free() during normal execution of level-3 operations, regardless of
the threading environment (potentially multiple application threads
as well as multiple BLIS threads). The functionality relies on a new
data structure, apool_t, which is (roughly speaking) a pool of
arrays, where each array element is a pool of small blocks. The outer
pool, which is protected by a mutex, provides separate arrays for each
application thread while the arrays each handle multiple BLIS threads
for any given application thread. The design minimizes the potential
for lock contention, as only concurrent application threads would
need to fight for the apool_t lock, and only if they happen to begin
their level-3 operations at precisely the same time. Thanks to Kiran
Varaganti and AMD for requesting this feature.
- Added a configure option to disable the sba pools, which are enabled
by default; renamed the --[dis|en]able-packbuf-pools option to
--[dis|en]able-pba-pools; and rewrote the --help text associated with
this new option and consolidated it with the --help text for the
option associated with the sba (--[dis|en]able-sba-pools).
- Moved the membrk field from the cntx_t to the rntm_t. We now pass in
a rntm_t* to the bli_membrk_acquire() and _release() APIs, just as we
do for bli_sba_acquire() and _release().
- Replaced all calls to bli_malloc_intl() and bli_free_intl() that are
used for small blocks with calls to bli_sba_acquire(), which takes a
rntm (in addition to the bytes requested), and bli_sba_release().
These latter two functions reduce to the former two when the sba pools
are disabled at configure-time.
- Added rntm_t* arguments to various cntl_t and thrinfo_t functions, as
required by the new usage of bli_sba_acquire() and _release().
- Moved the freeing of "old" blocks (those allocated prior to a change
in the block_size) from bli_membrk_acquire_m() to the implementation
of the pool_t checkout function.
- Miscellaneous improvements to the pool_t API.
- Added a block_size field to the pblk_t.
- Harmonized the way that the trsm_ukr testsuite module performs packing
relative to that of gemmtrsm_ukr, in part to avoid the need to create
a packm control tree node, which now requires a rntm_t that has been
initialized with an sba and membrk.
- Re-enable explicit call bli_finalize() in testsuite so that users who
run the testsuite with memory tracing enabled can check for memory
leaks.
- Manually imported the compact/minor changes from 61441b24 that cause
the rntm to be copied locally when it is passed in via one of the
expert APIs.
- Reordered parameters to various bli_thrcomm_*() functions so that the
thrcomm_t* to the comm being modified is last, not first.
- Added more descriptive tracing for allocating/freeing small blocks and
formalized via a new configure option: --[dis|en]able-mem-tracing.
- Moved some unused scalm code and headers into frame/1m/other.
- Whitespace changes to bli_pthread.c.
- Regenerated build/libblis-symbols.def.

commit 61441b24f3244a4b202c29611a4899dd5c51d3a1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Dec 20 19:38:11 2018 -0600

Make local copy of user's rntm_t in level-3 ops.

Details:
- In the case that the caller passes in a non-NULL rntm_t pointer into
one of the expert APIs for a level-3 operation (e.g. bli_gemm_ex()),
make a local copy of the rntm_t and use the address of that local copy
in all subsequent execution (which may change the contents of the
rntm_t). This prevents a potentially confusing situation whereby a
user-initialized rntm_t is used once (in, say, gemm), and then found
by the user to be in a different state before it is used a second
time.

commit e809b5d2f1023b4249969e2f516291c9a3a00b80
Merge: 76016691 0476f706
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Dec 20 16:27:26 2018 -0600

Merge branch 'master' into amd

commit 1f4eeee5175a8fc9ac312847c796ce6db5fe75b9
Author: sraut <Biplab.Rautamd.com>
Date: Wed Dec 19 21:21:10 2018 +0530

Fixed BLAS test failures of small matrix SYRK for single and double precision.

Details:
- SYRK for small matrix was implemented by reusing small GEMM routine. This was
resulting in output written to the full C matrix, and C being symmetric the
lower and upper triangles of C matrix contained same results. BLAS SYRK API
spec demands either lower or upper triangle of C matrix to be written with
results. So, this was resulting in BLAS test failures, even though testsuite
of BLIS was passing small SYRK operation.
- To fix BLAS test failures of small matrix SYRK, separate kernel routines are
implemented for small SYRK for both single and double precision. The newly
added small SYRK routines are in file kernels/zen/3/bli_syrk_small.c.
Now the intermediate results of matrix C are written to a scratch buffer.
Final results are written from scratch buffer to matrix C using SIMD
copy to either lower or upper traingle part of matrix C.
- Source and header files frame/3/syrk/bli_syrk_front.c and
frame/3/syrk/bli_syrk_front.h are changed to invoke new small SYRK routines.

Change-Id: I9cfb1116c93d150aefac673fca033952ecac97cb

commit 6d267375c3a0543f20604d74cc678ad91db3b6f1
Author: sraut <Biplab.Rautamd.com>
Date: Wed Dec 19 14:22:21 2018 +0530

This commit improves the performance of multi-instance DGEMM when these multiple threads are binded to a CCX.
Multi-Instance: Each thread runs a sequential DGEMM.
Change-Id: I306920c8061b6dad61efac1dae68727f4ac27df6

commit 0476f706b93e83f6b74a3d7b7e6e9cc9a1a52c3b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 18 14:56:20 2018 -0600

CHANGELOG update (0.5.1)

0.9.0

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 1 08:12:06 2022 -0500

Version file update (0.9.0)

commit 99bb9002f1aff598d347eae2821a3f7bdd1f48e8 (origin/master, origin/HEAD)
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 1 08:10:59 2022 -0500

ReleaseNotes.md update in advance of next version.

commit bee7678b2558a691ac850819dbe33fefe4fdbee3 (origin/dev, origin/amd, dev, amd)
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 31 14:09:39 2022 -0500

CREDITS file update.

commit cf06364327bd2d21d606392371ff3c5962bee5ba
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 29 16:18:25 2022 -0500

Fixed typo in BLAS gemm3m call to _check().

Details:
- Fixed an unresolved symbol issue leftover from 590 whereby ?gemm3m_()
as defined in bla_gemm3m.c was referencing bla_gemm3m_check(), which
does not exist. It should have simply called the _check() function for
gemm.

commit 1ec020b33ece1681c0041e2549eed2bd4c6cf356
Author: Dipal M Zambare <71366780+dzambareusers.noreply.github.com>
Date: Wed Mar 30 02:45:36 2022 +0530

AMD kernel updates; frame-specific AMD updates. (597)

Details:
- Allow building BLIS with certain framework files (each with the '_amd'
suffix) that have been customized by AMD for Zen-based hardware. These
customized files were derived from portable versions of the same files
(i.e., those without the '_amd' suffix). Whether the portable or AMD-
specific files are compiled is now controlled by a new configure
option, --[en|dis]able-amd-frame-tweaks. This option is disabled by
default in vanilla BLIS, though AMD may choose to enable it by default
in their fork. For now, the added AMD-specific files are:
- bli_gemv_unf_var2_amd.c
- bla_copy_amd.c
- bla_gemv_amd.c
These files reside in 'amd' subdirectories found within the directory
housing their generic counterparts.
- Register optimized real-domain copyv, setv, and swapv kernels in
bli_cntx_init_zen.c.
- Various minor updates to level-1v kernels in 'zen' kernel set.
- Added caxpyf kernel as well as saxpyf and multiple daxpyf kernels to
the 'zen' kernel set
- If the problem passed to ?gemm_() in bla_gemm.c has a unit m or n dim,
call gemv instead and return early.
- Combined variable declarations with their initialization in various
level-2 and level-3 BLAS compatibility files, and also inserted
'const' qualifer in those same declaration statements.
- Moved frame/compat/bla_gemmt.c and .h to frame/compat/extra/ .
- Added copyv and swapv test drivers to 'test' directory.
- Whitespace, comment changes.

commit 0db2bd5341c5c3ed5f1cc2bffa90952735efa45f
Author: Bhaskar Nallani <Nallani.Bhaskaramd.com>
Date: Fri Mar 25 05:11:55 2022 +0530

Added BLAS/CBLAS APIs for gemm3m. (590)

Details:
- Created ?gemm3m_() and cblas_?gemm3m() APIs that (for now) simply
invoke the 1m implementation unconditionally. (Note that these APIs
bypass sup handling.)
- Added BLAS prototypes for gemm3m in frame/compat/bla_gemm3m.h.
- Added CBLAS prototypes for gemm3m in frame/compat/cblas/src/cblas.h.
- Relocated:
frame/compat/cblas/src/cblas_?gemmt.c
files into
frame/compat/cblas/src/extra/
- Relocated frame/compat/bla_gemmt.? into frame/compat/extra/ .
- Minor reorganization of prototypes and cpp macro directives in
bli_blas.h, cblas.h, and cblas_f77.h.
- Trival whitespace change to cblas_zgemm.c.

commit d6810000e961fe807dc5a7db81180a8355f3eac0
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Mar 14 10:29:54 2022 -0500

Update Multithreading.md

Add notes about `BLIS_IR_NT` (should typically be 1) and `BLIS_JR_NT` (should typically be small, e.g. <= 4). [ci skip]

commit f1dbb0e514f53a3240d3a6cbdc3306b01a2206f5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Mar 11 13:38:28 2022 -0600

Trival whitespace change; commit log addendum.

Details:
- A co-attribution to Mithun Mohan was inadvertently omitted from the
commit log for headline change in the previous commit, 7c07b47.

commit 7c07b477e432adbbce5812ed9341ba3092b03976
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Mar 11 13:28:50 2022 -0600

Avoid gemmsup barriers when not packing A or B. (622)

Details:
- Implemented a multithreaded optimization for the special (and common)
case of employing the gemmsup code path when the user requests
(implicitly or explicitly) that neither A nor B be packed during
computation. This optimization takes the form of a greatly reduced
code branch in bli_thrinfo_sup_create_for_cntl(), which avoids a
broadcast and two barriers, and results in higher performance when
obtaining two-way or higher parallelism within BLIS. Thanks to
Bhaskar Nallani of AMD for proposing this change via issue 605.
- Added an early return branch to bli_thrinfo_create_for_cntl() that
detects and quickly handles cases where no parallelism is being
obtained within BLIS (i.e., single-threaded execution). Note that
this special case handling was/is already present in
bli_thrinfo_sup_create_for_cntl().
- CREDITS file update.

commit cad10410b2305bc0e328c5f2517ab02593b53428
Author: Ivan Korostelev <ivan23korgmail.com>
Date: Thu Mar 10 09:58:14 2022 -0600

POWER10: edge cases in microkernel (620)

Use new API for POWER10 gemm microkernel

commit 71851a0549276b17db18a0a0c8ab4f54493bf033
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 8 17:38:09 2022 -0600

Fixed level-3 performance bug in haswell ukernels.

Details:
- Fixed a performance regression affecting nearly all level-3 operations
that use the 'haswell' sgemm and dgemm microkernels. This regression
was introduced in 54fa28b, caused by an ill-formed conditional
expression in the assembly code that controls whether cache lines of C
should be prefetched as rows or as columns. Essentially, the two
branches were reversed, causing incomplete prefetching to occur for
both row- and column-stored instances of matrix C. Thanks to Devin
Matthews for his help finding and fixing this bug.

commit 84732bf95634ac606c5f2661d9474318e366c386
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 28 12:19:31 2022 -0600

Revamp how tools are handled/checked by configure.

Details:
- Consolidate handling of tools that are specifiable via CC, CXX, FC,
PYTHON, AR, and RANLIB into one bash function, select_tool_w_env().
- If the user specifies a tool via an environment variable (e.g.
CC=gcc) and that tool does not seem valid, print an error message
and abort configure, unless the tool is optional (e.g. CXX or FC),
in which case a warning message is printed instead.
- The definition of "seems valid" above amounts to:
- responding to at least one of a basic set of command line options
(e.g. --version, -V, -h) if the os_name is Linux (since GNU tools
tend to respond to flags such as --version) or if the tool in
question is CC, CXX, FC, or PYTHON (which tend to respond to the
expected flags regardless of OS)
- the binary merely existing for AR and RANLIB on Darwin/OSX/BSD.
(These OSes tend to have non-GNU versions of ar and ranlib, which
typically do not respond to --version and friends.)
- This PR addresses 584. Thanks to Devin Matthews for suggesting some
of the changes in this commit.

commit d5146582b1f1bcdccefe23925d3b114d40cd7e31
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed Feb 23 03:35:46 2022 +0900

ArmSVE Ensure Non-zero Block Size (615)

Fixes 613. There are several macros/environment variables which need to be tuned to get good cache block sizes. It would be nice to have a way of getting values automatically.

commit 4d8352309784403ed6719528968531ffb4483947
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed Feb 23 01:03:47 2022 +0900

Add armsve to arm64 Metaconfig (614)

Availability of the `armsve` subconfig is controlled by the compiler version (gcc/clang). Tested for SVE and non-SVE. Fixes 612.

commit c9700f369aa84fc00f36c4b817ffb7dab72b865d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Feb 15 15:36:52 2022 -0600

Renamed SIMD-related macro constants for clarity.

Details:
- Renamed the following macros defined in bli_kernel_macro_defs.h:

BLIS_SIMD_NUM_REGISTERS -> BLIS_SIMD_MAX_NUM_REGISTERS
BLIS_SIMD_SIZE -> BLIS_SIMD_MAX_SIZE

Also updated all instances of these macros elsewhere, including
subconfigurations, source code, and documentation. Thanks to Devin
Matthews for suggesting this change.

commit ee9ff988c49f16696679d4c6cd3dcfcac7295be7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Feb 15 15:01:51 2022 -0600

Move edge cases to gemmtrsm ukrs; doc updates.

Details:
- Moved edge-case handling into the gemmtrsm microkernel. This required
changing the microkernel API to take m and n dimension parameters as
well as updating all existing gemmtrsm microkernel function pointer
types, function signatures, and related definitions to take m and n
dimensions. Also updated all existing gemmtrsm kernels in the
'kernels' directory (which for now is limited to haswell and penryn
kernel sets, plus native and 1m-based reference kernels in
'ref_kernels') to take m and n dimensions, and implemented edge-case
handling within those microkernels via a collection of new C
preprocessor macros defined within bli_edge_case_macro_defs.h. Note
that the edge-case handling for gemm-like operations had already
been relocated into the gemm microkernel in 54fa28b.
- Added desriptive comments to GEMM_UKR_SETUP_CT() and related macros in
bli_edge_case_macro_defs.h to allow for easier reading.
- Updated docs/KernelsHowTo.md to reflect above changes. Also cleaned up
the bullet under "Implementation Notes for gemm" that covers alignment
issues. (Thanks to Ivan Korostelev for pointing out the confusing and
outdated language in issue 591.)
- Other minor tweaks to KernelsHowTo.md.

commit 25061593460767221e1066f9d720fa6676bbed8f
Author: Devin Matthews <damatthewssmu.edu>
Date: Sun Feb 13 20:11:55 2022 -0600

Don't use `-Wl,-flat-namespace`.

Flat namespaces can cause problems due to conflicting system libraries,
etc., so just mark `xerbla_` as a weak symbol on macOS instead.

commit 5a4d3f5208d3d8cc1827f8cc90414c764b7ebab3
Author: Devin Matthews <damatthewssmu.edu>
Date: Sun Feb 13 17:28:30 2022 -0600

Use -flat_namespace option to link on macOS

Fixes 611.

commit 26742910a087947780a089360e2baf82ea109e01
Author: Devin Matthews <damatthewssmu.edu>
Date: Sun Feb 13 16:53:45 2022 -0600

Update CC_VENDOR logic

Look for `GCC` in addition to `gcc` to handle weird conda version strings. [ci skip]

commit 2f3872e01d51545c687ae2c8b2650e00552111a7
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Mon Feb 7 17:14:49 2022 +0900

ArmSVE Adopts Label Wrapper

For clang (& armclang?) compilation.

Hopefully solves 609 .

commit 72089bb2917b78d99cf4f27c69125bf213ee54e6
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat Feb 5 16:56:04 2022 +0900

ArmSVE Use Predicate in M-Direction

No need to query MR during kernel runtime.

commit 9cc897f37455d52fbba752e3801f1a9d4a5bfdc1
Author: Ruqing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Feb 3 16:40:02 2022 +0000

Fix SVE Compil.

commit b5df1811f1bc8212b2cda6bb97b79819afe236a8
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Feb 3 02:31:29 2022 +0900

Armv8a, ArmSVE: Simplify Gen-C

commit 35195bb5cea5d99eb3eaf41e3815137d14ceb52d
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Jan 31 10:29:50 2022 -0600

Add armclang detection to configure.

armclang is treated as regular clang. Fixes 606. [ci skip]

commit 0be9282cdccf73342d8571d3f7971a9b0af72363
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jan 26 17:46:24 2022 -0600

Updated zen3 macro constant names.

Details:
- In config/zen3/bli_family_zen3.h, renamed:
BLIS_SMALL_MATRIX_A_THRES_M_GEMMT -> _M_SYRK
BLIS_SMALL_MATRIX_A_THRES_N_GEMMT -> _N_SYRK
Thanks to Jeff Diamond for helping spot the stale _SYRK naming.

commit 0ab20c0e72402ba0b17fe2c3ed3e16bf2ace0fd3
Author: Jeff Hammond <jehammondnvidia.com>
Date: Thu Jan 13 07:29:56 2022 -0800

the Apple local label thing is required by Clang in general

egaudry and I both saw this issue on Linux with Clang 10.


Compiling obj/thunderx2/kernels/armv8a/3/sup/bli_gemmsup_rv_armv8a_asm_d4x8m.o ('thunderx2' CFLAGS for kernels)
kernels/armv8a/3/bli_gemm_armv8a_asm_d6x8.c:171:49: fatal error: invalid symbol redefinition
" \n\t"
^
<inline asm>:90:5: note: instantiated into assembly here
.SLOOPKITER:
^
1 error generated.


Signed-off-by: Jeff Hammond <jehammondnvidia.com>

commit 81f93be0561c705ae6823d19e40849facc40bef7
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Jan 10 10:19:47 2022 -0600

Fix row-/column-major pref. in 16x8 haswell sgemm ukr (unused)

commit 268ce1f29a717d18304713ecc25a2eafe41838c7
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Jan 10 10:17:17 2022 -0600

Relax alignment constraints

Remove alignment of temporary AB buffer in edge case handling macros unless alignment is specifically requested (e.g. Core2, SDB/IVB). Fixes 595.

commit 3f2440b0226d5e23a43d12105d74aa917cd6c610
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jan 6 14:57:36 2022 -0600

Added m, n dims to gemmd/gemmlike ukernel calls.

Details:
- Updated the gemmd addon and the gemmlike sandbox code to use the new
microkernel calling sequence, which now includes m and n dimensions so
that the microkernel has all the information necessary to handle edge
cases. Thanks to Jeff Diamond for catching this, which ideally would
have been included in commit 54fa28b.
- Retired var2 of both gemmd and gemmlike to 'attic' directories and
removed their corresponding prototypes. In both cases, var2 was a
variant of the block-panel algorithm where edge-case handling was
abstracted away to a microkernel wrapper. (Since this is now the
official behavior of BLIS microkernels, I saw no need to have it
included as a separate code path.)
- Comment updates.

commit 864bfab4486ac910ef9a366e9ade4b45a39747fc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jan 4 15:10:34 2022 -0600

CREDITS file update.

commit 466b68a3ad118342dc49a8130b7b02f5e7748521
Author: Devin Matthews <damatthewssmu.edu>
Date: Sun Jan 2 14:59:41 2022 -0600

Add unique tag to branch labels for Apple ARM64.

Add `%=` tag to branch labels, which expands to a unique identifier for each inline assembly block. This prevents duplicate symbol errors on Apple Silicon (594). Fixes 594. [ci skip] since we can't test Apple Silicon anyways...

commit 08174a2f6ebbd8ed5aa2bc4edc45da80962f06bb
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat Jan 1 21:35:19 2022 +0900

Evict <arm_sve.h> Requirement for SVE GEMM

For 8<= GCC < 10 compatibility.

commit 54fa28bd847b389215cffb57a83dc9b3dce79c86
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Dec 24 08:00:33 2021 -0600

Move edge cases to gemm ukr; more user-custom mods. (583)

Details:
- Moved edge-case handling into the gemm microkernel. This required
changing the microkernel API to take m and n dimension parameters.
This required updating all existing gemm microkernel function pointer
types, function signatures, and related definitions to take m and n
dimensions. We also updated all existing kernels in the 'kernels'
directory to take m and n dimensions, and implemented edge-case
handling within those microkernels via a collection of new C
preprocessor macros defined within bli_edge_case_macro_defs.h. Also
removed the assembly code that formerly would handle general stride
IO on the microtile, since this can now be handled by the same code
that does edge cases.
- Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and
bli_trsm_cntl_create(), where this function pointer is used in lieu of
the default macrokernel when it is non-NULL, and ignored when it is
NULL.
- Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single
function using byte pointers rather that one function for each
floating-point datatype. Also, obtain the microkernel function pointer
from the .ukr field of the params struct embedded within the obj_t
for matrix C (assuming params is non-NULL and contains a non-NULL
value in the .ukr field). Communicate both the gemm microkernel
pointer to use as well as the params struct to the microkernel via
the auxinfo_t struct.
- Defined gemm_ker_params_t type (for the aforementioned obj_t.params
struct) in bli_gemm_var.h.
- Retired the separate _md macrokernel for mixed datatype computation.
We now use the reimplemented bli_gemm_ker_var2() instead.
- Updated gemmt macrokernels to pass m and n dimensions into microkernel
calls.
- Removed edge-case handling from trmm and trsm macrokernels.
- Moved most of bli_packm_alloc() code into a new helper function,
bli_packm_alloc_ex().
- Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c.
- Added test/syrk_diagonal and test/tensor_contraction directories with
associated code to test those operations.

commit 961d9d509dd94f3a66f7095057e3dc8eb6d89839
Author: Kiran <kiran.varagantiamd.com>
Date: Wed Dec 8 03:00:38 2021 +0530

Re-add BLIS_ENABLE_ZEN_BLOCK_SIZES macro for 'zen'.

Details:
- Added previously-deleted cpp macro block to bli_cntx_init_zen.c
targeting the Naples microarchitecture that enabled different cache
blocksizes when the number of threads exceeds 16. This commit
represents PR 573.

commit cf7d616a2fd58e293b496770654040818bf5609c
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Dec 2 17:10:03 2021 -0600

Enable user-customized packm ukernel/variant. (549)

Details:
- Added four new fields to obj_t: .pack_fn, .pack_params, .ker_fn, and
.ker_params. These fields store pointers to functions and data that
will allow the user to more flexibly create custom operations while
recycling BLIS's existing partitioning infrastructure.
- Updated typed API to packm variant and structure-aware kernels to
replace the diagonal offset with panel offsets, and changed strides
of both C and P to inc/ldim semantics. Updated object API to the packm
variant to include rntm_t*.
- Removed the packm variant function pointer from the packm cntl_t node
definition since it has been replaced by the .pack_fn pointer in the
obj_t.
- Updated bli_packm_int() to read the new packm variant function pointer
from the obj_t and call it instead of from the cntl_t node.
- Moved some of the logic of bli_l3_packm.c to a new file,
bli_packm_alloc.c.
- Rewrote bli_packm_blk_var1.c so that it uses byte (char*) pointers
instead of typed pointers, allowing a single function to be used
regardless of datatype. This obviated having a separate implementation
in bli_packm_blk_var1_md.c. Also relegated handling of scalars to a
new function, bli_packm_scalar().
- Employed a new standard whereby right-hand matrix operands ("B") are
always packed as column-stored row panels -- that is, identically to
that of left-hand matrix operands ("A"). This means that while we pack
matrix A normally, we actually pack B in a transposed state. This
allowed us to simplify a lot of code throughout the framework, and
also affected some of the logic in bli_l3_packa() and _packb().
- Simplified bli_packm_init.c in light of the new B^T convention
described above. bli_packm_init()--which is now called from within
bli_packm_blk_var1()--also now calls bli_packm_alloc() and returns
a bool that indicates whether packing should be performed (or
skipped).
- Consolidated bli_gemm_int() and bli_trsm_int() into a bli_l3_int(),
which, among other things, defaults the new .pack_fn field of the
obj_t to bli_packm_blk_var1() if the field is NULL.
- Defined a new function, bli_obj_reset_origin(), which permanently
refocuses the view of an object so that it "forgets" any offsets from
its original pointer. This function also sets the object's root field
to itself. Calls to bli_obj_reset_origin() for each matrix operand
appear in the _front() functions, after the obj_t's are aliased. This
resetting of the underlying matrices' origins is needed in preparation
for more advanced features from within custom packm kernels.
- Redefined bli_pba_rntm_set_pba() from a regular function to a static
inline function.
- Updated gemm_ukr, gemmtrsm_ukr, and trsm_ukr testsuite modules to use
libblis_test_pobj_create() to create local packed objects. Previously,
these packed objects were created by calling lower-level functions.

commit e229e049ca08dfbd45794669df08a71dba892925
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Dec 1 17:36:22 2021 -0600

Added recu-sed.sh script to 'build' directory.

Details:
- Added a recursive sed script to the 'build' directory.

commit 12c66a4acc77bf4927b01e2358e2ac10b61e0a53
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 19 14:43:53 2021 -0600

Minor updates to README.md, docs/Addons.md.

Details:
- Add additional mentions of addons to README.md, including in the
"What's New" section.
- Removed mention of sandboxes from the long list of advantages
provided by BLIS.
- Very minor description update to opening line of Addons.md.

commit a4bc03b990fe0572001eb6409efd12cd70677dcf
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 19 13:29:00 2021 -0600

Brief mention/link to Addons.md in README.md.

Details:
- Add a blurb about the new addons feature to the "Documentation for
BLIS developers" section of the README.md, which also links to the
Addons.md document.

commit b727645eb7a8df39dee74068f734da66322fe0b3
Merge: 9be97c15 7bde468c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 19 13:22:09 2021 -0600

Merge branch 'dev'

commit 9be97c150e19fa58bca30cb993a6509ae21e2025
Author: Madan mohan Manokar <86282872+madanm3users.noreply.github.com>
Date: Thu Nov 18 00:46:46 2021 +0530

Support all four dts in test/test_her[2][k].c (578)

Details:
- Replaced the hard-coded calls to double-precision real syr, syr2,
syrk, and syrk in the corresponding standalone test drivers in the
'test' directory with conditional branches that will call the
appropriate BLAS interface depending on which datatype is enabled.
Thanks to Madan mohan Manokar for this improvement.
- CREDITS file update.

commit 26e4b6b29312b472c3cadf95ccdf5240764777f4
Author: Dipal M Zambare <71366780+dzambareusers.noreply.github.com>
Date: Thu Nov 18 00:32:00 2021 +0530

Added support for AMD's Zen3 microarchitecture.

Details:
- Added a new 'zen3' subconfiguration targeting support for the AMD Zen3
microarchitecture (561). Thanks to AMD for this contribution.
- Restructured clang and AOCC support for zen, zen2, and zen3
make_defs.mk files. The clang and AOCC version detection now happens
in configure, not in the subconfigurations' makefile fragments. That
is, we've added logic to configure that detects the version of
clang/AOCC, outputs an appropriate variable to config.mk
(ie: CLANG_OT_*, AOCC_OT_*), and then checks for it within the
makefile fragment (as is currently done for the GCC_OT_* variables).
- Added configure support for a GCC_OT_10_1_0 variable (and associated
substitution anchor) to communicate whether the gcc version is older
than 10.1.0, and use this variable to check for recent enough versions
of gcc to use -march=znver3 in the zen3 subconfig.
- Inlined the contents of config/zen/amd_config.mk into the zen and zen2
make_defs.mk so that the files are self-contained, harmonizing the
format of all three Zen-based subconfigurations' make_defs.mk files.
- Added indenting (with spaces) of GNU make conditionals for easier
reading in zen, zen2, and zen3 make_defs.mk files.
- Adjusted the range of models checked by bli_cpuid_is_zen() (which was
previously 0x00 ~ 0xff and is now 0x00 ~ 0x2f) so that it is
completely disjoint from the models checked by bli_cpuid_is_zen2()
(0x30 ~ 0xff). This is normally necessary because Zen and Zen2
microarchitectures share the same family (23, or 0x17), and so the
model code is the only way to differentiate the two. But in our case,
fixing the model range for zen *wasn't* actually necessary since we
checked for zen2 first, and therefore the wide zen range acted like
the 'else' of an 'if-else' statement. That said, the change helps
improve clarity for the reader by encoding useful knowledge, which
was obtained from https://en.wikichip.org/wiki/amd/cpuid .
- Added zen2.def and zen3.def files to the collection in travis/cpuid.
Note that support for zen, zen2, and zen3 is now present, and while
all the three microarchitectures have identical instruction sets from
the perspective of BLIS microkernels, they each correspond to
different subconfigurations and therefore merit separate testing.
Thanks to Devin Matthews for his guidance in hacking these files as
slight modifications of zen.def.
- Enabled testing of zen2 and zen3 via the SDE in travis/do_sde.sh.
Now, zen, zen2, and zen3 are tested through the SDE via Travis CI
builds.
- Updated travis/do_sde.sh to grab the SDE tarball from a new ci-utils
repository on GitHub rather than on Intel's website. This change was
made in an attempt to circumvent recent troubles with Travis CI not
being able to download the SDE directly from Intel's website via curl.
Thanks to Devin Matthews for suggesting the idea.
- Updated travis/do_sde.sh to grab the latest version (8.69.1) of the
Intel SDE from the flame/ci-utils repository.
- Updated .travis.yml to use gcc 9. The file was previously using gcc 8,
which did not support -march=znver2.
- Created amd64_legacy umbrella family in config_registry for targeting
older (bulldozer, piledriver, steamroller, and excavator)
microarchitectures and moved those same subconfigs out of the amd64
umbrella family. However, x86_64 retains amd64_legacy as a constituent
member.
- Fixed a bug in configure related to the building of the so-called
config list. When processing the contents of config_registry,
configure creates a series of structures and lists that allow for
various mappings related to configuration families, subconfigs, and
kernel sets. Two of those lists are built via substitution of
umbrella families with their subconfig members, and one of those
lists was improperly performing the substitution in a way that would
erroneously match on partial umbrella family names. That code was
changed to match the code that was already doing the substitution
properly, via substitute_words(). Also added comments noting the
importance of using substitute_words() in both instances.
- Comment updates.

commit 74c0c622216aba0c24aa2c3a923811366a160cf5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 16 16:06:33 2021 -0600

Reverted cbc88fe.

Details:
- Reverted the annotation of some markdown code blocks with 'bash'
after realizing that the in-browser syntax highlighting was not
worthwhile.

commit cbc88feb51b949ce562d044cf9f99c4e46bb8a39
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 16 16:02:39 2021 -0600

Marked some markdown shell code blocks as 'bash'.

Details:
- Annotated the code blocks that represent shell commands and output as
'bash' in README.md and BuildSystem.md.

commit 78cd1b045155ddf0b9ec6e2ab815f2b216ad9a9e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 16 15:53:40 2021 -0600

Added 'Example Code' section to README.md.

Details:
- Inserted a new 'Example Code' section into the README.md immediately
after the 'Getting Started' section. Thanks to Devin Matthews for
recommending this addition.
- Moved the 'Performance' section of the README down slightly so that it
appears after the 'Documentation' section.

commit 7bde468c6f7ecc4b5322d2ade1ae9c0b88e6b9f3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Nov 13 16:39:37 2021 -0600

Added support for addons.

Details:
- Implemented a new feature called addons, which are similar to
sandboxes except that there is no requirement to define gemm or any
other particular operation.
- Updated configure to accept --enable-addon=<name> or -a <name> syntax
for requesting an addon be included within a BLIS build. configure now
outputs the list of enabled addons into config.mk. It also outputs the
corresponding include directives for the addons' headers to a new
companion to the bli_config.h header file named bli_addon.h. Because
addons may wish to make use of existing BLIS types within their own
definitions, the addons' headers must be included sometime after that
of bli_config.h (which currently is included before bli_type_defs.h).
This is why the include directives needed to go into a new top-level
header file rather than the existing bli_config.h file.
- Added a markdown document, docs/Addons.md, to explain addons, how to
build with them, and what assumptions their authors should keep in
mind as they create them.
- Added a gemmlike-like implementation of sandwich gemm called 'gemmd'
as an addon in addon/gemmd. The code uses a 'bao_' prefix for local
functions, including the user-level object and typed APIs.
- Updated .gitignore so that git ignores bli_addon.h files.

commit 7bc8ab485e89cfc6032932e57929e208a28f4be5
Author: Meghana-vankadari <74656386+Meghana-vankadariusers.noreply.github.com>
Date: Fri Nov 12 04:16:14 2021 +0530

Added BLAS/CBLAS APIs for axpby, gemm_batch. (566)

Details:
- Expanded the BLAS compatibility layer to include support for
?axpby_() and ?gemm_batch_(). The former is a straightforward
BLAS-like interface into the axpbyv operation while the latter
implements a batched gemm via loops over bli_?gemm(). Also
expanded the CBLAS compatibility layer to include support for
cblas_?axpby() and cblas_?gemm_batch(), which serve as wrappers to
the corresponding (new) BLAS-like APIs. Thanks to Meghana Vankadari
for submitting these new APIs via 566.
- Fixed a long-standing bug in common.mk that for some reason never
manifested until now. Previously, CBLAS source files were compiled
*without* the location of cblas.h being specified via a -I flag.
I'm not sure why this worked, but it may be due to the fact that
the cblas.h file resided in the same directory as all of the CBLAS
source, and perhaps compilers implicitly add a -I flag for the
directory that corresponds to the location of the source file being
compiled. This bug only showed up because some CBLAS-like source code
was moved into an 'extra' subdirectory of that frame/compat/cblas/src
directory. After moving the code, compilation for those files failed
(because the cblas.h header file, presumably, could not be found in
the same location). This bug was fixed within common.mk by explicitly
adding the cblas.h directory to the list of -I flags passed to the
compiler.
- Added test_axpbyv.c and test_gemm_batch.c files to 'test' directory,
and updated test/Makefile to build those drivers.
- Fixed typo in error message string in cblas_sgemm.c.

commit 28b0982ea70c21841fb23802d38f6b424f8200e1
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Nov 10 12:34:50 2021 -0600

Refactored her[2]k/syr[2]k in terms of gemmt. (531)

Details:
- Renamed herk macrokernels and supporting files and functions to gemmt,
which is possible since at the macrokernel level they are identical.
Then recast herk/her2k/syrk/syr2k in terms of gemmt within the expert
level-3 oapi (bli_l3_oapi_ex.c) while also redefining them as literal
functions rather than cpp macros that instantiate multiple functions.
Thanks to Devin Matthews for his efforts on this issue (531).
- Check that the maximum stack buffer size is sufficiently large
relative to the register blocksizes for each datatype, and do so when
the context is initialized rather than when an operation is called.
Note that with this change, users who pass in their own contexts into
the expert interfaces currently will *not* have any checks performed.
Thanks to Devin Matthews for suggesting this change.

commit cfa3db3f3465dc58dbbd842f4462e4b49e7768b4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 3 18:13:56 2021 -0500

Fixed bug in mixed-dt gemm introduced in e9da642.

Details:
- Fixed a bug that broke certain mixed-datatype gemm behavior. This
bug was introduced recently in e9da642 when the code that performs
the operation transposition (for microkernel IO preference purposes)
was moved up so that it occurred sooner. However, when I moved that
code, I failed to notice that there was a cpp-protected "if"
conditional that applied to the entire code block that was moved. Once
the code block was relocated, the orphaned if-statement was now
(erroneously) glomming on to the next thing that happened to be in the
function, which happened to be the call to bli_rntm_set_ways_for_op(),
causing a rather odd memory exhaustion error in the sba due to the
num_threads field of the rntm_t still being -1 (because the rntm_t
field were never processed as they should have been). Thanks to
ArcadioN09 (Snehith) for reporting this error and helpfully including
relevant memory trace output.

commit f065a8070f187739ec2b34417b8ab864a7de5d7e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 28 16:05:43 2021 -0500

Removed support for 3m, 4m induced methods.

Details:
- Removed support for all induced methods except for 1m. This included
removing code related to 3mh, 3m1, 4mh, 4m1a, and 4m1b as well as any
code that existed only to support those implementations. These
implementations were rarely used and posed code maintenance challenges
for BLIS's maintainers going forward.
- Removed reference kernels for packm that pack 3m and 4m micropanels,
and removed 3m/4m-related code from bli_cntx_ref.c.
- Removed support for 3m/4m from the code in frame/ind, then reorganized
and streamlined the remaining code in that directory. The *ind(),
*nat(), and *1m() APIs were all removed. (These additional API layers
no longer made as much sense with only one induced method (1m) being
supported.) The bli_ind.c file (and header) were moved to frame/base
and bli_l3_ind.c (and header) and bli_l3_ind_tapi.h were moved to
frame/3.
- Removed 3m/4m support from the code in frame/1m/packm.
- Removed 3m/4m support from trmm/trsm macrokernels and simplified some
pointer arithmetic that was previously expressed in terms of the
bli_ptr_inc_by_frac() static inline function (whose definition was
also removed).
- Removed the following subdirectories of level-0 macro headers from
frame/include/level0: ri3, rih, ri, ro, rpi. The level-0 scalar macros
defined in these directories were used exclusively for 3m and 4m
method codes.
- Simplified bli_cntx_set_blkszs() and bli_cntx_set_ind_blkszs() in
light of 1m being the only induced method left within BLIS.
- Removed dt_on_output field within auxinfo_t and its associated
accessor functions.
- Re-indexed the 1e/1r pack schemas after removing those associated with
variants of the 3m and 4m methods. This leaves two bits unused within
the pack format portion of the schema bitfield. (See bli_type_defs.h
for more info.)
- Spun off the basic and expert interfaces to the object and typed APIs
into separate files: bli_l3_oapi.c and bli_l3_oapi_ex.c; bli_l3_tapi.c
and bli_l3_tapi_ex.c.
- Moved the level-3 operation-specific _check function calls from the
operations' _front() functions to the corresponding _ex() function of
the object API. (This change roughly maintains where the _check()
functions are called in the call stack but lays the groundwork for
future changes that may come to the level-3 object APIs.) Minor
modifications to bli_l3_check.c to allow the check() functions to be
called from the expert interface APIs.
- Removed support within the testsuite for testing the aforementioned
induced methods, and updated the standalone test drivers in the 'test'
directory so reflect the retirement of those induced methods.
- Modified the sandbox contract so that the user is obliged to define
bli_gemm_ex() instead of bli_gemmnat(). (This change was made in light
of the *nat() functions no longer existing.) Also updated the existing
'power10' and 'gemmlike' sandboxes to come into compliance with the
new sandbox rules.
- Updated BLISObjectAPI.md, BLISTypedAPI.md, Testsuite.md documentation
to reflect the retirement of 3m/4m, and also modified Sandboxes.md to
bring the document into alignment with new conventions.
- Updated various comments; removed segments of commented-out code.

commit e8caf200a908859fa5f5ea2049911a9bdaa3d270
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 18 13:04:15 2021 -0500

Updated do_sde.sh to get SDE from GitHub.

Details:
- Updated travis/do_sde.sh so that the script downloads the SDE tarball
from a new ci-utils repository on GitHub rather than from Intel's
website. This change is being made in an attempt to circumvent Travis
CI's recent troubles with downloading the SDE from Intel's website via
curl. Thanks to Devin Matthews for suggesting the idea.

commit 290ff4b1c26737b074d5abbf76966bc22af8c562
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 14 16:09:43 2021 -0500

Disable SDE testing of old AMD microarchitectures.

Details:
- Skip testing on piledriver, steamroller, and excavator platforms
in travis/do_sde.sh.

commit 514fd101742dee557e5eb43d0023a221ae8a7172
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 14 13:50:28 2021 -0500

Fixed substitution bug in configure.

Details:
- Fixed a bug in configure related to the building of the so-called
config list. When processing the contents of config_registry,
configure creates a series of structures and list that allow for
various mappings related to configuration families, subconfigs,
and kernel sets. Two of those lists are built via subsitituion
of umbrella families with their subconfig members, and one of
those lists was improperly performing the subtitution in a way
that would erroneously match on partial umbrella family names.
That code was changed to match the code that was already doing
the subtitution properly, via substitute_words().
- Added comments noting the importance of using substitute_words()
in both instances.

commit e9da6425e27a9d63c9fef92afc2dd750c601ccd7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 13 14:15:38 2021 -0500

Allow use of 1m with mixing of row/col-pref ukrs.

Details:
- Fixed a bug that broke the use of 1m for dcomplex when the single-
precision real and double-precision real ukernels had opposing I/O
preferences (row-preferential sgemm ukernel + column-preferential
dgemm ukernel, or vice versa). The fix involved adjusting the API
to bli_cntx_set_ind_blkszs() so that the induced method context init
function (e.g., bli_cntx_init_<subconfig>_ind()) could call that
function for only one datatype at a time. This allowed the blocksize
scaling (which varies depending on whether we're doing 1m_r or 1m_c)
to happen on a per-datatype basis. This fixes issue 557. Thanks to
Devin Matthews and RuQing Xu for helping discover and report this bug.
- The aforementioned 1m fix required moving the 1m_r/1m_c logic from
bli_cntx_ref.c into a new function, bli_l3_set_schemas(), which is
called from each level-3 _front() function. The pack_t schemas in the
cntx_t were also removed entirely, along with the associated accessor
functions. This in turn required updating the trsm1m-related virtual
ukernels to read the pack schema for B from the auxinfo_t struct
rather than the context. This also required slight tweaks to
bli_gemm_md.c.
- Repositioned the logic for transposing the operation to accommodate
the microkernel IO preference. This mostly only affects gemm. Thanks
to Devin Matthews for his help with this.
- Updated dpackm pack ukernels in the 'armsve' kernel set to avoid
querying pack_t schemas from the context.
- Removed the num_t dt argument from the ind_cntx_init_ft type defined
in bli_gks.c. The context initialization functions for induced methods
were previously passed a dt argument, but I can no longer figure out
*why* they were passed this value. To reduce confusion, I've removed
the dt argument (including also from the function defintion +
prototype).
- Commented out setting of cntx_t schemas in bli_cntx_ind_stage.c. This
breaks high-leve implementations of 3m and 4m, but this is okay since
those implementations will be removed very soon.
- Removed some older blocks of preprocessor-disabled code.
- Comment update to test_libblis.c.

commit 81e103463214d589071ccbe2d90b8d7c19a186e4
Author: Minh Quan Ho <1337056+hominhquanusers.noreply.github.com>
Date: Wed Oct 13 20:28:02 2021 +0200

Alloc at least 1 elem in pool_t block_ptrs. (560)

Details:
- Previously, the block_ptrs field of the pool_t was allowed to be
initialized as any unsigned integer, including 0. However, a length of
0 could be problematic given that malloc(0) is undefined and therefore
variable across implementations. As a safety measure, we check for
block_ptrs array lengths of 0 and, in that case, increase them to 1.
- Co-authored-by: Minh Quan Ho <minh-quan.hokalray.eu>

commit 327481a4b0acf485d0cbdd8635dd9b886ba3f2a7
Author: Minh Quan Ho <1337056+hominhquanusers.noreply.github.com>
Date: Tue Oct 12 19:53:04 2021 +0200

Fix insufficient pool-growing logic in bli_pool.c. (559)

Details:
- The current mechanism for growing a pool_t doubles the length of the
block_ptrs array every time the array length needs to be increased
due to new blocks being added. However, that logic did not take in
account the new total number of blocks, and the fact that the caller
may be requesting more blocks that would fit even after doubling the
current length of block_ptrs. The code comments now contain two
illustrating examples that show why, even after doubling, we must
always have at least enough room to fit all of the old blocks plus
the newly requested blocks.
- This commit also happens to fix a memory corruption issue that stems
from growing any pool_t that is initialized with a block_ptrs length
of 0. (Previously, the memory pool for packed buffers of C was
initialized with a block_ptrs length of 0, but because it is unused
this bug did not manifest by default.)
- Co-authored-by: Minh Quan Ho <minh-quan.hokalray.eu>

commit 32a6d93ef6e2af5e486dfd5e46f8272153d3d53d
Merge: 408906fd 2604f407
Author: Devin Matthews <damatthewssmu.edu>
Date: Sat Oct 9 15:53:54 2021 -0500

Merge pull request 543 from xrq-phys/armsve-packm-fix

ARMSVE Block SVE-Intrinsic Kernels for GCC 8-9

commit 408906fdd8892032aa11bd061b7971128f453bef
Merge: 4277fec0 ccf16289
Author: Devin Matthews <damatthewssmu.edu>
Date: Sat Oct 9 15:50:25 2021 -0500

Merge pull request 542 from xrq-phys/armsve-zgemm

Arm SVE CGEMM / ZGEMM Natural Kernels

commit ccf16289d2e71fd9511ccf2d13dcebbfa29deabc
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Fri Oct 8 12:34:14 2021 +0900

Arm SVE C/ZGEMM Fix FMOV 0 Mistake

FMOV [hsd]M, imm does not allow zero immediate.
Use wzr, xzr instead.

commit 82b61283b2005f900101056e6df2a108258db602
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Fri Oct 8 12:17:29 2021 +0900

SH Kernel Unused Eigher

commit 1749dfa493054abd2e4ddba7cb21278d337e4f74
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Fri Oct 8 12:11:53 2021 +0900

Arm SVE C/ZGEMM Support *beta==0

commit 4b648e47daad256ab8ab698173a97f71ab9f75eb
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed Sep 22 16:42:09 2021 +0900

Arm SVE Config armsve Use ZGEMM/CGEMM

commit f76ea905e216cf640975e6319c6d2f54aeafed2e
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Tue Sep 21 20:38:44 2021 +0900

Arm SVE: Update Perf. Graph

Pic. size seems a bit different from upstream.
Generaged w/ MATLAB. Open to any change.

commit 66a018e6ad00d9e8967b67e1aa3e23b20a7efdfe
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Mon Sep 20 00:16:11 2021 +0900

Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0

commit 9e1e781cb59f8fadb2a10a02376d3feac17ce38d
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sun Sep 19 23:30:42 2021 +0900

Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0

commit f7c6c2b119423e7ba7a24ae2156790e076071cba
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Sep 16 01:47:42 2021 +0900

A64FX Config Use ZGEMM/CGEMM

commit e4cabb977d038688688aca39b366f98f9c36b7eb
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Sep 16 01:34:26 2021 +0900

Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg

commit b677e0d61b23f26d9536e5c363fd6bbab6ee1540
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Sep 16 01:18:54 2021 +0900

Arm SVE Add SGEMM 2Vx10 Unindexed

commit 3f68e8309f2c5b31e25c0964395a180a80014d36
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Sep 16 01:00:54 2021 +0900

Arm SVE ZGEMM Support Gather Load / Scatt. St.

commit c19db2ff826e2ea6ac54569e8aa37e91bdf7cabe
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed Sep 15 23:39:53 2021 +0900

Arm SVE Add ZGEMM 2Vx10 Unindexed

commit e13abde30b9e0e381c730c496e74bc7ae062a674
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed Sep 15 04:19:45 2021 +0900

Arm SVE Add ZGEMM 2Vx7 Unindexed

commit 49b9d7998eb86f340ae7b26af3e5a135d6a8feee
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Tue Sep 14 04:02:47 2021 +0900

Arm SVE Add ZGEMM 2Vx8 Unindexed

commit 4277fec0d0293400497ae8bcfc32be5e62319ae9
Merge: 2329d990 f44149f7
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Oct 7 13:47:22 2021 -0500

Merge pull request 533 from xrq-phys/arm64-hi-bw

ARMv8 PACKM and GEMMSUP Kernels + Apple Firestorm Subconfig

commit 2329d99016fe1aeb86da4552295f497543cea311 (origin/1m_row_col_problem)
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Oct 7 12:37:58 2021 -0500

Update Travis CI badge

[ci skip]

commit f44149f787ae3d4b53d9c4d8e6f23b2818b7770d
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Fri Oct 8 02:35:58 2021 +0900

Armv8 Trash New Bulk Kernels

- They didn't make much improvements.
- Can't register row-preferral and column-preferral ukrs at the same time.
Will break 1m.

commit 70b52cadc5ef4c16431e1876b407019e6286614e
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Oct 7 12:34:35 2021 -0500

Enable testing 1m in `make check`.

commit 2604f4071300d109f28c8438be845aeaf3ec44e4
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Oct 7 02:39:00 2021 +0900

Config ArmSVE Unregister 12xk. Move 12xk to Old

commit 1e3200326be9109eb0f8c7b9e4f952e45700cbba
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Oct 7 02:37:14 2021 +0900

Revert __has_include(). Distinguish w/ BLIS_FAMILY_**

commit a4066f278a5c06f73b16ded25f115ca4b7728ecb
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Oct 7 02:26:05 2021 +0900

Register firestorm into arm64 Metaconfig

commit d7a3372247c37568d142110a1537632b34b8f2ff
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Oct 7 02:25:14 2021 +0900

Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo

commit 2920dde5ac52e09f84aa42990aab8340421522ce
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Oct 7 02:01:45 2021 +0900

Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo

commit 14b13583f1802c002e195b3b48874b3ebadbeb20
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Oct 6 10:22:34 2021 -0500

Add test for Apple M1 (firestorm)

This test will run on Linux, but all the kernels should run just fine. This does not test autodetection but then none of the other ARM tests do either.

commit a024715065532400da6257b8b3124ca5aecda405
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Oct 7 00:15:54 2021 +0900

Firestorm CPUID Dispatcher

Commenting out <sys/sysctl.h> due to possibly a Xcode bug.

commit b9da6d55fec447d05c8b67f34ce83617123d8357
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed Oct 6 12:25:54 2021 +0900

Armv8 GEMMSUP Edge Cases Require Signed Ints

Fix a bug in bli_gemmsup_rd_armv8a_asm_d6x8m.c.
For safety upon similar strategies in the future,
change all [mn]_[iter/left] into signed ints.

commit 34919de3df5dda7a06fc09dcec12ca46dc8b26f4
Author: Devin Matthews <damatthewssmu.edu>
Date: Sat Oct 2 18:48:50 2021 -0500

Make error checking level a thread-local variable.

Previously, this was a global variable. Setting the value was synchronized via a mutex but reading the value was not. Of course, these accesses are almost certainly atomic, but there is still the possibility of one thread attempting to set the value and then reading the value set by another thread. For correct operation under user threading (e.g. pthreads), this should probably be thread-local with no mutex.

commit c3024993c3d50236fad112822215f066496c5831
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Oct 5 15:20:27 2021 -0500

Fix data race in testsuite.

commit 353a0d82572f26e78102cee25693130ce6e0ea5b
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Oct 5 14:24:17 2021 -0500

Update .appveyor.yml

[ci skip]

commit 4bfadf9b561d4ebe0bbaf8b6d332f07ff531d618
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed Oct 6 01:51:26 2021 +0900

Firestorm Block Size Fixes

commit 40baf83f0ea2749199b93b5a8ac45c01794b008c
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed Oct 6 01:00:52 2021 +0900

Armv8 Handle *beta == 0 for GEMMSUP ??r Case.

commit 079fbd42ce8cf7ea67a939b0f80f488de5821319
Merge: f5c03e9f 9905f443
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Oct 4 17:21:48 2021 -0500

Merge branch 'master' into arm64-hi-bw

commit 9905f44347eea4c57ef4927b81f1c63e76a92739
Merge: 6d3036e3 64a421f6
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Oct 4 15:58:59 2021 -0500

Merge pull request 553 from flame/rpath-fix

Add an option to use an rpath-dependent install_name on macOS

commit 6d3036e31d8a2c1acbc1260489eeb8f535a8f97a
Merge: 53377fcc eaa554aa
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Oct 4 15:58:43 2021 -0500

Merge pull request 545 from hominhquan/clean_error

bli_error: more cleanup on the error strings array

commit 53377fcca91e595787b38e2a47780ac0c35a7e7c
Merge: d0a0b4b8 80c5366e
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Oct 4 15:45:53 2021 -0500

Merge pull request 554 from flame/armsve-cleanup

Move unused ARM SVE kernels to "old" directory.

commit 80c5366e4a9b8b72d97fba1eab89bab8989c44f4
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Oct 4 15:40:28 2021 -0500

Move unused ARM SVE kernels to "old" directory.

commit 64a421f6983ab5bc0b55df30a2ddcfff5bfd73be
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Oct 4 13:40:43 2021 -0500

Add an option to control whether or not to use rpath.

Adds `--enable-rpath/--disable--rpath` (default disabled) to use an install_name starting with rpath/. Otherwise, set the install_name to the absolute path of the install library, which was the previous behavior.

commit c4a31683dd6f4da3065d86c11dd998da5192740a
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Oct 4 13:27:10 2021 -0500

Fix $ORIGIN usage on linux.

commit d0a0b4b841fce56b7b2d3c03c5d93ad173ce2b97
Author: Dave Love <dave.lovemanchester.ac.uk>
Date: Mon Oct 4 18:03:04 2021 +0000

Arm micro-architecture dispatch (344)

Details:
- Reworked support for ARM hardware detection in bli_cpuid.c to parse
the result of a CPUID-like instruction.
- Added a64fx support to bli_gks.c.
- include arm64 and arm32 family headers from bli_arch_config.h.
- Fix the ordering of the "armsve" and "a64fx" strings in the
config_name string array in bli_arch.c. The ordering did not match
the ordering of the corresponding arch_t values in bli_type_defs.h,
as it should have all along.
- Added clang support to make_defs.mk in arm64, cortexa53, cortexa57
subconfigs.
- Updated arm64 and arm32 families in config_registry.
- Updated docs/HardwareSupport.md to reflect added ARM support.
- Thanks to Dave Love, RuQing Xu, and Devin Matthews for their
contributions in this PR (344).

commit 91408d161a2b80871463ffb6f34c455bdfb72492
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Oct 4 11:37:48 2021 -0500

Use path-based install name on MacOS and use relocatable RPATH entries for testsuite inaries.

- RPATH entries (and DYLD_LIBRARY_PATH) do nothing on macOS unless the install_name of the library starts with rpath/. While the install_name can be set to the absolute install path, this makes the installation non-relocatable. When using path in the install_name, install paths within the normal DYLD_LIBRARY_PATH work with no changes on the user side, but for install paths off the beaten track, users must specify an RPATH entry when linking (or modify DYLD_LIBRARY_PATH at runtime). Perhaps this could be made into a configure-time option.
- Having relocable testsuite binaries is not necessarily a priority but it is easy to do with executable_path (macOS) or $ORIGIN (linux/BSD).

commit f5c03e9fe808f9bd8a3e0c62786334e13c46b0fc
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sun Oct 3 16:51:51 2021 +0900

Armv8 Handle *beta == 0 for GEMMSUP ?rc Case.

commit abc648352c591e26ceee436bd3a45400115b70c5
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sun Oct 3 13:14:19 2021 +0900

Armv8 Fix 6x8 Row-Maj Ukr

- Fixed for 6x8 only, 4x4 & 4x8 pending;
- Installed to config firestorm as benchmark seems to show better perf:
Old:
blis_dgemm_ukr_c 6 8 320 36.87 2.43e-17 PASS
blis_dgemm_ukr_c 6 8 352 40.55 1.04e-17 PASS
blis_dgemm_ukr_c 6 8 384 44.24 5.68e-17 PASS
blis_dgemm_ukr_c 6 8 416 41.67 3.51e-17 PASS
blis_dgemm_ukr_c 6 8 448 34.41 2.94e-17 PASS
blis_dgemm_ukr_c 6 8 480 42.53 2.35e-17 PASS

New:
blis_dgemm_ukr_r 6 8 352 50.69 1.59e-17 PASS
blis_dgemm_ukr_r 6 8 384 49.15 5.55e-17 PASS
blis_dgemm_ukr_r 6 8 416 50.44 2.86e-17 PASS
blis_dgemm_ukr_r 6 8 448 46.92 3.12e-17 PASS
blis_dgemm_ukr_r 6 8 480 48.08 4.08e-17 PASS

commit 0a45bc0fbc7aee3876c315ed567fc37f19cdc57f
Merge: 5013a6cb 13dbd5b5
Author: Devin Matthews <damatthewssmu.edu>
Date: Sat Oct 2 18:59:43 2021 -0500

Merge pull request 552 from flame/armsve_beta_0

Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs.

commit 13dbd5b5d3dbf27e33ecf0e98d43c97019a6339d
Author: Devin Matthews <damatthewssmu.edu>
Date: Sat Oct 2 20:40:25 2021 +0000

Apply patch from xrq-phys.

commit ae0eeeaf77c77892db17027cef10b95ec97c904f
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Sep 29 16:42:33 2021 -0500

Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs.

commit 5013a6cb7110746c417da96e4a1308ef681b0b88
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Sep 29 10:38:50 2021 -0500

More edits and fixes to docs/FAQ.md.

commit b36fb0fbc5fda13d9a52cc64953341d3d53067ee
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 28 18:47:45 2021 -0500

Fixed newly broken link to CREDITS in FAQ.md.

commit 3442d4002b3bfffd8848f72103b30691df2b19b1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 28 18:43:23 2021 -0500

More minor fixes to FAQ.md and Sandboxes.md.

commit 89aaf00650d6cc19b83af2aea6c8d04ddd3769cb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 28 18:34:33 2021 -0500

Updates to FAQ.md, Sandboxes.md, and README.md.

Details:
- Updated FAQ.md to include two new questions, reordered an existing
question, and also removed an outdated and redundant question about
BLIS vs. AMD BLIS.
- Updated Sandboxes.md to use 'gemmlike' as its main example, along with
other smaller details.
- Added ARM as a funder to README.md.

commit c52c43115ec2264fda9380c48d9e6bb1e1ea2ead
Merge: 1fc23d21 1f527a93
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Sep 26 15:56:54 2021 -0500

Merge branch 'dev'

commit 1fc23d2141189c7b583a5bff2cffd87fd5261444
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 21 14:54:20 2021 -0500

Safelist 'master', 'dev', 'amd' branches.

Details:
- Modified .travis.yml so that only commits to 'master', 'dev', and
'amd' branches get built by Travis CI. Thanks to Devin Matthews for
helping to track down the syntax for this change.

commit 1f527a93b996093e06ef7a8e94fb47ee7e690ce0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 20 17:56:36 2021 -0500

Re-enable and fix fb93d24.

Details:
- Re-enabled the changes made in fb93d24.
- Defined BLIS_ENABLE_SYSTEM in bli_arch.c, bli_cpuid.c, and bli_env.c,
all of which needed the definition (in addition to config_detect.c) in
order for the configure-time hardware detection binary to be compiled
properly. Thanks to Minh Quan Ho for helping identify these additional
files as needing to be updated.
- Added additional comments to all four source files, most notably to
prompt the reader to remember to update all of the files when updating
any of the files. Also made the cpp code in each of the files as
consistent/similar as possible.
- Refer to issues 532 and PR 546 for more history.

commit 7b39c1492067de941f81b49a3b6c1583290336fd
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 20 16:13:50 2021 -0500

Reverted fb93d24.

Details:
- The latest changes in fb93d24 are still causing problems. Reverting
and preparing to move them to a branch.

commit fb93d242a4fef4694ce2680436da23087bbdd5fe
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 20 15:42:08 2021 -0500

Re-enable and fix 8e0c425 (BLIS_ENABLE_SYSTEM).

Details:
- Re-enable the changes originally made in 8e0c425 but quickly reverted
in 2be78fc.
- Moved the include of bli_config.h so that it occurs before the
include of bli_system.h. This allows the define BLIS_ENABLE_SYSTEM
or define BLIS_DISABLE_SYSTEM in bli_config.h to be processed by the
time it is needed in bli_system.h. This change should have been
in the original 8e0c425, but was accidentally omitted. Thanks to Minh
Quan Ho for catching this.
- Add define BLIS_ENABLE_SYSTEM to config_detect.c so that the proper
cpp conditional branch executes in bli_system.h when compiling the
hardware detection binary. The changes made in 8e0c425 were an attempt
to support the definition of BLIS_OS_NONE when configuring with
--disable-system (in issue 532). That commit failed because, aside
from the required but omitted header reordering (second bullet above),
AppVeyor was unable to compile the hardware detection binary as a
result of missing Windows headers. This commit, which builds on PR
546, should help fix that issue. Thanks to Minh Quan Ho for his
assistance and patience on this matter.

commit eaa554aa52b879d181fdc87ba0bfad3ab6131517
Author: Minh Quan HO <minh-quan.hokalray.eu>
Date: Wed Sep 15 15:39:36 2021 +0200

bli_error: more cleanup on the error strings array

- There was redundance between the macro BLIS_MAX_NUM_ERR_MSGS (=200) and
the enum BLIS_ERROR_CODE_MAX (-170), while they both mean the same thing:
the maximal number of error codes/messages.
- The previous initialization of error messages at compile time ignored that
the 'bli_error_string' array still occupies useless memory due to 2D char[][]
declaration. Instead, it should be just an array of pointers, pointing at
strings in .rodata section.
- This commit does the two modifications:
* retired macros BLIS_MAX_NUM_ERR_MSGS and BLIS_MAX_ERR_MSG_LENGTH everywhere
* switch bli_error_string from char[][] to char *[] to reduce its footprint
from 40KB (200*200) to 1.3KB (170*sizeof(char*)).
(No problem to use the enum BLIS_ERROR_CODE_MAX at compile-time,
since compiler is smart enough to determine its value is 170.)

commit 52f29f739dbbb878c4cde36dbe26b82847acd4e9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Sep 17 08:38:29 2021 -0500

Removed last vestige of define BLIS_NUM_ARCHS.

Details:
- Removed the commented-out define BLIS_NUM_ARCHS in bli_type_defs.h
and its associated (now outdated) comments. BLIS_NUM_ARCHS has been
part of the arch_t enum for some time now, and so this change is
mostly about removing any opportunity for confusion for people who
may be reading the code. Thanks to Minh Quan Ho for leading me to
cleanup.

commit 849aae09f4fbf8d7abf11f4df1471f1d057e874b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 16 14:47:45 2021 -0500

Added new packm var3 to 'gemmlike'.

Details:
- Defined a new packm variant for the 'gemmlike' sandbox. This new
variant (bls_l3_packm_var3.c) parallelizes the packing operation over
the k dimension rather than the m or n dimensions. Note that the
gemmlike implementation still uses var1 by default, and use of the new
code would require changing bls_l3_packm_a.c and/or bls_l3_packm_b.c
so that var3 is called instead. Thanks to Jeff Diamond for proposing
this (perhaps NUMA-friendly) solution.

commit b6f71fd378b7cd0cdc5c780e0b8c975a7abde998
Merge: 9293a68e e3dc1954
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Sep 16 12:24:33 2021 -0500

Merge pull request 544 from flame/haswell-gemmsup-fpe

Fix more copy-paste errors in the haswell gemmsup code.

commit e3dc1954ffb5eee2a8b41fce85ba589f75770eea
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Sep 16 10:59:37 2021 -0500

Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell.

The fix is to use the same (valid) source register twice in the horizontal addition.

commit 5191c43faccf45975f577c60b9089abee25722c9
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Sep 16 10:16:17 2021 -0500

Fix more copy-paste errors in the haswell gemmsup code.

Fixes 486.

commit 30c29b256ef13f0141ca9e9169cbdc7a45ce3a61
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Sep 16 05:01:03 2021 +0900

Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9

Affected configs: a64fx.

commit bffa85be59dece8e756b9444e762f18892c06ee1
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Sep 16 04:31:45 2021 +0900

Arm SVE: Correct PACKM Ker Name: Intrinsic Kers

SVE-Intrinsic-based kernels ought not to use asm in their names.

commit 9293a68eb6557a9ea43a846435908c3d52d4218b
Merge: ade10f42 98ce6e8b
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Sep 10 14:13:29 2021 -0500

Merge pull request 534 from flame/cxx_test

Add test to Travis using C++ compiler to make sure blis.h is C++-compatible

commit 98ce6e8bc916e952510872caa60d818d62a31e69
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Sep 10 14:12:13 2021 -0500

Do a fast test on OSX. [ci skip]

commit c76fcad0c2836e7140b6bef3942e0a632a5f2cda
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Sep 10 13:57:02 2021 -0500

Fix AArch64 tests and consolidate some other tests.

commit e486d666ffefee790d5e39895222b575886ac1ea
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Sep 10 13:50:16 2021 -0500

Use C++ cross-compiler for ARM tests.

commit fbb3560cb8e2aeab205c47c2b096d4fa306d93db
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Sep 10 13:38:27 2021 -0500

Attempt to fix cxx-test for OOT builds.

commit 9c0064f3f67d59263c62d57ae19605562bb87cc2
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Sep 10 10:39:04 2021 -0500

Fix config_name in bli_arch.c

commit ade10f427835d5274411cafc9618ac12966eb1e7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Aug 27 12:47:12 2021 -0500

Updated travis-ci.org link in README.md to .com.

commit 2be78fc97777148c83d20b8509e38aa1fc1b4540
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Aug 27 12:17:26 2021 -0500

Disabled (at least temporarily) commit 8e0c425.

Details:
- Reverted changes in 8e0c425 due to AppVeyor build failures that we do
not yet understand.

commit 820f11a4694aee5f234e24277aecca40885ae9d4
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Fri Aug 27 13:40:26 2021 +0900

Arm Whole GEMMSUP Call Route is Asm/Int Optimized

- `ref2` call in `bli_gemmsup_rv_armv8a_asm_d6x8m.c` is commented out.
- `bli_gemmsup_rv_armv8a_asm_d4x8m.c` contains a tail `ref2` call but
it's not called by any upper routine.

commit 8e0c4255de52a0a5cffecbebf6314aa52120ebe4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 26 15:29:18 2021 -0500

Define BLIS_OS_NONE when using --disable-system.

Details:
- Modified bli_system.h so that the cpp macro BLIS_OS_NONE is defined
when BLIS_DISABLE_SYSTEM is defined. Otherwise, the previous OS-
detecting macro conditionals are considered. This change is to
accommodate a solution to a cross-compilation issue described in
532.

commit d6eb70fbc382ad7732dedb4afa01cf9f53e3e027
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 26 13:12:39 2021 -0500

Updated stale calls to malloc_intl() in gemmlike.

Details:
- Updated two out-of-date calls to bli_malloc_intl() within the gemmlike
sandbox. These calls to malloc_intl(), which resided in
bls_l3_decor_pthreads.c, were missing the err_t argument that the
function uses to report errors. Thanks to Jeff Diamond for helping
isolate this issue.

commit 2f7325b2b770a15ff8aaaecc087b22238f0c67b7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Aug 23 15:04:05 2021 -0500

Blacklist clang10/gcc9 and older for 'armsve'.

Details:
- Prohibit use of clang 10.x and older or gcc 9.x and older for the
'armsve' subconfiguration. Addresses issue 535.

commit 7e2951e61fda1c325d6a76ca9956253482d84924
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Mon Aug 23 17:06:44 2021 +0900

Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref

Ref cannot handle panel strides (packed cases) thus cannot be called
from the beginning of `gemmsup` (i.e. cannot be dispatch target of
gemmsup to other sizes.)

commit 4fd82b0e9348553d83e258bd4969e49a81f8fcf0
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Mon Aug 23 05:18:32 2021 +0900

Header Typo

commit 35409ebe67557c0e7cf5ced138c8166c9c1c909f
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Mon Aug 23 04:51:47 2021 +0900

Arm: DGEMMSUP ??r(rv) Invoke Edge Size

Plus some fix at edges.

TODO: Should ensure that no ref kernel appear in beginning of gemmsup
kernels. As ref does not recognise panel stride.

commit a361492c24fdd919ee037763fc6523e8d7d2967a
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Mon Aug 23 01:13:39 2021 +0900

Arm: DGEMMSUP ?rc(rd) Invoke Edge Size

commit eaea67401c2ab31f2e51eede59725f64c1a21785
Merge: 5fc65cdd e320ec6d
Author: Devin Matthews <damatthewssmu.edu>
Date: Sat Aug 21 16:09:31 2021 -0500

Merge branch 'master' into cxx_test

commit 5fc65cdd9e4134c5dcb16d21cd4a79ff426ca9f3
Author: Devin Matthews <damatthewssmu.edu>
Date: Sat Aug 21 15:59:27 2021 -0500

Add test to Travis using C++ compiler to make sure blis.h is C++-compatible.

commit e320ec6d5cd44e03cb2e2faa1d7625e84f76d668
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Aug 20 17:15:20 2021 -0500

Moved lang defs from _macro_def.h to _lang_defs.h.

Details:
- Moved miscellaneous language-related definitions, including defs
related to the handling of the 'restrict' keyword, from the top half
of bli_macro_defs.h into a new file, bli_lang_defs.h, which is now
included immediately after "bli_system.h" in blis.h. This change is
an attempt to fix a report of recent breakage of C++ compilers due
to the recent introduction of 'restrict' in bli_type_defs.h (which
previously was being included *before* bli_macro_defs.h and its
restrict handling therein. Thanks to Ivan Korostelev for reporting
this issue in 527.
- CREDITS file update.

commit e6799b26a6ecf1e80661a77d857d1c9e9adf50dc
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat Aug 21 02:39:38 2021 +0900

Arm: Implement GEMMSUP Fallback Method

bli_dgemmsup_rv_armv8a_int_6x4mn

commit 7d5903d8d7570090eb37c592094424d1c64805d1
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat Aug 21 01:55:50 2021 +0900

Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin

Forgot to support `alpha`/`beta` in gemmsup_armv8a_int.

commit 3b275f810b2479eb5d6cf2296e97a658cf1bb769
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 19 16:06:46 2021 -0500

Minor tweaks to gemmlike sandbox.

Details:
- In the gemmlike sandbox, changed the loop index variable of inner
loop of packm_cxk() from 'd' to 'i' (and likewise for the
corresponding inlined code within packm_var2()).
- Pack matrices A and B using packm_var1() instead of packm_var2().

commit 3eccfd456e7e84052c9a429dcde1183a7ecfaa48
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 19 13:22:10 2021 -0500

Added local _check() code to gemmlike sandbox.

Details:
- Added code to the gemmlike sandbox that handles parameter checking.
Previously, the gemmlike implementation called bli_gemm_check(), which
resides within the BLIS framework proper. Certain modifications that a
user may wish to perform on the sandbox, such as adding a new matrix
or vector operand, would have required additional checks, and so these
changes make it easier for such a person to implement those checks for
their custom gemm-like operation.

commit 7144230cdb0653b70035ddd91f7f41e06ad8d011
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Aug 18 13:25:39 2021 -0500

README.md citation updates (e.g. BLIS7 bibtex).

commit 4a955e939044cfd2048cf9f3e33024e3ad1fbe00
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Aug 16 13:49:27 2021 -0500

Tweaks to gemmlike to facilitate 3rd party mods.

Details:
- Changed the implementation in the 'gemmlike' sandbox to more easily
allow others to provide custom implementations of packm. These changes
include:
- Calling a local version of packm_cxk() that can be modified. This
version of packm_cxk() uses inlined loops in packm_cxk() rather
than querying the context for packm kernels (or even using scal2m).
- Providing two variants of packm, one of which calls the
aforementioned packm_cxk(), the other of which inlines the contents
of packm_cxk() into the variant itself, making it self-contained.
To switch from one to the other, simply change which function gets
called within bls_packm_a() and bls_packm_b().
- Simplified and cleaned up some variant names in both variants of
packm, relative to their parent code.

commit 2c0b4150e40c83ea814f69ca766da74c19ed0a58
Merge: c99fae50 4b8ed99d
Author: Devin Matthews <damatthewssmu.edu>
Date: Sat Aug 14 18:41:35 2021 -0500

Merge pull request 527 from flame/obj_t_makeover

Implement proposed new function pointer fields for obj_t.

commit 4b8ed99d926876fbf54c15468feae4637268eb6b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Aug 13 15:31:10 2021 -0500

Whitespace tweaks.

commit c99fae50ac3de0b5380a085aeebebfe67a645407
Merge: e6d68bc4 4f70eb79
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Aug 13 14:48:00 2021 -0500

Merge pull request 530 from flame/fix_clang_warnings

Clean up some warnings that show up on clang/OSX.

commit e6d68bc4fd0981bea90d7f045779cacfe53f6ae8
Merge: 20a1c401 ec06b6a5
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Aug 13 14:47:46 2021 -0500

Merge pull request 529 from flame/fix_make_check_dependencies

Add dependency on the "flat" blis.h file for the BLIS and BLAS testuite objects.

commit 1772db029e10e0075b5a59d3fb098487b1ad542a
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Aug 13 14:46:35 2021 -0500

Add row- and column-strides for A/B in obj_ukr_fn_t.

commit 4f70eb7913ad3ded193870361b6da62b20ec3823
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Aug 13 11:12:43 2021 -0500

Clean up some warnings that show up on clang/OSX.

commit 3cddce1e2a021be6064b90af30022b99cbfea986
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Aug 12 22:32:34 2021 -0500

Remove schema field on obj_t (redundant) and add new API functions.

commit ec06b6a503a203fa0cdb23273af3c0e3afeae7fa
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Aug 12 19:27:31 2021 -0500

Add dependency on the "flat" blis.h file for the BLIS and BLAS testsuite objects.

This fixes a bug where "make -j<N> check" may fail after a change to one or more header files, or where testsuite code doesn't get properly recompiled after internal changes.

commit 20a1c4014c999063e6bc1cfa605b152454c5cbf4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 12 14:44:04 2021 -0500

Disabled sanity check in bli_pool_finalize().

Details:
- Disabled a sanity check in bli_pool_finalize() that was meant to alert
the user if a pool_t was being finalized while some blocks were still
checked out. However, this is exactly the situation that might happen
when a pool_t is re-initialized for a larger blocksize, and currently
bli_pool_reinit() is implemeneted as _finalize() followed by _init().
So, this sanity check is not universally appropriate. Thanks to
AMD-India for reporting this issue.

commit e366665cd2b5ae8d7683f5ba2de345df0a41096f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 12 14:06:53 2021 -0500

Fixed stale API calls to membrk API in gemmlike.

Details:
- Updated stale calls to the bli_membrk API within the 'gemmlike'
sandbox. This API is now called bli_pba (packed block allocator).
Ideally, this forgotten update would have been included as part of
21911d6, which is when the branch where the membrk->pba changes was
introduced was merged into 'master'.
- Comment updates.

commit e38ca28689f31c5e5bd2347704dc33042e5ea176
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Fri Aug 13 03:21:19 2021 +0900

Added Apple Firestorm (A14/M1) Subconfig

- Use the same bulk kernel as Cortex-A53 / ThunderX2;
- Larger block size;
- Use gemmsup kernels for double precision.

commit 3df0e9b653fbb1293cad93010273eea579e753d9
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat Jul 17 04:21:53 2021 +0900

Arm64 8x4 Kernel Use Less Regs

commit 4e7e225057a05b9722ce65ddf75a9c31af9fbf36
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed Jun 9 15:46:36 2021 +0900

Armv8-A Supplimentary GEMMSUP Sizes for RD

commit c792d506ba09530395c439051727631fd164f59a
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat Jun 5 04:20:24 2021 +0900

Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm

Suffixed NEON opcode is not supported by GNU assembler

commit ce4473520975c2c8790c82c65a69d75f8ad758ea
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat Jun 5 04:08:14 2021 +0900

Armv8-A Adjust Types for PACKM Kernels

GCC does not have full NEON intrinsics support.

commit 8a32d19af85b61af92fcab1c316fb3be1a8d42ce
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat Jun 5 03:31:30 2021 +0900

Armv8-A GEMMSUP-RD 6x8m

Armv8-A now has a complete set of GEMMSUP kernels..

commit afd0fa6ad1889ed073f781c8aa8635f99e76b601
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat Jun 5 01:19:01 2021 +0900

Armv8-A GEMMSUP-RD 6x8n

commit 3c5f7405148ab142dee565d00da331d95a7a07b9
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Fri Jun 4 21:50:51 2021 +0900

Armv8-A s/d Packing Kernels Fix Typo

For GCC.

commit 49b05df7929ec3abc0d27b475d2d406116fe2682
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Fri Jun 4 18:04:59 2021 +0900

Armv8-A Introduced s/d Packing Kernels

Sizes according to the 2014 kernels.

commit c3faf93168c3371ff48a2d40d597bdb27021cad4
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Jun 3 23:09:05 2021 +0900

Armv8-A DGEMMSUP 6x8m Kernel

Recommended kernels set:
...
BLIS_RRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
BLIS_RCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
BLIS_RCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
BLIS_CRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
BLIS_CCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
BLIS_CCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
...
bli_blksz_init ( &blkszs[ BLIS_MR ], -1, 6, -1, -1,
-1, 8, -1, -1 );
bli_blksz_init_easy( &blkszs[ BLIS_NR ], -1, 8, -1, -1 );
...

commit 3efe707b5500954941061d4c2363d6ed41d17233
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Jun 3 17:20:57 2021 +0900

Armv8-A DGEMMSUP Adjustments

commit 8ed8f5e625de9b77a0f14883283effe79af01771
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Jun 3 16:37:37 2021 +0900

Armv8-A Add More DGEMMSUP

- Add 6x8 GEMMSUP.
- Adjust prefetching.
- Workaround for Clang's disability to handle reg clobbering.
- Subproduct 6x8 row-major GEMM <- incomplete.

commit a9ba79ea14de3b5a271e5970cb473d3c52e2fa5f
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed Jun 2 15:04:29 2021 +0900

Armv8-A Add GEMMSUP 4x8n Kernel

- Compile w/ both GCC & Clang.
- Edge cases use ref-kernels.
- Can give performance boost in some contexts.

commit df40efe8fbfd399d76c6000ec03791a9b76ffbdf
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed Jun 2 00:04:20 2021 +0900

Armv8-A Add Part of GEMMSUP 8x4m Kernel

- Compile w/ both GCC & Clang
- Only block part is implement. Edge cases WIP
- Not Optimal kernel scheme. Should do 4x8 instead

commit 66399992881316514f64d68ec9eb60a87d53f674
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat May 29 05:52:05 2021 +0900

Armv8A DGEMM 4x4 Kernel WIP. Slow

Quite slow.

commit a29c16394ccef02d29141c79b71fb408e20073e6
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat May 29 04:58:45 2021 +0900

Armv8-A Add 8x4 Kernel WIP

Test result: a bit lower GFlOps than 6x8.

commit 64a1f786d58001284aa4f7faf9fae17f0be7a018
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Aug 11 17:53:12 2021 -0500

Implement proposed new function pointer fields for obj_t.

The added fields:
1. `pack_t schema`: storing the pack schema on the object allows the macrokernel to act accordingly without side-channel information from the rntm_t and cntx_t. The pack schema and "pack_[ab]" fields could be removed from those structs.
2. `void* user_data`: this field can be used to store any sort of additional information provided by the user. The pointer is propagated to submatrix objects and copies, but is otherwise ignored by the framework and the default implementations of the following three fields. User-specified pack, kernel, or ukr functions can do whatever they want with the data, and the user is 100% responsible for allocating, assigning, and freeing this buffer.
3. `obj_pack_fn_t pack`: the function called when a matrix is packed. This functions receives the expected arguments, as well as a mdim_t and mem_t* as memory must be allocated inside this function, and behavior may differ based on which matrix is being backed (i.e. transposition for B). This could also be achieved by passing a desired pack schema, but this would require additional information to travel down the control tree.
4. `obj_ker_fn_t ker`: the function called when we get to the "second loop", or the macro-kernel. Behavior may depend on the pack schemas of the input matrices. The default implementation would perform the inner two loops around the ukr, and then call either the default ukr or a user-supplied one (next field).
5. `obj_ukr_fn_t ukr`: the function called by the default macrokernel. This would replace the various current "virtual" microkernels, and could also be used to supply user-defined behavior. Users could supply both a custom kernel (above) and microkernel, although the user-specified kernel does **not** necessarily have to call the ukr function specified on the obj_t.

Note that no macros or functions for accessing these new fields have been defined yet. That is next once these are finalized. Addresses https://github.com/flame/blis/projects/1#card-62357687.

commit a32257eeab2e9946e71546a05a1847a39341ec6b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 5 16:23:02 2021 -0500

Fixed bli_init.c compile-time error on OSX clang.

Details:
- Fixed a compile-time error in bli_init.c when compiling with OSX's
clang. This error was introduced in 868b901, which introduced a
post-declaration struct assignment where the RHS was a struct
initialization expression (i.e. { ... }). This use of struct
initializer expressions apparently works with gcc despite it not
being strict C99. The fix included in this commit declares a temporary
variable for the purposes of being initialized to the desired value,
via the struct initializer, and then copies the temporary struct (via
'=' struct assignment) to the persistent struct. Thanks to Devin
Matthews for his help with this.

commit c8728cfbd19ecde9d43af05829e00bcfe7d86eed
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 5 15:17:09 2021 -0500

Fixed configure breakage on OSX clang.

Details:
- Accept either 'clang' or 'LLVM' in vendor string when greping for
the version number (after determining that we're working with clang).
Thanks to Devin Matthews for this fix.

commit 868b90138e64c873c780d9df14150d2a370a7a42
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Aug 4 18:31:01 2021 -0500

Fixed one-time use property of bli_init() (525).

Details:
- Fixes a rather obvious bug that resulted in segmentation fault
whenever the calling application tried to re-initialize BLIS after
its first init/finalize cycle. The bug resulted from the fact that
the bli_init.c APIs made no effort to allow bli_init() to be called
subsequent times at all due to it, and bli_finalize(), being
implemented in terms of pthread_once(). This has been fixed by
resetting the pthread_once_t control variable for initialization
at the end of bli_finalize_apis(), and by resetting the control
variable for finalization at the end of bli_init_apis(). Thanks to
lschork2 for reporting this issue (525), and to Minh Quan Ho and
Devin Matthews for suggesting the chosen solution.
- CREDITS file update.

commit 8dba1e752c6846a85dea50907135bbc5cbc54ee5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jul 27 12:38:24 2021 -0500

CREDITS file update.

commit cc9206df667b7c710b57b190b8ad351176de53b8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 16 15:48:37 2021 -0500

Added Graviton2 Neoverse N1 performance results.

Details:
- Added single-threaded and multithreaded performance results to
docs/Performance.md. These results were gathered on a Graviton2
Neoverse N1 server. Special thanks to Nicholai Tukanov for
collecting these results via the Arm-HPC/AWS hackaton.
- Corrected what was supposed to be a temporary tweak to the legend
labels in test/3/octave/plot_l3_perf.m.

commit fab5c86d68137b59800715efb69214c0a7e458a7
Merge: 84f9dcd4 d073fc9a
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Jul 13 16:46:21 2021 -0500

Merge pull request 516 from nicholaiTukanov/p10-sandbox-rework

P10 sandbox rework

commit 84f9dcd449fa7a4cf4087fca8ec4ca0d10e9b801
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Jul 13 16:45:44 2021 -0500

Remove unnecesary windows/zen2 directory.

commit 21911d6ed3438ca4ba942d05851ba5d7e9835586
Merge: 17729cf4 689fa0f4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 9 18:10:46 2021 -0500

Merge branch 'dev'

commit 17729cf449919d1db9777cea5b65d2efc77e2692
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Jul 9 14:59:48 2021 -0500

Add vzeroupper to Haswell microkernels. (524)

Details:
- Added vzeroupper instruction to the end of all 'gemm' and 'gemmtrsm'
microkernels so as to avoid a performance penalty when mixing AVX
and SSE instructions. These vzeroupper instructions were once part
of the haswell kernels, but were inadvertently removed during a source
code shuffle some time ago when we were managing duplicate 'haswell'
and 'zen' kernel sets. Thanks to Devin Matthews for tracking this down
and re-inserting the missing instructions.

commit c9a7f59aa84daa54d8f8c771f1f1ef2bd8730da2
Merge: 75f03907 9a8e649c
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Jul 8 14:00:38 2021 -0500

Merge pull request 522 from flame/windows-avx512

Fix Win64 AVX512 bug.

commit 9a8e649c5ac89eba951bbee7136ca28aeb24d731
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Jul 7 15:23:57 2021 -0500

Fix Win64 AVX512 bug.

Use `-march=haswell` for kernels. Fixes 514.

commit 75f03907c58385b656c8bd35d111db245814a9f3
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Jul 7 15:44:11 2021 -0500

Add comment about make checkblas on Windows

[ci skip]

commit 4651583b1204a965e4aa672c7ad6de60f3ab1600
Merge: 69205ac2 174f7fc9
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Jul 7 01:11:20 2021 -0500

Merge pull request 520 from flame/travis-ci-install

Test installation in Travis CI

commit 69205ac266947723ad4d7bb028b7521fe5c76991
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jul 6 20:39:22 2021 -0500

CREDITS file update.

Details:
- Thanks to Chengguo Sun for submitting 515 (5ef7f68).
- Thanks to Andrew Wildman for submitting 519 (551c6b4).
- Whitespace update to configure (spaces to tabs).

commit 174f7fc9a11712c7bd1a61510bdc5c262b3e8e1f
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Jul 6 19:35:55 2021 -0500

Test installation in Travis CI

commit 551c6b4ee8cd9dd2e1d1b46c8dde09eb50b91b2c
Merge: 78eac6a0 f648df4e
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Jul 6 19:32:53 2021 -0500

Merge pull request 519 from awild82/oot_build_bugfix

Fix installation from out-of-tree builds

commit f648df4e5588f069b2db96f8be320ead0c1967ef
Author: Andrew Wildman <apw4uw.edu>
Date: Tue Jul 6 16:35:12 2021 -0700

Add symlink to blis.pc.in for out-of-tree builds

commit 78eac6a0ab78c995c3f4e46a9e87388b5c3e1af6
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Jul 6 11:05:43 2021 -0500

Revert "Always run `make check`."

This reverts commit a201a53440c51244739aaee20e3309b50121cc68.

commit a201a53440c51244739aaee20e3309b50121cc68
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Jul 5 21:39:18 2021 -0500

Always run `make check`.

I'm concerned that problems may lurk for `x86_64` builds on Windows which may be uncovered by a fuller `make check`.

commit 5ef7f684dc75fc707c82f919e0836615f90a2627
Merge: aaa10c87 ad6231cc
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Jul 5 21:35:07 2021 -0500

Merge pull request 515 from chengguosun/bug-fix

Fixed configure script bug.

commit ad6231cca3fc1e477752ecd31b1ee2323398a642
Author: sunchengguo <sunchengguohigon.com>
Date: Tue Jul 6 07:30:00 2021 -0400

Fixed configure script bug.
Details:
- Fixed kernel list string substitution error by adding function substitute_words in configure script.
if the string contains zen and zen2, and zen need to be replaced with another string, then zen2
also be incorrectly replaced.

commit d073fc9acac9d702556cab9fbbb3a253eeb1f998
Author: nicholaiTukanov <nicholaitukanovgmail.com>
Date: Fri Jul 2 19:54:33 2021 -0500

Update POWER10.md

commit 907226c0af4afb6323b4e02be4f73f5fb89cddaf
Author: nicholaiTukanov <nicholaitukanovgmail.com>
Date: Fri Jul 2 19:47:18 2021 -0500

Rework POWER10 sandbox

- Add a testsuite for gathering performance (in GFLOPs) and measuring correctness for the POWER10 GEMM reduced precision/integer kernels.
- Reworked GENERIC_GEMM template to hardcode the cache parameters.
- Remove kernel wrapper that checked that only allowed matrices that weren't transposed or conjugated. However, the kernels still assume the matrices are not transposed. This wrapper was removed for performance reasons.
- Renamed and restructured files and functions for clarity.
- Editted the POWER10 document to reflect new changes.

commit aaa10c87e19449674a4ca30fa3b6392bb22c3a66
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jun 21 17:53:52 2021 -0500

Skip clearing temp microtile in gemmlike sandbox.

Details:
- Removed code from gemmlike sandbox files bls_gemm_bp_var1.c and
bls_gemm_bp_var2.c that initializes the elements of the temporary
microtile to zero. This code, introduced recently in 7f7d726, did
not actually fix any bug (despite that commit's log entry). The
microtile does not need to be initialized because it is completely
overwritten by a "beta = 0" invocation of gemm prior to it being
read. Any NaNs or Infs present at the outset would have no impact
on the output matrix C. Thanks to Devin Matthews for reminding me
of this.

commit bc10a3f2ff518360c32bea825b3eb62a9e4c8a77
Merge: bf727636 6548ceba
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Jun 18 19:01:08 2021 -0500

Merge pull request 492 from flame/thunderx2-clang

Allow clang for ThunderX2 config

commit bf727636632a368f3247dc8ab1d4b6119e9c511a
Merge: e28f2a2d 5fc93e28
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Jun 18 18:59:43 2021 -0500

Merge pull request 506 from xrq-phys/arm64-mac

BLIS on Darwin_Aarch64

commit e28f2a2dfcff14e7094fce0b279b3a917b3ab98c
Merge: d10e05bb 56ffca6a
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Jun 15 19:35:07 2021 -0500

Merge pull request 513 from nicholaiTukanov/asm_warning_p9_fix

Fix assembler warning in POWER9 DGEMM

commit 56ffca6a9bc67432a7894298739895f406e5f467
Author: nicholai <nicholaiibm.com>
Date: Tue Jun 15 18:17:39 2021 -0500

Fix asm warning

commit 689fa0f40399bde1acc5367d6dd4e8fc4eb6f3ea
Merge: b683d01b d10e05bb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Jun 13 19:44:14 2021 -0500

Merge branch 'master' into dev

commit d10e05bbd1ce45ce2c0dfe5c64daae2633357b3f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Jun 13 19:36:16 2021 -0500

Sandbox header edits trigger full library rebuild.

Details:
- Adjusted the top-level Makefile so that any change to a sandbox header
file will result in blis.h being regenerated along with a full
recompilation of the library. Previously, sandbox files were omitted
from the list of header files that, when touched, could trigger a full
rebuild. Why was it like that previously? Because originally we only
envisioned using sandboxes to *replace* gemm, not augment the library
with new functionality. When replacing gemm, blis.h does not need to
contain any local sandbox defintions in order for the user to be able
to (indirectly) use that sandbox. But if you are adding functions to
the library, those functions need to be prototyped so the compiler
can perform type checking against the user's invocation of those new
functions. Thanks to Jeff Diamond for helping us discover this
deficiency in the build system.

commit 7c3eb44efaa762088c190bb820ef6a3c87db8f65
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Jun 2 11:28:22 2021 -0500

Add vhsubpd/vhsubpd.

Horizontal subtraction instructions added to bli_x86_asm_macros.h, currently unused [ci skip].

commit 7f7d72610c25f511ba8cd2a53be7b59bdb80f3f3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon May 31 16:50:18 2021 -0500

Fixed bugs in cpackm kernels, gemmlike code.

Details:
- Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and
bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the
kappa scalar was incorrectly loaded at an offset of 8 bytes (instead
of 4 bytes) from the real component. This was almost certainly a copy-
paste bug carried over from the corresonding zpackm kernels. Thanks to
Devin Matthews for bringing this to my attention.
- Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and
bls_gemm_bp_var2.c that initializes the elements of the temporary
microtile to zero. (This bug was never observed in output but rather
noticed analytically. It probably would have also manifested as
intermittent failures, this time involving edge cases.)
- Minor commented-out/disabled changes to testsuite/src/test_gemm.c
relating to debugging.

commit 5fc93e280614b4a21a9cff36cf873b4b9407285b
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat May 29 18:44:47 2021 +0900

Armv8A Rename Regs for Safe Darwin Compile

Avoid x18 use in FP32 kernel:
- C address lines x[18-26] renamed to x[19-27] (reg index +1)
- Original role of x27 fulfilled by x5 which is free after k-loop pert.

FP64 does not require changing since x18 is not used there.

commit 9f4a4a3cfb2244e4024445e127dafd2a11f39fc5
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat May 29 17:21:28 2021 +0900

Armv8A Rename Regs for Clang Compile: FP32 Part

Roughly the same as 916e1fa , additionally with x15 clobbering removed.
- x15: Not used at all.

Compilation w/ Clang shows warning about x18 reservation, but
compilation itself is OK and all tests got passed.

commit 916e1fa8be3cea0e3e2a4a7e8b00027ac2ee7780
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat May 29 16:46:52 2021 +0900

Armv8A Rename Regs for Clang Compile: FP64 Part

- x7, x8: Used to store address for Alpha and Beta.
As Alpha & Beta was not used in k-loops, use x0, x1 to load
Alpha & Beta's addresses after k-loops are completed, since A & B's
addresses are no longer needed there.
This "ldr [addr]; -> ldr val, [addr]" would not cause much performance
drawback since it is done outside k-loops and there are plenty of
instructions between Alpha & Beta's loading and usage.
- x9: Used to store cs_c. x9 is multiplied by 8 into x10 and not used
any longer. Directly loading cs_c and into x10 and scale by 8 spares
x9 straightforwardly.
- x11, x12: Not used at all. Simply remove from clobber list.
- x13: Alike x9, loaded and scaled by 8 into x14, except that x13 is
also used in a conditional branch so that "cmp x13, 1" needs to be
modified into "cmp x14, 8" to completely free x13.
- x3, x4: Used to store next_a & next_b. Untouched in k-loops. Load
these addresses into x0 and x1 after Alpha & Beta are both loaded,
since then neigher address of A/B nor address of Alpha/Beta is needed.

commit 7fabd896af773623ed01820a71bbff432e8a7d25
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat May 29 16:28:03 2021 +0900

Asm Flag Mingling for Darwin_Aarch64

Apple+Arm64 requires additional "tagging" of local symbols.

commit 213dce32d2eed8b7a38c6a3f6112072b0a89ecd0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri May 28 14:49:57 2021 -0500

Added a new 'gemmlike' sandbox.

Details:
- Added a new sandbox called 'gemmlike', which implements sequential and
multithreaded gemm in the style of gemmsup but also unconditionally
employs packing. The purpose of this sandbox is to
(1) avoid select abstractions, such as objects and control trees, in
order to allow readers to better understand how a real-world
implementation of high-performance gemm can be constructed;
(2) provide a starting point for expert users who wish to build
something that is gemm-like without "reinventing the wheel."
Thanks to Jeff Diamond, Tze Meng Low, Nicholai Tukanov, and Devangi
Parikh for requesting and inspiring this work.
- The functions defined in this sandbox currently use the "bls_" prefix
instead of "bli_" in order to avoid any symbol collisions in the main
library.
- The sandbox contains two variants, each of which implements gemm via a
block-panel algorithm. The only difference between the two is that
variant 1 calls the microkernel directly while variant 2 calls the
microkernel indirectly, via a function wrapper, which allows the edge
case handling to be abstracted away from the classic five loops.
- This sandbox implementation utilizes the conventional gemm microkernel
(not the skinny/unpacked gemmsup kernels).
- Updated some typos in the comments of a few files in the main
framework.

commit 82af05f54c34526a60fd2ec46656f13e1ac8f719
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 25 15:25:08 2021 -0500

Updated Fugaku (a64fx) performance results.

Details:
- Updated the performance graphs (pdfs and pngs) for the Fugaku/a64fx
entry within Performance.md, and also updated the experiment details
accordingly. Thanks to RuQing Xu for re-running the BLIS and SSL2
experiments reflected in this commit.
- In Performance.md, added an English translation of the project name
under which the Fugaku results were gathered, courtesy of RuQing Xu.

commit e5c85da3763f73854ecd739ba3008bb467ed77c3
Merge: cbd8d393 5feb04e2
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon May 24 16:56:22 2021 -0500

Merge pull request 503 from flame/windows-compiler-check

Add explicit compiler check for Windows.

commit cbd8d3932599485727204479fded66ac19186db4
Merge: 6d4ab022 932dfe6a
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon May 24 16:32:42 2021 -0500

Merge pull request 500 from xrq-phys/armsve+travis

Upgrade Travis CI for Arm SVE

commit 5feb04e233e1e6f81c727578ad9eae1367a2562f
Author: Devin Matthews <damatthewssmu.edu>
Date: Sun May 23 18:46:56 2021 -0500

Add explicit compiler check for Windows.

Check the C compiler for a predefined macro `_WIN32` to indicate (cross-)compilation for Windows. Fixes 463.

commit 6d4ab0223d9014ac2a66d66759536aa305be5867
Merge: 61584ded 859fb77a
Author: Devin Matthews <damatthewssmu.edu>
Date: Sun May 23 18:39:53 2021 -0500

Merge pull request 502 from flame/rm-rm-dupls

Remove `rm-dupls` function in common.mk.

commit 859fb77a320a3ace71d25a8885c23639b097a1b6
Author: Devin Matthews <damatthewssmu.edu>
Date: Sun May 23 18:15:23 2021 -0500

Remove `rm-dupls` function in common.mk.

AMD requested removal due to unclear licensing terms; original code was from stackoverflow. The function is unused but could easily be replaced by new implementation.

commit 932dfe6abb9617223bd26a249e53447169033f8c
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu May 20 02:07:31 2021 +0900

Travis CI Revert Unnecessary Extras from 91d3636

- Removed `V=1` in make line
- Removed `CFLAGS` in configure line
- Restored `pwd` surrounding OOT line

commit bd156a210d347a073a6939cc4adab3d9256c2e2b
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sun May 16 02:56:14 2021 +0900

Adjust TravisCI

- ArmSVE don't test gemmt (seems Qemu-only problem);
- Clang use TravisCI-provided version instead of fixing to clang-8
due to that clang-8 seems conflicting with TravisCI's clang-7.

commit 91d3636031021af3712d14c9fcb1eb34b6fe2a31
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat May 15 17:05:16 2021 +0900

Travis Support Arm SVE

- Updated distro to 20.04 focal aarch64-gcc-10.
This is minimal version required by aarch64-gcc-10.
SVE intrinsics would not compile without GCC >=10.
- x86 toolchains use official repo instead of ubuntu-toolchain-r/test.
20.04 focal is not supported by that PPA at the moment.
- Add extra configuration-time options to .travis.yml.
- Add Arm SVE entry to .travis.yml.

commit 61584deddf9b3af6d11a811e6e04328d22390202
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed May 19 23:52:29 2021 +0900

Added 512b SVE-based a64fx subconfig + SVE kernels.

Details:
- Added 512-bit specific 'a64fx' subconfiguration that uses empirically
tuned block size by Stepan Nassyr. This subconfig also sets the sector
cache size and enables memory-tagging code in SVE gemm kernels. This
subconfig utilizes (16, k) and (10, k) DPACKM kernels.
- Added a vector-length agnostic 'armsve' subconfiguration that computes
blocksizes according to the analytical model. This part is ported from
Stepan Nassyr's repository.
- Implemented vector-length-agnostic [d/s/sh] gemm kernels for Arm SVE
at size (2*VL, 10). These kernels use unindexed FMLA instructions
because indexed FMLA takes 2 FMA units in many implementations.
PS: There are indexed-FLMA kernels in Stepan Nassyr's repository.
- Implemented 512-bit SVE dpackm kernels with in-register transpose
support for sizes (16, k) and (10, k).
- Extended 256-bit SVE dpackm kernels by Linaro Ltd. to 512-bit for
size (12, k). This dpackm kernel is not currently used by any
subconfiguration.
- Implemented several experimental dgemmsup kernels which would
improve performance in a few cases. However, those dgemmsup kernels
generally underperform hence they are not currently used in any
subconfig.
- Note: This commit squashes several commits submitted by RuQing Xu via
PR 424.

commit b683d01b9c4ea5f64c8031bda816beccfbf806a0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu May 13 15:23:22 2021 -0500

Use extra undef when including ba/ex API headers.

Details:
- Inserted a "include bli_xapi_undef.h" after each usage of the basic
and expert API macro setup headers: bli_oapi_ba.h, bli_oapi_ex.h,
bli_tapi_ba.h, and bli_tapi_ex.h. This is functionally equivalent to
the previous status quo, in which each header made minimal undef
prior to its own definitions and then a single instance of
"include bli_xapi_undef.h" cleaned up any remaining macro defs after
all other headers were used. This commit will guarantee that macro
defs from the setup of one header (say, bli_oapi_ex.h) don't "infect"
the definitions made in a subsequent header. As with this previous
commit, this change does not fix any issue but rather attempts to
avoid creating orphaned macro definitions that are only needed within
a very limited scope.
- Removed minimal undef from bli_?api_[ba|ex].h.
- Removed old commented-out lines from bli_?api_[ba|ex].h.

commit d4427a5b2f5cab5d2a64c58d87416628867c2b4a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu May 13 13:55:11 2021 -0500

Minor preprocessor/header cleanup.

Details:
- Added frame/include/bli_xapi_undef.h, which explicitly undefines all
macros defined in bli_oapi_ba.h, bli_oapi_ex.h, bli_tapi_ba.h, and
bli_tapi_ex.h. (This is for safety and good cpp coding practice, not
because it fixes anything.)
- Added include "bli_xapi_undef.h" to bli_l1v.h, bli_l1d.h, bli_l1f.h,
bli_l1m.h, bli_l2.h, bli_l3.h, and bli_util.h.
- Comment updates to bli_oapi_ba.h, bli_oapi_ex.h, bli_tapi_ba.h, and
bli_tapi_ex.h.
- Moved frame/3/bli_l3_ft_ex.h to local 'old' directory after realizing
that nothing in BLIS used those function pointer types. Also commented
out the "include bli_l3_ft_ex.h" directive in frame/3/bli_l3.h.

commit 5aa63cd927b22a04e581b07d0b68ef391f4f9b1f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed May 12 19:53:35 2021 -0500

Fixed typo in cpp guard in bli_util_ft.h.

Details:
- Changed ifdef BLIS_OAPI_BASIC to ifdef BLIS_TAPI_BASIC in
bli_util_ft.h. This typo was causing some types to be redefined when
they weren't supposed to be.

commit f0e8634775094584e89f1b03811ee192f2aaf67f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed May 12 18:45:32 2021 -0500

Defined eqsc, eqv, eqm to test object equality.

Details:
- Defined eqsc, eqv, and eqm operations, which set a bool depending on
whether the two scalars, two vectors, or two matrix operands are equal
(element-wise). eqsc and eqv support implicit conjugation and eqm
supports diagonal offset, diag, uplo, and trans parameters (in a
manner consistent with other level-1m operations). These operations
are currently housed under frame/util, at least for now, because they
are not computational in nature.
- Redefined bli_obj_equals() in terms of eqsc, eqv, and eqm.
- Documented eqsc, eqv, and eqm in BLISObjectAPI.md and BLISTypedAPI.md.
Also:
- Documented getsc and setsc in both docs.
- Reordered entry for setijv in BLISTypedAPI.md, and added separator
bars to both docs.
- Added missing "Observed object properties" clauses to various
levle-1v entries in BLISObjectAPI.md.
- Defined bli_apply_trans() in bli_param_macro_defs.h.
- Defined supporting _check() function, bli_l0_xxbsc_check(), in
bli_l0_check.c for eqsc.
- Programming style and whitespace updates to bli_l1m_unb_var1.c.
- Whitespace updates to bli_l0_oapi.c, bli_l1m_oapi.c
- Consolidated redundant macro redefinition for copym function pointer
type in bli_l1m_ft.h.
- Added macros to bli_oapi_ba.h, _ex.h, and bli_tapi_ba.h, _ex.h that
allow oapi and tapi source files to forego defining certain expert
functions. (Certain operations such as printv and printm do not need
to have both basic expert interfaces. This also includes eqsc, eqv,
and eqm.)

commit 5d46dbee4a06ba5a422e19817836976f8574cb4f
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed May 12 18:42:09 2021 -0500

Replace bli_dlamch with something less archaic (498)

Details:
- Added new implementations of bli_slamch() and bli_dlamch() that use
constants from the standard C library in lieu of dynamically-computed
values (via code inherited from netlib). The previous implementation
is still available when the cpp macro BLIS_ENABLE_LEGACY_LAMCH is
defined by the subconfiguration at compile-time. Thanks to Devin
Matthews for providing this patch, and to Stefano Zampini for
reporting the issue (497) that prompted Devin to propose the patch.

commit 6a89c7d8f9ac3f51b5b4d8ccb2630d908d951e6f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat May 1 18:54:48 2021 -0500

Defined setijv, getijv to set/get vector elements.

Details:
- Defined getijv, setijv operations to get and set elements of a vector,
in bli_setgetijv.c and .h.
- Renamed bli_setgetij.c and .h to bli_setgetijm.c and .h, respectively.
- Added additional bounds checking to getijm and setijm to prevent
actions with negative indices.
- Added documentation to BLISObjectAPI.md and BLISTypedAPI.md for getijv
and setijv.
- Added documentation to BLISTypedAPI.md for getijm and setijm, which
were inadvertently missing.
- Added a new entry to the FAQ titled "Why does BLIS have vector
(level-1v) and matrix (level-1m) variations of most level-1
operations?"
- Comment updates.

commit 4534daffd13ed7a8983c681d3f5e9de17c9f0b96
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 27 18:16:44 2021 -0500

Minor API breakage in bli_pack API.

Details:
- Changed bli_pack_get_pack_a() and bli_pack_get_pack_b() so that
instead of returning a bool, they set a bool that is passed in by
address. This does break the public exported API, but I expect very
few users actually use this function. (This change is being made in
preparation for a much more extensive commit relating to error
checking.)

commit 6a4aa986ffc060d3e64ed230afe318b82630f8b2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 23 13:10:01 2021 -0500

Fixed typo in Table of Contents.

commit f6424b5b82160d346a09a0fbb526981ecf66cdb3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 23 13:08:06 2021 -0500

Added dedicated Performance section to README.md.

Details:
- Spun off the Performance.md and PerformanceSmall.md links in the
Documentation section into a new Performance section dedicated to
those two links. (The previous entries remain redundantly listed
within Documentation section.) Thanks to Robert van de Geijn for
suggesting this change.

commit 40ce5fd241b9ad140bf57278d440f0598d7f15d8
Merge: 6280757b 1f3461a5
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Apr 21 09:54:25 2021 -0500

Merge pull request 493 from cassiersg/patch-1

Fix typo in FAQ.md

commit 1f3461a5a5a88510f913451a93e3190ec1556f39
Author: Gaëtan Cassiers <cassiersgusers.noreply.github.com>
Date: Wed Apr 21 16:49:05 2021 +0200

Fix typo in FAQ.md

commit 6548cebaf55a1f9bdb8417cc89dd0444d8f9c2e4
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Apr 14 13:00:42 2021 -0500

Allow clang for ThunderX2 config

Needed for compiling on e.g. Mac M1. AFAIK clang supports the same -mcpu flag for ThunderX2 as gcc.

commit 6280757be32f90fd77d8dd9357b07d9306e6f80d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 7 13:03:56 2021 -0500

Minor updates to a64fx section of Performance.md.

commit 1e6ed823c6cd11f9b671779f3c8bdbd2bbb40f34
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Apr 8 02:59:26 2021 +0900

Additional A64fx Comments (490)

* Performance.md Update A64fx Comments

- Reason for ARMPL's missing data;
- Additional envs / flags for kernel selection;
- Update BLIS SRC commit.

* Include Another Fix in armsve-cfg-vendor

A prototype was forgotten, causing that void* pointer was not fully returned.

commit 2688f21a5b073950f6f187c95917fdbb5aac234a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 6 19:02:37 2021 -0500

Added Fujitsu A64fx (512-bit SVE) perf results.

Details:
- Added single-threaded and multithreaded performance results to
docs/Performance.md. These results were gathered on the "Fugaku"
Fujitsu A64fx supercomputer at the RIKEN Center for Computational
Science in Kobe, Japan. Special thanks to RuQing Xu and Stepan
Nassyr for their work in developing and optimizing A64fx support in
BLIS and RuQing for gathering the performance data that is reflected
in these new graphs.

commit ba3ba8da83d48397162139e11337c036a631ba79
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 6 18:39:58 2021 -0500

Minor updates and fixes to test/3/octave scripts.

Details:
- Fixed an issue where the wrong string was being passed in for the
vendor legend string.
- Changed the graph in which the legends appear.
- Updates to runthese.m.

commit 09bd4f4f12311131938baa9f75d27e92b664d681
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 31 17:09:36 2021 -0500

Add err_t* "return" parameter to malloc functions.

Details:
- Added an err_t* parameter to memory allocation functions including
bli_malloc_intl(), bli_calloc_intl(), bli_malloc_user(),
bli_fmalloc_align(), and bli_fmalloc_noalign(). Since these functions
already use the return value to return the allocated memory address,
they can't communicate errors to the caller through the return value.
This commit does not employ any error checking within these functions
or their callers, but this sets up BLIS for a more comprehensive
commit that moves in that direction.
- Moved the typedefs for malloc_ft and free_ft from bli_malloc.h to
bli_type_defs.h. This was done so that what remains of bli_malloc.h
can be included after the definition of the err_t enum. (This ordering
was needed because bli_malloc.h now contains function prototypes that
use err_t.)
- Defined bli_is_success() and bli_is_failure() static functions in
bli_param_macro_defs.h. These functions provide easy checks for error
codes and will be used more heavily in future commits.
- Unfortunately, the additional err_t* argument discussed above breaks
the API for bli_malloc_user(), which is an exported symbol in the
shared library. However, it's quite possible that the only application
that calls bli_malloc_user()--indeed, the reason it is was marked for
symbol exporting to begin with--is the BLIS testsuite. And if that's
the case, this breakage won't affect anyone. Nonetheless, the "major"
part of the so_version file has been updated accordingly to 4.0.0.

commit f9ad55ce7e12f59930605753959fcfd41a218d8d
Merge: 04502492 90508192
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 31 14:20:19 2021 -0500

Merge branch 'master' into dev

commit 90508192f2d6ae95adc2a3ba9f4e5bad2c8d6fd2
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Mar 30 21:16:44 2021 -0500

Update do_sde.sh (489)

Update to a newer version of SDE, and do a direct download as it seems you don't have to click-through the license anymore.

commit 22c6b5dc4c9cc21942f8ccc30891f9b4385a9504
Author: Nicholai Tukanov <nicholaitukanovgmail.com>
Date: Tue Mar 30 19:07:42 2021 -0500

Fixed bug in power10 microkernel I/O. (488)

Details:
- Fixed a bug in the POWER10 DGEMM kernel whereby the microkernel did
not store the microtile result correctly due to incorrect indices
calculations. (The error was introduced when I reorganized the
'kernels/power10/3' directory.)

commit 04502492671456b94bcdee60b9de347b6763a32d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Mar 28 19:11:43 2021 -0500

Always stay initialized after BLAS compat calls.

Details:
- Removed the option to finalize BLIS after every BLAS call, which also
means that BLIS would initialize at the beginning of every BLAS call.
This option never really made sense and wasn't even implemented
properly to begin with. (Because bli_init_auto() and _finalize_auto()
were implemented in terms of bli_init_once() and _finalize_once(),
respectively, the application would have only been able to call one
BLAS routine before BLIS would find itself in a unusable, permanently
uninitialized state.) Because this option was never meant for regular
use, it never made it into configure as an actual configure-time
option, and therefore this commit only removes parts of the code
affected by the cpp macro guard BLIS_ENABLE_STAY_AUTO_INITIALIZED.

commit 3a6f41afb8197e831b6ce2f1ae7f63735685fa0a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Mar 27 17:22:14 2021 -0500

Renamed membrk files/vars/functions to pba.

Details:
- Renamed the files, variables, and functions relating to the packing
block allocator from its legacy name (membrk) to its current name
(pba). This more clearly contrasts the packing block allocator with
the small block allocator (sba).
- Fixed a typo in bli_pack_set_pack_b(), defined in bli_pack.c, that
caused the function to erroneously change the value of the pack_a
field of the global rntm_t instead of the pack_b field. (Apparently
nobody has used this API yet.)
- Comment updates.

commit 36cb4116d15cfef2d42ec4a834efd4a958f261b5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Mar 27 15:15:09 2021 -0500

Switch allocator mutexes to static initialization.

Details:
- Switched the small block allocator (sba), as defined in bli_sba.c and
bli_apool.c, to static initialization of its internal mutex. Did a
similar thing for the packing block allocator (pba), which appears as
global_membrk in bli_membrk.c.
- Commented out bli_membrk_init_mutex() and bli_membrk_finalize_mutex()
to ensure they won't be used in the future.
- In bli_thrcomm_pthreads.c and .h, removed old, commented-out cpp
blocks guarded by BLIS_USE_PTHREAD_MUTEX.

commit 159ca6f01a5f91b93513134c9470b69ff78f5354
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 24 15:57:32 2021 -0500

Made test/3/octave scripts robust to missing data.

Details:
- Modified the octave scripts in test/3 so that the script does not
choke when one or more of the expected OpenBLAS, Eigen, or vendor data
files is missing. (The BLIS data set, however, must be complete.) When
a file is missing, that data series is simply not included on that
particular graph. Also factored out a lot of the redundant logic from
plot_panel_4x5.m into a separate function in read_data.m.

commit 545e6c2f6d09d023b353002a9a43b11aa0c1d701
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 22 17:42:33 2021 -0500

CHANGELOG update (0.8.1)

0.8.1

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 22 17:42:33 2021 -0500

Version file update (0.8.1)

commit e56d9f2d94ed247696dda2cbf94d2ca05c7fc089
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 22 17:40:50 2021 -0500

ReleaseNotes.md update in advance of next version.

commit ca83f955d45814b7d84f53933cdb73323c0dea2c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 22 17:21:21 2021 -0500

CREDITS file update.

commit 57ef61f6cdb86957f67212aa59407f2f8e7f3d1a
Merge: bf1b578e e7a4a8ed
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Mar 19 13:05:43 2021 -0500

Merge branch 'master' of github.com:flame/blis

commit bf1b578ea32ea1c9dbf7cb3586969e8ae89aa5ef
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Mar 19 13:03:17 2021 -0500

Reduced KC on skx from 384 to 256.

Details:
- Reduced the KC cache blocksize for double real on the skx subconfig
from 384 to 256. The maximum (extended) KC was also reduced
accordingly from 480 to 320. Thanks to Tze Meng Low for suggesting
this change.

commit e7a4a8edc940942357e8e4c4594383a29a962f93
Author: Nicholai Tukanov <nicholaitukanovgmail.com>
Date: Wed Mar 17 19:43:31 2021 -0500

Fix calculation of new pb size (487)

Details:
- Added missing parentheses to the i8 and i4 instantiations of the
GENERIC_GEMM macro in sandbox/power10/generic_gemm.c.

commit 4493cf516e01aba82642a43abe350943ba458fe2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 15 13:12:49 2021 -0500

Redefined BLIS_NUM_ARCHS to update automatically.

Details:
- Changed BLIS_NUM_ARCHS from a cpp macro definition to the last enum
value in the arch_t enum. This means that it no longer needs to get
updated manually whenever new subconfigurations are added to BLIS.
Also removed the explicit initial index assigment of 0 from the
first enum value, which was unnecessary due to how the C language
standard mandates indexing of enum values. Thanks to Devin Matthews
for originally submitting this as a PR in 446.
- Updated docs/ConfigurationHowTo.md to reflect the aforementioned
change.

commit a4b73de84cdffcbe5cf71969a0f7f0f8202b3510
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Mar 12 17:12:27 2021 -0600

Disabled _self() and _equal() in bli_pthread API.

Details:
- Disabled the _self() and _equal() extensions to the bli_pthread API
introduced in d479654. These functions were disabled after I realized
that they aren't actually needed yet. Thanks to Devin Matthews for
helping me reason through the appropriate consumer code that will
appear in BLIS (eventually) in a future commit. (Also, I could never
get the Windows branch to link properly in clang builds in AppVeyor.
See the comment I left in the code, and 485, for more info.)

commit f9d604679d8715bc3e79a8630268446889b51388
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 11 16:57:55 2021 -0600

Added _self() and _equal() to bli_pthread API.

Details:
- Expanded the bli_pthread API to include equivalents to pthread_self()
and pthread_equal(). Implemented these two functions for all three cpp
branches present within bli_pthread.c: systemless, Windows, and
Linux/BSD.

commit fa9b3c8f6b3d5717f19832362104413e1a86dfb0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 11 15:13:51 2021 -0600

Shuffled code in Windows branch of bli_pthreads.c.

Details:
- Reordered the definitions in the cpp branch in bli_pthreads.c that
defines the bli_pthreads API in terms of Windows API calls. Also added
missing comments that mark sections of the API, which brings the code
into harmony with other cpp branches (as well as bli_pthread.h).

commit 95d4f3934d806b3563f6648d57a4e381d747caf5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 11 13:50:40 2021 -0600

Moved cpp macro redef of strerror_r to bli_env.c.

Details:
- Relocated the _MSC_VER-guarded cpp macro re-definition of strerror_r
(in terms of strerror_s) from bli_thread.h to bli_env.c. It was
likely left behind in bli_thread.h in a previous commit, when code
that now resides in bli_env.c was moved from bli_thread.c. (I couldn't
find any other instance of strerror_r being used in BLIS, so I moved
the define directly to bli_env.c rather than place it in bli_env.h.)
The code that uses strerror_r is currently disabled, though, so this
commit should have no affect on BLIS.

commit 8a3066c315358d45d4f5b710c54594455f9e8fc6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 9 17:52:59 2021 -0600

Relocated gemmsup_ref general stride handling.

Details:
- Moved the logic that checks for general stridedness in any of the
matrix operands in a gemmsup problem. The logic previously resided
near the top of bli_gemmsup_int(), which is the thread entry point
for the parallel region of the current gemmsup implementation. The
problem with this setup was that the code would attempt to reject
problems with any general-strided operands by returning BLIS_FAILURE,
and that return value was then being ignored by the l3_sup thread
decorator, which unconditionally returns BLIS_SUCCESS. To solve this
issue, rather than try to manage n return values, one from each of n
threads, I simply moved the logic into bli_gemmsup_ref(). I didn't
move it any higher (e.g. bli_gemmsup()) because I still want the
logic to be part of the current gemmsup handler implementation. That
is, perhaps someone else will create a different handler, and that
author wants to handle general stride differently. (We don't want to
force them into a particular way of handling general stride.)
- Removed the general stride handling from bli_gemmtsup_int(), even
though this function is inoperative for now.
- This commit addresses issue 484. Thanks to RuQing Xu for reporting
this issue.

commit 670bc7b60f6065893e8ec1bebd2fc9e5ba710dff
Author: Nicholai Tukanov <nicholaitukanovgmail.com>
Date: Fri Mar 5 13:53:43 2021 -0600

Add low-precision POWER10 gemm kernels (467)

Details:
- This commit adds a new BLIS sandbox that (1) provides implementations
based on low-precision gemm kernels, and (2) extends the BLIS typed
API for those new implementations. Currently, these new kernels can
only be used for the POWER10 microarchitecture; however, they may
provide a template for developing similar kernels for other
microarchitectures (even those beyond POWER), as changes would likely
be limited to select places in the microkernel and possibly the
packing routines. The new low-precision operations that are now
supported include: shgemm, sbgemm, i16gemm, i8gemm, i4gemm. For more
information, refer to the POWER10.md document that is included in
'sandbox/power10'.

commit b8dcc5bc75a746807d6f8fa22dc2123c98396bf5
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Tue Mar 2 06:58:24 2021 +0800

Fixed typed API definition for gemmt (476)

Details:
- Fixed incorrect definition and prototype of bli_?gemmt() in
frame/3/bli_l3_tapi.c and .h, respectively. gemmt was previously
defined identically to gemm, which was wrong because it did not
take into account the uplo property of C.
- Fixed incorrect API documentation for her2k/syr2k in BLISTypedAPI.md.
Specifically, the document erroneously listed only a single transab
parameter instead of transa and transb.

commit a0e4fe2340a93521e1b1a835a96d0f26dec8406a
Author: Ilknur <ilknuri607gmail.com>
Date: Tue Mar 2 02:06:56 2021 +0400

Fixed double free() in level1v example (482)

Details:
- In exampls/tapi/00level1v.c, pointer 'z' was being freed twice and
pointer 'a' was not being freed at all. This commit correctly frees
each pointer exactly once.

commit f5871c7e06a75799251d6b55a8a5fbfa1a92cf95
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Feb 28 17:03:57 2021 -0600

Added complex asm packm kernels for 'haswell' set.

Details:
- Implemented assembly-based packm kernels for single- and double-
precision complex domain (c and z) and housed them in the 'haswell'
kernel set. This means c3xk, c8xk, z3xk, and z4xk are now all
optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
and zen2 subconfigs.
- Minor modifications to the corresponding s and d packm kernels that
were introduced in 426ad67.
- Thanks to AMD, who originally contributed the double-precision real
packm kernels (d6xk and d8xk), upon which these complex kernels are
partially based.

commit 426ad679f55264e381eb57a372632b774320fb85
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Feb 27 18:39:56 2021 -0600

Added assembly packm kernels for 'haswell' set.

Details:
- Implemented assembly-based packm kernels for single- and double-
precision real domain (s and d) and housed them in the 'haswell'
kernel set. This means s6xk, s16xk, d6xk, and d8xk are now all
optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
and zen2 subconfigs.
- Thanks to AMD, who originally contributed the double-precision real
packm kernels (d6xk and d8xk), which I have now tweaked and used to
create comparable single-precision real kernels (s6xk and s16xk).

commit f50c1b7e5886d29efe134e1994d05af9949cd4b6
Merge: 8f39aea1 b3953b93
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Feb 1 11:55:51 2021 -0600

Merge pull request 473 from ajaypanyala/pkgconfig

build: generate pkgconfig file

commit 8f39aea11f80a805b66cff4b4dc5e72727ea461d
Merge: f8db9fb3 2a815d5b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Jan 30 17:59:56 2021 -0600

Merge branch 'dev'

commit f8db9fb33b48844d6b47fdef699625bd9197745a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jan 28 08:04:52 2021 -0600

Fixed missing parentheses in README.md Citations.

commit b3953b938eee59f79b4a4162ba583a5cb59fa34e
Author: Ajay Panyala <ajay.panyalagmail.com>
Date: Tue Jan 12 17:07:04 2021 -0800

drop CFLAGS in the generated pkgconfig file

commit b02d9376bac31c1a1c7916f44c4946277a1425e2
Author: Ajay Panyala <ajay.panyalagmail.com>
Date: Mon Jan 11 20:50:01 2021 -0800

add datadir

commit d8d8deeb6d8b84adb7ae5fdb88c6dd4f06624a76
Author: Ajay Panyala <ajay.panyalagmail.com>
Date: Mon Jan 11 17:47:50 2021 -0800

generate pkgconfig file

commit 8c65411c7c8737248a6f054ffa0ce008c95cb515
Merge: 328b4f88 874c3f04
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Jan 11 16:01:45 2021 -0600

Merge pull request 471 from flame/fix-470

Fix kernel-to-config mapping for intel64

commit 874c3f04ece9af4d8fdf0e2713e21a259c117656
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Jan 8 13:56:30 2021 -0600

Update configure

Choose last sub-config in the kernel-to-config map if the config list doesn't contain the name of the kernel set. E.g. for "zen: skx knl haswell" pick "haswell" instead of "skx" which was chosen previously. Fixes 470.

commit 2a815d5b365d934cb351b2f2a8cd1366e997b2e1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jan 4 18:03:39 2021 -0600

Support trsm pre-inversion in 1m, bb, ref kernels.

Details:
- Expanded support for disabling trsm diagonal pre-inversion to other
microkernel types, including the reference microkernel as well as the
kernel implementations for 1m and the pre-broadcast B (bb) format used
by the power9 subconfig. This builds on the 'haswell' and 'penryn'
kernel support added in 7038bba. Thanks to Bhaskar Nallani for
reminding me, in 461 (post-closure), that 1m support was missing from
that commit.
- Removed cpp branch of ref_kernels/3/bli_trsm_ref.c that contained the
omp simd implementation after making a stripped-down copy in 'old'.
This code has been disabled for some time and it seemed better suited
to rot away out of sight rather than clutter up a file that is already
cluttered by the presence of lower and upper versions.
- Minor comment update to bli_ind_init().

commit c3ed2cbb9f60100fc9beb2a9d75476de9f711dc5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jan 4 16:16:32 2021 -0600

Enable 1m only if real domain ukr is not reference.

Details:
- Previously, BLIS would automatically enable use of the 1m method
for a given precision if the complex domain microkernel was a
reference kernel. This commit adds an additional constraint so that
1m is only enabled if the corresponding real domain microkernel is
NOT reference. That is, BLIS now forgos use of 1m if both the real and
complex domain kernels are reference implementations. Note that this
does not prevent 1m from being enabled manually under those
conditions; it only means that 1m will not be enabled automatically
at initialization-time.

commit ed50c947385ba3b0b5d550015f38f7f0a31755c0
Merge: 0cef09aa 328b4f88
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jan 4 14:31:44 2021 -0600

Merge branch 'master' into dev

commit 328b4f8872b4bca9a53d2de8c6e285f3eb13d196
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Dec 30 17:54:18 2020 -0600

Shared object (dylib) was not built correctly for partial build.

The SO build rule used $? instead of $^. Observed on macOS, not sure if it affected Linux or not.

commit ae6ef66ef824da9bc6348bf9d1b588cd4f2ded9b
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Dec 30 17:34:55 2020 -0600

bli_diag_offset_with_trans had wrong return type. Fixes 468.

commit ebcf197fb86fdd0a864ea928140752bc2462e8c6
Merge: 472f138c 21aa67e1
Author: Devin Matthews <damatthewssmu.edu>
Date: Sat Dec 5 22:26:27 2020 -0600

Merge pull request 466 from isuruf/patch-3

fix cc_vendor for crosstool-ng toolchains

commit 21aa67e11cebbc5a6dd7c6353154256294df3c33
Author: Isuru Fernando <isurufgmail.com>
Date: Sat Dec 5 21:59:13 2020 -0600

fix cc_vendor for crosstool-ng toolchains

commit 472f138cb927b7259126ebb9c68919cfcc7a4ea3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Dec 5 14:13:52 2020 -0600

Fixed typo in README.md to CodingConventions.md.

commit 0cef09aa92208441a656bf097f197ea8e22b533b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Dec 4 16:40:59 2020 -0600

Consolidated code in level-3 _front() functions.

Details:
- Reduced a code segment that appears in all of the bli_*_front()
functions except for bli_gemm_front(). Previously, the code looked
like this (taken from bli_herk_front()):

if ( bli_cntx_method( cntx ) == BLIS_NAT )
{
bli_obj_set_pack_schema( BLIS_PACKED_ROW_PANELS, &a_local );
bli_obj_set_pack_schema( BLIS_PACKED_COL_PANELS, &ah_local );
}
else // if ( bli_cntx_method( cntx ) != BLIS_NAT )
{
pack_t schema_a = bli_cntx_schema_a_block( cntx );
pack_t schema_b = bli_cntx_schema_b_panel( cntx );

bli_obj_set_pack_schema( schema_a, &a_local );
bli_obj_set_pack_schema( schema_b, &ah_local );
}

This code segment is part of a sort-of-hack that allows us to
communicate the pack schemas into the level-3 thread decorator, which
needs them so that they can be passed into bli_l3_cntl_create_if(),
where the control tree is created. However, the first conditional case
above is unnecessary because the second case is fully generalized.
That is, even in the native case, the context contains correct,
queryable schemas. Thus, these code segments were reduced to something
like:

pack_t schema_a = bli_cntx_schema_a_block( cntx );
pack_t schema_b = bli_cntx_schema_b_panel( cntx );

bli_obj_set_pack_schema( schema_a, &a_local );
bli_obj_set_pack_schema( schema_b, &ah_local );

There's always a small chance that the seemingly unnecessary code
in the first branch case has some special use that is not apparent to
me, but the testsuite's default input parameters seem to think this
commit will be fine.

commit 7038bbaa05484141195822291cf3ba88cbce4980
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Dec 4 16:08:15 2020 -0600

Optionally disable trsm diagonal pre-inversion.

Details:
- Implemented a configure-time option, --disable-trsm-preinversion, that
optionally disables the pre-inversion of diagonal elements of the
triangular matrix in the trsm operation and instead uses division
instructions within the gemmtrsm microkernels. Pre-inversion is
enabled by default. When it is disabled, performance may suffer
slightly, but numerical robustness should improve for certain
pathological cases involving denormal (subnormal) numbers that would
otherwise result in overflow in the pre-inverted value. Thanks to
Bhaskar Nallani for reporting this issue via 461.
- Added preprocessor macro guards to bli_trsm_cntl.c as well as the
gemmtrsm microkernels for 'haswell' and 'penryn' kernel sets pursuant
to the aforementioned feature.
- Added macros to frame/include/bli_x86_asm_macros.h related to division
instructions.

commit 78aee79452cce2691c40f05b3632bdfc122300af
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Dec 2 13:02:36 2020 -0600

Allow amaxv testsuite module to run with dim = 0.

Details:
- Exit early from libblis_test_amaxv_check() when the vector dimension
(length) of x is 0. This allows the module to run when the testsuite
driver passes in a problem size of 0. Thanks to Meghana Vankadari for
alerting us to this issue via 459.
- Note: All other testsuite modules appear to work with problem sizes
of 0, except for the microkernel modules. I chose not to "fix" those
modules because a failure (or segmentation fault, as happens in this
case) is actually meaningful in that it alerts the developer that some
microkernels cannot be used with k = 0. Specifically, the 'haswell'
kernel set contains microkernels that preload elements of B. Those
microkernels would need to be restructured to avoid preloading in
order to support usage when k = 0.

commit 92d2b12a44ee0990c22735472aeaf1c17deb2d9b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Dec 2 13:02:00 2020 -0600

Fixed obscure testsuite gemmt dependency bug.

Details:
- Fixed a bug in the gemmt testsuite module that only manifested when
testing of gemmt is enabled but testing of gemv is disabled. The bug
was due to a copy-paste error dating back to the introduction of gemmt
in 88ad841.

commit b43dae9a5d2f078c9bbe07079031d6c00a68b7de
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 1 16:44:38 2020 -0600

Fixed copy-paste bugs in edge-case sup kernels.

Details:
- Fixed bugs in two sup kernels, bli_dgemmsup_rv_haswell_asm_1x6() and
bli_dgemmsup_rd_haswell_asm_1x4(), which involved extraneous assembly
instructions that were left over from when the kernels were first
written. These instructions would cause segmentation faults in some
situations where extra memory was not allocated beyond the end of
the matrix buffers. Thanks to Kiran Varaganti for reporting these
bugs and to Bhaskar Nallani for identifying the cause and solution.

commit 11dfc176a3c422729f453f6c23204cf023e9954d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 1 19:51:27 2020 +0000

Reorganized thread auto-factorization logic.

Details:
- Reorganized logic of bli_thread_partition_2x2() so that the primary
guts were factored out into "fast" and "slow" variants. Then added
logic to the "fast" variant that allows for more optimal thread
factorizations in some situations where there is at least one factor
of 2.
- Changed BLIS_THREAD_RATIO_M from 2 to 1 in bli_kernel_macro_defs.h and
added comments to that file describing BLIS_THREAD_RATIO_? and
BLIS_THREAD_MAX_?R.
- In bli_family_zen.h and bli_family_zen2.h, preprocessed out several
macros not used in vanilla BLIS and removed the unused macro
BLIS_ENABLE_ZEN_BLOCK_SIZES from the former file.
- Disabled AMD's small matrix handling entry points in bli_syrk_front.c
and bli_trsm_front.c. (These branches of small matrix handling have
not been reviewed by vanilla BLIS developers.)
- Added commented-out calls printf() to bli_rntm.c.
- Whitespace changes to bli_thread.c.

commit 6d3bafacd7aa7ad198762b39490876c172bfbbcb
Author: Devin Matthews <damatthewssmu.edu>
Date: Sat Nov 28 17:17:56 2020 -0600

Update BuildSystem.md

Add git version >= 1.8.5 requirement (see 462).

commit 64856ea5a61b01d585750815788b6a775f729647
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Nov 23 16:54:51 2020 -0600

Auto-reduce (by default) prime numbers of threads.

Details:
- When requesting multithreaded parallelism by specifying the total
number of threads (whether it be via environment variable, globally at
runtime, or locally at runtime), reduce the number of threads actually
used by one if the original value (a) is prime and (b) exceeds a
minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which is set
to 11 by default. If, when specifying the total number of threads (and
not the individual ways of parallelism for each loop), prime numbers
of threads are desired, this feature may be overridden by defining the
BLIS_ENABLE_AUTO_PRIME_NUM_THREADS macro in the bli_family_*.h that
corresponds to the configuration family targeted at configure-time.
(For now, there is no configure option(s) to control this feature.)
Thanks to Jeff Diamond for suggesting this change.
- Defined a new function in bli_thread.c, bli_is_prime(), that returns a
bool that determines whether an integer is prime. This function is
implemented in terms of existing functions in bli_thread.c.
- Updated docs/Multithreading.md to document the above feature, along
with unrelated minor edits.

commit 55933b6ff6b9b8a12041715f42bba06273d84b74
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 20 10:39:32 2020 -0600

Added missing attribution to docs/ReleaseNotes.md.

commit e310f57b4b29fbfee479e0f9fe2040851efdec4f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 19 13:33:37 2020 -0600

CHANGELOG update (0.8.0)

0.8.0

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 19 13:33:37 2020 -0600

Version file update (0.8.0)

commit 2928ec750d3a3e1e5d55de5b57ddc04e9d0bd796
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 18 18:31:35 2020 -0600

ReleaseNotes.md update in advance of next version.

Details:
- Updated docs/ReleaseNotes.md in preparation for next version.

commit b9899bedff6854639468daa7a973bb14ca131a74
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 18 16:52:41 2020 -0600

CREDITS file update.

commit 9bb23e6c2a44b77292a72093938ab1ee6e6cc26a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Nov 16 15:55:45 2020 -0600

Added support for systemless build (no pthreads).

Details:
- Added a configure option, --[enable|disable]-system, which determines
whether the modest operating system dependencies in BLIS are included.
The most notable example of this on Linux and BSD/OSX is the use of
POSIX threads to ensure thread safety for when application-level
threads call BLIS. When --disable-system is given, the bli_pthreads
implementation is dummied out entirely, allowing the calling code
within BLIS to remain unchanged. Why would anyone want to build BLIS
like this? The motivating example was submitted via 454 in which a
user wanted to build BLIS for a simulator such as gem5 where thread
safety may not be a concern (and where the operating system is largely
absent anyway). Thanks to Stepan Nassyr for suggesting this feature.
- Another, more minor side effect of the --disable-system option is that
the implementation of bli_clock() unconditionally returns 0.0 instead
of the time elapsed since some fixed point in the past. The reasoning
for this is that if the operating system is truly minimal, the system
function call upon which bli_clock() would normally be implemented
(e.g. clock_gettime()) may not be available.
- Refactored preprocess-guarded code in bli_pthread.c and bli_pthread.h
to remove redundancies.
- Removed old comments and commented include of "bli_pthread_wrap.h"
from bli_system.h.
- Documented bli_clock() and bli_clock_min_diff() in BLISObjectAPI.md
and BLISTypedAPI.md, with a note that both are non-functional when
BLIS is configured with --disable-system.

commit 88ad84143414644df4c56733b1cf91a36bfacaf8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Nov 14 09:39:48 2020 -0600

Squash-merge 'pr' into 'squash'. (457)

Merged contributions from AMD's AOCL BLIS (448).

Details:
- Added support for level-3 operation gemmt, which performs a gemm on
only the lower or upper triangle of a square matrix C. For now, only
the conventional/large code path will be supported (in vanilla BLIS).
This was accomplished by leveraging the existing variant logic for
herk. However, some of the infrastructure to support a gemmtsup is
included in this commit, including
- A bli_gemmtsup() front-end, similar to bli_gemmsup().
- A bli_gemmtsup_ref() reference handler function.
- A bli_gemmtsup_int() variant chooser function (with variant calls
commented out).
- Added support for inducing complex domain gemmt via the 1m method.
- Added gemmt APIs to the BLAS and CBLAS compatiblity layers.
- Added gemmt test module to testsuite.
- Added standalone gemmt test driver to 'test' directory.
- Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md.
- Added a C++ template header (blis.hh) containing a BLAS-inspired
wrapper to a set of polymorphic CBLAS-like function wrappers defined
in another header (cblas.hh). These two headers are installed if
running the 'install' target with INSTALL_HH is set to 'yes'. (Also
added a set of unit tests that exercise blis.hh, although they are
disabled for now because they aren't compatible with out-of-tree
builds.) These files now live in the 'vendor' top-level directory.
- Various updates to 'zen' and 'zen2' subconfigurations, particularly
within the context initialization functions.
- Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and
various minor updates to dotv and scalv kernels. Also added various
sup kernels contributed by AMD to kernels/zen/3. However, these
kernels are (for now) not yet used, in part because they caused
AppVeyor clang failures, and also because I have not found time to
review and vet them.
- Output the python found during configure into the definition of PYTHON
in build/config.mk (via build/config.mk.in).
- Added early-return checks (A, B, or C with zero dimension; alpha = 0)
to bli_gemm_front.c.
- Implemented explicit beta = 0 handling in for the sgemm ukernel in
bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent
bug surfaced because the gemmt module verifies its computation using
gemm with its beta parameter set to zero, which, on a cortexa15 system
caused the gemm kernel code to unconditionally multiply the
uninitialized C data by beta. The C matrix likely contained
non-numeric values such as NaN, which then would have resulted in a
false failure.
- Fixed a bug whereby the implementation for bli_herk_determine_kc(),
in bli_l3_blocksize.c, was inadvertantly being defined in terms of
helper functions meant for trmm. This bug was probably harmless since
the trmm code should have also done the right thing for herk.
- Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in
kernels/zen/3/bli_gemm_small.c since those macros are not used in
vanilla BLIS.
- Added cpp guard to definition of bli_mem_clear() in bli_mem.h to
accommodate C++'s stricter type checking.
- Added cpp guard to test/*.c drivers that facilitate compilation on
Windows systems.
- Various whitespace changes.

commit 234b8b0cf48f1ee965bd7999b291fc7add3b9a54
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 12 19:11:16 2020 -0600

Increased dotxaxpyf testsuite thresholds.

Details:
- Increased the test thresholds used by the dotxaxpyf testsuite module
by a factor of five in order to avoid residuals that unnecessarily
fall in the MARGINAL range. This commit should fix 455. Thanks to
nagsingh for reporting this issue.

commit ed612dd82c50063cfd23576a6b2465213d31b14b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Nov 7 13:09:42 2020 -0600

Updated README.md with sgemmsup blurb.

Details:
- Added an entry to the "What's New" section of the README.md to
announce the availability of sgemmsup.

commit e14424f55b15d67e8d18384aea45a11b9b772e02
Merge: 0cfe1aac eccdd75a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Nov 7 13:02:50 2020 -0600

Merge branch 'dev'

commit 0cfe1aac222008a78dff3ee03ef5183413936706
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 30 17:10:36 2020 -0500

Relocated operation index to ToC in API docs.

Details:
- Moved the "Operation index" section of both the BLISObjectAPI.md and
BLISTypedAPI.md docs to appear immediately after the table of contents
of each document. This allows the reader to quickly jump to the
documentation for any operation without having to scroll through much
of the document (when rendered via a web browser).
- Fixed a mistake in the BLISObjectAPI.md for the setd operation, which
does *not* observe the diag property of its matrix argument. Thanks to
Jeff Diamond for reporting this.

commit 2a0682f8e5998be536da313525292f0da6193147
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Oct 18 18:04:03 2020 -0500

Implemented runtime subconfig selection (451).

Details:
- Implemented support for the user manually overriding the automatic
subconfiguration selection that happens at runtime. This override
can be requested by setting the BLIS_ARCH_TYPE environment variable.
The variable must be set to the arch_t id (as enumerated in
bli_type_defs.h) corresponding to the desired subconfiguration. If a
value outside this enumerated range is given, BLIS will abort with an
error message. If the value is in the valid range but corresponds to a
subconfiguration that was not activated at configure-time/compile-time,
BLIS will abort with a (different) error message. Thanks to decandia50
for suggesting this feature via issue 451.
- Defined a new function bli_gks_lookup_id to return the address of an
internal data structure within the gks. If this address is NULL, then
it indicates that the subconfig corresponding to the arch_t id passed
into the function was not compiled into BLIS. This function is used
in the second of the two abort scenarios described above.
- Defined the enumerated error code BLIS_UNINITIALIZED_GKS_CNTX, which
is returned for the latter of the two abort scenarios mentioned above,
along with a corresponding error message and a function to perform
the error check.
- Added cpp macro branching to bli_env.c to support compilation of the
auto-detect.x executable during configure-time. This cpp branch is
similar to the cpp code already found in bli_arch.c and bli_cpuid.c.
- Cleaned up the auto_detect() function to facilitate easier maintenance
going forward. Also added a convenient debug switch that outputs the
compilation command for the auto-detect.x executable and exits.

commit eccdd75a2d8a0c46e91e94036179c49aa5fa601c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 9 15:44:16 2020 -0500

Whitespace tweak in docs/PerformanceSmall.md.

commit 7677e9ba60ac27496e3421c2acc7c239e3f860e9
Merge: addcd46b a0849d39
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 9 15:41:25 2020 -0500

Merge branch 'dev' of github.com:flame/blis into dev

commit addcd46b0559d401aa7d33d4c7e6f63f5313a8e0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 9 15:41:09 2020 -0500

Added Epyc 7742 Zen2 ("Rome") sup perf results.

Details:
- Added single-threaded and multithreaded sup performance results to
docs/PerformanceSmall.md for both sgemm and dgemm. These results were
gathered on an Epyc 7742 "Rome" server featuring AMD's Zen2
microarchitecture. Special thanks to Jeff Diamond for facilitating
access to the system via the Oracle Cloud.
- Updates to octave scripts in test/sup/octave for use with Octave 5.2
and for use with subplot_tight().
- Minor updates to octave scripts in test/3/octave.
- Renamed files containing the previous Zen performance results for
consistency with the new results.
- Decreased line thickness slightly in large/conventional Zen2 graphs.
I'm done tweaking those this time. Really.
- Added missing line regarding eigen header installation for each
microarchitecture section.

commit a0849d390d04067b82af937cda8191b049b98915
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 9 20:22:17 2020 +0000

Register l3 sup kernels in zen2 subconfig.

Details:
- Registered full suite of sgemm and dgemm sup millikernels, blocksizes,
and crossover thresholds in bli_cntx_init_zen2.c.
- Minor updates to test/sup/runme.sh for running on Zen2 Epyc 7742
system.

commit d98368c32d5fbfaab8966ee331d9bcb5c4fe7a59
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 8 19:05:51 2020 -0500

Another tweak to line thickness of Zen2 graphs.

commit 1855dfbdaafa37892b36c97fd317fd5d8da76676
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 8 19:01:00 2020 -0500

Tweaked line thickness in Zen2 graphs once more.

Details:
- Decreased (relative to previous commit) line thickness in recent Zen2
graphs.

commit 0991611e7ed82889c53a5c3f1ef1d49552c50d61
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 8 18:54:49 2020 -0500

Increased line thickness in recent Zen2 graphs.

Details:
- Increased the width of the lines in the graphs introduced in 74ec6b8.

commit 8273cbacd7799e9af59e5320d66055f2f5d9cb31
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 7 14:51:33 2020 -0500

README.md, docs/FAQ.md updates.

Details:
- Added a frequently asked question to docs/FAQ.md regarding the
difference between upstream (vanilla) BLIS and AMD BLIS.
- Updated the name of ICES in the README.md to reflect the Oden
rebranding.

commit a178a822ad3d5021489a0e61f909d8550ae12a8f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Sep 30 16:00:52 2020 -0500

Added Zen2 links to docs/Performance.md Contents.

commit 74ec6b8f457cabe37d2382aaab35ba04fc737948
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Sep 30 15:54:18 2020 -0500

Added Epyc 7742 Zen2 ("Rome") performance results.

Details:
- Added single-threaded and multithreaded performance results to
docs/Performance.md. These results were gathered on an Epyc 7742
"Rome" server with AMD's Zen2 microarchitecture. Special thanks
to Jeff Diamond for facilitating access to the system via the
Oracle Cloud.
- Renamed files containing the previous Zen performance results for
consistency with the new results.

commit bc4a213a2c3dcf8bbfcbb3a1ef3e9fc9e3226c34
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Sep 30 15:28:20 2020 -0500

Updated matlab (now octave) plot code in test/3.

Details:
- Renamed test/3/matlab to test/3/octave.
- Within test/3, updated and tuned plot_l3_perf.m and plot_panel_4x5.m
files for use with octave (which is free and doesn't crash on me
mid-way through my use of subplot).
- Updated runthese.m scratchpad for zen2 invocations.
- Added Nikolay S.'s subplot_tight() function, along with its license.

commit c77ddc418187e1884fa6bcfe570eee295b9cb8bc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Sep 30 20:15:43 2020 +0000

Added optional numactl usage to test/3/runme.sh.

commit 2d8ec164e7ae4f0c461c27309dc1f5d1966eb003
Author: Nicholai Tukanov <nicholaiutexas.edu>
Date: Tue Sep 29 16:52:18 2020 -0500

Add POWER10 support to BLIS (450)

commit 4fd8d9fec2052257bf2a5c6e0d48ae619ff6c3e4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 28 23:39:05 2020 +0000

Tweaked zen2 subconfig's MC cache blocksizes.

Details:
- Updated the MC cache blocksizes registered by the 'zen2' subconfig.
- Minor updates to test/3/Makefile and test/3/runme.sh.

commit 5efcdeffd58af621476d179afc0c19c0f912baa8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Sep 25 14:25:24 2020 -0500

More minor README.md updates.

commit 9e940f8aad6f065ea1689e791b9a4e1fb7900c40
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Sep 25 13:53:35 2020 -0500

Added 1m SISC bibtex to README.md.

Details:
- Added final citation info to 1m bibtex in README.md file.
- Updated draft 1m paper link.
- Changed some http to https.

commit e293cae2d1b9067261f613f25eaa0e871356b317
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 15 16:09:11 2020 -0500

Implemented sgemmsup assembly kernels.

Details:
- Created a set of single-precision real millikernels and microkernels
comparable to the dgemmsup kernels that already exist within BLIS.
- Added prototypes for all kernels within bli_kernels_haswell.h.
- Registered entry-point millikernels in bli_cntx_init_haswell.c and
bli_cntx_init_zen.c.
- Added sgemmsup support to the Makefile, runme.sh script, and source
file in test/sup. This included edits that allow for separate "small"
dimensions for single- and double-precision as well as for single-
vs. multithreaded execution.

commit 2765c6f37c11cb7f71cd4b81c64cea6130636c68
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Sep 12 17:48:15 2020 -0500

Type saga continues; fixed sgemm ukernel signature.

Details:
- Changed double* pointers in sgemm function signature to float*. At
this point I've lost track of whether this was my fault or another
dormant bug like the one described in ece9f6a, but at this point I
no longer care. It's one of those days (aka I didn't ask for this).

commit 0779559509e0a1af077530d09ed151dac54f32ee
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Sep 12 17:37:21 2020 -0500

Fixed missing restrict in knl sgemm prototype.

Details:
- Added a missing 'restrict' qualifier in the sgemm ukernel prototype
for knl. (Not sure how that code was ever compiling before now.)

commit ece9f6a3ef1b26b53ecf968cd069df7a85b139fb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Sep 12 17:22:42 2020 -0500

Fixed dormant type bugs in bli_kernels_knl.h.

Details:
- Fixed dormant type mismatches in the use of the prototype-generating
macros in bli_kernels_knl.h. Specifically, some float prototypes
were incorrectly using double as their ctype. This didn't actually
matter until the type changes in 645d771, as previously those types
were not used since packm was prototyped with void* pointers.

commit 8ebb3b60e1c4c045ddb48e02de6e246cecde24a4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Sep 12 17:00:47 2020 -0500

Fixed accidental breakage in 645d771.

Details:
- In trying to clean up kappa_cast variables in the reference packm
kernels, which I initally believed to be redundant given the other
void* -> ctype* changes in 645d771, I accidentally ended up violating
restrict semantics for 1e/1r packing and possibly other packm kernels.
(Normally, my pre-commit testsuite run would have caught this, but I
was unknowingly using an edited input.operations file in which I'd
disabled most tests as part of unrelated work.) This commit reverts
the kappa_cast changes in 645d771.

commit 645d771a14ae89aa7131d6f8f4f4a8090329d05e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Sep 12 15:31:56 2020 -0500

Minor packm kernel type cleanup (void* -> ctype*).

Details:
- Changed all void* function arguments in reference packm kernels to
those of the native type (ctype*). These pointers no longer need to
be void* and are better represented by their native types anyway.
(See below for details.) Updated knl packm kernels accordingly.
- In the definition of the PACKM_KER_PROT prototype macro template in
frame/1m/bli_l1m_ker_prot.h, changed the pointer types for kappa, a,
and p from void* to ctype*. They were originally void* because these
function signatures had to share the same type so they could all be
stored in a single array of that shared type, from which they were
queried and called by packm_cxk(). This is no longer how the function
pointers are stored, and so it no longer makes sense to force the
caller of packm kernels to use void*, only so that the implementor
of the packm kernels can typecast back to the native datatype within
the kernel definition. This change has no effect internally within
BLIS because currently all packm kernels are called after querying
the function addresses from the context and then typecasting to the
appropriate function pointer type, which is based upon type-specific
function pointers like float* and double*.
- Removed a comment in frame/1m/bli_l1m_ft_ker.h that was outdated and
misleading due to changes to the handling of packm kernels since
moving them into the context.

commit 54bf6c35542a297e25bc8efec6067a6df80536f4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 10 15:42:01 2020 -0500

Minor README.md update.

Details:
- Added a new entry to the "What people are saying about BLIS" section.

commit e50b4d40462714ae33df284655a2faf7fa35f37c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Sep 9 14:12:53 2020 -0500

Minor update to README.md (SIAM Best Paper Prize).

commit a8efb72074691e2610372108becd88b4b392299e
Merge: b0c4da17 97e87f2c
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Sep 7 16:18:19 2020 -0500

Merge pull request 434 from flame/intel-zdot

Add an option to change the complex return type.

commit 97e87f2c9f3878a05e1b7c6ec237ee88d9a72a42
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 7 15:56:42 2020 -0500

Whitespace/comment updates to 434 PR.

commit b0c4da1732b6c6a9ff66f70c36e4722e0f9645ae
Merge: 810e90ee b1b5870d
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Sep 7 15:47:54 2020 -0500

Merge pull request 436 from flame/s390x

Add checks so that s390x is detected as 64-bit.

commit 810e90ee806510c57504f0cf8eeaf608d38bd9dd
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 1 16:11:40 2020 -0500

Minor README.md update.

Details:
- Added HPE to list of funders.
- Changed http to https in funders' website links.

commit 7d411282196e036991c26e52cb5e5f85769c8059
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Aug 13 17:50:58 2020 -0500

Use -O2 for all framework code. (435)

It seems that -O3 might be causing intermittent problems with the f2c'ed packed and banded code. -O3 is retained for kernel code. Fixes 341 and fixes 342.

commit 9c5b485d356367b0a1288761cd623f52036e7344
Author: Dave Love <dave.lovemanchester.ac.uk>
Date: Fri Aug 7 20:11:18 2020 +0000

Don't override -mcpu with -march on ARM (353)

* Use -mcpu for ARM
See the GCC doc about -march, -mtune, and -mpu and maybe
https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu

* Fix typo in flags

* Fix typo in cortexa9 flags

* Modify cortexa53 compilation flags to fix failing BLAS check (341)

commit c253d14a72a746b670b3ffbb6e81bcafc73d1133
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Aug 7 09:39:04 2020 -0500

Also handle Intel-style complex return in CBLAS interface.

commit 5d653a11a0cc71305d0995507b1733995856f475
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Aug 6 17:58:26 2020 -0500

Update Multithreading.md

Addresses the issue raised in 426.

commit b1b5870dd3f9b1c78cf5f58a53514d73f001fc4c
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Aug 6 17:34:20 2020 -0500

Add checks so that s390x is detected as 64-bit.

commit 882dcb11bfc9ea50aa2f9044621833efd90d42be
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 6 17:28:14 2020 -0500

Mention example code at top of documentation docs.

Details:
- Steer the reader towards the example code section of each
documentation doc (object and typed).
- Trivial update to examples/oapi/README, examples/tapi/README.

commit f4894512e5bf56ff83701c07dd02972e300741a5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 6 17:20:00 2020 -0500

Very minor updates to previous commit.

commit adedb893ae8dfacd1dc54035979e15c44d589dbb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 6 17:14:01 2020 -0500

Documented mutator functions in BLISObjectAPI.md.

Details:
- Added documentation for commonly-used object mutator functions in
BLISObjectAPI.md. Previously, only accessor functions were documented.
Thanks to Jeff Diamond for pointing out this omission.
- Explicitly set the 'diag' property of objects in oapi example modules
(08level2.c and 09level3.c).

commit 5b5278ff494888509543a79c09ea82089f6c95d9
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Aug 6 14:19:37 2020 -0500

Use ifdef instead of if as macro may be undefined.

commit 7fdc0fc893d0c6727b725ea842053b65be2c20ba
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Aug 6 14:03:55 2020 -0500

Add an option to change the complex return type.

ifort apparently does not return complex numbers in registers as in C/C++ (or gfortran), but instead creates a "hidden" first parameter for the return value. The option --complex-return=gnu|intel has been added, as well as a guess based on a provided FC if not specified (otherwise default to gnu). This option affects the signatures of cdotc, cdotu, zdotc, and zdotu, and a single library cannot be used with both GNU and Intel Fortran compilers. Fixes 433.

commit 6e522e5823b762d4be09b6acdca30faafba56758
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jul 30 19:31:37 2020 -0500

Mention disabling of sup in docs/Sandboxes.md.

Details:
- Added language to remind the reader to disable sup if the intended
behavior is for the sandbox implementation to handle all problem
sizes, even the smaller ones that would normally be handled by the
sup code path.

commit 00e14cb6d849e963a2e1ac35e7dbbe186af00a58
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 29 14:24:34 2020 -0500

Replaced use of bool_t type with C99 bool.

Details:
- Textually replaced nearly all non-comment instances of bool_t with the
C99 bool type. A few remaining instances, such as those in the files
bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and
bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being
used not for boolean purposes but to index into an array.
- This commit constitutes the third phase of a transition toward using
C99's bool instead of bool_t, which was raised in issue 420. The first
phase, which cleaned up various typecasts in preparation for using
bool as the basis for bool_t (instead of gint_t), was implemented by
commit a69a4d7. The second phase, which redefined the bool_t typedef
in terms of bool (from gint_t), was implemented by commit 2c554c2.

commit 2c554c2fce885f965a425e727a0314d3ba66c06d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 24 15:57:19 2020 -0500

Redefined bool_t typedef in terms of C99 bool.

Details:
- Changed the typedef that defines bool_t from:

typedef gint_t bool_t;

where gint_t is a signed integer that forms the basis of most other
integers in BLIS, to:

typedef bool bool_t;

- Changed BLIS's TRUE and FALSE macro definitions from being in terms of
integer literals:

define TRUE 1
define FALSE 0

to being in terms of C99 boolean constants:

define TRUE true
define FALSE false

which are provided by stdbool.h.
- This commit constitutes the second phase of a transition toward using
C99's bool instead of bool_t, which will address issue 420. The first
phase, which cleaned up various typecasts in preparation for using
bool as the basis for bool_t (instead of gint_t), was implemented by
commit a69a4d7.

commit e01dd125581cec87f61e15590922de0dc938ec42
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 24 15:41:46 2020 -0500

Fail-safe updates to Makefiles in 'test' dir.

Details:
- Updated Makefiles in test, test/3, and test/sup so that running any of
the usual targets without having first built BLIS results in a helpful
error message. For example, if BLIS is not yet configured, make will
output:

Makefile:327: *** Cannot proceed: config.mk not detected! Run
configure first. Stop.

Similarly, if BLIS is configured but not yet built, make will output:

Makefile:340: *** Cannot proceed: BLIS library not yet built! Run
make first. Stop.

In previous commits, these actions would result in a rather cryptic
make error such as:

make: *** No rule to make target 'test_sgemm_2400_asm_blis_st.x',
needed by 'blis-nat-st'. Stop.

commit b4f47f7540062da3463e2cb91083c12fdda0d30a
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Jul 24 13:56:13 2020 -0500

Add BLIS_EXPORT_BLIS to bli_abort. (429)

Fixes 428.

commit a69a4d7e2f4607c919db30b14535234ce169c789
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 22 16:13:09 2020 -0500

Cleaned up bool_t usage and various typecasts.

Details:
- Fixed various typecasts in

frame/base/bli_cntx.h
frame/base/bli_mbool.h
frame/base/bli_rntm.h
frame/include/bli_misc_macro_defs.h
frame/include/bli_obj_macro_defs.h
frame/include/bli_param_macro_defs.h

that were missing or being done improperly/incompletely. For example,
many return values were being typecast as
(bool_t)x && y
rather than
(bool_t)(x && y)
Thankfully, none of these deficiencies had manifested as actual bugs
at the time of this commit.
- Changed the return type of bli_env_get_var() from dim_t to gint_t.
This reflects the fact that bli_env_get_var() needs to be able to
return a signed integer, and even though dim_t is currently defined
as a signed integer, it does not intuitively appear to necessarily be
signed by inspection (i.e., an integer named "dim_t" for matrix
"dimension"). Also, updated use of bli_env_get_var() within
bli_pack.c to reflect the changed return type.
- Redefined type of thrcomm_t.barrier_sense field from bool_t to gint_t
and added comments to the bli_thrcomm_*.h files that will explain a
planned replacement of bool_t with C99's bool type.
- Note: These changes are being made to facilitate the substitution of
'bool' for 'bool_t', which will eliminate the namespace conflict with
arm_sve.h as reported in issue 420. This commit implements the first
phase of that transition. Thanks to RuQing Xu for reporting this
issue.
- CREDITS file update.

commit a6437a5c11d364c6c88af527294d29734d7cc7d6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jul 20 19:21:07 2020 -0500

Replaced broken ref99 sandbox w/ simpler version.

Details:
- The 'ref99' sandbox was broken by multiple refactorings and internal
API changes over the last two years. Rather than try to fix it, I've
replaced it with a much simpler version based on var2 of gemmsup.
Why not fix the previous implementation? It occurred to me that the
old implementation was trying to be a lightly simplified duplication
of what exists in the framework. Duplication aside, this sandbox
would have worked fine if it had been completely independent of the
framework code. The problem was that it was only partially
independent, with many function calls calling a function in BLIS
rather than a duplicated/simplified version within the sandbox. (And
the reason I didn't make it fully independent to begin with was that
it seemed unnecessarily duplicative at the time.) Maintaining two
versions of the same implementation is problematic for obvious
reasons, especially when it wasn't even done properly to begin with.
This explains the reimplementation in this commit. The only catch is
that the newer implementation is single-threaded only and does not
perform any packing on either input matrix (A or B). Basically, it's
only meant to be a simple placeholder that shows how you could plug
in your own implementation. Thanks to Francisco Igual for reporting
this brokenness.
- Updated the three reference gemmsup kernels (defined in
ref_kernels/3/bli_gemmsup_ref.c) so that they properly handle
conjugation of conja and/or conjb. The general storage kernel, which
is currently identical to the column-storage kernel, is used in the
new ref99 sandbox to provide basic support for all datatypes
(including scomplex and dcomplex).
- Minor updates to docs/Sandboxes.md, including adding the threading
and packing limitations to the Caveats section.
- Fixed a comment typo in bli_l3_sup_var1n2m.c (upon which the new
sandbox implementation is based).

commit bca040be9da542dd9c75d91890fa7731841d733d
Merge: 2605eb4d 171ecc1d
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Jul 20 09:27:30 2020 -0500

Merge pull request 425 from gmargari/patch-1

Update Multithreading.md

commit 171ecc1dc6f055ea39da30e508f711b49a734359
Author: Giorgos Margaritis <gmargariprotonmail.com>
Date: Mon Jul 20 12:24:06 2020 +0300

Update Multithreading.md

commit 2605eb4d99d3813c37a624c011aa2459324a6d89
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 15 15:25:19 2020 -0500

Added missing rv_d?x6 edge cases to sup kernel.

Details:
- Added support to bli_gemmsup_rv_haswell_asm_d6x8n.c for handling
various n = 6 edge cases with a single sup kernel call. Previously,
only n = {4,2,1} were handled explicitly as single kernel calls;
that is, cases where n = 6 were previously being executed via two
kernel calls (n = 4 and n = 2).
- Added commented debug line to testsuite's test_libblis.c.

commit 72f6ed0637dfcb021de04ac7d214d5c87e55d799
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 3 17:55:54 2020 -0500

Declare/define static functions via BLIS_INLINE.

Details:
- Updated all static function definitions to use the cpp macro
BLIS_INLINE instead of the static keyword. This allows blis.h to
use a different keyword (inline) to define these functions when
compiling with C++, which might otherwise trigger "defined but
not used" warning messages. Thanks to Giorgos Margaritis for
reporting this issue and Devin Matthews for suggesting the fix.
- Updated the following files, which are used by configure's
hardware auto-detection facility, to unconditionally define
BLIS_INLINE to the static keyword (since we know BLIS will be
compiled with C, not C++):
build/detect/config/config_detect.c
frame/base/bli_arch.c
frame/base/bli_cpuid.c
- CREDITS file update.

commit 5fc701ac5f94c6300febbb2f24e731aa34f0f34a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 1 15:48:58 2020 -0500

Added -fomit-frame-pointer option to CKOPTFLAGS.

Details:
- Added the -fomit-frame-pointer compiler option to the CKOPTFLAGS
variable in the following make_defs.mk files:
config/haswell/make_defs.mk
config/skx/make_defs.mk
as well as comments that mention why the compiler option is needed.
This option is needed to prevent the compiler from using the rbp
frame register (in the very early portion of kernel code, typically
where k_iter and k_left are defined and computed), which, as of
1c719c9, is used explicitly by the gemmsup millikernels. Thanks to
Devin Matthews for identifying this missing option and to Jeff
Diamond for reporting the original bug in 417.
- The file
config/zen/amd_config.mk
which feeds into the make_defs.mk for both zen and zen2 subconfigs,
was also touched, but only to add a commented-out compiler option
(and the aforementioned explanatory comment) since that file already
uses -fomit-frame-pointer in COPTFLAGS, which forms the basis of
CKOPTFLAGS.

commit 6af59b705782dada47e45df6634b479fe781d4fe
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 1 14:54:23 2020 -0500

Fixed disabled edge case optimization in gemmsup.

Details:
- Fixed an inadvertently disabled edge case optimization in the two
gemmsup variants in bli_l3_sup_var1n2m.c. Background: These edge case
optimizations allow the last millikernel operation in the jr loop to
be executed with inflated an register blocksize if it is the last
(or only) iteration. For example, if mr=6 and nr=8 and the gemmsup
problem is m=8, n=100, k=100. (In this case, the panel-block variant
(var1n) is executed, which places the jr loop in the m dimension.)
In principle, this problem could be executed as two millikernels: one
with dimensions 6x100x100, and one as 2x100x100. However, with the
support for inflated blocksizes in the kernel, the entire 8x100x100
problem can be passed to the millikernel function, which will then
execute it more favorably as two 4x100x100 millikernel sub-calls.
Now, this optimization is disabled under certain circumstances, such
as when multithreading. Previously, the is_mt predicate was being set
incorrectly such that it was non-zero even when running
single-threaded.
- Upon fixing the is_mt issue above, another bit of code needed to be
moved so that the result of the optimization could have an impact on
the assignment of loop bounds ranges to threads.

commit b37634540fab0f9b8d4751b8356ee2e17c9e3b00
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jun 25 16:05:12 2020 -0500

Support ldims, packing in sup/test drivers.

Details:
- Updated the test/sup source file (test_gemm.c) and Makefile to support
building matrices with small or large leading dimensions, and updated
runme.sh to support executing both kinds of test drivers.
- Updated runme.sh to allow for executing sup drivers with unpacked (the
default) or packed matrices (via setting BLIS_PACK_A, BLIS_PACK_B
environment variables), and for capturing output to files that encode
both the leading dimension (small or large) and packing status into
the filenames.
- Consolidated octave scripts in test/sup/octave_st, test/sup/octave_mt
into test/sup/octave and updated the octave code in that consolidated
directory to read the new output filename format (encoding ldim and
packing). Also added comments and streamlined code, particularly in
plot_panel_trxsh.m. Tested the octave scripts with octave 5.2.0.
- Moved old octave_st, octave_mt directories to test/sup/old.

commit ceb9b95a96cc3844ecb43d9af48ab289584e76b6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jun 18 17:15:25 2020 -0500

Fixed incorrect link to shiftd in BLISTypedAPI.md.

Details:
- Previously, the entry for shiftd in the Operation index section of
BLISTypedAPI.md was incorrectly linking to the shiftd operation entry
in BLISObjectAPI.md. This has been fixed. Thanks to Jeff Diamond for
helping find this incorrect link.

commit b3c42016818797f79e55b32c8b7d090f9d0aa0ea
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jun 18 14:00:56 2020 -0500

CREDITS file update.

commit 31af73c11abae03248d959da0f81eacea015b57a
Author: Isuru Fernando <isurufgmail.com>
Date: Thu Jun 18 13:35:54 2020 -0500

Expand windows instructions (414)

* Expand windows instructions

* Windows: both static and shared don't work at the same time

commit b5b604e106076028279e6d94dc0e51b8ad48e802
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jun 17 16:42:24 2020 -0500

Ensure random objects' 1-norms are non-zero.

Details:
- Fixed an innocuous bug that manifested when running the testsuite on
extremely small matrices with randomization via the "powers of 2 in
narrow precision range" option enabled. When the randomization
function emits a perfect 0.0 to fill a 1x1 matrix, the testsuite will
then compute 0.0/0.0 during the normalization process, which leads to
NaN residuals. The solution entails smarter implementaions of randv,
randnv, randm, and randnm, each of which will compute the 1-norm of
the vector or matrix in question. If the object has a 1-norm of 0.0,
the object is re-randomized until the 1-norm is not 0.0. Thanks to
Kiran Varaganti for reporting this issue (413).
- Updated the implementation of randm_unb_var1() so that it loops over
a call to the randv_unb_var1() implementation directly rather than
calling it indirectly via randv(). This was done to avoid the overhead
of multiple calls to norm1v() when randomizing the rows/columns of a
matrix.
- Updated comments.

commit 35e38fb693e7cbf2f3d7e0505a63b2c05d3f158d
Author: Isuru Fernando <isurufgmail.com>
Date: Tue Jun 16 10:59:41 2020 -0500

FIx typo in FAQ

commit 1c719c91a3ef0be29a918097652beef35647d4b2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jun 4 17:21:08 2020 -0500

Bugfixes, cleanup of sup dgemm ukernels.

Details:
- Fixed a few not-really-bugs:
- Previously, the d6x8m kernels were still prefetching the next upanel
of A using MR*rs_a instead of ps_a (same for prefetching of next
upanel of B in d6x8n kernels using NR*cs_b instead of ps_b). Given
that the upanels might be packed, using ps_a or ps_b is the correct
way to compute the prefetch address.
- Fixed an obscure bug in the rd_d6x8m kernel that, by dumb luck,
executed as intended even though it was based on a faulty pointer
management. Basically, in the rd_d6x8m kernel, the pointer for B
(stored in rdx) was loaded only once, outside of the jj loop, and in
the second iteration its new position was calculated by incrementing
rdx by the *absolute* offset (four columns), which happened to be the
same as the relative offset (also four columns) that was needed. It
worked only because that loop only executed twice. A similar issue
was fixed in the rd_d6x8n kernels.
- Various cleanups and additions, including:
- Factored out the loading of rs_c into rdi in rd_d6x8[mn] kernels so
that it is loaded only once outside of the loops rather than
multiple times inside the loops.
- Changed outer loop in rd kernels so that the jump/comparison and
loop bounds more closely mimic what you'd see in higher-level source
code. That is, something like:
for( i = 0; i < 6; i+=3 )
rather than something like:
for( i = 0; i <= 3; i+=3 )
- Switched row-based IO to use byte offsets instead of byte column
strides (e.g. via rsi register), which were known to be 8 anyway
since otherwise that conditional branch wouldn't have executed.
- Cleaned up and homogenized prefetching a bit.
- Updated the comments that show the before and after of the
in-register transpositions.
- Added comments to column-based IO cases to indicate which columns
are being accessed/updated.
- Added rbp register to clobber lists.
- Removed some dead (commented out) code.
- Fixed some copy-paste typos in comments in the rv_6x8n kernels.
- Cleaned up whitespace (including leading ws -> tabs).
- Moved edge case (non-milli) kernels to their own directory, d6x8,
and split them into separate files based on the "NR" value of the
kernels (Mx8, Mx4, Mx2, etc.).
- Moved config-specific reference Mx1 kernels into their own file
(e.g. bli_gemmsup_r_haswell_ref_dMx1.c) inside the d6x8 directory.
- Added rd_dMx1 assembly kernels, which seems marginally faster than
the corresponding reference kernels.
- Updated comments in ref_kernels/bli_cntx_ref.c and changed to using
the row-oriented reference kernels for all storage combos.

commit 943a21def0bedc1732c0a2453afe7c90d7f62e95
Author: Isuru Fernando <isurufgmail.com>
Date: Thu May 21 14:09:21 2020 -0500

Add build instructions for Windows (404)

commit fbef422f0d968df10e598668b427af230cfe07e8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu May 21 10:30:41 2020 -0500

Separate OS X and Windows into separate FAQs.

Details:
- Separated the unified Mac OS X / Windows frequently asked question
into two separate questions, one for each OS.

commit 28be1a4265ea67e3f177c391aba3dbbcf840bd52
Author: Guodong Xu <guodong.xulinaro.org>
Date: Thu May 21 02:22:22 2020 +0800

avoid loading twice in armv8a gemm kernel (403)

This bug happens at a corner case, when k_iter == 0 and we jump to
CONSIDERKLEFT.

In current design, first row/col. of a and b are loaded twice.

The fix is to rearrange a and b (first row/col.) loading instructions.

Signed-off-by: Guodong Xu <guodong.xulinaro.org>

commit d51245e58b0beff2717156b980007c90337150d8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri May 8 18:00:54 2020 -0500

Add support for Intel oneAPI in configure.

Details:
- Properly select cc_vendor based on the output of invoking CC with the
--version option, including cases where CC is the variant of clang
that is included with Intel oneAPI. (However, we continue to treat
the compiler as clang for other purposes, not icc.) Thanks to Ajay
Panyala and Devin Matthews for reporting on this issue via 402.

commit 787adad73bd5eb65c12c39d732723a1ac0448748
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri May 8 16:18:20 2020 -0500

Defined netlib equivalent of xerbla_array().

Details:
- Added a function definition for xerbla_array_(), which largely mirrors
its netlib implementation. Thanks to Isuru Fernando for suggesting the
addition of this function.

commit c53b5153bee585685bf95ce22e058a7af72ecef0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 5 12:39:12 2020 -0500

Documented Perl prerequisite for build system.

Details:
- Added Perl to list of prerequisites for building BLIS. This is in part
(and perhaps completely?) due to some substitution commands used at
the end of configure that include '\n' characters that are not
properly interpreted by the version of sed included on some versions
of OS X. This new documentation addresses issue 398.

commit f032d5d4a6ed34c8c3e5ba1ed0b14d1956d0097c
Author: Guodong Xu <guodong.xulinaro.org>
Date: Thu Apr 30 01:08:46 2020 +0800

New kernel set for Arm SVE using assembly (396)

Here adds two kernels for Arm SVE vector extensions.
1. a gemm kernel for double at sizes 8x8.
2. a packm kernel for double at dimension 8xk.

To achive best performance, variable length agonostic programming
is not used. Vector length (VL) of 256 bits is mandated in both kernels.
Kernels to support other VLs can be added later.

"SVE is a vector extension for AArch64 execution mode for the A64
instruction set of the Armv8 architecture. Unlike other SIMD architectures,
SVE does not define the size of the vector registers, but constrains into
a range of possible values, from a minimum of 128 bits up to a maximum of
2048 in 128-bit wide units. Therefore, any CPU vendor can implement the
extension by choosing the vector register size that better suits the
workloads the CPU is targeting. Instructions are provided specifically
to query an implementation for its register size, to guarantee that
the applications can run on different implementations of the ISA without
the need to recompile the code." [1]

[1] https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/arm-scalable-vector-extensions-and-application-to-machine-learning

Signed-off-by: Guodong Xu <guodong.xulinaro.org>

commit 4d87eb24e8e1f5a21e04586f6df4f427bae0091b
Author: Yingbo Ma <mayingbo5gmail.com>
Date: Mon Apr 27 17:02:47 2020 -0400

Update KernelsHowTo.md (395)

commit 477ce91c5281df2bbfaddc4d86312fb8c8f879e2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 22 14:26:49 2020 -0500

Moved include "cpuid.h" to bli_cpuid.c.

Details:
- Relocated the include "cpuid.h" directive from bli_cpuid.h to
bli_cpuid.c. This was done because cpuid.h (which is pulled into
the post-build blis.h developer header) doesn't protect its
definitions with a preprocessor guard of the form:

ifndef FOOBAR_H
define FOOBAR_H
// header contents.
endif

and as a result, applications (previously) could not include both
blis.h and cpuid.h (since the former was already including the
latter). Thanks to Bhaskar Nallani for raising this issue via 393
and to Devin Matthews for suggesting this fix.
- CREDITS file update.

commit 8bde63ffd7474a97c3a3b0b0dc1eae45be0ab889
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Apr 18 12:50:12 2020 -0500

Adding missing conjy to her2/syr2 in typed API doc.

Details:
- Fixed a missing argument (conjy) in the function signatures of
bli_?her2() and bli_?syr2() in docs/BLISTypedAPI.md. Thanks to Robert
van de Geijn for reporting this omission.

commit 976902406b610afdbacb2d80a7a2b4b43ff30321
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 17 15:11:10 2020 -0500

Disable packing by default in expert rntm_t init.

Details:
- Changed the behavior of bli_rntm_init() as well as the static
initializer, BLIS_RNTM_INITIALIZER, so that user-initialized rntm_t
objects by default specify the disabling of packing for A and B.
Packing of A/B was already disabled by default when calling non-expert
APIs (and enabled only when the user set environment variables
BLIS_PACK_A or BLIS_PACK_B). With this commit, the default behavior of
using user-initialized rntm_t objects with expert APIs comes into line
with the default behavior of non-expert APIs--that is, they now both
lead to the avoidance of packing in the sup code path. (Note: The
conventional code path is unaffected by the environment variables
BLIS_PACK_A/BLIS_PACK_B and/or the disabling of packing in a rntm_t
object when calling an expert API.) This addresses issue 392. Thanks
to Kiran Varaganti for bringing this inconsistency to our attention.
- The above change was accomplished by changing the the definitions of
static functions bli_rntm_clear_pack_a() and bli_rntm_clear_pack_b()
in bli_rntm.h, which are both for internal use only.

commit 5f2aee7c5fa5d562acaf8fbde3df0e2a04e1dd1b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 7 14:55:15 2020 -0500

README.md update to promote supmt dgemm.

Details:
- Updated the sup entry in the "What's New" section of the README.md
file to promote the multithreaded dgemm sup feature introduced in
c0558fd.

commit f5923cd9ff5fbd91190277dea8e52027174a1d57
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 7 14:41:45 2020 -0500

CHANGELOG update (0.7.0)

0.7.0

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 7 14:41:44 2020 -0500

Version file update (0.7.0)

commit b04de636c1702e4cb8e7ad82bab3cf43d2dbdfc6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 7 14:37:43 2020 -0500

ReleaseNotes.md update in advance of next version.

Details:
- Updated docs/ReleaseNotes.md in preparation for next version.

commit 2cb604ba472049ad498df72d4a2dc47a161d4c3c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 6 16:42:14 2020 -0500

Rename more bli_thread_obarrier(), _obroadcast().

Details:
- Renamed instances of bli_thread_obarrier() and bli_thread_obroadcast()
that were made in the supmt-specific code commited to the 'amd'
branch, which has now been merged with 'master'. Prior to the merge,
'master' received commit c01d249, which applied these renamings to
the existing, non-sup codebase.

commit efb12bc895de451067649d5dceb059b7827a025f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 6 15:01:53 2020 -0500

Minor updates/elaborations to RELEASING file.

commit 2e3b3782cfb7a2fd0d1a325844983639756def7d
Merge: 9f3a8d4d da0c086f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 6 14:55:35 2020 -0500

Merge branch 'master' into amd

commit da0c086f4643772e111318f95a712831b0f981a8
Author: Satish Balay <balaymcs.anl.gov>
Date: Tue Mar 31 17:09:41 2020 -0500

OSX: specify the full path to the location of libblis.dylib (390)

* OSX: specify the full path to the location of libblis.dylib so that it can be found at runtime

Before this change:

Appication gives runtime error [when linked with blis]
dyld: Library not loaded: libblis.3.dylib

balaykpro lib % otool -L libblis.dylib
libblis.dylib:
libblis.3.dylib (compatibility version 0.0.0, current version 0.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1281.0.0)

After this change:
balaykpro lib % otool -L libblis.dylib
libblis.dylib:
/Users/balay/petsc/arch-darwin-c-debug/lib/libblis.3.dylib (compatibility version 0.0.0, current version 0.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1281.0.0)

* INSTALL_LIBDIR -> libdir as INSTALL_LIBDIR has DESTDIR

Co-Authored-By: Jed Brown <jedjedbrown.org>

* CREDITS file update.

Co-authored-by: Jed Brown <jedjedbrown.org>
Co-authored-by: Field G. Van Zee <fieldcs.utexas.edu>

commit 2bca03ea9d87c0da829031a5332545d05e352211
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Mar 28 22:10:00 2020 +0000

Updates, tweaks to runme.sh in test/1m4m.

Details:
- Made several updates to test/1m4m/runme.sh, including:
- Added missing handling for 1m and 4m1a implementations when setting
the BLIS_??_NT environment variables.
- Added support for using numactl to run the test executables.
- Several other cleanups.

commit c40a33190b94af5d5c201be63366594859b1233f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 26 16:55:00 2020 -0500

Warn user when auto-detection returns 'generic'.

Details:
- Added logic to configure that causes the script to output a warning
to the user if/when "./configure auto" is run and the underlying
hardware feature detection code is unable to identify the hardware.
In these cases, the auto-detect code will return 'generic', which
is likely not what the user expected, and a flag will be set so that
a message is printed at the end of the configure output. (Thankfully,
we don't expect this scenario to play out very often.) Thanks to
Devin Matthews for suggesting this fix 384.

commit 492a736fab5b9c882996ca024b64646877f22a89
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Mar 24 17:28:47 2020 -0500

Fix vectorized version of bli_amaxv (382)

* Fix vectorized version of bli_amaxv

To match Netlib, i?amax should return:
- the lowest index among equal values
- the first NaN if one is encountered

* Fix typos.

* And another one...

* Update ref. amaxv kernel too.

* Re-enabled optimized amaxv kernels.

Details:
- Re-enabled the optimized, intrinsics-based amaxv kernels in the 'zen'
kernel set for use in haswell, zen, zen2, knl, and skx subconfigs.
These two kernels (for s and d datatypes) were temporarily disabled in
e186d71 as part of issue 380. However, the key missing semantic
properties that prompted the disabling of these kernels--returning the
index of the *first* rather than of the last element with largest
absolute value, and returning the index of the first NaN if one is
encountered--were added as part of 382 thanks to Devin Matthews.
Thus, now that the kernels are working as expected once more, this
commit causes these kernels to once again be registered for the
affected subconfigs, which effectively reverts all code changes
included in e186d71.
- Whitespace/formatting updates to new macros in bli_amaxv_zen_int.c.

Co-authored-by: Field G. Van Zee <fieldcs.utexas.edu>

commit e186d7141a51f2d7196c580e24e7b7db8f209db9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Mar 21 18:40:36 2020 -0500

Disabled optimized amaxv kernels.

Details:
- Disabled use of optimized amaxv kernels, which use vector intrinsics
for both 's' and 'd' datatypes. We disable these kernels because the
current implementations fail to observe a semantic property of the
BLAS i?amax_() subroutine, which is to return the index of the
*first* element containing the maximum absolute value (that is, the
first element if there exist two or more elements that contain the
same value). With the optimized kernels disabled, the affected
subconfigurations (haswell, zen, zen2, knl, and skx) will use the
default reference implementations. Thanks to Mat Cross for reporting
this issue via 380.
- CREDITS file update.

commit 9f3a8d4d851725436b617297231a417aa9ce8c6a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Mar 14 17:48:43 2020 -0500

Added missing return to bli_thread_partition_2x2().

Details:
- Added a missing return statement to the body of an early case handling
branch in bli_thread_partition_2x2(). This bug only affected cases
where n_threads < 4, and even then, the code meant to handle cases
where n_threads >= 4 executes and does the right thing, albeit using
more CPU cycles than needed. Nonetheless, thanks to Kiran Varaganti
for reporting this bug via issue 377.
- Whitespace changes to bli_thread.c (spaces -> tabs).

commit 8c3d9b9eeb6f816ec8c32a944f632a5ad3637593
Merge: 71249fe8 0f9e0399
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 10 14:03:33 2020 -0500

Merge branch 'amd' of github.com:flame/blis into amd

commit 71249fe8ddaa772616698f1e3814d40e012909ea
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 10 13:55:29 2020 -0500

Merged test/sup, test/supmt into test/sup.

Details:
- Updated the Makefile, test_gemm.c, and runme.sh in test/sup to be able
to compile and run both single-threaded and multithreaded experiments.
This should help with maintenance going forward.
- Created a test/sup/octave_st directory of scripts (based on the
previous test/sup/octave scripts) as well as a test/sup/octave_mt
directory (based on the previous test/supmt/octave scripts). The
octave scripts are slightly different and not easily mergeable, and
thus for now I'll maintain them separately.
- Preserved the previous test/sup directory as test/sup/old/supst and
the previous test/supmt directory as test/sup/old/supmt.

commit 0f9e0399e16e96da2620faf2c0c3c21274bb2ebd
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 5 17:03:21 2020 -0600

Updated sup performance graphs; added mt results.

Details:
- Reran all existing single-threaded performance experiments comparing
BLIS sup to other implementations (including the conventional code
path within BLIS), using the latest versions (where appropriate).
- Added multithreaded results for the three existing hardware types
showcased in docs/PerformanceSmall.md: Kaby Lake, Haswell, and Epyc
(Zen1).
- Various minor updates to the text in docs/PerformanceSmall.md.
- Updates to the octave scripts in test/sup/octave, test/supmt/octave.

commit 90db88e5729732628c1f3acc96eeefab49f2da41
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 2 15:06:48 2020 -0600

Updated sup[mt] Makefiles for variable dim ranges.

Details:
- Updated test/sup/Makefile and test/supmt/Makefile to allow specifying
different problem size ranges for the drivers where one, two, or three
matrix dimensions is large. This will facilitate the generation of
more meaningful graphs, particularly when two dimensions are tiny.

commit 31f11a06ea9501724feec0d2fc5e4644d7dd34fc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Feb 27 14:33:20 2020 -0600

Updates to octave scripts in test/sup[mt]/octave.

Details:
- Optimized scripts in test/sup/octave and test/supmt/octave for use
with octave 5.2.0 on Ubuntu 18.04.
- Fixed stray 'end' keywords in gen_opsupnames.m and plot_l3sup_perf.m,
which were not only unnecessary but also causing issues with versions
5.x.

commit c01d249d7c546fe2e3cee3fe071cd4c4c88b9115
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Feb 25 14:50:53 2020 -0600

Renamed bli_thread_obarrier(), _obroadcast().

Details:
- Renamed two bli_thread_*() APIs:
bli_thread_obarrier() -> bli_thread_barrier()
bli_thread_obroadcast() -> bli_thread_broadcast()
The 'o' was a leftover from when thrcomm_t objects tracked both
"inner" and "outer" communicators. They have long since been
simplified to only support the latter, and thus the 'o' is
superfluous.

commit f6e6bf73e695226c8b23fe7900da0e0ef37030c1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 24 17:52:23 2020 -0600

List Gentoo under supported external packages.

Details:
- Add mention of Gentoo Linux under the list of external packages in
the README.md file. Thanks to M. Zhou for maintaining this package.

commit 9e5f7296ccf9b3f7b7041fe1df20b927cd0e914b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Feb 18 15:16:03 2020 -0600

Skip building thrinfo_t tree when mt is disabled.

Details:
- Return early from bli_thrinfo_sup_grow() if the thrinfo_t object
address is equal to either &BLIS_GEMM_SINGLE_THREADED or
&BLIS_PACKM_SINGLE_THREADED.
- Added preprocessor logic to bli_l3_sup_thread_decorator() in
bli_l3_sup_decor_single.c that (by default) disables code that
creates and frees the thrinfo_t tree and instead passes
&BLIS_GEMM_SINGLE_THREADED as the thrinfo_t pointer into the
sup implementation.
- The net effect of the above changes is that a small amount of
thrinfo_t overhead is avoided when running small/skinny dgemm
problems when BLIS is compiled with multithreading disabled.

commit 90081e6a64b5ccea9211bdef193c2d332c68492f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 17 14:57:25 2020 -0600

Fixed bug(s) in mt sup when single-threaded.

Details:
- Fixed a syntax bug in bli_l3_sup_decor_single.c as a result of
changing function interface for the thread entry point function
(of type l3supint_t).
- Unfortunately, fixing the interface was not enough, as it caused
a memory leak in the sba at bli_finalize() time. It turns out that,
due to the new multithreading-capable variant code useing thrinfo_t
objects--specifically, their calling of bli_thrinfo_grow()--we
have to pass in a real thrinfo_t object rather than the global
objects &BLIS_PACKM_SINGLE_THREADED or &BLIS_GEMM_SINGLE_THREADED.
Thus, I inserted the appropriate logic from the OpenMP and pthreads
versions so that single-threaded execution would work as intended
with the newly upgraded variants.

commit c0558fde4511557c8f08867b035ee57dd2669dc6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 17 14:08:08 2020 -0600

Support multithreading within the sup framework.

Details:
- Added multithreading support to the sup framework (via either OpenMP
or pthreads). Both variants 1n and 2m now have the appropriate
threading infrastructure, including data partitioning logic, to
parallelize computation. This support handles all four combinations
of packing on matrices A and B (neither, A only, B only, or both).
This implementation tries to be a little smarter when automatic
threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
recalculate the factorization in units of micropanels (rather than
using the raw dimensions) in bli_l3_sup_int.c, when the final
problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
or column-stored matrices. (This is used for the rrc and crc storage
cases.) Previously, copym was used, but that would no longer suffice
because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
instead of from the variant functions. This has the effect of making
the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
and inserted usage of these functions within bli_thrinfo_init(), which
previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
tests, as well as appropriate octave/matlab scripts to plot the
resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
that specifying any BLIS_*_NT variable, even if it is set to 1, will
be considered manual specification for the purposes of determining
whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.

commit d7a7679182d72a7eaecef4cd9b9a103ee0a7b42b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Feb 7 17:37:03 2020 -0600

Fixed int-to-packbuf_t conversion error (C++ only).

Details:
- Fixed an error that manifests only when using C++ (specifically,
modern versions of g++) to compile drivers in 'test' (and likely most
other application code that includes blis.h. Thanks to Ajay Panyala
for reporting this issue (374).

commit d626112b8d5302f9585fb37a8e37849747a2a317
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jan 15 13:27:02 2020 -0600

Removed sorting on LDFLAGS in common.mk (373).

Details:
- Removed a line of code in common.mk that passed LDFLAGS through the
sort function. The purpose was not to sort the contents, but rather
to remove duplicates. However, there is valid syntax in a string of
linker flags that, when sorted, yields different/broken behavior.
So I've removed the line in common.mk that sorts LDFLAGS. Also, for
future use, I've added a new function, rm-dupls, that removes
duplicates without sorting. (This function was based on code from a
stackoverflow thread that is linked to in the comments for that
code.) Thanks to Isuru Fernando for reporting this issue (373).

commit e67deb22aaeab5ed6794364520190936748ef272
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jan 14 16:01:34 2020 -0600

CHANGELOG update (0.6.1)

0.6.1

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jan 14 16:01:33 2020 -0600

Version file update (0.6.1)

commit 5db8e710a2baff121cba9c63b61ca254a2ec097a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jan 14 15:59:59 2020 -0600

ReleaseNotes.md update in advance of next version.

Details:
- Updated ReleaseNotes.md in preparation for next version.

commit cde4d9d7a26eb51dcc5a59943361dfb8fda45dea
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jan 14 15:19:25 2020 -0600

Removed 'attic/windows' (to prevent confusion).

Details:
- Finally removed 'attic/windows' and its contents. This directory once
contained "proto" Windows support for BLIS, but we've since moved on
to (thanks to Isuru Fernando) providing Windows DLL support via
AppVeyor's build artifacts. Furthermore, since 'windows' was the only
subdirectory within 'attic', the directory path would show up in
GitHub's listing at https://github.com/flame/blis, which probably led
to someone being confused about how BLIS provides Windows support. I
assume (but don't know for sure) that nobody is using these files, so
this is admittedly a case of shoot first and ask questions later.

commit 7d3407d4681c6449f4bbb8ec681983700ab968f3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jan 14 15:17:53 2020 -0600

CREDITS file update.

commit f391b3e2e7d11a37300d4c8d3f6a584022a599f5
Author: Dave Love <dave.lovemanchester.ac.uk>
Date: Mon Jan 6 20:15:48 2020 +0000

Fix parsing in vpu_count on workstation SKX (351)

* Fix parsing in vpu_count on workstation SKX

* Document Skylake-X as Haswell for single FMA

* Update vpu_count for Skylake and Cascade Lake models

* Support printing the configuration selected, controlled by the environment

Intended particularly for diagnosing mis-selection of SKX through
unknown, or incorrect, number of VPUs.

* Move bli_log outside the cpp condition, and use it where intended

* Add Fixme comment (Skylake D)

* Mostly superficial edits to commits towards 351.

Details:
- Moved architecture/sub-config logging-related code from bli_cpuid.c
to bli_arch.c, tweaked names, and added more set/get layering.
- Tweaked log messages output from bli_cpuid_is_skx() in bli_cpuid.c.
- Content, whitespace changes to new bullet in HardwareSupport.md that
relates to single-VPU Skylake-Xs.

* Fix comment typos

Co-authored-by: Field G. Van Zee <fieldcs.utexas.edu>

commit 5ca1a3cfc1c1cc4dd9da6a67aa072ed90f07e867
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jan 6 12:29:12 2020 -0600

Fixed 'configure' breakage introduced in 6433831.

Details:
- Added a missing 'fi' (endif) keyword to a conditional block added in
the configure script in commit 6433831.

commit e7431b4a834ef4f165c143f288585ce8e2272a23
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jan 6 12:01:41 2020 -0600

Updated 1m draft article link in README.md.

commit 6433831cc3988ad205637ebdebcd6d8f7cfcf148
Author: Jeff Hammond <jeff.r.hammondintel.com>
Date: Fri Jan 3 17:52:49 2020 -0800

blacklist ICC 18 for knl/skx due to test failures

Signed-off-by: Jeff Hammond <jeff.r.hammondintel.com>

commit af3589f1f98781e3a94a8f9cea8d5ea6f155f7d2
Author: Jeff Hammond <jeff.sciencegmail.com>
Date: Fri Jan 3 13:23:24 2020 -0800

blacklist Intel 19+

Signed-off-by: Jeff Hammond <jeff.r.hammondintel.com>

commit 60de939debafb233e57fd4e804ef21b6de198caf
Author: Jeff Hammond <jeff.sciencegmail.com>
Date: Wed Jan 1 21:30:38 2020 -0800

fix link to docs

the comment contains an incorrect link, which is trivially fixed here.

fgvanzee I hope you don't mind that I committed directly to master but this cannot break anything.

commit 52711073789b6b84eb99bb0d6883f457ed3fcf80
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Dec 16 16:30:26 2019 -0600

Fixed bugs in cblas_sdsdot(), sdsdot_().

Details:
- Fixed a bug in sdsdot_sub() that redundantly added the "alpha" scalar,
named 'sb'. This value was already being added by the underlying
sdsdot_() function. Thus, we no longer add 'sb' within sdsdot_sub().
Thanks to Simon Lukas Märtens for reporting this bug via 367.
- Fixed a second bug in order of typecasting intermediate products in
sdsdot_(). Previously, the "alpha" scalar was being added after the
"outer" typecast to float. However, the operation is supposed to first
add the dot product to the (promoted) scalar and THEN downcast the sum
to float. Thanks to Devin Matthews for catching this bug.

commit fe2560a4b1d8ef8d0a446df6002b1e7decc826e9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Dec 6 17:12:44 2019 -0600

Annoted missing thread-related symbols for export.

Details:
- Added BLIS_EXPORT_BLIS annotation to function prototypes for

bli_thrcomm_bcast()
bli_thrcomm_barrier()
bli_thread_range_sub()

so that these functions are exported to shared libraries by default.
This (hopefully) fixes issue 366. Thanks to Kyungmin Lee for
reporting this bug.
- CREDITS file update.

commit 2853825234001af8f175ad47cef5d6ff9b7a5982
Merge: efa61a6c 61b1f0b0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Dec 6 16:06:46 2019 -0600

Merge branch 'master' into amd

commit 61b1f0b0602faa978d9912fe58c6c952a33af0ac
Author: Nicholai Tukanov <nicholaiutexas.edu>
Date: Wed Dec 4 14:18:47 2019 -0600

Add prototypes for POWER9 reference kernels (365)

Updates and fixes to power9 subconfig.

Details:
- Register s,c,z reference gemm and trsm ukernels that assume elements
of B have been broadcast.
- Added prototypes for level-3 ukernels that assume elements of B have
been broadcast. Also added prototype for an spackm function that
employs a duplication/broadcast factor of 4.
- Register virtual gemmtrsm ukernels that work with broadcasting of B.
- Disable right-side hemm, symm, trmm, and trmm3 in bli_family_power9.h.
- Thanks to Nicholai Tukanov for providing these updates.

commit efa61a6c8b1cfa48781fc2e4799ff32e1b7f8f77
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 29 16:17:04 2019 -0600

Added missing bli_l3_sup_thread_decorator() symbol.

Details:
- Defined dummy versions of bli_l3_sup_thread_decorator() for Openmp
and pthreads so that those builds don't fail when performing shared
library linking (especially for Windows DLLs via AppVeyor). For now,
these dummy implementations of bli_l3_sup_thread_decorator() are
merely carbon-copies of the implementation provided for single-
threaded execution (ie: the one found in bli_l3_sup_decor_single.c).
Thus, an OpenMP or pthreads build will be able to use the gemmsup
code (including the new selective packing functionality), as it did
before 39fa7136, even though it will not actually employ any
multithreaded parallelism.

commit 39fa7136f4a4e55ccd9796fb79ad5f121b872ad9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 29 15:27:07 2019 -0600

Added support for selective packing to gemmsup.

Details:
- Implemented optional packing for A or B (or both) within the sup
framework (which currently only supports gemm). The request for
packing either matrix A or matrix B can be made via setting
environment variables BLIS_PACK_A or BLIS_PACK_B (to any
non-zero value; if set, zero means "disable packing"). It can also
be made globally at runtime via bli_pack_set_pack_a() and
bli_pack_set_pack_b() or with individual rntm_t objects via
bli_rntm_set_pack_a() and bli_rntm_set_pack_b() if using the expert
interface of either the BLIS typed or object APIs. (If using the
BLAS API, environment variables are the only way to communicate the
packing request.)
- One caveat (for now) with the current implementation of selective
packing is that any blocksize extension registered in the _cntx_init
function (such as is currently used by haswell and zen subconfigs)
will be ignored if the affected matrix is packed. The reason is
simply that I didn't get around to implementing the necessary logic
to pack a larger edge-case micropanel, though this is entirely
possible and should be done in the future.
- Spun off the variant-choosing portion of bli_gemmsup_ref() into
bli_gemmsup_int(), in bli_l3_sup_int.c.
- Added new files, bli_l3_sup_packm_a.c, bli_l3_sup_packm_b.c, along
with corresponding headers, in which higher-level packm-related
functions are defined for use within the sup framework. The actual
packm variant code resides in bli_l3_sup_packm_var.c.
- Pass the following new parameters into var1n and var2m: packa, packb
bool_t's, pointer to a rntm_t, pointer to a cntl_t (which is for now
always NULL), and pointer to a thrinfo_t* (which for nowis the address
of the global single-threaded packm thread control node).
- Added panel strides ps_a and ps_b to the auxinfo_t structure so that
the millikernel can query the panel stride of the packed matrix and
step through it accordingly. If the matrix isn't packed, the panel
stride of interest for the given millikernel will be set to the
appropriate value so that the mkernel may step through the unpacked
matrix as it normally would.
- Modified the rv_6x8m and rv_6x8n millikernels to read the appropriate
panel strides (ps_a and ps_b, respectively) instead of computing them
on the fly.
- Spun off the environment variable getting and setting functions into
a new file, bli_env.c (with a corresponding prototype header). These
functions are now used by the threading infrastructure (e.g.
BLIS_NUM_THREADS, BLIS_JC_NT, etc.) as well as the selective packing
infrastructure (e.g. BLIS_PACK_A, BLIS_PACK_B).
- Added a static initializer for mem_t objects, BLIS_MEM_INITIALIZER.
- Added a static initializer for pblk_t objects, BLIS_PBLK_INITIALIZER,
for use within the definition of BLIS_MEM_INITIALIZER.
- Moved the global_rntm object to bli_rntm.c and extern it where needed.
This means that the function bli_thread_init_rntm() was renamed to
bli_rntm_init_from_global() and relocated accordingly.
- Added a new bli_pack.c function, which serves as the home for
functions that manage the pack_a and pack_b fields of the global
rntm_t, including from environment variables, just as we have
functions to manage the threading fields of the global rntm_t in
bli_thread.c.
- Reorganized naming for files in frame/thread, which mostly involved
spinning off the bli_l3_thread_decorator() functions into their own
files. This change makes more sense when considering the further
addition of bli_l3_sup_thread_decorator() functions (for now limited
only to the single-threaded form found in the _single.c file).
- Explicitly initialize the reference sup handlers in both
bli_cntx_init_haswell.c and bli_cntx_init_zen.c so that it's more
obvious how to customize to a different handler, if desired.
- Removed various snippets of disabled code.
- Various comment updates.

commit bbb21fd0a9be8c5644bec37c75f9396eeeb69e48
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 21 18:15:16 2019 -0600

Tweaked SIAM/SC Best Prize language in README.md.

commit 043366f92d5f5f651d5e3371ac3adb36baf4adce
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 21 18:13:51 2019 -0600

Fixed typo in previous commit (SIAM/SC prize).

commit 05a4d583e65a46ff2a1100ab4433975d905d91f9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 21 18:12:24 2019 -0600

Added SIAM/SC prize to "What's New" in README.md.

commit 881b05ecd40c7bc0422d3479a02a28b1cb48383f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 21 16:34:27 2019 -0600

Fixed blastest failure for 'generic' subconfig.

Details:
- Fixed a subtle and complicated bug that only manifested via the BLAS
test drivers in the generic subconfiguration, and possibly any other
subconfiguration that did not register complex-domain gemm ukernels,
or registered ONLY real-domain ukernels as row-preferential. This is
a long story, but it boils down to an exception to the "transpose the
operation to bring storage of C into agreement with ukernel pref"
optimization in bli_hemm_front.c and bli_symm_front.c sabotaging the
proper functioning of the 1m method, but only when the imaginary
component of beta is zero. See the comments in issue 342 for more
details. Thanks to Dave Love for identifying the commit in which this
bug was introduced, and other feedback related to this bug.

commit 0c7165fb01cdebbc31ec00124d446161b289942f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 14 16:48:14 2019 -0600

Fixed obscure bug in bli_acquire_mpart_[mn]dim().

Details:
- Fixed a bug in bli_acquire_mpart_mdim(), bli_acquire_mpart_ndim(),
and bli_acquire_mpart_mndim() that allowed the use of a blocksize b
that is too large given the current row/column index (i.e., the i/j
argument) and the size of the dimension being partitioned (i.e., the
m/n argument). This bug only affected backwards partitioning/motion
through the dimension and was the result of a misplaced conditional
check-and-redirect to the backwards code path. It should be noted
that this bug was discovered not because it manifested the way it
could (thanks to the callers in BLIS making sure to always pass in
the "correct" blocksize b), but could have manifested if the
functions were used by 3rd party callers. Thanks to Minh Quan Ho for
reporting the bug via issue 363.

commit fb8bef9982171ee0f60bc39e41a33c4d31fd59a9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 14 13:05:28 2019 -0600

Fixed copy-paste bug in bli_spackm_6xk_bb4_ref().

Details:
- Fixed a copy-paste bug in the new bli_spackm_6xk_bb4_ref() that
manifested as failures in single-precision real level-3 operations.
Also replaced the duplication factor constants with a const-qualifed
varialbe, dfac, so that this won't happen again.
- Changed NC for single-precision real from 4080 to 8160 so that the
packed matrix B will have the same byte footprint in both single
and double real.

commit 8f399c89403d5824ba767df1426706cf2d19d0a7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 12 15:32:57 2019 -0600

Tweaked/added notes to docs/Multithreading.md.

Details:
- Added language to docs/Multithreading.md cautioning the reader about
the nuances of setting multithreading parameters via the manual and
automatic ways simultaneously, and also about how these parameters
behave when multithreading is disabled at configure-time. These
changes are an attempt to address the issues that arose in issue 362.
Thanks to Jérémie du Boisberranger for his feedback on this topic.
- CREDITS file update.

commit bdc7ee3394500d8e5b626af6ff37c048398bb27e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Nov 11 15:47:17 2019 -0600

Various fixes to support packing duplication in B.

Details:
- Added cpp macros to trmm and trmm3 front-ends to optionally force
those operations to be cast so the structured matrix is on the left.
symm and hemm already had such macros, but these too were renamed so
that the macros were individual to the operation. We now have four
such macros:
define BLIS_DISABLE_HEMM_RIGHT
define BLIS_DISABLE_SYMM_RIGHT
define BLIS_DISABLE_TRMM_RIGHT
define BLIS_DISABLE_TRMM3_RIGHT
Also, updated the comments in the symm and hemm front-ends related to
the first two macro guards, and added corresponding comments to the
trmm and trmm3 front-ends for the latter two guards. (They all
functionally do the same thing, just for their specific operations.)
Thanks to Jeff Hammond for reporting the bugs that led me to this
change (via 359).
- Updated config/old/haswellbb subconfiguration (used to debug issues
related to duplicating B during packing) to register: a packing
kernel for single-precision real; gemmbb ukernels for s, c, and z;
trsmbb ukernels for s, c, and z; gemmtrsmbb virtual ukrnels for s, c
and z; and to use non-default cache and register blocksizes for s, c,
and z datatypes. Also declared prototypes for all of the gemmbb,
trsmbb, and gemmtrsmbb ukernel functions within the
bli_cntx_init_haswellbb() function. This should, once applied to the
power9 configuration, fix the remaining issues in 359.
- Defined bli_spackm_6xk_bb4_ref(), which packs single reals with a
duplication factor of 4. This function is defined in the same file as
bli_dpackm_6xk_bb2_ref() (bli_packm_cxk_bb_ref.c).

commit 0eb79ca8503bd7b237994335b9687457227d3290
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 8 14:48:48 2019 -0600

Avoid unused variable warning in lread.c (356).

Details:
- Replaced the line

f = f;

with

( void )f;

for the unused variable 'f' in blastest/f2c/lread.c. (Hopefully)
addresses issue 356, but since we don't use xlc who knows. Thanks
to Jeff Hammond for reporting this.

commit f377bb448512f0b578263387eed7eaf8f2b72bb7
Author: Jérôme Duval <jerome.duvalgmail.com>
Date: Thu Nov 7 23:39:29 2019 +0100

Add Haiku to the known OS list (361)

commit e29b1f9706b6d9ed798b7f6325f275df4e6be973
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 5 17:15:19 2019 -0600

Fixed failing testsuite gemmtrsm_ukr for power9.

Details:
- Added code that fixes false failures in the gemmtrsm_ukr module of the
testsuite. The tests were failing because the computation (bli_gemv())
that performs the numerical check was not able to properly travserse
the matrix operands bx1 and b11 that are views into the micropanel of
B, which has duplicated/broadcast elements under the power9 subconfig.
(For example, a micropanel of B with duplication factor of 2 needs to
use a column stride of 2; previously, the column stride was being
interpreted as 1.)
- Defined separate bli_obj_set_row_stride() and bli_obj_set_col_stride()
static functions in bli_obj_macro_defs.h. (Previously, only the
function bli_obj_set_strides() was defined. Amazing to think that we
got this far without these former functions.)
- Updated/expounded upon comments.

commit 49177a6b9afcccca5b39a21c6fd8e243525e1505
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Nov 4 18:09:37 2019 -0600

Fixed latent testsuite ukr module bugs for power9.

Details:
- Fixed a latent bug in the testsuite ukernel modules (gemm, trsm, and
gemmtrsm) that only manifested once we began running with parameters
that mimic those of power9. The problem was rooted in the way those
modules were creating objects (and thus allocating memory) for the
micropanel operands to the microkernel being tested. Since power9
duplicates/broadcasts elements of B in memory, we needed an easy way
of asking for more than one storage element per logical element in
the matrix. I incorrectly expressed this as:

bli_obj_create( datatype, k, n, ldbp, 1, &bp );

The problem here is that bli_obj_create() is exceedingly efficient
at calculating the size it passes to malloc() and doesn't allocate a
full leading dimension's worth of elements for the last column (or
row, in this example). This would normally not bother anyone since
you're not supposed to access that memory anyway. But here, my
attempted "hack" for getting extra elements was insufficient, and
needed to be changed to:

bli_obj_create( datatype, k, ldbp, ldbp, 1, &bp );

That is, the extra elements needed to be baked into the dimensions of
the matrix object in order to have the intended effect on the number
of elements actually allocated. Thanks to Jeff Hammond for reporting
this bug.
- Fixed a typically harmless memory leak in the aforementioned test
modules (the objects for the packed micropanels were not being freed).
- Updated/expanded a common comment across all three ukr test modules.

commit c84391314d4f1b3f73d868f72105324e649f2a72
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Nov 4 13:57:12 2019 -0600

Reverted minor temp/wspace changes from b426f9e.

Details:
- Added missing license header to bli_pwr9_asm_macros_12x6.h.
- Reverted temporary changes to various files in 'test' and 'testsuite'
directories.
- Moved testsuite/jobscripts into testsuite/old.
- Minor whitespace/comment changes across various files.

commit 4870260f6b8c06d2cc01b7147d7433ddee213f7f
Author: Jeff Hammond <jeff.r.hammondintel.com>
Date: Mon Nov 4 11:55:47 2019 -0800

blacklist GCC 5 and older for POWER9 (360)

commit b426f9e04e5499c6f9c752e49c33800bfaadda4c
Author: Nicholai Tukanov <nicholaiutexas.edu>
Date: Fri Nov 1 17:57:03 2019 -0500

POWER9 DGEMM (355)

Implemented and registered power9 dgemm ukernel.

Details:
- Implemented 12x6 dgemm microkernel for power9. This microkernel
assumes that elements of B have been duplicated/broadcast during the
packing step. The microkernel uses a column orientation for its
microtile vector registers and thus implements column storage and
general stride IO cases. (A row storage IO case via in-register
transposition may be added at a future date.) It should be noted that
we recommend using this microkernel with gcc and *not* xlc, as issues
with the latter cropped up during development, including but not
limited to slightly incompatible vector register mnemonics in the GNU
extended inline assembly clobber list.

commit 58102aeaa282dc79554ed045e1b17a6eda292e15
Merge: 52059506 b9bc222b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 28 17:58:31 2019 -0500

Merge branch 'amd'

commit 52059506b2d5fd4c3738165195abeb356a134bd4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 23 15:26:42 2019 -0500

Added "How to Download BLIS" section to README.md.

Details:
- Added a new section to the README.md, just prior to the "Getting
Started" section, titled "How to Download BLIS". This section details
the user's options for obtaining BLIS and lays out four common ways
of downloading the library. Thanks to Jeff Diamond for his feedback
on this topic.

commit e6f0a96cc59aef728470f6850947ba856148c38a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 14 17:05:39 2019 -0500

Updated README.md to ack Facebook as funder.

commit b9bc222bfc3db4f9ae5d7b3321346eed70c2c3fb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 14 16:38:15 2019 -0500

Call bli_syrk_small() before error checking.

Details:
- In bli_syrk_front(), moved the conditional call to bli_syrk_check()
(if error checking is enabled) and the conditional scaling of C by
beta (if alpha is zero) so that they occur after, instead of before,
the call to bli_syrk_small(). This sequencing now matches that of
bli_gemm_small() in bli_gemm_front() and bli_trsm_small() in
bli_trsm_front().

commit f0959a81dbcf30d8a1076d0a6348a9835079d31a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 14 15:46:28 2019 -0500

When manual config is blacklisted, output error.

Details:
- Fixed and adjusted the logic in configure so that a more informative
error message is output when a user runs './configure ... <conf>' and
<conf> is present in the configuration blacklist. Previously, this
particular set of conditions would result in the message:

'user-specified configuration '' is NOT registered!

That is, the error message mis-identified the targeted configuration
as the empty string, and (more importantly) mis-identifies the
problem. Thanks to Tze Meng Low for reporting this issue.
- Fixed a nearby error messages somewhat unrelated to the issue above.
Specifically, the wrong string was being printed when the error
message was identifying an auto-detected configuration that did not
appear to be registered.

commit 6218ac95a525eefa8921baf8d0d7057dfacebe9c
Merge: 0016d541 a617301f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 11 11:53:51 2019 -0500

Merge branch 'master' into amd

commit 0016d541e6b0da617b1fae6612d2b314901b7a75
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 11 11:09:44 2019 -0500

Changed -march=znver2 to =znver1 for clang on zen2.

Details:
- In config/zen2/make_defs.mk, changed the -march= flag so that
-march=znver1 is used instead of -march=znver2 when CC_VENDOR is
clang. (The gcc branch attempts to differentiate between various
versions, but the equivalent version cutoffs for clang are not
yet known by us, so we have to use a single flag for all versions
of clang. Hopefully -march=znver1 is new enough. If not, we'll
fall back to -march=bdver4 -mno-fma4 -mno-tbm -mno-xop -mno-lwp.)
This issue was discovered thanks to AppVeyor.

commit e94a0530e5ac4c78a18f09105f40003be2b517f7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 11 10:48:27 2019 -0500

Corrected zen NC that was non-multiple of NR.

Details:
- Updated an incorrectly set cache blocksize NC for single real within
config/zen/bli_cntx_init_zen.c that was non a multiple of the
corresponding value of NR. This issue, which was caught by Travis CI,
was introduced in 29b0e1e.

commit a2ffac752076bf55eb8c1fe2c5da8d9104f1f85b
Merge: 1cfe8e25 29b0e1ef
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 11 10:31:18 2019 -0500

Merge branch 'amd-master' into amd

commit 29b0e1ef4e8b84ce76888d73c090009b361f1306
Merge: 1cfe8e25 fdce1a56
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 11 10:24:24 2019 -0500

Code review + tweaks to AMD's AOCL 2.0 PR (349).

Details:
- NOTE: This is a merge commit of 'master' of git://github.com/amd/blis
into 'amd-master' of flame/blis.
- Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was
inadvertantly not incremented when the Zen2 subconfiguration was
added.
- In bli_gemm_front(), added a missing conditional constraint around the
call to bli_gemm_small() that ensures that the computation precision
of C matches the storage precision of C.
- In bli_syrk_front(), reorganized and relocated the notrans/trans logic
that existed around the call to bli_syrk_small() into bli_syrk_small()
to minimize the calling code footprint and also to bring that code
into stylistic harmony with similar code in bli_gemm_front() and
bli_trsm_front(). Also, replaced direct accessing of obj_t fields with
proper accessor static functions (e.g. 'a->dim[0]' becomes
'bli_obj_length( a )').
- Added ifdef BLIS_ENABLE_SMALL_MATRIX guard around prototypes for
bli_gemm_small(), bli_syrk_small(), and bli_trsm_small(). This is
strictly speaking unnecessary, but it serves as a useful visual cue to
those who may be reading the files.
- Removed cpp macro-protected small matrix debugging code from
bli_trsm_front.c.
- Added a GCC_OT_9_1_0 variable to build/config.mk.in to facilitate gcc
version check for availability of -march=znver2, and added appropriate
support to configure script.
- Cleanups to compiler flags common to recent AMD microarchitectures in
config/zen/amd_config.mk, including: removal of -march=znver1 et al.
from CKVECFLAGS (since the -march flag is added within make_defs.mk);
setting CRVECFLAGS similarly to CKVECFLAGS.
- Cleanups to config/zen/bli_cntx_init_zen.c.
- Cleanups, added comments to config/zen/make_defs.mk.
- Cleanups to config/zen2/make_defs.mk, including making use of newly-
added GCC_OT_9_1_0 and existing GCC_OT_6_1_0 to choose the correct
set of compiler flags based on the version of gcc being used.
- Reverted downstream changes to test/test_gemm.c.
- Various whitespace/comment changes.

commit a617301f9365ac720ff286514105d1b78951368b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Oct 8 17:14:05 2019 -0500

Updates to docs/CodingConventions.md.

commit 171f10069199f0cd280f18aac184546bd877c4fe
Merge: 702486b1 05d58edf
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 4 11:18:23 2019 -0500

Merge remote-tracking branch 'loveshack/emacs'

commit 702486b12560b5c696ba06de9a73fc0d5107ca44
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 2 16:35:41 2019 -0500

Removed stray FAQ section introduced in 1907000.

commit 1907000ad6ea396970c010f07ae42980b7b14fa0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 2 16:31:54 2019 -0500

Updated to FAQ (AMD-related questions).

Details:
- Added a couple potential frequently-asked questions/answers releated
to AMD's fork of BLIS.
- Updated existing answers to other questions.

commit 834f30a0dad808931c9d80bd5831b636ed0e1098
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 2 12:45:56 2019 -0500

Mention mixeddt paper in docs/MixedDatatypes.md.

commit 05d58edfe0ea9279971d74f17a5f7a69c4672ed5
Author: Dave Love <dave.lovemanchester.ac.uk>
Date: Wed Oct 2 10:33:44 2019 +0100

Note .dir-locals.el in docs

commit 531110c339f199a4d165d707c988d89ab4f5bfe8
Author: Dave Love <dave.lovemanchester.ac.uk>
Date: Wed Oct 2 10:16:22 2019 +0100

Modify Emacs config
Confine it to cc-mode and add comment-start/end.

commit 4bab365cab98202259c70feba6ec87408cba28d8
Author: Dave Love <dave.lovemanchester.ac.uk>
Date: Tue Oct 1 19:22:47 2019 +0000

Add .dir-locals.el for Emacs (348)

A minimal version that could probably do with extending, but at least
gets the indentation roughly right.

commit 4ec8dad66b3d37b0a2b47d19b7144bb62d332622
Author: Dave Love <dave.lovemanchester.ac.uk>
Date: Thu Sep 26 16:27:53 2019 +0100

Add .dir-locals.el for Emacs

A minimal version that could probably do with extending, but at least
gets the indentation roughly right.

commit bc16ec7d1e2a30ce4a751255b70c9cbe87409e4f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 23 15:37:33 2019 -0500

Set execute bits of shared library at install-time.

Details:
- Modified the 0644 octal code used during installation of shared
libraries to 0755 (for Linux/OSX only). Thanks to Adam J. Stewart
for reporting this issue via 343.
- CREDITS file update.

commit c60db26aee9e7b4e5d0b031b0881e58d23666b53
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 17 18:04:17 2019 -0500

Fixed bad loop counter in bli_[cz]scal2bbs_mxn().

Details:
- Fixed a typo in the loop counter for the 'd' (duplication) dimension
in the complex macros of frame/include/level0/bb/bli_scal2bbs_mxn.h.
They shouldn't be used by anyone yet, but thankfully clang via
AppVeyor spit out warnings that alerted me to the issue.

commit c766c81d628f0451d8255bf5e4b8be0a4ef91978
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 17 18:00:29 2019 -0500

Added missing schema arg to knl packm kernels.

Details:
- Added the pack_t schema argument to the knl packm kernel functions.
This change was intended for inclusion in 31c8657. (Thank you SDE +
Travis CI.)

commit 31c8657f1d6d8f6efd8a73fd1995e995fc56748b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 17 17:42:10 2019 -0500

Added support for pre-broadcast when packing B.

Details:
- Added support for being able to duplicate (broadcast) elements in
memory when packing matrix B (ie: the left-hand operand) in level-3
operations. This turns out advantageous for some architectures that
can afford the cost of the extra bandwidth and somehow benefit from
the pre-broadcast elements (and thus being able to avoid using
broadcast-style load instructions on micro-rows of B in the gemm
microkernel).
- Support optionally disabling right-side hemm and symm. If this occurs,
hemm_r is implemented in terms of hemm_l (and symm_r in terms of
symm_l). This is needed when broadcasting during packing because the
alternative--supporting the broadcast of B while also allowing matrix
B to be Hermitian/symmetric--would be an absolute mess.
- Support alignment factors for packed blocks of A, B, and C separately
(as well as for general-purpose buffers). In addition, we support
byte offsets from those alignment values (which is different from
aligning by align+offset bytes to begin with). The default alignment
values are BLIS_PAGE_SIZE in all four cases, with the offset values
defaulting to zero.
- Pass pack_t schema into bli_?packm_cxk() so that it can be then passed
into the packm kernel, where it will be needed by packm kernels that
perform broadcasts of B, since the idea is that we *only* want to
broadcast when packing micropanels of B and not A.
- Added definition for variadic bli_cntx_set_l3_vir_ukrs(), which can be
used to set custom virtual level-3 microkernels in the cntx_t, which
would typically be done in the bli_cntx_init_*() function defined in
the subconfiguration of interest.
- Added a "broadcast B" kernel function for use with NP/NR = 12/6,
defined in in ref_kernels/1m/bli_packm_cxk_bb_ref.c.
- Added a gemm, gemmtrsm, and trsm "broadcast B" reference kernels
defined in ref_kernels/3/bb. (These kernels have been tested with
double real with NP/NR = 12/6.)
- Added ifndef ... endif guards around several macro constants defined
in frame/include/bli_kernel_macro_defs.h.
- Defined a few "broadcast B" static functions in
frame/include/level0/bb for use by "broadcast B"-style packm reference
kernels. For now, only the real domain kernels are tested and fully
defined.
- Output the alignment and offset values for packed blocks of A and B
in the testsuite's "BLIS configuration info" section.
- Comment updates to various files.
- Bumped so_version to 3.0.0.

commit fd9bf497cd4ff73ccdfc030ba037b3cb2f1c2fad
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 17 15:45:24 2019 -0500

CREDITS file update.

commit 6c8f2d1486ce31ad3c2083e5c2035acfd4409a43
Author: ShmuelLevine <shmuel.levinegmail.com>
Date: Tue Sep 17 16:43:46 2019 -0400

Fix description for function bli_*pxby2v (340)

Fix typo in BLISTypedAPI.md for bli_?axpy2v() description.

commit b5679c1520f8ae7637b3cc2313133461f62398dc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 17 14:00:37 2019 -0500

Inserted Multithreading links into BuildSystem.md.

Details:
- Inserted brief disclaimers about default disabled multithreading
and default single-threadedness to BuildSystem.md along with links to
the Multithreading.md document. Thanks to Jeff Diamond for suggesting
these additions.
- Trivial reword of sentence regarding automatically-detected
architectures.

commit f4f5170f8482c94132832eb3033bc8796da5420b
Author: Isuru Fernando <isurufgmail.com>
Date: Wed Sep 11 07:34:48 2019 -0500

Update README.md (338)

commit 1cfe8e2562e5e50769468382626ce36b734741c1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 5 16:08:30 2019 -0500

Reimplemented bli_cpuid_query() for ARM.

Details:
- Rewrote bli_cpuid_query() for ARM architectures to use stdio-based
functions such as fopen() and fgets() instead of popen(). The new code
does more or less the same thing as before--searches /proc/cpuinfo for
various strings, which are then parsed in order to determine the
model, part number, and features. Thanks to Dave Love for suggesting
this change in issue 335.

commit 7c7819145740e96929466a248d6375d40e397e19
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Aug 30 16:52:09 2019 -0500

Always use sqsumv to compute normfv. (334)

* Always use sqsumv to compute normfv on MacOS.

* Unconditionally disable the "dot trick" in normfv.

* Added explanatory comment to normfv definition.

Details:
- Added a comment above the unconditional disabling of the dotv-based
implementation to normfv. Thanks to Roman Yurchak, Devin Matthews,
and Isuru Fernando in helping with this improvement.
- CREDITS file update.

commit 80e6c10b72d50863b4b64d79f784df7befedfcd1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 29 12:12:08 2019 -0500

Added reproduction section to Performance docs.

Details:
- Added section titled "Reproduction" to both Performance.md and
PerformanceSmall.md that briefly nudges the motivated reader in the
right direction if he/she wishes to run the same performance
benchmarks used to produce the graphs shown in those documents.
Thanks to Dave Love for making this suggestion.

commit 14cb426414856024b9ae0f84ac21efcc1d329467
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Aug 28 17:04:33 2019 -0500

Updated OpenBLAS, Eigen sup results.

Details:
- Updated the results shown in docs/PerformanceSmall.md for OpenBLAS and
Eigen.

commit b02e0aae8ce2705e91023b98ed416cd05430a78e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Aug 27 14:37:46 2019 -0500

Updated test drivers to iterate backwards.

Details:
- Updated test driver source in test, test/3, test/1m4m, and
test/mixeddt to iterate through the problem space backwards. This
can help avoid certain situations where the CPU frequency does not
immediately throttle up to its maximum. Thanks to Robert van de
Geijn for recommending this fix (originally made to test/sup drivers
in 57e422a).
- Applied off-by-one matlab output bugfix from b6017e5 to test drivers
in test, test/3, test/1m4m, and test/mixeddt directories.

commit b6017e53f4b26c99b14cdaa408351f11322b1e80
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Aug 27 14:18:14 2019 -0500

Bugfix of output text + tweaks to test/sup driver.

Details:
- Fixed an off-by-one bug in the output of matlab row indices in
test/sup/test_gemm.c that only manifested when the problem size
increment was equal to 1.
- Disabled the building of rrc, rcr, rcc, crr, crc, and ccr storage
combinations for blissup drivers in test/sup. This helps make the
building of drivers complete sooner.
- Trivial changes to test/sup/runme.sh.

commit 138d403b6bb15e687a3fe26d3d967b8ccd1ed97b
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Aug 26 18:11:27 2019 -0500

Use -funsafe-math-optimizations and -ffp-contract=fast for all reference kernels when using gcc or clang. (331)

commit d5a05a15a7fcc38fb2519031dcc62de8ea4a530c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Aug 26 16:54:31 2019 -0500

Cropped whitespace from new sup graphs.

Details:
- Previously forgot crop whitespace from the new .png graphs
added/updated in docs/graphs/sup.

commit a6c80171a353db709e43f9e6e7a3da87ce4d17ed
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Aug 26 16:51:31 2019 -0500

Fixed contents links in docs/PerformanceSmall.md.

Details:
- Corrected links in contents section of docs/PerformanceSmall.md,
which were erroneously directing readers to the corresponding
sections of docs/Performance.md.

commit 40781774df56a912144ef19cc191ed626a89f0de
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Aug 26 16:47:37 2019 -0500

Updated sup performance graphs with libxsmm.

Details:
- Added libxsmm to column-stored sup graphs presented in
docs/PerformanceSmall.md.
- Updated sup results for BLASFEO.
- Added sup results for Lonestar5 (Haswell).
- Addresses issue 326.

commit bfddf671328e7e372ac7228f72ff2d9d8e03ae18
Author: figual <figualucm.es>
Date: Mon Aug 26 12:01:33 2019 +0200

Fixed context registration for Cortex A53 (329).

commit 4a0a6e89c568246d14de4cc30e3ff35aac23d774
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Aug 24 15:25:16 2019 -0500

Changed test/sup alpha to 1; test libxsmm+netlib.

Details:
- Changed the value of alpha to 1.0 in test/sup/test_gemm.c. This is
needed because libxsmm currently only optimizes gemm operations where
alpha is unit (and beta is unit or zero).
- Adjusted the test/sup/Makefile to test libxsmm with netlib BLAS as its
fallback library. This is the library that will be called the
problem dimensions are deemed too large, or any other criteria for
optimization are not met. (This was done not because it is realistic,
but rather so that it would be very clear when libxsmm ceased handling
gemm calls internally when the data are graphed.)

commit 7aa52b57832176c5c13a48e30a282e09ecdabf73
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Aug 23 16:12:50 2019 -0500

Use libxsmm API in test/sup; add missing -ldl.

Details:
- Switch the driver source in test/sup so that libxsmm_?gemm() is called
instead of ?gemm_() when compiling for / linking against libxsmm.
libxsmm's documentation isn't clear on whether it is even *trying* to
provide BLAS API compatibility, and I got tired of trying to figure it
out.
- Added missing -ldl in LDFLAGS when linking against libxsmm.

commit 57e422aa168bee7416965265c93fcd4934cd7041
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Aug 23 14:17:52 2019 -0500

Added libxsmm support to test/sup drivers.

Details:
- Modified test/sup/Makefile to build drivers that test the performance
of skinny/small problems via libxsmm.
- Modified test/sup/runme.sh to run aforementioned drivers.
- Modified test/sup/test_gemm.c so that problem sizes are tested in
reverse order (from largest to smallest). This can help avoid certain
situations where the CPU frequency does not immediately throttle up
to its maximum. Thanks to Robert van de Geijn for recommending this
fix.

commit 661681fe33978acce370255815c76348f83632bc
Merge: 2f387e32 ef0a1a0f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 22 14:29:50 2019 -0500

Merge branch 'master' of github.com:flame/blis

commit 2f387e32ef5f9a17bafb5076dc9f66c38b52b32d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 22 14:27:30 2019 -0500

Added Eigen -march=native hack to perf docs.

Details:
- Spell out the hack given to me by Sameer Agarwal in order to get Eigen
to build with -march=native (which is critically important for Eigen)
in docs/Performance.md and docs/PerformanceSmall.md.

commit ef0a1a0faf683fe205f85308a54a77ffd68a9a6c
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Aug 21 17:40:24 2019 -0500

Update do_sde.sh (330)

* Update do_sde.sh

Automatically accept SDE license and download directly from Intel

* Update .travis.yml

[ci skip]

* Update .travis.yml

Enable SDE testing for PRs.

commit 0cd383d53a8c4a6871892a0395591ef5630d4ac0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Aug 21 13:39:05 2019 -0500

Corrected variable type and comment update.

Details:
- Forgot to save all changes from bli_gemmtrsm4m1_ref.c before commit
in 8122f59. Fixed type mismatch and referenced github issue in
comment.

commit 8122f59745db780987da6aa1e851e9e76aa985e0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Aug 21 13:22:12 2019 -0500

Pacify 'restrict' warning in gemmtrsm4m1 ref ukr.

Details:
- Previously, some versions of gcc would complain that the same
pointer, one_r, is being passed in for both alpha and beta in the
fourth call to the real gemm ukernel in bli_gemmtrsm4m1_ref.c. This
is understandable since the compiler knows that the real gemm ukernel
qualifies all of its floating-point arguments (including alpha and
beta) with restrict. A small hack has been inserted into the file
that defines a new variable to store the value 1.0, which is now used
in lieu of one_r for beta in the fourth call to the real gemm ukernel,
which should pacify the compiler now. Thanks to Dave Love for
reporting this issue (328) and for Devin Matthews for offering his
'restrict' expertise.

commit e8c6281f139bdfc9bd68c3b36e5e89059b0ead2e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Aug 21 12:38:53 2019 -0500

Add -march support for specific gcc version ranges.

Details:
- Added logic to configure that checks the version of the compiler
against known version ranges that could cause problems later in the
build process. For example, versions of gcc older than 4.9.0 use
different -march labels than version 4.9.0 or later
('-march=corei7-avx' vs '-march=sandybridge', respectively).
Similarly, before 6.1, compilation on Zen was possible, but you
need to start with -march=bdver4 and then disable instruction sets
that were discarded during the transition from Excavator to Zen. So
now, configure substitutes 'yes'/'no' values into anchors in
config.mk.in, which sets various make variables (e.g. GCC_OT_4_9_0),
which can be accessed and branched upon by the various
configurations' make_defs.mk files when setting their compiler flags.
- Updated config/haswell/make_defs.mk to branch on GCC_OT_4_9_0.
- Updated config/sandybridge/make_defs.mk to branch on GCC_OT_4_9_0.
- Updated config/zen/make_defs.mk to branch on GCC_OT_6_1_0.

commit e6ac4ebcb6e6a372820e7f509c0af3342966b84a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Aug 20 13:49:47 2019 -0500

Added page size, source location to perf docs.

Details:
- Added the page size, as returned via 'getconf -a | grep PAGE_SIZE',
and the location of the performance drivers to docs/Performance.md
(test/3) and docs/PerformanceSmall.md (test/sup). Thanks to Dave
Love for suggesting these additions in 325.

commit fdce1a5648d69034fab39943100289323011c36f
Author: Meghana <Meghana.Vankadariamd.com>
Date: Wed Jul 24 15:04:41 2019 +0530

changed gcc version check condition from 'ifeq' to 'if greater or equal'

Change-Id: Ie4c461867829bcc113210791bbefb9517e52c226

commit c9486e0c4f82cd9f58f5ceb71c0df039e9970a20
Author: Meghana <Meghana.Vankadariamd.com>
Date: Wed Jul 24 09:45:17 2019 +0530

code to detect version of gcc and set flags accordingly for zen2

Change-Id: I29b0311d0000dee1a2533ee29941acf53f9e9f34

commit 54afe3dfe6828a1aff65baabbf14c98d92e50692
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jul 23 16:54:28 2019 -0500

Added "Education and Learning" ToC entry to README.

commit 9f53b1ce7ac702e84e71801fe96986f6aa16040e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jul 23 16:50:35 2019 -0500

Added "Education and Learning" section to README.

Details:
- Added a short section after the Intro of the README.md file titled
"Education and Learning" that directs interested readers to the
"LAFF-On Programming for High-Performance" massive open online course
(MOOC) hosted via edX.

commit deda4ca8a094ee18d7c7c45e040e8ef180f33a48
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jul 22 13:59:05 2019 -0500

Added test/1m4m driver directory.

Details:
- Added a new standalone test driver directory named '1m4m' that can
build and run performance experiments for BLIS 1m, 4m1a, assembly,
OpenBLAS, and the vendor library (MKL). This new driver directory
was used to regenerate performance results for the 1m paper.
- Added alternate (commented-out) cache blocksizes to
config/haswell/bli_cntx_init_haswell.c. These blocksizes tend to
work well on an a 12-core Intel Xeon E5-2650 v3.

commit dcc0ce12fde4c6dca2b4764a1922a2ab19725867
Author: Meghana <Meghana.Vankadariamd.com>
Date: Mon Jul 22 17:12:01 2019 +0530

Added a global Makefile for AMD architectures in config/zen folder
This Makefile(amd_config.mk) has all the flags that are common to EPYC series

Change-Id: Ic02c60a8293ccdd37f0f292e631acd198e6895de

commit af17bca26a8bd3dcbee8ca81c18d7b25de09c483
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 19 14:46:23 2019 -0500

Updated haswell MC cache blocksizes.

Details:
- Updated the default MC cache blocksizes used by the haswell subconfig
for both row-preferential (the default) and column-preferential
microkernels.

commit b5e9bce4dde5bf014dd9771ae741048e1f6c7748
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 19 14:42:37 2019 -0500

Updated -march flags for sandybridge, haswell.

Details:
- Updated the '-march=corei7-avx' flag in the sandybridge subconfig
to '-march=sandybridge' and the '-march=core-avx2' flag in the
haswell subconfig to '-march=haswell'. The older flags were used
by older versions of gcc and should have been updated to the newer
forms a long time ago. (The older flags were clearly working, even
though they are no longer documented in the gcc man page.)

commit c22b9dba5859a9fc94c8431eccc9e4eb9be02be1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jul 16 13:14:47 2019 -0500

More updates to comments in testsuite modules.

Details:
- Updated most comments in testsuite modules that describe how the
correctness test is performed so that it is clear whether the vector
(normfv) or matrix (normfm) form of Frobenius norm is used.

commit c4cc6fa702f444a05963db01db51bc7d6669e979
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jul 16 13:00:35 2019 -0500

New cntx_t blksz "set" functions + misc tweaks.

Details:
- Defined two new static functions in bli_cntx.h:
bli_cntx_set_blksz_def_dt()
bli_cntx_set_blksz_max_dt()
which developers may find convenient when experimenting with different
values of cache blocksizes.
- Updated one- and two-socket multithreaded problem size range and
increment values in test/3/Makefile.
- Changed default to column storage in test/3/test_gemm.c.
- Fixed typo in comment in testsuite/src/test_subm.c.

commit b84cee29f42855dc1f263e42b83b1a46ac8def87
Merge: 1f80858a c7dd6e6c
Author: Meghana Vankadari <Meghana.Vankadariamd.com>
Date: Mon Jul 8 02:03:07 2019 -0400

Merge "Added compiler flags for vanilla clang" into amd-staging-rome2.0

commit 1f80858abf5ca220b2998fbe6f9b06c32d3864c3
Author: kdevraje <kiran.Devrajegowdaamd.com>
Date: Fri Jul 5 16:05:11 2019 +0530

This checkin solves the dgemm performance issue jira ticket CPUPL 458, as else was missed during integration, it was always following else path to get the block sizes

Change-Id: I0084b5856c2513ab1066c08c15b5086db6532717

commit c7dd6e6cd2f910cbefcdc1e04a5adeb919a23de0
Author: Meghana <meghana.vankadariamd.com>
Date: Thu Jul 4 09:32:51 2019 +0530

Added compiler flags for vanilla clang

Change-Id: I13c00b4c0d65bbda4c929848fd48b0ab611952ab

commit 2acd49b76457635625a01e31c2abc8902b23cf51
Author: Meghana <meghana.vankadariamd.com>
Date: Mon Jul 1 15:42:38 2019 +0530

fix for test failures using AOCC 2.0

Change-Id: If44eaccc64bbe96bbbe1d32279b1b5773aba08d1

commit ceee2f973ebe115beca55ca77f9e3ce36b14c28a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jun 24 17:47:40 2019 -0500

Fixed thrinfo_t printing bug for small problems.

Details:
- Fixed a bug in bli_l3_thrinfo_print_gemm_paths() and
bli_l3_thrinfo_print_trsm_paths(), defined in bli_l3_thrinfo.c,
whereby subnodes of the thrinfo_t tree are "dereferenced" near the
beginning of the functions, which may lead to segfaults in certain
situations where the thread tree was not fully formed because the
matrix problem was too small for the level of parallelism specified.
(That is, too small because some problems were assigned no work due
to the smallest units in the m and n dimensions being defined by the
register blocksizes mr and nr.) The fix requires several nested levels
of if statements, and this is one of those few instances where use of
goto statements results in (mostly) prettier code, especially in the
case of _gemm_paths(). And while it wasn't necessary, I ported this
goto usage to the loop body that prints the thrinfo_t work_id and
comm_id values for each thread. Thanks to Nicholai Tukanov for helping
to find this bug.

commit cac127182dd88ed0394ad81e6b91b897198e168a
Merge: 565fa385 3a45ecb1
Author: kdevraje <Kiran.Devrajegowdaamd.com>
Date: Mon Jun 24 13:01:27 2019 +0530

Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis
with public repo commit id 565fa3853b381051ac92cff764625909d105644d.

Change-Id: I68b9824b110cf14df248217a24a6191b3df79d42

commit c152109e9a3b1cd74760e8a3215a676d25c18d2e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jun 19 13:23:24 2019 -0500

Updated BLASFEO results in PerformanceSmall.md.

Details:
- Updated the BLASFEO performance graphs shown in PerformanceSmall.md
using a new commit of BLASFEO (2c9f312); updated PerformanceSmall.md
accordingly.
- Updated test/sup/octave/plot_l3sup_perf.m so that the .m files
containing the mpnpkp results do not need to be preprocessed in order
to plot half the problem size range (ie: up to 400 instead of the
800 range of the other shape cases).
- Trivial updates to runme.m.

commit 4d19c98110691d33ecef09d7e1b97bd1ccf4c420
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Jun 8 11:02:03 2019 -0500

Trivial change to MixedDatatypes.md link text.

commit 24965beabe83e19acf62008366097a7f198d4841
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Jun 8 11:00:22 2019 -0500

Fixed typo in README.md's MixedDatatypes.md link.

commit 50dc5d95760f41c5117c46f754245edc642b2179
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jun 7 13:10:16 2019 -0500

Adjust -fopenmp-simd for icc's preferred syntax.

Details:
- Use -qopenmp-simd instead of -fopenmp-simd when compiling with Intel
icc. Recall that this option is used for SIMD auto-vectorization in
reference kernels only. Support for the -f option has been completely
deprecated and removed in newer versions of icc in favor of -q. Thanks
to Victor Eijkhout for reporting this issue and suggesting the fix.

commit ad937db9507786874c801b41a4992aef42d924a1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jun 7 11:34:08 2019 -0500

Added missing include "bli_family_thunderx2.h".

Details:
- Added a cpp-conditional directive block to bli_arch_config.h that
includes "bli_family_thunderx2.h". The code has been missing since
adf5c17f. However, this never manifested as an error because the file
is virtually empty and not needed for thunderx2 (or most subconfigs).
Thanks to Jeff Diamond for helping to spot this.

commit ce671917b2bc24895289247feef46f6fdd5020e7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jun 6 14:17:21 2019 -0500

Fixed formatting/typo in docs/PerformanceSmall.md.

commit 86c33a4eb284e2cf3282a1809be377785cdb3703
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jun 5 11:43:55 2019 -0500

Tweaked language in README.md related to sup/AMD.

commit cbaa22e1ca368d36a8510f2b4ecd6f1523d1e1f3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jun 4 16:06:58 2019 -0500

Added BLASFEO results to docs/PerformanceSmall.md.

Details:
- Updated the graphs linked in PerformanceSmall.md with BLASFEO results,
and added documenting language accordingly.
- Updated scripts in test/sup/octave to plot BLASFEO data.
- Minor tweak to language re: how OpenBLAS was configured for
docs/Performance.md.

commit 763fa39c3088c0e2c0155675a3ca868a58bffb30
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jun 4 14:46:45 2019 -0500

Minor tweaks to test/sup.

Details:
- Changed starting problem and increment from 16 to 4.
- Added 'lll' (square problems) to list of problem size shapes to
compile and run with.
- Define BLASFEO location and added BLASFEO-related definitions.

commit 5e1e696003c9151b1879b910a1957b7bdd7b0deb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jun 3 18:37:20 2019 -0500

CHANGELOG update (0.6.0)

Page 1 of 7

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.