Blis

Latest version: v0.9.1

Safety actively analyzes 628969 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 3 of 7

0.3.2

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Apr 28 14:07:31 2018 -0500

Version file update (0.3.2)

commit cdf041ddadd8725e578e2f59f37ae341f26655af
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Apr 28 14:05:00 2018 -0500

Use config.mk instead of common.mk in bump-version.sh.

Details:
- Fixed inadvertent targeting of common.mk when testing whether configure
had already been run, rather than config.mk.

commit 6ded8f9f0364b3c07255e2532ada3eeb2ed2a715
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Apr 28 14:01:29 2018 -0500

Account for recent 'make distclean' in bump-version.sh.

Details:
- Added logic to build/bump-version.sh that will run './configure auto'
if 'common.mk' is not present (usually because 'make distclean' was run
recently).

commit 7c16fdce433f5dea0e83d5047553c955d8e46fd2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Apr 28 13:50:55 2018 -0500

Fixed typo in RELEASING file.

commit 5e5ca4984fcf6d72d3036c338bb9cdc64520a325
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Apr 28 13:48:01 2018 -0500

README updates.

Details:
- Updates to the top-level README files in the top-level directory as
well as the 'examples/oapi' directory.

commit 627b045e301defea6770dc5b64e1110cbec25153
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 27 18:11:19 2018 -0500

Added an example of using transposition with gemm.

Details:
- Added an example to examples/oapi/8level3.c to show how to indicate
transposition when performing a gemm operation.

commit 13a0eadc69d72933e322901f5b44944834e3c787
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 27 18:00:07 2018 -0500

Added more transposition/conjugation examples.

Details:
- Added code to examples/oapi/5level1m.c that demonstrates transposing
(and conjugate-transposing) unstructured matrices.
- Comment updates to 6level1m_diag.c to maintain consistency with new
examples in 5level1m.c.

commit 5606cd8881e75264a96af45dc8ea1905bab054f5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 27 17:13:10 2018 -0500

Added utility module to examples/oapi.

Details:
- Added a new code example file to examples/oapi demonstrating how to use
various utility operations.
- Comment updates to other example files.
- README updates.

commit ff26c94c6486374c709f93c6965ea18903bd6a18
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 27 12:31:34 2018 -0500

Added missing gcc version constraint for knl.

Details:
- Previously forgot to add explicit enforcement of a minimum gcc version
in configure script when 'knl' sub-configuration is requested.
- Comment updates to configure.

commit 4d97574e477b3e55ddbb6044b0542a92cd9bab30
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 24 18:48:09 2018 -0500

Added object API example code.

Details:
- Added an 'examples' directory at the top level.
- Added an 'oapi' subdirectory in 'examples' that contains a tutorial-like
sequence of example code demostrating the core functionality of BLIS's
object-based API, along with a Makefile and README. Thanks to Victor
Eijkhout for being the first to suggest including such code in BLIS.

commit d6ab25a3232aa52b9b855088fb4b0b46ff2c00c8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 24 18:43:03 2018 -0500

Add setijm, getijm operations.

Details:
- Added bli_setgetijm.c, which defines bli_setijm(), bli_getijm(), and
related functions that can be used to read and write individual
elements of an obj_t.
- Defined a new function, bli_obj_create_conf_to(), in bli_obj.c that will
create a new object with dimensions conformal to an existing object.
Transposition and conjugation states on the existing object are ignored,
as are structure and uplo fields.
- Defined a new function, bli_datatype_string(), in bli_obj.c that returns
a char* to a string representation of the name of each num_t datatype.
For example, BLIS_DOUBLE is "double" and BLIS_DCOMPLEX is "dcomplex".
BLIS_INT is included (as "int"), but BLIS_CONSTANT is not, and thus is
not a valid input argument to bli_datatype_string().
- Added calls to bli_init_once() to various functions in bli_obj.c, the
most important of which was bli_obj_create_without_buffer().
- Removed unintended/extra newline from the end of printv output.
- Whitespace changes to
- frame/base/bli_machval.c
- frame/base/bli_machval.h
- frame/0/copysc/bli_copysc.c
- Trivial changes to README.md and common.mk.

commit a731a428f7fc02fd6ab4f953ead828c1d06fb5a1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 17 16:44:55 2018 -0500

Another README.md update.

commit c734ee928a824b27d280a9a67b1b4bc8423d5795
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 17 16:40:05 2018 -0500

README.md update.

commit 03ecad372d8eb603ee905a7b944d0544a813460a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 17 14:16:59 2018 -0500

Added RELEASING file.

Details:
- Added a file named 'RELEASING' that contains basic notes on how to
create a new version/release of BLIS. This is mostly just a reminder
to myself, but also may become useful if/when others take over
development and administration of the project.

commit 24b3c3149ce66546b9a1afc2cc794a637a86aa60
Merge: 60366a3f 817b67c0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 16 18:49:38 2018 -0500

Merge branch 'dev' of github.com:flame/blis into dev

commit 60366a3faba4e60cee85c3b87a3f69625f4b9026
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 16 18:46:21 2018 -0500

Updates to knl kernels and related code.

Details:
- Imported the 24x16 knl sgemm microkernel (and its corresonding spackm
kernel) from TBLIS and enabled its use in the knl sub-config. Also
Added sgemm microkernel prototype to bli_kernels_knl.h.
- Updated dgemm and dpackm microkernels from TBLIS, which included an
important change regarding the offsets array (changed from extern
declaration to static declaration/definition).
- Activated use of level-1v and -1f zen kernels in skx and knl
sub-configs.
- Removed some old macros no longer needed in bli_family_skx.h now that
libmemkind support exists in configure.
- Moved bli_avx512_macros.h to frame/include and adjusted includes in
skx and knl kernels accordingly.
- Moved unused kernels in kernels/knl/3 to kernels/knl/3/other
directory.
- Fixed a minor bug in the 'make' output per compile when verboseness
is not turned on. The rule-generating function 'make-kernel-rule' was
previously passing in the name of the config, rather than the name of
the kernel set returned by get-config-for-kset, which could give
misleading information to the user when the kconfig_map mapped a
kernel set to a sub-configuration that did not share the same name.
(This didn't affect the CFLAGS that were actually used.)
- Updated test/3m4m/Makefile, removing acml targets and renaming the
remaining targets.

commit 817b67c01752e0ca8fe230bb8ad23afc7bd0f64e
Merge: 67c9c2f8 2b7108a8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 16 14:06:26 2018 -0500

Merge branch 'dev' of github.com:flame/blis into dev

commit 67c9c2f86d5ef2accc439b21581d73d82754a2e3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 16 14:03:12 2018 -0500

Retired haswell gemm microkernels.

Details:
- Moved microkernels in kernels/haswell/3 to kernels/haswell/3/old. These
microkernels were no longer being used and only sowed confusion to
anyone inspecting the repository without being fully cognizant of the
build system and how it works (and sometimes even to those who wrote
the build system). Note that the haswell configuration currently
employs the zen microkernels.

commit 2b7108a8ef8ce958b3acad028ff07c85ff97fd63
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 16 12:35:53 2018 -0500

Minor updates to test driver makefiles.

Details:
- Cleaned up and homogenized the various test driver Makefiles in
testsuite and test directories.
- Very minor updates to test driver code.

commit 9f56df95570a24587b910b169f342bd356ccbfb6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 11 14:51:36 2018 -0500

Trivial tweaks to configure blacklisting output.

Details:
- Updated output of information vis-a-vis configuration blacklisting.

commit f56481efebd9a7785c0618f3a12c0bec36f46333
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 10 19:02:21 2018 -0500

Cleaned up assembler version query on OS X.

Details:
- Swiched from querying version of 'objdump' to 'as' (e.g. the
assembler).
- Fixed the outputting of the version of 'as' on OS X, which required
this beauty:
...=$(as -v /dev/null -o /dev/null 2>&1)
- Only add sub-configs to blacklist if the sub-config hasn't already
been added.

commit 088c474e629535affbe111f141f895af50d109be
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 10 18:09:56 2018 -0500

Added support for blacklisting via the assembler.

Details:
- Added logic to configure that attempts to assemble various small files
containing select instructions designed to reveal whether binutils
(specifically, the assembler) supports emitting those instruction sets.
This information provides additional opportunities to blacklist sub-
configurations that are unsupported by the environment. Thanks to Devin
Matthews for pointing me towards a similar solution in TBLIS as an
example.
- Various other cleanups in configure.
- Reorganized the detection code in the 'build' directory, bringing the
"auto-detect" configuration detection, libmemkind detection, and new
instruction set detection codes into a single new subdirectory named
'detect'.

commit 78a24e7dada52a3582f8488795bd1a44993989d9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 9 17:02:13 2018 -0500

Updated bli_avx512_macros.h in knl and skx configs.

Details:
- Downloaded updated version of bli_avx512_macros.h from TBLIS [1] in
attempt to address issue 192.
[1] https://github.com/devinamatthews/tblis/

commit 388f64d6ade14caa4a6c286845ad2d565378b2bb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 9 15:33:10 2018 -0500

Fixed failure to honor CC= argument to configure.

Details:
- Fixed a failure to observe the value of CC when selecting the compiler
in configure. Thanks to Devangi Parikh for reporting this bug.
- The semantics now also work for the CC environment variable. That is,
if CC is set prior to running configure, that value is used, but will
be overridden by specifying the CC= argument to configure. If the CC
environment variable is not set, the CC= value is used. If neither the
environment variable nor CC= are specified, then the choice is made
internally to configure: first attempting to find gcc, then clang, and
then cc.

commit 45fbe66b3e2ab92f0b4fdf437d57c5d06603803d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 9 14:01:08 2018 -0500

Fixed libmemkind dependency for x86_64.

Details:
- Removed some old conditional code in config/knl/make_defs.mk that
added -lmemkind to LDFLAGS if DEBUG_TYPE was not 'sde' and inserted
code into common.mk that affirmatively filters out -lmemkind from
LDFLAGS if DEBUG_TYPE is 'sde'. (Thanks to Dave Love for reporting
this issue.) Other minor cleanups to neighboring code in common.mk.
- Updated CRVECFLAGS in knl/make_defs.mk to be based on -march=knl,
and then AVX-512 functionality is manually removed via various
-mno-avx512* flags. Also, make the setting of CRVECFLAGS conditional
on CC_VENDOR. Similar change to skx/make_defs.mk.
- Comment/whitespace updates.

commit ca982148b3b419db063cad2fa74376ec383a5c80
Author: dnp <devangiparikhgmail.com>
Date: Sun Apr 8 21:27:10 2018 -0500

Fixed bug in SKX sgemm microkernel. Modified SKX dgemm mircokernel to be consistent with the sgemm microkernel

commit bd0276752ccdd56ff897b1a5ae022f2ffe6e0b38
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 6 18:51:43 2018 -0500

Track separate ref kernel flags for each sub-config.

Details:
- Renamed CVECFLAGS variables in sub-configurations' make_defs.mk files
to CKVECFLAGS.
- Added default defintions of two new make variables to most sub-
configurations' make_defs.mk files--CROPTFLAGS and CRVECFLAGS--
which correspond to reference kernel analogues of the CKOPTFLAGS
and CKVECFLAGS, which track optimization and vectorization flags for
optimized kernels. Currently, two sub-configurations (knl and skx)
explicitly set CRVECFLAGS to non-default values (using AVX2 instead of
AVX-512 for reference kernels. Thanks to Jeff Hammond, whose feedback
prompted me to make this change (issue 187).
- Changed common.mk so that the get-refkern-cflags-for function returns
the flags associated with the given sub-configuration's CROPTFLAGS
and CRVECFLAGS (instead of CKOPTFLAGS and CKVECFLAGS).

commit b9aebce19480448817373e2df2b36bd090eae41a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 6 18:37:33 2018 -0500

De-verbosify makefile fragment generation.

Details:
- Changed from -v1 to -v0 when calling gen-make-frag.sh from configure.
The directory-by-directory recursive output didn't add much value to
the user, so now we just echo a line for each top-level directory into
which we will recurse (e.g. 'config', 'ref_kernels', 'frame', etc.).
This also helps keep more interesting information (from earlier in the
execution of configure) from scrolling out of the terminal window.

commit b549b91f26948991e13364f1f26a878da0f43aa0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 6 16:31:33 2018 -0500

Added 64-bit integer support to BLAS test drivers.

Details:
- Updated the build system and BLAS test drivers to use 64-bit integers
when BLIS is configured for 64-bit integers in the BLAS layer. Also
updated blastest/Makefile accordingly. Thanks to Dave Love for
reporting the need for this feature.
- Added a 'check' target to blastest/Makefile so that the user can see
a summary of the tests.
- Commented out the initial definition of INCLUDE_PATHS in common.mk,
which was used pre-monolithic header, back when BLIS needed paths to
*all* headers, rather than just a select few. This line is no longer
needed since the value of INCLUDE_PATHS is overwritten by a later
definition limited to only the header paths that are needed now.

commit d39fa1c04265869bdf8b6f453076359eec2f3c59
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Apr 5 19:38:35 2018 -0500

Adjusted CFLAGS used to compile bli_cntx_ref.c.

Details:
- Removed CKOPTFLAGS and CVECFLAGS from the set of CFLAGS used to
compile bli_cntx_ref.c for each configuration. This is necessary
because the file defines functions like bli_cntx_init_skx_ref(),
which are called during BLIS's initialization of the global kernel
structure, potentially being executed by an architecture that lacks
the instruction set used to compile the kernels for, in this example,
skx, which would lead to an illegal instruction error. Thanks to
Dave Love for reporting this issue.
- Further adjusted CFLAGS used when compiling code in the 'config'
directory (e.g. bli_cntx_init_skx.c) as well as code in 'frame' so
as to avoid the aforementioned issue.

commit 08b123084d35680beab379012f8f5a5a8b44a443
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Apr 5 14:25:39 2018 -0500

Added color-coding to 'make check' output.

Details:
- Added color coding to output of check-blistest.sh, check-blastest.sh
scripts. Success messages are coded green and failure are coded red.
This helps draw the eye toward those messages as the 'make checkblis',
'make checkblis-fast', and 'make checkblas' targets are executed.
- Changed top-level Makefile so that execution will not halt if
'checkblis', 'checkblis-fast', or 'checkblas' targets fail, which
means that the second of the two tests (BLIS and BLAS) run by
'make check' will run even if the first test fails.

commit c9e4d7db7410b03c1ffe8c9727e9f1b2ba7fecfe
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 4 17:13:15 2018 -0500

CHANGELOG update (0.3.1)

0.3.1

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 4 17:13:15 2018 -0500

Version file update (0.3.1)

commit e6cc9ee26bcf0450f1120d5d12985b04d9fb8516
Merge: 786d15c5 3c91c7ae
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 4 16:08:18 2018 -0500

Merge branch 'dev' of github.com:flame/blis into dev

commit 786d15c5ef09f1f647b126b63d57e76d5810c58e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 4 16:06:47 2018 -0500

Added skx, knl to x86_64 configuration family.

Details:
- Added 'skx' and 'knl' sub-configurations to the 'x86_64' configuration
family in the config_registry file.
- Added logic to configure that avoids committing certain sub-configs to
the configuration/kernel registries if those sub-configs cannot be
handled properly by the chosen compiler. (This was modeled after
similar logic in TBLIS's configure; thanks to Devin Matthews for
pointing this out.) First, the compiler and its version are inspected
and, based on the results, certain configurations are added to a
"blacklist". Then, as the configuration registries are being created,
configurations and/or kernels that match items in the blacklist are
skipped over and not commited to the registries. Under certain
circumstances, omitting a blacklisted configuration will indirectly
invalidate other configurations due to the loss of availability of
the original blacklisted configuration's kernel set. This additional
indirect blacklist is also accounted for.
- Added output to the beginning of configure that echos information
about the chosen compiler as well as the configurations that are
blacklisted and must be stripped from the registries.
- Various other cleanups in configure, especially with respect to
explicitly declaring local variables in functions.
- Comment updates to config/zen/make_defs.mk regarding choice of -march
flags based on compiler version.

commit 3c91c7aebafb446a2582267beb3b22c8bb475b3b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 2 12:40:25 2018 -0500

Fixed 64b type mismatch warning in cblas_xerbla.c.

Details:
- Fixed a compiler warning concerning a type mismatch between the
format specifier of the printf() call in cblas_xerbla.c and its
corresponding (info) argument. The warning manifested when the CBLAS
layer was enabled and the BLAS/CBLAS integer type siwas is set to 64
(the default is 32). The warning was fixed by changing the specifier
from %d to %jd and typecasting the argument to intmax_t. Thanks to
Dave Love for reporting this issue and submitting the patch.

commit 71eaf449a812fe2bd640d21513ec83974b2edb45
Merge: 6a628184 ae9a5be5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 27 17:21:43 2018 -0500

Merge branch 'dev'

commit ae9a5be56d6f9b87278d6032154d2dcf3fb7d54f
Author: dnp <devangiparikhgmail.com>
Date: Tue Mar 27 17:01:23 2018 -0500

Fixed bug in skx sgemm microkernel

commit 3f02af0905b1e2e2e065862f8afe5e9a52f282b2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 26 17:40:04 2018 -0500

Row storage optimizations to zen dotxf kernels.

Details:
- Split the main loop bodies of zen's [sd]dotxf kernels into two cases:
one to handle a column-stored matrix A and one to handle a row-stored
matrix A. This allows vector instructions to be employed even if A is
stored by rows (and A^T appears stored as columns). Both storage cases
use a common edge case loop. Thanks to Devin Matthews for this idea
and for prototyping the change needed for sdotxf kernel.

commit 679dcc331dd870ec680e135a3fb65ffa6e3a91c2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 26 15:35:17 2018 -0500

Make k_iter/k_left uint64_t in bulldozer fma ukrs.

Details:
- Changed the declaration of k_iter and k_left for d, c, z microkernels
from dim_t to uint64_t. This is needed to ensure compatibility with
the movq instruction used to load the value into registers. This
change should have been made a long time ago, but for some reason
only recently began showing up via Travis CI.

commit 6a628184f6938673440e4cdd4fed0208c51fd1f9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 26 14:48:16 2018 -0500

Fixed a memkind-related compile-time bug on knl.

Details:
- Fixed a compile-time error that occurred due to the fact that
BLIS_ENABLE_MEMKIND, defined in bli_config.h, was not being defined
soon enough to be used in bli_system.h where it is needed to determine
whether hbwmalloc.h should be included. bli_system.h is now included
after bli_config.h (and bli_config_macro_defs.h). Thanks to Dave Love
for reporting this issue.
- Tweaked the language used by configure to echo the status of the
--with[out]-memkind option.

commit e2192a8fd58ec3657434ddd407033e097edad8f4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Mar 23 12:53:48 2018 -0500

Removed vzeroupper intrinsics from zen kenels.

Details:
- Fixed a bug in the zen (also used by haswell) dotxf kernels whereby a
vzeroupper instruction destoryed part of the intermediate result
stored by the vdpps instructions that came right before. (The
vzeroupper instrinsic was removed.)
- Removed remaining vzeroupper instrinsics from other zen kernels.
Previously, the vzeroupper instructions were included because BLIS is
typically compiled with -mfpmath=sse. But it was brought to my
attention that inserting these vzeroupper instructions is unnecessary
for our purposes, since (a) -mfpmath=sse results in VEX-encoded scalar
code rather than literal SSE instructions, and (b) compilers already
(likely) insert vzeroupper instructions where necessary. Thanks to
Devin Matthews for zeroing in on the dotxf bug.
- Removed -malign-double from bulldozer make_defs.mk. This alignment
was already happening by default since bulldozer is an x86_64 system.

commit 22289ad23cd10b81451ce82f60d84b5f97e7fd85
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 22 18:21:30 2018 -0500

Added build system support for libmemkind.

Details:
- Added support for libmemkind to configure. configure attempts to
detect the presence of libmemkind by compiling a small program
containing include <hbwmalloc.h> and a call to hbw_malloc(). If
successful, it is assumed that libmemkind is present and available.
If present, use of libmemkind is enabled by default, and otherwise
use is disabled by default. If libmemkind is present, the user may
explicitly disable use of the library by running configure with the
--without-memkind option. Furthermore, a configuration may disable
libmemkind, perhaps conditional on some aspect of the build system,
by including -DBLIS_DISABLE_MEMKIND in the configuration's CPPROCFLAGS
make variable and setting the BLIS_ENABLE_MEMKIND makefile variable,
set in config.mk, to 'no'. (The knl configuration makes use of this
latter feature; see below.)
- If enabled at configure-time, bli_system.h will include <hbwmalloc.h>
and bli_kernel_macro_defs.h will define BLIS_MALLOC_POOL and
BLIS_FREE_POOL to use hbw_malloc() and hbw_free(), respectively.
- Deprecated explicit use of BLIS_NO_HBWMALLOC in
config/knl/bli_family.knl.h and replaced use of -DBLIS_NO_HBWMALLOC in
config/knl/make_defs.mk with -DBLIS_DISABLE_MEMKIND, which overrides
(undefs) the definition of BLIS_ENABLE_MEMKIND in bli_system.h, if it
would otherwise be defined. Also, set the BLIS_ENABLE_MEMKIND makefile
variable to 'no'.
- common.mk now adds libmemkind to LDFLAGS if libmemkind is enabled.

commit 7dc40eafdd9af3e8c4519a8d1b04d25830b4ca7a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 21 18:39:16 2018 -0500

Updates to top-level and test driver Makefiles.

Details:
- Added logic to common.mk that will choose a BLIS library against which
to link (LIBBLIS_LINK). The default choice is the static (.a) library;
the shared (.so) library is chosen only if the shared library build was
enabled and the static one was disabled.
- Updated the various test driver Makefiles to reference this common,
pre-chosen library against which to link. (Previously, these drivers
unconditionally linked against the static library and would have
failed if the static library build was disabled at configure-time.)
- Renamed many of the variables in common.mk and the top-level Makefile
so that variables relating to the libblis.[a|so] files, including
paths to those files, begin with "LIBBLIS".
- Shuffled around some of the library definitions from the top-level
Makefile to common.mk.
- Renamed BLIS_ENABLE_DYNAMIC_BUILD to BLIS_ENABLE_SHARED_BUILD, and
the enable_dynamic anchor to enable_shared in build/config.mk.in
and in configure.
- A few other cleanups in the top-level Makefile.

commit 97e1eeade3c51df1bae574a9bc1da34b05bf2bd3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 21 15:47:11 2018 -0500

Added input.operations.fast file for 'make check'.

Details:
- Added an 'input.operations.fast' file to testsuite directory to go
along with the 'input.general.fast' file used by the 'make check'
target in the top-level Makefile. This will allow the "fast" check
to prune operations and/or parameter combinations from the test
space in order to save time.
- Currently, input.operations.fast prunes trmm3 and all transposition
and conjugation parameters from the level-3 test space.
- Reduced problem size tested in input.general.fast to 100 and disabled
testing of 1m method.

commit c441caa95aabe69f54e2160eb67bf4ca76a66c34
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 20 17:56:02 2018 -0500

README update.

Details:
- Minor updates to README.md.
- Minor change to blastest/Makefile.

commit 6fe018eb4ac8c16f2edc916c24f5994848017b7f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 20 15:35:45 2018 -0500

Added .gitkeep file to blastest/obj.

Details:
- Added an empty file named '.gitkeep' to blastest/obj/ so that git will
track the otherwise empty directory. (This is already done for the BLIS
testsuite in testsuite/obj.)

commit 0e6d000db9291342913dc5f8590a28c67bbcbc95
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 20 15:08:43 2018 -0500

Updated .gitignore to ignore BLAS test out.* files.

commit 40c040a31d96fbadff11f761d0cad1ef03ef2cc5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 20 14:33:50 2018 -0500

Fixes to .travis.yml.

Details:
- Invoke the full BLIS testsuite via 'make testblis' instead of the fast
version via 'blistest-fast' (which was wrong anyway, since the correct
fast traget is 'testblis-fast').
- Invoke the BLAS tests via 'make testblas' instead of 'blastest'.

commit 664ec4813d8b53121cce7a68bef47da656ece9cb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 20 13:54:58 2018 -0500

Integrated f2c'ed netlib BLAS test suite.

Details:
- Created a new test suite that exercises only the BLAS compatibility
found in BLIS. The test suite is a straightforward port of code
obtained from netlib LAPACK, run through f2c and linked to a stripped-
down version of libf2c that is compiled along with the test drivers
(to prevent any obvious ABI issues). The new BLAS test suite can be
run from within its new local directory, 'blastest' (through its local
'make ; make run' targets) or from the top-level Makefile (via the
'make testblas' target). Output files are created in whatever directory
the test drivers are run, whether it be the 'blastest' directory, the
top-level source distribution directory, or the out-of-tree directory
in which 'configure' was run. Also, the results of the BLAS test suite
can be checked via 'make checkblas', which summarizes the presence or
absence of test failures in a single line printed to stdout.
- Updated the 'test' target to run both 'testblis' and 'testblas'.
- Added a new 'testblis-fast' target that runs the BLIS testsuite with
smaller problem sizes, allowing it to finish more quickly.
- Added a 'make check' target, which runs 'checkblis-fast' and
'checkblas'.
- Changed .travis.yml so that Travis CI runs 'testblis-fast' instead of
'testblis' before (calling the check-blistest.sh script to check the
result manually).
- Renamed some targets in the top-level Makefile to be consistent between
BLAS and BLIS.

commit fc53ad6c5b2e39238b1bbbf625cc0c638b9da4e1
Author: Nisanth M P <nisanth.padinharepattamd.com>
Date: Mon Mar 19 12:49:26 2018 +0530

Re-enabling the small matrix gemm optimization for target zen

Change-Id: I13872784586984634d728cd99a00f71c3f904395

commit d12d34e167d7dc32732c0ed135f8065a55088106
Author: Nisanth M P <nisanth.padinharepattamd.com>
Date: Mon Mar 19 11:34:32 2018 +0530

Re-enabling Zen optimized cache block sizes for config target zen

Change-Id: I8191421b876755b31590323c66156d4a814575f1

commit 40fa10396c0a3f9601cf49f6b6cd9922185c932e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 19 18:19:43 2018 -0500

Fixed a few obscure bugs in the BLAS API.

Details:
- Fixed a missing parameter in the definition of sdsdot_(). The 'sb'
argument was missing. Strangely, the argument is omitted from dsdot_()
in the BLAS API.
- Fixed the missing 'c' or 'u' in the "?gerc" or "?geru" operation string
passed to xerbla_() by the bla_ger_check() macro.
- For bla_syrk_check() and bla_syr2k_check() macros, only allow
conjugate-transpose (trans='c') as a valid argument for the real
domain functions [sd]syrk_() and [sd]syr2k_(). (Previously, the
argument was allowed even for the complex domain equivalents, which
was inconsistent with the BLAS API.)

commit fe7d7f1e43e4c26249eed83d4188beee1ba96202
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Mar 18 19:43:06 2018 -0500

Fixed cpp macro parameter "ch" typo in bla_ger.c.

Details:
- Previously, the BLAS routine-generating macro in bla_ger.c was
incorrectly passing MKSTR(ch) into the _check() macro when it
should have been passing in the char that was available, chxy.
I've instead changed the name of the macro parameter from chxy
to ch. Similar change as made to bla_ger.h for consistency.
Thanks to Dave Love in helping track this down. (NOTE: This is
actually the root cause of the bug that was first patched by
increasing the length of the operation name strings passed into
xerbla_(), as defined by the constant BLIS_MAX_BLAS_FUNC_STR_LENGTH,
in 3d1a5a7. In theory, that change could be backed out now.)
- Applied aforementioned chxy->ch change to bla_dot.[ch], as well as
frame/compat/cblas/f77_sub/f77_dot_sub.[ch] (not because it needed
to happen, but for naming consistency).
- Reformatted function signatures/prototypes of CBLAS functions and
function calls to BLAS in frame/compat/cblas/f77_sub/*.c.

commit cb7ed90752d1ddbac11368c4510641ca4f3a02eb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Mar 16 13:05:56 2018 -0500

Convert op names to uppercase before calling xerbla_().

Details:
- Defined a new function, bli_string_mkupper(), that calls toupper() on
every non-NULL character in a string.
- Call bli_string_mkupper() prior to calling xerbla_() in the level-2/-3
BLAS _check() macros. This prevents the BLAS testsuite from complaining
that the operation name (e.g. "dgemm") does not match the expected
value (e.g. "DGEMM"). Thanks to Dave Love for reporting this issue.

commit 3d1a5a7c08fed3ba29f060fe1db2b0dc42dde223
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Mar 16 12:24:07 2018 -0500

Fixed printf() format overflow.

Details:
- Increased the length of operation name strings passed to xerbla_() in
the level-2 and level-3 operation _check() functions, found in
frame/compat/check. This avoids a format specifier overflow warning by
gcc 7. Thanks to Dave Love for reporting this issue and suggesting the
fix.

commit c73055f028684d998e03b2392093c393782bbfe7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 15 16:08:21 2018 -0500

Return after non-zero info in BLAS checks.

Details:
- Previously, when calling the BLAS compatibility layer, discovering a
parameter check failure would result in the proper setting of the
info parameter (printed by xerbla_()), but would also come with an
immediate abort() rather than a return. This was incorrect behavior
for two overlapping reasons.
(1) BLAS should return gracefully to the caller in the event of a
bad set of parameters, not abort().
(2) When BLIS was being tested via the BLAS testsuite, BLIS's
xerbla_() would correctly get preempted/overridden by the
xerbla_() in the BLAS testsuite, but execution would then
erroneously continue on to the BLIS implementation with bad
parameter values.
- The previous issue was addressed by disabling the abort() in BLIS's
xerbla_(), changing all of the BLAS _check() functions to cpp macros,
and adding a return statement to the end of each _check() macro's
"if ( info != 0 )" conditional.
Thanks to Dave Love for reporting this issue.

commit c4f1d18b97a6a8c3ea0366aa759db597a664062a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 14 19:10:09 2018 -0500

Minor typo fix to printing arch in testsuite.

Details:
- Mistakenly was calling bli_cpuid_query_id() instead of
bli_arch_query_id() in the recent addition to the testsuite output
that prints the active sub-configuration. The former function is
only used for multi-architecture builds, whereas the latter is the
more general option that also works for single configuration
(including 'configure auto') builds.

commit 8f2fabec800a720b3e94b33c0048cc8c4ead436d
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Mar 14 17:43:42 2018 -0500

Make arm32 and arm64 families work. (176)

commit fc6a1842518a0820c6708c285611346d5a1419da
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 14 15:31:17 2018 -0500

Print sub-configuration name in testsuite output.

Details:
- Added a line to the testsuite output that prints the name of the
current/active sub-configuration. This is useful when linking the
testsuite against multi-configuration builds because it confirms
the sub-configuration that is actually being employed at runtime.
Thanks to Devin Matthews for suggesting this feature.

commit 9943a899d64bf7ec4a24106f6f4c70629bbe1f6e
Merge: 290dd4a9 b1a15ae6
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Mar 14 13:27:44 2018 -0500

Merge pull request 173 from devinamatthews/dev

Fix Cortex-A9 and Cortex-A15 configs.

commit b1a15ae6ee0f46c9a95cf59f9555925e0e8e21ff
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Mar 14 13:26:44 2018 -0500

Use BLIS_H_FLAT

commit 290dd4a9feee447e69b40ad108954af78e196f7e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 14 13:15:37 2018 -0500

Allow arbitrarily deep configuration families.

Details:
- Updated configure so that configuration families specified in the
config_registry are no longer constrained as being only one level
deep. For example, previously the x86_64 family could not be defined
concisely in terms of, say, intel64 and amd64 families, and instead
had to be defined as containing "haswell, sandybridge, penryn, zen,
etc." In other words, families were constrained to only having
singleton configurations as their members. That constraint is now
lifted.
- Redefined x86_64 family in config_registry in terms of intel64 and
amd64.

commit 9cee78e006d56543ac02fc9c488905c0434e60ae
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Mar 14 13:09:48 2018 -0500

Fix Cortex-A9 and Cortex-A15 configs.

Tested with QEMU.

commit 1a3031740f7fcbbcc2c99d5c4cb50d0413407455
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 13 16:04:40 2018 -0500

Updates to ARM hardware detection support.

Details:
- Updated/clarified the ARM preprocessor macro branch of bli_cpuid.c.
Going forward, cortexa57 (64-bit), cortexa15, and cortexa9 (32-bit)
sub-configurations are supported. However, the functions that detect
features specific to a15 and a9 are identical, and since a15 is tested
first, it will always be chosen for arm32 hardware (even if both
sub-configurations were enabled at configure-time and the library is
linked and run on an a9). Thus, more work needs to be done to
distinguish these two.
- Added cpp guard around x86_64 portions of bli_cpuid.c. Now, either
the x86_64 or ARM code will be compiled (or neither, if neither
environment is detected).
- In bli_arch_query_id(), call bli_cpuid_query_id() when the
BLIS_FAMILY_ARM64 or BLIS_FAMILY_ARM32 macros are defined.
- Added arm64 and arm32 configuration families to config_registry.
- Added a note to the arch_t typedef enum in bli_type_defs.h reminding
the developer to update the string array in bli_arch.c whenever new
enum values are added or existing values are reordered.

commit 1442d06886ebdc34d8f1cb620229ddc6062c2ce8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Mar 11 16:59:50 2018 -0500

Fixed misnamed kernels in _cntx_init_cortexa57.c.

Details:
- Changed incorrect kernel function names in bli_cntx_init_cortexa57.c:
bli_sgemm_cortexa57_asm_8x12 -> bli_sgemm_armv8a_asm_8x12
bli_dgemm_cortexa57_asm_6x8 -> bli_dgemm_armv8a_asm_6x8
Thanks to Jacob Gorm Hansen for reporting this issue.

commit 28bcea37dfcf0eb99a99da6f46de2a2830393d1d
Merge: b1ea3092 8b0475a8
Author: praveeng <praveen.gamd.com>
Date: Fri Mar 9 19:13:08 2018 +0530

Merge master code till 06_mar_2018 to amd-staging

Change-Id: I12267e5999c92417e3715fef4f36ac2131d00f1a

commit 48da9f5805f0a49f6ad181ae2bf57b4fde8e1b0a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 7 12:54:06 2018 -0600

Tweaked common.mk, Makefile, skx/knl make_defs.mk.

Details:
- Reorganized linker-related section of common.mk so that LDFLAGS set
in a sub-configuration's make_defs.mk file will not be immediately
(and erroneously) overridden by the default values.
- Re-enabled redirected (to file) output of the testsuite when run from
the top-level Makefile via 'make test'. (For some reason, it was
commented-out for the non-verbose case.)
- Removed old/unnecessary code from the make_defs.mk files of skx and
knl sub-configurations.

commit 8b0475a87daa177916e2caac0e530c6a57fa07cf
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 6 06:39:44 2018 -0600

Fixed typo in attempted fix in 1a8350f7.

Details:
- Mistakenly entered 148 as knl mc blocksize for double real when the
value should have been 144. Thanks to Dave Love for reporting this.

commit 8912e6886b97eabb4ce0c35a3609a0fd994d347b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 5 18:00:45 2018 -0600

Fixed missing flags during shared object build.

Details:
- Fixed a bug in common.mk that caused warning, position-independent
code, miscellaneous, and general preprocessor flags to be omitted
from the configuration family-specific variables that hold those
values, as registered by the family's make_defs.mk file. This would
most obviously manifest when targeting a configuration family such as
'intel64' while simultaneously configuring for a shared object build,
as the key '-fPIC' flag would be omitted at compile-time and prevent
successful linking. Thanks to Dave Love for reporting this bug.
- Other cleanups to common.mk for readability and clarity.

commit 1a8350f70557fc53ca0c2eadf2076710dd0d9bc9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 5 13:32:00 2018 -0600

Fixed cache blocksize bug in knl configuration.

Details:
- Changed the mc blocksize for double real execution in the knl sub-
configuration from 160 to 148. The old value was not a multiple of
mr (which is 24), and thus the safeguards in bli_gks_register_cntx()
were tripping. Thanks for Dave Love for reporting this issue.
- Switch knl sub-configuration to use default blocksizes for datatypes
not supported by native kernels.
- Fixed typos in bli_error.c that prevented certain error strings
(which report maximum cache blocksizes not being multiples of their
corresponding register blocksize) from properly initializing.

commit c09fffa827fe6241dc20193a1c404496664220de
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Mar 3 13:13:39 2018 -0600

Added missing cntx_t* arg in knl packm kernels.

Details:
- Added the missing cntx_t* argument to the function signature of packm
kernels in kernels/knl/1m/. Thanks to Dave Love for reporting this
issue.

commit b1ea30925dff751eced23dfa94ff578a20ea0b94
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Feb 23 17:42:48 2018 -0600

CHANGELOG update (0.3.0)

Change-Id: Id038b00a62de51c9818ad249651ec5dc662f4415

commit 1ef9360b1fd0209fbeb5766f7a35402fbd080fcb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 1 14:36:39 2018 -0600

Enable non-unit vector stride tests by default.

Details:
- Change "vector storage schemes to test" parameter in testsuite's
input.general file to "cj". This means that both unit stride column
vectors and non-unit stride column vectors will be tested in
operations with vector operands (e.g. level-1v, level-1f, level-2).
- Very minor comment (typo) changes to input.operations.

commit 8c4e55a1a1ead9a5e970200fee027ffd2c7e8454
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Feb 28 17:01:47 2018 -0600

Added individual operation overrides in testsuite.

Details:
- Updated the testsuite driver so that setting one or more individual
operation test switches to "2" in input.operations will enable ONLY
those operations and disable all others, regardless of the values of
the section overrides and other operation switches. This makes it
every easy to quickly test only one or two operations, and equally
easy to revert back to the previous combination of operation tests.
- Added more comments to input.operations describing the use of
individual "enable only" overrides.

commit 34862aed89e5d5a8f35aeecd49f3052ada1f337b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Feb 28 15:30:14 2018 -0600

Use zen kernels in haswell sub-configuration.

Details:
- Register use of level-1v zen intrinsic kernels for amaxv, axpyv, dotv,
dotxv, and scalv, as well asl level-1f zen intrinsic kernels for axpyf
and dotxf. This works because these kernels simply target AVX/AVX2,
and therefore work without modification on haswell hardware.
- Switch to use of zen microkernels in bli_cntx_init_haswell.c. The zen
kernels are essentially identical to those used by haswell, except that
now zen kernels are a bit more up-to-date. In the future, I may
continue to maintain duplicates, or I may keep the kernels named after
one architecture (zen or haswell) but used by both sub-configurations.
- In config_registry, enable use of both haswell and zen kernels for the
haswell sub-configuration. This is necessary in order to make zen
kernels visible when registering kernels in bli_cntx_init_haswell.c.
- Enable use of assembly-based complex gemm microkernels for zen,
bli_cgemm_zen_asm_3x8() and bli_zgemm_zen_asm_3x4(), in
bli_cntx_init_zen.c. This was actually intended for 1681333.

0.3.0

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Feb 23 17:42:48 2018 -0600

Version file update (0.3.0)

commit d9079655c9cbb903c6761d79194a21b7c0a322bc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Feb 23 17:42:48 2018 -0600

CHANGELOG update (0.3.0)

commit 3defc7265c12cf85e9de2d7a1f243c5e090a6f9d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Feb 23 17:38:19 2018 -0600

Applied 34b72a3 to non-active/unused microkernels.

Details:
- Applied the read-beyond-bounds bugfix in 34b72a3 to other haswell and
zen kernels (ie: other microtile shapes) which are not used by default.
This was done mostly in case someone decided to pick up these kernels
and start using them, not because it affects BLIS's behavior
out-of-the-box.

commit 34b72a351745aa0d47bb0b74ebcd0f0a616d613d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Feb 23 16:33:32 2018 -0600

Fixed obscure read-beyond-bounds bug in sgemm ukrs.

Details:
- Fixed an obscure bug in the bli_sgemm_haswell_asm_6x16 and
bli_sgemm_zen_asm_6x16 microkernels when the input/output matrix C
is stored with general stride (ie: both rs and cs are non-unit). The
bug was rooted in the way those microkernels read from matrix C--
namely, they used vmovlps/vmovhps instead of movss. By loading two
floats at a time, even if one of them was treated as junk, the
assembly code could be written in a more concise manner. However,
under certain conditions--if m % mr == 0 and n % nr == 0 and the
underlying matrix is not an internal "view" into a larger matrix--
this could result in the very last vmovhps of the last (bottom-right)
microkernel invocation reading beyond valid memory. Specifically, the
low 32 bits read would always be valid, but the high 32 bits could
reside beyond the bounds of the array in which the output C matrix is
contained. To remedy this situation, we now selectively use movss to
load any element that could be the last element in the matrix.

commit 5112e1859e7f8888f5555eb7bc02bd9fab9b4442 (origin/rt)
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Feb 23 14:31:26 2018 -0600

Added missing 'restrict' to some kernels' cntx_t*.

Details:
- Added missing 'restrict' keyword to cntx_t* argument of function
signatures corresponding to level-1v, level-1f, and level-1m kernels.
This affected bli_l1v_ker_prot.h, bli_l1f_ker_prot.h, and
bli_l1m_ker_prot.h. (The 'restrict' was already being used to
qualify cntx_t* arguments for kernels defined in bli_l3_ker_prot.h.)
- Added comments to bli_l1v_ker.h, bli_l1f_ker.h, bli_l1m_ker.h, and
bli_l3_ukr.h that help explain how those headers function to produce
kernel prototypes using the prototype macros defined in the files
mentioned above.

commit 1fa8af95d807168e0849adb668492601e7009be0
Merge: c084b03b 16813335
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Feb 21 17:54:02 2018 -0600

Merge branch 'rt'

commit c084b03b31d84427a120e391963db5419f1911ee
Merge: 5d03b6e6 fa74af4e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Feb 21 17:52:17 2018 -0600

Merge branch 'rt'

commit 16813335bdb5978bc9a26cd00a32bd5a130130c4
Merge: fa74af4e 5a7005dd
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Feb 21 17:43:32 2018 -0600

Merge branch 'amd' into rt

Details:
- Merged contributions made by AMD via 'amd' branch (see summary below).
Special thanks to AMD for their contributions to-date, especially with
regard to intrinsic- and assembly-based kernels.
- Added column storage output cases to microkernels in
bli_gemm_zen_asm_d6x8.c and bli_gemmtrsm_l_zen_asm_d6x8.c. Even with
the extra cost of transposing the microtile in registers, this is
much faster than using the general storage case when the underlying
matrix is column-stored.
- Added s and d assembly-based zen gemmtrsm_u microkernel (including
column storage optimization mentioned above).
- Updated zen sub-configuration to reflect presence of new native
kernels.
- Temporarily reverted zen sub-configuration's level-3 cache blocksizes
to smaller haswell values.
- Temporarily disabled small matrix handling for zen configuration
family in config/zen/bli_family_zen.h.
- Updated zen CFLAGS according to changes in 1e4365b.
- Updated haswell microkernels such that:
- only one vzeroupper instruction is called prior to returning
- movapd/movupd are used in leiu of movaps/movups for double-real
microkernels. (Note that single-real microkernels still use
movaps/movups.)
- Added kernel prototypes to kernels/zen/bli_kernels_zen.h, which is
now included via frame/include/bli_arch_config.h.
- Minor updates to bli_amaxv_ref.c (and to inlined "test" implementation
in testsuite/src/test_amaxv.c).
- Added early return for alpha == 0 in bli_dotxv_ref.c.
- Integrated changes from f07b176, including a fix for undefined
behavior when executing the 1m method under certain conditions.
- Updated config_registry; no longer need haswell kernels for zen
sub-configuration.
- Tweaked marginal and pass thresholds for dotxf.
- Reformatted level-1v, -1f, and -3 amd kernels and inserted additional
comments.
- Updated LICENSE file to explicitly mention that parts are copyright
UT-Austin and AMD.
- Added AMD copyright to header templates in build/templates.

Summary of previous changes from 'amd' branch.
- Added s and d assembly-based zen gemm microkernels (d6x8 and d8x6) and
s and d assembly-based zen gemmtrsm_l microkernels (d6x8).
- Added s and d intrinsics-based zen kernels for amaxv, axpyv, dotv, dotxv,
and scalv, with extra-unrolling variants for axpyv and scalv.
- Added a small matrix handler to bli_gemm_front(), with the handler
implemented in kernels/zen/3/bli_gemm_small_matrix.c.
- Added additional logic to sumsqv that first attempts to compute the
sum of the squares via dotv(). If there is a floating-point exception
(FE_OVERFLOW), then the previous (numerically conservative) code is
used; otherwise, the result of dotv() is square-rooted and stored as
the result. This new implementation is only enabled when FE_OVERFLOW
is defined. If the macro is not defined, then the previous
implementation is used.
- Added axpyv and dotv standalone test drivers to test directory.
- Added zen support to old cpuid_x86.c driver in build/auto-detect/old.
- Added thread-local and __attribute__-related macros to bli_macro_defs.h.

commit 5d03b6e6e19d5a07f0cccf1a158f02fbd62dfd99
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Feb 19 11:31:30 2018 -0600

Fix asm macro include line for KNL. Fixes 167.

commit f07b176c84dc9ca38fb0d68805c28b69287c938a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Feb 15 18:36:54 2018 -0600

Fixed an obscure bug in the 1m implementation.

Details:
- Fixed a bug in the way the bli_gemm1m_cntx_ref() function (defined in
ref_kernels/bli_cntx_ref.c) initializes its context for 1m execution.
Previously, the function probed the context that was in the process of
being updated for use with 1m--this context being previously
initialized/copied from a native context--for its storage preference
to determine which "variant" (row- or column-oriented) of 1m would be
needed. However, the _cntx_ref() function was not updating the method
field of the context until AFTER this query, and the conditional which
depended on it, had taken place, meaning the storage preference query
function would mistakenly think the context was for native execution,
since the context's method field would still be set to BLIS_NAT. This
would lead it to incorrectly grab the storage preference of the complex
domain microkernel rather than the corresponding real domain
microkernel, which could cause the storage preference predicate to
evaluate to the wrong value, which would lead to the _cntx_ref()
function choosing the wrong variant. This could lead to undefined
behavior at runtime. The method is now explicitly set within the
context prior to calling the storage preference query function.
- Updated comments in frame/ind/oapi/bli_l3_3m4m1m_oapi.c.
- Fixed a typo in the commented-out CFLAGS in config/zen/make_defs.mk,
which are appropriate for gcc 6.x and newer. (Mistakenly used
-march=bdver4 instead of -march=znver1.)

commit 1f94bb7b96eb2b67257e6c4df89e29c73e9ab386
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jan 19 12:46:53 2018 -0600

Document how to enable zen-specific instructions.

Details:
- Added as a comment in config/zen/make_defs.mk the list of compiler flags
that could be added to manually enable the instructions provided by the
Zen microarchitecture that are not already implied by -march=bdver4.
This information, along with the previous commit's flags to selectively
disable Bulldozer instructions no longer present in Zen, was gathered
from [1]. I hesitate to enable use of these instructions since I don't
have any Zen hardware to test on yet.
[1] https://wiki.gentoo.org/wiki/Ryzen

commit 1e4365b21bafa02bd108c5ac4705a25671fb9441
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jan 18 12:03:51 2018 -0600

Augment zen CFLAGS to prevent illegal instruction.

Details:
- Added various compiler flags (-mno-fma4 -mno-tbm -mno-xop -mno-lwp) so
that compiling with -march=bdver4 on zen-based architectures does not
result in an illegal instruction error at runtime. Note: This fix is
only needed for gcc 5.4; gcc 6.3 or later supports the use of
-march=znver1, which can be used in lieu of the augmented set of flags
based on bdver4. Thanks to Nisanth Padinharepatt for reporting this
error.

commit fa74af4e1fa7385ac3f3089fe1ea7bb88c906029
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jan 9 13:43:15 2018 -0600

Minor labeling update for './configure -c' output.

Details:
- Print the name of the configuration in the output of the
kernel-to-config map (and chosen pairs list) as a subtle way to remind
the user that these only apply to the targeted configuration (whereas
the config list and kernel list are printed without regard to which
configuration was actually targeted).

commit 5cdea756c7391e2c6cbfb38436ef9a205f860237
Merge: 9d8858b5 1e7a4896
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Jan 7 19:45:20 2018 -0600

Merge branch 'rt'

commit 9d8858b5cff4a4b078b87872847a5710073fff0a
Merge: 0b3ca3cf f7df64da
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Sun Jan 7 10:03:25 2018 -0600

Merge pull request 164 from devinamatthews/master

Don't use memkind for skx configuration.

commit f7df64daf6bbe6431effada6e13d8d1fab5aa221
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Sun Jan 7 09:37:25 2018 -0600

Don't use memkind for skx configuration. Fixes 163.

commit 1e7a4896e0cbe73c4685fa956278e3f28273cdf9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jan 5 12:33:48 2018 -0600

Minor error handling in update-version-file.sh.

Details:
- Added explicit handling of situations when 'git describe --tags'
returns an error. This command is used by update-version-file.sh
when deciding whether or not to update the version file prior to
configuration.
- Removed bli_packm.c and bli_unpackm.c, as they contained no source
code.

commit 0b3ca3cfb682715a3686fd93ebb10d4a695d1162
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jan 4 20:51:35 2018 -0600

Intelligently select compiler for auto-detection.

Details:
- Rewrote code that selects the compiler for the purposes of compiling
the auto-detection executable. CC (if specified) is tried first. Then
gcc. Then clang. The absolute fallback is cc. The previous code was
sort of broken, and seemed to unintentionally always use gcc.
- Moved various configuration-agnostic flags from config/*/make_defs.mk
files to common.mk. The new mechanism appends the configuration-
agnostic flags to the various compiler flag variables initialized in
make_defs.mk. Flags specific to the sub-configuration are still set
in make_defs.mk.
- Added -Wno-tautological-compare to CMISCFLAGS when clang is in use.
Also added the flag to the compiler instantiation during configure-
time hardware detection (when clang is selected).
- Added some missing (but mostly-optional) quotes to configure script.

commit 5a7005dd44ed3174abbe360981e367fd41c99b4b
Merge: 7be88705 3bc99a96
Author: Nisanth M P <nisanth.padinharepattamd.com>
Date: Wed Jan 3 12:05:12 2018 +0530

Merge changes in AMD beta release 0.95 into amd branch

commit 0b9c5127e91508c115228ca604ee2dac8de8f477
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Dec 23 15:53:44 2017 -0600

Enabled C99, added stdint.h to auto-detect build.

Details:
- Added "-std=c99" to compiler arguments when building auto-detection
driver in configure script.
- Added include <stdint.h> to all three source files needed by auto-
detection program.

commit 0ce5e19c318e04909d3e664d69accb3a0fc6b988
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Dec 23 15:32:03 2017 -0600

Reimplemented configure-time hardware detection.

Details:
- Reimplemented the hardware detection functionality invoked when running
"./configure auto". Previously, a standalone script in build/auto-detect
that used CPUID was used. However, the script attempted to enumerate all
models for each microarchitecture supported. The new approach recycles
the same code used for runtime hardware detection introduced in 2c51356.
This has two immediate benefits. First, it reduces and consolidates the
code required to detect microarchitectures via the CPUID instruction.
Second, it provides an indirect way of testing at configure-time the
code that is used to detect hardware at runtime. This code is (a) only
activated when targeting a configuration family (such as intel64 or
amd64) at configure-time and (b) somewhat difficult to test in
practice, since it relies on having access to older microarchitectures.
- The above change required placing conditional cpp macro blocks in
bli_arch.c and bli_cpuid.c which either include "blis.h" or include
a bare-bones set of headers that does not rely on the presence of a
bli_config.h header. This is needed because bli_config.h has not been
created yet when configure-time auto-detection takes places.
- Defined a new function in bli_arch.c, bli_arch_string(), which takes
an arch_t id and returns a pointer to a string that contains the
lowercase name of the corresponding microarchitecture. This function
is used by the auto-detection script to printf() the name of the
sub-configuration corresponding to the detected hardware.

commit 9804adfd405056ec332bb8e13d68c7b52bd3a6c1 (origin/selfinit)
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Dec 21 19:22:57 2017 -0600

Added option to disable pack buffer memory pools.

Details:
- Added a new configure option, --[en|dis]able-packbuf-pools, which will
enable or disable the use of internal memory pools for managing buffers
used for packing. When disabled, the function specified by the cpp
macro BLIS_MALLOC_POOL is called whenever a packing buffer is needed
(and BLIS_FREE_POOL is called when the buffer is ready to be released,
usually at the end of a loop). When enabled, which was the status quo
prior to this commit, a memory pool data structure is created and
managed to provide threads with packing buffers. The memory pool
minimizes calls to bli_malloc_pool() (i.e., the wrapper that calls
BLIS_MALLOC_POOL), but does so through a somewhat more complex
mechanism that may incur additional overhead in some (but not all)
situations. The new option defaults to --enable-packbuf-pools.
- Removed the reinitialization of the memory pools from the level-3
front-ends and replaced it with automatic reinitialization within the
pool API's implementation. This required an extra argument to
bli_pool_checkout_block() in the form of a requested size, but hides
the complexity entirely from BLIS. And since bli_pool_checkout_block()
is only ever called within a critical section, this change fixes a
potential race condition in which threads using contexts with different
cache blocksizes--most likely a heterogeneous environment--can check
out pool blocks that are too small for the submatrices it wishes to
pack. Thanks to Nisanth Padinharepatt for reporting this potential
issue.
- Removed several functions in light of the relocation of pool reinit,
including bli_membrk_reinit_pools(), bli_memsys_reinit(),
bli_pool_reinit_if(), and bli_check_requested_block_size_for_pool().
- Updated the testsuite to print whether the memory pools are enabled or
disabled.

commit 107801aaae180c00022f1b990bc59038c14949d2
Merge: d9c05745 0084531d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Dec 18 16:29:28 2017 -0600

Merge branch 'master' into selfinit

commit 0084531d3eea730a319ecd7018428148c81bbba7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Dec 17 18:58:25 2017 -0600

Updated flatten-headers.py for python3.

Details:
- Modifed flatten-headers.py to work with python 3.x. This mostly
amounted to removing print statements (which I replaced with calls
to my_print(), a wrapper to sys.stdout.write()). Thanks to Stefan
Husmann for pointing out the script's incompatibility with python 3.
- Other minor changes/cleanups.

commit 90b11b79c302f208791bdfb1ed754873103c7ce5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Dec 17 17:34:32 2017 -0600

Modest performance boost to flatten-headers.py.

Details:
- Updated flatten-headers.py to pre-compile the main regular expression
used to isolate include directives and the header filenames they
reference. The compiled regex object is then used over and over on
each header file in the tree of referenced headers. This appears to
have provided a 1.7-2x performance increase in the best case.
- Other minor tweaks, such as renaming the main recursive function from
replace_pass() to flatten_header().

commit 99dee87f30b4d437fa6b5e4ba862526d07b9f08b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Dec 17 16:47:27 2017 -0600

Reimplemented flatten-headers.sh in python.

Details:
- Added flatten-headers.py, a python implementation of the bash script
flatten-headers.sh. The new script appears to be 25-100x faster,
depending on the operating system, filesystem, etc. The python script
abides by the same command line interface as its predecessor and
targets python 2.7 or later. (Thanks to Devin Matthews for suggesting
that I look into a python replacement for higher performance.)
- Activated use of flatten-headers.py in common.mk via the FLATTEN_H
variable.
- Made minor tweaks to flatten-headers.sh such as spelling corrections
in comments.

commit d9c0574599c3f97c0f9b6c334a077bab9452e1f4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Dec 14 17:13:42 2017 -0600

Allow travis failures of OS X builds that run testsuite.

Details:
- Added an allowance for OS X builds that run the testsuite to fail.
There seems to be an issue with 1m when running in Travis CI under
OS X and clang, but only in double-precision. Haven't been able to
reproduce the error on my own, and thus, I can't debug it. (Hopefully
it is simply a version-specific compiler bug.)

commit 86cd23b7379b00a42b4ecc04fa668f1e3f9b54ee
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Dec 14 15:47:41 2017 -0600

Fixed testsuite Makefile brokenness from 9091a207.

Details:
- Fixed a makefile error encountered when building the testsuite directly
in its directory (as opposed to indirectly via 'make test'). The fix
involves introducing a new variable, BUILD_PATH, alongside the existing
DIST_PATH variable. By default, BUILD_PATH is set to the current
directory, and is overridden by other Makefiles used by, for example,
the testsuite and standalone test drivers in testsuite or test,
respectively.
- Some files/directories in common.mk were redefined in terms of
BUILD_DIR, such as the locations of config.mk file and the intermediate
include directory.

commit 6a3a8924c04d25507fc4aa593df30c56c7dc12f7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Dec 14 13:20:02 2017 -0600

Temporarily show Makefile's testsuite output.

Details:
- Disabled redirection of testsuite output for 'test' target. This is
part of an attempt to debug a segmentation fault on OS X via Travis.

commit 9a01080dd426915bed18229f70401bfa639dc283
Merge: 83316485 a32e8a47
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Dec 14 11:27:19 2017 -0600

Merge branch 'master' into selfinit

commit a32e8a47c022b6071302b2956af5728976c83ca9 (origin/travis)
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Dec 13 16:31:36 2017 -0600

Added an exclusion to .travis.yml.

Details:
- Added exclusion for out-of-tree builds on OS X (clang).

commit b9f7d987df548965c86e16e0ba94d5cad0d9b399
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Dec 13 16:22:09 2017 -0600

Cleaned up after previous travis oot debugging.

Details:
- Removed debugging output from common.mk related to Travis CI
out-of-tree builds.
- Other minor cleanups to common.mk.

commit 9091a207aa8c49e279676ea02be533480b3b0d5a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Dec 13 16:12:34 2017 -0600

Attempted fix to travis oot build failure.

Details:
- Found the likely cause of the Travis CI out-of-tree build failures:
config.mk was being read from DIST_PATH, rather than the current
directory.

commit c01c71c33e236e6c91f5ddd3ec1e3faec89368c1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Dec 13 15:58:50 2017 -0600

Added debugging output to Makefile.

Details:
- Added $(info ...) statements in key locations in an attempt to reveal
why Travis CI doesn't like building BLIS out-of-tree.

commit 784289d69dd6b3692444d3b3e290f6a014465b72
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Dec 13 15:31:27 2017 -0600

Updated SHELL in common.mk from /bin/bash to bash.

commit d9bb1d1d4ebc89ea75d9d927d09882162a914f77
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Dec 13 15:27:54 2017 -0600

Defined SHELL in common.mk so "echo -n" works.

Details:
- Defined the SHELL variable in common.mk as "/bin/bash" so that the
-n option can be used with echo in the Makefile rule for flattening
blis.h. Thanks to Devin Matthews for suggesting this fix.

commit 9289a08667df2044f3a37af54d893efe2b56d555
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Dec 13 15:14:27 2017 -0600

Attempt 3 on .travis.yml.

commit 720bfcf0ef54fdc41df0dcaa94503edb0d5c8972
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Dec 13 14:52:28 2017 -0600

More fixes to .travis.yml.

Details:
- Fixed a mistake (hopefully) in d0c4dd0 that resulted in many more
osx/clang sub-tests than intended.
- Shortened the variable names in an effort to make them more readable
via the Travis CI web interface.

commit 8717c9c97fe9b1ecd3b3192049a73976f8390ca7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Dec 13 14:36:37 2017 -0600

Added 'pwd' commands to .travis.yml for debugging.

Details:
- Added 'pwd' commands to the script portion of the .travis.yml file in
an attempt to uncover the problem with the recent out-of-tree build
testing changes made in d0c4dd0.

commit 83316485ce10f6fcafe92a1c146282de0dd8068a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Dec 13 14:14:50 2017 -0600

Simplified/fixed self-initialization.

Details:
- Fixed a race condition in self-initialization whereby the bli_is_init
static variable could be erroneously read as TRUE by thread 1 while
thread 0 is still executing bli_init_apis(), thus allowing thread 1 to
use the library before it is actually ready. Thanks to to Minh Quan Ho
and Devin Matthews for pointing out this issue.
- Part of the solution to the aforementioned race condition was involved
replacing the runtime initialization of the global scalar constants
(e.g., BLIS_ONE, BLIS_ZERO, etc.) in bli_const.c with a static
initialization of those same constants. This eliminates the need for
bli_const_init() altogether. (The static initialization is made concise
via preprocess macros.)
- Defined bli_gks_query_cntx_noinit(), which behaves just like
bli_gks_query_cntx(), except that it does not call bli_init_once(). This
function is called in lieu of bli_gks_query_cntx() in bli_ind_init() and
bli_memsys_init() so as to not result in any recursion into
bli_init_once().
- Removed BLIS_ONE_HALF, BLIS_MINUS_ONE_HALF global scalar constants.
They have no use in BLIS or its test products, and we have little reason
to believe they are used by others.
- Removed testsuite/out file, which was accidentally committed as part
of 70640a3.

commit 6526d1d4ae6dbfa854ca8d1e5f224cd6ab3fa958
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 12 13:50:43 2017 -0600

Added temp_dir argument to flatten-headers.sh.

Details:
- Added "temp_dir" argument to flatten-headers.sh so that the caller can
specify where intermediate files should be created as the script runs.
- Updated flatten-headers.sh to create intermediate files in temp_dir
instead of alongside the corresponding source files. This should now
(once again) allow out-of-tree builds where the BLIS distribution is
read-only, or where the out-of-tree build is running concurrently with
another out-of-tree build. (Thanks to Devin Matthews for pointing out
the possibility of simultaneous out-of-tree builds.)

commit 94755017c967630daf2e31c1f63ed5e88ab0d6ab
Merge: d0c4dd00 5cf7b0c4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 12 12:50:41 2017 -0600

Merge branch 'master' of github.com:flame/blis

commit d0c4dd000ff38acc249e8acf7e0655a523991695
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 12 12:47:53 2017 -0600

Added out-of-tree build test to .travis.yml file.

Details:
- Modified .travis.yml file to include an out-of-tree build test (using
the "auto" configure target). Thanks to Devin Matthews for this
suggestion.

commit 5cf7b0c4e52922069183a87dc2aa177419644e04
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Dec 12 12:38:48 2017 -0600

Ignore blis.h.interm [ci skip]

commit 8d8ff74d15b4a584929cec36034ba6d3c53f7d27
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 12 12:32:50 2017 -0600

Further attempt to fix out-of-tree builds.

Details:
- Fix applied in 87978f6 was necessary but not sufficient to fix
out-of-tree builds. It turns out that using a source tree that had
already built the target erroneously gave the impression that
out-of-tree builds were working again, when in fact they were still
broken. The additional changes in this commit should complete the
fix that was started in the aforementioned commit. Thanks to Devin
Matthews and Shaden Smith for their help in isolating this issue.

commit 70640a37109290b57c344083c00624e13c496e30
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Dec 11 17:18:43 2017 -0600

Implemented library self-initialization.

Details:
- Defined two new functions in bli_init.c: bli_init_once() and
bli_finalize_once(). Each is implemented with pthread_once(), which
guarantees that, among the threads that pass in the same pthread_once_t
data structure, exactly one thread will execute a user-defined function.
(Thus, there is now a runtime dependency against libpthread even when
multithreading is not enabled at configure-time.)
- Added calls to bli_init_once() to top-level user APIs for all
computational operations as well as many other functions in BLIS to
all but guarantee that BLIS will self-initialize through the normal
use of its functions.
- Rewrote and simplified bli_init() and bli_finalize() and related
functions.
- Added -lpthread to LDFLAGS in common.mk.
- Modified the bli_init_auto()/_finalize_auto() functions used by the
BLAS compatibility layer to take and return no arguments. (The
previous API that tracked whether BLIS was initialized, and then
only finalized if it was initialized in the same function, was too
cute by half and borderline useless because by default BLIS stays
initialized when auto-initialized via the compatibility layer.)
- Removed static variables that track initialization of the sub-APIs in
bli_const.c, bli_error.c, bli_init.c, bli_memsys.c, bli_thread, and
bli_ind.c. We don't need to track initialization at the sub-API level,
especially now that BLIS can self-initialize.
- Added a critical section around the changing of the error checking
level in bli_error.c.
- Deprecated bli_ind_oper_has_avail() as well as all functions
bli_<opname>_ind_get_avail(), where <opname> is a level-3 operation
name. These functions had no use cases within BLIS and likely none
outside of BLIS.
- Commented out calls to bli_init() and bli_finalize() in testsuite's
main() function, and likewise for standalone test drivers in 'test'
directory, so that self-initialization is exercised by default.

commit 70a64432ee5a7adbee10fb7ff6d7b608c1940a7a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Dec 11 13:14:20 2017 -0600

Fixed off-by-one indexing in bli_cpuid.c.

Details:
- In bli_cpuid.c, fixed an off-by-one indexing statement in vpu_count()
whereby a string-terminating NULL character, '\0', is written beyond
the bounds of the model_num string.
- Minor whitespace and formatting edits to bli_cpuid.c.

commit 87978f6261a080d261d01f9acf4e9cc18855c833
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Dec 11 12:49:03 2017 -0600

Fixed broken out-of-tree builds since 52f9e6f.

Details:
- Added missing $(DIST_PATH)/ prefix to relative path to flatten-headers.sh
script in common.mk so that the script could be found during out-of-tree
builds. Thanks to Devin Matthews for reporting this bug.

commit 513ef4d040f89a18dda5154e8c4cf1aaf7463999
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Dec 11 12:35:59 2017 -0600

Various typecasting fixes, mis-typed enums, etc.

Details:
- Fixed implicit typecasting of conj_t to trans_t in bli_[un]packm_cxk.c.
- Properly typecast integer arguments to match format specifier in various
calls to printf() in bli_l3_thrinfo.c, bli_cntx.c, bli_pool.c, and
bli_util_oapi.c.
- Fixed "unsigned less-than-comparison with zero" checks in bli_check.c,
bli_cntx.h.
- Fixed mis-typed enums in bli_cntx.c (e.g., l1mkr_t that should have been
l1fkr_t or l1vkr_t).
- Fixed instances of opid_t value BLIS_GEMM that should have been l3ukr_t
value BLIS_GEMM_UKR in bli_cntx_ref.c.
- NOTE: These issues were identified via compiler warnings when building
BLIS with clang on a rather old installation of OS X:
$ clang --version
Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)
Target: x86_64-apple-darwin15.2.0
Thread model: posix

commit 3bc99a96a3648f51b9acdc8a8c7e1cf4eb815459
Merge: 3a441183 78199c53
Author: prangana <pradeep.raoamd.com>
Date: Mon Dec 11 12:53:03 2017 +0530

Fix merge conflicts after rebase with release branch

Change-Id: I581b26c6d515f717ff0dce91c7c0c92553aa2630

commit 3a44118398955d6f872e01f73ae5bb4a4f8500f7
Author: Nisanth M P <nisanth.padinharepattamd.com>
Date: Wed Nov 15 11:11:17 2017 +0530

Added AMD copyright line to the changed files in last 3 commits

Change-Id: I37d5dbbbe1b199e07529610a5e9cc9e49d067c66

commit 268a56c06e94d1c388766dbfe81d54efbe432809
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 1 11:51:41 2017 -0500

Revert to default SIMD alignment for bulldozer.

Details:
- Removed the default-overriding define of BLIS_SIMD_ALIGN_SIZE set in
config/bulldozer/bli_kernel.h. Not sure where this value came from, but
it would seem to allow for insufficient starting address alignment for
any matrices created via bli_malloc_user(), such as via
bli_obj_create(). Thanks to Rene Sitt for reporting the behavior that
led us to this bug.
- This commit is a manual patch of the same fix made to the 'rt' branch
in 8f150f2.

commit 510a6863e28277f9446abfb77f1aea9f01d37e7a
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Oct 30 10:04:42 2017 -0500

Fix CVECFLAGS for bulldozer config.

commit c669716790bdda5d2b11ea0a026cbc121b228842
Author: Nisanth M P <nisanth.padinharepattamd.com>
Date: Tue Oct 24 16:36:36 2017 +0530

Adding __attribute__((constructor/destructor)) for CLANG case.

CLANG supports __attribute__, but its documentation doesn't
mention support for constructor/destructor. Compiling with
clang and testing shows that it does support this.

Change-Id: Ie115b20634c26bda475cc09c20960d687fb7050b

commit 24e64a9d0877d788357fc63d4b947e977f8697f7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 18 13:41:25 2017 -0500

Removed a duplicate bli_avx512_macros.h header.

Details:
- Removed a duplicate header file that was causing problems during
installation for the 'knl' configuration. Thanks to Victor Eijkhout
for reporting this issue.

commit 9c0a3c4c0260cbfefb9f11532f46508b4fd19ec2
Author: Nisanth M P <nisanth.padinharepattamd.com>
Date: Mon Oct 16 22:06:57 2017 +0530

Thread Safety: Move bli_init() before and bli_finalize() after main()

BLIS provides APIs to initialize and finalize its global context.
One application thread can finalize BLIS, while other threads
in the application are stil using BLIS.

This issue can be solved by removing bli_finalize() from API.
One way to do this is by getting bli_finalize() to execute by default
after application exits from main().

GCC supports this behaviour with the help of __attribute__((destructor))
added to the function that need to be executed after main exits.

Similarly bli_init() can be made to run before application enters main()
so that application need not call it.

Change-Id: I7ce6cfa28b384e92c0bdf772f3baea373fd9feac

commit 83f31253eb21c5ecd8a5907835e57720daae0b8b
Author: Nisanth M P <nisanth.padinharepattamd.com>
Date: Mon Oct 16 21:07:50 2017 +0530

Thread safety: Make the global induced method status array local to thread

BLIS retains a global status array for induced methods, and provides
APIs to modify this state during runtime. So, one application thread
can modify the state, before another starts the corresponding
BLIS operation.

This patch solves this issue by making the induced method status array
local to threads.

Change-Id: Iff59b6f473771344054c010b4eda51b7aa4317fe

commit e923402e68029be379a4297de3ac6fb155ffd928
Author: sthangar <Santanu.Thangarajamd.com>
Date: Thu Sep 28 12:15:36 2017 +0530

The inner loop paralleization is turned off by default, the JR and IR loop parameters are set to 1 by default

Change-Id: I8c3c2ecbbd636259f6ffb92768ec04148205c3e5

commit a64c15de19327c7595376d699be676c7003e850e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 26 19:02:53 2017 -0500

Fixed a pthread typo in previous commit.

Details:
- Misnamed 'pthread_mutex_t' type in bli_memsys.c as 'thread_mutex_t'.

commit 42dcd589c37e1a2473ab2e1539207da97aebc07f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 26 17:00:04 2017 -0500

Fixed bugs in gemm/gemmtrsm ukr tests in testsuite.

Details:
- Fixed a bug in gemmtrsm test module that was due to improper partitioning
into a k x k triangular matrix for the purposes of obtaining an mr x k
micropanel of A with which to test.
- Fixed a bug in gemm and gemmtrsm test modules that would only manifest for
very large k (depending on the product of mr x kc on that architecture).
The bug arose from the fact that the test module was triggering the
allocation of blocks from the internal memory pools, which are limited in
size. This allocation imposes an implicit assumption that the micro-
panel being tested with will fit inside, and this assumption is violated
for large values of k. Arbitrarily large k may now be tested for both
operation tests.
- Added OpenMP/pthread critical sections around the setting or getting of
statuses from the induced method operation lookup table in bli_l3_ind.c.
- Added the 'static' keyword to all pthread_mutex_t global variables in BLIS.
- Thanks to Nisanth Padinharepatt of AMD for reporting the first and third
issues.

commit 206beb68ff73b75f5c382413967aacbb8a0aac3a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Sep 9 14:10:15 2017 -0500

Updated bibtex info for BLIS5 (3m4m) article.

commit 0c8c0363aeb1f4aa88f7ec2d02403dab05a6e014
Author: sthangar <Santanu.Thangarajamd.com>
Date: Mon Aug 28 16:44:42 2017 +0530

Bug fix for the testsuite build failing

Change-Id: I7cd8c9d187387c48b2564e45cbfb8df985e93d77

commit 63d1c84465b50f64787808dd3e8494e683c16821
Author: sthangar <Santanu.Thangarajamd.com>
Date: Wed Aug 23 13:01:14 2017 +0530

Adding auto hardware detection for Zen

Change-Id: I40ce6705dd66b35000c4ccddffad1c5b65998caf

commit 537fb2a895b09be94b11947696fd2da629be24dd
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Aug 15 10:02:25 2017 -0500

Add vzeroupper to Intel AVX kernels.

commit 7628de3f76f78a44788807605a4601ddda445854
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 10 16:24:28 2017 -0500

Removed trailing enum commas from bli_type_defs.h.

Details:
- Removed trailing commas from enums in bli_type_defs.h. Thanks to
Erling Andersen for pointing out this inconsistency and suggesting
the change.

commit a666fd4e267ffae3d4b21f38d569c61ff56adc9e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Aug 5 13:04:31 2017 -0500

Added edge handling to _determine_blocksize_b().

Details:
- Added explicit handling of situations where i == dim to
bli_determine_blocksize_b_sub(). This isn't actually needed by any
current use case within BLIS, but handling the situation is nonetheless
prudent. Thanks to Minh Quan for reporting this issue and requesting
the fix.

commit 0c8afa546d7f33760415519ba328d7c49eb7aa06
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Aug 4 14:17:44 2017 -0500

Fixed a minor bug in level-3 packm management.

Details:
- Fixed a bug in bli_l3_packm() that caused cntl_t-cached packed mem_t
entries to be released and then re-acquired unnecessarily. (In essence,
the "<" operands in the conditional that guards the
release-and-reacquire code block simply needed to be swapped.) The bug
should have only affected performance (rather than the computed result).
Thanks to Minh Quan for identifying and reporting the bug.

commit 6cf68a185d83fa46d438fcef65258ace78e24b13
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Jul 31 15:19:51 2017 -0500

Change lsame_ signature to match lapacke.

commit 6a9bd97295cc4fb1cbcd28f69824a43c073c9a76
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Jul 29 20:17:05 2017 -0500

Fixed pthreads compile bug with previous commit.

Details:
- Erroneously passed family parameter into l3int_t function despite
that function not taking the parameter. Oops.

commit 95adc43d800431dc0a02ca83a51426dbef641ad6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Jul 29 14:53:39 2017 -0500

Moved 'family' field from cntx_t to cntl_t.

Details:
- Removed the family field inside the cntx_t struct and re-added it to the
cntl_t struct. Updated all accessor functions/macros accordingly, as well
as all consumers and intermediaries of the family parameter (such as
bli_l3_thread_decorator(), bli_l3_direct(), and bli_l3_prune_*()). This
change was motivated by the desire to keep the context limited, as much
as possible, to information about the computing environment. (The family
field, by contrast, is a descriptor about the operation being executed.)
- Added additional functions to bli_blksz_*() API.
- Added additional functions to bli_cntx_*() API.
- Minor updates to bli_func.c, bli_mbool.c.
- Removed 'obj' from bli_blksz_*() API names.
- Removed 'obj' from bli_cntx_*() API names.
- Removed 'obj' from bli_cntl_*(), bli_*_cntl_*() API names. Renamed routines
that operate only on a single struct to contain the "_node" suffix to
differentiate with those routines that operate on the entire tree.
- Added enums for packm and unpackm kernels to bli_type_defs.h.
- Removed BLIS_1F and BLIS_VF from bszid_t definition in bli_type_defs.h.
They weren't being used and probably never will be.

commit a98e4aa547f61ab09dd91d11478c2a2ef9882e11
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu Jul 20 14:50:13 2017 -0500

Clang can't make up it's mind what to support.

commit 32eb36c3e8c2add2528514272044de16faed0c8f
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu Jul 20 12:54:58 2017 -0500

Add default define for __has_extension.

commit 2a9aa134f7c29d3d4fdc160022ff257e61885a95
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu Jul 20 10:04:34 2017 -0500

Add fallbacks to __sync_* or __c11_atomic_* builtins when __atomic_* is not supported. Fixes 143.

commit 6f07a034d575e1e9e30bb6417b8fcb77cf301297
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 19 15:40:48 2017 -0500

Updated ar option list used by all configurations.

Details:
- Dropped 'u' from the list of modifiers passed into the library archiver
ar. Previously, "cru" was used, while now we employ only "cr". This
change was prompted by a warning observed on Ubuntu 16.04:

ar: `u' modifier ignored since `D' is the default (see `U')

This caused me to realize that the default mode causes timestamps to be
zero, and thus the 'u' option, which causes only changed object files to
be inserted, is not applicable.

commit 32bc03f9eed8795cfd2f2615d1c9f8673e039c57
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 19 13:51:53 2017 -0500

Added --force-version=STRING option to configure.

Details:
- Added an option to configure that allows the user to force an arbitrary
version string at configure-time. The help text also now describes the
usage information.
- Changed the way the version string is communicated to the Makefile.
Previously, it was read into the VERSION variable from the 'version' file
via $(shell cat ...). Now, the VERSION variable is instead set in
config.mk (via a configure-substituted anchor from config.mk.in).

commit befaee6dd8b2a72de9e0461fe2ec1f36e9f88f3c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jul 18 17:56:00 2017 -0500

Updated openmp/pthread barriers with GNU atomics.

Details:
- Updated the non-tree openmp and pthreads barriers defined in
bli_thrcomm_openmp.c and bli_thrcomm_pthreads.c to instead call a common
implementation in bli_thrcomm.c, bli_thrcomm_barrier_atomic(). This new
implementation goes through the same motions as the previous codes, but
protects its loads and increments with GNU atomic built-ins. These atomic
statements take memory ordering parameters that allow us to specify just
enough constraints for the barrier to work as intended on weakly-ordered
hardware. The prior implementation was only guaranteed to work on systems
with strongly- ordered memory. (Thanks to Devin Matthews for suggesting
this change and his crash-course in atomics and memory ordering.)
- Removed 'volatile' from structs' barrier field declarations in
bli_thrcomm_*.h.
- Updated bli_thrcomm_pthread.? files to use renamed struct barrier fields
consistent with that of the _openmp.? files.
- Updated other bli_thrcomm_* files to rename "communicator" variables to
simply "comm".

commit 8f739cc847fcff2ddeeb336f8b2b9d080eb16f6c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jul 17 19:03:22 2017 -0500

Added API to set mt environment variables.

Details:
- Renamed bli_env_get_nway() -> bli_thread_get_env().
- Added bli_thread_set_env() to allow setting environment variables
pertaining to multithreading, such as BLIS_JC_NT or BLIS_NUM_THREADS.
- Added the following convenience wrapper routines:
bli_thread_get_jc_nt()
bli_thread_get_ic_nt()
bli_thread_get_jr_nt()
bli_thread_get_ir_nt()
bli_thread_get_num_threads()
bli_thread_set_jc_nt()
bli_thread_set_ic_nt()
bli_thread_set_jr_nt()
bli_thread_set_ir_nt()
bli_thread_set_num_threads()
- Added include "errno.h" to bli_system.h.
- This commit addresses issue 140.
- Thanks to Chris Goodyer for inspiring these updates.

commit 10163833075fd42be5b5b503acc855f91a484cfd
Author: Marat Dukhan <maratfb.com>
Date: Thu Jul 13 21:39:24 2017 -0700

Fix Emscripten builds

commit c09b30d115eade72f44f37bf90aa848c9c0e79af
Author: Minh Quan HO <mqhokalray.eu>
Date: Fri Jul 7 10:52:05 2017 +0200

set missing free_fp in bli_membrk_init for free-ing GEN_USE buffers

The membrk's free_fp is called when releasing GEN_USE buffers, but this free_fp is
not set in bli_membrk_init

commit 997628ed9793c72e9ef576dd8d715cfec27c4862
Author: sthangar <Santanu.Thangarajamd.com>
Date: Fri Jun 30 12:23:19 2017 +0530

Reducing the framework overhead of GEMV routines

Change-Id: I83607ad767bff74e305e915b54b0ea34ec3e5684

commit ee869066168239b710ad9938bb0e1ae454883f3a
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Tue Jul 4 12:57:32 2017 +0530

Improved efficiency of dGEMM for large matrices by reducing TLB load misses and majorly L3 cache misses. This is achieved by changing the packed block sizes of matrix A & B. Now the optimum values are MC_D = 510 and KC_D = 1024.

Change-Id: I2d8bdd5f62f2d1f8782ae2997f3d7a26587d1ca4

commit 7b933b90b1859c96de49a402d48de82909bc73e5
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Jun 6 20:23:17 2017 -0500

Add new SSI acknowledgment

commit 3485abba4b426fbf42b146a9611a0841f6d236c6
Author: sthangar <Santanu.Thangarajamd.com>
Date: Wed May 24 11:48:16 2017 +0530

Checked in the small matrix code to compute GEMM called with A transpose case

Change-Id: I29f40046d43d7a4b037c1cb322503ee26495f462

commit de16beb83b29b4b9748f70db985b0fe04db85f7d
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri May 26 14:49:31 2017 -0400

PACKDIM_MR=8 didn't work out, but messing with the prefetching helps 2%.

commit 25d0e618544b6eea7d3f13c7aec513ac0139801d
Author: Devin Matthews <dmatthewsgator3.ufhpc>
Date: Fri May 26 14:47:36 2017 -0400

Revert "Change PACKDIM_MR (double) for haswell to 8."

This reverts commit 681eec913d7c2ebcff637cec5c1627ced9a92b99.

commit c5bdd84b35bc2a8ebf55b7763fb56c0c945be0cb
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri May 26 12:28:09 2017 -0500

Change PACKDIM_MR (double) for haswell to 8.

commit 172789d562001293b973bbdd8015bd27d37292e8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed May 17 13:03:52 2017 -0500

Restored deleted lines from makefile fragments.

commit 3ea9bd2c8e90dbd35655fa6a5b953dfea1f308fe
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed May 17 12:29:44 2017 -0500

Change to /bin/sh.

All scripts checked with Debian's checkbashisms. Also check for clang first in auto-detect.sh.

commit 49438409eedb98d3f0ebf00b8d1eee0ae45f4f8c
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed May 17 12:27:14 2017 -0500

Remove shebangs from makefiles.

commit 497e2640474c016d576dce3530fa6a66891642a0
Author: J M Dieterich <dieterichogolem.org>
Date: Tue May 16 23:11:22 2017 -0400

Fix if/else structure. Thanks to TravisCI.

commit 835035c56a8de36ad25bb8d1375db170d489ef57
Author: J M Dieterich <dieterichogolem.org>
Date: Tue May 16 22:23:27 2017 -0400

Mark piledriver compilable w/ clang.

commit 6cdb533472ee61af297c1f948307abbf45828887
Author: J M Dieterich <dieterichogolem.org>
Date: Tue May 16 22:12:12 2017 -0400

Mark bulldozer compilable w/ clang.

commit a85697d62272da06d28cd1c947f6cf1098df6467
Author: J M Dieterich <dieterichogolem.org>
Date: Tue May 16 22:06:59 2017 -0400

Correct error message.

commit e0c64cad271058688a2b999caf8c2767dc3aef7e
Author: J M Dieterich <dieterichogolem.org>
Date: Tue May 16 22:03:23 2017 -0400

Indeed once can compile for carrizo also using clang.

commit 4aafe0505d3f0954d095ded5459a76976e5093b4
Author: J M Dieterich <dieterichogolem.org>
Date: Tue May 16 21:50:49 2017 -0400

A bunch of shebang fixes from unportable /bin/bash to portable /usr/bin/env bash

commit abaeaa68ea11e84be1810f564d6f38d506cbeb6a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri May 5 15:06:56 2017 -0500

Fixed a bug in norm1v, norm1m.

Details:
- Fixed a bug that manifested as improperly-computed 1-norm for vectors
and matrices. This is one of the few operations in BLIS that does not
have its own test module within the testsuite, hence why it went
undetected for so long. The bad 1-norms were being used to normalize
matrices in the testsuite after initialization, which led to some
matrices containing a combination of "large" and "small" values. This
tended to push the residuals computed after each test away from zero.
In some cases, they were off *just* enough to the testsuite to label
it a "failure". Many thanks to Jeff Hammond for reporting this bug.
(Wonky details: the bug was due to improperly-defined level-0 scalar
macros for abval2, an operation that computes the absolute square,
or complex magnitude/modulus. Certain complex domain instances of
abval2 were being incorrectly defined in terms of real-only solutions,
leading to bad results. This level-0 operation forms the basis of
norm1v/norm1m. absq2 was also affected, but almost nothing uses
this operation.)

commit cc3107ae1c2074f72b724aa748d2e5b4cb290ed5
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu May 4 10:35:22 2017 -0500

Setting any one of BLIS_NT_[IJ][CR] overrides BLIS_NUM_THEADS. Missing BLIS_NT_XX's are defaulted to 1. Fixes 123.

commit c8ab91f70d399ee14edd30a3a5c46b24c5d2f910
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed May 3 15:04:51 2017 -0500

Disable complex 3m/4m in testsuite by default.

Details:
- Disabled testsuite tests of all level-3 implementations based on 3m
and 4m. This will improve testing runtime on Travis CI as well as for
anyone manually running the testsuite using default test parameters.
Thanks to Devin Matthews for suggesting this change.

commit 9700f0e5785007ddafb72a5ca83800dee61fd35c
Author: Jeff Hammond <jeff.sciencegmail.com>
Date: Tue May 2 19:25:21 2017 -0700

allow KNL build without hbwmalloc.h (i.e. emulated)

we want to be able to run BLIS KNL binaries on non-KNL machines via SDE.
although it is possible to install hbwmalloc implementation on such
systems, it is easier not to, since obviously the performance of SDE
execution is not representative so there is no reason to emulate HBW
allocation.

commit 17dcd5a33ff91967f67e7c0ba09b4f18754609a4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 2 16:48:43 2017 -0500

Fixed stray parentheses in README citations.

commit 2910d44ff9e1d951d3249313f4ab39d18ea1b48d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 2 16:38:43 2017 -0500

CHANGELOG update (0.2.2)

commit 5ca3863220e07972fcefc6682ddd3f6e54fe4a94
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 2 15:48:30 2017 -0500

Fixed a trsm1m bug that affected right-side cases.

Details:
- Fixed a bug introduced in 1c732d3 that affected trsm1m_r. The result
was nondeterministic behavior (usually segmentation faults) for certain
problem sizes beyond the 1m instance of kc (e.g. 128 on haswell). The
cause of the bug was my commenting out lines in bli_gemm1m_ukr_ref.c
which explicitly directed the virtual gemm micro-kernel to use temporary
space if the storage preference of the [real domain] gemm ukernel did
not match the storage of the output matrix C. In the context of gemm,
this handling is not needed because agreement between the storage pref
and the matrix is guaranteed by a high-level optimization in BLIS.
However, this optimization is not applied to trsm because the storage
of C is not necessarily the same as the storage of the micro-panels of
B--both of which are updated by the micro-kernel during a trsm
operation. Thus, the guarantee of storage/preference agreement is not
in place for trsm, which means we must handle that case within the
virtual gemm micro-kernel.
- Comment updates and a minor macro change to bli_trsm*_cntx_init() for
3m1, 4m1a, and 1m.

commit 1af0b09f5c275ee7bac896cc6f36f42af721d9b5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 2 12:09:39 2017 -0500

README.md update.

Details:
- Updated bibtex entries for 4th BLIS paper, and adds entries for 5th
and 6th BLIS papers.

commit db4a0bb8ba7cd697d68be8e5632371ee3e59fd63
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Mar 17 12:07:27 2017 -0500

Whitespace reformatting to armv8a kernels file.

Details:
- Updated formatting of function signature/header in
kernels/armv8a/3/bli_gemm_opt_4x4.c.

commit e3eb01f6b990e205b15edcbaffd3d54b3ddd1ca4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Feb 21 15:33:39 2017 -0600

Disabled experiment-related 1m code.

Details:
- Commented out code in frame/ind/oapi/bli_l3_3m4m1m_oapi.c that was
specifically inserted to facilitate the benchmarking of 1m block-panel
and panel-block algorithms.
- Updates to test/3m4m/Makefile, runme.sh script, and test_gemm.c to
reflect changes used/needed during benchmarking.

commit 4f61528d56eed6a139eeac9db0c44e56f2d2d136
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jan 25 16:25:46 2017 -0600

Added 1m-specific APIs for bp, pb gemm algorithms.

Details:
- Defined bli_gemmbp_cntl_create(), bli_gemmpb_cntl_create(), with the
body of bli_gemm_cntl_create() replaced with a call to the former.
- Defined bli_cntl_free_w_thrinfo(), bli_cntl_free_wo_thrinfo(). Now,
bli_cntl_free() can check if the thread parameter is NULL, and if so,
call the latter, and otherwise call the former.
- Defined bli_gemm1mbp_cntx_init(), bli_gemm1mpb_cntx_init(), both in
terms of bli_gemm1mxx_cntx_init(), which behaves the same as
bli_gemm1m_cntx_init() did before, except that an extra bool parameter
(is_pb) is used to support both bp and pb algorithms (including to
support the anti-preference field described below).
- Added support for "anti-preference" in context. The anti_pref field,
when true, will toggle the boolean return value of routines such as
bli_cntx_l3_ukr_eff_prefers_storage_of(), which has the net effect of
causing BLIS to transpose the operation to achieve disagreement (rather
than agreement) between the storage of C and the micro-kernel output
preference. This disagreement is needed for panel-block implementations,
since they induce a transposition of the suboperation immediately before
the macro-kernel is called, which changes the apparent storage of C. For
now, anti-preference is used only with the pb algorithm for 1m (and not
with any other non-1m implementation).
- Defined new functions,
bli_cntx_l3_ukr_eff_prefers_storage_of()
bli_cntx_l3_ukr_eff_dislikes_storage_of()
bli_cntx_l3_nat_ukr_eff_prefers_storage_of()
bli_cntx_l3_nat_ukr_eff_dislikes_storage_of()
which are identical to their non-"eff" (effectively) counterparts except
that they take the anti-preference field of the context into account.
- Explicitly initialize the anti-pref field to FALSE in
bli_gks_cntx_set_l3_nat_ukr_prefs().
- Added bli_gemm_ker_var1.c, which implements a panel-block macro-kernel
in terms of the existing block-panel macro-kernel _ker_var2(). This
technique requires inducing transposes on all operands and swapping
the A and B.
- Changed bli_obj_induce_trans() macro so that pack-related fields are
also changed to reflect the induced transposition.
- Added a temporary hack to bli_l3_3m4m1m_oapi.c that allows us to easily
specify the 1m algorithm (block-panel or panel-block).
- Renamed the following cntx_t-related macros:
bli_cntx_get_pack_schema_a() -> bli_cntx_get_pack_schema_a_block()
bli_cntx_get_pack_schema_b() -> bli_cntx_get_pack_schema_b_panel()
bli_cntx_get_pack_schema_c() -> bli_cntx_get_pack_schema_c_panel()
and updated all instantiations. Also updated the field names in the
cntx_t struct.
- Comment updates.

commit 1d728ccb2394e77365e7c42683db6579c5fba014
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 25 18:29:49 2016 -0600

Implemented the 1m method.

Details:
- Implemented the 1m method for inducing complex domain matrix
multiplication. 1m support has been added to all level-3 operations,
including trsm, and is now the default induced method when native
complex domain gemm microkernels are omitted from the configuration.
- Updated _cntx_init() operations to take a datatype parameter. This was
needed for the corresponding function for 1m (because 1m requires us
to choose between column-oriented or row-oriented execution, which
requires us to query the context for the storage preference of the
gemm microkernel, which requires knowing the datatype) but I decided
that it made sense for consistency to add the parameter to all other
cntx initialization functions as well, even though those functions
don't use the parameter.
- Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take
a second scalar for each blocksize entry. The semantic meaning of the
two scalars now is that the first will scale the default blocksize
while the second will scale the maximum blocksize. This allows scaling
the two independently, and was needed to support 1m, which requires
scaling for a register blocksize but not the register storage
blocksize (ie: "packdim") analogue.
- Deprecated bli_blksz_reduce_dt_to() and defined two new functions,
bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing
default and maximum blocksizes to some desired blocksize multiple.
These functions are needed in the updated definitions of
bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs().
- Added support for the 1e and 1r packing schemas to packm, including
1e/1r packing kernels.
- Added a minor optimization to bli_gemm_ker_var2() that allows, under
certain circumstances (specifically, real domain beta and row- or
column-stored matrix C), the real domain macrokernel and microkernel
to be called directly, rather than using the virtual microkernel
via the complex domain macrokernel, which carries a slight additional
amount of overhead.
- Added 1m support to the testsuite.
- Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified
some code in test_gemm.c driver.

commit 0d1b90286e29aa8b768e280b5286d92c02ad87a1
Author: Jeff Hammond <jeff.sciencegmail.com>
Date: Tue Oct 25 21:15:26 2016 -0700

never use libm with Intel compilers

Intel compilers include a highly optimized math library (libimf) that
should be used instead of GNU libm.

yes, this change is for ALL targets, including those that are not
supported by the Intel compiler. there is no harm in doing this, and it
is future-proof in the event that the Intel compilers support other
architectures.

commit b150870397e7aee558e61d1bd72a0c0d1d99bee8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Dec 8 16:08:41 2017 -0600

Removed most "old" directories.

Details:
- Removed the vast majority of directories named "old", which contained
deprecated code that I wasn't quite ready to jettison from the source
tree.

commit 270c65985df849297ba1951aa3b56c03948d7775
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Dec 8 15:21:18 2017 -0600

Modified bli_getopt() for thread-safety.

Details:
- Changed the interface of bli_getopt() to take a new argument, a getopt_t
struct, that stores the values of optarg, optind, opterr, and optopt,
and updated the implementation accordingly. (Previously, these
variables were assumed to be global.)
- Added a function for initializing a getopt_t struct.
- Changed test_libblis.c--currently the only consumer of bli_getopt()--to
utilize the new getopt_t state object.

commit ce4d8fabc2e39371f89c12192fb707be82ae021a
Merge: 39be59f2 e05a8dfa
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Dec 7 17:36:44 2017 -0600

Merge branch 'master' of github.com:flame/blis

commit 39be59f2a8470f40475907d9dd52639b8a911a92
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Dec 7 17:35:20 2017 -0600

Replaced several macros with static function APIs.

Details:
- Reimplemented several sets of get/set-style preprocessor macros with
static functions, including those in the following frame/base headers:
auxinfo, cntl, mbool, mem, membrk, opid, and pool. A few headers in
frame/thread were touched as well: mutex_*, thrcomm, and thrinfo.

commit e05a8dfa7cc7df41e966c1ad04e51c482b308b23
Merge: 79507337 4423e33d
Author: dnp <devangiparikhgmail.com>
Date: Wed Dec 6 16:45:24 2017 -0600

Merge branch 'rt'

commit 4423e33dc593115cda92c5763d756d7ad1298aa9
Author: dnp <devangiparikhgmail.com>
Date: Wed Dec 6 16:35:03 2017 -0600

Adding SKX kernels and configuration.

commit 79507337e140daec7639f6eb3ed9cfe6e123d342
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Dec 6 16:21:35 2017 -0600

Various checks to ensure that arch_t id is in range.

Details:
- Expanded checking of the arch_t id in bli_gks.c--either passed in from
the caller or as returned from bli_arch_query_id()--against the expected
range of id values. Thanks to Devangi Parikh for suggesting these
additional sanity checks.

commit fde7c1126c58373ecde83471890b257399144876
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Dec 4 16:11:01 2017 -0600

Added 'uninstall-old-headers' target to Makefile.

Details:
- Defined a new 'uninstall-old-headers' target that allows users of BLIS to
uninstall no-longer-needed headers left over from previous installations.
- Fixed the 'uninstall-old' target so that it will install both .a and .so
libraries.
- Renamed 'uninstall-old' to 'uninstall-old-libs'.
- Added 'uninstall-old' target (different from previous 'uninstall-old'
target) that combines 'uninstall-old-libs' and 'uninstall-old-headers'.

commit d4ee770bde213a87aa6049245145318324dc6b51
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Dec 4 14:53:43 2017 -0600

Create/install monolithic cblas.h.

Details:
- When CBLAS is enabled at configure-time, BLIS now creates a monolithic
cblas.h using the same flatten-header.sh script that was recently
introduced for creating monolithic blis.h header files. The top-level
Makefile will also install this cblas.h file into the install prefix
alongside blis.h when the 'install' target is invoked. The two header
files are compatible with one another. Regardless whether the user's
source includes cblas.h, both blis.h and cblas.h, or just blis.h,
the user will get the CBLAS function prototypes and enums, as expected.

commit 52f9e6f1b6468785af8947317656445d4729fc8b
Merge: ab57b979 21360dd8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Dec 1 12:28:09 2017 -0600

Merge branch 'rt'

commit 21360dd8e2c7287100645e109acaabcc6ba1140c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 29 14:11:34 2017 -0600

Fixed cntx_t packm query when ker_id > _NUM_PACKM_KERS.

Details:
- Fixed a subtle bug in bli_cntx_get_[un]packm_ker_dt() in which the
function fails to return NULL when passed a kernel id argument that is
equal to or beyond BLIS_NUM_[UN]PACKM_KERS. Instead, the function was
attempting to index into the cntx_t's packm kernel array, which resulted
in undefined behvaior. Thanks to Devangi Parikh for finding this bug.

commit 244a6f4e66e8ff091e995f8090ce779c1928aa8b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 28 17:48:48 2017 -0600

Fixed POSIX sed non-compliance in flatten-header.sh.

Details:
- Changed GNU usage of 'i' and 'a' sed commands used in flatten-header.sh
to POSIX-compliant usage that will work on OS X's sed.

commit 45078621676833e53a2878af8f89479c4f93b8ab
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 28 15:16:22 2017 -0600

Generate/compile with/install monolithic blis.h.

Details:
- Rewrote monolithify-header.sh (and renamed to flatten-header.sh) so that
headers are inserted recursively. This improves performance by a factor
of 3-4x.
- Modified configure to create an 'include/<configname>' directory in which
make can create a monolithic header.
- Modified the top-level Makefile so that a monolithic header is generated
unconditionally prior to compilation (stored in include/<configname>) and
so that the single header is installed instead of the 450 or so header
files that reside throughout the framework source tree.
- Added "include/*/*.h" to .gitignore file.
- Removed some pnacl/emscripten leftovers that I intended to include in
a1caeba (mostly in testsuite/Makefile).
- Trivial comment changes to frame/include/bli_f2c.h.

commit 1f30b1301bf6d6047ec29e57a5fde8eb1072a0ee
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Nov 25 16:54:26 2017 -0600

Added missing framework support for x86_64 family.

Details:
- Added support for the x86_64 configuration family to bli_arch.c and
bli_arch_config.h. Thanks to Johannes Dieterich for reporting this
issue.
- Bumped the default value for BLIS_SIMD_NUM_REGISTERS from 16 to 32 and
the default value for BLIS_SIMD_SIZE from 32 to 64. This will support
configuration families that include Skylake and newer processors without
any supported needed in the bli_family_*.h file. The semantics of these
values have always been "maximum" and not exact values; comments in
bli_kernel_macro_defs.h and the github wiki have been adjusted
accordingly.

commit 9f39806c4ed484c9ed13edf96005838d977722a9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 21 16:03:56 2017 -0600

Fixed a bug in e31f0b3/b131b9a.

Details:
- Erroneously placed the "don't overwrite existing blocksize" logic in
bli_blksz_init*() rather than in bli_cntx_set_blkszs(). It belongs in
the latter because that function copies blocksizes as-is from the
blksz_t function argument to the appropriate field in the cntx_t. If
the blksz_t was previously initialized selectively, based on the sign
of the blocksize value passed into bli_blksz_init*(), that just leaves
some fields possibly uninitialized (with garbage values), which
definitely will not work.
- The aforementioned logic has been moved to bli_cntx_set_blkszs() via
a new function bli_blksz_copy_if_pos(), which selectively copies only
the blocksizes that are greater than zero.

commit b131b9a025c15f548d4c2952a9ec85eee3d139b1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 21 14:30:26 2017 -0600

Updated configs to omit setting some blocksizes.

Details:
- Employ the new semantics of bli_blksz_init*() in e31f0b3 in various
sub-configurations' bli_cntx_init_*() functions by passing in 0 for
register and cache blocksizes that correpond to gemm microkernel
datatypes that were not registered, allowing the default values
set by the bli_cntx_init_*_ref() function call to remain.

commit 499a4c002f895744ecaf81ef7f62d2d6d0d7d594
Merge: e31f0b3e 6c3ba502
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 21 14:25:08 2017 -0600

Merge branch 'rt' of github.com:flame/blis into rt

commit e31f0b3e2dba19ca8a2946bc21beb136a42d0f57
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 21 14:21:25 2017 -0600

Subtle update to bli_blksz_init*() API.

Details:
- Updated the semantics of bli_blksz_init() and bli_blksz_init_ed() so
that non-positive blocksize values are ignored entirely. This provides
an easy way to indicate that certain existing values should not be
touched by the update. Thanks to Devangi Parikh for feedback that led
to these changes.

commit 6c3ba502a11f87bc67555d26154cfd39d0af1bac
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 21 13:50:53 2017 -0600

Added 'x86_64' sub-config directory.

Details:
- Added missing x86_64 configuration directory, which was intended to be
part of b7ca580.
- Added -Wfatal-errors compiler warning flag to all configurations so that
compilation stops after the first error.
- Changed the vectorization flags for intel64 configuration to be compatible
with 'penryn', the oldest sub-config included in that family.
- Changed the vectorization flags for penryn to target the 'core2'
microarchitecture and ssse3.

commit 25eee3cc49b0631812485d4d5ceef0c23ed1b6dd
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 21 12:34:20 2017 -0600

Added a dummy file to kernels/generic.

Details:
- Added a dummy file to kernels/generic, which was previously empty, so
that git would begin tracking the otherwise-empty directory. This
directory's existence is necessary for proper execution of configure
for any configuration family that contains the 'generic'
sub-configuration. Thanks to Johannes Dieterich for reporting the
issue that led to this fix.

commit ef024ce4cafa217669eaabb31ff8ab6df93cca05
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Nov 20 18:08:29 2017 -0600

More tweaks to monolithify-header.sh

Details:
- Further fixes monolithify-header.sh script.
- Removed unnecessary include "blis.h" from frame/3/bli_l3_packm.h.

commit 5028e7dec269b62895511453272585da36e591b5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Nov 20 17:00:37 2017 -0600

Second attempt to implement travis_wait.

Details:
- Corrected accidental misplacement of the travis_wait prefix (on the
wrong line of the .travis.yml file) in commit 13e5d91.

commit 13e5d9107b3763cba46fb1bae87476852601b47c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Nov 20 15:57:06 2017 -0600

Added travis_wait prefix to testsuite via Travis.

Details:
- It appears that Travis CL has implemented a new policy that results in
a test failing if it does not produce any output for more than 10
minutes. (Two test instances are now failing in Travis despite the most
recent commit not affecting the library or testsuite.) This issue can
be worked around by executing the test run via travis_wait, which takes
an optional time parameter. This commit attempts to use 'travis_wait 30'
in the .travis.yml file to prevent the early failure at 10 minutes.

commit a1caeba0ea79c8fecb1abadca1f91c6367ab3afb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Nov 20 13:31:20 2017 -0600

Removed pnacl, emscripten support from Makefile.

commit 78199c539beaa50f37893add220261ce0dcb921a
Merge: b3d8ab2e ab57b979
Author: praveeng <praveen.gamd.com>
Date: Mon Nov 20 15:51:20 2017 +0530

Merge master code till 01-Nov-2017 to amd-staging

Change-Id: I40b53f876db84c8b947b3f2385c9b882245c6603

commit 9df6dda9ec51a0d40166169d2d8a2f84b42266e6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Nov 18 19:03:26 2017 -0600

Improvements, bugfixes to monolithify-header.sh.

commit 21d26201f90b884eb8d5de279ed74bbd244ffcb5
Merge: 43baa3b3 b7ca5806
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Nov 18 14:16:53 2017 -0600

Merge branch 'rt' of github.com:flame/blis into rt

commit 43baa3b327d5ae1e2ba619432687b4dd849b05e3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Nov 18 14:14:44 2017 -0600

Removed unnecessary flags for generic config.

Details:
- Removed -D_POSIX_C_SOURCE=200112L and -m64 flags from make_defs.mk file
of generic sub-configuration. These flags are generally not necessary,
and particularly not desirable for the generic configuration since they
unnecessarily restrict the environments in which the configuration can
be built.

commit b7ca580618f9382b7982168fd035ed058f83e4c2
Author: iotamudelta <dieterichogolem.org>
Date: Sat Nov 18 14:56:05 2017 -0500

[WIP] Add x86 and x86_64 processor families. (154)

* Add x86 and x86_64 processor families.
* Use generic config as fallback for more families.

After discussion with fgvanzee, a) it's "generic" and 2) use it for all the families as a fallback. Goal is that if a specific CPU is not yet supported by a family (say a new Intel microarchitecture on x86_64), it'll fall through to still work with the slower "generic" kernels

commit 870597d1663aaba1b74d7654b1d4946280aa0d3f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 17 17:06:42 2017 -0600

Added bash script for creating monolithic headers.

Details:
- Added a new script, monolithify-header.sh, to the 'build' directory.
This script recursively replaces all include directives in a selected
file with the contents of the header files referenced by each directive.
The idea is to "flatten" a tree of .h files into a single file, with
the script acting as a C preprocessor that only processes include
directives.

commit c76f77f4cc1e71988251c5e63cf6ef137477bf9c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 17 15:10:52 2017 -0600

Removed unnecessary include "blis.h" from header.

Details:
- Removed an errant include "blis.h directive from bli_cntx_ind_stage.h.
The generaly policy is that no header file in BLIS should include
blis.h. This will be important in the near future when using a tool to
recursively create a monolithic blis.h file from its consitutent
headers.

commit 2bb9bc6e9536fa239fbc19a7efaaf151116e15b4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 17 13:50:14 2017 -0600

Miscellaneous tweaks to gks, rt functionality.

Details:
- Updated bli_cpuid_query_id() so that BLIS_ARCH_GENERIC is always returned
if the hardware fails to test positive for any supported sub-configuration.
- Defined bli_gks_init_ref_cntx(), which will call the context initialization
function bli_cntx_init_configname() for the sub-configuration 'configname'
associated with the arch_t id returned by bli_arch_query_id(). This makes
initializing a reference context easy for experts who wish to construct
those contexts.

commit b3d8ab2ea02c127ab241532abc214624f35bfaab
Merge: 189ffbb0 fe71c06e
Author: Santanu Thangaraj <Santanu.Thangarajamd.com>
Date: Wed Nov 15 01:33:12 2017 -0500

Merge "Added AMD copyright line to the changed files in last 3 commits" into amd-staging

commit fe71c06e42b072407c83112779055b0afb67173d
Author: Nisanth M P <nisanth.padinharepattamd.com>
Date: Wed Nov 15 11:11:17 2017 +0530

Added AMD copyright line to the changed files in last 3 commits

Change-Id: I37d5dbbbe1b199e07529610a5e9cc9e49d067c66

commit d5bf79e50bf97072bbe7117c86b7c45e6e707ea0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Nov 13 14:24:29 2017 -0600

Miscellaneous tweaks and fixes.

Details:
- Fixed incorrect calling sequence in bli_cntx_init_knl.c--an instance of
bli_blksz_init_easy() that should have been bli_blksz_init().
- Fixed a bug in code that is supposed to output the list of sub-directories
in the 'config' directory when configure script is run with no arguments.
- Expanded the output of "make showconfig" to include more info from config.mk.
- Minor changes to build/auto-detect/cpuid_x86.c, mostly in preparation for
someone to add excavator and zen support.
- Added a link to the ConfigurationHowTo wiki to config_registry.
- Other minor tweaks to configure.

commit 673e5184030532c4ebd9fdeecbaa6442bb3ad54f
Merge: 2c51356a 8f150f28
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 1 17:37:42 2017 -0500

Merge branch 'rt' of github.com:flame/blis into rt

commit 2c51356a8b2699c99f9507c80d69c08a35d45fe3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 1 17:37:02 2017 -0500

Implemented runtime hardware detection via cpuid.

Details:
- Added runtime support for selecting an appropriate arch_t value based
on the results of the cpuid instruction (for x86_64). This allows
deferral of choosing a context (kernels, blocksizes, etc.) until
runtime, which allows BLIS to be built with support for multiple
microarchitectures. Currently, only amd64 and intel64 configurations
are registered in the config_registry; however, one could create
custom configuration families to support arbitrary sets of x86_64
microarchitectures.
- Current Intel microarchitectures supported via cpuid are knl, haswell,
sandybridge, and penryn.
- Current AMD microarchitectures supported via cpuid are: zen, excavator,
steamroller, piledriver, and bulldozer.

commit ab57b979046479bcda7f83165838a80117c2ad95
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 1 11:51:41 2017 -0500

Revert to default SIMD alignment for bulldozer.

Details:
- Removed the default-overriding define of BLIS_SIMD_ALIGN_SIZE set in
config/bulldozer/bli_kernel.h. Not sure where this value came from, but
it would seem to allow for insufficient starting address alignment for
any matrices created via bli_malloc_user(), such as via
bli_obj_create(). Thanks to Rene Sitt for reporting the behavior that
led us to this bug.
- This commit is a manual patch of the same fix made to the 'rt' branch
in 8f150f2.

commit 8f150f28a678c4a0c1591400177ad7cca81fcaec
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 1 11:41:45 2017 -0500

Revert to default SIMD alignment for bulldozer.

Details:
- Removed the default-overriding define of BLIS_SIMD_ALIGN_SIZE set in
bli_family_bulldozer.h. Not sure where this value came from, but it
would seem to allow for insufficient starting address alignment for
any matrices created via bli_malloc_user(), such as via
bli_obj_create(). Thanks to Rene Sitt for reporting the behavior that
led us to this bug.

commit e3f10557caf114441fbfff990e3ce3576c177bdc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 30 13:37:54 2017 -0500

Use perl for some substitution for OS X compatibility.

Details:
- Discovered that sed commands where the replacement string contains '\n'
are problematic with the version of sed present in OS X. For these cases
cases in the configure script, we instead use 'perl -pe' for
search-and-replace functionality.
- Various other minor comment/whitespace tweaks to configure.
- Removed remaining lines of code related to setting/checking variables to
track "unregistered" configurations.

commit dd45cfdfc3d8f9acf4cf7f69138d9b83dafc8842
Merge: 3e4f42a4 f60c827b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 30 12:23:05 2017 -0500

Merge branch 'master' into rt

commit f60c827ba95f452c8454fb914f5564f4895bf644
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Oct 30 10:04:42 2017 -0500

Fix CVECFLAGS for bulldozer config.

commit 3e4f42a4d2ebb37b95988933d92e561c5b2cc201
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 27 11:41:37 2017 -0500

Typecast l1mkr_t enum value prior to comparison.

Details:
- Typecast l1mkr_t enum value in bli_cntx.h to guint_t before testing for
out-of-range value. This is an attempt to pacify a strange warning from
clang on OS X that is seemingly the result of the following compiler
warning flag:
-Wtautological-constant-out-of-range-compare

commit aec6e038d942d35b81bbd723a640cce2c054fb8e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 26 16:12:36 2017 -0500

Removed associative arrays from configure.

Details:
- Implemented a replacement for associative arrays in the configure script
that does not utilize arrays, and therefore works in pre-4.0 versions of
bash. (It appears that Mac OS X will be stuck with version 3.2 indefinitely
due to bash switching to the GPL 3.0 license starting with version 4.0.)

commit 189ffbb0d37262b21acddc0d35b4a22f2cbbca94
Merge: 06e0e635 3eb44f67
Author: Santanu Thangaraj <Santanu.Thangarajamd.com>
Date: Wed Oct 25 02:00:30 2017 -0400

Merge changes Ie115b206,I7ce6cfa2,Iff59b6f4 into amd-staging

* changes:
Adding __attribute__((constructor/destructor)) for CLANG case.
Thread Safety: Move bli_init() before and bli_finalize() after main()
Thread safety: Make the global induced method status array local to thread

commit 3eb44f67618b91ae5f5f0aaaba67e38f16042ee4
Author: Nisanth M P <nisanth.padinharepattamd.com>
Date: Tue Oct 24 16:36:36 2017 +0530

Adding __attribute__((constructor/destructor)) for CLANG case.

CLANG supports __attribute__, but its documentation doesn't
mention support for constructor/destructor. Compiling with
clang and testing shows that it does support this.

Change-Id: Ie115b20634c26bda475cc09c20960d687fb7050b

commit 07c352188bf5265af242255f8e6fcb97050d973d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 23 16:59:22 2017 -0500

Added "generic" configuration.

Details:
- Added a "generic" configuration that leaves the default blocksizes and
kernels unchanged. This replaces the older "reference" configuration.
Updated auto-detect script and code accordingly.
- Added support for generic configuration to arch_t (bli_type_defs.h),
bli_gks_init() (bli_gks.c), and bli_arch_config.h
- Moved bli_arch_query_id() to bli_arch.c (and prototype to bli_arch.h).
- Whitespace changes to configurations' make_defs.mk files.

commit c1a98d6f70608b02a1e6bcad6ba020a60773dace
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 23 14:24:41 2017 -0500

Minor update to .travis.yml file.

commit 75b9383f01caa8b83f8be0117e15085b0d807ba6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 20 16:41:22 2017 -0500

Minor header renaming ahead of bli_arch.c.

Details:
- Renamed the various configurations' "bli_arch_<configname>.h" header files
(replacing "arch" with "family") to free up the 'bli_arch' namespace for a
different purpose (hardware detection).
- Renamed "bli_arch.h" and "bli_arch_pre_macro_defs.h" in frame/include to
"bli_arch_config.h" and "bli_arch_config_pre.h", respectively.

commit 482af51add26d5ed103c3e3f167657f273b32c7a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 20 15:44:26 2017 -0500

Fixed 'make test' target from top-level Makefile.

Details:
- Updated the top-level Makefile's build rule for testsuite object files to
properly obtain CFLAGS via get-frame-cflags-for() function instead of
simply using the $(CFLAGS) variable (which is empty). This means that
'make test' should now work as expected.

commit 3c269f700d207efe6c04193f09d519c88c1d4045
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 20 13:57:21 2017 -0500

Makefile updates for test drivers, testsuite.

Details:
- Fixed semi-broken testsuite Makefile and very-broken test driver Makefiles,
as well as those for test/3m4m, test/thread_ranges, and test/exec_sizes
sub-directories.
- Factored out much of the top-level Makefile into common.mk. A Makefile
needs only set DIST_PATH to the relative path to the top level of the
BLIS source distribution before including common.mk in order to acquire
all of the definitions typically needed in a Makefile that tests BLIS.

commit 0557189d463446b4c32077cdcf0467fa71ca68dc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 18 15:05:27 2017 -0500

Minor updates to .travis.yml, configure script.

commit 2553734d1d62043793f4e783a027349ef6d4d563
Merge: 453deb29 37534279
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 18 13:46:50 2017 -0500

Merge branch 'master' into rt

commit 375342799cbae981c28d831793af588d7951f3f6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 18 13:41:25 2017 -0500

Removed a duplicate bli_avx512_macros.h header.

Details:
- Removed a duplicate header file that was causing problems during
installation for the 'knl' configuration. Thanks to Victor Eijkhout
for reporting this issue.

commit 453deb29068889698e274f269c9aa90eea99b527
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 18 13:29:32 2017 -0500

Implemented runtime kernel management.

Details:
- Reworked the build system around a configuration registry file, named
config_registry', that identifies valid configuration targets, their
constituent sub-configurations, and the kernel sets that are needed by
those sub-configurations. The build system now facilitates the building
of a single library that can contains kernels and cache/register
blocksizes for multiple configurations (microarchitectures). Reference
kernels are also built on a per-configuration basis.
- Updated the Makefile to use new variables set by configure via the
config.mk.in template, such as CONFIG_LIST, KERNEL_LIST, and KCONFIG_MAP,
in determining which sub-configurations (CONFIG_LIST) and kernel sets
(KERNEL_LIST) are included in the library, and which make_defs.mk files'
CFLAGS (KCONFIG_MAP) are used when compiling kernels.
- Reorganized 'kernels' directory into a "flat" structure. Renamed kernel
functions into a standard format that includes the kernel set name
(e.g. 'haswell'). Created a "bli_kernels_<kernelset>.h" file in each
kernels sub-directory. These files exist to provide prototypes for the
kernels present in those directories.
- Reorganized reference kernels into a top-level 'ref_kernels' directory.
This directory includes a new source file, bli_cntx_ref.c (compiled on
a per-configuration basis), that defines the code needed to initialize
a reference context and a context for induced methods for the
microarchitecture in question.
- Rewrote make_defs.mk files in each configuration so that the compiler
variables (e.g. CFLAGS) are "stored" (renamed) on a per-configuration
basis.
- Modified bli_config.h.in template so that bli_config.h is generated with
defines for the config (family) name, the sub-configurations that are
associated with the family, and the kernel sets needed by those
sub-configurations.
- Deprecated all kernel-related information in bli_kernel.h and transferred
what remains to new header files named "bli_arch_<configname>.h", which
are conditionally included from a new header bli_arch.h. These files
are still needed to set library-wide parameters such as custom
malloc()/free() functions or SIMD alignment values.
- Added bli_cntx_init_<configname>.c files to each configuration directory.
The files contain a function, named the same as the file, that initializes
a "native" context for a particular configuration (microarchitecture). The
idea is that optimized kernels, if available, will be initialized into
these contexts. Other fields will retain pointers to reference functions,
which will be compiled on a per-configuration basis. These bli_cntx_init_*()
functions will be called during the initialization of the global kernel
structure. They are thought of as initializing for "native" execution, but
they also form the basis for contexts that use induced methods. These
functions are prototyped, along with their _ref() and _ind() brethren, by
prototype-generating macros in bli_arch.h.
- Added a new typedef enum in bli_type_defs.h to define an arch_t, which
identifies the various sub-configurations.
- Redesigned the global kernel structure (gks) around a 2D array of cntx_t
structures (pointers to cntx_t, actually). The first dimension is indexed
over arch_t and the inner dimension is the ind_t (induced method) for
each microarchitecture. When a microarchitecture (configuration) is
"registered" at init-time, the inner array for that configuration in the
2D array is initialized (and allocated, if it hasn't been already). The
cntx_t slot for BLIS_NAT is initialized immediately and those for other
induced method types are initialized and cached on-demand, as needed. At
cntx_t registration, we also store function pointers to cntx_init functions
that will initialize (a) "reference" contexts and (b) contexts for use with
induced methods. We don't cache the full contexts for reference contexts
since they are rarely needed. The functions that initialize these two kinds
of contexts are generated automatically for each targeted sub-configuration
from cpp-templatized code at compile-time. Induced method contexts that
need "stage" adjustments can still obtain them via functions in
bli_cntx_ind_stage.c.
- Added new functions and functionality to bli_cntx.c, such as for setting
the level-1f, level-1v, and packm kernels, and for converting a native
context into one for executing an induced method.
- Moved the checking of register/cache blocksize consistency from being cpp
macros in bli_kernel_macro_defs.h to being runtime checks defined in
bli_check.c and called from bli_gks_register_cntx() at the time that the
global kernel structure's internal context is initialized for a given
microarchitecture/configuration.
- Deprecated all of the old per-operation bli_*_cntx.c files and removed
the previous operation-level cntx_t_init()/_finalize() invocations.
Instead, we now query the gks for a suitable context, usually via
bli_gks_query_cntx().
- Deprecated support for the 3m2 and 3m3 induced methods. (They required
hackery that I was no longer willing to support.)
- Consolidated the 1e and 1r packm kernels for any given register blocksize
into a single kernel that will branch on the schema and support packing
to both formats.
- Added the cntx_t* argument to all packm kernel signatures.
- Deprecated the local function pointer array in all bli_packm_cxk*.c files
and instead obtain the packm kernel from the cntx_t.
- Added bli_calloc_intl(), which serves as the calloc-equivalent to to
bli_malloc_intl(). Useful when we wish to allocate and initialize to
zero/NULL.
- Converted existing cpp macro functions defined in bli_blksz.h, bli_func.h,
bli_cntx.h into static functions.

commit 4607aac297e55ad540cbe5fffbe02e6b1889c181
Author: Nisanth M P <nisanth.padinharepattamd.com>
Date: Mon Oct 16 22:06:57 2017 +0530

Thread Safety: Move bli_init() before and bli_finalize() after main()

BLIS provides APIs to initialize and finalize its global context.
One application thread can finalize BLIS, while other threads
in the application are stil using BLIS.

This issue can be solved by removing bli_finalize() from API.
One way to do this is by getting bli_finalize() to execute by default
after application exits from main().

GCC supports this behaviour with the help of __attribute__((destructor))
added to the function that need to be executed after main exits.

Similarly bli_init() can be made to run before application enters main()
so that application need not call it.

Change-Id: I7ce6cfa28b384e92c0bdf772f3baea373fd9feac

commit 0f5ce26fc597cda6e8ae93a7526f52eb8cba01e9
Author: Nisanth M P <nisanth.padinharepattamd.com>
Date: Mon Oct 16 21:07:50 2017 +0530

Thread safety: Make the global induced method status array local to thread

BLIS retains a global status array for induced methods, and provides
APIs to modify this state during runtime. So, one application thread
can modify the state, before another starts the corresponding
BLIS operation.

This patch solves this issue by making the induced method status array
local to threads.

Change-Id: Iff59b6f473771344054c010b4eda51b7aa4317fe

commit b882648af87deb1b365fc6b3e94151e69c5ccfa4
Merge: 8b379069 e02d3cb8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 11 16:32:21 2017 -0500

Merge branch 'master' into rt

commit 06e0e6351acb9481225975ad9a4e0b8925336621
Author: sthangar <Santanu.Thangarajamd.com>
Date: Thu Sep 28 12:15:36 2017 +0530

The inner loop paralleization is turned off by default, the JR and IR loop parameters are set to 1 by default

Change-Id: I8c3c2ecbbd636259f6ffb92768ec04148205c3e5

commit e02d3cb84190a345ebe9b32f53db03a1838976b1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 26 19:02:53 2017 -0500

Fixed a pthread typo in previous commit.

Details:
- Misnamed 'pthread_mutex_t' type in bli_memsys.c as 'thread_mutex_t'.

commit f5962a1aae0fb3c9be104d0035c0d73210e7f670
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 26 17:00:04 2017 -0500

Fixed bugs in gemm/gemmtrsm ukr tests in testsuite.

Details:
- Fixed a bug in gemmtrsm test module that was due to improper partitioning
into a k x k triangular matrix for the purposes of obtaining an mr x k
micropanel of A with which to test.
- Fixed a bug in gemm and gemmtrsm test modules that would only manifest for
very large k (depending on the product of mr x kc on that architecture).
The bug arose from the fact that the test module was triggering the
allocation of blocks from the internal memory pools, which are limited in
size. This allocation imposes an implicit assumption that the micro-
panel being tested with will fit inside, and this assumption is violated
for large values of k. Arbitrarily large k may now be tested for both
operation tests.
- Added OpenMP/pthread critical sections around the setting or getting of
statuses from the induced method operation lookup table in bli_l3_ind.c.
- Added the 'static' keyword to all pthread_mutex_t global variables in BLIS.
- Thanks to Nisanth Padinharepatt of AMD for reporting the first and third
issues.

commit 8e917b256ca2d4bcdc059fe98d86be8775c69561
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Sep 9 14:10:15 2017 -0500

Updated bibtex info for BLIS5 (3m4m) article.

commit 7be887057358df4978a4833eeae0c17e15acd9d1
Author: Nisanth M P <nisanth.padinharepattamd.com>
Date: Mon Aug 28 17:38:22 2017 +0530

Merging "Adding auto hardware detection for Zen"

Change-Id: Id450fb0c4f91a5cd5cbdc06970f4f9ed28dd8520

commit e056d810d16621891ead032603de0c2105cfc0f7
Author: sthangar <Santanu.Thangarajamd.com>
Date: Mon Aug 28 16:44:42 2017 +0530

Bug fix for the testsuite build failing

Change-Id: I7cd8c9d187387c48b2564e45cbfb8df985e93d77

commit 83796b7caf745fafc263e9e5e1bfcf5eff00c025
Merge: 8176f4e4 d1ee7762
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Mon Aug 28 05:23:28 2017 -0400

Merge "Adding auto hardware detection for Zen" into amd-staging

commit d1ee776202b26874333af7a91b6d2686342c4c81
Author: sthangar <Santanu.Thangarajamd.com>
Date: Wed Aug 23 13:01:14 2017 +0530

Adding auto hardware detection for Zen

Change-Id: I40ce6705dd66b35000c4ccddffad1c5b65998caf

commit 8176f4e43872714b997f1a5f83056daadb0ff1a5
Merge: 12413018 adafe974
Author: praveeng <praveen.gamd.com>
Date: Mon Aug 28 12:21:16 2017 +0530

resolving conflicts bli_gemm_front.c and LICENCE

Change-Id: Id24ce53896d4c1c7ceccc3e004014a0ecceb5474

commit 57e1e5cd51e7ffe8612c96a20b6a041b55426ddb
Merge: f86ce54d d6ef56c6
Author: Nisanth M P <nisanth.padinharepattamd.com>
Date: Tue Aug 22 17:07:44 2017 +0530

Merge AMD authored changes

commit adafe974b4bc3fc0663bc2f6f4ce2fde71a97988
Merge: f86ce54d 7dc78b49
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Aug 15 15:17:21 2017 -0500

Merge pull request 150 from devinamatthews/vzeroupper

Add vzeroupper to Intel AVX kernels.

commit 7dc78b49f97e6b3cd6d72fcdc588ace534d0e700
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Aug 15 10:02:25 2017 -0500

Add vzeroupper to Intel AVX kernels.

commit f86ce54d6f315006984534fe29e47a2deaacc9f5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 10 16:24:28 2017 -0500

Removed trailing enum commas from bli_type_defs.h.

Details:
- Removed trailing commas from enums in bli_type_defs.h. Thanks to
Erling Andersen for pointing out this inconsistency and suggesting
the change.

commit 60a1eeb2317939d732b9eb6ff1e0d6d668c9a1e5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Aug 5 13:04:31 2017 -0500

Added edge handling to _determine_blocksize_b().

Details:
- Added explicit handling of situations where i == dim to
bli_determine_blocksize_b_sub(). This isn't actually needed by any
current use case within BLIS, but handling the situation is nonetheless
prudent. Thanks to Minh Quan for reporting this issue and requesting
the fix.

commit b01c80829907d50ec79977fba8e7b53cfe7db80a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Aug 4 14:17:44 2017 -0500

Fixed a minor bug in level-3 packm management.

Details:
- Fixed a bug in bli_l3_packm() that caused cntl_t-cached packed mem_t
entries to be released and then re-acquired unnecessarily. (In essence,
the "<" operands in the conditional that guards the
release-and-reacquire code block simply needed to be swapped.) The bug
should have only affected performance (rather than the computed result).
Thanks to Minh Quan for identifying and reporting the bug.

commit 8b379069fcd4811669855b1248ece831f190dff6
Merge: 1f3a5819 05925dd5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Aug 1 15:30:40 2017 -0500

Merge branch 'master' into rt

commit 05925dd5d30e8f403bb671ce33029170d65ce7c0
Merge: 803bbef0 cecdc05d
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Aug 1 09:31:02 2017 -0500

Merge pull request 146 from devinamatthews/master

Change lsame_ signature to match lapacke.

commit cecdc05d2834786a84ff85775d3f99a958c0765a
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Jul 31 15:19:51 2017 -0500

Change lsame_ signature to match lapacke.

commit 803bbef0a386dd0571ad389f69d55154dbfe3c50
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Jul 29 20:17:05 2017 -0500

Fixed pthreads compile bug with previous commit.

Details:
- Erroneously passed family parameter into l3int_t function despite
that function not taking the parameter. Oops.

commit c63980f4ca750618f359031d0691289b1abf5146
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Jul 29 14:53:39 2017 -0500

Moved 'family' field from cntx_t to cntl_t.

Details:
- Removed the family field inside the cntx_t struct and re-added it to the
cntl_t struct. Updated all accessor functions/macros accordingly, as well
as all consumers and intermediaries of the family parameter (such as
bli_l3_thread_decorator(), bli_l3_direct(), and bli_l3_prune_*()). This
change was motivated by the desire to keep the context limited, as much
as possible, to information about the computing environment. (The family
field, by contrast, is a descriptor about the operation being executed.)
- Added additional functions to bli_blksz_*() API.
- Added additional functions to bli_cntx_*() API.
- Minor updates to bli_func.c, bli_mbool.c.
- Removed 'obj' from bli_blksz_*() API names.
- Removed 'obj' from bli_cntx_*() API names.
- Removed 'obj' from bli_cntl_*(), bli_*_cntl_*() API names. Renamed routines
that operate only on a single struct to contain the "_node" suffix to
differentiate with those routines that operate on the entire tree.
- Added enums for packm and unpackm kernels to bli_type_defs.h.
- Removed BLIS_1F and BLIS_VF from bszid_t definition in bli_type_defs.h.
They weren't being used and probably never will be.

commit 07837395560d413a1ba828163b41186e21a7bcfe
Merge: ca1d1d85 ad8610b4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 21 16:49:48 2017 -0500

Merge pull request 139 from Maratyszcza/emscripten

Fix Emscripten builds

commit ad8610b4415cc7982804d74f9aba29875e9e2b6c
Merge: 8772a0b3 ca1d1d85
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 21 15:18:33 2017 -0500

Merge branch 'master' into emscripten

commit ca1d1d8560c9ab1a7e3b0ac43ac70d08075bf904
Merge: b537b5bb 733faf84
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 21 09:49:50 2017 -0500

Merge pull request 144 from devinamatthews/fix_atomics_on_bgq

Add fallbacks to __sync_* or __c11_atomic_* builtins...

commit 733faf848dcc54834fcdfbb0185dc644978d8864
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu Jul 20 14:50:13 2017 -0500

Clang can't make up it's mind what to support.

commit 7425d0744d9e9cd29a887120e57c2b43ba287040
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu Jul 20 12:54:58 2017 -0500

Add default define for __has_extension.

commit b537b5bbe8cbee459a85bac11458498ae2bce4de
Merge: 1f1ec0db 7f41bb0a
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu Jul 20 10:58:39 2017 -0500

Merge pull request 133 from devinamatthews/haswell-packdim

Fix prefetching in haswell ukernel

commit 8823f91a14638ce6f4e45e67df03212bb61609d6
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu Jul 20 10:04:34 2017 -0500

Add fallbacks to __sync_* or __c11_atomic_* builtins when __atomic_* is not supported. Fixes 143.

commit 1f1ec0db9380b87679d5c771c4594daa1cfc5f0d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 19 15:40:48 2017 -0500

Updated ar option list used by all configurations.

Details:
- Dropped 'u' from the list of modifiers passed into the library archiver
ar. Previously, "cru" was used, while now we employ only "cr". This
change was prompted by a warning observed on Ubuntu 16.04:

ar: `u' modifier ignored since `D' is the default (see `U')

This caused me to realize that the default mode causes timestamps to be
zero, and thus the 'u' option, which causes only changed object files to
be inserted, is not applicable.

commit 5caaba2d61cbbc36d63102a0786ece28ff797f72
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 19 13:51:53 2017 -0500

Added --force-version=STRING option to configure.

Details:
- Added an option to configure that allows the user to force an arbitrary
version string at configure-time. The help text also now describes the
usage information.
- Changed the way the version string is communicated to the Makefile.
Previously, it was read into the VERSION variable from the 'version' file
via $(shell cat ...). Now, the VERSION variable is instead set in
config.mk (via a configure-substituted anchor from config.mk.in).

commit 13175c5fb70fb6a378d5fff6ecede62e5ea6a1f6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jul 18 17:56:00 2017 -0500

Updated openmp/pthread barriers with GNU atomics.

Details:
- Updated the non-tree openmp and pthreads barriers defined in
bli_thrcomm_openmp.c and bli_thrcomm_pthreads.c to instead call a common
implementation in bli_thrcomm.c, bli_thrcomm_barrier_atomic(). This new
implementation goes through the same motions as the previous codes, but
protects its loads and increments with GNU atomic built-ins. These atomic
statements take memory ordering parameters that allow us to specify just
enough constraints for the barrier to work as intended on weakly-ordered
hardware. The prior implementation was only guaranteed to work on systems
with strongly- ordered memory. (Thanks to Devin Matthews for suggesting
this change and his crash-course in atomics and memory ordering.)
- Removed 'volatile' from structs' barrier field declarations in
bli_thrcomm_*.h.
- Updated bli_thrcomm_pthread.? files to use renamed struct barrier fields
consistent with that of the _openmp.? files.
- Updated other bli_thrcomm_* files to rename "communicator" variables to
simply "comm".

commit 0e58ba1b3aa84700ca51a96f1c0eed6067562fba
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jul 17 19:03:22 2017 -0500

Added API to set mt environment variables.

Details:
- Renamed bli_env_get_nway() -> bli_thread_get_env().
- Added bli_thread_set_env() to allow setting environment variables
pertaining to multithreading, such as BLIS_JC_NT or BLIS_NUM_THREADS.
- Added the following convenience wrapper routines:
bli_thread_get_jc_nt()
bli_thread_get_ic_nt()
bli_thread_get_jr_nt()
bli_thread_get_ir_nt()
bli_thread_get_num_threads()
bli_thread_set_jc_nt()
bli_thread_set_ic_nt()
bli_thread_set_jr_nt()
bli_thread_set_ir_nt()
bli_thread_set_num_threads()
- Added include "errno.h" to bli_system.h.
- This commit addresses issue 140.
- Thanks to Chris Goodyer for inspiring these updates.

commit 8772a0b33a90154c80d88b381dcdd66f824e041f
Author: Marat Dukhan <maratfb.com>
Date: Thu Jul 13 21:39:24 2017 -0700

Fix Emscripten builds

commit 72c8b49bb8d3b9370b2cc37718da22f065de9c57
Merge: 70cc825b ba7cada5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 12 14:58:12 2017 -0500

Merge pull request 138 from hominhquan/membrk_set_free_fp

Set missing free_fp in bli_membrk_init for free-ing GEN_USE buffers

commit ba7cada51a238d320528e3504ed0f0a17a6b022a
Author: Minh Quan HO <mqhokalray.eu>
Date: Fri Jul 7 10:52:05 2017 +0200

set missing free_fp in bli_membrk_init for free-ing GEN_USE buffers

The membrk's free_fp is called when releasing GEN_USE buffers, but this free_fp is
not set in bli_membrk_init

commit 1241301869957c96f16a2c6567e3ad70afa547de
Merge: 969b67e8 25ead66f
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Wed Jul 5 02:24:00 2017 -0400

Merge "Reducing the framework overhead of GEMV routines" into amd-staging

commit 25ead66fb78557f73af48bac305724d5d8aa3309
Author: sthangar <Santanu.Thangarajamd.com>
Date: Fri Jun 30 12:23:19 2017 +0530

Reducing the framework overhead of GEMV routines

Change-Id: I83607ad767bff74e305e915b54b0ea34ec3e5684

commit 969b67e8800fbd5d14a086606f3b5afbf66ed093
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Tue Jul 4 12:57:32 2017 +0530

Improved efficiency of dGEMM for large matrices by reducing TLB load misses and majorly L3 cache misses. This is achieved by changing the packed block sizes of matrix A & B. Now the optimum values are MC_D = 510 and KC_D = 1024.

Change-Id: I2d8bdd5f62f2d1f8782ae2997f3d7a26587d1ca4

commit 70cc825b552dec05165b9d70f9e6eb33d8abb118
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Jun 6 21:58:21 2017 -0500

Update LICENSE

Remove totally unnecessary first 9 lines and hopefully get Github to recognize it as 3BSD [ci skip].

commit cf54c77bc79a0f33a514be72c80a654c4e6e6f63
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Jun 6 20:23:17 2017 -0500

Add new SSI acknowledgment

commit d6ef56c6dbaf6df8ee1af1ca6a0f0792a811396a
Author: prangana <pradeep.raoamd.com>
Date: Thu Jun 1 16:11:09 2017 +0530

Update version number

Change-Id: Ib6e52d1d34c0791367ab9152dfab31f94deedeb4

commit 897bfa0e92082c30bbb74229562d7d7327cbbac8
Author: prangana <pradeep.raoamd.com>
Date: Thu Jun 1 16:11:09 2017 +0530

Update version number

Change-Id: Ib6e52d1d34c0791367ab9152dfab31f94deedeb4

commit 99d0ba5606d4b63e6a9c639aa78d4defc2455f79
Merge: be2c7eb8 6d17e012
Author: Santanu Thangaraj <Santanu.Thangarajamd.com>
Date: Thu Jun 1 02:19:02 2017 -0400

Merge "Checked in the small matrix code to compute GEMM called with A transpose case" into amd-staging

commit 6d17e0120fe5c127b941136ad2c0c08e91439535
Author: sthangar <Santanu.Thangarajamd.com>
Date: Wed May 24 11:48:16 2017 +0530

Checked in the small matrix code to compute GEMM called with A transpose case

Change-Id: I29f40046d43d7a4b037c1cb322503ee26495f462

commit 9d93f8481a1404695f7b78a3ced8ca47e890b649
Author: prangana <pradeep.raoamd.com>
Date: Tue May 30 09:58:10 2017 +0530

Update Licence File

Change-Id: I4c5cf1690d0cef92a68400f9a89e454ab6856ad2

commit be2c7eb85168937bd4318f4d05ded37620119310
Author: prangana <pradeep.raoamd.com>
Date: Tue May 30 09:58:10 2017 +0530

Update Licence File

Change-Id: I4c5cf1690d0cef92a68400f9a89e454ab6856ad2

commit 7f41bb0a0becde6a7de7df0f99668d7b4686c3b0
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri May 26 14:49:31 2017 -0400

PACKDIM_MR=8 didn't work out, but messing with the prefetching helps 2%.

commit d87614af3f3d9187be94d6e77984b282bf890928
Author: Devin Matthews <dmatthewsgator3.ufhpc>
Date: Fri May 26 14:47:36 2017 -0400

Revert "Change PACKDIM_MR (double) for haswell to 8."

This reverts commit 681eec913d7c2ebcff637cec5c1627ced9a92b99.

commit 681eec913d7c2ebcff637cec5c1627ced9a92b99
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri May 26 12:28:09 2017 -0500

Change PACKDIM_MR (double) for haswell to 8.

commit 0a3ae0ecaa0ddcb5887005d7051fa234499f1120
Merge: 0f4e6652 6e04f9df
Author: praveeng <praveen.gamd.com>
Date: Sat May 20 16:53:50 2017 +0530

frame/3/gemm/bli_gemm_front.c

Change-Id: I52a0fbc1d33bb948d430942323bbc5fe44e3ca13

commit 6e04f9df01d79c1b0e673943ca0d5d0a6095eb2e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed May 17 13:03:52 2017 -0500

Restored deleted lines from makefile fragments.

commit ec5c0c0448275280dca0991f6f33afeb73650450
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed May 17 12:29:44 2017 -0500

Change to /bin/sh.

All scripts checked with Debian's checkbashisms. Also check for clang first in auto-detect.sh.

commit 555ddc30d4c7e44f3f335e436c98606f56e1598b
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed May 17 12:27:14 2017 -0500

Remove shebangs from makefiles.

commit f26bd7f42e0c2a47fe321b2c452644990b689654
Merge: cbf8710a 169fb05f
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed May 17 11:58:41 2017 -0500

Merge pull request 128 from iotamudelta/master

Portability and clang

commit 169fb05f225c2f060265bcaa872f7f80dc638b70
Author: J M Dieterich <dieterichogolem.org>
Date: Tue May 16 23:11:22 2017 -0400

Fix if/else structure. Thanks to TravisCI.

commit 0579dfea0bcfbb90ebc073fcf78b92a5cf7238e1
Author: J M Dieterich <dieterichogolem.org>
Date: Tue May 16 22:58:07 2017 -0400

Restore version.

commit a75b05c23dc786a1fdc45dc1627a5ce2299f1a7b
Author: J M Dieterich <dieterichogolem.org>
Date: Tue May 16 22:23:27 2017 -0400

Mark piledriver compilable w/ clang.

commit 7541d46e2ba8659bb2e36b444edef112fefa1345
Author: J M Dieterich <dieterichogolem.org>
Date: Tue May 16 22:12:12 2017 -0400

Mark bulldozer compilable w/ clang.

commit 91f897073ec0df3330ede449c4d6af8158266ae3
Author: J M Dieterich <dieterichogolem.org>
Date: Tue May 16 22:06:59 2017 -0400

Correct error message.

commit f5131e1e49167f948bddd714bb1af1761829c212
Author: J M Dieterich <dieterichogolem.org>
Date: Tue May 16 22:03:23 2017 -0400

Indeed once can compile for carrizo also using clang.

commit 5fa4e9439c04f35f89dd7d26ff742cb2dadc3180
Author: J M Dieterich <dieterichogolem.org>
Date: Tue May 16 21:50:49 2017 -0400

A bunch of shebang fixes from unportable /bin/bash to portable /usr/bin/env bash

commit 1f3a58197e5d5f9ac862bda91e7527cbfbab5d76
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon May 8 16:10:03 2017 -0500

Housekeeping, induced method file/function renames.

Details:
- Renamed all level-3 induced method files to use the "_vir.c" suffix
instead of "_ref.c". Also renamed functions within these files
accordingly.
- Renamed cpp macro definitions in frame/ind/include according to the
above changes.
- Removed frame/3/old.

commit cbf8710a1ba63e25aadaa6fc5da51ea81b3d596d
Merge: cf39d3ef fdc66f12
Author: Tyler Michael Smith <tmscs.utexas.edu>
Date: Mon May 8 11:21:20 2017 -0500

Merge pull request 127 from devinamatthews/fix_blis_nt_xx

Setting any one of BLIS_NT_[IJ][CR] overrides BLIS_NUM_THEADS

commit cf39d3ef3b29b8058c39fb4638c1a734fe64aaed
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri May 5 15:06:56 2017 -0500

Fixed a bug in norm1v, norm1m.

Details:
- Fixed a bug that manifested as improperly-computed 1-norm for vectors
and matrices. This is one of the few operations in BLIS that does not
have its own test module within the testsuite, hence why it went
undetected for so long. The bad 1-norms were being used to normalize
matrices in the testsuite after initialization, which led to some
matrices containing a combination of "large" and "small" values. This
tended to push the residuals computed after each test away from zero.
In some cases, they were off *just* enough to the testsuite to label
it a "failure". Many thanks to Jeff Hammond for reporting this bug.
(Wonky details: the bug was due to improperly-defined level-0 scalar
macros for abval2, an operation that computes the absolute square,
or complex magnitude/modulus. Certain complex domain instances of
abval2 were being incorrectly defined in terms of real-only solutions,
leading to bad results. This level-0 operation forms the basis of
norm1v/norm1m. absq2 was also affected, but almost nothing uses
this operation.)

commit 799485124f4d823e908d2e5d38b0c3a1e6172ade
Merge: 773a24ef 0df3541f
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu May 4 10:52:09 2017 -0500

Merge pull request 121 from jeffhammond/not-real-knl

allow KNL build without hbwmalloc (i.e. emulated)

commit fdc66f12d40754ff46179804bff592fddafbca02
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu May 4 10:35:22 2017 -0500

Setting any one of BLIS_NT_[IJ][CR] overrides BLIS_NUM_THEADS. Missing BLIS_NT_XX's are defaulted to 1. Fixes 123.

commit 773a24efb2fa1c3a220bf0ce1dd621a3176196da
Merge: dd58c954 b8854259
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed May 3 15:07:59 2017 -0500

Merge branch 'master' of github.com:flame/blis

commit dd58c9545c877c3f7553eaebca7b5e9720a66f5d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed May 3 15:04:51 2017 -0500

Disable complex 3m/4m in testsuite by default.

Details:
- Disabled testsuite tests of all level-3 implementations based on 3m
and 4m. This will improve testing runtime on Travis CI as well as for
anyone manually running the testsuite using default test parameters.
Thanks to Devin Matthews for suggesting this change.

commit 0df3541f54b7fe0c604ab2ec47ba814f12391798
Author: Jeff Hammond <jeff.sciencegmail.com>
Date: Tue May 2 19:25:21 2017 -0700

allow KNL build without hbwmalloc.h (i.e. emulated)

we want to be able to run BLIS KNL binaries on non-KNL machines via SDE.
although it is possible to install hbwmalloc implementation on such
systems, it is easier not to, since obviously the performance of SDE
execution is not representative so there is no reason to emulate HBW
allocation.

commit b88542591d4dd0cde366e5ae35afd3205cb81bdc
Merge: 43007f7b c2c91e09
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 2 19:22:41 2017 -0500

Merge pull request 107 from jeffhammond/intel-compilers-no-use-libm

never use libm with Intel compilers

commit 43007f7b65ec7926cbbfc39965ff733fa251c15f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 2 16:48:43 2017 -0500

Fixed stray parentheses in README citations.

commit a4f1d0b8801c114e9ef8be39df01e1b8d27ebcb3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 2 16:38:43 2017 -0500

CHANGELOG update (0.2.2)

0.2.2

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 2 16:38:42 2017 -0500

Version file update (0.2.2)

commit d5a5e003ea9b24bb6abf12e88862e8eb61ffb03d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 2 15:48:30 2017 -0500

Fixed a trsm1m bug that affected right-side cases.

Details:
- Fixed a bug introduced in 1c732d3 that affected trsm1m_r. The result
was nondeterministic behavior (usually segmentation faults) for certain
problem sizes beyond the 1m instance of kc (e.g. 128 on haswell). The
cause of the bug was my commenting out lines in bli_gemm1m_ukr_ref.c
which explicitly directed the virtual gemm micro-kernel to use temporary
space if the storage preference of the [real domain] gemm ukernel did
not match the storage of the output matrix C. In the context of gemm,
this handling is not needed because agreement between the storage pref
and the matrix is guaranteed by a high-level optimization in BLIS.
However, this optimization is not applied to trsm because the storage
of C is not necessarily the same as the storage of the micro-panels of
B--both of which are updated by the micro-kernel during a trsm
operation. Thus, the guarantee of storage/preference agreement is not
in place for trsm, which means we must handle that case within the
virtual gemm micro-kernel.
- Comment updates and a minor macro change to bli_trsm*_cntx_init() for
3m1, 4m1a, and 1m.

commit e80993e71f4d571e9650a8e90ed386e32059eae5
Merge: a509fbd5 ca3a7924
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 2 12:30:28 2017 -0500

Merge branch 'master' into 1m

commit ca3a7924770d6cf203cce4ca9f5482e1d0d4e961
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 2 12:09:39 2017 -0500

README.md update.

Details:
- Updated bibtex entries for 4th BLIS paper, and adds entries for 5th
and 6th BLIS papers.

commit 0f4e6652dfe9b30105d3bab328ac26d9d5c11182
Merge: 42e7f6fb 6e7de6ef
Author: praveeng <praveen.gamd.com>
Date: Wed Apr 19 17:54:10 2017 +0530

Merge master code till 2017_04_19 to amd-staging

Change-Id: Ibebe83c8ea2e7eb15798c2bcf214b7228a1c9518

commit 42e7f6fb2a531429ee600b2fe0293b67371c7ccb
Author: sthangar <Santanu.Thangarajamd.com>
Date: Tue Mar 28 18:10:03 2017 +0530

fixed license attribute issues in AMD added files

Change-Id: I303f870a777c7cd1c1af29ea0b93f3e0a27948e4

commit 5600001e973c6cea048bd3fdb28117f1d7c98b9d
Merge: 0b190293 b3ed4933
Author: prangana <pradeep.raoamd.com>
Date: Mon Mar 20 13:56:33 2017 +0530

Fix merge conflicts after sync with release branch

Change-Id: Icf14a09f728befb69a73fff9fa79c4128e728310

commit 6e7de6ef84babb273dc5528a9b9d01f0febe394b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Mar 17 12:10:24 2017 -0500

Minor updates to test/3m4m.

Details:
- Updated initial problem size and increment in Makefile.
- Updated code in test_gemm.c to correctly query kc from context.

commit f484c6cd4389dc7ae5b972849e12e98ad5bbf9a4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Mar 17 12:07:27 2017 -0500

Whitespace reformatting to armv8a kernels file.

Details:
- Updated formatting of function signature/header in
kernels/armv8a/3/bli_gemm_opt_4x4.c.

commit 0b19029342ffc530fa22ef20398a26221cb8f6ec
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Tue Mar 14 14:51:31 2017 +0530

Code cleanup, removed warnings from trsm, removed unused routines in axpyv & scalv

Change-Id: I02867f394c5f416194c4b1769a6c75f39243ec81

commit 825363bd2a5a60a923d4a6d9691dc143845a9cab
Merge: 093bdb80 513944e4
Author: praveeng <praveen.gamd.com>
Date: Wed Mar 8 15:42:49 2017 +0530

Merge code from master to amd-staging as on 2017_03_08 by praveeng

Change-Id: I80740081b2cb54c9b77a3e78b9fe540e170be23d

commit 093bdb80c86b06367e595aa17487139ae983822f
Author: sthangar <Santanu.Thangarajamd.com>
Date: Tue Mar 7 13:35:50 2017 +0530

Checked in Unpacked DGEMM code

Change-Id: I39dcc7b238b328f73ee2675d21a5e521d0488723

commit 33923da9a108854590d386e74b6ee66b971e7796
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Mon Mar 6 14:31:31 2017 +0530

Added variant 10 for double precision axpyv microkernel

Change-Id: I7a20cc113a422603250bc450825c965136354974

commit bc828f7f8e3ddb9f58af07edc0b935b21759fb0f
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Fri Mar 3 14:45:35 2017 +0530

Added new axpyv (single precision) microkernel where it performs 10 FMAs per loop- This gives better performance than all other implementations of axpyv

Change-Id: Ic4f0e4c67e367d67d0b24febcf34f81a70a39972

commit c9949f4603419267c10973adf1d63ec38497475d
Author: sthangar <Santanu.Thangarajamd.com>
Date: Fri Feb 17 14:16:33 2017 +0530

Checked in DGEMMTRSM and edge case handling routine in DDOTXF

Change-Id: I65f00661af6c09b2507294fd43e0a10641c0597e

commit a509fbd5ac04fafd4e51b43d2f59ca56432dc212
Merge: 69b4846a 513944e4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Feb 21 17:06:16 2017 -0600

Merge branch 'master' into 1m

commit 69b4846ae9adb157c4171b52e159684db2867853
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Feb 21 15:33:39 2017 -0600

Disabled experiment-related 1m code.

Details:
- Commented out code in frame/ind/oapi/bli_l3_3m4m1m_oapi.c that was
specifically inserted to facilitate the benchmarking of 1m block-panel
and panel-block algorithms.
- Updates to test/3m4m/Makefile, runme.sh script, and test_gemm.c to
reflect changes used/needed during benchmarking.

commit 513944e4a951d8823b4de161b86ad7a965b4d99b
Merge: 8b462a0e 0e18f68c
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Feb 20 10:04:33 2017 -0500

Merge pull request 118 from devinamatthews/master

Handle k=0 correctly in KNL dgemm ukernel.

commit 0e18f68cf12eb9189ba901a20040b1cdae417670
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Feb 20 09:03:21 2017 -0600

Handle k=0 correctly in KNL dgemm ukernel.

commit 8b462a0e8c3e9252f0401940849e53cc772256fa
Merge: c362afc5 7d42fc07
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Sun Feb 19 23:03:03 2017 -0500

Merge pull request 117 from devinamatthews/master

Cast dim_t and inc_t parameters to 64-bit in KNL microkernels.

commit 7d42fc0796ef0c010375fd8e59b1240ba41ce4d2
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Sun Feb 19 21:10:55 2017 -0500

Cast dim_t and inc_t parameters to 64-bit in KNL microkernels.

commit 04245c9ff7f8b3c70d61003029c964bb9a4320ee
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Fri Feb 10 14:24:30 2017 +0530

Reoptimized scalv routines - two vector multiplies are done per iteration, and these routines are enabled in bli_kernel.h

Change-Id: Ic5654508573d1f6bde2edef06aefe117e581feb5

commit c362afc525bab4050581d1b0fcea2fe4d582c608
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Feb 9 11:54:59 2017 -0600

Added missing "level-0" BLAS [sd]cabs1_().

Details:
- Fixed issue 115 by adding implementations for scabs1_() and dcabs1_()
to the BLAS compatibility layer. Thanks to heroxbd for pointing out
their absence.

commit 018180c938c32efbeaaf626ba71ec5b780664db1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Feb 8 11:20:52 2017 -0600

Fixed a minor bug in configure (issue 114).

Details:
- Fixed a bug in the configure script whereby a non-preferred value for
--enable-threading would cause problems in common.mk vis-a-vis detecting
which threading model was chosen. Thanks to heroxbd for reporting this
issue.

commit 58b5b77e5fdb179ea465e398e416e6a00d917e05
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Wed Feb 8 21:43:34 2017 +0530

Fixed a bug in axpyv, the arguments passed to intrinsic fmad instruction are corrected

Change-Id: If12f24c6bc74b22ac9e4acd6b9378e06d79f2f5e

commit 85de4ebf74d0a5587d5a12724eb5489d51674db3
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Wed Feb 8 14:41:04 2017 +0530

variant 4 axpyv single precision modified: explicitly used FMA intrinsics, replaced vector multiply and add operations

Change-Id: I975feef56696d479d2b9e9441b0660021cf4f6ff

commit 3fa53e8af31d634779f40258c51483ae8af494fa
Merge: b5291a44 95be7b04
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Wed Feb 8 11:46:34 2017 +0530

Merged axpyv and gemm small in bli_kernel.h
Merge branch 'amd-staging' of ssh://git.amd.com:29418/cpulibraries/er/blis into amd-staging

modified: config/zen/bli_kernel.h
modified: frame/3/gemm/bli_gemm_front.c
modified: kernels/x86_64/zen/3/bli_gemm_small_matrix.c

Change-Id: If181cf9345178c448b3530beb8bef453917fe295

commit 95be7b04709e688a4cb01fba680081e30f4258ef
Author: sthangar <Santanu.Thangarajamd.com>
Date: Tue Feb 7 14:01:27 2017 +0530

Added logic for packing matrix A and prefetching matrix C in Unpacked SGEMM code

Change-Id: I99efeca9eb5b4449286ec0ec133fd554ef1bb4f0

commit b5291a445b1313e01f1e0e8102c5f3660ab07f69
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Tue Feb 7 12:39:31 2017 +0530

Added optimization variant 4 for axpyv single precision - this performs 5 FMA per loop, keeping the IPC always full

Change-Id: Ie77ed22584271136a257e673bcd3b1ba71136bc9

commit f4bfc1662af82aa4b98185334c44835e51f1cbec
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Mon Feb 6 15:04:27 2017 +0530

New routines implemented for axpyv to improve performance for small vector sizes, vectorization is done for vectors as small as 8 (single precision) 4(double precision), since this operation has low compute to memory ratio, higher matrix sizes memory operations are dominating and hence not much gain - This still needs some work- added saxpyv and daxpyv var 3 routines in the file bli_axpyv_opt_var1.c

Change-Id: Ic1b33bd5516e10113b00e44ab41b97eb19d46072

commit ddf45e71770c55ea4a58ca24ea4913fe5d8beb9b
Merge: a6ab91bc 78e1b16e
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jan 27 14:25:40 2017 -0600

Merge pull request 113 from devinamatthews/knl_thread_params

Change default threading parameters for KNL.

commit 78e1b16e16d589ed31b2e712115ee282097f114d
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jan 27 14:22:20 2017 -0600

Change default threading parameters for KNL.

commit 574472ba5a89924eca7dbd10055d0e1dcd7f4c71
Author: sthangar <Santanu.Thangarajamd.com>
Date: Tue Jan 10 14:51:46 2017 +0530

checked in unpacked SGEMM optimization

Change-Id: I8e4ea374415c0c402c660b656fb076af15354181

commit 1c732d3ddc4ac0861d3b0e0dd15eb7e071615502
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jan 25 16:25:46 2017 -0600

Added 1m-specific APIs for bp, pb gemm algorithms.

Details:
- Defined bli_gemmbp_cntl_create(), bli_gemmpb_cntl_create(), with the
body of bli_gemm_cntl_create() replaced with a call to the former.
- Defined bli_cntl_free_w_thrinfo(), bli_cntl_free_wo_thrinfo(). Now,
bli_cntl_free() can check if the thread parameter is NULL, and if so,
call the latter, and otherwise call the former.
- Defined bli_gemm1mbp_cntx_init(), bli_gemm1mpb_cntx_init(), both in
terms of bli_gemm1mxx_cntx_init(), which behaves the same as
bli_gemm1m_cntx_init() did before, except that an extra bool parameter
(is_pb) is used to support both bp and pb algorithms (including to
support the anti-preference field described below).
- Added support for "anti-preference" in context. The anti_pref field,
when true, will toggle the boolean return value of routines such as
bli_cntx_l3_ukr_eff_prefers_storage_of(), which has the net effect of
causing BLIS to transpose the operation to achieve disagreement (rather
than agreement) between the storage of C and the micro-kernel output
preference. This disagreement is needed for panel-block implementations,
since they induce a transposition of the suboperation immediately before
the macro-kernel is called, which changes the apparent storage of C. For
now, anti-preference is used only with the pb algorithm for 1m (and not
with any other non-1m implementation).
- Defined new functions,
bli_cntx_l3_ukr_eff_prefers_storage_of()
bli_cntx_l3_ukr_eff_dislikes_storage_of()
bli_cntx_l3_nat_ukr_eff_prefers_storage_of()
bli_cntx_l3_nat_ukr_eff_dislikes_storage_of()
which are identical to their non-"eff" (effectively) counterparts except
that they take the anti-preference field of the context into account.
- Explicitly initialize the anti-pref field to FALSE in
bli_gks_cntx_set_l3_nat_ukr_prefs().
- Added bli_gemm_ker_var1.c, which implements a panel-block macro-kernel
in terms of the existing block-panel macro-kernel _ker_var2(). This
technique requires inducing transposes on all operands and swapping
the A and B.
- Changed bli_obj_induce_trans() macro so that pack-related fields are
also changed to reflect the induced transposition.
- Added a temporary hack to bli_l3_3m4m1m_oapi.c that allows us to easily
specify the 1m algorithm (block-panel or panel-block).
- Renamed the following cntx_t-related macros:
bli_cntx_get_pack_schema_a() -> bli_cntx_get_pack_schema_a_block()
bli_cntx_get_pack_schema_b() -> bli_cntx_get_pack_schema_b_panel()
bli_cntx_get_pack_schema_c() -> bli_cntx_get_pack_schema_c_panel()
and updated all instantiations. Also updated the field names in the
cntx_t struct.
- Comment updates.

commit 41595e98eedaf3f1f93802c14dcae490402f933f
Merge: d625c49e a6ab91bc
Author: praveeng <praveen.gamd.com>
Date: Wed Dec 7 15:13:21 2016 +0530

Merge master code as on 2016_12_07 to amd-staging

Change-Id: I5d9ecef9bff960aeb9b51ca4e4b21714e789e44f

commit d625c49e20bd3c50d6d44e330e34076cced114a3
Author: sthangar <Santanu.Thangarajamd.com>
Date: Tue Nov 29 15:05:19 2016 +0530

checked-in SGEMMTRSM microkernel for Zen

Change-Id: Ib61936418dea911b2154aa99f703b66e9669f94f

commit a6ab91bc61432490fadf18d596de4589645f37dd
Merge: 145a551d 7f31a630
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 30 09:26:58 2016 -0600

Merge pull request 111 from figual/master

Fixed missing cntx argument in ARMv8 microkernels.

commit 7f31a6307b7bd35f913c895947552c3a176f789b
Author: Francisco Igual <figualucm.es>
Date: Sun Nov 27 14:40:47 2016 +0100

Fixed missing cntx argument in ARMv8 microkernels.

commit 126482a3b609b9ad7026ba348f6c4bf6a29be8a1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 25 18:29:49 2016 -0600

Implemented the 1m method.

Details:
- Implemented the 1m method for inducing complex domain matrix
multiplication. 1m support has been added to all level-3 operations,
including trsm, and is now the default induced method when native
complex domain gemm microkernels are omitted from the configuration.
- Updated _cntx_init() operations to take a datatype parameter. This was
needed for the corresponding function for 1m (because 1m requires us
to choose between column-oriented or row-oriented execution, which
requires us to query the context for the storage preference of the
gemm microkernel, which requires knowing the datatype) but I decided
that it made sense for consistency to add the parameter to all other
cntx initialization functions as well, even though those functions
don't use the parameter.
- Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take
a second scalar for each blocksize entry. The semantic meaning of the
two scalars now is that the first will scale the default blocksize
while the second will scale the maximum blocksize. This allows scaling
the two independently, and was needed to support 1m, which requires
scaling for a register blocksize but not the register storage
blocksize (ie: "packdim") analogue.
- Deprecated bli_blksz_reduce_dt_to() and defined two new functions,
bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing
default and maximum blocksizes to some desired blocksize multiple.
These functions are needed in the updated definitions of
bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs().
- Added support for the 1e and 1r packing schemas to packm, including
1e/1r packing kernels.
- Added a minor optimization to bli_gemm_ker_var2() that allows, under
certain circumstances (specifically, real domain beta and row- or
column-stored matrix C), the real domain macrokernel and microkernel
to be called directly, rather than using the virtual microkernel
via the complex domain macrokernel, which carries a slight additional
amount of overhead.
- Added 1m support to the testsuite.
- Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified
some code in test_gemm.c driver.

commit d8f13beeea90338e0ecb0a3aeaa2d59d8ebd6c36
Merge: c25a9205 145a551d
Author: praveeng <praveen.gamd.com>
Date: Fri Nov 25 17:31:08 2016 +0530

Merge master code till 2016_11_25 to amd-staging

commit c25a9205fd8c8d8de7fd81b1e5621e7ac79f4e87
Merge: 65298762 bdc0a264
Author: praveeng <praveen.gamd.com>
Date: Fri Nov 25 17:06:36 2016 +0530

Merge master code till Switched to simpler trsm_r 2016_11_25 to amd-staging

Change-Id: Ibf71d224d8fb6cf0bc497f84d50c27d276512cc1

commit 145a551d524ae5492667a05fc248923d922df850
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 23 17:59:06 2016 -0600

Switched to simpler trsm_r implementation.

Details:
- Disabled the implementation of trsm_r that allows the right-hand matrix
B to be trianglar, and switched to the implementation that simply
transposes the operation (and thus the storage of C) in order to recast
the operation as trsm_l. This avoids the need to use trsm_rl and trsm_ru
macrokernels, which require an awkward swapping of MR and NR. For now,
the support for trsm_r macrokernels, via separate control trees, remains.
- Modified bli_config_macro_defs.h so that BLIS_RELAX_MCNR_NCMR_CONSTRAINTS
is defined by default. This is mostly a safety precaution in case someone
tries to switch back to the previous trsm_r implementation, but also
serves as a convenience on some systems where one does not naturally
choose blocksizes in a way that satisfies MC % NR = 0 and NC % MR = 0.

commit b3e58ee30307cf1e11529f2113acb9abbeda25af
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 23 17:58:26 2016 -0600

Reimplemented 4x12 haswell ukernels (real only).

Details:
- Replaced permutation-based implementations in bli_gemm_asm_d4x12.c, which
defines 4x24 single real and 4x12 double real gemm microkernels, with
broadcast-based implementations. (The previous microkernel file has been
moved to an 'old' subdirectory.)

commit 65298762ff15c45e8588e0c279a9feaa98c927a0
Author: sthangar <Santanu.Thangarajamd.com>
Date: Tue Nov 22 12:15:33 2016 +0530

removed a redundant copy operation in DNRM2

Change-Id: I673b08efde4480e871779716f7715566740ad9ce

commit d6863e851adeef037e4d1476fe63bb293fb9d987
Author: sthangar <Santanu.Thangarajamd.com>
Date: Mon Nov 21 11:30:30 2016 +0530

checked-in DNRM2 optimizations

Change-Id: I3b31d768bd7f4fbf43042aa5a0762995c73c4522

commit bdc0a264d2fb5940bfd09298b1de823674a39053
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 16 14:13:08 2016 -0600

Adjusted stride selection of ct in macrokernels.

Details:
- Updated the changes introduced in 618f433 so that the strides of the
temporary microtile ct used in the macrokernels is determined based
on the storage preference of the microkernel (via the new functions
below), rather than the strides of c. In almost all cases, presently,
this change results in no net effect, as a high-level optimization
in the _front() functions aligns the storage of c to that of the
microkernel's preference. However, I encountered some cases where
this is not always the case in some development code that has yet
to be committed, and therefore I'm generalizing the framework code
in advance.
- Defined two new functions in bli_cntx.c:
bli_cntx_l3_ukr_prefers_rows_dt()
bli_cntx_l3_ukr_prefers_cols_dt()
which return bool_t's based on the current micro-kernel's storage
preferences. For induced methods, the preference of the underlying
real domain microkernel is returned.
- Updated definition of bli_cntx_l3_ukr_dislikes_storage_of(), and
by proxy bli_cntx_l3_ukr_prefers_storage_of(), to be in terms of
the above functions, rather than querying the preferences of the
native microkernel directly (which did the wrong thing for induced
methods).

commit 031978d2647cf08316858baf29c84ebba9c3133e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 16 14:04:33 2016 -0600

Fixed inactive trsm_r blocksize constraint code.

Details:
- Changed a cpp macro that was meant to prevent using certain trsm_r code
if BLIS_RELAX_MCNR_NCMR_CONSTRAINTS was defined. It was actually coded
incorrectly at first. I've now fixed its location and changed its
consequence to a compile-time error message.

commit 9772218cae57d55c252595b01e3669d8bed84944
Author: sthangar <Santanu.Thangarajamd.com>
Date: Wed Nov 16 15:19:19 2016 +0530

Added optimized DAMAX routines for Zen

Change-Id: I499c0c8f0f4ce6c19235c47b86d5608db6ba50f8

commit 9c448e30174e5eb76a94b43b30819704a5dfcb3f
Merge: 998d8240 e35d3c23
Author: Santanu Thangaraj <Santanu.Thangarajamd.com>
Date: Wed Nov 16 04:18:57 2016 -0500

Merge "Added new optimized micro-kernel for dotxv routine" into amd-staging

commit 998d824044adac0d54c921dcd44fb58f3d54aad2
Merge: 0d13e9a4 6b5a4032
Author: praveeng <praveen.gamd.com>
Date: Wed Nov 16 14:22:42 2016 +0530

Merge master code till devinamatthews/omp_num_thrds 2016_11_16 to amd-staging

Change-Id: I601ff1d3ec8a680e1be039ffc7b299744e8a27c5

commit 6b5a4032d2e3ed29a272c7f738b7e3ed6657e556
Merge: 3b524a08 a8220e3a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 10 15:28:24 2016 -0600

Merge pull request 109 from devinamatthews/omp_num_threads

Add automatic loop thread assignment.

commit a8220e3a86433b5d76789e32ea7ca014a11b6d17
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu Nov 10 14:19:34 2016 -0600

- Fix typo in bli_cntx.c
- Bump BLIS_DEFAULT_NR_THREAD_MAX to 4

commit e35d3c23f28784e50ee13d2e77a69d60e0c24c1f
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Thu Nov 10 14:30:53 2016 +0530

Added new optimized micro-kernel for dotxv routine

Change-Id: I2c544e9b25a454d971ad690353502a55cd668391

commit 0d13e9a4f6f2fcda08f205215240cdf86442d6c6
Merge: e044fa62 3b524a08
Author: praveeng <praveen.gamd.com>
Date: Mon Nov 7 14:40:41 2016 +0530

bli_kernel.h

Change-Id: I425d089f79497a0de7d1622e829c3ca9edf7f091

commit c05b3862f6241486442b313eff0c8bee7b5e1274
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Nov 4 15:48:02 2016 -0500

Add automatic loop thread assignment.

- Number of threads is determined by BLIS_NUM_THREADS or OMP_NUM_THREADS, but can be overridden by BLIS_XX_NT as before.
- Threads are assigned to loops (ic, jc, ir, and jc) automatically by weighted partitioning and heuristics, both of which are tunable via bli_kernel.h.
- All level-3 BLAS covered.

commit 3b524a08e3fb8380e7b8b2ba835312c51a331570
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 2 17:45:18 2016 -0500

Consolidated 3m1/4m1 gemmtrsm, trsm ukernel code.

Details:
- Consolidated the macros that define the lower and upper versions of the
gemmtrsm microkernels into a single macro that is instantiated twice.
Did this for both 3m1 and 4m1 microkernels.
- Consolidated lower and upper versions of the trsm microkernels for 3m1
and 4m1 into single files (each).

commit ead231aca635deb3db270f118454e4222c627f31
Merge: d25e6f8b 62987f60
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 2 13:03:50 2016 -0500

Merge pull request 108 from devinamatthews/patch-2

Update .travis.yml with additional tests

commit 62987f60a6a6ff0a75b31d0404f493593ce35ccc
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Nov 2 11:20:37 2016 -0500

Allow KNL to fail

commit 8f9010542c751ae3cbfe6121cb011d8985c1e00d
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Nov 2 11:18:32 2016 -0500

Fix some problems with OSX builds:

- Update CPU detection for Intel archs (esp. Skylake)
- Allow clang for the reference config

commit d25e6f8b63c57f30b8a67dffbf4995977cf9f235
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 1 14:35:15 2016 -0500

Can disable trsm_r-specific blocksize constraints.

Details:
- Added cpp guards around the constraints in bli_kernel_macro_defs.h
that enforce MC % NR = 0 and NC % MR = 0. These constraints are ONLY
needed when handling right-side trsm by allowing the matrix on the
right (matrix B) to be triangular, because it involves swapping
register, but not cache, blocksizes (packing A by NR and B by MR)
and then swapping the operands to gemmtrsm just before that kernel
is called. It may be useful to disable these constraints if, for
example, the developer wishes to test the configuration with
a different set of cache blocksizes where only MC % MR = 0 and
NC % NR = 0 are enforced.
- In summary, defining BLIS_RELAX_MCNR_NCMR_CONSTRAINTS will bypass
the enforcement of MC % NR = 0 and NC % MR = 0.

commit 1a67e3688edb073a9d44c160e7b0798e08796b8a
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Nov 1 13:53:18 2016 -0500

Bogus commit

Need to trigger another Travis build.

commit 2cd82d67b372cad1bed50cfd99e524f1f40b4e24
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Nov 1 13:25:50 2016 -0500

Some fixes for .travis.yml

- Switch to gcc-5 to support knl
- Don't run tests in parallel -- it is super slow.
- Use clang on OSX since gcc is only a zombie husk.

commit a3db4e6bdfe745083acf704ab0f51f74ea869538
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Nov 1 10:33:18 2016 -0500

Update .travis.yml with additional tests

- Test knl configuration (without running of course).
- Test openmp and pthreads threading for auto configuration with 4 threads.
- Test auto configuration with and without pthreads on OSX.
- Also, run make in parallel.

I don't know how the `addons:` section works on OSX; hopefully it is just ignored.

commit 8a11a2174a1a5b9426f13bbc5338dc86ab138cdd
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 31 19:07:55 2016 -0500

Updates to non-default haswell microkernels.

Details:
- Updated s and d microkernels in bli_gemm_asm_d8x6.c to relax alignment
constraints.
- Added missing c and z microkernels, which are based on the corresponding
kernels in the d6x8 set.
- This completes the d8x6 set (which may be used for situations when it
is desirable to have a microkernel with a column preference).

commit 618f4331eba209803ecab99747872eceb1b5f091
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 31 14:40:51 2016 -0500

Align strides of ct in macrokernels to that of c.

Details:
- Previously, rs_ct and cs_ct, the strides of the temporary microtile used
primarily in the macrokernels' edge case handling, were unconditionally
set to 1 and MR, respectively. However, Devin Matthews noted that this
ought to be changed so that the strides of ct were in agreement with the
strides of C. (That is, if C was row-stored, then ct should be accessed
as by rows as well.) The implicit assumption is that the strides of C
have already been adjusted, via induced transposition, if the storage
preference of the microkernel is at odds with the storage of C. So, if
the microkernel prefers row storage, the macrokernel's interior cases
would present row-stored (ideal) microkernel subproblems to the
microkernel, but for edge cases, it would still see column-stored
subproblems (not ideal). This commit fixes this issue. Thanks to Devin
for his suggestion.

commit c2c91e09b4893cb81314774557f728a95080f81e
Author: Jeff Hammond <jeff.sciencegmail.com>
Date: Tue Oct 25 21:15:26 2016 -0700

never use libm with Intel compilers

Intel compilers include a highly optimized math library (libimf) that
should be used instead of GNU libm.

yes, this change is for ALL targets, including those that are not
supported by the Intel compiler. there is no harm in doing this, and it
is future-proof in the event that the Intel compilers support other
architectures.

commit 630391002325a589063aec2ab0a7d89ef2e178c0
Merge: 956b3edf 216206c1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Oct 25 19:34:51 2016 -0500

Merge pull request 105 from devinamatthews/knl

Support for Intel Knight's Landing.

commit 216206c1d328a865c2192e35a4df6e9aff79a85b
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Oct 25 13:56:18 2016 -0500

Fix up for merge to master.

commit 11eb7957abbcdf02d5e312898e094260eadb1209
Merge: cd5b6681 956b3edf
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Oct 25 13:51:07 2016 -0500

Merge branch 'master' into knl

Conflicts:
frame/thread/bli_thread.h

commit cd5b6681838899283cd94e5427dfda206e7fbabe
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Oct 25 13:49:27 2016 -0500

Don't use %rbp in KNL packing kernels.

commit 956b3edf8eb09480f31f2e861c1b10f9ecbb2e52
Merge: b7e41d71 0662a3c1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Oct 25 13:02:57 2016 -0500

Merge pull request 104 from devinamatthews/misspellings

Add flexible options for thread model (pthread/posix for pthreads etc.).

commit 0662a3c1b1f4644a86bf8e5073d1391808c91b4a
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Oct 25 12:42:44 2016 -0500

Add flexible options for thread model (pthread/posix for pthreads etc.).

commit e044fa624008c161de32a39d734cddf1dd22dd41
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Tue Oct 25 13:03:05 2016 +0530

Changed double precision trsm kernel macro definition to bli_dtrsm_l_int_6x8 from 6x16 : it fixes the seg fault

Change-Id: Ia8c1de5fe13a370d691570a50136d55ffb18908a

commit b3ed4933aa0da72ad771fb0fdf1727e5ba9ad7b4
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Tue Oct 25 13:03:05 2016 +0530

Changed double precision trsm kernel macro definition to bli_dtrsm_l_int_6x8 from 6x16 : it fixes the seg fault

Change-Id: Ia8c1de5fe13a370d691570a50136d55ffb18908a

commit b7e41d71b07d2af6d22d632c70e0c5f7ce46852c
Merge: 4bd905bd 5117d444
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 24 16:47:46 2016 -0500

Merge pull request 103 from devinamatthews/patch-1

Change .align to .p2align in Bulldozer ukernels.

commit 5117d444f7f3a2bc327f067926eaf2398212edda
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Oct 24 16:20:47 2016 -0500

Change .align to .p2align in Bulldozer ukernels

Apparently OSX doesn't allow .align directives for >16B, so I've changed these to their .p2align counterparts.

commit 4bd905bd4597e0ad7bedf31e25e779d3e2dfda29
Merge: 936d5fdc 7f32dd57
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 21 14:48:44 2016 -0500

Merge pull request 93 from ShadenSmith/config_check

Adds sanity check to configuration choice.

commit 936d5fdc26c6c4dab199a8d11fde948975cfa1d6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 21 14:34:27 2016 -0500

Fixed multithreading compilation bug in 970745a.

Details:
- Moved the definition of the cpp macro BLIS_ENABLE_MULTITHREADING
from bli_thread.h to bli_config_macro_defs.h. Also moved the
sanity check that OpenMP and POSIX threads are not both enabled.
- Thanks to Krzysztof Drewniak for reporting this bug.

commit d250e6a3af3af8beedcda28f508ac03e94efb3c8
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Thu Oct 20 14:34:39 2016 +0530

Merged TRSM and scalv routines into zen folder

Change-Id: Ice897bc83e8fb70b90f23cc3ce892c39883aceb9

commit 8feb0f85a674e84bec2417486e3bcea584b14c04
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 19 16:05:41 2016 -0500

Removed auto-prototyping of malloc()/free() substitutes.

Details:
- Removed the header file, bli_malloc_prototypes.h, which automatically
generated prototypes for the functions specified by the following
cpp macros:
BLIS_MALLOC_INTL
BLIS_FREE_INTL
BLIS_MALLOC_POOL
BLIS_FREE_POOL
BLIS_MALLOC_USER
BLIS_FREE_USER
These prototypes were originally provided primarily as a convenience
to those developers who specified their own malloc()/free() substitutes
for one or more of the following. However, we generated these prototypes
regardless, even when the default values (malloc and free) of the
macros above were used. A problem arose under certain circumstances
(e.g., gcc in C++ mode on Linux with glibc) when including blis.h that
stemmed from the "throw" specification which was added to the glibc's
malloc() prototype, resulting in a prototype mismatch. Therefore, going
forward, developers who specify their own custom malloc()/free()
substitutes must also prototype those substitutes via bli_kernel.h.
Thanks to Krzysztof Drewniak for reporting this bug, and Devin Matthews
for researching the nature and potential solutions.

commit 970745a5fc7c29de3e202988e5eb104fabca4fdc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 19 15:58:03 2016 -0500

Reorganized typedefs to avoid compiler warnings.

Details:
- Relocated membrk_t definition from bli_membrk.h to bli_type_defs.h.
- Moved include of bli_malloc.h from blis.h to bli_type_defs.h.
- Removed standalone mtx_t and mutex_t typedefs in bli_type_defs.h.
- Moved include of bli_mutex.h from bli_thread.h to bli_typedefs.h.
- The redundant typedefs of membrk_t and mtx_t caused a warning on some C
compilers. Thanks to Tyler Smith for reporting this issue.

commit 1c2f7b57d557c05f5ef6148cccafaf0f70d910da
Author: sthangar <Santanu.Thangarajamd.com>
Date: Tue Oct 18 15:06:35 2016 +0530

Removed symlinks to zen kernels from haswell kernel folder and also modified the bli_kernel.h file accordingly

Change-Id: Ib3736af48e851c8243bbe10d937fb942c49ad048

commit d864ea9f4f039fe2b2dc395d0015bd9e8902bc8e
Merge: 7045fcbf 28b2af8a
Author: praveeng <praveen.gamd.com>
Date: Fri Oct 14 17:00:57 2016 +0530

Merge master code 2016_10_14 till Added disabled code thrinfo_t structures

Change-Id: If7db98d286c1471fcd30f00757abee9b253ef987

commit 28b2af8a71133ce68774e153b6e05afb05affba8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 13 14:50:08 2016 -0500

Added disabled code to print thrinfo_t structures.

Details:
- Added cpp-guarded code to bli_thrcomm_openmp.c that allows a curious
developer to print the contents of the thrinfo_t structures of each
thread, for verification purposes or just to study the way thread
information and communicators are used in BLIS.
- Enabled some previously-disabled code in bli_l3_thrinfo.c for freeing
an array of thrinfo_t* values that is used in the new, cpp-guarde code
mentioned above.
- Removed some old commented lines from bli_gemm_front.c.

commit 11eed3f683d09e65f721567b346b0f733bff9a64
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 13 14:23:23 2016 -0500

Fixed a configure -t omp/openmp bug from fd04869.

Details:
- Forgot to update certain occurrences of "omp" in common.mk during
commit fd04869, which changed the preferred configure option string
for enabling OpenMP from "omp" to "openmp".

commit 7045fcbf0bd349ebe6cb9ac4508c6a387bb05966
Merge: 7e044900 9cda6057
Author: praveeng <praveen.gamd.com>
Date: Thu Oct 13 12:02:28 2016 +0530

Merge master code 2016_10_13 Removed previously renamed/old files

Change-Id: I8106d371afaa0af474a8967388d44481b05de923

commit 7e04490002206d3557fcfb7dd893838a7f36916f
Author: sthangar <Santanu.Thangarajamd.com>
Date: Wed Oct 12 16:43:02 2016 +0530

Checked in the SAMAX optimizations

Change-Id: I7faf8c3adf52ff01432188ad3b9866ee4b9a9dfd

commit 9cda6057eaa16a24ac8785a9fa167df6c9edba44
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Oct 11 13:21:26 2016 -0500

Removed previously renamed/old files.

Details:
- Removed frame/base/bli_mem.c and frame/include/bli_auxinfo_macro_defs.h,
both of which were renamed/removed in 701b9aa. For some reason, these
files survived when the compose branch was merged back into master.
(Clearly, git's merging algorithm is not perfect.)
- Removed frame/base/bli_mem.c.prev (an artifact of the long-ago changed
memory allocator that I was keeping around for no particular reason).

commit 22377abd84b9e560ffe1c4e4d284eb443ddb7133
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 10 13:43:56 2016 -0500

Fixed bli_gemm() segfault on empty C matrices.

Details:
- Fixed a bug that would manifest in the form of a segmentation fault
in bli_cntl_free() when calling any level-3 operation on an empty
output matrix (ie: m = n = 0). Specifically, the code previously
assumed that the entire control tree was built prior to it being
freed. However, if the level-3 operation performs an early exit, the
control tree will be incomplete, and this scenario is now handled.
Thanks to Elmar Peise for reporting this bug.

commit 0b571cd94d9b175331c9453258a6b1389a718ae8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 6 14:48:15 2016 -0500

Fixed segfault in bli_free_align() for NULL ptrs.

Details:
- Fixed a bug in bli_free_align() caused by failing to handle NULL pointers
up-front, which led to performing pointer arithmetic on NULL pointers in
order to free the address immediately before the pointer. Thanks to Devin
Matthews for reporting this bug.

commit cd84fb95182514601d72c78ee0e36a394d0284d7
Author: praveeng <praveen.gamd.com>
Date: Thu Oct 6 15:08:21 2016 +0530

syntax erros in configure file

Change-Id: Ibe8a6071aad97df550df64c009fec33a9d8f43a1

commit f2e7ea113aa93b74f1d42408d5db2c5a7b00a653
Merge: 133983c3 86969873
Author: praveeng <praveen.gamd.com>
Date: Thu Oct 6 12:35:30 2016 +0530

conflicts merge for bli_kernel.h

Change-Id: I15d846bd34e11f86ebfd7ed091ff671a1f3366a0

commit 133983c36fa01c7acb6d666b3744f77f216314a5
Author: sthangar <Santanu.Thangarajamd.com>
Date: Thu Oct 6 11:26:22 2016 +0530

code clean up in bli_kernel.h

Change-Id: I11d9cdf2af8e8199209eb084f6c3a7c910b83d5d

commit 4fb9b4ef2e4cf2626a6e000a41628fb823f16da8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 5 14:41:35 2016 -0500

CHANGELOG update (0.2.1)

0.2.1

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 5 14:41:34 2016 -0500

Version file update (0.2.1)

commit 87fddeab3c8a5ccb1bbf02e5f89db1464e459ba9
Merge: 86969873 6f71cd34
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 5 13:35:01 2016 -0500

Merge branch 'compose'

commit 6f71cd344951854e4cff9ea21bbdfe536e72611d (origin/compose)
Merge: c0630c40 8d55033c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Oct 4 15:53:46 2016 -0500

Merge pull request 94 from flame/distcomm

Implemented distributed thrinfo_t management.

commit 86969873b5b861966d717d8f9f370af39e3d9de6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Oct 4 14:24:59 2016 -0500

Reclassified amaxv operation as a level-1v kernel.

Details:
- Moved amaxv from being a utility operation to being a level-1v operation.
This includes the establishment of a new amaxv kernel to live beside all
of the other level-1v kernels.
- Added two new functions to bli_part.c:
bli_acquire_mij()
bli_acquire_vi()
The first acquires a scalar object for the (i,j) element of a matrix,
and the second acquires a scalar object for the ith element of a vector.
- Added integer support to bli_getsc level-0 operation. This involved
adding integer support to the bli_*gets level-0 scalar macros.
- Added a new test module to test amaxv as a level-1v operation. The test
module works by comparing the value identified by bli_amaxv() to the
the value found from a reference-like code local to the test module
source file. In other words, it (intentionally) does not guarantee the
same index is found; only the same value. This allows for different
implementations in the case where a vector contains two or more elements
containing exactly the same floating point value (or values, in the case
of the complex domain).
- Removed the directory frame/include/old/.

commit 8d55033c966feed99fcca2a58017c3ab5b1646dc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 27 15:20:58 2016 -0500

Implemented distributed thrinfo_t management.

Details:
- Implemented Ricardo Magana's distributed thread info/communicator
management. Rather that fully construct the thrinfo_t structures, from
root to leaf, prior to spawning threads, the threads individually
construct their thrinfo_t trees (or, chains), and do so incrementally,
as needed, reusing the same structure nodes during subsequent blocked
variant iterations. This required moving the initial creation of the
thrinfo_t structure (now, the root nodes) from the _front() functions
to the bli_l3_thread_decorator(). The incremental "growing" of the tree
is performed in the internal back-end (ie: _int()) function, and so
mostly invisible. Also, the incremental growth of the thrinfo_t tree is
done as a function of the current and parent control tree nodes (as well
as the parent thrinfo_t node), further reinforcing the parallel
relationship between the two data structures.
- Removed the "inner" communicator from thrinfo_t structure definition,
as well as its id. Changed all APIs accordingly. Renamed
bli_thrinfo_needs_free_comms() to bli_thrinfo_needs_free_comm().
- Defined bli_l3_thrinfo_print_paths(), which prints the information
in an array of thrinfo_t* structure pointers. (Used only as a
debugging/verification tool.)
- Deprecated the following thrinfo_t creation functions:
bli_packm_thrinfo_create()
bli_l3_thrinfo_create()
because they are no longer used. bli_thrinfo_create() is now called
directly when creating thrinfo_t nodes.

commit fd04869ae4d4a3b0ebb9052557c296456bce7c0d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 27 14:14:11 2016 -0500

Changed configure's 'omp' threading to 'openmp'.

Details:
- Changed the configure script so that the expected string argument to the
-t (or --enable-threading=) option that enables OpenMP multithreading is
'openmp'. The previous expected string, 'omp', is still supported but
should be considered deprecated.

commit 9424af87209e4e435e2e742430945152690170b0
Merge: efa7341d c0630c40
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 27 12:51:08 2016 -0500

Merge branch 'compose'

commit 7f32dd57c6bd41c0704341752842277dd6a4c8eb
Author: Shaden Smith <shadencs.umn.edu>
Date: Sat Sep 17 11:33:57 2016 -0500

Adds sanity check to configuration choice.

commit efa7341df0b0115926aa8a6e8a4ebfb24fdbf11e
Merge: 121c39d4 e1453f68
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Sep 16 11:01:57 2016 -0500

Merge pull request 92 from ShadenSmith/readme_fix

Fixes broken URL in README.md

commit e1453f68f6afd90ae9a29b7a5faa46aa79bbf741
Author: Shaden Smith <ShadenTSmithgmail.com>
Date: Fri Sep 16 09:29:28 2016 -0500

Fixes broken URL in README.md

commit b922d7563422e14c49a4677bc6ae088a408861ed
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Aug 23 13:38:36 2016 -0500

Avoid compiling BLAS/CBLAS files when disabled.

Details:
- Updated the top-level Makefile, build/config.mk.in template, and
configure script so that object files corresponding to source files
belonging to the BLAS compatibility layer are not compiled (or archived)
when the compatibility layer is disabled. (Same for CBLAS.) Thanks
to Devin Matthews for suggesting this optimization.
- Slight change to the way configure handles internal variables. Instead
of converting (overwriting) some, such as enable_blas2blis and
enable_cblas, from a "yes" or "no" to a "1" or "0" value, the latter are
now stored in new variables that live alongside the originals (with the
suffix "_01"). This is convenient since some values need to be
sed-substituted into the config.mk.in template, which requires "yes" or
"no", while some need to be written to the bli_config.h.in template,
which requires "0" or "1".

Updated BLIS4 TOMS citation in README.md.

Added complex gemm micro-kernels for haswell.

Details:
- Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based
architectures. As with their real domain brethren, these kernels perfer
row storage, (though this doesn't affect most users due to high-level
optimizations in most level-3 operations that induce a transpose to
whatever storage preference the kernel may have).

Change-Id: I512ab90784ecbb7cdaee24928d2ccebb544ba5c1

commit 69826110bab2a064ec76457c24843d28f2581281
Merge: 64598ee4 a58dd35e
Author: Pradeep Rao <Pradeep.Raoamd.com>
Date: Wed Sep 14 03:26:25 2016 -0400

Merge "Implemented trsm single precision for lower triangular matrices, files added bli_trsm_l_int_6x16.cfiles modified bli_kernel.h to enable optimized trsm microkernel and test_trsm.c is modified to test trsm single precision" into amd-staging

commit c0630c4024b08750043a2942a3e8a037aa6b6259
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 12 13:59:02 2016 -0500

Added debugging printf()'s to bli_l3_thrinfo.c.

Details:
- Added optional printf() statements to print out thread communicator
info as the thrinfo_t structure is built in bli_l3_thrinfo.c.
- Minor changes to frame/thread/bli_thrinfo.h.

commit 7b3bf1ffcd7160ccbf6c2518af6d88f6742e4977
Merge: 35509818 121c39d4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 6 15:47:13 2016 -0500

Merge branch 'master' into compose

commit 121c39d455f2db6f7ce6802ba7f73ad5e088c68c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 5 13:11:42 2016 -0500

Added complex gemm micro-kernels for haswell.

Details:
- Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based
architectures. As with their real domain brethren, these kernels perfer
row storage, (though this doesn't affect most users due to high-level
optimizations in most level-3 operations that induce a transpose to
whatever storage preference the kernel may have).

commit 35509818cbea1598b123421f81c42120889a03c3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Aug 31 17:34:15 2016 -0500

Added, moved some thread barriers.

Details:
- Removed thread barriers from the end of the loop bodies of
bli_gemm_blk_var1(), bli_gemm_blk_var2(), bli_trsm_blk_var1(),
and bli_trsm_blk_var2().
- Moved the thread barrier at the end of bli_packm_int() to the
end of bli_l3_packm(), and added missing barriers to that function.
- Removed the no longer necessary (and now incorrect) ochief guard
in bli_gemm3m3_packa() on the bli_obj_scalar_reset() on C.
- Thanks to Tyler Smith for help with these changes.

commit 64598ee4cfb86f64abbd4bcef5a82ba0d5565b67
Author: sthangar <Santanu.Thangarajamd.com>
Date: Wed Aug 31 12:54:50 2016 +0530

fixed the symlink issue

Change-Id: I2186d529f295c576597c189e1ae219bc1a83f955

commit abd61f9fa75d77a96d1491b3e035451ee73238fe
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Aug 30 12:34:19 2016 -0500

Updated BLIS4 TOMS citation in README.md.

commit 8a2373f26ba8fcd5b2d7b2cc72cb8b2e1f841a03
Author: sthangar <Santanu.Thangarajamd.com>
Date: Mon Aug 29 14:10:45 2016 +0530

Norm 2 optimization

Change-Id: Ide9decaccd20bf0ccc32c9abb6556e038dceed2b

commit fdc663902347aa252ea88cf09ce24ab748958dff
Author: sthangar <Santanu.Thangarajamd.com>
Date: Mon Aug 29 10:43:38 2016 +0530

Placed 1 and 1f AMD optimized AVX routines under zen folder

Change-Id: I26795211ef11d232ed794ce36dd0a9c1f8706328

commit 701b9aa3ff028decbf90efac0dca5bd64fe26269
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Aug 26 19:04:45 2016 -0500

Redesigned control tree infrastructure.

Details:
- Altered control tree node struct definitions so that all nodes have the
same struct definition, whose primary fields consist of a blocksize id,
a variant function pointer, a pointer to an optional parameter struct,
and a pointer to a (single) sub-node. This unified control tree type is
now named cntl_t.
- Changed the way control tree nodes are connected, and what computation
they represent, such that, for example, packing operations are now
associated with nodes that are "inline" in the tree, rather than off-
shoot braches. The original tree for the classic Goto gemm algorithm was
expressed (roughly) as:

blk_var2 -> blk_var3 -> blk_var1 -> ker_var2
| |
-> packb -> packa

and now, the same tree would look like:

blk_var2 -> blk_var3 -> packb -> blk_var1 -> packa -> ker_var2

Specifically, the packb and packa nodes perform their respective packing
operations and then recurse (without any loop) to a subproblem. This means
there are now two kinds of level-3 control tree nodes: partitioning and
non-partitioning. The blocked variants are members of the former, because
they iteratively partition off submatrices and perform suboperations on
those partitions, while the packing variants belong to the latter group.
(This change has the effect of allowing greatly simplified initialization
of the nodes, which previously involved setting many unused node fields to
NULL.)
- Changed the way thrinfo_t tree nodes are arranged to mirror the new
connective structure of control trees. That is, packm nodes are no longer
off-shoot branches of the main algorithmic nodes, but rather connected
"inline".
- Simplified control tree creation functions. Partitioning nodes are created
concisely with just a few fields needing initialization. By contrast, the
packing nodes require additional parameters, which are stored in a
packm-specific struct that is tracked via the optional parameters pointer
within the control tree struct. (This parameter struct must always begin
with a uint64_t that contains the byte size of the struct. This allows
us to use a generic function to recursively copy control trees.) gemm,
herk, and trmm control tree creation continues to be consolidated into
a single function, with the operation family being used to select
among the parameter-agnostic macro-kernel wrappers. A single routine,
bli_cntl_free(), is provided to free control trees recursively, whereby
the chief thread within a groups release the blocks associated with
mem_t entries back to the memory broker from which they were acquired.
- Updated internal back-ends, e.g. bli_gemm_int(), to query and call the
function pointer stored in the current control tree node (rather than
index into a local function pointer array). Before being invoked, these
function pointers are first cast to a gemm_voft (for gemm, herk, or trmm
families) or trsm_voft (for trsm family) type, which is defined in
frame/3/bli_l3_var_oft.h.
- Retired herk and trmm internal back-ends, since all execution now flows
through gemm or trsm blocked variants.
- Merged forwards- and backwards-moving variants by querying the direction
from routines as a function of the variant's matrix operands. gemm and
herk always move forward, while trmm and trsm move in a direction that
is dependent on which operand (a or b) is triangular.
- Added functions bli_thread_get_range_mdim(), bli_thread_get_range_ndim(),
each of which takes additional arguments and hides complexity in managing
the difference between the way ranges are computed for the four families
of operations.
- Simplified level-3 blocked variants according to the above changes, so that
the only steps taken are:
1. Query partitioning direction (forwards or backwards).
2. Prune unreferenced regions, if they exist.
3. Determine the thread partitioning sub-ranges.
<begin loop>
4. Determine the partitioning blocksize (passing in the partitioning
direction)
5. Acquire the curren iteration's partitions for the matrices affected
by the current variants's partitioning dimension (m, k, n).
6. Call the subproblem.
<end loop>
- Instantiate control trees once per thread, per operation invocation.
(This is a change from the previous regime in which control trees were
treated as stateless objects, initialized with the library, and shared
as read-only objects between threads.) This once-per-thread allocation
is done primarily to allow threads to use the control tree as as place
to cache certain data for use in subsequent loop iterations. Presently,
the only application of this caching is a mem_t entry for the packing
blocks checked out from the memory broker (allocator). If a non-NULL
control tree is passed in by the (expert) user, then the tree is copied
by each thread. This is done in bli_l3_thread_decorator(), in
bli_thrcomm_*.c.
- Added a new field to the context, and opid_t which tracks the "family"
of the operation being executed. For example, gemm, hemm, and symm are
all part of the gemm family, while herk, syrk, her2k, and syr2k are
all part of the herk family. Knowing the operation's family is necessary
when conditionally executing the internal (beta) scalar reset on on
C in blocked variant 3, which is needed for gemm and herk families,
but must not be performed for the trmm family (because beta has only
been applied to the current row-panel of C after the first rank-kc
iteration).
- Reexpressed 3m3 induced method blocked variant in frame/3/gemm/ind
to comform with the new control tree design, and renamed the macro-
kernel codes corresponding to 3m2 and 4m1b.
- Renamed bli_mem.c (and its APIs) to bli_memsys.c, and renamed/relocated
bli_mem_macro_defs.h from frame/include to frame/base/bli_mem.h.
- Renamed/relocated bli_auxinfo_macro_defs.h from frame/include to
frame/base/bli_auxinfo.h.
- Fixed a minor bug whereby the storage-to-ukr-preference matching
optimization in the various level-3 front-ends was not being applied
properly when the context indicated that execution would be via an
induced method. (Before, we always checked the native micro-kernel
corresponding to the datatype being executed, whereas now we check
the native micro-kernel corresponding to the datatype's real projection,
since that is the micro-kernel that is actually used by induced methods.
- Added an option to the testsuite to skip the testing of native level-3
complex implementations. Previously, it was always tested, provided that
the c/z datatypes were enabled. However, some configurations use
reference micro-kernels for complex datatypes, and testing these
implementations can slow down the testsuite considerably.

commit a58dd35ed7b5b77a6b272655d2edd7a822b8fa87
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Fri Aug 26 14:55:12 2016 +0530

Implemented trsm single precision for lower triangular matrices, files added bli_trsm_l_int_6x16.cfiles modified bli_kernel.h to enable optimized trsm microkernel and test_trsm.c is modified to test trsm single precision

Change-Id: Ibddf989f4aad577e89558673e1038cf6ece654d9

commit 73517f522b69de429dd7f3df60a70c068149ab28
Merge: c6f5c215 50293da3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Aug 23 13:46:59 2016 -0500

Merge branch 'master' into compose

commit 50293da38d5f2b7be9bbc94b9e85aacb6a10f672
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Aug 23 13:38:36 2016 -0500

Avoid compiling BLAS/CBLAS files when disabled.

Details:
- Updated the top-level Makefile, build/config.mk.in template, and
configure script so that object files corresponding to source files
belonging to the BLAS compatibility layer are not compiled (or archived)
when the compatibility layer is disabled. (Same for CBLAS.) Thanks
to Devin Matthews for suggesting this optimization.
- Slight change to the way configure handles internal variables. Instead
of converting (overwriting) some, such as enable_blas2blis and
enable_cblas, from a "yes" or "no" to a "1" or "0" value, the latter are
now stored in new variables that live alongside the originals (with the
suffix "_01"). This is convenient since some values need to be
sed-substituted into the config.mk.in template, which requires "yes" or
"no", while some need to be written to the bli_config.h.in template,
which requires "0" or "1".

commit 22dd6a353ddb56614309c01533b1a94c9fd32bca
Merge: cdfb3c3f f20ed388
Author: praveeng <praveen.gamd.com>
Date: Tue Aug 23 15:15:35 2016 +0530

Merge master code as on 2016_08_23 to amd-staging branch by praveeng

Changes to be committed:
modified: frame/thread/bli_mutex_openmp.h
modified: frame/thread/bli_mutex_pthreads.h

Change-Id: Ica522edbb1d0173f53f38d5057b1f7aef73666be

commit c6f5c215ee793d03ea834469fc2adc53feaffc42
Merge: d52cb767 16a4c7a8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Aug 22 17:33:02 2016 -0500

Merge branch 'master' into compose

commit f20ed3885d628992fab88690f629a5a2bab3eb88
Merge: 02ac597e 4bc842ca
Author: praveeng <praveen.gamd.com>
Date: Mon Aug 22 15:27:33 2016 +0530

Merge branch 'master' of https://github.com/clMathLibraries/blis-amd for "Fixed bugs in bli_mutex_init() and friends."

commit 02ac597e4b9be2670d9fff65d28552f8e1ec81b3
Author: praveeng <praveen.gamd.com>
Date: Thu Jul 28 15:11:08 2016 +0530

Revert commits 357c990bdd7bd5667aac5adf1bab3712973e7414

Change-Id: I12a34456d7eed93fda4369e76bcddb42ba7ccb99

commit 84e41cc73c9c87ce64582acd4264b8e1b5316482
Author: praveeng <praveen.gamd.com>
Date: Thu Jul 28 15:01:36 2016 +0530

Revert commits 8aee306

Change-Id: I3dd999c77c6779332a40dbb84371ca487216f189

commit 30ccfcee82db93d0109d1571242e2db925e95d0a
Author: praveeng <praveen.gamd.com>
Date: Mon Jul 25 14:14:00 2016 +0530

removed changes from readme file which are giving confilcts

Change-Id: Ic71ad1313e1404fed444e899466043704d875af6

commit aeca25cd63fc8971f8fe7809599c57853f976548
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 16:51:23 2016 +0530

first commit

Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit 6b2274864b36fd1019d97bcc4ca6dd7a57ef16d9
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 15:00:31 2016 +0530

small modification to readme for git push test

Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit daa7a9ecb25982f2551adbd95e65f8ba97cfe944
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 16:51:23 2016 +0530

first commit

Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit 5f66a4aa05aeffcb6eb587851d78d9527319466c
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 15:00:31 2016 +0530

small modification to readme for git push test

Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit c6cbd78d2388c08824822b91a1c36ac4349bb67f
Author: praveeng <praveen.gamd.com>
Date: Thu Jul 28 15:11:08 2016 +0530

Revert commits 357c990bdd7bd5667aac5adf1bab3712973e7414

Change-Id: I12a34456d7eed93fda4369e76bcddb42ba7ccb99

commit 9219a9060762525f87ebbf556d78fe8621858513
Author: praveeng <praveen.gamd.com>
Date: Thu Jul 28 15:01:36 2016 +0530

Revert commits 8aee306

Change-Id: I3dd999c77c6779332a40dbb84371ca487216f189

commit 728573296efa7cf14d2381570e116509dfe2a240
Author: praveeng <praveen.gamd.com>
Date: Mon Jul 25 14:14:00 2016 +0530

removed changes from readme file which are giving confilcts

Change-Id: Ic71ad1313e1404fed444e899466043704d875af6

commit ad7862e291c240505c733a41d231b1a126ade73c
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 16:51:23 2016 +0530

first commit

Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit ad4b471a25ce77867295e5529dfc787e7c18b03f
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 15:00:31 2016 +0530

small modification to readme for git push test

Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit 55d641363fcd8bdfdabbd7c22822fa2d0b7f3fa6
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 16:51:23 2016 +0530

first commit

Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit f3b6b15f6d591d323802bd6c81c522a02056506d
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 15:00:31 2016 +0530

small modification to readme for git push test

Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit 16a4c7a823d60707ed9272f5d36e5c5d54c0ba4b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Aug 19 11:38:36 2016 -0500

Fixed bugs in bli_mutex_init() and friends.

Details:
- Fixed a couple of bugs that affected OpenMP and POSIX threads
configurations that resulted in compiler errors and warnings due
to type mismatch, and in the case of pthreads, a missing function
argument. The bugs are fairly recent, introduced in a017062.

commit c8e4ef93953ba2b79fb7e0973c08469c0e28a2cd
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Aug 3 16:13:03 2016 -0500

Add prefetchw to 30x8 kernel.

commit 4b5a2f3d6e7ffeb5cc2be8448554f5c2083ad68f
Merge: 380736bf 9f52a587
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Aug 3 16:09:51 2016 -0500

Merge remote-tracking branch 'origin/knl' into knl

Conflicts:
kernels/x86_64/knl/3/bli_dgemm_opt_24x8.c

commit 380736bfe955efbdd7274c90b6fd635688e83bc4
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Aug 3 16:08:28 2016 -0500

Add (new) 30x8 KNL kernel and fix non-scatter prefetch bug.

commit 9f52a587dee855daa73c194e41b6951416544e9a
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Aug 3 16:03:53 2016 -0500

Try prefetchw[t1] instead of regular prefetch for C.

commit 8945a1512d366bc6a8a85718d12cbf5de6f2898b
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Aug 3 11:28:24 2016 -0500

This version gets ~1550 GFLOPs on KNL wuth 16x4.

commit cdfb3c3f29d321033fca106aa58ab67ead90a95d
Merge: 50a2f2ef 4bc842ca
Author: praveeng <praveen.gamd.com>
Date: Fri Jul 29 12:45:04 2016 +0530

Merge master code as on 2016_07_29 to amd-staging branch by praveeng

Change-Id: Ic78b84d8b8d10158fb2a612f9a64bbc7b1f9b486

commit 4bc842ca3a64e658c0808bfe4c5693a5ace97923
Merge: 117f8838 b0d510bf
Author: praveeng <praveen.gamd.com>
Date: Thu Jul 28 17:32:12 2016 +0530

Merge branch 'master' of publicrepo

commit 117f8838511a478aa16137e770d27dd21f4227c5
Author: praveeng <praveen.gamd.com>
Date: Thu Jul 28 15:11:08 2016 +0530

Revert commits 357c990bdd7bd5667aac5adf1bab3712973e7414

Change-Id: I12a34456d7eed93fda4369e76bcddb42ba7ccb99

commit 2fcdc28f1055d385b2e662aa920fb97c472394d7
Author: praveeng <praveen.gamd.com>
Date: Thu Jul 28 15:01:36 2016 +0530

Revert commits 8aee306

Change-Id: I3dd999c77c6779332a40dbb84371ca487216f189

commit 1b5d104afe0628b8b6c0650f1e58cfb08be67004
Author: praveeng <praveen.gamd.com>
Date: Mon Jul 25 14:14:00 2016 +0530

removed changes from readme file which are giving confilcts

Change-Id: Ic71ad1313e1404fed444e899466043704d875af6

commit d81273047bff56501e9413a90991d3d1f8b56a06
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 16:51:23 2016 +0530

first commit

Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit 65905c3011a11cda95761681d4ae84337e46bdb5
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 15:00:31 2016 +0530

small modification to readme for git push test

Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit 23cca231be10fe1797aed451bcbc69d38c78bc0c
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 16:51:23 2016 +0530

first commit

Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit 922e3091702f25e3287b417719a33adbd5bbf138
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 15:00:31 2016 +0530

small modification to readme for git push test

Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit b0d510bf0e4dfd177f9e4ae0069f41921e2ecdc1
Author: praveeng <praveen.gamd.com>
Date: Thu Jul 28 15:11:08 2016 +0530

Revert commits 357c990bdd7bd5667aac5adf1bab3712973e7414

Change-Id: I12a34456d7eed93fda4369e76bcddb42ba7ccb99

commit 5ebeece5b4a8df81d59ca7558b278a4263d15128
Author: praveeng <praveen.gamd.com>
Date: Thu Jul 28 15:01:36 2016 +0530

Revert commits 8aee306

Change-Id: I3dd999c77c6779332a40dbb84371ca487216f189

commit 6ce4c022ebdea00c2b951090e3c2e9e88735b9ce
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Jul 27 16:26:36 2016 -0500

Switch back to 24x8. I could only squeeze 24.5GFLOP out of 8x24, and scalability is not improved.

commit d52cb7671509592a8078729477b40b60380518a2
Merge: 95abea46 c31b1e7b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 27 16:04:55 2016 -0500

Merge branch 'master' into compose

commit c31b1e7b9d659b96433a87e5aecb90e457a104cc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 27 15:58:07 2016 -0500

Relax alignment restrictions for sandybridge ukrs.

Details:
- Relaxed the base pointer and leading dimension alignment restrictions
in the sandybridge gemm microkernels, allowing the use of vmovups/vmovupd
instead of vmovaps/vmovapd. These change mimic those made to the haswell
microkernels in e0d2fa0 and ee2c139.
- Updated testsuite modules as well as standalone test drivers in 'test'
directory to use DBL_MAX as the initial time candidate. Thanks to Devin
Matthews for suggesting this change.
- Inserted include "float.h" into bli_system.h (to gain access to DBL_MAX).
- Minor update (vis-a-vis contexts) to driver code in test/3m4m.

commit b8f2b55532849d45d379afbdd05a52ff6100800d
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Jul 27 15:22:55 2016 -0500

Try an 8x24 kernel for the hell of it.

commit 7ede5863ae3567f7c0852efc2d5cd649ca19e0f3
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Jul 27 13:41:27 2016 -0600

Allocate pack buffer on MCDRAM for KNL.

commit ad89ed2e829c7b261d8ba0998a3cb83ad576ee04
Merge: 2c9de740 81e2b05f
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Jul 27 11:45:40 2016 -0500

Merge branch 'knl' of github.com:devinamatthews/blis into knl

commit 2c9de740edb66c4692c200731763bbd1d3171ccb
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Jul 27 11:44:54 2016 -0500

This version gets ~26GF on one core.

commit 81e2b05f31bca4e1e1676e7b533d1868d9f9be33
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Jul 27 11:39:05 2016 -0500

Add optimized packing kernels for KNL.

commit a7d8ca97b8d835c32d90ff20a565c82733f014a8
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Jul 25 15:15:13 2016 -0500

All fixed.

commit 963d0393b023f4134bb0c682923faf9964c0e645
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Jul 25 14:40:53 2016 -0500

Add 24xk pack kernel.

commit 117b76739afba481768897d2580f8365d3345417
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Jul 25 13:53:07 2016 -0500

In the midst of debugging.

commit 8c0a4fd1d3535d608a9a309a61ffee0a73c3646f
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Jul 25 13:09:24 2016 -0500

Fix some row/column confusion.

commit c44f9f96930312125b15e64c326ab5ab5cc02633
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Jul 25 12:02:24 2016 -0500

Simplify displacements -- clang assembler was badly botching EVEX compressed displacements giving false alarms for instruction length.

commit e0cce177cc1b47ec9f11ac0556241feaa3564df1
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Jul 25 10:02:25 2016 -0500

Minor fixes for 8x24 KNL kernel.

commit 50a2f2efcbeb46537f1deaa8e44dc579a4e49eb8
Merge: 1aa77dfc cfd46c88
Author: praveeng <praveen.gamd.com>
Date: Mon Jul 25 17:01:20 2016 +0530

Merge master code as on 2016_07_25 to amd-staging branch by praveeng

Change-Id: I84886ae241db2aac0bef6b7ef399f04aa8bca16d

commit cfd46c88d59c8f61d5e7cf768d606e4c44623584
Merge: f493bf4d a017062f
Author: praveeng <praveen.gamd.com>
Date: Mon Jul 25 15:38:13 2016 +0530

Merge remote-tracking branch 'publicrepo/master'

commit f493bf4d704fe0e967783cd6e6877d3302c056a1
Author: praveeng <praveen.gamd.com>
Date: Mon Jul 25 14:14:00 2016 +0530

removed changes from readme file which are giving confilcts

Change-Id: Ic71ad1313e1404fed444e899466043704d875af6

commit 65735bbedf75784c48bd11e05b3fdc98fc66b4bc
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Sun Jul 24 21:50:32 2016 -0500

Switch to 24x8 kernel, unrolled by 16.

commit 45d5dc97177117220bd9dd0abf85aafc185acad1
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Sun Jul 24 14:25:26 2016 -0500

Add 24x8 "KNC-style" kernel for KNL.

commit 95abea46f86816fddfc9ff0abfa52880801461be
Merge: d0dfe5b5 a017062f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Jul 23 15:38:33 2016 -0500

Merge branch 'master' into compose

commit a017062fdf763037da9d971a028bb07d47aa1c8a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 22 17:02:59 2016 -0500

Integrated "memory broker" (membrk_t) abstraction.

Details:
- Integrated a patch originally authored and submitted by Ricardo Magana
of HP Enterprise. The changeset inserts use of a new object type, membrk_t,
(memory broker) that allows multiple sets of memory pools on, for example,
separate NUMA nodes, each of which has a separate memory space.
- Added membrk field to cntx_t and defined corresponding accessor macros.
- Added membrk field to mem_t object and defined corresponding accessor macros.
- Created new bli_membrk.c file, which contains the new memory broker API,
including:
bli_membrk_init(), bli_membrk_finalize()
bli_membrk_acquire_[mv](), bli_membrk_release(),
bli_membrk_init_pools(), bli_membrk_reinit_pools(),
bli_membrk_finalize_pools(),
bli_membrk_pool_size()
- In bli_mem.c, changed function calls to
bli_mem_init_pools() -> bli_membrk_init()
bli_mem_reinit_pools() -> bli_membrk_reinit()
bli_mem_finalize_pools() -> bli_membrk_finalize()
- In bli_packv_init.c, bli_packm_init.c, changed function calls to:
bli_mem_acquire_[mv]() -> bli_membrk_acquire_[mv]()
bli_mem_release() -> bli_membrk_release()
- Added bli_mutex.c and related files to frame/thread. These files define
abstract mutexes (locks) and corresponding APIs for pthreads, openmp, or
single-threaded execution. This new API is employed within functions
such as bli_membrk_acquire_[mv]() and bli_membrk_release().

commit 8ff2e069c48c12fd06b9c48c6b3aeb4ea9b0e6e1
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 16:22:26 2016 -0500

Add 4x unrolled variant for KNL microkernel.

commit 9cb2ed9b0c25f31a22c1c9719b062fa665ad7adf
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 16:10:30 2016 -0500

Git rid of one RBX update.

commit 451bde076f0320d60cd2475cfb048ac4a2b798bb
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 15:43:00 2016 -0500

Add some more knobs to twiddle for KNL microkernel.

commit 8c6e621c099521e7a4d87e007bb8224faa5f33a3
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 15:05:15 2016 -0500

Make knl conform to new kernel dir structure.

commit ce7214c6618d6f22f4ce2ee452336236916d1f30
Merge: 119d0399 ce59f811
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 14:59:53 2016 -0500

Merge remote-tracking branch 'origin/master' into knl

commit ce59f81108ec9aea918a7e77030da8acfdd397ce
Merge: ff41153f 707a2b7f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 22 14:48:14 2016 -0500

Merge pull request 88 from devinamatthews/32bit-dim_t

Handle 32-bit dim_t in 64-bit microkernels.

commit 707a2b7faca137cca7cab7b11a12c44ddaf7ad53
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 13:49:44 2016 -0500

Somehow forgot the most important microkernel.

commit 47ec045056351ac4f0791c071fa0daaa81699c8c
Merge: 08f1d6b6 ff41153f
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 13:45:23 2016 -0500

Merge remote-tracking branch 'upstream/master' into 32bit-dim_t

commit 08f1d6b6fa344275de0f675f69737145ccf6646a
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 13:44:37 2016 -0500

Use 64-bit intermediate variable for k for architectures that do 64-bit loads in case dim_t is 32-bit.

commit ff41153f4eb7f38ed94bdd9a3fd81fb979f3f401
Merge: f9214ced e0d2fa0d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 22 13:21:03 2016 -0500

Merge pull request 86 from devinamatthews/haswell-vmovups

Remove alignment restrictions on C in haswell kernel.

commit e0d2fa0d835ab49366aeb790363bb2b571d36ed8
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 12:56:51 2016 -0500

Relax alignment restrictions for haswell sgemm.

commit f9214ced97392861f5a0ea72abfcf6f41faf674c
Merge: 413d62ac 08666eaa
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 22 12:16:39 2016 -0500

Merge pull request 85 from devinamatthews/qopenmp

Change -openmp to -fopenmp for icc.

commit ee2c139df6ad53c6aec8a67ab23b3b1912e8d259
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 12:06:03 2016 -0500

Remove alignment restrictions on C in haswell kernel.

commit 08666eaa20d8a31f2f92f944e5bfa7c1558c53e4
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 11:07:34 2016 -0500

Change -openmp to -fopenmp for icc.

commit 119d0399428905053265f3aca1cc8cc1fde3b363
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 10:23:31 2016 -0500

Add 8x24 KNL kernel.

commit 1aa77dfc1dc183d16e0b6a1196d9c263f021e83d
Merge: 9101a9c8 ec9f5983
Author: praveeng <praveen.gamd.com>
Date: Thu Jul 21 14:22:40 2016 +0530

Merge master code as on 2016_07_21 to amd-staging branch by praveeng

Change-Id: Ic7d0a21101358f08147736e7f1884e7409937344

commit b58cda9eba0c1e175460aae109baf792d29ba5bf
Merge: 318f063d 413d62ac
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Jul 19 14:09:09 2016 -0500

Merge remote-tracking branch 'origin/master' into knl

Conflicts:
frame/base/bli_threading.h
frame/include/blis.h
frame/thread/bli_thread.c

commit ec9f59836b32260c29ff1cd24e629c7d8de14992
Merge: 197e182f 763babe4
Author: praveeng <praveen.gamd.com>
Date: Mon Jul 18 12:56:25 2016 +0530

Merge branch 'master' of https://github.com/clMathLibraries/blis-amd

commit 197e182fcbf1340fd4a202fac58bea6cfcfa9e2f
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 16:51:23 2016 +0530

first commit

Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit 41fb32711031e7ec86b062aa7f53255d1f5905e2
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 15:00:31 2016 +0530

small modification to readme for git push test

Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit d0dfe5b5372cc7558ee9c4104b29f82eecc7ed61
Merge: 31def12e 413d62ac
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jul 14 11:01:06 2016 -0500

Merge branch 'master' into compose

commit 9101a9c880e3934f8a63ffc7fe15f5fc1077a73d
Author: sthangar <Santanu.Thangarajamd.com>
Date: Wed Jul 13 16:51:14 2016 +0530

Checked in optimized 1V kernels along with benchmark codes. Also incorporated review comments for 1F kernels

Change-Id: I035c0d39e6b0bed28e6e2041242186c49f6ed55b

commit 763babe488880b42c86c7fc207aa7665bd0ff9f7
Merge: 357c990b 413d62ac
Author: praveeng <praveen.gamd.com>
Date: Wed Jul 13 11:57:19 2016 +0530

Merge remote-tracking branch 'publirepo/master'

commit 413d62aca28edabba56605a9f87d5b715831e1db
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jul 12 15:02:52 2016 -0500

README update (use official ACM TOMS links).

commit dfa431f696db2df4065ea454df268a2e0bc02eac
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jul 12 14:21:19 2016 -0500

README update (BLIS2 TOMS article now in-print).

commit 357c990bdd7bd5667aac5adf1bab3712973e7414
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 16:51:23 2016 +0530

first commit

Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit 8aee306300adb099b66036f2c2f7f3996433cf49
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 15:00:31 2016 +0530

small modification to readme for git push test

Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit 31def12e2629f187e40f93f6bae9e26a6c2660e2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jun 30 15:19:20 2016 -0500

First phase of control tree redesign.

Details:
- These changes constitute the first set of changes in preparation to
revamping the structure and use of control trees in BLIS. Modifications
in this commit don't affect the control tree code yet, but rather lay
the groundwork.
- Defined wrappers for the following functions, where the the wrappers
each take a direction parameter of a new enumerated type (BLIS_BWD or
BLIS_FWD), dir_t, and executes the correct underlying function.
- bli_acquire_mpart_*() and _vpart_*()
- bli_*_determine_kc_[fb]()
- bli_thread_get_range_*() and bli_thread_get_range_weighted_*()
- Consolidated all 'f' (forwards-moving) and 'b' (backwards-moving)
blocked variants for trmm and trsm, and renamed gemm and herk variants
accordingly. The direction is now queried via routines such as
bli_trmm_direct(), which deterines the direction from the implied side
and uplo parameters. For gemm and herk, it is uncondtionally BLIS_FWD.
- Defined wrappers to parameter-specific macrokernels for herk, trmm, and
trsm, e.g. bli_trmm_xx_ker_var2(), that execute the correct underlying
macrokernel based on the implied parameters. The same logic used to
choose the dir_t in _direct() functions is used here.
- Simplified the function pointer arrays in _int() functions given the
consolidation and dir_t querying mentioned above.
- Function signature (whitespace) reformatting for various functions.
- Removed old code in various 'old' directories.

commit 405c9d46344d93c3eab5572b233900b50ca50d68
Author: sthangar <Santanu.Thangarajamd.com>
Date: Wed Jun 22 12:18:54 2016 +0530

Check-in the fused kernels optimized for Zen

Change-Id: I7b2f467b960e7b9a285f06e47be87de122e5fa24

commit 232754feecf29452987666b9f5ebba2619bfd0b0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jun 21 14:25:39 2016 -0500

Fixed compiler warning in rand[vm], randn[vm].

Details:
- Fixed compiler warnings about unused variables related to the disabling
of normalization in the structured cases of the rand[vm] and randn[vm]
operations.

commit a89555d1605574f3685813dcc972b636dd61264d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jun 17 14:08:35 2016 -0500

Added randn[vm] operations, support in testsuite.

Details:
- Defined a new randomization operation, randn, on vectors and matrices.
The randnv and randnm operations randomize each element of the target
object with values from a narrow range of values. Presently, those
values are all integer powers of two, but they do not need to be powers
of two in order to achieve the primary goal, which is to initialize
objects that can be operated on with plenty of precision "slack"
available to allow computations that avoid roundoff. Using this method
of randomization makes it much more likely that testsuite residuals of
properly-functioning operations are close to zero, if not exactly zero.
- Updated existing randomization operations randv and randm to skip
special diagonal handling and normalization for matrices with structure.
This is now handled by the testsuite modules by explicitly calling a
testsuite function that loads the diagonal (and scales off-diagonal
elements).
- Added support for randnv and randnm in the testsuite with a new switch
in input.general that universally toggles between use of the classic
randv/randm, which use real values on the interval [-1,1], and
randnv/randnm, which use only values from a narrow range. Currently,
the narrow range is: +/-{2^0, 2^-1, 2^-2, 2^-3, 2^-4, 2^-5, 2^-6}, as
well as 0.0.
- Updated testsuite modules so that a testsutie wrapper function is called
instead of directly calling the randomization operations (such as
bli_randv() and bli_randm()). This wrapper also takes a bool_t that
indicates whether the object's elements should be normalized. (NOTE: As
alluded to above, in the test modules of triangular solve operations such
as trsv and trsm, we perform the extra step of loading the diagonal.)
- Defined a new level-0 operation, invertsc, which inverts a scalar.
- Updated the abval2ris and sqrt2ris level-0 macros to avoid an unlikely
but possible divide-by-zero.
- Updated function signature and prototype formatting in testsuite.

commit 318f063dcbd8b594969e401bc99146d24b01066a
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Jun 8 17:46:50 2016 -0500

Add new KNL microkernel derived from Haswell.

commit 096895c5d538a7f8817603d7cf28c52e99340def
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jun 6 13:32:04 2016 -0500

Reorganized code, APIs related to multithreading.

Details:
- Reorganized code and renamed files defining APIs related to multithreading.
All code that is not specific to a particular operation is now located in a
new directory: frame/thread. Code is now organized, roughly, by the
namespace to which it belongs (see below).
- Consolidated all operation-specific *_thrinfo_t object types into a single
thrinfo_t object type. Operation-specific level-3 *_thrinfo_t APIs were
also consolidated, leaving bli_l3_thrinfo_*() and bli_packm_thrinfo_*()
functions (aside from a few general purpose bli_thrinfo_*() functions).
- Renamed thread_comm_t object type to thrcomm_t.
- Renamed many of the routines and functions (and macros) for multithreading.
We now have the following API namespaces:
- bli_thrinfo_*(): functions related to thrinfo_t objects
- bli_thrcomm_*(): functions related to thrcomm_t objects.
- bli_thread_*(): general-purpose functions, such as initialization,
finalization, and computing ranges. (For now, some macros, such as
bli_thread_[io]broadcast() and bli_thread_[io]barrier() use the
bli_thread_ namespace prefix, even though bli_thrinfo_ may be more
appropriate.)
- Renamed thread-related macros so that they use a bli_ prefix.
- Renamed control tree-related macros so that they use a bli_ prefix (to be
consistent with the thread-related macros that were also renamed).
- Removed undef BLIS_SIMD_ALIGN_SIZE from dunnington's bli_kernel.h. This
undef was a temporary fix to some macro defaults which were being applied
in the wrong order, which was recently fixed.

commit 232530e88ff99f37abcae5b6fb5319a9a375a45f
Merge: 4bcabd1b eef37f8b
Author: Tyler Michael Smith <tmscs.utexas.edu>
Date: Wed Jun 1 15:14:10 2016 -0500

Merge commit 'refs/pull/81/head' of https://github.com/flame/blis

Conflicts:
frame/base/bli_threading_pthreads.c
frame/base/bli_threading_pthreads.h

commit 4bcabd1bf60688c38cf562459fc5e8be8b831756
Author: Tyler Michael Smith <tmscs.utexas.edu>
Date: Wed Jun 1 13:27:28 2016 -0500

Use spin locks instead of pthread barriers

commit eef37f8b4d81845a6ba4bf25586d32b50c3e8a68
Author: Jeff Hammond <jeff.sciencegmail.com>
Date: Sun May 29 22:28:13 2016 -0700

use GCC intrinsic instead of pthread_mutex for atomic increment and fetch

commit 9dcd6f05c4c3ff2ce7cd87a9951a96ebef22681e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 24 13:15:32 2016 -0500

Implemented developer-configurable malloc()/free().

Details:
- Replaced all instances of bli_malloc() and bli_free() with one of:
- bli_malloc_pool()/bli_free_pool()
- bli_malloc_user()/bli_free_user()
- bli_malloc_intl()/bli_free_intl()
each of which can be configured to call malloc()/free() substitutes,
so long as the substitute functions have the same function type
signatures as malloc() and free() defined by C's stdlib.h. The _pool()
function is called when allocating blocks for the memory pools (used
for packing buffers, primarily), the _user() function is called when
obj_t's are created (via bli_obj_create() and friends), and the _intl()
function is called for internal use by BLIS, such as when creating
control tree nodes or temporary buffers for manipulating internal data
structures. Substitutes for any of the three types of bli_malloc() may
be specified by defining the following pairs of cpp macros in
bli_kernel.h:
- BLIS_MALLOC_POOL/BLIS_FREE_POOL
- BLIS_MALLOC_USER/BLIS_FREE_USER
- BLIS_MALLOC_INTL/BLIS_FREE_INTL
to be the name of the substitute functions. (Obviously, the object
code that contains these functions must be provided at link-time.)
These macros default to malloc() and free(). Subsitute functions are
also automatically prototyped by BLIS (in bli_malloc_prototypes.h).
- Removed definitions for bli_malloc() and bli_free().
- Note that bli_malloc_pool() and bli_malloc_user() are now defined in
terms of a new function, bli_malloc_align(), which aligns memory to an
arbitrary (power of two) alignment boundary, but does so manually,
whereas before alignment was performed behind the scenes by
posix_memalign(). Currently, bli_malloc_intl() is defined in terms
of bli_malloc_noalign(), which serves as a simple wrapper to the
designated function that is passed in (e.g. BLIS_MALLOC_INTL).
Similarly, there are bli_free_align() and bli_free_noalign(), which
are used in concert with their bli_malloc_*() counterparts.

commit 9dd440109a9d964f5cd286e9f83c487ad703e1e4
Author: Jeff Hammond <jeff.sciencegmail.com>
Date: Sat May 21 15:21:58 2016 -0700

fix 404 link to BuildSystem

Google Code is dead. Long live GitHub!

commit d309f20b7376a68efa3b864ad790c2021c071655
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed May 18 15:13:53 2016 -0500

Added alignment switch to testsuite.

Details:
- Added a new input parameter to input.general that globally toggles
whether testsuite tests are performed on objects whose buffers and
leading dimensions have been aligned, and changed the implementation
of libblis_test_mobj_create() to employ alignment (or not) regardless
of whether row, column, or general storage is being tested.
- Updated configure script's "--help" text to indicate default behavior
for internal integer type size and BLAS/CBLAS integer type size
options.

commit 32db0adc218ea4ae370164dbe8d23b41cd3526d3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 17 15:20:16 2016 -0500

Generate prototypes for user-defined packm kernels.

Details:
- Created template prototypes for packm kernels (in bli_l1m_ker.h), and
then redefined reference packm kernels' prototyping headers in terms of
this template, as is already done for level-1v, -1f, and -3 kernels.
- Automatically generate prototypes for user-defined packm kernels in
bli_kernel_prototypes.h (using the new template prototypes in
bli_l1m_ker.h).
- Defined packm kernel function types in bli_l1m_ft.h, including for
packm kernels specific to induced methods, which are now used in
bli_packm_cxk.c and friends rather than using a locally-defined
function type.
- In bli_packm_cxk.c, extended function pointer for packm kernels array
from out to index 31 (from previous maximum of 17). This allows us to
store the unrolled 30xk kernel in the array for use (on knc, for
example). Note: This should have been done a long time ago.

commit e3bd5ca64ae7c190ba689396c0de687b829a11fe
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu May 12 20:54:13 2016 -0500

Fix SIMD definitions in KNL config, and a couple of fixes to C update.

commit 4fe02e3d497995d94d34d3fcf5af895084cfc8b9
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu May 12 20:53:58 2016 -0500

Move bli_kernel.h before bli_threading.h in order of inclusion in blis.h.

commit 4bcf1b35abea3f3dfc8f2fe462dcf155cf199e55
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed May 11 16:09:49 2016 -0500

Fixed bli_get_range_*() bugs in trsm variants.

Details:
- Fixed incorrect calls to bli_get_range_*() from within trsm blocked
variants 1f, 2b, and 2f. The bug somehow went undetected since the
big commit (537a1f4), and, strangely, did not manifest via the BLIS
testsuite. The bug finally came to our attention when running thei
libflame test suite while linking to BLIS. Thanks to Kiran Varaganti
for submitting the initial report that led to this bug.

commit 9cfa33023f123a6c17e987f72fba174ce073f0b6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed May 11 16:02:30 2016 -0500

Minor updates to bli_f2c.h.

Details:
- Added undef guards to certain define statements in bli_f2c.h,
and renamed the file guard to BLIS_F2C_H. This helps when
including "blis.h" from an application or library that already
includes an "f2c.h" header.

commit a09a2e23eacf5328858c8318bb637c5ff3b71d08
Merge: 4dcd37eb 7c604e1c
Author: Tyler Michael Smith <tmscs.utexas.edu>
Date: Wed May 11 10:47:11 2016 -0500

Merge pull request 76 from devinamatthews/move_simd_defs

Move default SIMD-related definitions to bli_kernel_macro_defs.h

commit 4dcd37eb1b12a6e08cc13df7b61391ef8363f5d8
Author: Tyler Smith <tmscs.utexas.edu>
Date: Tue May 10 16:28:59 2016 -0500

fixing knc simd align size

commit 619dee0daec3474b4e5a55df90a61aabcae194f2
Merge: b790b3d9 7c604e1c
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue May 10 12:13:24 2016 -0500

Merge branch 'move_simd_defs' into knl

commit 7c604e1cbc1609b6e12d3ee973c08b7af5035be4
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue May 10 12:11:55 2016 -0500

Move default SIMD-related definitions to bli_kernel_macro_defs.h. Otherwise, configurations which customize these fail as these are now defined in bli_kernel.h.

commit b790b3d9e1820f3b691676de48c291cae083452d
Merge: 4f8c05c9 a7be2d28
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue May 10 11:49:47 2016 -0500

Merge branch 'master' into knl

commit a7be2d28e8930b154d0da1d6929b54a96e210af6
Merge: 97b512ef 4b1e55ed
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 10 11:48:51 2016 -0500

Merge pull request 74 from devinamatthews/fix_common_symbols

Default-initialize all extern global variables to avoid generating common symbols.

commit 4b1e55edbfe0e1cb2e7b9428424903497cb7a841
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue May 10 10:08:47 2016 -0500

Default-initialize all extern global variables to avoid generating common symbols. Fixes 73.

commit 97b512ef62c7e25c97ed5e9eca81cd7015b2ac91
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri May 6 10:24:30 2016 -0500

Include headers from cblas.h to pull in f77_int.

Details:
- Added include statements for certain key BLIS headers so that the
definition of f77_int is pulled in when a user compiles application
code with only include "cblas.h" (and no other BLIS header). This
is necessary since f77_int is now used within the cblas API.

commit c3a4d39d03665135f1616588b5ef7c3e9ef5688d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed May 4 17:22:56 2016 -0500

Updates to haswell gemm micro-kernels.

Details:
- Added two new sets of [sd]gemm micro-kernels for haswell architectures,
one that is 4x24/4x12 (s and d) and one that is 6x16/6x8.
- Changed the haswell configuration to use the 6x16/6x8 micro-kernels
by default.
- Updated various Makefiles, in test, test/3m4m, and testsuite.

commit 0b01d355ae861754ae2da6c9a545474af010f02e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 27 15:21:10 2016 -0500

Miscellaneous cleanups, fixes to recent commits.

Details:
- Fixed a typo in bli_l1f_ref.h, introduced into bbb8569, that only
manifested when non-reference level-1f kernels were used.
- Added an undef BLIS_SIMD_ALIGN_SIZE to bli_kernel.h of dunnington
configuration to prevent a compile-time warning until I can figure out
the proper permanent fix.
- Moved frame/1f/kernels/bli_dotxaxpyf_ref_var1.c out of the compilation
path (into 'other' directory). _ref_var2 is used by default, which is
the variant that is built on axpyf and dotxf instead of dotaxpyv.
- Removed section of frame/include/bli_config_macro_defs.h pertaining to
mixed datatype support.

commit ed7326c836f427e2f8420b015220ce293207b10c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 27 14:57:40 2016 -0500

Added 'restrict' to l1v/l1f code in 'kernels' dir.

Details:
- Added 'restrict' keyword to existing kernel definitions in 'kernels'
directory. These changes were meant for inclusion in bbb8569.

commit bbb8569b2a08c3bcd631d5a05eb389d01d94ac07
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 27 14:13:46 2016 -0500

Use 'restrict' in all kernel APIs; wspace changes.

Details:
- Updated level-1v, level-1f kernel function types (bli_l1?_ft.h) and
generic kernel prototypes (bli_l1?_ker.h) to use 'restrict' for all
numerical operand pointers (ie: all pointers except the cntx_t).
- Updated level-1f reference kernel definitions to use 'restrict' for
all numerical operand pointers. (Level-1v reference kernel definitions
were already updated in bdbda6e.)
- Rewrote the level-1v and level-1f reference kernel prototypes in
bli_l1v_ref.h and bli_l1f_ref.h, respectively, to simply include
bli_l1v_ker.h and bli_l1f_ker.h with redefined function base names
(as was already being done for the level-3 micro-kernel prototypes
in bli_l3_ref.h), rather than duplicate the signatures from the
_ker.h files.
- Added definitions to frame/include/bli_kernel_prototypes.h for axpbyv
and xpbyv, which were probably meant for inclusion in bdbda6e.
- Converted a number of instances of four spaces, as introduced in
bdbda6e, to tabs.

commit 4ea419c72c789825e1f93a1eee88219bbf873930
Merge: f1e9be2a bdbda6e6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 26 12:50:45 2016 -0500

Merge pull request 70 from devinamatthews/daxpby

Give the level1v operations some love

commit bdbda6e6acc682ab1b6ca680edebd09ae12a832c
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Apr 25 11:05:57 2016 -0500

Give the level1v operations some love:

- Add missing axpby and xpby operations (plus test cases).
- Add special case for scal2v with alpha=1.
- Add restrict qualifiers.
- Add special-case algorithms for incx=incy=1.

commit f1e9be2aba1a057eedb947bbae96848597777408
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 22 15:34:02 2016 -0500

Minor tweak to test/Makefile.

Details:
- Just committing a minor change to test/Makefile that has been lingering
in my local working copy for longer than I can remember.

commit aa0bceec277938328dabeb744680623f24fb0b61
Merge: 4136553f e2784b4c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 22 12:01:31 2016 -0500

Merge branch 'master' of github.com:flame/blis

commit 4136553f0d0661a668dfdb9edcd7ce1c5773dde7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 22 11:53:53 2016 -0500

Clear level-3 cntx_t's via memset() before use.

Details:
- In all level-3 operations' _cntx_init() functions, replaced calls to
bli_cntx_obj_init() with calls to bli_cntx_obj_clear(), and in all
level-3 operations' _cntx_finalize() functions, removed calls to
bli_cntx_obj_finalize(), leaving those function definitions empty.
- Changed the definition of bli_cntx_obj_clear() so that the clearing
occurs via a single call to memset().

commit 4f8c05c9e2ef4cbb82b35a3ebf1f0a0ac665830e
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu Apr 21 10:00:59 2016 -0500

Rearrange KNL dgemm kernel again to streamline usage of ymm register. sgemm and dgemm now both working with Intel SDE.

commit e2784b4c921f706e756df3e146e20a4cb63f53e3
Merge: dd0ab1d9 a9b6c3ab
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 20 18:34:09 2016 -0500

Merge pull request 67 from devinamatthews/cblas-f77-int

Change CBLAS integer type to f77_int

commit a9b6c3abda6222a8b240361643932e83cf726c4f
Merge: e4c54c81 dd0ab1d9
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Apr 20 16:00:10 2016 -0500

Merge remote-tracking branch 'origin/master' into cblas-f77-int

Conflicts:
config/haswell/bli_config.h

commit e4c54c81463c2a19c9bb6b1f0f1be3fa9d018a45
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Apr 20 15:56:46 2016 -0500

Change integer type in CBLAS function signatures to f77_int, and add proper const-correctness to BLAS layer.

commit dd0ab1d93f33abca6af9edd7b8e52da62dcfa5b1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 20 14:38:23 2016 -0500

Converted some bli_cntx query functions to macros.

Details:
- Commented out several datatype-aware query functions (those ending in
_dt) from bli_cntx.c, as well as their prototypes in bli_cntx.h, and
added equivalent cpp query macros to bli_cntx.h.
- Added 'bli_config.h' to .gitignore.

commit 7193230f7d35edbd1d2f77842a613971f1603463
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Apr 20 09:37:30 2016 -0500

Work around missing VPMULLQ on KNL.

commit a30ccbc4c6a6e6460e78af6b5c530ee0d06f98fb
Merge: eb2f18e4 0e1a9821
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 19 15:04:33 2016 -0500

Merge pull request 66 from devinamatthews/blas-configure

Add configure options and generate bli_config.h automatically.

commit bd44cf13e886069bc66c10ac0db178be96629a0d
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Apr 19 13:43:04 2016 -0500

Fix copy-paste errors in KNL kernels.

commit eb2f18e4844d985715df20798f50f9cc12e3b5ad
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 19 12:50:32 2016 -0500

More compile-time fixes to bgq gemm ukernel code.

commit 0e1a9821d860f6c1d818baf4c48d21a23726c132
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Apr 19 11:44:37 2016 -0500

Add configure options and generate bli_config.h automatically.

Options to configure have been added for:
- Setting the internal BLIS and BLAS/CBLAS integer sizes.
- Enabling and disabling the BLAS and CBLAS layers.

Additionally, configure options which require defining macros (the above plus the threading model), write their macros to the automatically-generated bli_config.h file in the top-level build directory. The old bli_config.h files in the config dirs were removed, and any kernel-related macros (SIMD size and alignment etc.) were moved to bli_kernel.h. The Makefiles were also modified to find the new bli_config.h file.

Lastly, support for OMP in clang has been added (closes 56).

commit a11eec05928ddc5c43fa5dbcd35f2edd24ff35a1
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Apr 18 13:13:36 2016 -0500

Add sgemm ukernels for KNL. vpmullq is not implemented on KNL -- needs workaround.

commit ff84469a4575f1ef8a0010046fde52240a312cae
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 18 12:29:09 2016 -0500

Applied various compilation fixes to bgq kernels.

commit c38e0dab05b2dc36672eab96e1248fb7fb2d785b
Merge: bd5e2296 cbcd0b73
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Apr 18 10:21:35 2016 -0500

Merge remote-tracking branch 'origin/master' into knl

commit bd5e2296e98e042c31f1e8ece2c1ca8e4bdc2d4c
Merge: 4745def0 49f85177
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Apr 18 10:15:22 2016 -0500

Merge remote-tracking branch 'origin/knl' into knl

commit 4745def0c87377ae83ad73ac514d7de08a96b2ac
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Apr 18 10:15:05 2016 -0500

Add 64-bit offset vector so we can use vgatherqpd.

commit 49f85177f886f38889b60503a4e12fa7f04be1fd
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Apr 18 10:14:11 2016 -0500

KNL ukernel compiles with gcc.

commit cbcd0b739dc54bd14fbb46aeda267c26725cd70f
Author: Tyler Michael Smith <tmscs.utexas.edu>
Date: Mon Apr 18 03:12:57 2016 -0500

Changing ifdef for OSX pthread barriers

commit 58b2c3cf040134d1be913c585a3c6905629116c0
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Sat Apr 16 16:12:24 2016 -0500

Rewrite of KNL kernel in GNU extended asm syntax.

commit dd62080cea78f3a23616200d6640e52c102b2bb9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 15 11:15:41 2016 -0500

Compile-time fix to bgq l1f kernels.

Details:
- Fixed an old reference to bli_daxpyf_fusefac, which no longer exists,
by replacing it with the axpyf fusing factor (8), and cleaned up the
relevant section of config/bgq/bli_kernel.h.
- Removed most of the details of the level-3 kernels from the template
kernel code in config/template/kernels/3 and replaced it with a
reference to the relevant kernel wiki maintained on the BLIS github
website.

commit d5a915dd8d7a6ead42a68772e4420eb3647e6f1a
Merge: 4320b725 41694675
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Apr 14 12:56:36 2016 -0500

Merge branch 'master' of github.com:flame/blis

commit 4320b725a1f8fd34101470b6cf52ad504a79c517
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Apr 14 12:51:29 2016 -0500

Use kernel CFLAGS on "ukernels" directories.

Details:
- Updated the top-level Makefile so that the CFLAGS variable designated
for kernel source code is applied not only to source code in
directories named "kernels" but source code in any directory that
contains the substring "kernels", such as "ukernels".
- Formally disabled some code in gen-make-frag.sh script that was already
effectively disabled. The code was related to handling "noopt" and
"kernel" directories, which is now handled independently within the
top-level Makefile without needing to place these source files into
a spearate makefile variable.

commit 41694675e4cb56e2e0323c7a7db48e0819606a31
Author: Tyler Smith <tmscs.utexas.edu>
Date: Wed Apr 13 15:51:08 2016 -0500

pthreads bugfixes

Getting pthreads to work on my Mac
Implemented a pthread barrier when _POSIX_BARRIER isn't defined
Now spawn n-1 threads instead of n threads so that master thread isn't just spinning the whole time
Add -lpthread instead of -pthread to LDFLAGS (for clang)

commit f756dbfa0d542cbc497724981520c83abf049c4b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 13 11:25:33 2016 -0500

Removed stale include from bgq configuration.

Details:
- Removed an old include statement ("bli_gemm_8x8.h") from the
bli_kernel.h file in the bgq configuration. It turns out this
file was no longer needed even prior to 537a1f4.

commit 0bd4169ea75f690714e7d2912229932a75d8a7e2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 11 18:08:32 2016 -0500

Fixed context-broken dunnington/penryn kernels.

Details:
- Added missing context parameters to several instances where simpler
kernels, or reference kernels, are called instead of executing the
main body code contained in the kernel function in question.
- Renamed axpyv and dotv kernel files to use "opt" instead of "int"
substring, for consistency with level-1f kernels.

commit 7912af5db45b7372d19a9a3dfeb82df302a05628
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 11 17:32:13 2016 -0500

CHANGELOG update (0.2.0)

0.2.0

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 11 17:32:09 2016 -0500

Version file update (0.2.0)

commit 537a1f4f85ce1aa008901857cb3182e6b4546d7f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 11 17:21:28 2016 -0500

Implemented runtime contexts and reorganized code.

Details:
- Retrofitted a new data structure, known as a context, into virtually
all internal APIs for computational operations in BLIS. The structure
is now present within the type-aware APIs, as well as many supporting
utility functions that require information stored in the context. User-
level object APIs were unaffected and continue to be "context-free,"
however, these APIs were duplicated/mirrored so that "context-aware"
APIs now also exist, differentiated with an "_ex" suffix (for "expert").
These new context-aware object APIs (along with the lower-level, type-
aware, BLAS-like APIs) contain the the address of a context as a last
parameter, after all other operands. Contexts, or specifically, cntx_t
object pointers, are passed all the way down the function stack into
the kernels and allow the code at any level to query information about
the runtime, such as kernel addresses and blocksizes, in a thread-
friendly manner--that is, one that allows thread-safety, even if the
original source of the information stored in the context changes at
run-time; see next bullet for more on this "original source" of info).
(Special thanks go to Lee Killough for suggesting the use of this kind
of data structure in discussions that transpired during the early
planning stages of BLIS, and also for suggesting such a perfectly
appropriate name.)
- Added a new API, in frame/base/bli_gks.c, to define a "global kernel
structure" (gks). This data structure and API will allow the caller to
initialize a context with the kernel addresses, blocksizes, and other
information associated with the currently active kernel configuration.
The currently active kernel configuration within the gks cannot be
changed (for now), and is initialized with the traditional cpp macros
that define kernel function names, blocksizes, and the like. However,
in the future, the gks API will be expanded to allow runtime management
of kernels and runtime parameters. The most obvious application of this
new infrastructure is the runtime detection of hardware (and the
implied selection of appropriate kernels). With contexts in place,
kernels may even be "hot swapped" at runtime within the gks. Once
execution enters a level-3 _front() function, the memory allocator will
be reinitialized on-the-fly, if necessary, to accommodate the new
kernels' blocksizes. If another application thread is executing with
another (previously loaded) kernel, it will finish in a deterministic
fashion because its kernel information was loaded into its context
before computation began, and also because the blocks it checked out
from the internal memory pools will be unaffected by the newer threads'
reinitialization of the allocator.
- Reorganized and streamlined the 'ind' directory, which contains much of
the code enabling use of induced methods for complex domain matrix
multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as
those APIs' functionality is now mostly subsumed within the global
kernel structure.
- Updated bli_pool.c to define a new function, bli_pool_reinit_if(),
that will reinitialize a memory pool if the necessary pool block size
has increased.
- Updated bli_mem.c to use bli_pool_reinit_if() instead of
bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed
usage of contexts where appropriate to communicate cache and register
blocksizes to bli_mem_compute_pool_block_sizes().
- Simplified control trees now that much of the information resides in
the context and/or the global kernel structure:
- Removed blocksize object pointers (blksz_t*) fields from all control
tree node definitions and replaced them with blocksize id (bszid_t)
values instead, which may be passed into a context query routine in
order to extract the corresponding blocksize from the given context.
- Removed micro-kernel function pointers (func_t*) fields from all
control tree node definitions. Now, any code that needs these function
pointers can query them from the local context, as identified by a
level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or
level-1v kernel id (l1vkr_t).
- Removed blksz_t object creation and initialization, as well as kernel
function object creation and initialization, from all operation-
specific control tree initialization files (bli_*_cntl.c), since this
information will now live in the gks and, secondarily, in the context.
- Removed blocksize multiples from blksz_t objects. Now, we track
blocksize multiples for each blocksize id (bszid_t) in the context
object.
- Removed the bool_t's that were required when a func_t was initialized.
These bools are meant to allow one to track the micro-kernel's storage
preferences (by rows or columns). This preference is now tracked
separately within the gks and contexts.
- Merged and reorganized many separate-but-related functions into single
files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and
util directories, but has the most obvious effect of allowing BLIS
to compile noticeably faster.
- Reorganized execution paths for level-1v, -1d, -1m, and -2 operations
in an attempt to reduce overhead for memory-bound operations. This
includes removal of default use of object-based variants for level-2
operations. Now, by default, level-2 operations will directly call a
low-level (non-object based) loop over a level-1v or -1f kernel.
- Converted many common query functions in blk_blksz.c (renamed from
bli_blocksize.c) and bli_func.c into cpp macros, now defined in their
respective header files.
- Defined bli_mbool.c API to create and query "multi-bools", or
heterogeneous bool_t's (one for each floating-point datatype), in the
same spirit as blksz_t and func_t.
- Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS
and BLIS_SIMD_SIZE. These values are needed in order to compute a third
new parameter, which may be set indirectly via the aforementioned
macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to
statically allocate memory in macro-kernels and the induced methods'
virtual kernels to be used as temporary space to hold a single
micro-tile. These values are now output by the testsuite. The default
value of BLIS_STACK_BUF_MAX_SIZE is computed as
"2 * BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE".
- Cleaned up top-level 'kernels' directory (for example, renaming the
embarrassingly misleading "avx" and "avx2" directories to "sandybridge"
and "haswell," respectively, and gave more consistent and meaningful
names to many kernel files (as well as updating their interfaces to
conform to the new context-aware kernel APIs).
- Updated the testsuite to query blocksizes from a locally-initialized
context for test modules that need those values: axpyf, dotxf,
dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr.
- Reformatted many function signatures into a standard format that will
more easily facilitate future API-wide changes.
- Updated many "mxn" level-0 macros (ie: those used to inline double loops
for level-1m-like operations on small matrices) in frame/include/level0
to use more obscure local variable names in an effort to avoid variable
shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings,
which are only output using -Wshadow.)
- Added a conj argument to setm, so that its interface now mirrors that
of scalm. The semantic meaning of the conj argument is to optionally
allow implicit conjugation of the scalar prior to being populated into
the object.
- Deprecated all type-aware mixed domain and mixed precision APIs. Note
that this does not preclude supporting mixed types via the object APIs,
where it produces absolutely zero API code bloat.

commit dd856c2cb75a2221a503a73dde27790c34b91570
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Apr 11 10:39:18 2016 -0500

Translated MIC kernel to KNL and cleaned up a bit. Only real change is lack of swizzle modifiers for FMA instructions (used bcast from memory instead).

commit 7f27431d3fffdda99c282ec412731d0a90cb32a7
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Apr 8 10:04:39 2016 -0500

Copy mic kernel to knl for transliteration.

commit f8f02f0334ac020021e15a415bcd33aeea01deb4
Merge: 32c92d94 d1f8e5d9
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Apr 6 11:37:05 2016 -0500

Merge branch 'master' into const_correctness

commit 32c92d945c55708da0eb63be1771f8c5430e3910
Merge: 62914ccb 20af937b
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Apr 6 11:36:02 2016 -0500

Merge branch 'master' into const_correctness

commit d1f8e5d9b2ecd054ed103f4d642d748db2d4f173
Merge: 20af937b c11d28ee
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 5 12:21:27 2016 -0500

Merge pull request 60 from esauvage/master

sgemm µkernel for bulldozer : bug correction for k%4 != 0

commit c11d28eed89d65494bc4019f04d046520866c0ff
Author: Etienne Sauvage <etienne.sauvagegmail.com>
Date: Sat Apr 2 21:15:48 2016 +0200

cgemm µkernel for bulldozer : bug correction for k%4 != 0

commit 20af937b57f82bb3acb09418d5c0206e1b24f2c7
Merge: 36c3abb0 fc61a114
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 31 14:37:30 2016 -0500

Merge pull request 59 from devinamatthews/fix_testsuite_makefile

Fix testsuite makefile

commit fc61a1143edeba4946d4b9915f1775bb08e643fc
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu Mar 31 10:53:01 2016 -0500

Fix formatting in configure.

commit 26379b14de630e3a6c6eef5dfe87ff001558a8a6
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu Mar 31 10:45:48 2016 -0500

Adjust paths in common.mk to support building from testsuite dir.

commit 36c3abb05fecb02d4a9ab13b2b69d133adf34583
Merge: 64b41fa5 917ce754
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 31 10:26:17 2016 -0500

Merge pull request 58 from esauvage/master

cgemm & zgemm micro-kernels for FMA4 instruction set (bulldozer confi…

commit 356d854fc9e34642cc46e0e02a8ceb56114878af
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Mar 30 16:33:15 2016 -0500

Make symlink to common.mk in build directory.

commit edbb8470044f82ef959583ee09613a5a985292b5
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Mar 30 16:27:11 2016 -0500

Refactor out some definitions which moved from make_defs.mk to Makefile for use in testsuite Makefile.

commit 917ce75482a543fef46553efff6c246939761e59
Author: Etienne Sauvage <etienne.sauvagegmail.com>
Date: Wed Mar 30 22:03:09 2016 +0200

cgemm & zgemm micro-kernels for FMA4 instruction set (bulldozer configuration), based on x86_64/avx micro-kernel

commit 62914ccbcdb3c594f065dcfa65bd7e7b95c79283
Merge: bbf704bf 64b41fa5
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Mar 29 15:24:25 2016 -0500

Merge branch 'master' into const_correctness

commit 64b41fa554dff44b2f9ad48901b67c63836407a8
Merge: 1b09e343 0171ad58
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 29 15:19:41 2016 -0500

Merge pull request 54 from devinamatthews/more_config_opts

More config opts

commit 1b09e343dfe5b48b4842e2cb96f41c8cc249bad0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 29 12:55:28 2016 -0500

Updated gcc version from 4.8 to 4.9 in .travis.yml.

commit 0171ad58997b3a5a9b76301511dbe0751fffc940
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Mar 28 13:55:06 2016 -0500

Add icc and clang support for Intel architectures, fixes 47. 2bd036f fixes 49 BTW.

commit 3090fff64cc87ff2519a09f38e6b8699cf3cba11
Merge: 8624e365 4ca5d5b1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 28 12:36:25 2016 -0500

Merge pull request 44 from esauvage/master

sgemm micro-kernel for FMA4 instruction set

commit e6e566426ac3ded7ef87cd8ff9be98accfdc4acc
Merge: 469429ec 8624e365
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Sat Mar 26 14:10:15 2016 -0500

Merge branch 'master' into more_config_opts

commit 8624e36543160739d954c4dbcc5a5594458f3a12
Merge: a315833f 2bd036f1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Mar 26 13:56:28 2016 -0500

Merge pull request 50 from devinamatthews/fix_noopt_avx

Fix configuration issue where instruction set flags are not specified for debug builds.

commit 469429ec34e5b1a172ce35596f9c7afdaacac131
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Mar 25 20:45:41 2016 -0500

Fix LD_FLAGS -> LDFLAGS.

commit 8442d65c9ead0376fc5f2dfad62fd4862ab9b2b3
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Mar 25 20:06:48 2016 -0500

Replace -march=native with specific architecture flags to support cross-compiling, and add icc support for Intel architectures.

commit 76099f20be1b49ac960f7e3c5a8296bbf4e1782d
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Mar 25 17:22:58 2016 -0500

Add threading option to configure.

commit ad43eab4c7899d56d8d7caa6e2d92bc0581ea5a5
Merge: 9452bdb3 2bd036f1
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Mar 25 15:00:02 2016 -0500

Merge branch 'fix_noopt_avx' into more_config_opts

commit 9452bdb3afbf2d7f898134a091d7790817e7be9c
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Mar 25 14:59:50 2016 -0500

Add options for verbose make output and static/shared linking to configure.

commit 2bd036f1f9ce1ee0864365557f66d9415dd42de3
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Mar 25 12:16:49 2016 -0500

Fix configuration issue where instruction set flags are not specified for debug builds.

commit bbf704bf7501411964a63a68f1af541f612cf92d
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Mar 25 09:55:35 2016 -0500

Add missing const to bli_read_nway_from_env.

commit a315833f067944fb0bc14cf60f0c7dcb5dc897b6
Merge: 1d1a426d af92773f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 24 12:30:21 2016 -0500

Merge pull request 48 from figual/master

Updated and improved ARMv8 micro-kernels.

commit af92773f4f85a2441fe0c6e3a52c31b07253d08e
Author: figual <figualucm.es>
Date: Wed Mar 23 22:07:02 2016 +0100

Updated and improved ARMv8 micro-kernels.

commit a4d7729776d17d9bdf2341eacd70b9770b9ba8d2
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Mar 21 09:55:21 2016 -0500

Set default value for debug_type variable.

commit 0e2447fa55d8c5fa2b1fc4150073512495c5f9eb
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu Mar 17 16:32:05 2016 -0500

Add const correctness to auxinfo_t struct (microkernels need update theoretically).

commit 1d1a426d18ec03754021456862a1f4d1dfec1fbf
Merge: 5a978fff d226dfa0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 7 15:17:53 2016 -0600

Merge pull request 46 from devinamatthews/new-config-opts

Add several changes to the build system.

commit d226dfa05190eb477b33563b1edccf8603973336
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Sat Mar 5 16:18:14 2016 -0600

Add several changes to the build system.

1) Add -- options.
2) Add -d/--enable-debug option to enable debugging symbols with and without optimization.
3) Allow user to specify CC at configure time, and determine vendor (gcc/icc/etc.). For now configurations enforce a particular vendor.
4) Add make V=[0,1] option to control build verbosity.

commit 5a978fffdb8f09a81c89541d541d4a6830cd70a4
Merge: adb2b4e0 63e26423
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Mar 4 17:26:58 2016 -0600

Merge pull request 45 from devinamatthews/high_prec_timers

Use clock_gettime(CLOCK_MONOTONIC) and mach_absolute_time instead of gettimeofday

commit 63e264239053b913164a849dd8a45829087eaddc
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Mar 4 13:17:50 2016 -0600

Make sure that -lrt is linked on Linux.

commit 44fddd48dc1708a956803d1948f04429ec0d8700
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Mar 4 12:36:38 2016 -0600

Add missing \.

commit 7cabd2131f953de23e7015d760b0ddfda51b1251
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu Mar 3 11:43:07 2016 -0600

Use clock_gettime(CLOCK_MONOTONIC) and mach_absolute_time instead of gettimeofday.

commit adb2b4e096c78e8b2f85fd372cf0d5eb04af5be8
Author: Tyler Smith <tmscs.utexas.edu>
Date: Wed Mar 2 14:48:12 2016 -0600

Fixing guard for non implemented partitioning through packed matrices

commit 4ca5d5b1fd6f2e4a8b2e139c5405475239581e51
Author: Etienne Sauvage <etienne.sauvagegmail.com>
Date: Tue Mar 1 21:33:01 2016 +0100

sgemm micro-kernel for FMA4 instruction set (bulldozer configuration), based on x86_64/avx micro-kernel

commit 627d59b5ba06866b26f46e4434a0435b600925e3
Author: Etienne Sauvage <etienne.sauvagegmail.com>
Date: Mon Feb 29 21:53:12 2016 +0100

symbolic link for bulldozer configuration to kernels

commit 2dc5c0ae038ed175fab85751803ada05734d1ba1
Merge: f2809fc5 3d0fae81
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 29 12:22:51 2016 -0600

Merge pull request 40 from tkelman/bulldozer-symlink

Add symlink from config/bulldozer/kernels to kernels/x86_64/bulldozer

commit f2809fc5f74466c755da6a5b4632853e634060b5
Merge: f86b94f2 8624a33c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Feb 27 13:06:03 2016 -0600

Merge pull request 39 from devinamatthews/fix_f2c_conflicts

Devin's f2c type namespace update.

Details:
- Added "bla_" prefix to f2c type names to prevent conflicts with external user code.
- Removed most of the body of bli_f2c.h, which was unused.

commit 3d0fae810d942085d8f2d389820b4e0027577db8
Author: Tony Kelman <tonykelman.net>
Date: Thu Feb 25 23:24:03 2016 -0800

Add symlink from config/bulldozer/kernels to kernels/x86_64/bulldozer

to fix linking issue mentioned in 37 and https://groups.google.com/forum/#!topic/blis-devel/iypwljcaeEI

commit 8624a33ccc12dff6f6c4f92992ca5636af1576a6
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu Feb 25 13:51:26 2016 -0600

Fix remaining f2c conflicts.

commit 372eef0b6c0a535bf88d4b46b72f61266e8491ba
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu Feb 25 12:01:58 2016 -0600

Fixed most conflicts after hack-n-slash ofr bli_f2c.h, cleanup in
progress.

commit f86b94f206e2e09fa3221cc55c3dc5b05ca4775a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Feb 23 18:12:34 2016 -0600

Included missing blas2blis integer def to CBLAS.

Details:
- Added include "bli_config_macro_defs" to all cblas_*.c files in
compat/cblas/src. This has the effect of defining
BLIS_BLAS2BLIS_INT_TYPE_SIZE to the default value if bli_config.h does
not define it. Thanks to Tony Kelman for reporting this bug.
- In cblas_i?amax.c, changed the type of the variable 'iamax' from 'int'
to 'f77_int'. This eliminates a compiler warning and a potential
runtime bug and/or crash when the size of an int differs from the size
of f77_int (as determined by BLIS_BLAS2BLIS_INT_TYPE_SIZE).

commit 0b126de1342c11c65623bcb38e258e21e9244e3d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 13 16:29:12 2015 -0600

Consolidated packm_blk_var1 and packm_blk_var2.

Details:
- Consolidated the two blocked variants for packm into a single
implementation (packm_blk_var1) and removed the other variant.
- Updated all induced method _cntl_init() functions in frame/cntl/ind/
to use the new blocked variant 1.
- Defined two new macros, bli_is_ind_packed() and bli_is_nat_packed(),
to detect pack_t schemas for induced methods and native execution,
respectively.

commit 30e5eb29e060b97752f702d2ea5d101d950f53b2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 13 12:14:19 2015 -0600

Minor changes to treatment of rs, cs in bli_obj.c.

Details:
- Applied a patch submitted by Devin Matthews that:
- implements subtle changes to handling of somewhat unusual cases of
row and column strides to accommodate certail tensor cases, which
includes adding dimension parameters to _is_col_tilted() and
_is_row_tilted() macros,
- simplifies how buffers are sized when requested BLIS-allocated
objects,
- re-consolidates bli_adjust_strides_*() into one function, and
- defines 'restrict' keyword as a "nothing" macro for C++ and pre-C99
environments.

commit f0a4f41b5acf55b41707ec821c4c5f9076dfbc24
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 12 15:22:50 2015 -0600

Fixed unimplemented case in core2 sgemm ukernel.

Details:
- Implemented the "beta == 0" case for general stride output for the
dunnington sgemm micro-kernel. This case had been, up until now,
identical to the "beta != 0" case, which does not work when the
output matrix has nan's and inf's. It had manifested as nan residuals
in the test suite for right-side tests of ctrsm4m1a. Thanks to Devin
Matthews for reporting this bug.

commit 42810bbfa0b8f006ecc5128d903909ec13ea63f9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 12 12:07:46 2015 -0600

Fixed minor bugs for uncommon obj_create cases.

Details:
- Separated bli_adjust_strides() into _alloc() and _attach() flavors so
that the latter can avoid a test performed by the former, in which the
rs and cs are overridden and set to zero if either matrix dimension is
zero. Actually, we also disable this overridding behavior, even for the
_alloc() case, since keeping the original strides (probably) does not
hurt anything. The original code has been kept commented-out, though,
in case an unintended consequence is later discovered.
- Fixed a typo in an error check for general stride cases where rs == cs.

commit 3e6dd11467643fbc2cb45c13cec8dd6024232833
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 3 10:30:08 2015 -0600

Minor re-expression in quadratic partitioning code.

Details:
- Minor change to quadratic equation solution code that avoids
recomputation of the sqrt() parameter when the compiler is not
smart enough to perform this optimization automatically.

commit 0694b722f7e4df00efb32639095a2aca80e67f52
Merge: 3e116f0a 33557ecc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Nov 2 17:24:25 2015 -0600

Merge branch 'master' of github.com:flame/blis

commit 3e116f0a2953f50b3c068759a775ad7ffae04e49
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Nov 2 17:18:23 2015 -0600

Fixed imaginary bug in quadratic partitioning code.

Details:
- Fixed a bug in the relatively new quadratic partitioning code that,
under the right conditions, would perform sqrt() on a negative value.
If the solution is imaginary, we discard it and use an alternate
partition width that assumes no diagonal intersection. That alternate
width is actually already computed, so, the fix was quite simple.
Thanks to Devangi Parikh for reporting this bug.

commit 33557ecccaf49b2569b7f3d7bcea52c2aab94c68
Author: Jeff Hammond <jeff.sciencegmail.com>
Date: Mon Nov 2 12:18:43 2015 -0800

add Travis CI build status icon to the README

commit 4a502fbe77bd0f701108baaa559d9cfb483f88de
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Nov 2 13:28:34 2015 -0600

Laid groundwork for runtime memory pool resizing.

Details:
- Changed bli_pool_finalize() so that the freeing begins with the block
at top_index instead of block 0. This allows us to use the function
for terminal finalization as well as temporary cleanup prior to
reinitialization. Also, clear the pool_t struct upon _pool_finalize()
in case it is called in the terminal case with some blocks still
checked out to threads (in which case the threads will see the new
block size as 0 and thus release the block as intended).
- Added bli_pool_reinit(), which calls _pool_finalize() followed by
_pool_init() with new parameters.
- Added bli_mem_reinit(), which is based on bli_pool_reinit().
- Added new wrapper, _mem_compute_pool_block_sizes(), which calls
_mem_compute_pool_block_sizes_dt().
- Updated bli_mem_release() so that the pblk_t is freed, via
_pool_free_block(), if the block size recorded in the mem_t at the
time the pblk_t was acquired is now different from the value in the
pool_t.

commit 37e55ca39bdbddaec03ad30d43e8ad2b3e549c96
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 30 18:25:04 2015 -0500

Fixed obscure 3m1/4m1a bugs in trmm[3] and trsm.

Details:
- Fixed a family of bugs in the triangular level-3 operations for
certain complex implementations (3m1 and 4m1a) that only manifest if
one of the register blocksizes (PACKMR/PACKNR, actually) is odd:
- Fixed incorrect imaginary stride computation in bli_packm_blk_var2()
for the triangular case.
- Fixed the incorrect computation of imaginary stride, as stored in
the auxinfo_t struct in trmm and trsm macro-kernels.
- Fixed incorrect pointer arithmetic in the trsm macro-kernels in the
cases where the the register blocksize for the triangular matrix is
odd. Introduced a new byte-granular pointer arithmetic macro,
bli_ptr_add(), that computes the correct value.
- Added cpp macro to bli_macro_defs.h for typeof() operator, defined in
terms of __typeof__, which is used by bli_ptr_add() macro.
- Disabled the row- vs. column-storage optimization in bli_trmm_front()
for singleton problems because the inherent ambiguity of whether a
scalar is row-stored or column-stored causes the wrong parameter
combination code to be executed (by dumb luck of our checking for
row storage first).
- Added commented-out debugging lines to 3m1/4m1a and reference
micro-kernels, and trsm_ll macro-kernel.

commit 46294d80e5a79c598e200e1c8ec2a642ff839971
Merge: d3159c57 a0a7b85a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Oct 27 12:41:23 2015 -0500

Merge pull request 35 from figual/master

Fixed incomplete code in the double precision ARMv8 microkernel.

commit a0a7b85ac3e157af53cff8db0e008f4a3f90372c
Author: Francisco Igual <figualucm.es>
Date: Tue Oct 27 08:59:15 2015 +0000

Fixed incomplete code in the double precision ARMv8 microkernel.

commit d3159c5740c9ee7f8c0b661003aab6f00646ad6f
Merge: b489152e 7e03e45b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 21 14:54:00 2015 -0500

Merge branch 'master' of github.com:flame/blis

commit b489152e112644ec3b6d19e687231a9607f7694f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 21 14:53:17 2015 -0500

Use vzeroall in haswell micro-kernels.

commit 7e03e45bfe6c27c4fdbf06b1caa7f49e9a5fef49
Merge: 77ddb0b1 4f88c29f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 14 13:26:07 2015 -0500

Merge pull request 33 from xianyi/master

Enable Travis CI

commit 4f88c29f9e634cbb6fb22d8c88931f0ec78ad7db
Author: Zhang Xianyi <traits.zhanggmail.com>
Date: Wed Oct 14 12:57:50 2015 -0500

Detect Intel Broadwell (using Haswell config).

commit 4b0ac1a9984a93f7ad4369b10fca63991107d9f5
Merge: fe3e355c 77ddb0b1
Author: Zhang Xianyi <traits.zhanggmail.com>
Date: Wed Oct 14 12:51:05 2015 -0500

Merge branch 'upstream_master'

commit 77ddb0b1d31ada111dadf392766ba6d9210ed9fb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Oct 13 12:53:06 2015 -0500

Removed flop-counting mechanism.

Details:
- Removed the optional flop-counting feature introduced in commit
7574c994.

commit 276da366187460a4c8e6e0910e79cb39ce780bfe
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 12 11:43:03 2015 -0500

Minor formatting change to README.md.

commit d17057446f5404824478e8a6cd08f242ab75544a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 12 11:39:49 2015 -0500

Added "Getting Started" section to README.md.

Details:
- Added section to README.md file containing links to wikis with brief
descriptions.

commit e7e1f2f7b601b21b50e3cdad8972cb3fe11018d3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 2 16:51:52 2015 -0500

Minor updates to CREDITS, README files.

commit 55329906ecd7ce1ab910e4d30a29354a9172e7ea
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Sep 26 20:47:19 2015 -0500

Minor edits to README.md, testsuite.

Details:
- Fixed typos in README.md.
- Fixed column heading alignment for testsuite when matlab output is
enabled.
- Minor updates to test/3m4m/runme.sh and test/3m4m/Makefile.

commit bbebdb5793a8fd6aaf257012ab0272beaa04a0de
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Sep 25 14:47:27 2015 -0500

Replaced README with README.md.

Details:
- Replaced the old (and short) README file with a much more comprehensive
version written in github-flavored markdown. The new file is based on
content taken from the old Google Code homepage.

commit e2e9d64a63485461192d9c2a6dd0183a8b71013c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 24 12:14:03 2015 -0500

Load balance thread ranges for arbitrary diagonals.

Details:
- Expanded/updated interface for bli_get_range_weighted() and
bli_get_range() so that the direction of movement is specified in the
function name (e.g. bli_get_range_l2r(), bli_get_range_weighted_t2b())
and also so that the object being partitioned is passed instead of an
uplo parameter. Updated invocations in level-3 blocked variants, as
appropriate.
- (Re)implemented bli_get_range_*() and bli_get_range_weighted_*() to
carefully take into account the location of the diagonal when computing
ranges so that the area of each subpartition (which, in all present
level-3 operations, is proportional to the amount of computation
engendered) is as equal as possible.
- Added calls to a new class of routines to all non-gemm level-3 blocked
variants:
bli_<oper>_prune_unref_mparts_[mnk]()
where <oper> is herk, trmm, or trsm and [mnk] is chosen based on which
dimension is being partitioned. These routines call a more basic
routine, bli_prune_unref_mparts(), to prune unreferenced/unstored
regions from matrices and simultaneously adjust other matrices which
share the same dimension accordingly.
- Simplified herk_blk_var2f, trmm_blk_var1f/b as a result of more the
new pruning routines.
- Fixed incorrect blocking factors passed into bli_get_range_*() in
bli_trsm_blk_var[12][fb].c
- Added a new test driver in test/thread_ranges that can exercise the new
bli_get_range_*() and bli_get_range_weighted_*() under a range of
conditions.
- Reimplemented m and n fields of obj_t as elements in a "dim"
array field so that dimensions could be queried via index constant
(e.g. BLIS_M, BLIS_N). Adjusted/added query and modification
macros accordingly.
- Defined mdim_t type to enumerate BLIS_M and BLIS_N indexing values.
- Added bli_round() macro, which calls C math library function round(),
and bli_round_to_mult(), which rounds a value to the nearest multiple
of some other value.
- Added miscellaneous pruning- and mdim_t-related macros.
- Renamed bli_obj_row_offset(), bli_obj_col_offset() macros to
bli_obj_row_off(), bli_obj_col_off().

commit fe3e355c9c5a6f65b8736b009e2d501b62a83ea1
Merge: efa641e3 4dd9dd3e
Author: Zhang Xianyi <traits.zhanggmail.com>
Date: Fri Aug 21 14:38:36 2015 -0500

Merge branch 'upstream_master'

commit efa641e36b73abee34166a252e90e28a6281d92d
Author: Zhang Xianyi <traits.zhanggmail.com>
Date: Sat Aug 22 03:15:50 2015 +0800

Try to fix the compiling bug on travis.

commit 4dd9dd3e1de626b51bfe85d9ee65f193d60e8d38
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Aug 21 11:52:37 2015 -0500

Fixed minor alignment ambiguity bug in bli_pool.c.

Details:
- Fixed a typecasting ambiguity in bli_pool_alloc_block() in which
pointer arithmetic was performed on a void* as if it were a byte
pointer (such as char*). Some compilers may have already been
interpreting this situation as intended, despite the sloppiness.
Thanks to Aleksei Rechinskii for reporting this issue.
- Redefined pointer alignment macros to typecast to uintptr_t instead of
siz_t.

commit 12ffd568b04feda57147c13b67717416a01c82f8
Author: Zhang Xianyi <traits.zhanggmail.com>
Date: Sat Aug 22 00:24:28 2015 +0800

Add Travis CI.

commit ecc3ebb749e0861c27deda52b5f87236ede4901b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 29 13:31:12 2015 -0500

CHANGELOG update (0.1.8)

Page 3 of 7

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.