Accera

Latest version: v1.2.29

Safety actively analyzes 629788 Python packages for vulnerabilities to keep your Python projects secure.

Page 5 of 6

1.2.5

What's Changed
* Add link to the NCHWc 2D Convolution Case Study by marina-neseem in https://github.com/microsoft/Accera/pull/41

- Merged PR 2593: [docs] [release] bump docs version to 1.2.5 in
preparation for release. [Lisa Ong]

bump docs version to 1.2.5 in preparation for release
- Merged PR 2586: Loop order and indices as parameters [Denny Sun]

With this change, the user can write a schedule with loop_order parameterized:

loop_order = create_parameters()
schedule.reorder(order=loop_order )

parameter_grid = {
loop_order : (j, k, i, ii, jj, kk)
}

parameters = create_parameter_grid(parameter_grid,
filter_func = lambda *p : schedule.is_valid_loop_order(p[0][0]),
sample=5)

Add another function to the package
package.add(
plan,
args=(A, B, C),
parameters=parameters,
base_name="matmul_256_256_256"
)

Related work items: 3693
- Merged PR 2591: Fixes more warnings. Enables STRICT_MODE for Linux PR
CI. [Kern Handa]
- Merged PR 2588: [test] Trim out redundant tests from ROCm pipeline.
[Lisa Ong]

The ROCm pipeline is currently on a single agent, avoid running CPU tests that are already running in other pipelines to speed up the pipeline execution.
- Merged PR 2590: [nfc] Fixes a bunch of warnings in C++ layer. [Kern
Handa]

[nfc] Fixes a bunch of warnings in C++ layer
- Merged PR 2589: [test] Adds DSL tests for Schedule.pad. [Kern Handa]

Adds DSL tests for Schedule.pad
- Merged PR 2587: Sync Github to ADO. [Lisa Ong]

commit b934ad05f6b8cd84420226b93f57b8ac3229eadc
- Merged PR 2585: Use conditional instead of loop-unswitching on GPU.
[Chuck Jacobs]

This PR changes how boundary conditions are handled on GPU-bound loop indices. If a loop's increment doesn't evenly divide its bounds, the body is guarded by a conditional instead of unswitching that loop.

Related work items: 3703
- Merged PR 2571: Add random seed to enable reproducible sampling.
[Denny Sun]

Giving users control over sampling strategies.
- Merged PR 2581: Add CUDA tensor core support. [Ritwik Das]

- Added CUDA tensor ops (no caching)
- Added validation tests
- Changed MMA enum names
- Bit of generated tensor op code in cuda:

...
vhalf *var11 = (vhalf*)arg2;
wmma::fragment<wmma::accumulator, 16, 16, 16, vhalf> mmaMatrix_12;
wmma::load_matrix_sync(mmaMatrix_12, var11 + var9 * 16 + var10, 16, wmma::layout_t::mem_row_major);
vhalf *var13 = (vhalf*)arg0;
wmma::fragment<wmma::matrix_a, 16, 16, 16, vhalf, wmma::row_major> mmaMatrix_14;
wmma::load_matrix_sync(mmaMatrix_14, var13 + var9 * 16 + 0, 16);
vhalf *var15 = (vhalf*)arg1;
wmma::fragment<wmma::matrix_b, 16, 16, 16, vhalf, wmma::row_major> mmaMatrix_16;
wmma::load_matrix_sync(mmaMatrix_16, var15 + 0 * 16 + var10, 16);
wmma::fragment<wmma::accumulator, 16, 16, 16, vhalf> mmaMatrix_17;
wmma::mma_sync(mmaMatrix_17, mmaMatrix_14, mmaMatrix_16, mmaMatrix_12);
wmma::store_matrix_sync(var11 + var9 * 16 + var10, mmaMatrix_17, 16, wmma::layout_t::mem_row_major);

Related work items: 3694
- Merged PR 2584: Adds cublas_gemm benchmarking tool. [Kern Handa]

Adds cublas_gemm benchmarking tool
- Merged PR 2583: Don't hold ResolveWarpSize results with rvalue. [Mason
Remy]

Don't hold ResolveWarpSize results with rvalue

gcc appears to be inlining ResolveWarpSize incorrectly in some cases and
not holding the result with an rvalue pair appears to fix it.

This was resulting in some mod 0's and floordiv 0's when we would expect
the warp size constants to either be 32 or 64 exactly.
- Merged PR 2580: Fixes rocblas_gemm's fp32 -> fp16 conversion. [Kern
Handa]
- Merged PR 2579: Improves accera_gemm.py's handling of unsupported
configs. [Kern Handa]

Improves accera_gemm.py's handling of unsupported configs
- Merged PR 2578: Fixes time unit conversions in accera_gemm.py. [Kern
Handa]

Also addresses comments for the previous rocblas_gemm PR
- Merged PR 2577: Fixes accera_gemm.py code after Plan.tensorize API
change. [Kern Handa]

Fixes accera_gemm.py code after Plan.tensorize API change
- Merged PR 2575: Adds library warmup to rocblas_gemm benchmarker. [Kern
Handa]

Adds library warmup to rocblas_gemm benchmarker
- Merged PR 2572: [nfc] Move accera/viz -> tools/viz. [Kern Handa]

[nfc] Move accera/viz -> tools/viz
- Merged PR 2573: Update setup.cfg hatlib dependency version. [Mason
Remy]

Update setup.cfg hatlib dependency version
- Merged PR 2557: Overhauls the benchmarking tool. [Kern Handa]

This change moves the benchmarking tool to a top-level `tools/benchmarkers` directory. The tool has also been split up so that the accera portion is in its own file, while the driver portion of the tool remains intact and has gained the ability to run a rocblas gemm benchmarking utility.

The aforementioned rocblas gemm benchmarking utility is also added in this change. `rocblas_gemm` is a new executable that is not built by default since it relies on the rocblas library, which may not be available everywhere. Once this tool has been explicitly built, it can be passed in as an argument to the benchmarker tool, which will use it to generate a comparison between accera's benchmark results and rocblas's.

An example:
sh
<build accera like usual>
ninja -C `git rev-parse --show-toplevel`/build/temp.linux-x86_64-3.8 rocblas_gemm
cd tools/benchmarkers
mkdir ~/accera_benchmarks
./gpu_benchmark_tool.py -i sgemm_bert_assorted.csv -t 'AMD MI100' -o ~/accera_benchmarks/results -r `git rev-parse --show-toplevel`/build/temp.linux-x86_64-3.8/tools/benchmarkers/rocblas/rocblas_gemm

Related work items: 3685
- Merged PR 2569: Make tensorization passes configurable, remove
dependency from split indices. [Ritwik Das]

- Make the mfma type a required parameter for tensorize() - this only chooses the underlyting mfma op to use
- Additionally, user can pass in the total number of passes (which defaults to 1) which needs to run instead of implicitly calculating a square tile.
- Added documentation for the new enum type.
- Added some tests
- Current code does not work with K > M (still investigating this, but should not block this PR)

Related work items: 3688
- Merged PR 2567: Fix vectorized access of LAST_MAJOR arrays. [Mason
Remy]

Fix vectorized access of LAST_MAJOR arrays

- mlir::vector::LoadOp and mlir::vector::StoreOp only support unit
strides on the minor dimension of the memref they access, so
reinterpretcast the memref to a flat buffer to pass that check
- add translation for reinterpretcastop
- improve vectorization of LAST_MAJOR matrices in cache accesses by
changing the traversal order of the cache region (when
filling/reducing) based on the memory ordering of the outer array
being acted on.
- Merged PR 2568: [Compliance] [nfc] Switch to Azure Container Registry
for ROCm build agent. [Lisa Ong]
- Merged PR 2560: Make register allocation during tensorization tunable.
[Ritwik Das]

- Add controllable number of fused mfma passes
- Add controllable scheduling policy of mfma ops
- Add tests

Related work items: 3687
- Merged PR 2565: [build] bump hatlib dependency to 0.0.13. [Lisa Ong]

hatlib 0.0.13 contains a fix to unblock ROCm buddy builds

New Contributors
* marina-neseem made their first contribution in https://github.com/microsoft/Accera/pull/41

**Full Changelog**: https://github.com/microsoft/Accera/compare/v1.2.4...v1.2.5

1.2.4

What's Changed
* Docs refactoring install by Arslan-e-Mustafa in https://github.com/microsoft/Accera/pull/27
* Revise Pi3_Cross_Compilation.md by Arslan-e-Mustafa in https://github.com/microsoft/Accera/pull/28
* Docs refactoring tutorials hello matmul by Arslan-e-Mustafa in https://github.com/microsoft/Accera/pull/29
* Docs refactoring tutorials hello matmul gpu by Arslan-e-Mustafa in https://github.com/microsoft/Accera/pull/30
* Docs refactoring tutorials optimized matmul by Arslan-e-Mustafa in https://github.com/microsoft/Accera/pull/31
* Refactoring of Accera.md from reference docs by Arslan-e-Mustafa in https://github.com/microsoft/Accera/pull/32
* Complete refactoring of safety analysis by Arslan-e-Mustafa in https://github.com/microsoft/Accera/pull/33
* Refactoring of functions docs in reference files by Arslan-e-Mustafa in https://github.com/microsoft/Accera/pull/34
* Demo fixes for hatlib 0.0.11 by lisaong in https://github.com/microsoft/Accera/pull/36
* [nfc] [doc] Update arrow label positions by lisaong in https://github.com/microsoft/Accera/pull/35
* completed reference docs by Arslan-e-Mustafa in https://github.com/microsoft/Accera/pull/37
* Update docstrings to match reference doc changes by lisaong in https://github.com/microsoft/Accera/pull/38
* [ci][nfc] Update CI pipeline to Azure Container Registry by lisaong in https://github.com/microsoft/Accera/pull/39
* [doc] Contributing guide for Case Studies by lisaong in https://github.com/microsoft/Accera/pull/40

- Merged PR 2563: Add a table of operators and code examples to the
Parameters.md. [Denny Sun]

Update the Manuals with the supported operators and code examples.
- Merged PR 2562: [nfc] Add some macOS targets and synced Model.md.
[Lisa Ong]

* Re-generated Model.md to add missing models
* Handle zero (unknown) vector_bytes cases in tests
* Opportunistically added these models used during development:
* 2016 macbook pro
* M1 max
- Merged PR 2561: [docs][nfc] Sync changes from Github remote, bump doc
versions to 1.2.4. [Lisa Ong]
- Merged PR 2558: [nfc] update requirements to latest version of six.
[Lisa Ong]

Fixes this warning:

<frozen importlib._bootstrap>:914: ImportWarning: _SixMetaPathImporter.find_spec() not found; falling back to find_module()

- Merged PR 2559: Finer-granularity error reporting for python tests.
[Chuck Jacobs]

This PR modifies how the python tests are invoked, so that they can report pass/fail results per test. Hopefully that'll make it easier to pinpoint where things are failing during CI builds.
- Merged PR 2556: [non-functional] Change ROCM code to generate gcn
intrinsics when possible. [Ritwik Das]

- Use amd gcn intrinsics when possible (threadIdx, blockIdx, barrier)
- Add helpers which automatically check for runtime before emitting the proper code

Related work items: 3698
- Merged PR 2547: [non-functional] Change custom mfma types to Memref
and some refactoring. [Ritwik Das]

Make inital changes to remove custom mfma types

Related work items: 3691
- Merged PR 2555: create_parameters(count: int) no longer needs count as
an argument. [Denny Sun]

1. Remove the count of parameters to be created from the DSL
2. Throw exception when users write the following code:
create_parameters()
3. The correct way of calling create_parameters() is:
p1, p2 , p3 ..., pN = create_parameters()
- Merged PR 2554: [doc] Updated some missing enums and fixed Case Study
path. [Lisa Ong]
- Merged PR 2522: Generalize array indexing in tensorized GEMM. [Chuck
Jacobs]

This PR generalizes the MFMA tensorization pass to improve the handling of code in the innermost loop. It recognizes more ways of writing the GEMM kernel, and rejects many ill-formed GEMM kernels.

There are also a number of tests.

This PR doesn't yet generalize to batch-GEMM, where the matrices (typically) have 3 indices.

Related work items: 3676
- Merged PR 2551: [nfc][ci] Switch hosted pipelines to 1ES hosted pool.
[Lisa Ong]

* The Linux1ESPool is created to support internal builds of LLVM

* Fix regression in pipeline due to overzealous .dockerignore
- Merged PR 2550: [nfc] [docs] Merge changes from GitHub remote. [Lisa
Ong]

In preparation for merge from ADO to GitHub for Case Studies publishing
- Merged PR 2549: [Compliance] Switching from Dockerhub to ACR for third
party containers. [Lisa Ong]

Updating Dockerfile references
- Merged PR 2548: Add README file for case studies. [Denny Sun]

README file has a table where each case study points to the external repo link.
- Merged PR 2546: [dev] [nfc] Natively support macOS/arm64 for
development. [Lisa Ong]

Limited to local development scenarios (LLVM_SETUP_VARIANT=Default)

No plans to release pip packages until there is CI support

Verified on: Big Sur (MacOSX 12.3 arm64) / Python 3.10
- Merged PR 2543: Add precomputed offset map optimization for
tensorization (no caching) [Ritwik Das]

- Add flag to tensorize() to enable optimization (off by default)
- Optimization only affects load/store of accumulator (C) argument
- Supports all 4 mfma shapes

Related work items: 3671
- Merged PR 2542: An assortment of minor fixes. [Chuck Jacobs]

This PR is a hodgepodge of tiny fixes. I'm happy to split it up into separate PRs if a kitchen-sink PR is too gross.

The specific things are:
- Add 2 new target models to `Targets.py` (that correspond to my local dev boxes)
- Change the snapshot IR format for sub-passes to use the same format as the top-level passes (that is, not "generic" format)
- Print a warning message if `check_correctness` skips a correctness check because no hat file was generated
- Add a "minimum version" constraint to `requirements.txt` for `hatlib`
- Merged PR 2545: Unifies CUDA and CPP enum values to SOURCE for
Package.Format. [Kern Handa]

Unifies CUDA and CPP enum values to SOURCE for Package.Format

Related work items: 3679
- Merged PR 2544: [nfc] Removes now unnecessary ldebug output. [Kern
Handa]

[nfc] Removes now unnecessary ldebug output
- Merged PR 2527: Enable vectorized shared memory write. [Mason Remy]

Enable vectorized shared memory write

- This adds mod simplification support needed for vecotrizing shared
memory writes
- Also refactors some of the affine simplification code slightly to
share some common code between the floordiv and mod simplifications

Related work items: 3586, 3661, 3689
- Merged PR 2526: Enable GPU global read vectorization. [Mason Remy]

Enable GPU global read vectorization

- Implements a floor div simplification that enables better recognition
of vectorizable load and stores

Related work items: 3661, 3690
- Merged PR 2541: Fix a few issues with GEMM benchmarking script. [Chuck
Jacobs]

This PR fixes a couple of errors:
- there was a bug in the GEMM kernel
- sometimes hatlib would fail to return a compiled function, but not throw an exception. These are now flagged as "uncompilable"

It makes a couple of other tweaks:
- it fails if the `alpha` and `beta` parameters aren't `1.0` and `0.0`
- it culls some variants with known-uncompilable tensorization parameters before trying to compile them
- Merged PR 2538: Fix std::pair unpacking issue in
TensorizeAffineForOpConversion. [Lisa Ong]

In debug builds, we are getting garbage values for warpSizeX and warpSizeY, resulting in division by 0 errors in the emitted .cu files
- Merged PR 2536: Parameter supports most of the
arithmetic/binary/unary operations defined in operator lib. [Denny
Sun]

Parameter supports the basic arithmetic operations (+, -, *, //, %), for example, the user can write the following code:

fma_unit_count, vector_size = acc.create_parameters(2)
jjj = schedule.split(jj, fma_unit_count * vector_size)
jjjj = schedule.split(jjjj, vector_size)

Related work items: 3692
- Merged PR 2539: [nfc][docs] Merging commits from Github/main. [Lisa
Ong]

commit ee28126a338d905eb5931038d3c5daba6ead3811
- Merged PR 2535: [ci] Self-hosted Azure DevOps build agent for ROCm
smoke tests. [Lisa Ong]

* Docker image for self-hosted build agent on the ROCm development machine
* Pipeline will front-load the Python ROCm tests so that we fail faster
* The agent runs ROCm 5.1.1 (the current latest). We can build/launch different containers for different versions if needed.
* CUDA_VISIBLE_DEVICES = 0 by default. This can be overwritten at pipeline scheduling time.
* The pipeline currently fails in the ROCm Python tests, so it does not block completion of the PR.
* Included some fixes that are not related to ROCm but generally needed to run on systems whose CPU names are resolved (e.g. "zen2"), i.e. the build agent itself.

Related work items: 3682
- Merged PR 2537: [Compliance] Make dependency on ffmpeg optional. [Lisa
Ong]

ffmpeg-python is only needed for video export from the Iteration Visualizer Tool

Removing the hard dependency from the tool.
- Merged PR 2525: Fix vectorization plumbing for GPU scenarios. [Mason
Remy]

Fix vectorization plumbing for GPU scenarios

Related work items: 3661
- Merged PR 2531: [nfc][docs] Merging weekly commits from Github/main.
[Lisa Ong]

commit d75d4a6b9cec2ccf90bdf27911d843be1833bc8d
- Merged PR 2530: Adds initial GPU benchmarking infrastructure. [Kern
Handa]

Related work items: 3685
- Merged PR 2524: [nfc] Refactor RangeValue utilities to separate file.
[Mason Remy]

[nfc] Refactor RangeValue utilities to separate file

Related work items: 3661
- Merged PR 2532: [prog] Fallback to known TargetDevice names for
looking up the LLVM triple. [Lisa Ong]

Resolves the issue where the CPU type is resolved (e.g. "zen2"), but does not match anything in the known triples list in TargetDevice.cpp

Future work can consider lifting the TargetDevice.cpp list to the Python layer
- Merged PR 2523: [nfc][docs] Incorporate generated visualizations from
Iteration Space Visualizer. [Lisa Ong]

* Add Alex's visualization tool to our tree
* Updated Schedule documentation and examples to align with existing visualizations
* Moved logos to subfolder under assets
- Merged PR 2521: Updates formatting of the unknown HOST warning
message. [Kern Handa]

Updates formatting of the unknown HOST warning message
- Merged PR 2514: Makes module compilation resist func compilation
fails. [Kern Handa]

Makes module compilation resist func compilation fails
- Merged PR 2517: Get the known device for host machine and give a
warning if the host is an unknown device. [Denny Sun]

When it is a host target, we call cpuinfo to query cpu model from the host machine, then use regex to match with the model names in known devices, we will use the configs in known devices if matched, or else we will use some default configs to generate code for the host target and give our users a warning about the potential suboptimum code.

Related work items: 3546
- Merged PR 2519: Merging changes from Github remote. [Lisa Ong]

commit ee8ad1ed7b7911109d76a40fb3990a419de05fe5
- Merged PR 2513: Removed inaccurate warp size computation for Vulkan
targets. [Chuck Jacobs]

The previous barrier optimization PR added so inaccurate code to `util::resolveWarpSize()` for Vulkan targets. This PR removes that, and fixes up some tests that depended on it.
- Merged PR 2516: Add fp16 support for mfma in the DSL (+tests) [Ritwik
Das]

- Add support for fp16 input and fp32 output
- Support fp16 input and output
- Clean up some tests

Related work items: 3670
- Merged PR 2510: Add different mfma tile sizes for FP32. [Ritwik Das]

- Fix couple of offset bugs
- Add multi-block tile sizes
- Add unit tests

Related work items: 3666
- Merged PR 2511: Enable smoke test GPU matmul correctness checks.
[Mason Remy]

Enable smoke test GPU matmul correctness checks

- Also fix some FP16 scenarios
- Add some more Accera <-> numpy mapping utilities
- Merged PR 2502: Support different input array layouts for GPU caching.
[Mason Remy]

Support different input array layouts for GPU caching

This change mainly configures the thread assignments in order to get
coalesced global memory access. The logical accessing should have
already been correct, this is primarily for performance.

Related work items: 3660
- Merged PR 2487: Barrier optimization, part 2. [Chuck Jacobs]

This PR improves the previous barrier optimization code. It now works with non-straight-line code (if/else constructs and loops).

It doesn't yet do the "move barriers outside of loops" optimization.

For debugging, there's an option to output a graphviz dot file showing the graph of relevant instructions that are used during the optimization:

acc-opt ... --barrier-opt-dot --barrier-opt-dot-filename="barrier.dot"

Related work items: 3649
- Merged PR 2509: [nfc] sync quickstart demo from GitHub/demo branch.
[Lisa Ong]

Use a subset of MLAS optimizations that are sufficient to show a 3x improvement over the default schedule.

This version was already in the Github repo for some time.

**Full Changelog**: https://github.com/microsoft/Accera/compare/v1.2.3...v1.2.4

1.2.3

What's Changed

* Docs refactoring manual fusing by Arslan-e-Mustafa in https://github.com/microsoft/Accera/pull/26

- Merged PR 2508: [release] Bump docs version to 1.2.3. [Lisa Ong]

In preparation for a PyPI release to facilitate community contributions for case studies

Synced doc editorials from public Github repo
- Merged PR 2503: [prog] Support unsigned integer types in the DSL.
[Lisa Ong]

* Add ScalarType.uint8/16/32/64 support
* Use UnrealizedConversionCastOps to convert these unsigned ints to signless ints
* Refactored CastImpl now that we have to handle both unsigned and signless cases for casts to/from ints
* Use a tuple of (mlir Type, llvm Type) to infer the C type when writing function declarations in the HAT file. The former holds sign-ness information, the latter determines the C type (e.g. pointer or not)
* Simplified CheckAllClose function to reduce unnecessary casting
* Doc updates
* Fixed HAT file issues with ScalarType.bool
- Merged PR 2507: Updates acc-translate output for ROCm 5.1. [Kern
Handa]
- Merged PR 2437: Add more known targets(from our team's devices) [Denny
Sun]

The new list covers the following cpus, these cpus are being used by our devs,
Intel(R) Xeon(R) W-2123 CPU 3.60GHz
11th Gen Intel(R) Core(TM) i7-1185G7 3.00GHz
Intel(R) Core(TM) i7-8650U CPU 1.90GHz 2.11 GHz
Intel(R) Xeon(R) CPU E5-1650 v3 3.50GHz
Intel(R) Xeon(R) Silver 4108 CPU 1.80GHz

Related work items: 3546
- Merged PR 2505: [nfc] Rename parameters for schedule.tile and
plan.bind. [Kern Handa]

[nfc] Rename parameters for schedule.tile and plan.bind
- Merged PR 2501: Adds support for more than one GPU function per
package. [Kern Handa]

Adds support for more than one GPU function per package

Related work items: 3686
- Merged PR 2504: [docs] Update stale versions in Reference docs. [Lisa
Ong]

Fixing while considering better approaches....
- Merged PR 2499: Updates the syntax for schedule.tile. [Kern Handa]

Updates the syntax for schedule.tile
- Merged PR 2498: Updates the syntax for plan.bind. [Kern Handa]

Updates the syntax for plan.bind

Related work items: 3678
- Merged PR 2500: Adds support for specifying index bitwidth for acc-
translate. [Kern Handa]

Adds support for specifying index bitwidth for acc-translate

Story 3669

Related work items: 3669
- Merged PR 2490: Restore CMake Export. [Abdul Dakkak]

Restore the CMake Export feature as it is used by argo-experiments. Note that you cannot use this feature if you are using the vcpkg llvm build
- Merged PR 2497: Fix vectorization plumbing to correctly handle zero
vectorization budget cases in cache reduce ops. [Mason Remy]

Fix vectorization plumbing to correctly handle zero vectorization budget cases in cache reduce ops
- Merged PR 2496: [nfc] Switch docs versioning to bump2version, replace
VERSION with simple git tag-based version. [Lisa Ong]

* Populate ACCERA_VERSION from the latest git tag
* bump2version is now configured for the docs/ tree
- Merged PR 2495: [test] Import break with python -m unittest discover.
[Lisa Ong]

`python -m unittest discover accera/test *.py` will interrogate verifiers.py and fail because of the relative import
- Merged PR 2492: Updates test verifier code to match hatlib API
changes. [Kern Handa]

Updates test verifier code to match hatlib API changes
- Merged PR 2488: Simplify RangeValue analysis. [Abdul Dakkak]

Uses LLVM's ConstantRange instead of implementing our own to delete a lot of code
- Merged PR 2489: add missing type_traits include. [Abdul Dakkak]

add missing type_traits include
- Merged PR 2482: Fix parameterized caches producing multiple caches
erroneously. [Mason Remy]

Fix parameterized caches producing multiple caches erroneously

- This is more of a one-off fix. A more generalized fix for resetting
schedules/plans for different parameter value resolution should be
implemented down the road
- Merged PR 2479: FP16 tensorization for ROCM. [Abdul Dakkak]
- Merged PR 2472: Tensorization + Caching. [Abdul Dakkak]
- Merged PR 2485: Add another keyword to function's auxiliary table.
[Denny Sun]

Add 'parameters' keyword to the parameter values in a function's auxiliary table, then the table will look like:

[functions.matmul_256_256_256_bdec0fac.auxiliary.accera]
[functions.matmul_256_256_256_bdec0fac.auxiliary.accera.parameters]
p_m_split_size = 16
p_n_split_size = 128
p_s_split_size = 256
p_s_split_2_size = 8
p_n_split_2_size = 16
p_n_split_3_size = 4

Related work items: 3662
- Merged PR 2484: [Pipelines] Enable uploads to PyPI when tagging a
release. [Lisa Ong]

Configurable service connection variable, allows setting of test and production PyPI service connections during scheduling.

Also cleaned up a stale workaround for auditwheel in the ManyLinux pipeline.
- Merged PR 2471: Fix to caching. [Abdul Dakkak]

This avoids the aggressive cache deletion specifically when it occurs within loop. This is a temporary fix, and a more elegant one is to handle memory access info across loop boundaries.
- Merged PR 2476: Add accera.create_parameter_grid() with self-defined
filter and sample as arguments. [Denny Sun]

Provide a generic function in DSL for users to create the parameters list from a dictionary(grid), self define a filter function to remove invalid parameter values and limit the number of parameter grid as well as the number of functions generated.

We find out the requirement for this function when updating our matmul grid search case study.

Related work items: 3662
- Merged PR 2483: [Test] Integrate FileCheck into Python tests. [Lisa
Ong]

- Added FileCheck utility to the accera-llvm package
- Can be run on any output file produced by the Package.build process, e.g. .cu, .mlir
- Support some basic directives
- Added examples for caching and rocm validation

Example error spew:

/root/Accera/build/lib.linux-x86_64-3.9/accera/bin/FileCheck /root/Accera/build/lib.linux-x86_64-3.9/test_acccgen/test_rocm_gemm_tiled_output/test_rocm_gemm_tiled_output.cu.filecheck --input-file /root/Accera/build/lib.linux-x86_64-3.9/test_acccgen/test_rocm_gemm_tiled_output/test_rocm_gemm_tiled_output.cu

/root/Accera/build/lib.linux-x86_64-3.9/test_acccgen/test_rocm_gemm_tiled_output/test_rocm_gemm_tiled_output.cu.filecheck:2:16: error: CHECK-COUNT: expected string not found in input (4 out of 4)
CHECK-COUNT-4: for (int64_t idx{{[0-9]}} = 0; idx{{[0-9]}} < 16; idx{{[0-9]}} += 1) {
^
/root/Accera/build/lib.linux-x86_64-3.9/test_acccgen/test_rocm_gemm_tiled_output/test_rocm_gemm_tiled_output.cu:42:47: note: scanning from here
for (int64_t idx2 = 0; idx2 < 16; idx2 += 1) {
^

Input file: /root/Accera/build/lib.linux-x86_64-3.9/test_acccgen/test_rocm_gemm_tiled_output/test_rocm_gemm_tiled_output.cu
Check file: /root/Accera/build/lib.linux-x86_64-3.9/test_acccgen/test_rocm_gemm_tiled_output/test_rocm_gemm_tiled_output.cu.filecheck

-dump-input=help explains the following input dump.

Input was:
<<<<<<
.
.
.
37:
38:
39: extern "C" __global__ __launch_bounds__(1) void test_rocm_gemm_tiled_output_710d7d7d2ca9ca9e__gpu__(float *arg0, float *arg1, float *arg2) {
40: for (int64_t idx0 = 0; idx0 < 16; idx0 += 1) {
41: for (int64_t idx1 = 0; idx1 < 16; idx1 += 1) {
42: for (int64_t idx2 = 0; idx2 < 16; idx2 += 1) {
count:2 X error: no match found
43: /*%0 = memref.load %arg0[%arg3, %arg5] : memref<16x16xf32, affine_map<(d0, d1) -> (d0 * 16 + d1)>>*/
count:2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
44: const auto arg0_offset0 = affine_map_func_0_i0(idx0, idx2);
count:2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
45: float var3 = ((float*)arg0)[arg0_offset0];
count:2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
46: /*%1 = memref.load %arg1[%arg5, %arg4] : memref<16x16xf32, affine_map<(d0, d1) -> (d0 * 16 + d1)>>*/
count:2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
47: const auto arg1_offset1 = affine_map_func_0_i0(idx2, idx1);
count:2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.
.
.
>>>>>>

- Merged PR 2480: Clean up cache vectorization argument plumbing. [Mason
Remy]

Clean up cache vectorization argument plumbing
- Merged PR 2481: Enables verification for ROCm smoke tests. [Kern
Handa]
- Merged PR 2473: Extends range analysis by adding support for
udiv,sdiv,urem,srem. [Abdul Dakkak]

these come up during code gen
- Merged PR 2474: Add vectorize arg to plan.cache. [Mason Remy]

Add vectorize arg to plan.cache

- Enables specifying whether or not to vectorize ops for a given cache,
including an "AUTO" option, which will behave how caching
vectorization has behaved in the past, where it vectorizes the cache
if any loop in the loopnest is also vectorized
- Also fix some include paths

**Full Changelog**: https://github.com/microsoft/Accera/compare/v1.2.2...v1.2.3

1.2.2

What's Changed

**Full Changelog**: https://github.com/microsoft/Accera/compare/v1.2.1...v1.2.2

* Add Ubuntu CI workflow by lisaong in https://github.com/microsoft/Accera/pull/9
* Rework documentation sections by lisaong in https://github.com/microsoft/Accera/pull/8
* Manually run script to update doc versions by lisaong in https://github.com/microsoft/Accera/pull/10
* Show more perf gains for the quickstart example by lisaong in https://github.com/microsoft/Accera/pull/12
* Fix post merge build break by lisaong in https://github.com/microsoft/Accera/pull/14
* README.md refactoring by Arslan-e-Mustafa in https://github.com/microsoft/Accera/pull/13
* Complete refactoring of file array.md and simple affine loop nests.md file in manual docs by Arslan-e-Mustafa in https://github.com/microsoft/Accera/pull/16
* complete refactoring of introduction.md file in manual docs by Arslan-e-Mustafa in https://github.com/microsoft/Accera/pull/15
* Complete refactoring of vectorization and parallelization of manual docs by Arslan-e-Mustafa in https://github.com/microsoft/Accera/pull/20
* Complete refactoring of targets.md and previous typos by Arslan-e-Mustafa in https://github.com/microsoft/Accera/pull/18
* Complete refactoring of caching.md file by Arslan-e-Mustafa in https://github.com/microsoft/Accera/pull/19
* Complete refactoring of schedule.md file from manual docs by Arslan-e-Mustafa in https://github.com/microsoft/Accera/pull/17
* Complete refactoring of manual docs: Deferred Layout of Constant Arrays by Arslan-e-Mustafa in https://github.com/microsoft/Accera/pull/21
* Refactoring of packages(dot)md from manual docs by Arslan-e-Mustafa in https://github.com/microsoft/Accera/pull/23
* Complete refactoring of parameters dot md from manual docs by Arslan-e-Mustafa in https://github.com/microsoft/Accera/pull/22

- Merged PR 2439: Downstream doc changes from github/main. [Lisa Ong]

Squashed commit of the following:

commit 8a6e5535efe7cdf11e614b11abc5bde14ee76d5b
- Merged PR 2440: Enable tensorization for Rocm target. [Abdul Dakkak]
- Merged PR 2470: Adds support for the execution of GPU (CUDA only)
functions via hat. [Kern Handa]
- Merged PR 2467: Adding multiple functions in package.add() can't work
with stateful auxiliary metadata and index_map. [Denny Sun]

These bugs are all about sharing Python objects among different functions, like auxiliary metadata and schedule's indexes, when we call pacakge.add() to add multiple parameterized functions, we add functions one by one, then emit functions one by one, at each step, the state of shared Python object is changed which results in only the first function added being correctly emitted, to make _add_function work, we need to make these shared Python objects stateless.

Related work items: 3662
- Merged PR 2469: Convert 'Local' memory space to 'Private' [Mason Remy]

Convert 'Local' memory space to 'Private'
- Merged PR 2463: Enable specifying double buffer memory space. [Mason
Remy]

Enable specifying double buffer memory space
- Merged PR 2468: Move to VS2022 for builds. [Kern Handa]

Move to VS2022 for builds
- Merged PR 2465: extend gpu target spec. [Abdul Dakkak]

extend gpu target spec
- Merged PR 2464: Compute a stable hash for function name suffixes.
[Lisa Ong]

Create a stable hash using md5 and json serialization of these stringized entries:
- Array args: shape, type, role, layout
- parameter dictionary
- Target

Example output:

test_unequal_iteration_space_fusing_1 (__main__.DSLTest_04Fusing) ... DEBUG:root:Adding wrapped function
DEBUG:root:Adding wrapped function
Building function fusing_test_32d12fb1a01061ec
DEBUG:root:Detected logic function _ uses indices i,j
DEBUG:root:Detected logic function _ uses indices i,j
Building function _debug_check_allclose_16_16_4cfd65a8b606655b

- Merged PR 2460: [nfc] Fix build.sh setting for vcpkg debug builds.
[Lisa Ong]
- Merged PR 2461: Replace MemoryType with MemorySpace for consistency.
[Mason Remy]

Replace MemoryType with MemorySpace for consistency
- Merged PR 2416: Implement initial thrifty caching support. [Mason
Remy]

Implement initial thrifty caching support

- This is a simple brute-force approach where each thrifty cache is
examined element-by-element alongside the array it is caching to check
whether there is a stride of 1 between every access
- Currently this thrifty analysis and the potential erasing of thrifty
caches happens after the cache ops have been created. This is due to
needing the cache mapping to have already run in order to support
hierarchical caching scenarios. Eventually this should be refactored
and the thrifty analysis should be used to prevent creating the cache
ops, but that is a larger refactor than the scope for this task.
- When creating affine loads and stores into caches, this change also
tacks on some attributes onto the load/store ops to indicate how the
original load or store accessed the base array. Since the base array
-> cache position mapping is not always invertible (consider
coefficient cache layout cases), this is one of the only ways to
encode this information. Unfortunately, canonicalization on affine
load/store ops will scrub away these attributes, so any reliance on
them has to occur before a canonicalization pass. Similarly, the
MakeCacheOps recording which argument to their accesses are the base
array positions depends on the operand list being unchanged, however
canonicalization may remove operands if it determines they are not
used - while this is fine for the load/store op itself, any assumption
like "base array indices are at positions N...N+K in the operand list"
are no longer valid

Related work items: 3575
- Merged PR 2459: Changes the order of the LLVM_SETUP_VARIANT detection.
[Kern Handa]

Changes the order of the LLVM_SETUP_VARIANT detection
- Merged PR 2458: Fixes building with clang++ on Linux/WSL. [Kern Handa]

Fixes building with clang++ on Linux/WSL
- Merged PR 2438: Support for double-buffer caching. [Mason Remy]

Support for double-buffer caching

- Adds plumbing from python dsl for double_buffer flag to cache API
- Implements double buffering by hoisting the initial cache fill outside
of the cache trigger loop parent, then creating a prologue subnest
that fills a temporary buffer with the i+1'st iterations data and an
epilogue subnest that moves that temporary buffer data into the main
cache buffer. The last iteration of the trigger loop parent loop is
unswitched and no cache filling is done in that loop.
- On GPU the temporary buffer is allocated in private memory and if the
cache is in shared memory each thread just holds onto their own
contribution to the cache in their own private memory buffer until the
epilogue fill nest
- Barrier ops are hoisted out of conditionals to avoid potential for
deadlocks. The conditionals introduced in this PR should be
always-true or always-false, but this is added as a safety measure.
Currently the hoisting is naive - any barrier within a conditional is
erased and barriers are placed before and after the conditional block.
This is not correct for all future conditional scenarios as any
operations that happen within the conditional that depend on the
barrier existing will be broken, however it works for how conditionals
are used currently and can be improved on over time

Related work items: 3659
- Merged PR 2450: Automatically add parameter dict as auxiliary data.
[Denny Sun]

Automatically add parameter dict as auxiliary data

Related work items: 3662
- Merged PR 2456: Updates CUDA source emission based on testing with
nvrtc. [Kern Handa]

Updates CUDA source emission based on testing with nvrtc
- Merged PR 2453: Sets CPU targets to default to openmp. [Kern Handa]

Sets CPU targets to default to openmp
- Merged PR 2443: Add FP16 support. [Abdul Dakkak]

preparation for adding mfma support for CUDA which only operates on FP16
- Merged PR 2452: Updates GPU source emitting path to emit host launcher
and device function pairs. [Kern Handa]
- Merged PR 2451: Updates IR util ResolveExec[Target,Runtime] to allow
for exact matches. [Kern Handa]

Updates IR util ResolveExec[Target,Runtime] to allow for exact matches
- Merged PR 2447: Makes Vulkan specific behavior pred. on Runtime. [Kern
Handa]

Makes Vulkan specific behavior pred. on Runtime
- Merged PR 2446: Updates Runtime enum in Targets.py to be more
comprehensive. [Kern Handa]

Updates Runtime enum in Targets.py to be more comprehensive
- Merged PR 2449: [Cleanup] Replace "rc*_" prefixes with "acc*_"
prefixes in tablegen'ed code. [Lisa Ong]

For *.td, perform the following replacements for ops:

s/rcv_/accv_/g
s/rc_/acc_/g
s/rcxp_/accxp_/g
s/rcln_/accln_/g
- Merged PR 2448: fix typo in the condition for mod in range analysis.
[Abdul Dakkak]

fix typo in the condition for mod in range analysis
- Merged PR 2445: Fix bind command when index is further split. [Abdul
Dakkak]
- Merged PR 2444: add range remainder. [Abdul Dakkak]

add range remainder
- Merged PR 2441: Fix APInt usage in RangeValueOptimizePass. [Mason
Remy]

Run the RangeValueOptimizePass as part of acc-to-llvm
- Merged PR 2442: Move ExecutionOptions to ir lib and create arrayattr
<-> struct utils. [Mason Remy]

Move ExecutionOptions to ir lib and create arrayattr <-> struct utils
- Simplify target passthrough layer. [Mason Remy]
- Move ExecutionOptions to ir lib and create arrayattr <-> struct utils.
[Mason Remy]
- Merged PR 2430: Remove unnecessary barrier ops. [Chuck Jacobs]

This PR adds an optimization pass that removes redundant / unnecessary barrier ops around shared memory usage.

The optimization pass in this PR is pretty simple and has a couple of limitations:
- it only works on straight-line code (that is, when all the loads, stores, and barriers are at the same loop level as each other).
- it considers all accesses to a specific array to be conflicts (that is, any write to an array followed by a read of that array will want to have a barrier in between them, even if the writes and reads are to different elements in the array)

I should be following up with a PR that deals with barrier and memory ops at different loops levels pretty soon after this.

Related work items: 3648
- PR comments. [Charles Jacobs]
- Fixed lit test. [Charles Jacobs]
- Remove duplicated code Remove hardcoded numeric memory space values.
[Charles Jacobs]
- Remove multiprocessing import. [Abdul Dakkak]
- Add python test cases. [Abdul Dakkak]
- Cleanup. [Charles Jacobs]
- Added lit test Removed debug output. [Charles Jacobs]
- Basic version working (straight-line code, all-or-nothing memory
accesses) Moved much of the analysis from the "analysis" class to the
pass. [Charles Jacobs]

added try/except in __init__ for gpu and llvm submodules
- Propagating read info. [Charles Jacobs]
- Removed barrier op rewrite pattern. [Charles Jacobs]
- Simple case of all-or-nothing write-only barriers on inline code
working (I think) [Charles Jacobs]
- Tweaked barrier opt pass Added BarrierOp to Python DSL Added
BarrierScope to Python DSL. [Charles Jacobs]
- Add trivial barrier optimization pass. [Charles Jacobs]
- Propagate the runtime info throughout the accera pipeline. [Abdul
Dakkak]
- Merged PR 2431: Support split after fusing unequal iteration spaces.
[Lisa Ong]

* For each correspondence index entry, perform end-padding to the largest-sized index range
* Support different sized iteration spaces in any fusion order
* Add boundary block splits whenever the InRange predicate is applied on an outer split index. Currently, this can incur more splits than necessary to ensure correctness

Related work items: 3476
- Merged PR 2436: [Debug mode] Support subarray's as function arguments.
[Lisa Ong]

Force an identity affine map if it cannot be canonicalized

Related work items: 3647
- Merged PR 2435: Binding thread and block ids updates launch params.
[Kern Handa]

Binding thread and block ids updates launch params
- Merged PR 2433: Remove defunct accera dialect transforms. [Kern Handa]

Remove defunct accera dialect transforms
- Merged PR 2432: split range optimization into two parts (analysis and
optimization) [Abdul Dakkak]

this uses the analysis pipeline to implement the range analysis. The range optimization is one instance of code which uses the range analysis info
- Merged PR 2428: Perform Optimization using Range Analysis. [Abdul
Dakkak]

This uses range reduction to remove provably true/false conditions. For example given the following mlir file

gpu.module NestFunction_0_module attributes {gpu.binary = "HSACO"} {
gpu.func NestFunction_0(%arg0: memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>, %arg1: memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>) kernel attributes {blockSize = [16 : i32, 16 : i32, 1 : i32], gridSize = [2 : i32, 2 : i32, 1 : i32]} {
%c16 = constant 16 : index
%c8 = constant 8 : index
%c0 = constant 0 : index
%c15 = constant 15 : index
%c-1 = constant -1 : index
%c1 = constant 1 : index
%c2 = constant 2 : index
%0 = "gpu.thread_id"() {dimension = "y"} : () -> index
%1 = "gpu.thread_id"() {dimension = "x"} : () -> index
%2 = "gpu.block_id"() {dimension = "y"} : () -> index
%3 = "gpu.block_id"() {dimension = "x"} : () -> index
%4 = memref.alloc() : memref<32x16xf32, 3>
scf.for %arg2 = %c0 to %c2 step %c1 {
%10 = cmpi sge, %arg2, %c0 : index
%11 = muli %arg2, %c-1 : index
%12 = addi %11, %c1 : index
%13 = cmpi sge, %12, %c0 : index
%14 = and %10, %13 : i1
%15 = cmpi sge, %0, %c0 : index
%16 = and %14, %15 : i1
%17 = muli %0, %c-1 : index
%18 = addi %17, %c15 : index
%19 = cmpi sge, %18, %c0 : index
%20 = and %16, %19 : i1
%21 = cmpi sge, %1, %c0 : index
%22 = and %20, %21 : i1
%23 = muli %1, %c-1 : index
%24 = addi %23, %c15 : index
%25 = cmpi sge, %24, %c0 : index
%26 = and %22, %25 : i1
scf.if %26 {
%27 = muli %3, %c16 : index
%28 = addi %27, %c8 : index
%29 = memref.load %arg0[%28, %c0] : memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>
memref.store %29, %4[%c0, %c8] : memref<32x16xf32, 3>
}
}
gpu.barrier
%5 = muli %2, %c16 : index
%6 = addi %0, %5 : index
%7 = memref.load %4[%6, %1] : memref<32x16xf32, 3>
%8 = muli %3, %c16 : index
%9 = addi %1, %8 : index
memref.store %7, %arg1[%9, %6] : memref<32x32xf32, affine_map<(d0, d1) -> (d0 * 32 + d1)>>
gpu.return
}
}

one can run it through the `--optimize-range-value --cse --sccp --symbol-dce` pass sequence to get

map = affine_map<(d0, d1) -> (d0 * 32 + d1)>
module {
gpu.module NestFunction_0_module attributes {gpu.binary = "HSACO"} {
gpu.func NestFunction_0(%arg0: memref<32x32xf32, map>, %arg1: memref<32x32xf32, map>) kernel attributes {blockSize = [16 : i32, 16 : i32, 1 : i32], gridSize = [2 : i32, 2 : i32, 1 : i32]} {
%true = constant true
%c2 = constant 2 : index
%c1 = constant 1 : index
%c-1 = constant -1 : index
%c15 = constant 15 : index
%c0 = constant 0 : index
%c8 = constant 8 : index
%c16 = constant 16 : index
%0 = "gpu.thread_id"() {dimension = "y"} : () -> index
%1 = "gpu.thread_id"() {dimension = "x"} : () -> index
%2 = "gpu.block_id"() {dimension = "y...
- Merged PR 2395: Add strided View implementation for containers.
[Ritwik Das]

- Add strided View implementation for containers
- added tests

Related work items: 3641
- Merged PR 2429: Merged changes from GitHub remote. [Lisa Ong]

commit 8c0fb548afcd376de5d1832835166ffb9dc04e57
- Squashed commit of the following: [Lisa Ong]

commit 8c0fb548afcd376de5d1832835166ffb9dc04e57
- Squashed commit of the following: [Lisa Ong]

commit 4fccbb9af41d0638d32afe7c795e2b0eda836df5
- Merged PR 2427: Disable CI triggers for weekly SDL pipelines. [Lisa
Ong]

`trigger` needs to be explicitly set to `none` to be disabled, otherwise it will run as a CI trigger on the default branch.
- Merged PR 2426: Expose and plumb support for ARM Cortex-M4F. [Kern
Handa]

Expose and plumb support for ARM Cortex-M4F
- Merged PR 2425: Set non-CI pipelines on a weekly schedule. [Lisa Ong]

Setup a weekly schedule to conserve resources. The pipelines will run at the beginning of the week to verify the payload from the previous week.

Also removed the main trigger from Linux Package Build as it is only used in CI.
- Merged PR 2424: Const arrays reused in two different functions results
in duplicate symbol name. [Denny Sun]

valueModuleOp.lookupSymbol() should be called here to loop for an existing symbol, but so far it doesn't work as expected. So manually walk the top level ops inside the ValueModuleOp to look for the symbol. Replace this workaround with a ValueModuleOp SymbolTable lookup once issues with comparing mlir::Identifiers is resolved.

Related work items: 3583
- Fix the typo and simplify for loop. [Denny Sun]
- Fix the build break caused by typo. [Denny Sun]
- Walk the top level ops inside the ValueModuleOp to look for the
symbol. [Denny Sun]
- Correct the logic. [Denny Sun]
- Draft PR for fixing dup global const. [Denny Sun]
- Merged PR 2422: Simplify implicit type casting check and add support
for more conversions. [Kern Handa]

Simplify implicit type casting check and add support for more conversions
- . [Kern Handa]
- Change controversial impl. [Kern Handa]
- Simplify implicit type casting check and add support for more
conversions. [Kern Handa]
- Merged PR 2421: Update accera-translate to emit CUDA code from acc-opt
output. [Kern Handa]

Update accera-translate to emit CUDA code from acc-opt output

This change adds support for new ops and refactors support for existing
ops, with the primary goal being the emission of CUDA code from the IR
output from `acc-opt` (when given the proper CL args). A secondary goal
was to clean up the code and make it easier to expand the support for
both C++ and CUDA-like code in the future.
- Merged PR 2419: Add automatic type casting for basic scalar types.
[Denny Sun]

This change is to make Scalar be able to do upcasting, e.g. Int8 to Float, Int8 to Int32, Float to Double etc.
With this fix, our users can write the following Python code without explicit type casting, A is an array object with Float type,
A[i, j] = 5

Related work items: 3570
- Add more type cast. [Denny Sun]
- Type casting support in Scalar. [Denny Sun]
- Merged PR 2420: Add smoke test repro'ing CONST array bug. [Mason Remy]

Add smoke test repro'ing CONST array bug
- Add smoke test repro'ing CONST array bug. [Mason Remy]
- Merged PR 2418: Fix Package format enum and memory space tablegen
enum. [Mason Remy]

Fix Package format enum and memory space tablegen enum
- Fix Package format enum and memory space tablegen enum. [Mason Remy]
- Merged PR 2415: propagate the runtime info throughout the accera
pipeline. [Abdul Dakkak]

propagate the runtime info throughout the accera pipeline
- Updated smoke_test.py. [Abdul Dakkak]
- Resolve comments. [Abdul Dakkak]
- Propagate the runtime info throughout the accera pipeline. [Abdul
Dakkak]
- Merged PR 2410: Initial work on supporting GPU Caching. [Abdul Dakkak]

This merges Alex's work on GPU caching. Tests do not run for reasons unrelated to caching
- Resolve merge conflicts. [Abdul Dakkak]
- Rename memory_space to location per the docs. [Abdul Dakkak]
- Checkpoint. [Abdul Dakkak]
- Continue merge. [Abdul Dakkak]
- Continue merge. [Abdul Dakkak]
- Merge from main. [Abdul Dakkak]
- Resolve comments. [Abdul Dakkak]
- Disable rocm test for now. [Abdul Dakkak]
- Fix typo. [Abdul Dakkak]
- Do not generate the global helpers if we are generating only gpu code
and the target is not an object. [Abdul Dakkak]
- Merge branch 'dev/kerha/fix_quiet_and_gpu_only' into
dev/adakkak/initial_gpu_caching. [Abdul Dakkak]
- Fix quiet and gpu_only integrations in FE. [Kern Handa]
- Fix setting of target category. [Abdul Dakkak]
- Checkpoint. [Abdul Dakkak]
- Initial work on merging Alex's branch. [Abdul Dakkak]
- Merged PR 2412: Merged commits from GitHub main. [Lisa Ong]

Changes:

commit fa63d6e9cde2b86df28e824bc907cee7a25f9d8b
- Squashed commit of the following: [Lisa Ong]

commit 2668b4ba996cf1ee488509881682e5a8937c13dd
- Update vcpkg. [Lisa Ong]
- Removed stale files from merge. [Lisa Ong]
- Squashed commit of the following: [Lisa Ong]

commit fa63d6e9cde2b86df28e824bc907cee7a25f9d8b
- Merged PR 2414: LLVM 13.0.1 update. [Lisa Ong]

Dependent PR: !2413

Incremental version update to LLVM, no changes in MLIR: https://github.com/llvm/llvm-project/compare/llvmorg-13.0.0...llvmorg-13.0.1

Related work items: 3620
- Merged PR 2409: [nfc] Pull out ACCERA_TOOLS_DIR to the main
CMakeLists.txt file. [Kern Handa]

[nfc] Pull out ACCERA_TOOLS_DIR to the main CMakeLists.txt file
- Merged PR 2408: [nfc] Update python style config, apply it on tree.
[Kern Handa]

[nfc] Update python style config, apply it on tree
- Merged PR 2407: Fix quiet and gpu_only integrations in FE. [Kern
Handa]

Fix quiet and gpu_only integrations in FE
- Merged PR 2406: Update accc.py to build with gpu_only mode. [Kern
Handa]

Update accc.py to build with gpu_only mode
- Merged PR 2404: Add support for translation of accera dialect to
cpp/cuda. [Kern Handa]
- Merged PR 2403: Update Python API for GPU support. [Kern Handa]

Update Python API for GPU support

Changes:

* Add support for gpu_only compilation mode in python layer
* Add support for specifying the execution runtime as a compiler option
* Adds the _gpu submodule to the accera module, which adds support for GPU specific ops - MFMA and Barrier, along with Python hooks for GPU Indices
- Merged PR 2402: Update Value library to support GPU related changes.
[Kern Handa]

Update Value library to support GPU related changes

* Add support for performing only GPU related compilations to compiler front-end
* Add support for strided views to Value DSL
* Matrix::SubMatrix has been updated to accept row, col strides
* Add support for MFMA ops in Value DSL (WIP)
* Add support for tensorization to Value DSL's GPUPlan
- Merged PR 2401: Adds the GPU pass pipeline and related transforms.
[Kern Handa]

Adds the GPU pass pipeline and related transforms

Additionally:
* Added support for adding execution runtime annotations to functions, enabling target compilation control
* Adds support for Tensorization
* Adds support for MFMA (WIP)
* Adds a debug utility transform to dump the module, but disabled from the build by default
* Improved subview op semantics, enabling support for strided views
- Merged PR 2400: Add support for MFMA, Tensorization, and Exec Runtime
in Value IR. [Kern Handa]

Add support for MFMA, Tensorization, and Exec Runtime in Value IR
- Merged PR 2399: Integrate acc-translate into the python layer. [Kern
Handa]

This adds build support for calling acc-translate after the MLIR transformations are done. Work is still pending for front-end and back-end support for GPU.
- Merged PR 2385: Add Execution Runtime to module. [Abdul Dakkak]

This continues on the merge path. It adds an execution runtime option to a GPUPlan which gets propagated and added to the module. Subsequent PRs will use this feature to dispatch to Rocm or SPIRV.
- Merged PR 2398: quiet the build output from python by default. [Kern
Handa]

quiet the build output from python by default

Related work items: 3557
- Merged PR 2397: Disable command line execution of accc.py. [Kern
Handa]

Disable command line execution of accc.py
- Merged PR 2396: AVX512 was being incorrectly gated. [Kern Handa]

AVX512 was being incorrectly gated
- Merged PR 2393: Pick up demo fixups from github/main. [Lisa Ong]

The binder demo now runs in a separate branch github/demos so that binder does not try to build the accera package.

The github/demos branch will not be merged into github/main (similar to github/gh-pages), and lives to host binder demos.
- Merged PR 2394: Support Python 3.10. [Lisa Ong]

* Add Python 3.10 to package builds and as the default Python version for all pipelines
* Exception: Linux, macOS, and Windows buddy builds rely on onnxruntime, which only supports up to 3.9. Since these pipelines only build 1 python version, we'll keep them at 3.9 so that there's test coverage for ORT.

Related work items: 3643

New Contributors
* Arslan-e-Mustafa made their first contribution in https://github.com/microsoft/Accera/pull/13

1.2.1

What's Changed

- Merged PR 2391: Update quickstart example, updated docs structure per
feedback. [Lisa Ong]

* Teasers for transformations in the Quickstart sample (to differentiate Accera from others), with benchmarking
* Removed the Miscellaneous section, redistributed various docs to various related locations
* Renamed the cross compilation tutorial so that it is ordered last
- Merged PR 2392: Populate Target.Models based on known devices. [Kern
Handa]

Populate Target.Models based on known devices
- Merged PR 2390: Merge multiple HAT files during project building.
[Kern Handa]

Merge multiple HAT files during project building

Related work items: 3559
- Merged PR 2386: Add support for various targets. [Kern Handa]

Add support for various targets

Related work items: 3631
- Merged PR 2389: [nfc] Doc typos and consistency fixes. [Lisa Ong]
- Merged PR 2388: Update quickstart example, add binder quickstart.
[Lisa Ong]

* Update quickstart example to perform a matmul + ReLU (unoptimized)
* Add Launch in Binder button to run everything in the browser
- Merged PR 2387: Placeholder GPU GridUnit definitions, add library
creation from multiple object files. [Lisa Ong]

Dependent HAT PR: https://github.com/microsoft/hat/pull/21

* GridUnit definitions are static until we have real GPU targets. These are updated just to be consistent with the Manual
* When not cross compiling, combine multiple .obj/.o into .lib/.a

Related work items: 3576
- Merged PR 2384: Update target docs, split Intel generation 8 and 9 for
consistency. [Lisa Ong]

* Update target docs to list the name of the target in the table
* Define separate models for Intel generation 8 and 9 for consistency

Related work items: 3631
- Merged PR 2383: Support dynamic libs from `Package.build` [Lisa Ong]

* Add static and dynamic variants to the HAT and MLIR formats
* MLIR format is also split because we'd want to support MLIR inspection of the cross-compilation scenario without forcing users to switch between dynamic and static
* Updated README sample

Left for future work:
* Combining multiple object files into a static lib or dynamic lib. We'd need to think about how HAT packages can be merged together (for example, how to reconcile the metadata in the HAT file, such as description, author - do we merge all metadata or just pick the first HAT file encountered as the "master", etc)

Related PR: https://github.com/microsoft/hat/pull/18

Related work items: 3576
- Merged PR 2382: [nfc] Move Case Studies out of the Accera repo. [Lisa
Ong]

Case Studies will live in other repositories, and be cross linked from the Accera repo's Case Studies README.md (to be added in the future).

Related work items: 3632
- Merged PR 2379: Specify dynamic lib dependencies from the HAT Package.
[Lisa Ong]

This is the final missing piece before we transition to building static / dynamic libs using `hatlib`.

* `Plan` infers additional dynamic dependencies when the target is GPU or when parallelization is requested.
* `Package.add` collects the dependency info the various `Plan` instances.
* `Package.build`, the platform parameter is used to resolve to the appropriate library (either a path or a -l directive).
* For library paths that cannot be fully determined in advance, we default to the current working directory, so perhaps the user can put the lib in the same path as the binaries. (this needs to be fleshed out more)
* Removed dead code

Dependent hatlib PR: https://github.com/microsoft/hat/pull/16/files

Related work items: 3576
- Merged PR 2380: Add Raspberry Pi 4 (B) support. [Kern Handa]

Related work items: 3631
- Merged PR 2368: Update and optimize acc-translate. [Abdul Dakkak]

- propagate constants while generating C++ code
- inline mlir within the C++ code to ease debugging
- increase support for vector ops
- silence a lot of warnings that were being emitted in the acc-translate codebase

The following

// CONFIG: {"K":2048,"M":2048,"N":2048,"block":{"x":16,"y":16,"z":1},"grid":{"x":128,"y":128,"z":1}}
module gemm_naive_14479263422999410716_module attributes {gpu.binary = "HSACO"} {
func gemm_naive_14479263422999410716(%arg0: memref<2048x2048xf32> loc(unknown), %arg1: memref<2048x2048xf32> loc(unknown), %arg2: memref<2048x2048xf32> loc(unknown)) {
%c16 = constant 16 : index loc(unknown)
%c0 = constant 0 : index loc(unknown)
%c2048 = constant 2048 : index loc(unknown)
%c1 = constant 1 : index loc(unknown)
%cst = constant 0.000000e+00 : f32 loc(unknown)
%0 = "gpu.thread_id"() {dimension = "x"} : () -> index loc(unknown)
%1 = "gpu.thread_id"() {dimension = "y"} : () -> index loc(unknown)
%2 = "gpu.block_id"() {dimension = "x"} : () -> index loc(unknown)
%3 = "gpu.block_id"() {dimension = "y"} : () -> index loc(unknown)
%4 = scf.for %arg3 = %c0 to %c2048 step %c1 iter_args(%arg4 = %cst) -> (f32) {
%11 = muli %3, %c16 : index loc(unknown)
%12 = addi %1, %11 : index loc(unknown)
%13 = memref.load %arg0[%12, %arg3] : memref<2048x2048xf32> loc(unknown)
%14 = muli %2, %c16 : index loc(unknown)
%15 = addi %0, %14 : index loc(unknown)
%16 = memref.load %arg1[%arg3, %15] : memref<2048x2048xf32> loc(unknown)
%17 = mulf %13, %16 {RelaxedPrecision} : f32 loc(unknown)
%18 = addf %arg4, %17 {RelaxedPrecision} : f32 loc(unknown)
scf.yield %18 : f32 loc(unknown)
} loc(unknown)
%5 = muli %3, %c16 : index loc(unknown)
%6 = addi %1, %5 : index loc(unknown)
%7 = muli %2, %c16 : index loc(unknown)
%8 = addi %0, %7 : index loc(unknown)
%9 = memref.load %arg2[%6, %8] : memref<2048x2048xf32> loc(unknown)
%10 = addf %9, %4 {RelaxedPrecision} : f32 loc(unknown)
memref.store %10, %arg2[%6, %8] : memref<2048x2048xf32> loc(unknown)
return loc(unknown)
} loc(unknown)
} loc(unknown)

generates the following cpp file

if defined(__HIP_PLATFORM_AMD__)
include <hip/hip_runtime.h>
using vfloatx2_t = float __attribute__((ext_vector_type(2)));
using vfloatx4_t = float __attribute__((ext_vector_type(4)));
using vfloatx16_t = float __attribute__((ext_vector_type(16)));
else
include "cuda_fp16.h"
endif // !defined(__HIP_PLATFORM_AMD__)

include <math.h>
include <stdint.h>

__global__ void gemm_naive_14479263422999410716(float (*arg0)[2048], float (*arg1)[2048], float (*arg2)[2048])
{
/*%0 = "gpu.thread_id"() {dimension = "x"} : () -> index*/
const uint threadIdx_x_0 = threadIdx.x;
/*%1 = "gpu.thread_id"() {dimension = "y"} : () -> index*/
const uint threadIdx_y_1 = threadIdx.y;
/*%2 = "gpu.block_id"() {dimension = "x"} : () -> index*/
const uint blockIdx_x_2 = blockIdx.x;
/*%3 = ...

- Merged PR 2376: [build] Install acc-lsp-server as an internal tool.
[Lisa Ong]

Removes acc-lsp-server from accera-compilers

Minor CMake macro renames to (hopefully) improve usability
- Merged PR 2378: [doc] Update doc links after DSL changes, fix missing
file warnings. [Lisa Ong]

Verified by:

cd <accera_root>
pip install mkdocs-material mkdocs-git-revision-date-plugin
mkdocs serve

- Merged PR 2377: Retire Benchmark.py, use hatlib for benchmarking and
shared library creation. [Lisa Ong]

This cleanup work precedes the actual work to produce static or dynamic libraries by migrating existing HAT Python scripts to consume hatlib. Next PRs will consume hatlib to produce those libraries.

hatlib defines a HAT package as .hat files and a library.

* Remove accera.tuning.AutoBenchmark and replace usages with hat.run_benchmark in case studies
* Removed accera.tuning.CorrectnessCheck. Baked correctness checking into accera.test.verifiers
* Disabled some tests in preparation for coming work (next PRs)
* parallelization tests: need to specify lomp as a link target dependency in the HAT file, and update hatlib to honor this flag
* emit_unpacked_buffer_tests: to resolve multi-MLIR-module scenario where we have a globals module in addition to the package module

Depends on this PR: https://github.com/microsoft/hat/pull/15

Related work items: 3556
- Merged PR 2374: Retain and honor the order of functions added to the
package. [Kern Handa]

Retain and honor the order of functions added to the package

Related work items: 3629
- Merged PR 2371: add lsp server for accera. [Abdul Dakkak]

this adds an lsp server to be used with the mlir vscode extension https://marketplace.visualstudio.com/items?itemName=llvm-vs-code-extensions.vscode-mlir . You will have to specify the lsp server in your settings.json . On my system this means to add the following setting

"mlir.server_path": "${workspaceFolder}/build/accera/acc-lsp-server/acc-lsp-server",

It's not super robust though
- Merged PR 2372: reduce install size. For example, on linux the install
size goes from 873M to 742M on Linux. [Abdul Dakkak]

reduce install size. For example, on linux the install dir goes from 873M to 742M. More can be done along those lines
- Merged PR 2369: run clang-format on acc_translate. [Abdul Dakkak]

run clang-format on acc_translate. There are no modifications to the code
- Merged PR 2367: Selectively emit GPU utilities. [Kern Handa]

Selectively emit GPU utilities

Related work items: 3559
- Merged PR 2366: [build] Fix manylinux package build. [Lisa Ong]

Apply updated requirements.txt without rebuilding docker image
- Merged PR 2365: Unify Package.add_function and Package.add_functions
into Package.add. [Kern Handa]

Related work items: 3549
- Merged PR 2363: Initial quickstart example in the main README. [Lisa
Ong]

* The quickstart example demonstrates how to do everything (including calling the function) from Python
* hatlib is now a runtime dependency as a result.
* We should consider updating at least the HelloMatMul Tutorials to also cover how to call functions from Python for quick testing. Calling from C++ is still the mainline scenario for performance

Dependent PR: https://github.com/microsoft/hat/pull/11

Related work items: 3630
- Merged PR 2364: Rename action plan references to plan. [Kern Handa]

Related work items: 3563
- Merged PR 2362: [hygiene] Move manylinux pipeline triggers from
classic to YAML. [Lisa Ong]

For maintainability, so that the triggers for that pipeline are in one place
- Merged PR 2350: LLVM update to 13.0.0. [Lisa Ong]

Updated LLVM to the "llvmorg-13.0.0" tag

Related work items: 3618
- Merged PR 2361: Build release versions of binaries for packaging
purposes, workaround auditwheel compression bug. [Lisa Ong]

* We currently build RelWithDebInfo instead of Release. This can result in packages that are too big to be uploaded to PyPI. A quick fix is to enable Release builds when invoked by the CI pipelines.
* Add triggering by tags for all pipelines that produce packages intended for PyPI (Windows, ManyLinux, macOS)
* Add pipeline to automate creating an LLVM build environment for the ManyLinux pipeline
* Revert to a last known good version of auditwheel (5.0.0) due to a compression bug (https://github.com/pypa/auditwheel/issues/366)
- [doc] tweaking public links. [Lisa Ong]
- Merged PR 2352: Reference github URLs for links in README.md. [Lisa
Ong]

README.md is referenced in PyPI, so these need to be fully-qualified URLs.

(The links will not work until the repo is published)

Related work items: 3619
- Merged PR 2360: Fix divide-by-0 crash when the active block exceeds
the vectorizable. [Mason Remy]

Fix divide-by-0 crash when the active block exceeds the vectorizable
size in the innermost dimension
- Add smoke test for this case. [Mason Remy]
- Fix divide-by-0 crash when the active block exceeds the vectorizable
size in the innermost dimension. [Mason Remy]
- Squashed commit of the following: [Lisa Ong]

commit add8396adc6e0f4e3cf0ae89796d08ac416c00a4
- Tweaked dark mode for better contrast, added favicon, improved
navigation. [Lisa Ong]
- Merged PR 2359: [docs] Fix rendering issues with code blocks and
bullet points. [Lisa Ong]

Also added sticky nav and tabs
- Minor typos in docs (4) [Lisa Ong]

* Update 00 Introduction.md

* Update mkdocs.yml

* Update Installing_accera_on_MacOS.md

* Update Installing_accera_on_Ubuntu.md

* Update Installing_on_MacOS.md

* Update Installing_on_Ubuntu.md

* Update Optimized_MatMul.md

* Update Hello_MatMul.md

* Update Cross_Compilation_PI3.md
- Add copyright. [Lisa Ong]
- Use mkdocs for documentation (3) [Lisa Ong, Lisa Ong]

* mkdocs integration

* add publishing workflow

* doc the doc
- Backport doc fixes from gh-pages to main (https://github.com/microsoft
/Accera/commit/ff491e3401691124b2aa6c3ee1d317bf264bdc11) [Lisa Ong]
- Merged PR 2345: Infer number of threads from the parallelization
indices. [Lisa Ong]

The number of threads was previously set to Target.num_threads.

This change treats Target.num_threads as a capacity setting, and infers the number of threads from an aggregate of:
* the number of unsplit indices
* the number of split blocks for each outermost index

This gives the user control over how many threads to request.

Examples:
* indices = i, j, k : 3 threads, one per index. Reason is that it doesn't make sense to just use 1 thread. For the future, we may want to add an explicit parameter to control the number of threads for this case
* indices = i, where ii = i.split(N//4): N//4 threads. We could have used ceiling(N/4), but due to loop unswitching, we don't directly apply the extra thread to the boundary loop. (future work?)
* indices = i, j, where ii = i.split(N//4): N//4 + 1 threads.

Implementation detail: if workshare loop collapsing happens because the indices are contiguous, the number of threads assigned is unaffected.

Related work items: 3554

1.2.0

------------
- Merged PR 2349: Add missing steps to CMake build instructions. [Lisa
Ong]
- Merged PR 2347: Add pip install for linux. [Lisa Ong]

Linux packages can now be pip installed directly

Some cosmetic edits to install instructions
- Merged PR 2326: Update install docs for Visual Studio 2022. [JUBI
TANEJA]

Related work items: 3605
- Nit. [Jubi Taneja]
- Merge branch 'main' of vs-
ssh.visualstudio.com:v3/intelligentDevices/ELL/Accera into
dev/jubitaneja/VS2022-install-docs. [Jubi Taneja]
- Edits. [Jubi Taneja]
- Update install docs for Visual Studio 2022. [Jubi Taneja]
- Merged PR 2346: Canary workflows for building with latest LLVM
release. [Lisa Ong]

This pipeline is part of a two-stage workflow.

Stage 1:
* Weekly docker image build that pulls the latest tagged official release of LLVM and rebuilds the image.
* Currently lives in: https://github.com/lisaong/accera-llvm-canary but can be moved to a more permanent location once this pipeline is stable.
* Github actions are used here for convenience (longer timeouts, better integration). In the future we can move to Azure DevOps if similar functionality is available.

Stage 2: (this PR)
* Weekly canary build that consumes the latest docker image produced in stage 1. This is on a weekly schedule because triggering on container pushes is not yet supported by ADO.

When a new release of LLVM is published:
* Stage 1's weekly build will fail because the port SHA will change. This is ok because we want manual intervention to update the LLVM vcpkg portfile to update the patches, etc.
* Stage 2's weekly build may also fail. This is where we would stage changes in an Accera branch to support the new LLVM release.

As of this PR, LLVM 13.0.1 is being pre-released. TODO: test out the workflow with the upcoming pre-release.

Related work items: 3616
- Merged PR 2340: Support Max element / budget caching for manual
caches. [Mason Remy]

Support Max element / budget caching for manual caches

Max element / budget caching previously only worked for automatic
caches, however the hierarchical caching change made automatic caches
harder to request from the DSL. This change enables max element caches
for manual caches by iteratively searching for the level at which a
cache should be placed due to the budget.

Notes:
- Currently if the budget is 0, that is treated as though the budget is
1, however maybe we want this to be an error case
- Different boundary condition sections of the loopnest may have
differently sized caches realized due to how the budget computation
works. i.e. if caching around a main loop would exceed a budget but
caching around the boundary would not, then the same cache would exist
inside the main loop and outside the boundary loop

Related work items: 3615
- Merge branch 'main' into review/masonr/max_element_caching. [Mason
Remy]
- Merged PR 2342: Add logic check for target compat; Debug mode makes
use of func target. [Kern Handa]

Add logic check for target compat; Debug mode makes use of func target

This change adds the concept of Target compatibility so that functions
that are for the same target but have different settings can be added
freely. This is particularly helpful when adding Debug mode checks for a
function, as the Debug mode function naturally is going to be a subset
of the original function's target.

This change also introduces the concept of the maxinum for a number of
Target properties, which is used to test whether one target is
compatible with another.

Another related change is that Debug mode now makes use of the target of
the function being checked. This might need to be further addressed to
figure out the correct way to debug GPU or remote targets
- Add logic check for target compat; Debug mode makes use of func
target. [Kern Handa]

This change adds the concept of Target compatibility so that functions
that are for the same target but have different settings can be added
freely. This is particularly helpful when adding Debug mode checks for a
function, as the Debug mode function naturally is going to be a subset
of the original function's target.

This change also introduces the concept of the maxinum for a number of
Target properties, which is used to test whether one target is
compatible with another.

Another related change is that Debug mode now makes use of the target of
the function being checked. This might need to be further addressed to
figure out the correct way to debug GPU or remote targets
- Merged PR 2343: Switch to PNGs for logo assets. [Lisa Ong]

This allows the images to render more reliably in preview mode
- Merged PR 2344: Add libvulkan to Manylinux builds. [Lisa Ong]

Set LD_LIBRARY_PATH so that auditwheel can find the dependency.

Assumes that target Linux system will have the lib preinstalled per install instructions.

Update docker image used by pipeline.

Related work items: 3529
- Merged PR 2339: Add logo and badges to README.md, licenses to whls.
[Lisa Ong]

Further adjustments deferred until repo is made public
- Fix c++ dsl test vectorize invocations. [Mason Remy]
- Make budget = 0 an error. [Mason Remy]
- Taking PR feedback. [Mason Remy]
- Fix C++ DSL test failures by making C++ DSL always create manual
active block caches. [Mason Remy]
- Support Max element / budget caching for manual caches. [Mason Remy]

Max element / budget caching previously only worked for automatic
caches, however the hierarchical caching change made automatic caches
harder to request from the DSL. This change enables max element caches
for manual caches by iteratively searching for the level at which a
cache should be placed due to the budget.

Notes:
- Currently if the budget is 0, that is treated as though the budget is
1, however maybe we want this to be an error case
- Different boundary condition sections of the loopnest may have
differently sized caches realized due to how the budget computation
works. i.e. if caching around a main loop would exceed a budget but
caching around the boundary would not, then the same cache would exist
inside the main loop and outside the boundary loop
- Merged PR 2329: Define CPU targets in Accera. [JUBI TANEJA]

- Intel Core processors and Intel Xeon
- documentation and definitions in python bindings

Related work items: 3571
- Fix dsl_tests and other references of intel core processor. [Jubi
Taneja]
- Targets definition. [Jubi Taneja]
- More details of targets. [Jubi Taneja]
- Edits. [Jubi Taneja]
- Edits. [Jubi Taneja]
- More details on extensions. [Jubi Taneja]
- Add target details. [Jubi Taneja]
- Merged PR 2338: Check for presence of libomp in macOS and Linux. [Lisa
Ong]

Only apply the linkage to libomp if present in the target system.

This fixes the manylinux pipeline smoke test failure, which fails because the manylinux system does not have a compatible libomp installed.
- Merged PR 2337: Build manylinux wheels for PyPI uploads. [Lisa Ong]

* Add an Azure Pipeline that builds and uploads packages based on manylinux2014. This uses a container that contains accera-llvm and other build dependencies pre-installed
* Link to the system libomp at target accc time (when libomp is not present in the manylinux2014 build system, but may be present in an Ubuntu target system, for example)
* manylinux2014 is what onnxruntime uses as well. manylinux_2_24 is available but not as widely used afaict (punt for future work).

Misc fixes:
* Missing copyright blurbs
* Updated accera/python/README.md
* Drop `clean --all` from build.sh/build.bat so that full rebuilds are not the default
- Merged PR 2336: value::Abs now supports non-fp types, fixes non-fp
Debug mode. [Kern Handa (KERN)]

value::Abs now supports non-fp types, fixes non-fp Debug mode
- Merged PR 2333: Initialize vcpkg in the SDL pipelines. [Lisa Ong]

Missed these changes from the previous PR, now that vcpkg and packages need to be installed.
- Merged PR 2335: Ignore PyPI when installing local wheels in the CI
pipelines. [Lisa Ong]

Update the CI pipelines to ignore PyPI when installing accera wheels.
- Merged PR 2334: Add support for hierarchical caching. [Mason Remy]

Add support for hierarchical caching

This adds support for creating an active block cache of an existing
active block cache (note that hierarchical caching for automatic caches
was already supported, however the cache itself was not used as an
argument to the cache call in that scenario).

This change includes:
- Moving cache access maps and arrays of loopnest index attributes onto
the MakeCacheOps
- Adding helpers to MakeCacheOp to construct access maps for the caches
given a position in the loopnest
- Remove redundant access map computation in active block cache copy and
reduce
- Support for hierarchical caches that are parameterized
- Implicitly hides automatic caches by assuming a layout on a cache call
which doesn't have a layout provided. Since any cache call with a
layout becomes an active block cache, this turns all cache calls into
active block caches. In a later PR we could add an undocumented flag
to enable users to request an automatic cache if we want to, however
long-term automatic caches should be removed completely.
- Fix cache merging bug where output caches with a boundary condition on
the cache level weren't constructing a union of the different loop
branches when computing the active block
- Fix multi-cache merging bug where a boundary condition on a loop
between the trigger level and the cache level and that loop IV is used
to index into the cache was resulting in the caches being merged.
Instead these caches should not be merged since the are accessing
different regions. An unfortunate side-effect of this fix is that some
multi-caches which have a boundary condition between the trigger level
and the cache level where the boundary loop IV is not used to index
into the cache won't be successfully merged. This isn't technically
wrong as we are still copying data the number of times requested based
on the multicache definition, however it is a potential missed
optimization opportunity in this edge case.
- Disables max_element caching as this was only supported for automatic
caches. A later PR will support this for active block caches

Related work items: 3453
- Add support for hierarchical caching. [Mason Remy]

This adds support for creating an active block cache of an existing
active block cache (note that hierarchical caching for automatic caches
was already supported, however the cache itself was not used as an
argument to the cache call in that scenario).

This change includes:
- Moving cache access maps and arrays of loopnest index attributes onto
the MakeCacheOps
- Adding helpers to MakeCacheOp to construct access maps for the caches
given a position in the loopnest
- Remove redundant access map computation in active block cache copy and
reduce
- Support for hierarchical caches that are parameterized
- Implicitly hides automatic caches by assuming a layout on a cache call
which doesn't have a layout provided. Since any cache call with a
layout becomes an active block cache, this turns all cache calls into
active block caches. In a later PR we could add an undocumented flag
to enable users to request an automatic cache if we want to, however
long-term automatic caches should be removed completely.
- Fix cache merging bug where output caches with a boundary condition on
the cache level weren't constructing a union of the different loop
branches when computing the active block
- Fix multi-cache merging bug where a boundary condition on a loop
between the trigger level and the cache level and that loop IV is used
to index into the cache was resulting in the caches being merged.
Instead these caches should not be merged since the are accessing
different regions. An unfortunate side-effect of this fix is that some
multi-caches which have a boundary condition between the trigger level
and the cache level where the boundary loop IV is not used to index
into the cache won't be successfully merged. This isn't technically
wrong as we are still copying data the number of times requested based
on the multicache definition, however it is a potential missed
optimization opportunity in this edge case.
- Disables max_element caching as this was only supported for automatic
caches. A later PR will support this for active block caches
- Merged PR 2325: Support building LLVM via vcpkg (no remote caching)
[Lisa Ong]

This change adds vcpkg support for external developers to build their own copy of LLVM based on public github sources.

Due to complexities of hosting large Nuget packages, the vcpkg built LLVM is local-only. We're still using Conan for LLVM for internal use.

* Added vcpkg as a submodule
* Migrated tomlplusplus and catch2 to vcpkg. pybind11 is untouched because it uses CMake FetchContent (the simplest and most direct method)
* Added top level build scripts for generating the Python packages

* Added support for installing LLVM via vcpkg. This is opted-in by setting the environment variable `LLVM_SETUP_VARIANT` or by passing in `-DLLVM_SETUP_VARIANT` during configuration:
* `LLVM_SETUP_VARIANT=Conan` will use Conan to acquire pre-built LLVM bits (internal use only)
* If unset, default behavior is to use vcpkg to build and install LLVM bits

* Whenever we update LLVM, we need to
* Build and upload the internal packages [as before]
* Update the vcpkg port by revising the Git hash and applying any patches. [new]

Related work items: 3611
- Merged PR 2331: Add Kernel::GetIndices and wire it up. [Kern Handa
(KERN)]

Add Kernel::GetIndices and wire it up
- Merged PR 2332: LoopNestBuilder minor code fixes. [Kern Handa (KERN)]

LoopNestBuilder minor code fixes

Related work items: 3602
- Merged PR 2330: Rename main pass to acc-to-llvm. [Kern Handa (KERN)]

Rename main pass to acc-to-llvm
- Merged PR 2328: Make git ignore .vscode symlinks as well. [Kern Handa
(KERN)]

Make git ignore .vscode symlinks as well
- Merged PR 2327: Updated pip install instructions to official PyPI
repositories (windows, macOS) [Lisa Ong]

Linux instructions will be updated once the manylinux distribution packages are ready and uploaded.
- Merged PR 2324: bugfix in benchmark HAT package while generating
main.cpp. [JUBI TANEJA]

bugfix in benchmark HAT package while generating main.cpp to include correct .hat files

Related work items: 3613
- Merged PR 2322: Port dev/byronc/address_sdl_timeouts to Accera repo.
[Lisa Ong]

Add BinSkim tool to SDL pipeline runs. Split the original SDL pipeline into 3 stages to avoid timeouts in ADO.

Add build flags recommended by BinSkim.

Original PR: !2303

Related work items: 3599
- Merged PR 2323: Re-enabling code signing for Windows distributions.
[Lisa Ong]

Disabled for Linux and macOS pending future support
- Merged PR 2320: Split accera python wheels to within 100MB. [Lisa Ong]

* Enable splitting of into component packages when building for the Packaging CI pipelines
* Support development mode for top level setup.py to place everything in build/lib.*
* Import and shared library paths are unchanged. Only moved executables into the accera/bin folder
* Currently only the accera-llvm and accera-compilers packages are required dependencies for the accera package

[spec](https://microsoft.sharepoint.com/:w:/t/AICompiler/EWYtk8DCuQNHk0yGZ4071f0BWM0lxtw2Rd1FlACmsU-b-g?e=hkDHec)

Azure artifacts feed: https://intelligentdevices.visualstudio.com/ELL/_packaging?_a=feed&feed=Accera

Manually seeded Windows and macOS packages on PyPI (linux is pending 3529):
* https://pypi.org/project/accera/
* https://pypi.org/project/accera-compilers/
* https://pypi.org/project/accera-gpu/
* https://pypi.org/project/accera-llvm/

Related work items: 3577
- Merged PR 2319: [forward port] [build] Fix py3.7 test regression,
applied workaround for Azure Pipelines caching infra issue. [Lisa Ong]

b005dad65a4a36a52dd627af5e38b803bb8102e1
- Merged PR 2318: [forward port] Fix a few typos and nits in
installation guide for Windows. [Lisa Ong]

be37216a165ed72cb51636cf88bb9f40dbb8f9cc
- Merged PR 2317: [nfc] Add licensing information to all source files.
[Lisa Ong]

* Added MIT license blob
* One-liner blurbs are mostly empty, except for a handful that are already commented at the top of the file.
* Authors are grandfathered (existing ones maintained, no new ones added)

Related work items: 3579
- Merged PR 2316: Migrating from old repo to new repo. [Lisa Ong]

Old repo commit id: d61bd7d31e2febc45321da72ed18278e27dbe4cb

Related work items: 3608
- SUPPORT.md committed. [Microsoft Open Source]
- SECURITY.md committed. [Microsoft Open Source]
- LICENSE committed. [Microsoft Open Source]
- README.md committed. [Microsoft Open Source]
- CODE_OF_CONDUCT.md committed. [Microsoft Open Source]
- Initial commit. [microsoft-github-operations[bot]]

Page 5 of 6

Releases

Has known vulnerabilities

Previous Next

Accera

Page 5 of 6

1.2.5

1.2.4

1.2.3

1.2.2

1.2.1

1.2.0

Page 5 of 6

Links

Releases