Accera

Latest version: v1.2.29

Safety actively analyzes 629788 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 6

1.2.23

What's Changed
------------
- Merged PR 3131: Set masked load/store inbounds flag to true. [Mason
Remy]

Set masked load/store inbounds flag to true

The mask we generate, as well as the rest of our infrastructure, will
prevent out-of-bounds accesses when used properly. Therefore for
performance reasons we don't want MLIR to generate runtime bounds
checking
- Merged PR 3130: Recognize and simplify always true EQ and NE CmpOps.
[Mason Remy]

Recognize and simplify always true EQ and NE CmpOps

These would already get simplified after converting to the builtin
dialects, but this makes them happen earlier in the lowering
- Merged PR 3129: Optimize 1-row horizontal i16->i32 sum reduction.
[Mason Remy]

Optimize 1-row horizontal i16->i32 sum reduction
- Merged PR 3118: vectorize accumulation of results of two masked load
ops. [JUBI TANEJA]

This PR vectorizes a pattern that occurs in MMIF where there are two conditional loads, followed by an accumulation operation, and a conditional store. On vectorizing the following DSL:

N_input = 8
N_output = 5
Input = Array(role=Role.INPUT, element_type=ScalarType.int32, shape=(N_input, ))
Output = Array(role=Role.INPUT_OUTPUT, element_type=ScalarType.int32, shape=(N_output, ))
nest = Nest(shape=(N_input, ))
i, = nest.get_indices()

nest.iteration_logic
def _nest():

def store_value():
Output[i] += Input[i]

_If(i < N_output, store_value)

It produces the following assembly. We are looking for `vpmaskmovd` instructions that correspond to vector.transfer_read/vector.transfer_write ops in MLIR.

0000000000000030 <test_vectorized_masked_accumulate_3e5de44f3dcca64e>:
30: c5 fd 6f 05 00 00 00 vmovdqa 0x0(%rip),%ymm0 38 <test_vectorized_masked_accumulate_3e5de44f3dcca64e+0x8>
37: 00
38: c4 e2 7d 8c 0e vpmaskmovd (%rsi),%ymm0,%ymm1
3d: c4 e2 7d 8c 17 vpmaskmovd (%rdi),%ymm0,%ymm2
42: c5 ed fe c9 vpaddd %ymm1,%ymm2,%ymm1
46: c4 e2 7d 8e 0e vpmaskmovd %ymm1,%ymm0,(%rsi)
4b: c5 f8 77 vzeroupper
4e: c3 retq

- Merged PR 3126: [test] Adds more tests for vectorized transpose. [Kern
Handa]

[test] Adds more tests for vectorized transpose
- Merged PR 3121: [nfc] Separate bounds checking into separate pass
file. [Mason Remy]

[nfc] Separate bounds checking into separate pass file

This removes the bounds checking code from
ExecutionPlanToAffineLoweringPass and creates a separate pass file for
it. There is no change in when and where the checking occurs (currently
it only happens for caching-generated loads and stores).

In a future change we will further separate the pass and run it at a
different phase of the lowering and plumb controls for
enabling/disabling it to the DSL
- Merged PR 3122: Fix reinterpret_cast output memref shape. [Mason Remy]

Fix reinterpret_cast output memref shape
- Merged PR 3115: Normalize AffineForOps to have unit stride and begin
at 0. [Mason Remy]

Normalize AffineForOps to have unit stride and begin at 0
- Merged PR 3117: Vectorize horizontal multi-dim sum reductions. [Mason
Remy]

Vectorize horizontal multi-dim sum reductions

Recognizes and vectorizes these sum reductions:
4x16xi16 -> 4x1xi32
4x8xi32 -> 4x1xi32
4x8xf32 -> 4x1xf32
- Merged PR 3099: Adds pattern rewriting for AVX2 vectorized transpose.
[Kern Handa]


**Full Changelog**: https://github.com/microsoft/Accera/compare/v1.2.22...v1.2.23

1.2.22

What's Changed
------------
- Merged PR 3107: Make vectorization happen after inlining and
simplification. [Mason Remy]

Make vectorization happen after inlining and simplification

This change fills out the vectorization passes and removes vectorization
from LoopNestToValueFunc. Some bugs were exposed that this also fixes.

Since vectorization is now a separate pass, mlir filecheck lit tests can
be run more easily. This change adds the initial file with one test, but
we should continue expanding this test suite
- Merged PR 3108: extend vectorization for masked store case. [JUBI
TANEJA]
- Merged PR 3109: Set conan version < 2.0.0. [Mason Remy]

Our infra isn't set up for the new conan 2 behavior, so fix our usage to
version 1 until we take the upgrade intentionally
- Merged PR 3104: Position fusing dim after the fused dimensions.
[Captain Jack Sparrow]

Position fusing dim after the fused dimensions
- Merged PR 3096: Add "RelWithDebInfo"-like option to accc. [Chuck
Jacobs]

This PR adds another option to the `Options` flag for `AcceraProject.gemerate_and_emit` to keep some debug (the frame pointers) info around when building the Accera project. This can be helpful when trying to interpret perf profiler output.

**Full Changelog**: https://github.com/microsoft/Accera/compare/v1.2.21...v1.2.22

1.2.21

What's Changed
------------
- Merged PR 3101: [build] install pkg-config for macos buddy builds.
[Lisa Ong]

Fixes macos packaging build failure:

https://intelligentdevices.visualstudio.com/ELL/_build/results?buildId=47235&view=results
- Merged PR 3098: [nfc] Move vectorization code to separate files.
[Mason Remy]

[nfc] Move vectorization code to separate files

Moves vectorization code out of ExecutionPlanToAffineLoweringPass in
preparation for better separating out a vectorization pass that can be
run later than vectorization is currently happening
- Merged PR 3100: Adds CMake dependencies to acc-translate to ensure
correct build. [Kern Handa]

Adds CMake dependencies to acc-translate to ensure correct build
- Merged PR 3095: Remove duplicate SubArray class. [Mason Remy]

Remove duplicate SubArray class
- Merged PR 3073: vectorize masked load store. [JUBI TANEJA]

This PR handles vectorization specifically for a masked buffer fill, where the output size is larger than the input. There is a conditional load and vector store.

Given the nest:

nest.iteration_logic
def _nest():
def store_value():
Output[i] = Input[i]
def store_zero():
Output[i] = 0
_If(i < N_input, store_value).Else(store_zero)

The unoptimized MLIR is as follows:

%c0_i32 = arith.constant 0 : i32
%c5 = arith.constant 5 : index
"accv.lambda"() ({
affine.for %arg2 = 0 to 8 {
%0 = "accv.cmp"(%arg2, %c5) {predicate = 2 : i64} : (index, index) -> i1
scf.if %0 {
%1 = affine.load %arg0[%arg2] : memref<5xi32>
affine.store %1, %arg1[%arg2] : memref<8xi32>
} else {
affine.store %c0_i32, %arg1[%arg2] : memref<8xi32>
}
}

On vectorizing this for loop, we get the vectorized MLIR (simplified version) as follows:

%c5 = arith.constant 5 : index
%cst = arith.constant dense<false> : vector<8xi1>
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%c2 = arith.constant 2 : index
%c3 = arith.constant 3 : index
%c4 = arith.constant 4 : index
%c6 = arith.constant 6 : index
%c7 = arith.constant 7 : index
%c0_i32 = arith.constant 0 : i32
"accv.lambda"() ({
affine.for %arg2 = 0 to 8 step 8 {

%7 = "accv.cmp"(%arg2, %c5) {predicate = 2 : i64} : (index, index) -> i1
%9 = "accv.cmp"(%0, %c5) {predicate = 2 : i64} : (index, index) -> i1
%11 = "accv.cmp"(%1, %c5) {predicate = 2 : i64} : (index, index) -> i1
%13 = "accv.cmp"(%2, %c5) {predicate = 2 : i64} : (index, index) -> i1
%15 = "accv.cmp"(%3, %c5) {predicate = 2 : i64} : (index, index) -> i1
%17 = "accv.cmp"(%4, %c5) {predicate = 2 : i64} : (index, index) -> i1
%19 = "accv.cmp"(%5, %c5) {predicate = 2 : i64} : (index, index) -> i1
%21 = "accv.cmp"(%6, %c5) {predicate = 2 : i64} : (index, index) -> i1

%23 = memref.reinterpret_cast %arg0 to offset: [0], sizes: [5], strides: [1] : memref<5xi32> to memref<5xi32>
%24 = vector.transfer_read %23[%arg2], %c0_i32, %22 : memref<5xi32>, vector<8xi32>

%25 = memref.reinterpret_cast %arg1 to offset: [0], sizes: [8], strides: [1] : memref<8xi32> to memref<8xi32>
vector.store %24, %25[%arg2] : memref<8xi32>, vector<8xi32>
}

- Merged PR 3093: Add meaningful error messages for c++ exceptions.
[Captain Jack Sparrow]

Add meaningful error messages for c++ exceptions
- Merged PR 3092: Add type size getter utility. [Captain Jack Sparrow]

Add type size getter utility
- Merged PR 3074: Add rudimentary pass to fix redundant load/store
issue. [Chuck Jacobs]

This PR adds a simple pattern to `ValueSimplifyPass` that looks for the redundant load/store pattern we often see at the end of kernels, and removes them.
- Merged PR 3075: Enable `fast_exp` operation. [Chuck Jacobs]

This PR makes a few changes to enable the `fast_exp` operation:
- Adds `fast_exp` to the python DSL
- Enables vectorization of `abs` instruction (which is used by `fast_exp`)

It also makes a couple of other minor changes:
- Improves auto-naming of nest indices
- Better support for using custom LLVM builds with Accera
- Merged PR 3088: Support dynamic sub_array shape, split_dim size.
[Mason Remy]

Support dynamic sub_array shape, split_dim size

This still requires that the sizes are static before lowering, but it
supports dynamic sizes temporarily before inlining into an outer static
function
- Merged PR 3078: Adds reinterpret_cast functionality to Array. [Kern
Handa]

Adds reinterpret_cast functionality to Array
- Merged PR 3070: Fixes for sub_array and _split_dimension. [Mason Remy]

Fixes for sub_array and _split_dimension

This fixes the sub array and split dim ops to work with the accera
codebase that has updated around them. Some MemoryLayout assumptions are
getting in the way and have been disabled in the short-term, however
long term our memory layout behavior should more closely match what MLIR
affine maps can represent for more generalized dynamic support
- Merged PR 3063: Refactor Dimension with C++ backend container class
and few other fixes. [Captain Jack Sparrow]

- Refactor Dimension with C++ backend container (ScalarDimension)
- Enable output scalar variables
- Fix dynamic sized TEMP arrays
- Merged PR 3072: Bump hatlib version to 0.0.34, skip unsupported test
on arm64 macOS, minor targets doc update. [Lisa Ong]

Update hatlib version since there is no incompatibility

**Full Changelog**: https://github.com/microsoft/Accera/compare/v1.2.20...v1.2.21

1.2.20

What's Changed

- Merged PR 3070: Fixes for sub_array and _split_dimension [Mason Remy]

Fixes for sub_array and _split_dimension

This fixes the sub array and split dim ops to work with the accera
codebase that has updated around them. Some MemoryLayout assumptions are
getting in the way and have been disabled in the short-term, however
long term our memory layout behavior should more closely match what MLIR
affine maps can represent for more generalized dynamic support

- Merged PR 3063: Refactor Dimension with C++ backend container class and few other fixes [Captain Jack Sparrow]

- Refactor Dimension with C++ backend container (ScalarDimension)
- Enable output scalar variables
- Fix dynamic sized TEMP arrays

- Merged PR 3072: Bump hatlib version to 0.0.34, skip unsupported test on arm64 macOS, minor targets doc update [Lisa Ong]

Update hatlib version since there is no incompatibility

**Full Changelog**: https://github.com/microsoft/Accera/compare/v1.2.19...v1.2.20

1.2.19

What's Changed
------------
- Merged PR 3069: Set target device features on module and check when
matching avx2/512 ops. [Mason Remy]

Set target device features on module and check when matching avx2/512 ops
- Merged PR 3060: Adds support for sqrt op in acc-translate. [Kern
Handa]

**Full Changelog**: https://github.com/microsoft/Accera/compare/v1.2.18...v1.2.19

1.2.18

What's Changed
------------
- Merged PR 3055: Move value unrolling to after function inlining and
loop simplification. [Mason Remy]

Move value unrolling to after function inlining and loop simplification

This enables dynamically-sized inner functions that get inlined into
statically-sized regions to have loop unrolling affect their
actually-statically-sized loops when possible
- Merged PR 3053: Add package.build flags for building with higher-
precision FP vector ops. [Mason Remy]

Add package.build flags for building with higher-precision FP vector ops

Setting this new flag prevents a vmulps -> vaddps sequence
from being contracted into a vfmaddps
- Merged PR 3052: Place heap allocations at the top level of the
function. [Mason Remy]

Place heap allocations at the top level of the function
- Merged PR 3050: [non-func, API] Change Nest.get_shape() to always
return a list. [Captain Jack Sparrow]

Change Nest.get_shape() to always return a list
- Merged PR 3030: Include acc-translate whenever accera is installed.
[Lisa Ong]

Perhaps a longer-term fix is to merge the accera-gpu package into accera-compilers so we have one less package to maintain.

However, that adds constraints to the binary size of acc-opt (to not push us past the 100MB PyPI hard limit), so punting until we have cycles for this.
- Merged PR 3035: [nfc] Adds my machine to targets.py. [Kern Handa]

**Full Changelog**: https://github.com/microsoft/Accera/compare/v1.2.17...v1.2.18

Page 2 of 6

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.