Accera

Latest version: v1.2.29

Safety actively analyzes 629765 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 4 of 6

1.2.11

What's Changed
* Update vcpkg by AtariDreams in https://github.com/microsoft/Accera/pull/52

- Merged PR 2924: Update hatlib dependency in setup.cfg, add comment.
[Lisa Ong]
- Merged PR 2922: [Github] Update vcpkg. [Lisa Ong]

From c2177e64bdf05e7ea69a8d41f420fcdb42df49db Mon Sep 17 00:00:00 2001
- Merged PR 2910: Updates hatlib dependency to 0.0.29. [Kern Handa]
- Merged PR 2905: Fix internal param name in GPU benchmarks. [Captain
Jack Sparrow]

Fix internal param name in GPU benchmarks
- Merged PR 2902: Increase ROCm baseline benchmark timeout to 10 hours.
[Captain Jack Sparrow]

- Increase ROCm baseline benchmark to 10 hours
- Add category to the gemm input for classification
- Merged PR 2901: Increase ROCm baseline timeout to 7 hours. [Captain
Jack Sparrow]

Increase ROCm baseline timeout to 7 hours
- Merged PR 2900: Prune gemm benchmark input for big sizes by removing
NT and TT configs. [Captain Jack Sparrow]

- Prune gemm benchmark input for big sizes by removing NT and TT configs
- Disable verification for resnet sizes
- Fix baseline tagging for pytorch
- Merged PR 2896: Dynamic shared memory allocation support. [Captain
Jack Sparrow]

- Add optional param in plan.cache for memory offset
- Add optional param in schedule.create_plan for total dynamic memory size in bytes
- Update benchmarks to allow dynamic shared memory usage

Related work items: 3735
- Merged PR 2898: Add pytorch gemm implementation for GPU benchmark
baselines. [Ritwik Das]

Add pytorch gemm implementation for GPU benchmark baselines
- Merged PR 2897: Generalize partial dynamic size support. [Mason Remy]

Generalize partial dynamic size support

Plumbs through mappings from arrays to which args provide the dimension
sizes for those arrays more generically.

This also generalizes dynamic size support beyond matmul scenarios.

Note: due to assumptions in the debug mode plumbing, the size arguments
still must occur first in the argument list, and a later PR should
generalize that
- Merged PR 2894: Add one test case for partially dynamic sized array.
[Denny Sun]
- Merged PR 2891: [nfc][release] Rev docs to 1.2.11. [Lisa Ong]
- Merged PR 2882: Add tests for thread coarsening and update GPU
benchmarks. [Ritwik Das]

- Add tests for thread coarsening and update GPU benchmarks

Related work items: 3684
- Merged PR 2890: Add folding scenario for cast ops where the only
downcasts are. [Mason Remy]

Add folding scenario for cast ops where the only downcasts are
internally-generated

This is useful for converting uint8*uint8->uint8 to
int16*int16->int32 using cache element types as is needed in the
vpmaddwd matmul scenario
- Merged PR 2889: [refactoring] Prevent overloading of keyword "Tensor"
- disambiguate with "MMAFragment" [Ritwik Das]

Prevent overloading of keyword "Tensor" - disambiguate with "MMAFragment"

New Contributors
* AtariDreams made their first contribution in https://github.com/microsoft/Accera/pull/52

**Full Changelog**: https://github.com/microsoft/Accera/compare/v1.2.10...v1.2.11

1.2.10

What's Changed
* Update ci.yml to fix path changes by lisaong in https://github.com/microsoft/Accera/pull/49
* Add unrolled convolution case study link by marina-neseem in https://github.com/microsoft/Accera/pull/50
* Bump protobuf from 3.20.1 to 3.20.2 in /accera/onnx-emitter/test by dependabot in https://github.com/microsoft/Accera/pull/51

- Merged PR 2886: [release] Bump docs to 1.2.10, sync GH to ADO. [Lisa
Ong]

* Bulk docs version update

* Bump protobuf from 3.20.1 to 3.20.2 in /accera/onnx-emitter/test (d1b87ec6820417da0e86d60094eacf354398d4b9)

* Also fixing a minor docs bug (errant backtick)
- Merged PR 2884: Add DSL test for runtime size correctness. [Denny Sun]
- Merged PR 2878: Optimize warp id calculation by forcing scalar
registers. [Ritwik Das]

- ROCM: use **__builtin_amdgcn_readfirstlane** to force scalar reg usage
- CUDA: don't use anything special since **__shfl_sync** seems to generate slower code
- Merged PR 2885: Updates python dependencies. [Kern Handa]

Updates hatlib version
- Merged PR 2881: Fix the runtime crash caused by incorrectly generated
LLVM IR. [Denny Sun]

1. Call the specific version of LLVM type converter for dynamic memory
2. Create MemRefDescriptor from dynamic memory shape by associating the arrays with correct size arguments

With this change, the following DSL test can succeed and pass correctness check.


M = Dimension()
N = Dimension()
K = Dimension()

A = Array(shape=(M, K), element_type=ScalarType.float32,
role=Array.Role.INPUT)

B = Array(shape=(K, N), element_type=ScalarType.float32,
role=Array.Role.INPUT)

C = Array(shape=(M, N),
element_type=ScalarType.float32,
role=Array.Role.INPUT_OUTPUT)

nest.iteration_logic
def _():
C[i, j] += A[i, k] * B[k, j]

M_test = np.int64(64)
N_test = np.int64(128)
K_test = np.int64(32)
A_test = np.random.random((M_test, K_test)).astype(np.float32)
B_test = np.random.random((K_test, N_test)).astype(np.float32)
C_test = np.random.random((M_test, N_test)).astype(np.float32)

correctness_check_values = {
"pre": [M_test, N_test, K_test, A_test, B_test, C_test],
"post": [M_test, N_test, K_test, A_test, B_test, C_test + A_test B_test],
}

function = package.add(nest, args=(M, N, K, A, B, C), base_name="runtimesizes")

with verifiers.VerifyPackage(self, "test_runtimesizes", TEST_PACKAGE_DIR) as v:
package.build("test_runtimesizes", format=TEST_FORMAT | Package.Format.MLIR_VERBOSE, mode=TEST_MODE, output_dir=TEST_PACKAGE_DIR)
if correctness_check_values:
v.check_correctness(
function.name,
before=correctness_check_values["pre"],
after=correctness_check_values["post"],
)

- Merged PR 2879: Fix exception in GPU baseline benchmark. [Ritwik Das]

Fix exception in GPU baseline benchmark
- Merged PR 2856: Enable output caching in ROCM for all MMA shapes.
[Ritwik Das]
- Merged PR 2876: Introduce warp bindings in CUDA. [Ritwik Das]

- Bind indices to WARP_X/Y along with tensorization (exclusively from thread id mapping)
- warp x dim is always a multiple of warp size in the x dimension. e.g. if for dividing a 64x64 block tile into 4 subtiles of 32x32 each where each subtile is computed by a single warp then the blockDim would be (64,2,1).
- This is required since with tensorization we would want block dims to be generated in a specific way than without it. Calculating offsets within the matrix based on warps is non-trivial if not impossible with just thread bindings.

Related work items: 3726
- Merged PR 2874: Add unrolled convolution case study link (50) [Lisa
Ong]

Add unrolled convolution case study link (50)

* Update README.md

Add unrolled convolution case study reference link

* Update the reference link

Update the reference according to latest updates in the case study
- Merged PR 2873: Convert function signature from dynamic memref type to
llvm type. [Denny Sun]

With this change, Accera is able to write the correct function signature of dynamic memref type to HAT file
- Merged PR 2871: Update hatlib version. [Denny Sun]

from 0.0.23 to 0.0.25
- Merged PR 2870: Filter benchmark kernels based on scheduling policy.
[Ritwik Das]

Filter benchmark kernels based on scheduling policy
- Merged PR 2867: [build][github] Update test path in github actions.
[Lisa Ong]

Fixes https://github.com/microsoft/Accera/actions/runs/3071905923


**Full Changelog**: https://github.com/microsoft/Accera/compare/v1.2.9...v1.2.10

1.2.9

- Merged PR 2862: write runtime size of index type to Hat. [Denny Sun]

write runtime size of index type to Hat
- Merged PR 2861: Fix cache_C benchmark variable which is not getting
set properly for CUDA. [Ritwik Das]

Fix cache_C benchmark variable which is not getting set properly for CUDA
- Merged PR 2864: [build]: fix breaks due to agent image updates. [Lisa
Ong]

Latest version of azure pipelines images now set VCPKG_ROOT, which overrides the submodule used by Accera.

See: https://github.com/actions/runner-images/commit/ef638dd95f31092f88c882294b0db3bfe9728685

* Only pipelines that rely on azure build agents are affected.
* We still need to keep the submodule around to enable external builds from the Github repo.
* Remove defunct pipeline
* Update vcpkg submodule while we're here
- Merged PR 2839: Enable CUDA output caching. [Ritwik Das]

- Add Tensor memory space type to denote memory fragments for caching (e.g. C in gemm). this might go away in future and just be replaced with Private once caching code is unified with ROCM behavior.
- Change caching code to generate MMALoad/StoreOps for caching of the output.

Related work items: 3725
- Merged PR 2813: Add pass to recognize patterns that look like int16
matrix multiply. [Chuck Jacobs]

This PR adds a pass to rewrite GEMM-like loops that multiply-accumulate int16 matrices into an int32 result. If this pattern gets invoked, the output should contain the much-sought `vpmaddwd` instruction.

It also fixes some old low-level tests of integer arithmetic.
- Merged PR 2847: [release] Bump docs version to 1.2.9 and update github
action container. [Lisa Ong]

* Rev docs to 1.2.9

* Update github workflow to reference updated tag for 14.0.6-1
- Merged PR 2845: Filter GPU benchmarks by de-parameterizing cache
layouts. [Ritwik Das]

Filter GPU benchmarks by de-parameterizing cache layouts
- Merged PR 2843: Fix bug in GPU benchmark to calculate valid variant.
[Ritwik Das]

- Fix bug in GPU benchmark to calculate valid variant
- Add cosmosdb util to cleanup old entries
- Merged PR 2835: Merge in MLIR fixes for LocationSnapshot and
MemRefCastOp. [Lisa Ong]

From 1abc4a981067ef1fd9bf717d7fabc4f6d75520d1 Mon Sep 17 00:00:00 2001
- Merged PR 2842: Paramterize cache strategy in GPU benchmarks and fix
kernel filters. [Ritwik Das]

Paramterize cache strategy in GPU benchmarks and fix kernel filters
- Merged PR 2836: Value DSL support for runtime sized output arrays.
[Lisa Ong]

* This adds memref-in-memref support for output arrays that are allocated in the function
* A new "Pointer" Value wrapper class with a Store() function which creates an accv.StoreOp, similar to Array, Scalar
* Update accv.StoreOp to support memrefs-in-memrefs

Value pointer levels are defined as follows:

|Layout|Example|Pointer level|C-type|
|--|--|--|--|
|scalar|int16, float32, index, ...|0|int16_t, float32_t, int64_t, ...|
|single-level memref|memref<1xindex>, memref<3x2xf32>, memref<10x16x11x?xf32>|1|int64_t*, float32_t*, float32_t*|
|memref in memref|memref<memref<?x?x?f32>>|at least 2 (= the number of levels of memrefs)|float32_t**|

Future work:
* End-to-end lowering through Python DSL
* Bare pointer convention for output arrays
* Custom allocator functions. Currently we use the built-in std alloc.

Related work items: 3730
- Merged PR 2840: [nfc] Remove redundant ACR info from docker scripts.
[Lisa Ong]

The container registry allows pull-only access
- Merged PR 2838: Runtime sized Array lowering to LLVM, accv.alloc to
LLVM malloc. [Denny Sun]

1. make deep copy of range end of value type when cloning ops
2. plumbing runtime size to LLVM
3. transform memref.alloc to LLVM malloc
4. conversion between block argument and symbol name

the generated IRs:

**Initial.mlir**

`%2 = "accv.alloc"(%arg0, %arg1) {sym_name = "diff"} : (index, index) -> memref<?x?xf32> loc(loc)`

**LoopNestToValueFunc.mlir**


%2 = "accv.alloc"(%arg0, %arg1) {sym_name = "diff"} : (index, index) -> memref<?x?xf32> loc(loc)
affine.for %arg4 = 0 to %arg0 {
affine.for %arg5 = 0 to %arg1 {
}
}


**ConvertValueToStd.mlir**

`%0 = memref.alloc(%arg0, %arg1) : memref<?x?xf32>`

**ConvertValueToLLVM.mlir**


%8 = llvm.mul %arg1, %arg0 : i64
%9 = llvm.mlir.null : !llvm.ptr<f32>
%10 = llvm.getelementptr %9[%8] : (!llvm.ptr<f32>, i64) -> !llvm.ptr<f32>
%11 = llvm.ptrtoint %10 : !llvm.ptr<f32> to i64
%12 = llvm.call malloc(%11) : (i64) -> !llvm.ptr<i8>


Related work items: 3733
- Merged PR 2831: Record unique IDs so that different processes acting
on a value module. [Mason Remy]

Record unique IDs so that different processes acting on a value module
don't produce conflicting IDs
- Merged PR 2837: Fix WPT calculation to prevent 0 work and filter
benchmarks. [Ritwik Das]

Fix WPT calculation to prevent 0 work and filter benchmarks
- Merged PR 2832: Caching strategy flag and thread ID optimization (GPU)
[Ritwik Das]

- Add a flag to plan.cache() to expose the different thread <--> data arrangements
- Optimize thread ID calculation to check blockdim first
- Merged PR 2829: Add handwritten caching implementation for GPU.
[Ritwik Das]

Add GPUBlockCacheOp which lowers to handwritted caching implementation on the GPU which supports access patterns for minimizing bank conflicts in shared memory and maximizing coalescing global memory access.
- Merged PR 2821: Fixes constraint logic for fusion of more than two
schedules. [Kern Handa]

Fixes constraint logic for fusion of more than two schedules
- Merged PR 2830: Fixes macOS CI build. [Kern Handa]

Fixes macOS CI build
- Merged PR 2806: Enable specifying cache element type. [Mason Remy]

Enable specifying cache element type

- Supports accumulating and/or computing in a different element type and
batching up the casts for those types
- Also adds support for binop/castop expansion and castop folding
- Merged PR 2818: Upgrade hatlib dependency to v0.0.23. [Ritwik Das]

Upgrade hatlib dependency to v0.0.23
- Merged PR 2792: Refactor cast to a value cast op. [Mason Remy]

Refactor cast to a value cast op
- Merged PR 2788: Re-enabled fusing test that was taking too long.
[Chuck Jacobs]

This PR just re-enables a skipped test that was taking too long
- Merged PR 2816: Upgrade hatlib requirement to 0.0.22. [Ritwik Das]

Upgrade hatlib requirement to 0.0.22
- Merged PR 2811: [nfc] Upgrade CUDA to 11.7 on NVidia benchmark
machines. [Lisa Ong]

According to https://hub.docker.com/r/nvidia/cuda/tags, 11.7.0 is still the latest.


**Full Changelog**: https://github.com/microsoft/Accera/compare/v1.2.8...v1.2.9

1.2.8

What's Changed
* Set license field in metadata of package by tonybaloney in https://github.com/microsoft/Accera/pull/46
* Github codespaces configuration by lisaong in https://github.com/microsoft/Accera/pull/48

------------
- Merged PR 2814: Parameterize batch_size in GPU benchmarks. [Ritwik
Das]

Parameterize batch_size in GPU benchmarks
- Merged PR 2810: [release] [nfc] Bump docs version to 1.2.8, bump
github actions to llvm 14.0.6. [Lisa Ong]

Preparation for 1.2.8 release
- Merged PR 2808: [ci] Add vcpkg caching for buddy builds, disable flaky
parallelized tests. [Lisa Ong]

* Enable vcpkg binary caching for CI pipelines that are using non custom agents. This reduces vcpkg install time from 2-3 minutes to ~30 seconds
* ctest --parallel on macos can sometimes fail randomly. The tests will need to be updated to support running in parallel
- Merged PR 2804: [ci] Reduce runtimes of PR Buddy Builds. [Lisa Ong]

* Remove redundant setup.py builds in pipelines with cmake builds
* Build debug for Linux only (the fastest config)
* Add pipeline caching for ccache, conan, and pip where applicable
* Add parallel configs where applicable
* Filter out some tests on windows due to slow runtimes. These should have coverage on Linux and macOS.
- Merged PR 2807: Enable verification for CK baselines. [Ritwik Das]

- Enable verification for CK baselines
- increase timeout for cuda resnet
- add functionality for extracting kernel code from cosmosdb
- Merged PR 2802: Fix barrier optimization pass. [Chuck Jacobs]

This PR fixes a couple of barrier-related issues:
- The barrier optimization pass wasn't keeping barriers that protected vector load/store ops
- Multiple barriers were getting generated when hoisting barriers out of conditionals

Related work items: 3732
- Merged PR 2800: Add max_threads to parallelize and change default
behavior. [Ritwik Das]

- Add num_threads to parallelize
- change default behavior to count the number of iterations of the given indices
- Update documentation
- Merged PR 2801: Remove verification on cuda-fp32-big benchmark.
[Ritwik Das]

Remove verification on cuda-fp32-big benchmark
- Merged PR 2798: LLVM 14.0.6 upgrade. [Lisa Ong]

An incremental upgrade with minimal or no changes to MLIR
- Merged PR 2796: Makes NestedPassAdaptor's pipeline consistent. [Kern
Handa]

Makes NestedPassAdaptor's pipeline consistent

This change makes it so NestedPassAdaptor creates a new pass manager
every time a new pass is added. Prior to this change, if dumpPasses was
false, the same nested pass manager would be used. If dumpPasses was
true, a new nested pass manager would be created per call to addPass.
This difference in behavior was also resulting in the lowering pipeline
to be different, depending on the value of dumpPasses.

For example, in the following code in AcceraPasses.cpp, all the passes
that are added to `funcOpPM` run BEFORE `createConvertSCFToOpenMPPass`
if `dumpPasses` was false.

cpp
auto funcOpPM = pmAdaptor.nestPassManager([&]() -> OpPassManager& { return pm.nest<v::ValueModuleOp>().nest<FuncOp>(); });
funcOpPM.addPass(createConvertLinalgToAffineLoopsPass());
funcOpPM.addPass(createSimplifyAffineStructuresPass());
funcOpPM.addPass(createCanonicalizerPass());
funcOpPM.addPass(createLoopInvariantCodeMotionPass());
funcOpPM.addPass(createCSEPass());

pmAdaptor.addPass(createConvertSCFToOpenMPPass());
pmAdaptor.addPass(value::createValueToStdPass(options.enableProfile));
funcOpPM.addPass(value::createBarrierOptPass(options.writeBarrierGraph.getValue(), options.barrierGraphFilename.getValue()));
pmAdaptor.addPass(value::createRangeValueOptimizePass());
pmAdaptor.addPass(createCanonicalizerPass());
pmAdaptor.addPass(createCSEPass());


Additionally, this change exposed the fact that the BarrierOpt pass is
incorrectly erasing barriers, and so has been made into a no-op until
this correctness issue has been fixed.
- Merged PR 2795: [docs] Cleanup viz scripts, clarify reorder
illustrations. [Lisa Ong]

* Clarify in the labels while working on the animated version

* Cleanup and rename .js files for (slightly) easier lookup
- Merged PR 2475: LLVM 14.0.0 upgrade. [Lisa Ong]

Tag: llvmorg-14.0.0

Notable changes:
* std dialect ops are now moved to arith, math dialects
* StrEnumAttribute is now replaced by simple enums. This affects things like gpu.dimension.x
* [Issue] linalg.copy is removed, replaced by memref.copy, which introduces a runtime dependency on a `memrefCopy` C function for non-identity layout copies. This affects Array.sub_array in debug mode.
* [Regression] OMP to LLVM lowering will crash in mlir-translate findAlloc due to a empty set of blocks being emitted. This only affects dynamic scheduling with collapsed loops.
* Lots of renames
* Upgraded macOS to macOS-12

Related work items: 3646
- Merged PR 2753: accera.Dimension and runtime-sized Arrays in the
Python DSL. [Denny Sun]

With this change, Accera is able to generate the initial mlir for runtime sized Arrays. The ir lowering is not fully working due to some bug, which can be fixed in the later changes.


M = Dim()
N = Dim()
K = Dim()

A = Array(shape=(M, K), element_type=ScalarType.float32, role=Array.Role.INPUT)
B = Array(shape=(K, N), element_type=ScalarType.float32, role=Array.Role.INPUT)
C = Array(shape=(M, N), element_type=ScalarType.float32, role=Array.Role.INPUT_OUTPUT)

nest = Nest((M, N, K))
i, j, k = nest.get_indices()

nest.iteration_logic
def _():
C[i, j] += A[i, k] * B[k, j]

package.add()
package.build()



domain0 = accln<"idomain{{i,3}={0:{op_idx:0}:1}, {j,4}={0:{op_idx:1}:1}, {k,5}={0:{op_idx:2}:1}}">
domain1 = accln<"idomain{{i,9}={0:{op_idx:0}:1}, {j,10}={0:{op_idx:1}:1}}">
domain2 = accln<"idomain{{i,6}={0:1:1}}">

map = affine_map<(d0, d1)[s0] -> (d0 * s0 + d1)>
xdomain0 = accln<"xfdomain{dims: {{i,3}, {j,4}, {k,5}}, indices: {{{i,3} : {0:{op_idx:0}:1}}, {{j,4} : {0:{op_idx:1}:1}}, {{k,5} : {0:{op_idx:2}:1}}}}">
xdomain1 = accln<"xfdomain{dims: {{i,9}, {j,10}}, indices: {{{i,9} : {0:{op_idx:0}:1}}, {{j,10} : {0:{op_idx:1}:1}}}}">
xdomain2 = accln<"xfdomain{dims: {{i,6}}, indices: {{{i,6} : {0:1:1}}}}">
module test_runtimesizes attributes {llvm.data_layout = "... ..."} {
accv.module "test_runtimesizes" {
accv.func nested runtimesizes_..._impl_...(%arg0: index loc(unknown), %arg1: index loc(unknown), %arg2: index loc(unknown), %arg3: memref<?x?xf32, map> loc(unknown), %arg4: memref<?x?xf32, map> loc(unknown), %arg5: memref<?x?xf32, map> loc(unknown)) attributes {accv.output_verifiers = ["", "", "", "", "", "_debug_check_allclose_<accera.lang.Dim.Dim object at ...>_<accera.lang.Dim.Dim object at ...>_..."], exec_target = 0 : i64} {
%0 = "accv.get_element"(<<UNKNOWN SSA VALUE>>) : (memref<index>) -> index loc(loc)
%1 = "accv.get_element"(<<UNKNOWN SSA VALUE>>) : (memref<index>) -> index loc(loc)
%2 = "accv.get_element"(<<UNKNOWN SSA VALUE>>) : (memref<index>) -> index loc(loc)
"accln.nest"(%0, %1, %2) ( {
%3 = accln.sym_index {name = "i"} accln<"index{i,3}"> loc(loc)
%4 = accln.sym_index {name = "j"} accln<"index{j,4}"> loc(loc)
%5 = accln.sym_index {name = "k"} accln<"index{k,5}"> loc(loc)
"accln.kernel"() ( {
%7 = "accv.slice"(%arg5, %3, %4) {sliceDimensions = [0, 1]} : (memref<?x?xf32, map>, index, index) -> memref<f32> loc(loc)
... ...
accln.terminator loc(loc)
}) {sym_name = "_"} : () -> () loc(loc)
... ...
accln.terminator loc(loc)
}) {domain = domain0, exec_target = 0 : i64, kernels = []} : (index, index, index) -> () loc(loc)
accv.return loc(loc)
} loc(loc)
accv.func runtimesizes_...(%arg0: index loc(unknown), %arg1: index loc(unknown), %arg2: index lo...
- Merged PR 2793: support sign extend op in canVectorize() function to
improve generated MLIR. [JUBI TANEJA]

While trying to optimize `int16` `MatMul` with vectorize transformation in DSL, we noticed an unrolled loop with load, binop, sexti, store instructions. There was no vector instruction emitted and it hinted us that sign extend instruction is not supported in `canVectorize` function and now with this op supported, we can emit some vector instructions in the MLIR.
- Merged PR 2790: Filter invalid kernels from GPU benchmarks. [Ritwik
Das]

- Filter invalid kernels from GPU benchmarks
- Disable verification on cuda f16 benchmarks
- Remove frequent cleanups
- Merged PR 2787: Remove MLIR flag from package format in benchmarks.
[Ritwik Das]

Remove MLIR flag from package format in benchmarks
- Merged PR 2784: Merge Github changes to ADO. [Lisa Ong]
- Merged PR 2776: Make fusing more efficient. [Chuck Jacobs]

This PR refactors the code generation for schedules and makes it more efficient. This makes a big difference for complex schedules with constraints on the kernels (like the ones generated when fusing schedules).

Here are some timings on a few tests (modified versions of Mason's example script) I ran:

| test | main branch | PR branch |
|----|----|----|
| 3 fused schedules, tile first only | 18.8s | 5.8s |
| 3 fused schedules, tile 1 & 2 | 190s | 6.2s |
| 3 fused schedules, tile all 3 | ???? | 7.2s |

Related work items: 3731
- Merged PR 2781: Fix benchmark with MLIR format and add repro test.
[Ritwik Das]
- Merged PR 2780: Type support for tensor ops in CUDA. [Ritwik Das]

- Add support for FP32 input (TF32 compute)
- Add support for bfloat16 input/FP32 output
- Add support for integer types

Related work items: 3709, 3710
- Merged PR 2779: Some assorted benchmark fixes. [Ritwik Das]

- Build Accera in release mode
- Shuffle gemm sizes to run small sizes first
- Increase tolerance to account for floating point drift for large k-split
- Merged PR 2774: Add input caching tests for CUDA, enable tests in PR
pipelines. [Ritwik Das]

Add input caching tests in CUDA

Related work items: 3725
- Merged PR 2677: Unify rocm/cuda tensor ops lowering under accv
dialect. [Ritwik Das]

- remove gpu dialect lowering (CUDA)
- add accv dialect lowering for CUDA
- rocm and cuda lowering use the same semantics

Related work items: 3728
- Merged PR 2764: [doc] Rename acc.Dim to acc.Dimension and add
create_dimensions() [Lisa Ong]

* Rename `acc.Dim` to `acc.Dimension`, `acc.Dim.Role` to `acc.Dimension.Role`
* Add the simplified `acc.create_dimensions()` construction pattern
* Kept the `acc.Dimension` constructor for advanced use cases involving generator patterns

Related work items: 3720
- Merged PR 2752: Add nargs to input args in benchmark tool. [Ritwik
Das]

add nargs to input args in benchmark tool
- Merged PR 2680: [doc] Manual and Reference doc updates for Runtime
Array DSL. [Lisa Ong]

Proposed DSL changes for supporting runtime array sizes:
* Adds a new dimension type that serves as a placeholder for runtime dimension sizes for `Array` and `Nest`. Supports both input and output dimensions
* Adds output-only Arrays
* Add the Scalar type
* Example kernels demonstrating different aspects:
* Gather: basic features
* Range: scalar function arguments
* ReduceMean: fusion

Related work items: 3720
- Merged PR 2683: Support conditionals in Logic Function. [Denny Sun]

Before this change, there is no way to emit conditionals in logic function.

With this change, the user is able to write the following logic function:


def if_func():
T[i, j] = A[i, j] + B[i, j]
C[i, j] += T[i, j]**2.

def elseif_func():
T[i, j] = A[i, j] - B[i, j]
C[i, j] += T[i, j]**2.

def else_func():
C[i, j] = A[i, j] + B[i, j]

nest.iteration_logic
def _():
_If(j<100, if_func).ElseIf(i>100, elseif_func).Else(else_func)


Related work items: 3706

New Contributors
* tonybaloney made their first contribution in https://github.com/microsoft/Accera/pull/46

**Full Changelog**: https://github.com/microsoft/Accera/compare/v1.2.7...v1.2.8

1.2.7

------------
- Merged PR 2744: [doc] Fixes link in reference/functions/cast.md, revs
version on all docs. [Kern Handa]

[doc] Fixes link in reference/functions/cast.md
- Merged PR 2743: [DSL] Document implicit casting rules and the explicit
`cast` function. [Lisa Ong]

* Document implicit casting rules implemented by !2693
* Promote `acc.cast` to a documented function to give the user control to override implicit casting behavior
- Merged PR 2739: Updates ROCM tensorization pattern to handle casting.
[Kern Handa]

Updates ROCM tensorization pattern to handle casting
- Merged PR 2643: Some fixes for last major array caching in
tensorization. [Mason Remy]

Some fixes for last major array caching in tensorization
- Merged PR 2693: Updates DSL codegen to implicitly cast if possible.
[Kern Handa]

Updates DSL codegen to implicitly cast if possible
- Merged PR 2735: Pass multiple input files as comma-separated list to
benchmark tool. [Ritwik Das]

https://intelligentdevices.visualstudio.com/ELL/_build/results?buildId=41588&view=logs&j=d78921a4-2f18-50b0-77ad-4c6803f3371b&t=f97c60f6-ada7-5ec9-5ea1-510216c408e9

Above pipeline did not run the 2nd set of input sizes since the 1st process did not exit until pipeline timeout was hit. After the fix, we will always have a single job.
- Merged PR 2721: Remove unnecessary logging in benchmarks. [Ritwik Das]

Remove unnecessary logging in benchmarks
- Merged PR 2674: Support emitting runtime array sizes in the Value DSL.
[Lisa Ong]

* Minimum set of changes to support runtime sizes in the Value DSL without transformations
* Add a ScalarDimension type (name TBC) which is aliased to Scalar
* Support variable ends in MemoryLayout, ScheduledLoopOp, RangeValueAnalysis
* Use mlir::ShapedType::kDynamicSize and mlir::ShapedType::kDynamicStrideOrOffset as sentinel values, following the pattern in MemRefOps, TensorOps, etc.
* TODO: E2E verification in the next PR
* TODO: Python DSL changes in the next PR

Output of mlir-translate for the runtime_sizes_all case, where %21, %22 and %23 are the runtime sizes for M, N, and K:


define void NestMatMul(float* %0, float* %1, i64 %2, i64 %3, i64 %4, i64 %5, i64 %6, float* %7, float* %8, i64 %9, i64 %10, i64 %11, i64 %12, i64 %13, float* %14, float* %15, i64 %16, i64 %17, i64 %18, i64 %19, i64 %20, i64 %21, i64 %22, i64 %23) !dbg !3 {
br label %25, !dbg !7

25: ; preds = %57, %24
%26 = phi i64 [ %58, %57 ], [ 0, %24 ]
%27 = icmp slt i64 %26, %21, !dbg !9
br i1 %27, label %28, label %59, !dbg !10

28: ; preds = %25
br label %29, !dbg !11

29: ; preds = %55, %28
%30 = phi i64 [ %56, %55 ], [ 0, %28 ]
%31 = icmp slt i64 %30, %22, !dbg !12
br i1 %31, label %32, label %57, !dbg !13

32: ; preds = %29
br label %33, !dbg !14

33: ; preds = %36, %32
%34 = phi i64 [ %54, %36 ], [ 0, %32 ]
%35 = icmp slt i64 %34, %23, !dbg !15
br i1 %35, label %36, label %55, !dbg !16

36: ; preds = %33
%37 = mul i64 %26, %5, !dbg !17
%38 = add i64 %37, %34, !dbg !18
%39 = getelementptr float, float* %1, i64 %38, !dbg !19
%40 = load float, float* %39, align 4, !dbg !20
%41 = mul i64 %34, %12, !dbg !21
%42 = add i64 %41, %30, !dbg !22
%43 = getelementptr float, float* %8, i64 %42, !dbg !23
%44 = load float, float* %43, align 4, !dbg !24
%45 = fmul float %40, %44, !dbg !25
%46 = mul i64 %26, %19, !dbg !26
%47 = add i64 %46, %30, !dbg !27
%48 = getelementptr float, float* %15, i64 %47, !dbg !28
%49 = load float, float* %48, align 4, !dbg !29
%50 = fadd float %49, %45, !dbg !30
%51 = mul i64 %26, %19, !dbg !31
%52 = add i64 %51, %30, !dbg !32
%53 = getelementptr float, float* %15, i64 %52, !dbg !33
store float %50, float* %53, align 4, !dbg !34
%54 = add i64 %34, 1, !dbg !35
br label %33, !dbg !36

55: ; preds = %33
%56 = add i64 %30, 1, !dbg !37
br label %29, !dbg !38

57: ; preds = %29
%58 = add i64 %26, 1, !dbg !39
br label %25, !dbg !40

59: ; preds = %25
ret void, !dbg !41
}


Related work items: 3716, 3717
- Merged PR 2682: Add nvidia device optimized sizes and some benchmark
fixes. [Ritwik Das]

Add nvidia dev opt sizes and some bench fixes
- Merged PR 2676: Add automated weekly rocm baseline benchmark. [Ritwik
Das]

https://intelligentdevices.visualstudio.com/ELL/_build/results?buildId=41316&view=logs&j=4f7f213a-5f0f-58b0-1189-99ef12faf0d8&t=687344d2-d6b6-5d8c-dd9d-6aab558fd96c

https://intelligentdevices.visualstudio.com/ELL/_build/results?buildId=41314&view=logs&j=4f7f213a-5f0f-58b0-1189-99ef12faf0d8
- Merged PR 2673: Add automated weekly baseline benchmarks on Nvidia
GPU. [Ritwik Das]

1.2.6

What's Changed
* Bump urllib3 from 1.25.8 to 1.26.5 in /tools/benchmarkers by dependabot in https://github.com/microsoft/Accera/pull/42
* [ci] Fix out of disk space errors for CI workflow by lisaong in https://github.com/microsoft/Accera/pull/43
* Bump bottle from 0.12.19 to 0.12.20 in /tools/viz by dependabot in https://github.com/microsoft/Accera/pull/44
- Merged PR 2657: Add conversion pass from gpu ops to rocdl ops. [Ritwik Das]

- switch to gpu dialect for gpu index ops
- add conversion pass from gpu dialect to rocdl
- Merged PR 2652: Add integer tensor ops support for AMD targets.
[Ritwik Das]

- int mfma ops
- tests
- static_cast in c++

Related work items: 3727
- Merged PR 2650: [release] Docs version to 1.2.6, sync Github to ADO.
[Lisa Ong]
- Merged PR 2624: Add more MMA shapes for CUDA. [Ritwik Das]

Add more MMA shapes for CUDA
- 32x8x16
- 8x32x16
- Merged PR 2644: Enable CUDA benchmarks only for A6000. [Lisa Ong]

* Manually set the Target.Model user capability on agents running A6000
* Update benchmarking pipelines to demand A6000s

https://docs.microsoft.com/en-us/azure/devops/pipelines/process/demands?view=azure-devops&tabs=yaml#feedback
- Merged PR 2634: Remove couple more big gemm sizes. [Ritwik Das]

Remove couple more big gemm sizes
- Merged PR 2626: [refactor] Moving debug mode to its own lowering pass.
[Lisa Ong]

Move the emitting of the debug mode wrapper function out of MLIREmitterContext into a lowering pass.

This makes it easier to expand debug mode in the future.
- Merged PR 2633: Bump hatlib to 0.0.19 to unblock CUDA T4 devices.
[Lisa Ong]

Page 4 of 6

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.