Fp8 runs faster when the kernel name has "cutlass" in it

4 months ago 4
ValueError('Layout mismatch in broadcast: SliceLayout(dim=1, parent=BlockedLayout(size_per_thread=[1, 128], threads_per_warp=[32, 1], warps_per_cta=[4, 1], order=[0, 1], ctas_per_cga=[1, 1], cta_split_num=[1, 1], cta_order=[1, 0])) vs SliceLayout(dim=1, parent=DistributedLinearLayout(reg_bases=[[0, 64], [0, 1], [0, 2], [0, 4], [0, 8], [0, 16], [0, 32]], lane_bases=[[1, 0], [2, 0], [4, 0], [8, 0], [16, 0]], warp_bases=[[32, 0], [64, 0]], block_bases=[], shape=[128, 128]))')

It seems that p ends up with a linear layout instead of a blocked layout. I am not sure why though -- I believe the layout inference should try a blocked layout first before falling back to linear layout.

Read Entire Article