Add ~128 MB of VMEM (800x more on-chip memory than a GPU SM), and XLA’s automatic fusion, and the score matrix just… stays on-chip. My handwritten tiling was reimplementing what the hardware and compiler already handle, but worse. (At production scale — multi-head, longer sequences, larger d — the tradeoffs shift and Splash Attention becomes necessary. But for the single-head setup I was benchmarking, the compiler had it covered.)
3月7日晚间,围绕上述领域,何小鹏接受了南方周末等媒体的采访。
。WPS极速下载页对此有专业解读
Global news & analysis
人 民 网 版 权 所 有 ,未 经 书 面 授 权 禁 止 使 用