| dbp:proof
|
- We can draw the computation graph of as a cube of lattice points, each point is of form . Since , computing requires the processor to have access to each point within the cube at least once. So the problem becomes covering the lattice points with a minimal amount of communication.
If is large, then we can simply load all entries then write entries. This is uninteresting.
If is small, then we can divide the minimal-communication algorithm into separate segments. During each segment, it performs exactly reads to cache, and any number of writes from cache.
During each segment, the processor has access to at most different points from .
Let be the set of lattice points covered during this segment. Then by the Loomis–Whitney inequality,
with constraint .
By the inequality of arithmetic and geometric means, we have , with extremum reached when .
Thus the arithmetic intensity is bounded above by where , and so the communication is bounded below by .
Direct computation verifies that the tiling matrix multiplication algorithm reaches the lower bound. (en)
|