4 thoughts on “Doubling Computer Speed”

  1. Haven’t read the paper, and not to denigrate what is probably sound research, but I’m guessing it’s not as big a deal as it sounds. Even this popular paper admits “those figures were only calculated under the hardest workloads”. Amdahl’s Law always gets you in the end.

  2. When I was writing a matrix multiply library for a parallel machine using the i860 I found that it was far better to let the compute kernel over calculate dummy matrix elements of zero rather than bring the floating point pipeline out of super-scalar mode prematurely. You’d get these varying sizes of square compute matrices based on tessellation of a much larger matrix, so that the smaller matrices could be computed in parallel among multiple (in our case, 28) i860s. Also the square sub matrices had to fit in the i860 cache. Because for arbitrary sized matrices the tessellation wasn’t always perfect it was easier to pad the sub-squares with zero column entries and just let the pipeline pour through them. You’d trim them out at the end when gathering back the result matrix in memory.

    What a misspent youth…

  3. One thing I’ve thought of is a memory system that allows three simultaneous reads and one write in the same cycle.

    In a static RAM, it’s pretty simple to allow multiple, separate row and column lines to access the same memory cell at the same time, including a write.

    This would allow each memory access cycle to perform an instruction fetch, two operand reads, and writing a result.
    INSTR dataA dataB outputC, so C = A operation B could be performed in one swoop, where the write is actually the result of the previous memory access, and the instruction read would likewise be leading code execution by one cycle.

    Instead of six transistors per bit, it would take 12 transistors per bit, bit the memory throughput is four times higher, so a doubling of performance. The downside is that the bus needs to be much wider, especially for the instruction and the three memory addresses it specifies. For 32-bit addressing and a 32-bit opcode, it would be 128 bits for the instruction, in parallel with two 64-bit address reads and one 64-bit address write, so 320 bits for the address bus and 64 bits for the data bus, which is a 384 bit total bus width.

    But it might also be an interesting approach to speeding up a single chip microcontroller, where a 16-bit architecture would have a 64-bit address bus and 16 bit data bus, for a 80-bit total bus width, all internal to the chip.

    1. Mind-numbing detail follows. Mostly for George’s benefit. You have been warned…

      The i860 had a 128-bit data path between its cache and registers.

      The fld.q instruction would load four 32-bit single precision floating point registers or two 64-bit double precision floating registers in two cycles if the address hit in cache. It offered an auto-increment addressing operation on the integer side to allow constant stride addressing for vector processing. The integer side and floating point side in super-scalar mode operated concurrently per cycle, i.e. it would dispatch two instructions per cycle.

      The pipelined floating point instructions operated on a per-cycle basis, but it took two additional cycles (for a total of three) from when the first instruction was pushed into the pipeline before you got back the first result that could be stored in a register. However it did support the dual-op multiply add instruction. Meaning you could push two operands into the multiply pipe while the adder got the result of the multiply from two ops back and would add it back to itself in the next cycle while also making its current output available for writing to a register.

      As long as you could keep the pipeline full it would chug along at 3 operations per cycle: fp-multiply, fp-add, ld/st++ or bla (load/store with auto-increment or a branch instruction with a delay slot). You had to take care not to intermingle loads and stores. You had to wait a cycle after a store to turn the internal bus around before you could issue another load instruction. A third type of load, called a pipelined load or pfld for short would allow you to fetch non-cached data asynchronously using a three-stage pipeline to memory. It executed in one cycle but the result loaded into the fp register was from the address issued 3 pfld’s previously.

      We got the full 40MFLOPS (out of a theoretical 40 MFLOPS) on DGEMM (double-precision floating point GEneral Matrix Multiply) and I was able to juice it up to ~68 MFLOPS (out of a theoretical 80 MFLOPS) on SGEMM.

      Not bad for a single chip CPU in 1990.

      With no pfld.q instruction (128-bit access off chip) on the XR we were memory starved on SGEMM. I might have been able to fix that using the XP version of the i860, but the company folded first.

Comments are closed.