How did 8088/86 PCs implement flood-fill before SSE and GPUs?

3 months ago 5

Honeywell machines had cool hardware that would increment a loop index in the same cycle it executed an instruction, but the 8088 didn't even have that.

8086 / 8088 does have rep stosw for memset with a repeating 16-bit pattern. (Or rep stosb for 8-bit patterns.) https://www.felixcloutier.com/x86/rep:repe:repz:repne:repnz
It's like .label: mov [es:di], ax ; lea di, [di+2]; loop .label, except it also checks CX before the first iteration, like there was a jcxz .skip_loop before entering the loop. (It doesn't update FLAGs, I used lea instead of add to emulate that.)

The CX decrements happen somewhat in parallel with the stores and pointer increments, since it's all microcode. This is potentially useful for drawing a horizontal line in video RAM, if that's what you were picturing as the use-case for CPU SIMD1.

https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html says on 8088, rep stosw cost 9 + 14n cycles, where n is the count in CX. (8088 has an 8-bit data bus so each store costs 2 memory transactions = 8 clock cycles. rep stosb on 8088 costs 9 + 10n, 4 cycles less, so unfortunately there's no true parallelism between the memory ops and the ALU work to update registers. Probably the same adder is needed for address calculations, and/or even the microcode is limited in ability to start different parts of the CPU working on different things at the same time. Or just that 8088 was an afterthought, with 8086 with 16-bit bus being the primary target, although I assume misaligned rep movsw is total garbage there.)
So if you have an 8-bit pattern, unless your memset is very small, it's probably best to mov ah, al / shr cx, 1 (and jnc over a stosb) to set up for rep stosw. All that setup does effectively make startup overhead a lot higher for rep stosw, though; let's say an extra 32 cycles, counting those 2-byte instructions as 8 cycles each (the time to fetch them) and the jnc as 16. (This is a rough guess without looking the taken and not-taken costs and non-rep stosb cost). The break-even point comes when 32+9 + 14n/2 = 9 + 10n ... 32 = 3n so for n >= 10.66 bytes, it's worth setting up for rep stosw instead of just using rep stosb. If tuning for 8086 not 8088, aligning the pointer for rep movsw is probably important.

Compare the fastest way to loop on 8088 (unlike modern CPUs): the loop instruction costs 17 cycles when taken. mov mem, reg stores cost 13+EA (effective-address calc time, the cheapest being one register like [di] costing 5 cycles). So even with loop unrolling, rep stos is a huge win, also avoiding code-fetch for the mov instructions.

And that page doesn't count initial instruction-fetch; timings assume instructions have already been fetched into the prefetch buffer (e.g. after a slow instruction). (This is a bad assumption on actual 8088; memory access was the major bottleneck. Somewhat true even on 8086 with its 16-bit bus.) So that 13+EA plus times for pointer-increment and a loop instruction is maybe an under estimate, although mov to mem is slow and not all of its cycles are spend keeping the memory busy.

See also When specifying Intel 80x86 instruction execution time, what is included in the cycle count? and https://stackoverflow.com/questions/67400133/increasing-efficiency-of-binary-gray-code-for-8086 for more details on 8086 / 8088 cycle counting.


On modern CPUs, a SIMD loop using 16 or 32-byte vectors is usually faster than rep-string for memset/memcpy, especially for small sizes where microcode startup overhead is a big fraction of the total time.
(And much faster for memchr/memcmp; until the last couple generations of Intel CPUs, rep[n]e scas/cmps had no fast-strings support and just went one element per 1 or 2 cycles, which is absolute garbage vs. checking 32 bytes per vector at 2 or 3 vectors per clock. (That's with data hot in L1d cache, less when coming from DRAM. Anyway, apparently there's a feature bit for fast-rep-scas/cmps, but I haven't looked into it. Still won't be good for short strings since it'll have microcode startup overhead even if it does use vectors internally.)

Intel since P6 (and also AMD) has "fast strings" microcode that does wider stores, like handling rep stosb/w/d 8 bytes at a time on PPro, a 32-bit CPU with a 64-bit data bus and store path to L1d cache.

Before that, including P5 which also had a 64-bit data path from FPU to L1d cache, the store width was the operand size for the instruction. But modern CPUs will internally use 16 or 32 byte vectors for rep stos or rep movs (except with overlapping src/dst, or with DF=1 going backwards, where CPUs will give up and use byte-at-a-time code instead of having even more branching to pick a good vector strategy if the overlap distance is large enough).

This applies to AMD as well, although rep movs / rep stos on AMD are sometimes relatively worse vs. SIMD than on Intel CPUs. See also Enhanced REP MOVSB for memcpy on SO for more about memcpy / memset performance when it misses in cache on modern x86, with NT stores vs. normal SIMD vs. ERMSB rep.


Footnote 1: CPU SIMD for more than just memset in flood-fill?

You could maybe get some use out of SSE2 / AVX2, especially if you have a shadow copy of video RAM or something so reads are efficient. You can check multiple pixels at once and efficiently branch on the result, like how SIMD memchr works with movdqu xmm, [mem] loads + pcmpeqb / pmovmskb eax, xmm0 and test eax,eax/jnz match then maybe bsf eax, eax to find the position of a match (perhaps starting at a column near where you found a match in the previous row).

For less than 8 bits per pixel, see my answer on Efficiently find least significant set bit in a large array? (intrinsics and commented asm for AVX2, plus discussion of what you'd do differently with SSE2 or AVX-512). But full-on SIMD searching for the first non-zero 16 or 32-byte vector would be checking 128 or 256 pixels at a time for 1 BPP image formats, so probably slower than a scalar search, unless avoiding branch mispredicts was enough of an upside. (If the common case is finding a match in the first vector, branching will go the same way almost every time.) But for 1 bit per pixel especially, even just 32 or 64-bit integer registers would often be fine if you have a good initial guess position to search from, and you can tzcnt or bsf that directly after branching on that one element having a match, so significantly less latency after breaking out of a loop. Out-of-order exec can hide latency if there's independent work soon after that, so my SIMD strategy isn't a disaster.

If reading before storing is near-free like it is on cacheable memory if you aren't using NT stores (e.g. if you're updating a pixmap you're later going to copy to video RAM), you could just search + fill until you find an ending column this row. Otherwise, find the end point ahead of time, then efficiently fill between start and end.

Read Entire Article