[futurebasic] Re: Assembly - Speed or not?

Message: < previous - next > : Reply : Subscribe : Cleanse
Home   : December 1997 : Group Archive : Group : All Groups

From: BMichael <BMichael@...>
Date: Tue, 9 Dec 1997 12:23:06 EST
>If all code (C, Pascal, FB, any language like that) is compiled into 
>a CODE resource when it's app is built, why is there a speed increase 
>when writting in Assembly? Now, I understand inserting assembly in an 
>INTERPRETED languages (Applescript, Usertalk, Hypertalk, C64 Basic, 
>AppleII Basic, etc) would DEFINANTLY show an increase in speed, but why
>does it make ANY differance at all in COMPILED languages? Is it because
>humans can write tigher assembly code then current computers can, 
>resulting in less instructions, and a shorter execution time (I don't 
>know here, just guessing :) )?
>Or, is all of this "Assembly = Speed" issue just a myth?

Sorry to be "long winded" here, but this shifted me into "professor" 

There are a lot of assumptions made about Asm = Speed, and when these 
assumptions are incorrect, it is indeed a myth. However, if the initial 
assumptions are correct, the hand-coded, "tweaked" assembly can be 
considerably faster.

Rick's example of comparing two ranges of memory is a case where the 
assumption is that the language does not have a built-in method of doing 
this, and that the Basic code will generate less-than-ideally-efficient 
code, compared to what he can write. I'm sure he's correct, given FBII 
and the Mac OS.

A common _bad_ assumption though is that Joe Programmer can write a 
bit-blitter that's "faster" than COPYBITS. In some cases, sure; when 
there are very explicit predefined conditions, given a good asm 
programmer, you'll see a measurable speed gain. However, 99% of the time, 
attempting to "beat" COPYBITS is a waste of effort; it's _already_ very 
tightly honed assembly code, with a lot of "special cases" built in.

Another bad assumption is that the assembly code you write that's so much 
faster on _your_ machine will continue to be faster than the high-level 
language equivalent on any machine using a compatible processor, even 
with compiler and/or emulator changes. This is particularly true when 
dealing with multi-processor systems! I gained a _great_ deal of respect 
for compiler-writers when I worked on a system that had 16 CPUs, and 
actually had to write 16 "threads". To explain what I mean here, let's 
look at Rick's range-compare algorithm, and consider a FB^4 compiler that 
runs on a multi-CPU Mac. (I'm assuming that FB^4 is multiprocessor 
"savvy".) We'll also consider the case of a PPC "GX" chip that has 
expanded commands available...

His function receives two pointers and a length, and returns true if the 
memory content at each pointer is identical for "length" bytes. The BASIC 
code compares the ranges one byte at a time, setting flags for the match 
and end conditions, so for "n" bytes of length, there are n comparisons, 
n tests, etc.; a classic "n*x" cycles of CPU time will be used.

Now, let's change his code up a bit so it's more "obvious" to the 
compiler what's going on; instead of a DO.. UNTIL with a CASE, let's put 

LOCAL FN RecordsMatch(ptr1&,ptr2&,numBytes&)
  DIM mismatch%
  'returns _zTrue if the numBytes& bytes starting at ptr1&
  'exactly match the numBytes& bytes which start at ptr2&
  mismatch% = 0
  FOR i& = 0 TO numBytes&
    mismatch% = PEEK(ptr1&+i&) - PEEK(ptr2&+i&)
    IF mismatch% THEN i& = numBytes&
  NEXT i&
END FN = FN zNot(mismatch%)

(Rick, if you're energetic, want to compare this to your original? You 
can also get creative and use PEEK LONG and a stepsize of 4, if you 
special case the non-multiples of 4...)

If we compile this code with FBII on a Mac today, we're looking at "n*x" 

If we recompile that same code on a multiprocessor machine... the 
compiler may decide (and should) to assign "1/p" (where "p" is the number 
of processors) of that function to each processor. The time to execute 
_then_ becomes "((n*x)/p)+overhead", and handily beats his one-CPU 
hand-coded assembly solution!

For example, if "numBytes&" was a multiple of 4, and we had 4 CPUs, the 
compiler could assign loop "i = 0 to numBytes&/4" to the first CPU, "i = 
numBytes&/4 to numBytes&/2" to the second, etc.; since "mismatch%" is a 
temporary variable, it could be assigned to a register in each CPU, and 
any non-zero value appearing in that register on any CPU could cause a 
"break" in the function, terminating execution. (Thus the overhead; how 
do the other 3 CPUs know when the first one is finished? With tests!)

The second case is with the PPC "GX" chip; let's say that instead of a 
compare instruction being limited to 8 bits (as assumed by the BASIC 
code) there's a "LONG COMPARE" that'll compare 64 bits at a time. _IF_ 
the compiler is "smart" enough to know about the "GX" processor, and _IF_ 
the compiler recognizes that we're looping through an incrementing 
pointer... it _could_ decide to replace numBytes& with numWords& (= 
numBytes&/8) and PEEK with "PEEK64", and do this loop in 1/8 the time!

In case you doubt the ability/sneakiness of compiler writers, there _are_ 
compilers that do both of the things I've described above. Obviously, not 
your everyday Mac compiler, but on other "big" systems, where multiple 
CPUs and multiple instruction-word-lengths are common. In fact, there are 
compilers that "know" what certain common _benchmark_ programs "look" 
like, and instead of producing their "normal" output when they see those 
benchmarks, they insert a block of carefully hand-tuned assembly... or 
even just the _answer_!