>If all code (C, Pascal, FB, any language like that) is compiled into >a CODE resource when it's app is built, why is there a speed increase >when writting in Assembly? Now, I understand inserting assembly in an >INTERPRETED languages (Applescript, Usertalk, Hypertalk, C64 Basic, >AppleII Basic, etc) would DEFINANTLY show an increase in speed, but why >does it make ANY differance at all in COMPILED languages? Is it because >humans can write tigher assembly code then current computers can, >resulting in less instructions, and a shorter execution time (I don't >know here, just guessing :) )? > >Or, is all of this "Assembly = Speed" issue just a myth? Sorry to be "long winded" here, but this shifted me into "professor" mode... There are a lot of assumptions made about Asm = Speed, and when these assumptions are incorrect, it is indeed a myth. However, if the initial assumptions are correct, the hand-coded, "tweaked" assembly can be considerably faster. Rick's example of comparing two ranges of memory is a case where the assumption is that the language does not have a built-in method of doing this, and that the Basic code will generate less-than-ideally-efficient code, compared to what he can write. I'm sure he's correct, given FBII and the Mac OS. A common _bad_ assumption though is that Joe Programmer can write a bit-blitter that's "faster" than COPYBITS. In some cases, sure; when there are very explicit predefined conditions, given a good asm programmer, you'll see a measurable speed gain. However, 99% of the time, attempting to "beat" COPYBITS is a waste of effort; it's _already_ very tightly honed assembly code, with a lot of "special cases" built in. Another bad assumption is that the assembly code you write that's so much faster on _your_ machine will continue to be faster than the high-level language equivalent on any machine using a compatible processor, even with compiler and/or emulator changes. This is particularly true when dealing with multi-processor systems! I gained a _great_ deal of respect for compiler-writers when I worked on a system that had 16 CPUs, and actually had to write 16 "threads". To explain what I mean here, let's look at Rick's range-compare algorithm, and consider a FB^4 compiler that runs on a multi-CPU Mac. (I'm assuming that FB^4 is multiprocessor "savvy".) We'll also consider the case of a PPC "GX" chip that has expanded commands available... His function receives two pointers and a length, and returns true if the memory content at each pointer is identical for "length" bytes. The BASIC code compares the ranges one byte at a time, setting flags for the match and end conditions, so for "n" bytes of length, there are n comparisons, n tests, etc.; a classic "n*x" cycles of CPU time will be used. Now, let's change his code up a bit so it's more "obvious" to the compiler what's going on; instead of a DO.. UNTIL with a CASE, let's put in; LOCAL FN RecordsMatch(ptr1&,ptr2&,numBytes&) DIM mismatch% 'returns _zTrue if the numBytes& bytes starting at ptr1& 'exactly match the numBytes& bytes which start at ptr2& mismatch% = 0 FOR i& = 0 TO numBytes& mismatch% = PEEK(ptr1&+i&) - PEEK(ptr2&+i&) IF mismatch% THEN i& = numBytes& NEXT i& END FN = FN zNot(mismatch%) (Rick, if you're energetic, want to compare this to your original? You can also get creative and use PEEK LONG and a stepsize of 4, if you special case the non-multiples of 4...) If we compile this code with FBII on a Mac today, we're looking at "n*x" cycles. If we recompile that same code on a multiprocessor machine... the compiler may decide (and should) to assign "1/p" (where "p" is the number of processors) of that function to each processor. The time to execute _then_ becomes "((n*x)/p)+overhead", and handily beats his one-CPU hand-coded assembly solution! For example, if "numBytes&" was a multiple of 4, and we had 4 CPUs, the compiler could assign loop "i = 0 to numBytes&/4" to the first CPU, "i = numBytes&/4 to numBytes&/2" to the second, etc.; since "mismatch%" is a temporary variable, it could be assigned to a register in each CPU, and any non-zero value appearing in that register on any CPU could cause a "break" in the function, terminating execution. (Thus the overhead; how do the other 3 CPUs know when the first one is finished? With tests!) The second case is with the PPC "GX" chip; let's say that instead of a compare instruction being limited to 8 bits (as assumed by the BASIC code) there's a "LONG COMPARE" that'll compare 64 bits at a time. _IF_ the compiler is "smart" enough to know about the "GX" processor, and _IF_ the compiler recognizes that we're looping through an incrementing pointer... it _could_ decide to replace numBytes& with numWords& (= numBytes&/8) and PEEK with "PEEK64", and do this loop in 1/8 the time! In case you doubt the ability/sneakiness of compiler writers, there _are_ compilers that do both of the things I've described above. Obviously, not your everyday Mac compiler, but on other "big" systems, where multiple CPUs and multiple instruction-word-lengths are common. In fact, there are compilers that "know" what certain common _benchmark_ programs "look" like, and instead of producing their "normal" output when they see those benchmarks, they insert a block of carefully hand-tuned assembly... or even just the _answer_! Bill