Curious adventures in (Commodore) BASIC tokenizing.

No, this is not about a one-hit-wonder from the charts of 1977, rather, it is about something that could have happened on a Commodore PET around this time. More generally, it is about a curious bug/feature in early BASICs and how this gave way to an additional keyword in Commodore BASIC for the PET.
Both are related to BASIC syntax and how white space in BASIC text is handled by the interpreter. — Which sets up the topic of today’s installment, namely, the tokenizer routine in early MS 9-digit BASIC for the 6502 and how this evolved over time. As this promises some fun with #BASIC and #6502 code on the #PET and involves some #archeology, as well, we really ought to have a look at this…
To begin with, let’s fire up a PET 2001 with BASIC 1.0, the first ROM version shipped with the PET 2001, also known as “Old ROM”, to start a little experiment:
*** COMMODORE BASIC *** 7167 BYTES FREE READY. 10 IF LS = LE THEN GOTO 100 LIST 10 IF LS = LETHEN GOTO 100 READY. RUN ?SYNTAX ERROR IN 10 READY. █So, what’s happening here?
And why does a perfectly valid BASIC statement throw a sntax error?
Let’s investigate…
As you may probably know, all white space in BASIC is optional and we do not require any of this as a keyword delimiter or separator of any kind. The two following statements work perfectly the same:
100 FOR I = 0 TO 10 STEP 2 100FORI=0TO10STEP2However, there’s actually a small difference: while the first one may be more readable, the second one executes a bit faster and requires less memory, for which it is preferred by the greybeards.
Going back to our little experiment, something that may spring to the eye is a tiny but crucial difference in the BASIC text as entered and the BASIC text as listed:
10 IF LS = LE THEN GOTO 100 LIST 10 IF LS = LETHEN GOTO 100 READY.Did you see it? — No, there is no paranormal entity. — Rather, it’s something missing in the listing, namely the blank between “LE” and “THEN”, indicating that this is about a misshap in white space handling.
In order to investigate this further, we first have to ask
Words and Letters, or, What is “Tokenizing”?
An important consequence of not requiring any separation between keywords and other program text is that every letter in the program text can be the beginning of a keyword. E.g., as we encounter
20 LET A=VWe do not know, yet, whether this is the start of a keyword, as in
20 LET A=VAL(A$)or “V” is part of an identifier (variable name), as in
20 LET A=V1+5The only way to discern this is by scanning ahead and finding out if this adds up to a valid keyword (a BASIC command) or not.
Now, going over a list of all keywords for each letter, comparing each letter to a valid start of a keyword and then scanning ahead to see, if this does indeed match, is a rather costly operation. It would be nice, if we could somehow avoid this.
And this is where the tokenizing comes to the rescue: every time, we enter a line of BASIC, we’ll copied the line to an input buffer, and there will be routine to go over this, in order to discern any keywords. Every time the routine does find a keyword, it will replace it by an integer value not allowed in regular BASIC text. As BASIC is generally restricted to 7-bit ASCII (that is, outside strings), we can use any values starting at 128 (hex 0x80) and above. This way, we do it once and for all. Later, when the runtime routine of the interpreter encounters such a value (a token), it will immediately know what this is about and it will be even able to use this as an index into a lookup table for quick access.
Moreover, in times when memory was still expensive and thus scarce, this reduced the memory requirements for storing a BASIC program quite significantly. In an average BASIC listing, keywords make up for about half of the text, besides line numbers, so the savings will quite dramatically extend the reach of our memory.
What’s not to love about this? Tokens are really the key to BASIC.
In-Memory BASIC Programs
Knowing this, we may have a deeper look into BASIC text as is represented in memory. As for instance, on the example of our offending line of BASIC:
10 IF LS = LE THEN GOTO 100 LIST 10 IF LS = LETHEN GOTO 100 READY.The equivalent tokenized program text is stored in the user area of the memory, which starts on the PET at 0x0401:
addr code semantics 0401 17 04 link: $0417 (addr. of next line, 16-bit little endian) 0403 0A 00 line# 10 (16-bit binary) 0405 8B token IF 0406 20 4C 53 20 ascii « LS » 040A B2 token = 040B 20 ascii « » 040C 88 token LET 040D 48 45 4E 20 ascii «HEN » 0411 89 token GOTO 0412 20 31 30 30 ascii « 100» 0416 00 -EOL- 0417 00 00 -EOT- (link = null)At first glance, we can see that BASIC has converted our input string “LE THEN ” into the token for “LET” (0x88) and the remaining text “HEN ”. Moreover, we can see that any keyword tokens are encoded as a single byte with a set sign-bit (n ≥ 0x80) and anything else is preserved as-is in plain ASCII text.
So, quite clearly, BASIC skipped over the blank separating “LE” from “THEN”, recognizing the keyword (command) “LET” and encoding this as the token ID 0x88. — But, why should it do so?
BASIC and White Space
Initially, BASIC simply ignored any white space, just like FORTRAN and Algol did. The parser simply skipped over any blanks and these were of no significance, whatsoever. This is how Dartmouth BASIC did it and how many versions of BASIC did it in the 1970s. Notably, is this also what you would find in listings.
But this had consequences: e.g., at this time, it wasn’t that clear whether “GOTO” was a single word or two, as in “GO TO”. In English, these are certainly two words and to the interpreter it didn’t matter. So we may be rather undecided on this. — And, indeed, there were plenty of listings, where you could find constructs like:
120 IF A > 10 THEN GO TO 200In MS BASIC, this is taken care of by the tokenizer routine. As we may observe in our above peek into the in-memory representation, it’s particularly in keywords, where the white space was ignored, while other instances of white space are preserved and faithfully reproduced in any listings.
That is, AppleSoft BASIC, another early version of MS BASIC for the 6502 and a close relative to Commodore BASIC, skips any white space and inserts a single blank after each keyword.
Thus, on the Apple ][, we get (even more obviously giving away, what might have gone wrong):
This is not dissimilar from what Commodore BASIC does with line numbers:
10A=2*0.5 20 PRINT A LIST 10 A=2*0.5 20 PRINT A READY.Like AppleSoft BASIC in any other context, it skips over any white space after the line number and inserts a single blank for listings.
Which is also why we can’t have indented text and have to resort to tricks, if we really want indentations:
100 FOR Y=1 TO 20 110 : FOR X = 1 TO 40 120 : PRINT X*Y 130 : NEXT X 140 NEXT Y LIST 100 FOR Y=1 TO 20 110 : FOR X = 1 TO 40 120 : PRINT X*Y 130 : NEXT X 140 NEXT Y READY.As we may see, Commodore BASIC doesn’t skip any blanks after a separating colon (“:”).
And Numbers? (BTW)
Glad you should ask. It’s a lesser known fact that numbers are another area where Commodore BASIC deliberately skips over any blanks:
### COMMODORE BASIC ### 15359 BYTES FREE READY. 10 A= 1 000 000 .00 20 A= A*2 30 PRINT A LIST 10 A= 1 000 000 .00 20 A= A*2 30 PRINT A READY. RUN 2000000 READY.While the listing proves that the constant “1 000 000 .00” is faithfully stored in memory, the interpreter doesn’t hick up over the interspersed blanks, rather, it elegantly skips over them.
We can see this also, as we peek into memory:
addr code semantics 0401 16 04 link: $0416 0403 0A 00 line# 10 0405 41 ascii «A» 0406 B2 token = 0407 20 31 20 30 30 30 ascii « 1 000 000 .00» 040D 20 30 30 30 20 2E 0413 30 30 0415 00 -EOL- 0416 21 04 link: $0421 0418 14 00 line# 20 041A 41 ascii «A» 041B B2 token = 041C 20 41 ascii « A» 041E AC token * 041F 32 ascii «2» 0420 00 -EOL- 0421 29 04 link: $0429 0423 1E 00 line# 30 0425 99 token PRINT 0426 20 41 ascii « A» 0428 00 -EOL- 0429 00 00 -EOT- (link = null)But why should it do so? Obviously, this is slow and cumbersome, since the interpreter has to parse each of those constants anew and skip any of those blanks, whenever it encounters a number, often repeatedly so, like in a FOR–NEXT loop.
Well, there‘s more than a single way to represent a number.E.g., all those are the same:
1000 1000.0000 1 000 1 000.00 1E3 1.0E+3 (and so on)It wouldn’t be that user friendly to type
100 A= 1000.00just to have this listed back as, say,
LIST 100 A=1.0E3The same goes for identifiers (variable names). The interpreter could maintain a reference table and just insert offsets into the program, but, for a dialog-like system, it’s much better to maintain a meaningful symbolic representation for listing and editing (instead of maybe an offset marker or hash, like #34e2). Or, as only the first two characters are significant, it could easily truncate any longer names.
A compiler, on the other hand, can convert this at compile time, it can also resolve any calculations at compile time and even optimize over such precomputations. E.g, our above example
10 A= 1 000 000 .00 20 A= A*2could become just something along the lines of
<STORE-REAL@FRAME-OFFSET> 0x48 2e6Meaning, if BASIC is slow (and it is), it is so, because it’s user-friendly.
It’s slow so that it can preserve all the intricacies of the input, in order to reproduce them in a meaningful listing.
That is but for the instances where it doesn’t, like with our “LET HEN” example.
Crunch Time
As for now, we have already established that our syntax error isn’t a bug, but a user error: if the keyword routine can skip over blanks to correctly identify constructs like “GO TO”, this implies that no adjacent strings of letters may add up to something that contains a valid BASIC command, even across word boundaries. (That is, outside of quoted string constants, of course.)
In this case, we got the legitimate keyword “LET”, which isn’t a valid right-hand part of a comparison. So, yes, this a syntax error, indeed.
There isn’t any argueing over this: if skipping blanks is a feature, the error is in front of the screen.
Notably, this is not an issue on a PET with the upgraded Commodore BASIC 2.0 (AKA “New ROM”) or, for the matter, with any other BASIC for Commodore 8-bits, thereafter.
Commodore BASIC 2.0, “New ROM”:
### COMMODORE BASIC ### 7167 BYTES FREE READY. 10 IF LS = LE THEN GOTO 100 100 PRINT "OK" LIST 10 IF LS = LE THEN GOTO 100 100 PRINT "OK" READY. RUN OK READY.So, what is special about Commodore BASIC 1.0 (or any other early flavor of MS BASIC for the 6502) and what does it do differently?
PET BASIC, “Old ROM”
What‘s needed for keyword parsing comes in to parts: a list of keywords and a routine to do the actual work, suggestively named “CRUNCH”, which splits kewords from other text in the input stream and converts them to tokens.
Let‘s have a look at the data first.
The list of keywords starts at 0xC092 (underlined characters indicate a set sign-bit):
The tokenizer routine (AKA CRUNCH) traverses over this list, using a set sign-bit for a word termination after that given character. In other words: a set sign-bit indicates the last character of any keyword. If a match is found, the index into this list added by 0x80 (again, the sign-bit set) provides the token ID.
If no match is found, as indicated by reaching the terminating zero-byte, the character, which was just inspected, must be a plain ASCII character (maybe part of a name) and is consequently copied to the program text as-is.
Thus, the above list translates to the following keyword-token table:
code keyword token index 45 4E C4 END 0x80 ( 0) 46 4F D2 FOR 0x81 ( 1) 4E 45 58 D4 NEXT 0x82 ( 2) 44 41 54 C1 DATA 0x83 ( 3) 49 4E 50 55 54 A3 INPUT# 0x84 ( 4) 49 4E 50 55 D4 INPUT 0x85 ( 5) 44 49 CD DIM 0x86 ( 6) 52 45 41 C4 READ 0x87 ( 7) 4C 45 D4 LET 0x88 ( 8) 47 4F 54 CF GOTO 0x89 ( 9) 52 55 CE RUN 0x8A (10) 49 C6 IF 0x8B (11) 52 45 53 54 4F 52 C5 RESTORE 0x8C (12) 47 4F 53 55 C2 GOSUB 0x8D (13) 52 45 54 55 52 CE RETURN 0x8E (14) 52 45 CD REM 0x8F (15) 53 54 4F D0 STOP 0x90 (16) 4F CE ON 0x91 (17) 57 41 49 D4 WAIT 0x92 (18) 4C 4F 41 C4 LOAD 0x93 (19) 53 41 56 C5 SAVE 0x94 (20) 56 45 52 49 46 D9 VERIFY 0x95 (21) 44 45 C6 DEF 0x96 (22) 50 4F 4B C5 POKE 0x97 (23) 50 52 49 4E 54 A3 PRINT# 0x98 (24) 50 52 49 4E D4 PRINT 0x99 (25) 43 4F 4E D4 CONT 0x9A (26) 4C 49 53 D4 LIST 0x9B (27) 43 4C D2 CLR 0x9C (28) 43 4D C4 CMD 0x9D (29) 53 59 D3 SYS 0x9E (30) 4F 50 45 CE OPEN 0x9F (31) 43 4C 4F 53 C5 CLOSE 0xA0 (32) 47 45 D4 GET 0xA1 (33) 4E 45 D7 NEW 0xA2 (34) 54 41 42 A8 TAB( 0xA3 (35) 54 CF TO 0xA4 (36) 46 CE FN 0xA5 (37) 53 50 43 A8 SPC( 0xA6 (38) 54 48 45 CE THEN 0xA7 (39) 4E 4F D4 NOT 0xA8 (40) 53 54 45 D0 STEP 0xA9 (41) AB + 0xAA (42) AD - 0xAB (43) AA * 0xAC (44) AF / 0xAD (45) DE ^ 0xAE (46) 41 4E C4 AND 0xAF (47) 4F D2 ON 0xB0 (48) BE > 0xB1 (49) BD = 0xB2 (50) BC < 0xB3 (51) 53 47 CE SGN 0xB4 (52) 49 4E D4 INT 0xB5 (53) 41 42 D3 ABS 0xB6 (54) 55 53 D2 USR 0xB7 (55) 46 52 C5 FRE 0xB8 (56) 50 4F D3 POS 0xB9 (57) 53 51 D2 SQR 0xBA (58) 52 4E C4 RND 0xBB (59) 4C 4F C7 LOG 0xBC (60) 45 58 D0 EXP 0xBD (61) 43 4F D3 COS 0xBE (62) 53 49 CE SIN 0xBF (63) 54 41 CE TAN 0xC0 (64) 41 54 CE ATN 0xC1 (65) 50 45 45 CB PEEK 0xC2 (66) 4C 45 CE LEN 0xC3 (67) 53 54 52 A4 STR$ 0xC4 (68) 56 41 CC VAL 0xC5 (69) 41 53 C3 ASC 0xC6 (70) 43 48 52 A4 CHR$ 0xC7 (71) 4C 45 46 54 A4 LEFT$ 0xC8 (72) 52 49 47 48 54 A4 RIGHT$ 0xC9 (73) 4D 49 44 A4 MID$ 0xCA (74) 00 <end-of-list>We may observe a few pecularities:
- There are 75 tokens in total.
- There’s some optimization in the order of keywords, with frequently occuring keywords like “FOR”, “NEXT”, “INPUT”, “DATA”, “READ”, “DIM”, “LET”, “GOTO” listed first. (The token 0x80, index #0, is a “natural” candidate for “END”.)
- The “#” in “INPUT#” and “PRINT#” is not just a decoration, rather, these are keywords in their own right, represented by their own, unique tokens.
- This also means that “INPUT#” must be inspected before “INPUT”, as is “PRINT#” before “PRINT”. Otherwise, we’d always match on the respective latter and never catch the former.
(As we’ll see later, this has some consequences.) - Interestingly, “GET#” is not a token, meaning, here, “#” is a decoration that has to be checked for separately.
- “SPC(” and “TAB(” include the opening paranthesis, while other function keywords do not.
- This also means that “SPC” and “TAB” are valid variable names!
(Only the first two characters will be significant, but the tokenizer routine will not trigger on those names and will copy the full strings to the program text.)
I guess, this another remarkable effort for broader comaptibilty with pre-existing BASIC listings? - While operators are generally tokens, there are no tokens for “>=”, “<=”, or “<>” (or, alternatively, “=>”, “=<”, “><” — yes, BASIC has a double-arrow-operator and a bow-tie operator, but these are just alternative forms of the common relational operators.)
Instead of having their own tokens, these operators are built from combinations of “<”, “>”, and “=”.
The Code
So, as we know what’s actually processed, we may have a look at the CRUNCH-y routine.
This starts $C48D and processes zero-terminated input data from the BASIC buffer, which is for BASIC 1.0 located at $000A–$0059 (10–89, 80 bytes long). The initial index from $0000 into this will be available in $C9, pointing to the first non-blank character after the line number (at least 0x0B or decimal 11).
The BASIC input buffer also doubles as the output buffer, meaning, CRUNCH converts the contents of the BASIC buffer in situ. (Due to tokenization, the output will be almost always shorter than the input and is guaranteed to not exceed it. Moreover, while reading starts at the first character after the line number, which is at least at 0x0B, the ouput will follow behind, starting at 0x0A.)
The X register is generally used as an index or cursor for reading from the BASIC buffer, while the Y register is used either as an index into the keyword list or as an output cursor.
Note: With JavaScript enabled, you can hover over a labeled target to highlight the related label.)
C48D LDX $C9 ;set read cursor to initial char in BASIC buffer C48F LDY #$04 ;initialize output cursor ($0005+4 = one before $000A) C491 STY $60 ;also set this as mode flag (statement/string/DATA) C493 iC493 LDA $00,X ;read next char C495 BPL iC49E ;sign-bit not set: it's ASCII, skip ahead… C497 CMP #$FF ;is it the token for pi ("π")? C499 BEQ iC4DC ;yes, branch to copy it… C49B INX ;ignore (graphics) & advance read cursor C49C BNE iC493 ;redo for next char, unless zero… (unconditional) C49E iC49E CMP #$20 ;process character; is it a blank? C4A0 BEQ iC4DC ;yes, branch to copy it… C4A2 STA $5B ;store it as a delimiter C4A4 CMP #$22 ;is it a quotation mark (`"`)? C4A6 BEQ iC500 ;yes, branch to copy the string… C4A8 BIT $60 ;check mode flag C4AA BVS iC4DC ;if bit #6 set ($49: DATA), branch to copy as-is… C4AC CMP #$3F ;is it a question mark ("?") C4AE BNE iC4B4 ;no, skip ahead… C4B0 LDA #$99 ;load token for PRINT C4B2 BNE iC4DC ;unconditional branch to save… C4B4 iC4B4 CMP #$30 ;compare to ASCII "0" C4B6 BCC iC4BC ;smaller, not a digit, skip ahead… C4B8 CMP #$3C ;compare it to "<" C4BA BCC iC4DC ;branch to copy "0"…"9", ":", ";" as-is… C4BC iC4BC STY $C0 ;keyword search; store a backup of the output cursor C4BE LDY #$00 ;load 0 C4C0 STY $5C ;reset keyword index C4C2 DEY ;prepare Y for pre-increment loop C4C3 STX $C9 ;store a backup of the read cursor position C4C5 DEX ;prepare X for pre-increment loop C4C6 iC4C6 INY ;increment index into the keyword list C4C7 iC4C7 INX ;advance read cursor C4C8 iC4C8 LDA $00,X ;read char C4CA CMP #$20 ;is it a blank? C4CC BEQ iC4C7 ;yes: skip and branch to read the next char… C4CE SEC ;prepare subtraction C4CF SBC $C092,Y ;subtract current keyword char for comparison C4D2 BEQ iC4C6 ;exact match: continue with next char… C4D4 CMP #$80 ;is the difference the sign-bit (end of word)? C4D6 BNE iC507 ;it's a mismatch: prepare for next keyword… C4D8 ORA $5C ;it’s a token: $80 OR keyword index (in $5C) C4DA iC4DA LDY $C0 ;restore ouput cursor from backup C4DC iC4DC INX ;save a byte; pre-increment read cursor C4DD INY ;advance output cursor C4DE STA $0005,Y ;store the processed byte (char or token) C4E1 LDA $0005,Y ;post-processing; load again to set flags C4E4 BEQ iC51A ;was it zero (EOL)? branch to finalize… C4E6 SEC ;prepare subtraction C4E7 SBC #$3A ;subtract ":" (statement separator) C4E9 BEQ iC4EF ;found end of BASIC statement, skip to set mode flag… C4EB CMP #$49 ;was it $83 ($3A+$49, token for DATA)? C4ED BNE iC4F1 ;no, skip ahead… C4EF iC4EF STA $60 ;mode-flag: $00,$04=statement, $49=DATA C4F1 iC4F1 SEC ;prepare subtraction C4F2 SBC #$55 ;was it $8F ($3A+$55, token for REM)? C4F4 BNE iC493 ;no: branch to read & process next input char… C4F6 STA $5B ;set the delimiter to zero (EOL) C4F8 iC4F8 LDA $00,X ;get next input char C4FA BEQ iC4DC ;zero (EOL), branch to store it as-is… C4FC CMP $5B ;compare it to delimiter byte ($00 or $22) C4FE BEQ iC4DC ;found delimiter, branch to store it… C500 iC500 INY ;simple copy (REM, string), increment output cursor C501 STA $0005,Y ;store current byte C504 INX ;advance read cursor C505 BNE iC4F8 ;branch to copy next byte, unless zero… (unconditional) C507 iC507 LDX $C9 ;restore read cursor from backup C509 INC $5C ;increment keyword index C50B iC50B INY ;search ahead in list for end of keyword C50C LDA $C091,Y ;read keyword byte C50F BPL iC50B ;last char (sign-bit set)? if not, get next byte… C511 LDA $C092,Y ;peek into next byte C514 BNE iC4C8 ;branch to start over and try the next keyword… C516 LDA $00,X ;end of list, no match: re-read the input character C518 BPL iC4DA ;unconditional branch to copy it as-is… C51A iC51A STA $0007,Y ;end of line; store zero byte (could be again) C51D LDA #$09 ;reset start position for read cursor to $09 C51F STA $C9 ;store it C521 RTS ;doneWell, this may be quite a bit to swollow at once. Let’s have look at this in decent bytes.
— (I am not sorry.) —
The first part is just a bit of initialization, which also introduces a few important concepts:
C48D A6 C9 LDX $C9 ;first char position from $0000 C48F A0 04 LDY #$04 ;initial write index from $0005 C491 84 60 STY $60First, we set up our read cursor, reading it from location $C9. This will be initially set (and reset) to 0x09, but as we enter the routine, this will have been advanced to point to the very first non-blank character after the line number. (So we have already skipped over any leading white space.) Notably, this will be used as an index from $0000.
The output/write cursor is initialized to 0x04. This an offset from $0005, so this will point to $0009, just before the start of the BASIC buffer.
Finally, we also store 0x04 in $60, which is used as mode flag. (0x04 is coincidentially the ASCII code for EOT.) Over the run of the routine, this mode flag may also receive the values 0 or 0x49. But we’ll only care about bit #6 of this value, which will only set in 0x49, indicating that we’re reading a “DATA” section.
C493 B5 00 iC493 LDA $00,X ;fetch char from buffer C495 10 07 BPL iC49E C497 C9 FF CMP #$FF ;`π`? C499 F0 41 BEQ iC4DC C49B E8 INX C49C D0 F5 BNE iC493This is the head of the main read-process loop.
“LDA $00,X” loads the next byte/character to inspect. First, we check the sign-bit. If it’s unset, this means we’re dealing with a regular ASCII character and skip ahead to the next block. Otherwise, we test for 0xFF, the code for pi (π). If it is pi, we branch to write it to the output position as-is.
If we’re still going, the sign-bit is set and it’s not pi. Meaning, it must be an illegal (shifted or graphics) character, which is not valid input text. Hence, we advance the read cursor and branch unconditionally to the beginning of the loop for the next character, ignoring the invalid byte.
C49E C9 20 iC49E CMP #$20 ;blank? C4A0 F0 3A BEQ iC4DC C4A2 85 5B STA $5B C4A4 C9 22 CMP #$22 ;`"`? C4A6 F0 58 BEQ iC500 C4A8 24 60 BIT $60 C4AA 70 30 BVS iC4DC C4AC C9 3F CMP #$3F ;`?`? C4AE D0 04 BNE iC4B4 C4B0 A9 99 LDA #$99 C4B2 D0 28 BNE iC4DC C4B4 C9 30 iC4B4 CMP #$30 ;`0`? C4B6 90 04 BCC iC4BC C4B8 C9 3C CMP #$3C ;`<`? C4BA 90 20 BCC iC4DCHere, we’re back for valid input characters. Time to handle some further special cases:
First, we check for a blank (0x20), and if it is, we branch to $C4DC to copy it as-is.
Then, we store the current character in $5B to be used as a string delimiter later on. (This will only matter, if the current character is a quotation mark.)
Let’s check, if it actually is a quotation mark (0x22): if so, we branch to $C500 to copy the entire string.
Now it‘s $60’s heyday: the BIT instruction transfers (besides other things) bit #6 from our mode flag into the V flag of the processor status register. If it’s set, this indicates that we’re currently reading a DATA section and we branch (again) to that part at $C4DC to copy this as-is.
The next check is for 0x39 (“?”): if it’s not a question mark, we skip ahead. But, if it is, this is the standard shorthand for PRINT (which is not in our keyword list) and we load the corresponding token number (0x99) and branch unconditionally (we just loaded the non-zero value 0x99, therefor the zero flag is unset) to $C4DC to write this byte.
Finally, we check for digits (0…9), “:” (0x3A) and “;” (0x3B), which can be copied at $C4DC, as well.
Now, if we’re still processing, this means that it will be probably a letter or an operator. Time to check for keywords…
C4BC 84 C0 iC4BC STY $C0 C4BE A0 00 LDY #$00 C4C0 84 5C STY $5C ;keyword index C4C2 88 DEY C4C3 86 C9 STX $C9 C4C5 CA DEXLet’s prepare for the keyword search: First, we have to backup our write cursor, since we will now use the Y register as an index into our keyword list, and also, as we’re at it, we reset Y to zero. Next, we reset the the counter in $5C, which keeps track of the current keyword index, to zero, as well. Then, we decrement the read index into the keyword list (in Y) to -1, because our main search loop will start with an increment.
We also store a backup of our buffer read cursor (in X) at $C9 (the same pointer, we initialized it from), since we’re going to probingly read ahead and will probably have to start over again. And, just as before, we decrement it to prepare for the coming pre-increment loop.
C4C6 C8 iC4C6 INY C4C7 E8 iC4C7 INX C4C8 B5 00 iC4C8 LDA $00,X C4CA C9 20 CMP #$20 ;blank? C4CC F0 F9 BEQ iC4C7 C4CE 38 SEC C4CF F9 92 C0 SBC $C092,Y C4D2 F0 F2 BEQ iC4C6 C4D4 C9 80 CMP #$80 ;sign-bit? C4D6 D0 2F BNE iC507 C4D8 05 5C ORA $5C ;keyword indexWelcome to the main attraction, this is why we came here: the inner loop of the keyword search!
We start by advancing both the index into the keyword list and the read cursor and read the next byte. If it’s a blank (0x20), we branch back to get read the next character from the buffer.
That’s it: this is how the tokenizer routine skips specially over any white space in keywords!
If it’s a non-blank character, we’re engaging into some destructive tests:
First, we subtract the corresponding byte from the keyword list. If the result is zero, we have an exact match, meaning, we have still a valid candidate, but not yet reached the final character, and redo for the next pair.
Next, we check for the end of a keyword, as indicated by the difference amounting to exactly 0x80. If the difference amounts to anything else, we have a mismatch and branch to $C507, in order to prepare for another go with the next keyword candidate.
If, on the other hand, the difference is 0x80, which is a set sign-bit, we successfully advanced to the very last character of the given keyword and matched it. By OR-ing this remaining 0x80 with the current keyword index (as stored in $5C) we get our token.
This should be worth a brief contemplation: in a byte-size subtraction, it doesn’t matter, which of the two operands has the sign-bit set and which one not: (0x80+x)-x ≡ x-(0x80+x) ≡ 0x80.
Since there is no carry involved to propagate the sign, this is essentially an XOR operation. (Which is also the raison d'être of the overflow flag: otherwise, we’d never know.)
Indeed, the same could have been achieved by an EOR instruction — and even a bit more economically so. (That it is not so, hints at an origin in a more general codebase adn option to encode this in a different manner.)
Coincidentially (or, rather, very much so by design), a shifted PETSCII character is the same as its unshifted counterpart, with a set sign-bit. This is how BASIC abbreviations work in Commodore BASIC: simply by tricking the tokenizer routine into matching a keyword early. If we were successfully matching a candidate word up to here and the difference is 0x80, the routine assumes that it reached the end of the keyword and matched the entire word: voila, here is your token!
On the other hand, this is also responsible for some awkwardness in this: e.g., if we enter “iN” or “inpU”, this matches “INPUT#” rather than the much more frequent “INPUT”, because the former is tested first (compare the order of the keyword list). And necessarily so, otherwise the tokenizer routine couldn’t match “INPUT#”, at all. And the same applies to “PRINT#”.
Skipping out-of-order ahead to $C507, we get to where the loop is prepared for a flying start-over to check against the next keyword candidate in the list (which is where to branched to in case of a missmatch):
C507 A6 C9 iC507 LDX $C9 C509 E6 5C INC $5C ;keyword index C50B C8 iC50B INY C50C B9 91 C0 LDA $C091,Y C50F 10 FA BPL iC50B ;last keyword char? C511 B9 92 C0 LDA $C092,Y C514 D0 B2 BNE iC4C8 ;end of list? C516 B5 00 LDA $00,X C518 10 C0 BPL iC4DA ;write it (unconditional)First, we restore our buffer read cursor to where we started trying to match a keyword, then, we increment our counter into the keyword list (as a base for a potential token ID).
Then we advance our index into the keyword list to the next end of word: starting with the usual pre-increment, we read the current byte again. Mind that we’re not reading from $C092 as before, but from $C091. And we repeat this until we find a byte with a set sign-bit.
Having thus reached the end of the word, we peek ahead, this time reading from $C092, to see, if this the zero-byte marking the end of the keyword list, or if there’s more to do. If there is, we branch to the entrance of our keyword search loop with X and Y already set up.
In case we exhausted the list and read a zero-byte, we just write the current input character as-is. For this, we read the current input character from the BASIC buffer fresh again and branch unconditionally (there is no way of an overflow of the index register for a buffer that ends at $0059) to $C4DA, where we will write this to the output.
So far, we’ve seen the greater workings of the loop. If we previously fell through with a valid BASIC token in the accumulator, we first restore the output cursor from its backup:
C4DA A4 C0 iC4DA LDY $C0Then it’s time to actually store a processed byte, which is either a token or a raw character (this is the very address, we branched to earlier for various special cases):
C4DC E8 iC4DC INX C4DD C8 INY C4DE 99 05 00 STA $0005,Y C4E1 B9 05 00 LDA $0005,Y C4E4 F0 34 BEQ iC51A ;EOL?After a now customary preincrement to advance both the read and write cursors to the respective next buffer position, we simply store the current byte — just to load it again, do do some post-processing tests. First, we check if we just stored a zero-byte (EOL), indicative of having reached the of the line. If so, we branch to $C51A to finalize.
(We may observe that both instructions for write and read-back are of a 3-byte, full address format, where a zero-byte address mode would have done perfectly. There’s no technical reason for this, since there’s no chance that we could exceed the indexing range for a buffer, which ends at $59. There’s obvious opportunity for further optimzation. But, more importantlty, it’s another hint at this having been derived from a more general codebase.)
If still in business and there’s still more input to process, we engage in another round of destructive testing:
C4E6 38 SEC C4E7 E9 3A SBC #$3A ;`:` C4E9 F0 04 BEQ iC4EF C4EB C9 49 CMP #$49 ;$83 = DATA? C4ED D0 02 BNE iC4F1 C4EF 85 60 iC4EF STA $60 C4F1 38 iC4F1 SEC C4F2 E9 55 SBC #$55 ;$8F = REM? C4F4 D0 9D BNE iC493 ;to entrance of main loopFirst, we subtract 0x3A, the code for a colon (“:”). If it is now zero, it’s a match and we skip ahead to store this as the current mode flag in 0x60. Otherwise, we compare this to 0x49.
0x3A + 0x49 = 0x83, which is — consulting our keyword-token table — the token for “DATA”. If it is now 0x49, it must have been a DATA token before the subtraction, and we skip ahead to store this remainder as the new mode flag.
In any other case, we skip ahead and subtract 0x55:
0x3A + 0x55 = 0x8F, which happens to be the token for “REM”, and, if the subtraction yields a zero in the accumulator, we have a match.
If it’s not a REM statement, we branch back to the entrance of our read-process loop at $C493 to fetch and process the next input character.
If it was “REM”, we clear the delimiter in 0x5B, in order to read to the end of line (EOL) and to consume the remaining line entirely.
What follows is the loop that copies a continous string of characters, as for a REM clause or a quoted string. We load the current character, and, if it’s not the end of the line, we compare it to the delimiter. If it is the end of line or the delimiter, we jump back to $C4F8 to write the character as usual (where also the EOL zero-byte will be finally caught).
Otherwise, we advance the output cursor and write the byte. And, after a pre-increment of the input cursor, we unconditionally branch to the entrance of this copy loop. (Again, there is no way to overflow on X for a buffer of 80 bytes length.)
Mind that this close copy loop applies to quoted strings and REM statements only. Notably, REM statements may contain shifted characters, while, due to a bug in the LIST routine, these will be expanded to their full keyword equivalent in any listing, as if these were tokens.
C51A 99 07 00 iC51A STA $0007,Y C51D A9 09 LDA #$09 C51F 85 C9 STA $C9 C521 60 RTSThe last few lines are finalizing the transformation and we only get here, if we read a terminating zero-byte. First we write this zero-byte to the buffer (we may have done so previously, but better to be sure) and after resetting the index for the input cursor to 0x09, we finally exit the routine.
— Phew! —
Closing Observations
By this, the BASIC line
10 PRINT A+5will be transformed in the buffer from
0000: 4C 30 D1 00 00 00 7F 7F L0...... 0008: 0A 00 31 30 20 50 52 49 ..10 PRI 0010: 4E 54 20 41 2B 35 00 44 NT A+5.Dinto the equivalent tokenized code:
0000: 4C 30 D1 00 00 00 7F 7F L0...... 0008: 0A 00 99 20 41 AA 35 00 ... A.5. 0010: 4E 00 20 41 2B 35 00 44 N. A+5.D(Mind the stray zero-byte at $0011, an artefact of the final better-sure-than-sorry write operation, which happened after the byte was already written and the pointers (cursors in X and Y) were pre-incremented again. We may also observe that this does no harm and that there’s also a termination in the correct memory location.)
If in direct mode, this code can be executed immediately from here, otherwise, it will be copied to user memory as (part of) a stored program.
We may also observe the 16-bit binary equivalent (0A 00) of the already processed line number immediately before this, at locations $0008 and $0009.
A few observations:
- Overall, this is a rather efficient, self-contained routine. There are only branch instructions and no jumps, saving a few of those precious ROM bytes.
- There are also a few inefficiencies, owed to the fact that this is from a more general codebase and wasn’t specially adapted to the BASIC buffer being located in the zero-page (as for the codebase, it could be located anywhere). Or the end-of-keyword matching that may have been performed by an EOR instruction more compactly.
- There’s also a major inefficiency, namely with the handling of a terminating zero-byte. Not only does the routine write this in most cases repeatedly, it’s also only checked for after a byte (character or token) has been written and reloaded, which happens only after the attempt to match a keyword. In other words, as we load a zero-byte at $C493, we pass all the checks for special cases and enter the keyword matching routine at $C4BC, just to turn a victory lap over all the 75 possible keywords. Only then we fall through to writing the byte at $C4E4, where the zero-byte will be finally detected, but only after the write operation, and from here we finally jump to the end of the routine, just to write this byte again.
- The implementation of the special skipping over any white space in keywords is rather unspectacular: just a simple “CMP #$20” followed by a “BEQ” branch instruction (at $C4CA and $C4CC).
- And there’s an actual bug. A subtle one and only concerning an edge-case, but still a bug!
The CRUNCH Bug
This bug is exclusive to Commodre BASIC, as it is related to PETSCII encoding. As we have seen, keyword abbreviations are really a side effect of how shifted characters are encoding in PETSCII, matching a character with a set sign-bit as stored in the keword list. The tokenizer routine is entirely agnostic of this.
This also means that we can enter a keyword terminating character. While this will be rejected in a leading position, it will be processed like any other byte, when we read ahead trying to match any keyword candidates. As far as the keyword search code is concerned, any such premature matching of the terminating difference of 0x80 is just a side effect of the subtraction working either way, regardless of which of the two compared bytes has the sign-bit set.
Now, what happens, if we enter a shifted a shifted character at the final position of a keyword? We’ll have a perfect character match with zero difference! Meaning, the routine will still go on, trying to match what must remaining characters of the keyword candidate, thus spilling over into next keyword candidate, but without incrementing the count of keywords which have been inspected already. Should this eventually match (be it on this next keyword, or on any consecuitive try), the keyword count will be lagging behind by one, resulting in an incorrect token.
We can demonstrate this by a little trick of magic:
My attractive assistant will now enter some nonsensical BASIC code containing two consecutive keywords, and by the magic touch of the RETURN key, one of these keywords will vanish and the line will be transformed into perfectly valid BASIC!To improve the performance and to amuse our esteemed audience, we will switch for this to the lower-case/upper-case character set:

I hope, you’re suitably impressed!
Peeking into memory, we can prove that the magical transformation is complete and not a fluke:
addr code semantics 0430 3A 04 link: $043A 0432 C8 00 line# 200 0434 8D token GOSUB 0435 20 31 30 30 ascii « 100» 0439 00 -EOL- 043A 00 00 -EOT-Now, how does this work?
Well, “gosuB” is number 13 in the keyword list (token 0x8D) and “returN” comes immedtiatly after this, at index 14 (token 0x8E).
As the keyword matching code runs away over the perfect match of the shifted “B” in both the input and the keyword candidate (indicating that we’re still in the middle of a valid keyword), it eventually matches “return”, but without having incremented the keyword count. Thus, the resulting token is 0x8D, the token for GOSUB, while the entire string has been consumed and the tokenized line amounts to just “200 GOSUB 100”.
The input string “return”, though, is gone, gone with the magic wind of a run-away error!
And, by, this, we‘ve seen what‘s to be seen, in tokenizing with Commodore BASIC 1.0.
Further Developments
As already indicated, this was not to last. The upgraded “New ROM” with BASIC 2.0 (the one introducing proper IEEE488 disk support), brought various changes and modifications, also doing away with skipping over white space in keywords. And “LET HEN” was no more.
BASIC 2.0 “New ROM”
All that had to be done to mend the “LET HEN” bug was the ommision of two instructions.
The BASIC input buffer is now located at $0200 to $0250 (moved to outside of the zero-page — we can see now why those write instructions hadn’t been zero-page instructions, before) and the there’s a related extra instruction near the end for resetting the read cursor (for use as a double-byte pointer at $77 / $78).
Otherwise the code is still the same, as can be seen in this comparison:
Accordingly, “LE THEN” is tokenized as expected:
### COMMODORE BASIC ### 7167 BYTES FREE READY. 10 IF LS = LE THEN GOTO 100 LIST 10 IF LS = LE THEN GOTO 100 READY.And, as in memory:
addr code semantics 0401 17 04 link: $0417 0403 0A 00 line# 10 0405 8B token IF 0406 20 4C 53 20 ascii « LS » 040A B2 token = 040B 20 4C 45 20 ascii « LE » 040F A7 token THEN 0410 20 ascii « » 0411 89 token GOTO 0412 20 31 30 30 ascii « 100» 0416 00 -EOL- 0417 00 00 -EOT-But now, there was a problem with two-word “GO TO”, the traditional and most frequent use of white space skipping in keywords. For this, the keyword list was amended for another token, “GO”, token ID 0xCB:
code keyword token index ... 4C 45 46 54 A4 LEFT$ 0xC8 (72) 52 49 47 48 54 A4 RIGHT$ 0xC9 (73) 4D 49 44 A4 MID$ 0xCA (74) 47 CF GO 0xCB (75) 00 <end-of-list>The new token “GO” bears all the hallmarks of an afterthought: It works only in place of a plain “GOTO” and relies on the already existing token “TO” (as in “FOR … TO”). As there is no token for “SUB”, we can‘t use it for “GOSUB” and a two-word “GO TO” isn’t checked for in a “ON … GOTO …” construct, either.
But, it’s for this new token that we can still write and successfully execute
100 GO TO 200and even:
100 IF A > B THEN GO TO 200But, we can’t use “GO TO’ in place of “THEN”, like “GOTO”:
100 IF A > B GO TO 200 RUN ?SYNTAX ERROR IN 100 █While more of a superficial fix, “GO” makes for a rather baffling new effect of the CRUNCH run-away bug:
(here for better readability in the lower-case/upper-case character set)

What happens here? Well, when the routine runs away over the shifted “B” in “gosubB” and subsequently fails on matching the next keyword candidate, “RETURN” as the remaining part, it continues the keyword search as usual, but with the keyword counter now lagging behind by one. As it finally finds “GO” at the very end of the list, the resulting token is thus 0xCA instead of the correct 0xCB. — And 0xCA is really the token for “MID$”, found just before “GO” in the keword list!
Subsequently, the remaining “suB” is parsed as plain ASCII text, with the shifted letter “B” rejected.
Thus, we end up with the following tokenized line of BASIC:
— Yay, this worked out greatly! —
Having seen the major attraction, we may reassueredly turn our attention to the next major iteration, which is
Commodore BASIC 4.0
This one came with new commands for disk control, some new PETSCII control characters (like for switching the character set or setting tabulators), and featured an enhanced editor with support for things like sub-windowing for proper business use on the 40xx/80xx series. Since this required another 4K of ROM, at 0xB000– 0XBFFF, this wasn’t available for the original PET 2001 (the one with the built-in Datasette), which lacked the required ROM slot. But it was available as an upgrade for the PET 2001-N and other non-CRTC PETs (but here, the code for the new editor features was disabled, while the BASIC ROMs where the same).
Here’s the new version in comparison to Commodore BASIC 2.0 (“New ROM”), it shared the system addresses (zero-page, etc.) with:
The changes are really just about how the keyword list is accessed. While it was previously done as in
SBC $C092,Y ;$C4D5 LDA $C091,Y ;$C513the keyword list is now accessed in post-incremented indirect address mode via a pointer in 0x1F / 0x20:
SBC ($1F),Y ;$B548 LDA ($1F),Y ;$B590Where previously the buffer was transformed in-place, just to be copied to memory in another step, now, thanks to the use of this pointer, we may write the output to anywhere in memory, just as we please.
All other changes are related to this, setting and updating the pointer in 0x1F and 0x20 accordingly.
(Moreover, the read cursor in X is now generally adjusted first, before the new pointer is handled.)
But there’s also a curious peculiarity to this new code, found at $B584:
B584 iB584 LDA ($1F),Y B586 PHP ;push A to stack B587 INC $1F B589 BNE iB58D B58B INC $20 B58D iB58D PLP ;pull A from stackWhat are PHA and PLA meant to achieve? Neither BNE nor INC affect the accumulator!
It‘s more like someone familiar with an other machine, say, the DEC PDP-1, where the IDX instruction, similar to INC, leaves the incremented value in the accumulator as a side effect, had a hand in this. On the 6502, though, this is not the case and these two instructions are entirely luxurious, achieving nothing but burning a few processor cycles.
(This piece of code is run whenever we failed to match a keyword candidate and read ahead in the list, in order to advance to the beginning of the next keyword candidate. Luckily, the input of a BASIC line isn’t that time critical.)
Anyways, none of these changes affect the run-away bug, which is still what it ever was:
*** commodore basic 4.0 *** 31743 bytes free ready. 10 gosuB 100 20 gosuBreturn 100 list 10 mid$su 100 20 gosub 100 ready. █Beyond the PET
BASIC V.2 for the VIC-20 and C64 is essentially derived from BASIC 2.0 for the PET. BASIC V.2 does away with the second cassette buffer, adds support for color (i.e., code maintaining the color RAM), implements IEEE-over serial (somewhat unfortunately so, as the clock synchronization didn’t work out as expected), adds support for RS-232C via the user port, and adds a few control characters (much like it’s found in BASIC 4.0 for the PET). But, otherwise it’s still the same.
Commodore 64 BASIC V.2, Kernal Rev. 2
Hence, it’s small wonder that the CRUNCH routine for rev. 2 of the C64 ROM is still the same.
(We may safely assume it’s also the same for the VIC and the rarer rev. 1 Kernal. I’ve to admit, I didn’t even check.)
System addresses are now slightly different, otherwise, it’s still the same, as is the keyword/token list:
As may be expected from this, the code is also functionally the same.
Commodore 64 BASIC V.2, Kernal Rev. 3
In Kernal Rev. 3, the last incarnation of this code, which is also found in the C128, the tokenizing is still the same, while its address has moved by 4 bytes in ROM:
Therefor, all the observation, we made for PET BASIC 1.0 (less the two instructions for white space skipping in keywords) and BASIC 2.0 are still valid, half a decade later.
BTW, you can easily discern the Kernal revision of a C64 by inspecting the contents of ROM location $FF80 (as in “PEEK (65408)”):
- 0xAA (dec. 170) … Rev. 1 (the address contains just the usual fill byte)
- 0x00 (dec. 0) … Rev. 2
- 0x03 (dec. 3) … Rev. 3
- 0x43 (dec. 67) … SX-64 (similar to Rev. 3, but there are no tape routines and screen colors differ)
- 0x64 (dec. 100) … CBM 4064 / Educator 64
And that’s it, for today. — With some of the questions actually answered, we drop the curtain (in lesser dismay.)
Norbert Landsteiner,
Vienna, 2025-07-05