This is a minimal ANSI Forth implementation. It implements the core and core extension wordsets in the Forth 2012 Standard, or at least its current draft.
It is minimal in two senses:
-
Other then the above two wordsets, there are only a handful of extra words (see below)
-
Aside from a few very basic words, everything is written in Forth itself
Rationale
There is (almost) none. I wanted to try to implement a fairly complete Forth system by myself, not a toy Forth like (the otherwise wonderful) jonesforth, or the Forths built in books like Let Over Lambda (one of my favorite programming texts).
I was also intrigued by the minimal implementations given in the credits below,
and wanted to know if these ideas hold even in a full system.
While I had to make some concessions (like adding the words UM* and UM/MOD
as basic words), I am quite content with the results.
That said, this is a doubly inefficient implementation:
-
It is interpreted (not compiled)
-
Even its interpreter is interpreted (not native)
The latter is because I wanted to show that even the interpreter can
be written in Forth itself; it also helps keeping the virtual machine
to a minimum (its interpreter only understands decimal numbers, and
does not handle string sources given by EVALUATE).
As for the former, it shouldn't be too hard to change it to a
compiler: a new basic word like MEM ( addr -- real-addr ) would be
needed to map relative addresses to real addresses, and some ,s
would need to be changed to COMPILE,s, which, in turn, should be
defined appropriately. An even more efficient implementation would
use the hardware stack, which could be done by changing a few words.
Credits
Based on milliForth, which is in turn based on sectorforth, which was inspired by a Usenet post by Bernd Paysan.
Some words are defined based on reference implementations at the Forth 2012 Standard website.
A few words are modeled after their implementation in Gforth.
Extra words
The following words are defined, but not part of the core and core extension datasets:
COMPARE ( c-addr1 u1 c-addr2 u2 -- n )compares two strings (part of the string wordset)DABS ( d -- ud )computes the absolute value of a double cell (part of the double wordset)JUMP ( -- )jumps to the relative address in the next cell?JUMP ( flag -- )jumps to the relative address in the next cell whenflagis0; otherwise jumps over the next cellJUMP! ( offset -- )compiles the computation and storing of the current relative address plus the given offset (storage address is given on the data stack in runtime)LIT@ ( -- x )loads the next cell onto the data stack and jumps over itNAME>INTERPRET ( nt -- xt | 0 )returns the interpretation semantics of the given name tag (part of the tools extension wordset); this implementation never returns0NAND ( x1 x2 -- x3 )is the bitwise NAND operationPARSE-NUMBER ( c-addr n1 -- FALSE | n2 TRUE )tries to parse the string as a number (using(PARSE-NUMBER)and(PARSE-BASE)helper functions)
An additional file (utils.fs) contains some standard utilities from the tools wordset:
.S ( -- )SEE ( "<spaces>name" -- )[very rudimentary]
It also provides some other useful words:
FORGET-NAME ( "<spaces>name" -- )selectively forgets words by setting their name length to 0VERBOSITY! ( 0 | 1 | 2 -- )sets the verbosity level of the interpreter
About the virtual machine
The machine memory has the following structure:
<- low addresses . . . . . . high addresses ->
VVVVVMMMMMMMMMMMMMMMMMMMMMMMMMMMMRRRRRTTDDDDDD
M-> <-R <-D
V: system variables
M: main memory (where the user dictionary lives)
R: return stack (grows downward)
T: text input buffer (stores the current line)
D: data stack (grows downward)
The first part of the memory contains the following system variables:
| Cell | Name | Meaning |
|---|---|---|
| 0 | HERE | start of unreserved data space |
| 1 | DICT | address of top of the dictionary |
| 2 | MEMEND | end of user memory |
| 3 | RSP | return stack pointer |
| 4 | TIB | text input buffer address (= bottom of the return stack) |
| 5 | >IN | input buffer offset |
| 6 | TIBSIZE | text input buffer size |
| 7 | DSP | data stack pointer |
| 8 | MEMSIZE | memory size (= bottom of the data stack) |
| 9 | STATE | TRUE during compilation |
| 10 | LATEST | address of the latest (started) definition |
| 11-19 | - | (reserved for variables in core.fs) |
The virtual machine also knows some basic operations:
! ( x a-addr -- )+ ( n1|u1 n2|u2 -- n3|u3 )0< ( n -- flag )@ ( a-addr -- x )ACCEPT ( c-addr +n1 -- +n2 )CELLS ( n1 -- n2 )EMIT ( x -- )EXIT ( -- ) ( R: nest-sys -- )KEY ( -- char )[currently not implemented in the VM]NAND ( x1 x2 -- x3 )UM* ( u1 u2 -- ud )UM/MOD ( ud u1 -- u2 u3 )
It also has an interpreter that can parse decimal numbers, and define new words with : ... ;.
This is only for bootstrapping; the interpreter and :, ; are rewritten later in core.fs.
Note that the VM implementation of : does not leave anything on the stack.
A dictionary entry has the following form:
| Length | Contents |
|---|---|
| 1 cell | Link to previous entry (or 0 at the first entry) |
| 1 bit | 1 if immediate |
| 2 bits | (reserved) |
| 5 bits | name length (max. 31) |
| N bytes | name |
| K bytes | aligning (K is minimal s.t. 1+N+K is a multiple of the cell size) |
| M cells | body (addresses of other bodies) |
VM implementation notes
Characters are assumed to be 1 byte, but cell size is easily adjustable (default is 32 bits).
Needed changes: redefine cell, ucell, udcell and CELL_SIZE in vm.c,
and check the mixed-precision functions fn_mult and fn_divmod.
The system is assumed to be little-endian, and endline character is 10 (line feed).
The body of system words (i.e., one of
! + 0< : ; @ ACCEPT CELLS EMIT EXIT KEY NAND UM* UM/MOD)
is a single negative number (-1 for !, -2 for + etc. to -14 for UM/MOD).
Literals are loaded by a special code (-15), but this is reimplemented later in LIT@.
Instead of using a full-fledged readline library, input-ending newlines are avoided using VT100 control sequences (so it looks better on a terminal that understands that).
Testing
The test framework and the test for the core wordset were developed by John Hayes, and those for the core extension wordset by Gerry Jackson.
Testing itself was an eye-opening experience. Who knew there could be so many bugs in my code? And there are still quite a few, I am sure, but at least now it passes...
- all tests in
test/prelimtest.fth - all tests in
test/core.fr(*) - all tests in
test/coreplustest.fth - almost all tests in
test/coreexttest.fth(**)
(*) One minor deviation is that since this implementation reads only from the standard input,
the test for ACCEPT is not very meaningful.
(**) The only failing test is the last one:
: SSQ9 S\" 11 : SSQ10 S\\\" \\x32\\x32\" EVALUATE ; SSQ10 33" EVALUATE ;This should leave 11 22 33 on the stack, but currently nested EVALUATEs are not supported.