A good interpreter has to be small and fast. These days, "small" is no longer considered a mandatory property, but things look very different when all you have is 250k of RAM shared with other (potentially even more demanding) applications running concurrently. So we have taken some steps to pursue both goals; specifically:
1. Our interpreter, as a virtual machine, works with registers, not with a [virtual] stack. There are two virtual registers, R0 and R1 (which correspond to real registers er0 and er1 of the H8S processor) which get used as operands. This does not lead to opcode bloat, however, each register has a hard-coded role in other words, no single opcode contains a "register bit". For example, push implies register R0, pop implies register R1, and so on. This may sound weird, at first, but there are very clever ways to make use of such opcodes, and our existing C compiler proves that. As for opcode implementations, please consider this one:
add: add.l er1, er0 ; 2 bytes, 1 cycle bra scheduler ; 2 bytes, 2 cycles
It is that simple.
2. There is no translation layer between CyOS and the hardware on one side, and the bytecode interpreter on the other. For example, push, pop, calln, and retn. Opcodes make use of the regular stack and not the emulated one. Upon calls to CyOS and so called extension functions (see below) which expect the first 3 parameters to be in registers, those parameters' values get placed into the proper registers as a result of standard expression evaluation sequences (e.g. extension functions, written in "regular" C, expect their 3-rd argument to be 'this' and expect it to be passed in register er2, but bytecode interpreter always keeps 'this' in er2). That means that the interpreter does not have to move registers' values around before each call. Thus the bytecode interpreter and the rest of the system are pretty well integrated.
3. The bytecode interpreter has neither 'stack frame pointer' nor 'code buffer pointer'. For those of you familiar with the i80x86 processors: 'stack frame pointer' used to be BP in 16-bit programs, and EBP in 32-bit. Compiler writers managed to always compute local variables' addresses relative to ESP, and thus freed EBP for use as a general register. In GNU C/C++, this is known as 'omit frame pointer' optimization. Our C compiler always does this optimization, so we dropped the very notion of 'stack frame pointer' out of our virtual machine specification.
Almost the same applies to 'code buffer pointer' - there is no such thing. Even such an instruction as leag.u contains displacement relative to the very next instruction, not any "absolute" offset. Other instructions, such as calln.s and jump.c are similar in this respect. In other words, the code is essentially position-independent (a.k.a. PIC).
4. The maximum size of a bytecode module is 64k; therefore, "local" address space is essentially 16-bit and, consequently, "static" offsets (including those used in calln.s opcode) are 2-byte. Furthermore, stack frames are addressed with unsigned 1-byte displacements (thus, a function cannot have more than((255-4-1)&~3)==248 bytes of arguments and 'auto' variables in total); objects are also addressed with unsigned 1-byte displacements (therefore, objects cannot be larger then 256 bytes; in other words, no more than 256 bytes are addressable via 'this', but objects themselves could be of any size up to 64k). But as soon as any address gets loaded into a register (say, as a result of leal.b bytecode execution, which effectively sums 1-byte offset it contains and current value of stack pointer and places result in R0), it becomes a valid 32-bit address, fully compatible with those used by the rest of the system.
5. There are bytecodes for "object" commands, that is, opcodes that operate upon data addressed relative to the special 'this' pointer. In other words, there are provisions for object-oriented languages, such as C++. The current implementation of C has some OO extensions implemented via the use of these opcodes (which save considerable space). See leat.b for more information.
6. Not all arithmetic opcodes are 32-bit. Multiplication and division are essentially 16-bit in that if their operands do not fit within the -32768..32767 range, results are unpredictable (this does not apply to the dividend). This is a design decision.
7. Unsigned data types are not supported, just like in Java. The only supported data type types are signed char, short (synonym for int), and long. However, unsigned shift right, and specialized Unicode character types found in Java are not supported in the Cybiko bytecode interpreter. Note that the H8S has a 24-bit address space and no virtual memory, so any address is guaranteed to have 8 higher bits clear; in other words, addresses could safely be treated as signed entities and compare correctly. We used this fact for major optimization: we excluded all opcodes for unsigned comparisons.
8. Opcodes are bytes, so there may be up to 256 of them. The thing is that we take the famous "profile anything, assume nothing" rule quite seriously. From this perspective, we want to build as perfect an instruction set as possible while maintaining maximum (including full backward) compatibility with current implementation. Therefore, we decided to implement a minimalist set of opcodes, wait until large pieces of software which make use of that set appear, and then "profile" them to see what bytecode sequences are most used, and then make them into new bytecodes. Similar approaches already proved to be highly efficient with our proprietary compressor.
9. The bytecode interpreter is fully re-entrant. Any number of applications executing bytecode modules (even with different sets of extension functions) share a single in-memory image of the respective dynamic library (bytecode.dl).