# ROP Virtual Machine

a virtual machine for the Z80 using the return-of-procedure technique.

Compiling a higher language to the Z80 microprocessor is a challenge:
either 'fast' and large code or small code and slow execution or both...

Higher languages require local variables, these are dynamically located in memory making it for the Z80 CPU very complicated to access them. Therefore compiled machine code becomes very longish and not really fast.

If you use a virtual machine then you can provide 'words', virtual opcodes which do the bulky things and need only be placed once in memory. Code becomes compact, as it can consists of the virtual opcode addresses only. But you sacrifice a register pair for the virtual machine's instruction pointer and fetching and calling the virtual opcodes takes a lot of time - every opcode.

An example, using bc for the VM IP and sacrificing hl to do the jump:

	.macro NEXT
		ld	a,(bc) 	; 7
		inc bc		; 6
		ld	l,a		; 4
		ld	a,(bc) 	; 7
		inc bc		; 6
		ld	h,a		; 4
		jp	hl		; 4	= 38 clock cycles
	.endm

I was thinking of a 'middle' way: compact and not too slow: return-oriented programming. But i did not know how good it would work. There would be many challenges. First, the idea was to use the stack pointer as the VM instruction pointer and replace the above macro by a simple 'ret' insruction:

					; jump to next opcode:
		ret			; 10 clock cycles

	pro:
	compact code, nearly as compact as in the above example
	little delay for jumping to the next opcode
	even faster than calling a subroutine from machine code: no 'call'!

	contra:
	using the stack pointer for something else than the stack is ridiculous.
	the sp points into the virtual code: the Z80 has no ssp and therefore interrupts will overwrite some code!
	Accessing local variables at arbitrary positions will still be slow.

The primary problem to solve was the overwritten code: when the sp is used for the VM.IP then it points into the code. When an interrupt occurs then it pushes the return address on the stack – into the code.

One possible solution is to disable interrupts, at least most of the time. That would leave only very few use cases, because at least a timer interrupt is always required for regular actions, like polling the user input.

Another idea is to copy the code and repair the overwritten word in the interrupt routine. If you have enough spare memory then this may be a good option, but this way we don't keep the code size small. At least we would only need to duplicate the virtual code region where the sp actually can point in.

The last idea was to calculate checksums over ranges of virtual code and restore the overwritten word by recalculating the checksum in every interrupt: way too slow...

How slow?

A first estimation yielded that at least it would not be dead slow:

For sectors of 64 bytes the raw calculation of the checksum would take 32 * (10+11) = 672 clock cycles 'cc':

	.rept 32
		pop	de		; use sp to address words
		add	hl,de
	.endm

Adding another 400cc for the overhead, multiply with the interrupt frequency, we get (672+400)*50 = 53600cc, which is 1.5% of the processing time of the CPU at 3.5MHz – on a ZX Spectrum.

Only 1.5% ... really?

That's the way to go!

Off course things a slightly different if the clock frequency is lower or the system generates more interrupts per second. But even in the worst real cases we won't need more than 10% of the time and when comparing machine code to virtual code the speed ratio is easily times 2.

The actual current implementation uses 32 byte segments and takes 928cc for the whole interrupt processing including everything, except the user actions, if the sp actually pointed into VM code, 206cc if not. The additional code is 127 bytes plus 2 bytes every 32 byte segment of VM code, roughly 7% of size. This could be adjusted to use 64 byte segments for a smaller footprint of the table, but then the unrolled checksum loop will become longer, eating some of the saved space, therefore 32 byte segments seemed to be the best compromize to me.

Not so obvious are some hidden disadvantages of this method:

	The virtual code must be located in ram.
	The VM code must reside in one segment.
	Nested interrupts are not possible because only one word can be repaired.
	The stack pointer must always (!) point at even addresses in the VM code
	the full segments at start and end of the VM code must be immutable too.
	the full 100h pages at start and end of VM code must be immutable too or the sp must never point into this region.

Some of these restrictions could be overcome with more code and more time to handle it. This is what i found to be the best compromise.

More Challenge

Now that the interrupt problem is solved, to the other challenges:

We no longer have a stack pointer. We need a stack, at least for return addresses, and i finally decided to use the iy register. It also provides - slow - access to local variables.

We no longer have a stack for push, pop, call and return! Before a VM opcode can use the stack it must store the current sp (which points into the VM code) and load the sp from the iy register. Finally, before jumping to the next opcode with 'return' it must reload the stack pointer. If opcodes call other opcodes (just because they return with return, which makes this easy!) then care must be taken that not two nested calls store their return address in the same location.

Pushing and popping on the new stack is extremely painful:

	.macro push
		dec	iy		; 10
		ld	(iy),h	; 19
		dec iy		; 10
		ld	(iy),l	; 19 = 58 cpu cycles!
	.endm

So we want to avoid it.

Therefore reentrant procedures allocate all required memory once on procedure entry and deallocate it on return. During their life time no pushing and popping occurs: data is just written into to appropriate locations in the stack space.

Calling Convention

All functions (VM opcodes, reentrant and not reentrant functions) are entered in native Z80 mode.

The first 2 arguments of any size (up to 4 bytes) are passed in registers. This also makes a very consistent ABI. Arguments 3++ are passed on the stack (arg#3 beeing the lowest / last added):

	register hl = argument #1
	register de = argument #2

If a byte value is passed, then registers l and e are used only, the high byte is void.

If a dword value is passed, either int32 or float32, then the secondary register set is used to hold he high word.

This appoach for dwords also has some flaws, e.g. it is slightly slower to load and store the 2nd registers into memory, because the value and the pointer are in different sets, which requires some additional 'exx'ing.

A return value is returned in hl (or l or hl+hl'). Some functions return 2 values, e.g. 'div' also returns the remainder. This second return value is returned in de (or e or de+de').


Register Usage

Finally the register assignment looks like this:

	af,af'	general calculations
	bc		always preserved and used for register variables
	bc'		scratch
	de+de'	2nd argument
	hl+hl'	1st argument
	ix		scratch
	iy		VM stack pointer
	sp		VM instruction pointer


Procedures and Functions

As said, access to local variables using index addressing with the iy register is slow. In fact it is so slow, that starting with 2-byte variables accessing a local variable on the real processor stack (in some other system) using hl and doing the arithmetics by hand is faster than using iy. So we want to avoid it if possible.

Therefore this VM makes a distinction between reentrant 'procedures' and not reentrant 'functions':

'procedures' store their local variables and return address on the stack. Slow but this always works.

'functions' store their local variables and return address in static variables. The VM source provides a set of macros to overlap them as far as possible: not reentrant functions can call each others only in a predictable hierarchy, so a function can know the maximum amount of static data used by the functions it calls and place it's variables just below. Then the VM source provides macros to access these local variables with fixed direct addresses, the fastest the Z80 can do.


Multithreading and Interrupts

As may become clear from the above, multithreading or other task switching cannot be done at arbitrary positions and interrupt code cannot execute VM opcodes.

Task switching can only occure in reentrant procedures, or in other words, outside of any not reentrant function. Task switches should be determined in the (timer) interrupt and set a flag, VM code in procedures should regularly poll this flag, e.g. between instuctions, and switch context or call an appropriate event handler. This event handler can be a reentrant procedure, if it itself should be interruptable, or a not interruptable function.


Summary

So the ROP-based VM provides all the building blocks needed for a system with real-time support. As long as use of the stack can be avoided, this should be actually really fast, because jump to the next opcode is so fast – faster than a 'call' + 'return' in machine code itself! And the virtual code is not optimal but pretty compact. And the virtual code is also easily hand-coded, with some practice.


Examples

stack notation:

	( n16 u8 -- n32 )
	-->
		arg1:	n16 = i16 or u16 (int16 or uint16)
		arg2:	u8  = uint8
		result:	n32 = i32 or u32


Example of a VM opcode:

	sub:: 	; ( n16 n16 -- n16 )
		and	a
		sbc	hl,de	; doit
		ret			; next opcode


Example of a non-reentrant function:

	FUNC minmax 	; (i16:a i16:n i16:e -- i16)
		.dw	max		; max(a,n)
		.dw	pop_de	; min(max,e)
		.dw	min
		.return
	ENDF minmax


Example of a reentrant function:

	PROC minmax
		.dw	max		; max(a,n)
		.dw	pop_de	; min(max,e)
		.dw	min
		.return
	ENDP minmax

currently the macros must actually be helped:

	PROC minmax, stacksize		; adding param stacksize
		.dw	max
		.dw	pop_de
		.dw	min
		.return
	ENDP minmax