Pointers in CentralData: ------------------------ | ptr heap_end; | ↑ | ptr heap_top; // allocation pointer | ↑ // used and formerly used chunks | ptr heap_base; Memory Block Header: -------------------- All allocated blocks are aligned to natural_alignment. Each block is prefixed by a MemHead block descriptor. MemHead is at least int32 but on 64bit systems it is int64. Within the MemHead are certain bitfields: MemHead32: %gaoxxxxx.xxxxxxxx.ssssssss.ssssssss %000sssss.ssssssss.ssssssss.ssssssss flat array or object %001sssss.ssssssss.ssssssss.ssssssss nested array %011ooooo.oooooooo.ssssssss.ssssssss object MemHead64: %gao00000.00000000.oooooooo.oooooooo : ssss… g = bit used by garbage collector a = 0 = array, 1 = object o = 0 = flat, 1 = nested ooo = object ID to look up it's members by the gc. sss = size measured in bytes. Arrays larger than 32bit are not possible because ranges only support Int indexes. Strings have no mandatory terminating 0-byte, but the alignment bytes are null bytes! Stacks & Variables: ------------------- The garbage collector must find all used blocks. Therefore all pointers to objects in the heap are stored in a special region: | ptr stack_end; | ↑ | ptr ssp; // string stack pointer | ↑ // local variables and arguments | int num_gsvars; // num global string variables | ↑ // global string variables | ptr gvars; // base for gvar addressing | ↓ // global normal global and local variables | … The virtual engine holds the pointers `ssp` and `gvars` in registers. It also has a `tops` register for the top value on the ssp, but this is "internal" and not important here. Memory Allocation & Deallocation: --------------------------------- Memory allocation is done by a variant of the function `alloc()`. Some `alloc()` returns null if the request cannot be fulfilled others immediately call `compact()`. Deallocation is implicitely done when a variable on the ssp goes out of scope and the ssp is decremented. In addition pointers may be explicitely set to null. Rules for Pointers on the `ssp`: -------------------------------- All pointers to objects in the heap must be stored on the ssp. All pointers into the heap must point to the start of the object. (behind the MemHead) Pointers on the ssp may point to objects outside the heap. Objects outside the heap must obey the rules for objects in the heap: - alignment - preceded by a MemHead Objects outside the heap will be skipped by the gc because their gc bit was never set. Therefore all subobjects must also be outside the heap, because they are never visited by the gc. Pointers into objects are only allowed for reading or writing with no additional calls in between: - calculate the address - peek it or - calculate the value - calculate the address - poke it Garbage Collection ------------------ The function to do the garbage collection is called `compact()`. When it is called the ssp must be valid and the ssp and tops registers of the VM are void. It functions as follows: 1: walk all chunks and set the gc bit 2: walk all variables from `gvars` to `ssp` and recursively in arrays and objects and reset the gc bit. when a gc bit is already clear then skip it: the object has several referrers. 3: find the first 8 (16? 32?) consecutive gaps with unused chunks and construct a table with start address and distance to move it down. 4: walk all variables and walk all chunks (not recursively) and adjust the pointers acc. to the table for the forthcoming move 5: move the first 8 (16? 32?) consecutive blocks of used chunks down to close the gaps 6: repeat step 3, 4 and 5 until all gaps are closed 7: clear the reclaimed memory to zero. Objects, Stings, Arrays: ------------------------ Objects are stored one per chunk in the heap. Data members are accessed using the `IVAR` opcode family. These opcodes use an offset measured in bytes: - IVARx - IGETx - ISETx Arrays are stored one dimension per chunk. Multidimensional arrays have nested subarrays. The compiler may choose whether to fully preallocate them or initially store null pointers. Array elements are accessed using the `ATI` opcode family: - ATIx - ATIGETx - ATISETx Global variables are accessed using the `GVAR` opcode falmily. Variables in the heap are stored at a positive offset, normal variables at a negative offset. These opcodes use an offset measured in bytes: - GVARx - GGETx - GSETx Local variables are accessed using the `LVAR` opcode family. Variables in the heap are stored on the ssp which starts at gvars and extends upwards, so offsets to access them are negative. Normal variables are stored on the sp which starts at gvars and extends downwards, so offsets to access them are positive. Local variables are accessed using the sp and ssp which move with every value pushed and popped! These opcodes use an offset measured in bytes: - LVARx - LGETx - LSETx Ranges and Strings: ------------------- References to single items in an array or object are only allowed for immediate reading or writing. Sub ranges of an array are often useful esp. in string operations. Here you have the choice whether a function which returns a substring from a string returns a newly allocated string on the heap or a `range` within the original string. A `range` consists of a pointer on the `ssp` and a `start` and a `end` index combined into a single `Long` on the normal stack `sp`. The start and and index are measured in bytes (as opposed to indexes used in the ATI opcode family) and are verified to be ordered and to be within the array size. Empty arrays always point to a static array and set start = end = 0, except for ranges over a nullptr array which point to null too. This complies with the requirements of the GC, avoids unneccessary allocations on the heap and is faster. The disadvantage is, that you need functions which accept ranges, often doubling variants for plain strings. Ranges are also useful for assigning to ranges of elements in a targeted array. A special case is where `Int == Long` on a 32bit machine: The VM must provide, just for this purpose, an auxiliary top extension register. The `start` and `end` of the range are compacted into a single `Long` because this enables normal functions to return a range: otherwise the `end` value would be on the stack which requires a whole new set of `RETURN` opcodes to handle this... Strings can be created by 1: allocating the string in the heap and copying the data 2: only pushing a pointer to a const string on the ssp 3: pushing a `range` which contains the character array on the ssp and sp 1) This is the simplest yet the slowest method. It also leads to frequent invocations of the GC! The string can easily follow the opcode without any further requirements. 2) The string must either follow directly or is stored somewhere else in the code. In either case it must conform to the heap requirements: alignment and preceded by a 4-byte (8-byte) MemHead. The alignment requirement is esp. ugly for inline strings. 3) The string must either follow directly or is stored somewhere else in the code. It needs not to be aligned or prefixed with a MemHead. Instead it can use a `universal MemHead` which could be located at the start of the code segment. Most range functions don't care about the MemHead, so addresses could be anywhere, but the GC requires valid MemHeads pointed to by pointers on the ssp. In any case it is probably a good idea to always add a null character at the end of the string. (not accounted for, just present!) C-Strings: ---------- C-style strings are used when calling outside functions. These expect possibly temporary strings and return possibly temporary strings. These are allocated in `tempmem` using function `tempstr()`. If a cstring was created explicitely by the program, then it may require a new allocation on the heap to add the final null byte. So the cstring pointers need to be on the ssp forcing all other cstrings outside the heap to comply with the heap rules. Therefore the cstring must be created by the wrapper function: Then the cstring can always use the string or range directly, provided there happens to be a null byte right at the end, which is often the case and can also be enforced by the IVALstr opcode if it always appends a (not accounted for) null byte to the string literal. Otherwise the cstring should be created in tempmem, though it could also be allocated on the heap. If the external function returns a string, it must be copied into the heap if the wrapper function returns a `string`. If it returns a `range`, then wrapping the cstring itself in a range may work: Then the `RANGE2STR` opcode must be prevented from even trying to use the underlying string which it does if start and end enclose the whole string. This is easily accomplished by using the `universal MemHead` and offsetting start and end accordingly if this is possible. On a 64bit system the distance may be too large or negative. Then the wrapper function must copy the string into the heap, no matter what. Evtl. it helps if we limit the size of arrays to 2GB (31 bit) on 64bit machines. On 32bit machines it is 512MB (29 bit) anyway. Then a range could also use negative offsets. (which is officially illegal) Using the cstring directly is also a little bit dangerous if the range hangs around too long because we don't know how long it remains valid. A tempstr or a static buffer can be overwritten by ongoing activity.