Details: Published: 14 January 2014

Introduction

newRPL has a complete modular framework to integrate new RPL commands and object to an existing core. Each individual module is called a "library" for historical reasons. A newRPL library is a self-contained C module that provides responses to object actions and execution of RPL commands from the user.

A library/module works very closely with the RPL execution core, and therefore there's many concepts that need to be reviewed before going into detail on the libraries themselves.

Data types

newRPL defines a few data types for convenience and uniformity in the code:

WORD = An unsigned 32-bit number

BINT = A signed 32-bit number

WORDPTR = A pointer to a WORD (pointers can be 32 or 64-bit depending on architecture).

BINT64 = A signed 64-bit number

BYTE = An unsigned 8-bit number

System Pointers

The RPL core needs to keep the state of execution, and it does so in several system pointers. Some of these pointers are protected during a Garbage Collection, and some others are not (usually, because they don't need to be).

The system pointers will be described together with the memory areas they provide access to.

Memory areas

The system has several different memory spaces to store different elements. The following memory areas are utilized by the environment:

ROM: This is the memory area where the system resides. It is read-only and not accessible to the user. However, the ROM might contain useful objects that can be directly referenced from other areas.
Data Stack: This memory area stores consecutive 32-bit words, each word being a pointer to an object. The size of this memory area is variable, and as such the Hardware abstraction layer must provide a mechanism to grow/reduce the stack during runtime. The first element is the bottom of the stack, and the last element is the top, where most operations occur.

DStk = (WORDPTR *) Pointer to the entire Data Stack memory region.

DSTop = (WORDPTR *) Pointer to the Top of the stack. The stack works in "Increase after write" or "Decrease before read", so this pointer is actually pointing to an unused pointer, immediately after level 1.

Return stack: This memory area stores consecutive 32-bit words, each word being a pointer to code within a program. The size of this memory area is variable, and as such the Hardware abstraction layer must provide a mechanism to grow/reduce the stack during runtime. The first element is the bottom of the stack, and the last element is the top, where most operations occur.

RStk = (WORDPTR *) Pointer to the entire Return Stack memory region.

RSTop = (WORDPTR *) Pointer to the Top of the stack. The stack works in "Increase after write" or "Decrease before read", so this pointer is actually pointing to an unused pointer, immediately after level 1.

Temporary Objects (TempOb): This area stores consecutive objects. For all purposes in this part of the specification, an object is defined as a block of memory with arbitrary contents. The size and type of objects will be defined later, and is of no concern to the running environment.

TempOb = (WORDPTR *) Pointer to the entire Object memory region.

TempObEnd = (WORDPTR *) Pointer to the end of the region. This pointer is actually pointing to an address immediately after the last used 32-bit word.

Temporary Named Variables stack: This area stores consecutive pairs of 32-bit pointers. On each pair, the first pointer is a reference to the name the variable, represented by an object, and the second is a reference to an object representing the value of the variable. This are is used to create temporary variables during program execution.
Permanent Storage directories: This area stores consecutive pairs of 32-bit pointers. On each pair, the first pointer is a reference to the name the variable, represented by an object, and the second is a reference to an object representing the value of the variable. This area is used to permanently store named objects.

Libraries

The code and objects that comprise the running environment are encapsulated in individual libraries. The library is the fundamental component of the running environment. Each library will communicate with the environment through a set of API functions (for example to push/pop objects on the stack) and system variables that the environment provides for the libraries.

Libraries are responsible for:

Defining object types and their behavior
Defining commands that use the object types
Executing those commands
Compile text into the objects/commands it defines
Decompile objects/commands back to text

Each library has a single entry point, called the handler, which is a function that will receive words from the execution environment and will have to perform an action according to the value in that word. The environment calls this handler every time it needs to compile/decompile a token, and to execute words.

A library is identified with a number. Libraries can be numbered from 0 to 4095, with the lower libraries (0-255) reserved for use by the system, and all other numbers available for user extensions. These libraries are written in C, and are not to be confused with RPL libraries, which is an object containing RPL code.

The system operates such that libraries with a higher number have priority over lower numbers. For example, if two libraries define a command with the same name, the one with the higher number will receive the token for compilation first, overriding the other command.

The execution environment

The execution environment is the core of the system, responsible for:

Allocating/releasing memory for all the memory areas above, and expose and API for libraries to use.
Garbage collection of unused objects in the various memory areas.
Compiling text by splitting the code in tokens, which are then fed to the libraries for compilation. The system must manage the resulting code from multiple libraries.
Managing the decompilation process, feeding commands and objects to libraries and processing the final result.
Executing code. The system delegates the bulk of execution to the libraries, but it has to provide basic operations on the runstream for libraries to use.

Executing RPL code

Code in this environment is defined as a sequence of 32-bit words and data words. The system executes words one after another by passing the word to the libraries for execution. Each word has the number of the library that compiled it embedded in it.

The basic execution loop is:

Read a 32-bit word from the Instruction Pointer (this is a system variable)
Extract the library number from the word
Call the library handler function, passing the word.
When the library returns check for exceptions/errors and proceed accordingly (end execution loop).
Update the IP by skipping the current word and payload data if applicable
Loop back to 1.

We have so far defined that every word must have the library number embedded in it. Also, step 5 needs to know the amount of payload data in order to skip it when data is embedded in the runstream. All kinds of arbitrary data can be embedded in the runstream. The execution environment does not manipulate the content of the data in any way. Libraries use the data to define objects and operate on them.

The 32-bit word in the execution stream will therefore be defined as:

Bits 20-31 (12 bits): Library number
Bit 19 (1 bit): 0 = There's no additional data, 1 = There's a payload of data following this word
Bits 0-18 (19 bits):
- When bit 19 is zero, these bits are reserved for use within the library and are preserved by the system but not used at all. Libraries will typically use them to define opcodes or command numbers to properly direct execution. Some opcodes are pre-defined by the system to have consistency between libraries.
- When bit 19 is one, the lower 18 bits are the size of the payload expressed in 32-bit words. Bit 18 indicates whether the size is expressed in 32-bit words or in kilobytes. In other words, 18 bits are the number of 32-bit or 1024-byte blocks that follow this word in the runstream and are data, not words to be executed.

A typical runstream example could be:

CMD1 CMD2 PROLOG1 OBJECT1..n CMD3 ...

where CMDn are 32-bit words with bit 19 set to zero (commands), then PROLOG1 is a 32-bit word with bit 19 set to 1, to indicate there's a payload. OBJECT1..n are 'n' words that contain the data of the object.

Execution of the above example will first get CMD1 word, call its library for execution, and then increment the IP by 1, since this word was a command. This leaves the IP pointing to CMD2, which is executed just the same, and then PROLOG1 is read and executed. When the word is passed to the library for execution, the library will typically need to access the object data, which is readily available at (IP+1).

This is an important definition: When the library handler function is called IP points to the prolog of the object or command word being executed.

In this example, the library will most typically push the object in the data stack (by pushing a copy of the IP to the data stack) and return. Then the main execution loop will check for bit 19 in the PROLOG, see that bit 19 is set to one, then extract the number of words 'n' from it, and add it to IP, so now IP points to CMD3, and execution continues.

Another important definition: The main loop passes the current word to the library, and the library passes the current word back to the loop (possibly modified). The main loop updates the IP pointer based on the word received from the library, NOT the original word.

This allows libraries to alter the execution stream. For example, a library defines a command MYIF that works in this way:

true/false MYIF execute-only-if-true ENDIF ...

The library executing MYIF will take an object from the data stack, and if it's TRUE then returns, so the main loop executes the following word, but if it's FALSE, then it can prepare a new 32-bit word with bit 19 set to one and treat the following code as a "data payload" that the loop will skip, so execution will resume after ENDIF. Alternatively, the library could set the IP pointing to ENDIF, and pass the ENDIF word back, so the main loop will simply "skip" the ENDIF word, resuming execution after ENDIF.

This is the main vehicle for the libraries to manipulate the runstream: modifying IP directly, and modifying the current word.

Inside libraries

A library has a number associated with it (in the range 0-4095). The library number determines the order in which libraries will receive tokens to compile (from higher number to lower number). This allows a developer to "override" or "overload" a command by simply defining a command of the same name within a library.

For example, a library could implement a faster factorial algorithm than the one included in the base system, and could define a new command "!". As long as the new library has a number higher than the original factorial, it will receive the "!" token first, which will compile to its own opcode.

This means that after the new library is installed in the system, any program compiled will be using the new algorithm. Preexisting programs will still use the original one and will function normally.

A library has a single function called "handler", which is of type LIBHANDLER. It does not take arguments and does not return a value.

The execution loop has a system variable called CurOpcode (among many others that will be detailed later), holding the 32-bit word with the bytecode to execute.

The library handler has to perform different tasks according to this opcode:

CurOpcode contains bytecode with a payload (bit 19=1): This happens when an object embedded in the runstream is "executed". Typically a library will push a pointer to the object on the stack, but nothing prevents more sophisticated behavior (for example, the "::" and ";" markers are defined to create a secondary that is executed immediately, rather than pushed on the stack like << >> secondaries).
CurOpcode contains no payload, and the opcode is one of the OVR_... pre-defined opcodes. These opcodes correspond to overloadable operators. If a library defines an object, it is mandatory that the library process all overloadable opcodes, even if the library does not implement them (in such case, it is the library's responsibility to issue a "Bad Argument Type" exception for any unimplemented opcode). If a library does not define an object, then these opcodes can be reused for other purposes (though this is not recommended to keep the code cleaner).
CurOpcode contains no payload, and the opcode is one of the system defined opcodes. These are opcodes to Compile / Decompile and display objects. It is mandatory for all libraries to process and respond to these opcodes accordingly.

A typical library handler that responds to a single command called MYCMD would look like:

void sample_handler()
{

  // If the current opcode is the prolog of an object...
  if(ISPROLOG(CurOpcode)) {

     // ... issue an error, since this example library only defines a command, not
     // objects with data payload

     // Raise a "Bad Opcode" exception
     Exception=EX_BADOPCODE;
     // Indicate which instruction caused the error (given by the RPL Instruction Pointer, IPtr)
     ExceptionPointer=IPtr;
     return;
  }
     switch(OPCODE(CurOpcode))
         {
          case MYCMD:
          ...

          // Also process all mandatory opcodes
          case ....
          ...
          default:
          // Raise a "Bad Opcode" exception
          Exception=EX_BADOPCODE;
          // Indicate which instruction caused the error (pointer to by the RPL Instruction Pointer, IPtr)
          ExceptionPointer=IPtr;
          return;
         }
}

The first thing the handler does is to look at CurOpcode. In this example, we have no plans to define a new object type, with data payload, so if CurOpcode has any payload (in other words, is a prolog of an object, hence the ISPROLOG() macro was used), we simply raise an error and return quickly.

After that, the switch statement gets ready tp process the opcodes. The OPCODE() macro helps isolate the Opcode number, removing the library number and the payload bit from CurOpcode. The first case statement is the command we are defining, the sole purpose of this example. However, the handler also has to respond to certain system defined opcodes that are mandatory. These will be explained later in detail, for now it suffices to say that they are mandatory and each library has to provide the case statements to handle all of them.

The default case is again processed by raising a "Bad Opcode" error.

All library handlers follow the same basic structure, processing opcodes as the core finds them in the run stream.

System reserved opcodes

The library handler will need to process the following mandatory system opcodes:

Library install/remove
Probe Token
Compile Token
Decompile Object
Display Object

Each system opcode is used by the RPL core to communicate with the library during certain events (installation/removal of the library, etc.), while other non-system opcodes are passed to the library during RPL code execution.

Compile: When the library receives this opcode, it receives also several system variables pointing to the token being compiled. Notice that these are pointers into the text that contains the entire program being compiled. The library should NEVER access any text outside the token that is being analyzed. Also, the token may or not be followed by a blank space and more tokens, so the library should never assume that there is a null character terminating the string.

TokenStart = (BYTEPTR) Pointer to the start of the token string.

BlankStart = (BYTEPTR) Pointer to the first blank character (or the null-terminator) immediately after the token (this marks the end of the token).

TokenLen = (BINT) Number of characters in the token. It's the number of Unicode Code Points, not the number of bytes in the Token.

NextToken = (BYTEPTR) Pointer to the first character of the next token. It marks the end of blank spaces after the current token.

The text passed to the library should never be modified. The library has to treat the text as read-only and generate bytecodes. To generate bytecode, the library can use the following API function:

rplCompileAppend(WORD bytecode);

This function appends the 32-bit word given in bytecode to the current compilation stream. The first word has to be a properly formed opcode, with the library number encoded, and for convenience it can be generated with the following macros:

MK_OPCODE(library_number,opcode)

MK_PROLOG(library_number,object_size)

The first will form a bytecode typically used for commands. The arguments are the library number to encode and the arbitrary 19-bit opcode. For example, a library number 1234 that wants to compile the current token as command number 1 of its own, will call:

rplCompileAppend(MK_OPCODE(1234,1));

And that will suffice to generate a proper bytecode command.

The MK_PROLOG macro is used to generate a bytecode sequence that has a payload attached. Typically this is used for objects, with the first word is called prolog and the rest are payload. The arguments are the library number and the number of 32-bit words of payload that will follow.

For example, to encode an object with 2 words of payload (for library 1234), call:

rplCompileAppend(MK_PROLOG(1234,2));
rplCompileAppend(payload_word_1);
rplCompileAppend(payload_word_2);

IMPORTANT: The library must ensure that the number of words in the prolog matches EXACTLY the number of words actually passed to rplCompileAppend, otherwise it creates a malformed object and is likely to crash the execution environment.

Whether a library decides to compile an opcode or not, it must communicate this to the compiler. This is done using the global system variable RetNum.

RetNum can take one of several values:

When the library does not recognize a token, it must return with an ERR_NOTMINE error to indicate it has not processed the token. The compiler will then pass the same token to another library.

RetNum = ERR_NOTMINE;

When the library recognizes a token but the token does not belong there (for example, THEN without a previous IF), then it can return an ERR_SYNTAX to indicate the compiler this token caused an error.

RetNum = ERR_SYNTAX;

When the library recognizes a token and compiles it, it returns OK_CONTINUE. The compiler will then move to the next token.

RetNum = OK_CONTINUE;

When an object includes white spaces in its definition, it may require more than one token to completely define an object. Such is the case for strings, for example. When a library identifies the start of an object, it compiles it and returns an OK_NEEDMORE to the compiler, to instruct the compiler to send all tokens from now on to the same library, using the reserved opcode Compile-continue, described next.

RetNum = OK_NEEDMORE;

When the object being compiled is part of a construct (a construct being a structured sequence of tokens), the OK_STARTCONSTRUCT is used to inform the compiler that the opcode just compiled started a new construct. The compiler does not know anything about the construct, but it records the location where it starts and the opcode that started it. Constructs can be nested.

RetNum = OK_STARTCONSTRUCT;

When the object being compiled is the end of a construct, use OK_ENDCONSTRUCT to inform the compiler that a construct was finished.

RetNum = OK_ENDCONSTRUCT;

When the object being compiled converts one construct into a different one, use OK_CHANGECONSTRUCT to inform the compiler that a construct has morphed into a different state.

Special consideration needs to be given to constructs, as they allow libraries to define very complex syntactic arrangements. Let's start with an example of a list construct. This is a construct formed by an opening bracket, a sequence of arbitrary objects and a closing bracket. For example, the list:

{ obj1 obj2 }

is formed with four tokens, which will be processed one by one by the compiler. The first token will be passed on to all libraries, which will return ERR_NOTMINE, until it's given to the library that defines the lists. This library will recognize the opening bracket, will compile it as a 32-bit prolog for a list and return with OK_STARTCONSTRUCT. The compiler will then keep a record of where this construct started and the type fo construct. The next two tokens will be processed normally by their own libraries, and will be compiled immediately following our list prolog. Finally, the compiler reached the closing bracket. When the list library receives the token, it verifies that the current construct is the prolog of a list. If the current construct is not, then it will issue ERR_SYNTAX. This can occur for example if obj1 opens a new construct and doesn't close it, like on the following list:

{ << 1 }

where the second token will trigger an OK_STARTCONSTRUCT that starts a secondary object. The closing bracket will then find that the current construct is a secondary, not a list and will issue the syntax error. If the current construct is a list, then an end list marker is compiled, and the library returns OK_ENDCONSTRUCT. The compiler will automatically detect if the construct was a prolog with payload, and adjust the size of the payload accordingly. If the construct was not an object, then it is simply considered closed, and the previous construct becomes current.

More complex constructs can change state, for example IF/THEN/ELSE/END. When the following is compiled:

IF obj1 THEN obj2 ELSE obj3 END

the first token IF will return OK_STARTCONSTRUCT. When the library recieves the THEN token, it needs to verify that the current construct is indeed an IF statement. If so, then it changes the state of the construct, rather than starting another one, by returning OK_CHANGECONSTRUCT. The compiler switches the current construct type now from an IF to a THEN construct. When it's time to compile the ELSE word, the library checks if the type of the current construct is a THEN, otherwise this would be a misplaced else (and should therefore return ERR_SYNTAX). The current construct is changed again by returning OK_CHANGECONSTRUCT. So the current construct now becomes type ELSE. When the END is compiled, the library checks whether the current construct is a THEN or an ELSE (since ELSE is optional), and finally returns OK_ENDCONSTRUCT to finalize.

Notice that in the example above, obj1, obj2 or obj3 may define their own nested constructs inside. As long as the constructs are properly formed and finalized, the compiler will provide a proper flow. Also notice that the token END is used to end many different flow control constructs, so the decisions made by the library are actually more complex than what was explained in this example.

Library Developer Guide - Part I