So what is this YASEP thing ?

 

In the beginning, YASEP was called "the VSP", which stood for "Very Simple Processor", but "Very Silly Processor" fits well too. Today, it is renamed to YASEP ("Yet Another Small Embedded Processor"), yet some original documents keep the old name for historical reasons.

YASEP is a microcontroller core with 16/32-bit instructions and 16×32-bit registers, with emphasis on simplicity, small size and memory bandwidth. The initial target performance lies between a Microchip PIC18F and a MMU-less ARM.

It is also an experimental core that uses very unusual techniques. The YASEP introduces and tests new methods in many domains : ISA design methodology, architecture, software development environment...

Being a microcontroller, there is no need for sophisticated instructions such as multiply/divide. The YASEP's intended job is (hardware and environment) management : it moves data around a SoC, keeping other specialised IP blocks fed with data, answering user requests through a keyboard/LCD screen interface, and overseeing the system's health (power management and system configuration / hotswap).

So it's an embedded core. Target speed is in the range of 10M instructions per seconds with an old SiO2 process. Every instruction is single-cycle and not pipelined. It's rather simple, until you look at the memory interface, which (like F-CPU's FC0) plays a critical role in the system's efficiency.

 

Why simple ?

When there is no need for horsepower, implementing a complex stuff is a waste of resources, time, money, efforts, silicon, energy...

So it naturally takes the barebone RISC route, with a very clean instruction format, but it further simplifies this by merging some of the instruction fetch mechanism into the data fetch machinery (or is it the reverse ???). Both are software controlled and this spares more instructions.

This comes at some price : it is furiously uncommon and might create new issues when programming, such as register usage through function calls. But you don't want to use C with YASEP (unless the compiler is designed specifically for this architecture).

However, the desing shines with its simplicity and low resource consumption, as you can see from the draft below. It has evolved a bit, since, but not radically.

The YASEP's initial idea and structure

 

Why do this ?

Because i needed something like this at one time (around the end of 2002). Well, now i don't need it anymore for a "real industrial project" but the idea sounded so coooool and there was no reason to stop, even if the original project was cancelled.

And because F-CPU development is almost stalled (huh .....). It is a good way to develop something fun and helpful later for FC0, which has many many implementation problems. I'll probably be able to solve them once i've put YASEP together.

 

Why bother ?

Like other projects, you may be interested because it is fun, it is very instructive and contributes to the Free Hardware movement.

Another better reason is that it is weird and you may want to scratch itches here and there. Help yourself.

Still better reason : you may need something like that. ARM's license costs may annoy you, and other "free" cores don't suit your needs. In fact, most fall in the same category of the 16/32-bit small processors for low power and low-resource applications. But YASEP has an edge when it is about memory bandwidth.

Finally, because i will reinvest all the efforts from YASEP into F-CPU/FC0, you may want to see how F-CPU works (well, roughly) by examining a small-scale loosely related core.

 

What it is not

The YASEP is NOT a workstation processor meant to run UNIX (or whatever). It is NOT meant to have protected or paged memory and it is specifically designed to NOT run fast. You are even discouraged to make it fast or run *nux. If you want horsepower, try F-CPU (well.... when it's ready).

 

So what about high-level langage support ?

If you're brave enough, you can attempt to create a C compiler for YASEP. Everyone has his very own vices. If you want to lose time for nothing, port GCC. Learning C compiler structures would be faster, though !

YASEP's architecture does not suit C nor GCC well. In fact it is designed to be programmed in assembly langage, which is quite simple thanks to a barebone RISC background. Hey, it's a microcontroller, after all.

If you're feeling fit and lucky, you'll certainly bounce against weird problems like pointer comparisons (mask the 4 MSB !) or PFQ optimisations...

However one side effect of YASEP's structure is that it is quite easy to use stack-based langages, particularly FORTH, and maybe JAVA. Porting FORTH is probably the easiest entry point and the best suited environment for such a microcontroller. Everything written mostly from scratch, much room for improvement, no bloat, flat access to all resources...

Another "solution" is the GNL projet that i have also restarted at http://f-cpu.seul.org/whygee/gnl/, i will certainly "link" the online architecture simulator with the GNL code generation backend. Some early interface experimentations are available here.

 

What is the link with F-CPU ?

The YASEP reuses some parts and concepts developped for FC0. It also inherits some common ideas developped in the 90's for my personal projects.

The instruction set is very orthogonal, making decoding particularly straightforward. Several opcodes may seem redundant or congruent, but eliminating them would make the decoder too complex.

More than FC0, instructions are so simple that they all take the same amount of time to complete, so there is no scheduling problem. The possible points of stall are only in the memory interface and the buffers (instruction or data not ready). Furthermore, because the core is not pipelined, invalid instructions are detected and handled without needing specific synchronisation circuitry.

The same separation of the configuration space (the Special Registers and the get/put instructions) from the memory space make YASEP and FC0 close cousins.

The YASEP reuses the same development methodology and tools as F-CPU, not being targeted at a specific industrial process. Being smaller, YASEP will fit in a cheaper FPGA, though ;-)

However, the YASEP is not designed to be scalable, fast or running Linux. It will always remain 32-bit, running a ad-hoc monitoring program at around 10MHz. Heavy processing tasks are handled by adapted coprocessors.

 

Architectural description of the YASEP execution core

 

Please note that I make a distinction between the execution core and the whole YASEP "IP core" because the current analysis (as of 2007) lacks the memory interface and controller, which is quite complex. The execution core runs almost separately from the memory controller thanks to small cache buffers that are not yet designed.

 

The instruction format

A look at the instruction set of a computer tells a lot about its structure, features and inherent complexity.

YASEP was designed for economy and code compactness so a mixed 16-bit and 32-bit format was chosen. Well, in fact, "compactness" does not matter that much, and extreme code compression would make the instructions too complex and difficult to decode with as few logic gates as possible, which goes against the primary requirement of economy. Memory is dirt cheap today (but not access time or bandwidth), and the bandwidth/speed ratio is still comfortable. So simplicity (and orthogonality) is more important than feature creep.

There are no 16/32-bit "modes", only a single bit in the instruction that tells the decoder how to behave. And even then, the instruction is almost identical. The only difference between 16-bit and 32-bit forms is the 16-bit immediate field (often called Imm16). The rest of the instruction fits in 16 bits, including the 16/32-bit flag.

The instructions generally consist of the "immediate flag", followed by a 7-bit opcode and two 4-bit register addresses. So in the 16-bit form, the core is "2-address machine", where one address points to both one source and the destination (think x86...). In the presence of the "immediate flag", a 3rd immediate operand is provided so the 2 register addresses point to one source and one destination.

The YASEP's instruction structure : add/sub with skip on carry

The intructions are aligned on natural boundaries : short instructions are aligned on a 2-byte boundary and long instructions (with immediate data) are aligned on a 4-byte boundary.

This is quite efficient because most instructions benefit from both forms. Some exceptions will trigger an error, but most instructions that don't need Imm16 will silently ignore it (often for upward compatibility and alignment purposes, see below). There is also the case when the Imm/Reg flag is set to "immediate" when the instruction is in an odd position (unaligned). These errors are easily detected and may eventually, maybe, open new opcodes in the far future.

The alignment requirement imposes that a padding NOP is sometimes (with 1/3 probability) present before a long instruction (for the alignment). Or the preceding short instruction can be extended to long if it ignores the Imm16 field. This removes some benefits from the instruction format, unless unaligned long instruction become allowed later... This would make the core a bit more complex but given today's memory sizes, it is less a concern than 20 to 40 years ago.

Furthermore, a decoding mechanism skips the "padding nops" by detecting them at the odd locations, so the execution time is not impacted.

The opcode map shows that the opcodes are grouped into 16 groups of 8 (or less) closely related functions. Currently defined groups are :

  • CTL (control),
  • MOV
  • ASU (Add and Substract Unit, with optional skip on carry/borrow),
  • ROP2 (boolean operations),
  • SHL (bit SHuffLing, rotate and Shift),
  • IE (Insert/Extract bytes (8 bits) and half-words (16 bits) into/from words (32 bits)),
  • MISC (various operations).
    There are 4 groups of conditional instructions with similar behaviour and identical conditions :
  • CMOV (Conditional MOVe),
  • JMP (conditional JuMP),
  • SKIP (conditional SKIP),
  • Q (conditional switch to another Queue).
    The last group is RSV (ReSerVed). Some groups are also assigned to future uses : SMT and PFQ.

     

    Data format

    The YASEP handles integers only. The natural data format is the 32-bit Word, used by most operations. Some operation can also process 16-bit Half-word and 8-bit Bytes (in the IE unit). Bit-granularity operations are provided with the ROP2 and SHL unit.

     

    The registers

    OK so we have 16 register addresses but no load/store instruction nor relative branches, or even absolute jumps. That's where it is getting weird. Beware, you are entering in another dimension.

    Only registers #12 to #15 are "normal registers". There is also a "scratch area" in the SR space, but that's all.

    The 12 remaing registers are a set of 6 pairs of data/address registers. Each register in the pair is linked to the other : when the address register is written to, the data register is updated with the value (if any) of the 32-bit word stored at the new address. That's a "read" operation. The write is performed by writing to the data register when the address register points to the memory that you want to update.

    The address/data pair is often named a "queue" or "prefetch queue" ("Q" or "PFQ").

    This is quite similar to how the CDC6600 Central Processing Unit worked in the 60's. Except that the YASEP has another twist : the 4 MSB of the pointer contain "update flags" which indicate how to update the pointer after the data register is read or written. Depending on the configuration, the pointer can be post-incremented or post-decremented, or not changed at all (so the data register becomes a simple register).

    #NameDescriptionType
    0A0Queue #0's address register, default instruction pointerExecutable
    queues
    1D0Queue #0's data register, default current instruction
    2A1Queue #1's address register
    3D1Queue #1's data register
    4A2Queue #2's address register
    5D2Queue #2's data register
    6A3Queue #3's address register
    7D3Queue #3's data register
    8A4Queue #4's address register, stack pointerData-only
    queues
    9D4Queue #4's data register, stack top
    10A5Queue #5's address register, alternate stack pointer
    11D5Queue #5's data register, alternate stack top
    12R0Standard register #0Static
    registers
    13R1Standard register #1
    14R2Standard register #2
    15R3Standard register #3

    Stacks are emulated with the 4th option : pre-increment. The address register of the pair thus becomes the "stack pointer" and the data register is the "stack top". Of course, the stack could extend down or up, at will, but having only a pre-increment option limits the direction somewhat...

    This can be used on all six "queues" so up to six stacks can be used at the same time. Or almost so because in practice, one must fetch instructions somewhere, and in the YASEP, guess where they come from ?

     

    Fetching instructions

    Instruction fetch is shared with data access. The decoder takes instructions out of one of the four first queues, leaving the two others to stacks. When the instruction stream is linear, a couple of queues are enough, so four queues are used for data moves. When the code becomes more complex, the four queues can contain pointers and instructions from several entry points in loops or functions, leaving one stack (or none) and one (or two) queue(s) for everyday data moves.

    Out of the four 'instruction' queues, only one is used to fetch the current instruction at any time. It is indicated by the "CQ" (Current Queue) two-bit register.

     

    Branches

    Like F-CPU, jumps usually require the distination to be prepared, then the right destination is chosen/selected according to the outcome of a comparison instruction, which contains the candidate new queue. If the test succeeds, the number of the new queue is copied into CQ. During the next cycle, the CQ will choose the new instruction stream.

    The difference with F-CPU is the use of an explicit queue number, while the FC0, which emulates a more "standard" architecture, needs to maintain lookup tables for associating a register (containing a pointer to the instruction) with an address in the cache memory. This is all simplified in YASEP !

    The YASEP (in 2007) provides 4 different conditional (and unconditional) groups of instructions, with different modes of operation :

    On top of that, some instructions provide means to avoid a branch instruction :

     

    Conditions

    The branches or skips can occur under some (homogenous) conditions, usually requiring the read of a register.

    Except for the unconditional case, the conditions can be negated, so each group of conditional instructions has 1+(2×3)=7 sub-functions.

    The YASEP's instruction structure : conditional instructions

     

    Pointer format

    The YASEP can address 256MiB of memory.

    The main reason comes from the 4 MSB of the data pointers that hold the auto-update flag. 32 bits - 4 = 28 bits for the addresses. This is well enough for embedded devices.

    Programming with this processor requires a lot of caution to avoid accidental change of the uptade flags. Alterations of the 4 MSB as the result of additions (for example) could trap the processor to signal a pointer overflow. This can also occur when auto-inc/decrement wraps around (but then, only 28-bit address calculations are performed).

    These 4 bits are less used in the instruction pointer, but the 2 MSB store the CQ when a trap occurs. The processor can thus restart at the right address with the right CQ, and needs only to save/restore 10 others registers (9 if it is smart) :

    Context buffer
    Pos.DataDescription
    0CQ|IPAddress of the instruction with the CQ
    1A(CQ+1)
    2A(CQ+2)
    3A(CQ+3)
    4A4Queue #4's address register
    5A5Queue #5's address register
    6R0Standard register #0
    7R1Standard register #1
    8R2Standard register #2
    9R3Standard register #3

    So you see how fast context swap can be : because a queue's data just comes out of the memory, there is no need to save it. Only the pointer has to be saved, sparing six items from being transfered :-)

    The catch is that the position of the 3 other queue numbers must be deduced from the indicated CQ, otherwise it would need 11 elements instead of 10. Well, that's just a few logic gates.

     

    All together !

    I have not yet determined how the address computations are performed. That might be inside or outside the execution core. Every cycle can perform 4 memory accesses (3 reads and 1 write, including instruction fetch) but 3 pointer updates (convention says that write has precedence over read for pointer updates). This means that 3 26-bit adders are needed in parallel with the core.

    For ease of design, i have split the core into four stages as shown below :

    The YASEP execution core's structure

    This does not mean however that the core is pipelinable, despite some potential. It's not even worth considering this because it would bring little performance boost (50% ?) at the price of a lot of circuitry. The core works well, slowly but simply, without pipeline gates (here, the registers act as clock boundaries).

    However, there might be another possibility later : transforming it into a SMT core (Simultaneous MultiThreading) to execute 2 or 4 threads, or more, concurrently. There, all the latencies are hidden from each others and interrupt latency might be greatly reduced. There is relatively few hazards to checks and the performance boost is much better than when simply pipelining. Each threads runs as slowly as in a Single-thread processor, but four threads would be able to run at the same time, boosting the instuctions per second rating. The only practical limitation would be the memory bandwidth, and some 32KB of onchip cache or SRAM would be helpful.

    But that's for waaayyy later.

     

    Now, you are encouraged to read the instruction set overview.

     

    More informations (old, preliminary and written differently) can be found in this text