Investigations on building a 1000-node z80 cluster

Mar 9, 2020 1:00 AM

I was taking notes on the The Connection Machine the other week and tried a few things with the goal of building a simulator of a machine with similar properties. The Connection Machine (CM-1) is a computer with a unique architecture; the first realization of the architecture ran 50,000 active devices running at 4MHZ consuming 1 watt of power. Algorithms involving spatial locality between data points worked particularly well, like fast fourier transforms.

I tried out some libraries so I can build something purely in software. I’m fond of old hardware, and wonder how commodity hardware from the 80s would perform if they access to the same scale of memory and bandwidth that we have today.

A z80 emulator

For my first prototype, I want to simulate a dual z80 processor with four banks of memory. I want to measure data processing bandwidth, where processing power is fixed. A 4MHz z80 is trivial for today’s computers, and I would expect to emulate at least 100 of them on a single 4GHz core, assuming emulation overhead.

So far, I’ve dug into a fairly deep trove of z80-related software. It turns out that emulation has been the least painful part of the process. I chose to use floooh/rz80, an emulator core written in Rust. A System struct implements power_on and step_frame, which I use to create and run a z80 CPU.

pub fn power_on(&mut self) {
    let mut cpu = self.cpu.borrow_mut();
    // map some writable memory to address 0x0000, 0x00000, 0x0000, true, 0x1000);

    // a little Z80 machine code program to add 2 numbers
    let prog = [
    // put the program at address 0x0100
    cpu.mem.write(0x0100, &prog);
    // set PC to address 0x0100
bytes assembly time in cycles
0x3E, 0x11 LD A,0x11 7 cycles
0x06, 0x22 LD B,0x22 7 cycles
0x80, 0x33 ADD A,B 4 cycles

I stepped through the frames a few times. I found running these 18 cycles takes 7 microseconds on my computer. To get a better estimate of timing, I would need to consider memory access and processor communication.

I’ve yet to do the back-of-the napkin math, but if I run into issues running many of these CPUs in realtime, I can always run slower than realtime.

A z80 assembler

I tried and failed to reuse an assembler built in lisp. Why? I want the abilities to write macros and to toughen up the scheme chops that I have somewhere. But more importantly, I wanted to have something that is completely cross-platform. I didn’t want to write my own, but after reading the source for a few assemblers, I grasp the basic idea behind them. Their design mirrors the process for building an emulator. floooh describes this in great detail on the evolution another z80 emulator he wrote.

First, I tried wesen/z80-asm which is written in Common Lisp. It didn’t work with SBCL on Windows, so I moved on after failing to parse through the error messages.

Next, I tried the assembler from siraben/zkeme80 which is designed to go along with an operating system designed for the TI-84. It runs on GNU Guile, which unfortunately doesn’t have pre-compiled binaries for Windows. I do have Racket, which is another Scheme dialect that I have installed.

I’ve ported most of this over to Racket. While maybe not the most productive of routes, it has been an interesting way to become familiar with Scheme.

#lang racket
(require "assembler.rkt")

(define add-two
    `((ld a #x11)
      (ld b #x22)
      (add a b)))

(println add-two)
(assemble-to-hex add-two)

The output is close, but no cigar. The final output is missing a byte somewhere.

'((ld a 17) (ld b 34) (add a b))
'("3e" "11" "6" "22" "80")

I’ve also had to modify the import system to fetch dependencies. Most of these are straightforward because of the (somewhat) standardized Scheme libraries.

;; from
-(use-modules (ice-9 match) (rnrs io ports) (rnrs bytevectors) (srfi srfi-9))
+#lang racket
+(require racket/match rnrs/io/ports-6 rnrs/bytevectors-6 srfi/9 srfi/60)

I did have to replace ice-9 match with the native match in Racket. This required annotating the list S-expression directly.

 (define (assemble-ld args)
   (match args
     ('(sp hl)                                                  (assemble-ld-sp-hl))
-    (((? 8-bit-reg? a) (? 8-bit-reg? b))                       (assemble-ld-reg8-reg8 a b))
-    (((? 8-bit-reg? a) ('+ (? index-reg? b) (? 8-bit-imm? c))) (assemble-ld-reg8-index-offset a b c))
-    (((? 8-bit-reg? a) ('+ (? 8-bit-imm? c) (? index-reg? b))) (assemble-ld-reg8-index-offset a b c))
+    ((list (? 8-bit-reg? a) (? 8-bit-reg? b))                       (assemble-ld-reg8-reg8 a b))
+    ((list (? 8-bit-reg? a) (list '+ (? index-reg? b) (? 8-bit-imm? c))) (assemble-ld-reg8-index-offset a b c))
+    ((list (? 8-bit-reg? a) (list '+ (? 8-bit-imm? c) (? index-reg? b))) (assemble-ld-reg8-index-offset a b c))

Another annoyance was the difference between the format functions. In most Schemes, the second argument determines whether the string should be sent to io (e.g. stdout) or returned as a string. In Racket, these are separated into a format and printf, somewhat like the C string libraries.

A few more search and replaces were needed before this program would run. When the second argument was #f, the argument was removed. When it was #t, (format #t ...) was converted into (printf ...).

- (error (format #f "Operand to jr ~a not an 8-bit signed integer." offset))
+ (error (format "Operand to jr ~a not an 8-bit signed integer." offset))

Next steps

This is where I stand today, but there’s a lot of exciting challenges for me to figure out. I am going to install asmotor/asmotor, which is an assembler that succeeds rednex/rgbasm. It’ll be useful when I write a program for the Gameboy and start studying the z80 in deeper depth.

The idea for the first non-trivial program will be to sort a large list of numbers. There are many algorithms to choose from, but the optimal one will change as I increase the number of elements. As simple 8 bit machines, z80’s can only address 65k of memory. What I plan to do is to bank the memory, having any bank acecssible to any CPU, but only 1 at a time.

This will also be an exercise in writing a bus connecting several of these virtual machines together. I’ll need to hook the pins between the processors, and allow them to share memory via hypervisor. The CM-1 is packet based, where this design is switched based. In the end, I expect to see it to have similar characteristics to modern MapReduce systems.

I might test it by seeing how long it takes to sort a large, in memory list. I’m forward to spending my time trying these ideas out. Another thing to try out in the future is a clustering algorithm like K-means to see if this computer can scale.

Another project (if I ever have an FPGA handy) would be to build a replica of one of the CM-1 processors. There are 16 processors arranged in a square grid, each connected to the rest of the network by a router. Each of the individual processors is simple.