Computer architecture for developers

4 November 2022 · 2755 words

computer architecture

An adaption of a talk I gave at work aimed at raising awareness of, and interest in, computer architecture. The intended audience has low to no knowledge of computer architecture, but hopefully anyone can enjoy it.

A note on accessiblity

This article contains a lot of images. I have tried to make the article accessible by explaining most things in text, so you wouldn't necessarily need the images, they're mostly there for flair. If there are accessibility problems though, please let me know.

Act one - why do we need computers?

I guess it all starts with nonsense like "what is 1 + 1?". Many people throughout history have asked questions like that, and most settled for pen and paper — sensible, efficient. But just like a kid's dinosaur you eventually replace it with something more exciting (like complex numbers, perhaps?) and someone needs to be there to pick up the slack.

A half adder is at least as sensible as pen and paper for calculating 1 + 1, so people started using half adders. Half adders can be built by wiring together an XOR gate and an AND gate, yielding two inputs (A, B) and two outputs (Sum, Carry).

A half adder has a truth table that looks like this

input A	input B	output Sum	output Carry
0	0	0	0
1	0	1	0
0	1	1	0
1	1	0	1

That worked for a while, but wiring up half adders all day gets old fast. "And nobody liked that", as it were. Long story short: that's why we invented assembly language.

mov ax, 1
mov bx, 1
add ax, bx

This is assembly language in nasm form. mov ax, 1 means put the number 1 in register ax, basically let ax = 1; in pseudocode. add ax, bx on the last line adds the numbers together and puts the result in the first register. Writing it as pseudocode would yield something like let ax = ax + bx;.

nasm is an x86 assembler, and it's the nasm syntax I have used above. Check it out, it's pretty cool.

The nasm compiler compiles assembly code into x86 machine code, and we'll have a peek at what that looks like and what kind of conclusions can be drawn, but first we'll take a quick look at hexadecimal notation.

Decimal	Hex	Binary
0	0	0000
1	1	0001
...	...	...
10	a	1010
11	b	1011
...	...	...
15	f	1111

The first column shows a decimal number 0 to 15, the second column would be the corresponding hexadecimal number and the last column is the binary equivalent. A lot of rows were skipped 😇.

Now that we've at least seen some hexadecimal we can convert our assembly code into something called machine code by compiling it using the nasm compiler which turns it into x86 machine code.

# mov ax, 1
b8 01 00
# mov bx, 1
bb 01 00
# add ax, bx
01 d8

The first thing we see in the compiled x86 machine code is that each textual instruction has been converted into a sequence of bytes.

As an aside, I can mention that x86 instructions are stored as little endian.

Little endian is one of two ways of storing or transmitting bytes in a computer or over a computer network. The other way would be big endian. Every byte of a given sequence of bytes would be stored with the "little number at the lowest byte" in little endian, so a number like 0x0A0B0C0D would have its last byte (meaning last when reading from left to right, the least significant byte, or the byte with the least "important" information: 0x0D) stored at the lowest memory address, with each consecutive byte being stored at a higher address. Read the endiannesss article on wikipedia if you want to do a deep-dive.

I personally think that the terms little-endian and big-endian are extremely confusing and I have to look them up every time. Another bad computer science / programming name is .filter() for arrays, which should really have been called .keep().

Let's look a little closer at the machine code.

The leftmost column is a byte called opcode. The opcode tells the CPU what we want it to do. b8 would mean mov data into the register ax. bb on the other hand means mov data into the register bx. The data columns for the first two lines both hold the number 1 in little endian.

The third line is a little more interesting because it's the opcode 01 which is add, but it only takes a single byte argument. That byte needs to be split into three parts: The first two bits signify what the next six bytes represent. 11 means both of the coming three bit sequences are registers, 011 means bx and 000 means ax.

But what would it look like if we turned all of these hexadecimal machine code instructions into binary numbers instead?

I'd say that looks a bit like the inputs to the half adder we studied above, just a teeny tiny bit more complex.

Maybe you already knew, but...

machine code is generally independent of the host operating system of your computer. If you have an x86 mac, x86 windows, or x86 linux computer then most of the instructions inside a program would be the same no matter the platform. Only the code talking to the operating system would differ significantly.

But we were talking about adding numbers, weren't we? For a long time people were writing computer programs in assembly language to perform complex tasks like putting people on the moon.

And for a long time that was the best way to do it.

But later scientist discovered another way to add numbers together. Unfortunately that method was the C programming language, which I hope most modern number adder people will shy away from like the plague.

C is... a (not very good) language but has found a niche together with C++ as unreliable, unsafe, but very fast number adders. But there are a lot of number adder languages out there, and some of them can be laid out on the Garbage Collected (GC)/non-GC vs unsafe/safe axes like so.

The unsafe + GC quadrant is conspicuously vacant. We have C and C++ in the unsafe + non-GC quadrant, Java, Python, Javascript, and Go in the safe + GC quadrant, and lastly Swift and Rust in the safe + non-GC quadrant.

What is a (memory) safe language anyway? Safe languages exhibit at least the following properties

No undefined behavior
No use-after-free memory reads
No out-of-bounds memory reads
No double-free memory releases
Only "logic bugs" can happen

What is a GC (Garbage Collected) language?

Memory safe
Most things tend to be heap allocated
Memory is freed from the heap when the runtime decides (as opposed to the programmer)
Tend to be a bit slower, tend to have "GC pauses".

Let's have a look at my completely made up and over-generalizing chart™️

The picture shows a rough timeline and a rough fast/slow categorization of the same languages that were compared previously. Assembly was invented around 1947, a good 25 years before C. In the 90's we got Python, Javascript, and Java tightly following each other, all Garbage Collected languages (Abe Simpson voice: which was the style at the time).

In the 2000s we got Golang, also a safe GC language. In 2010 and 2014 we saw the birth of two, currently very popular, safe non-GC languages: Rust and Swift, respectively.

From my totally made up diagram we can draw the conclusion that most GC languages are pretty slow, but Go seems to be an outlier? What's going on? Let's put a pin in that for now.

Act two - Why do we need to know how computers work?

Rhetorical, obviously. The textbook answer goes something like "to be able to write better programs", but the correct answer is "because computers are Cool".

If act one was setting up the story, getting to know our protagonist, who just wants to "add 1 + 1 really fast", then act two will launch us into conflict with our common enemy — memory.

Let's try to deduce why a GC'd language would be slower than a non-GC'd one. Let's try to conquer memory 💪.

A novice tries to conquer memory

The novice knows that memory is "segmented" into two main areas: the stack and the heap. Functions mostly use stack memory while "classes" would be put on the heap. Java almost entire works off of classes, so almost all the memory would be heap allocated.

The novice also knows that the heap is slower than the stack because things can be created anywhere on the heap, meaning the program has to follow a lot of pointers to run a program.

Not only that, but storing the thing AND the pointer takes up more space which probably makes things slower too. But why?

A competent programmer gives it a try

Inside the CPU there are several cores and each core has several layers of caching for memory, and the different layers have different latencies. So accessing memory tha has already been cached is way faster.

The latencies are relative to each other depending on how far from the CPU registers the memory is cached. If the registers have latency 1 (no unit) then L1 cache would be 3, L2 cache would be 12, RAM would be 240 and finally an SSD would be 2400 times slower than accessing a register.

The sizes of the different levels of memory storage are important, too. The completely made up computer architecture we will be discussing has 256 bytes of registers, 32 kB of L1 cache, 2 MB of L2 cache, and 16 GB of RAM.

This makes sense because if there is a lot of memory on the heap then stuff in the L1 and L2 caches would need to be cleared more often because stuff is just located in different parts of memory all the time.

When the CPU core doesn't have the wanted memory in L1 or L2 cache, that's called a cache miss, and the computer would need to fetch the data from RAM, which is slow.

An expert gives it a try

Modern computers operate using Virtual Memory, where 64-bit CPUs use 48 bits of virtual address space giving a maximum theoretical available physical RAM of 256 TB.

Most computers don't have that much memory. But the CPU, or the Operating System for that matter, don't care about that. Each program will be given a huge chunk of virtual memory space, and programs running in different processes can even get the same, overlapping, virtual memory space.

When a CPU loads memory it does so using Memory Pages, which have a predetermined size depending on your hardware (and sometimes configurable by the OS). A common size is a page size of 4 kB. If there are too many pages to fit in RAM, the Operating System will "page out" to SSD, which means copy the bytes from RAM to SSD to make space for other things in RAM.

When the CPU later asks for that page, it will be loaded back into RAM, and some other page might page out instead.

Unfortunately the novice and competent programmers have been lied to. The stack / heap dichotomy is a lie.

Instead of the stack growing from lower physical addresses and the heap growing from larger physical addresses, leading to an inevitable memory clash in the middle of the physical stick of RAM, memory is arranged in pages which are mapped using virtual memory with a seemingly endless supply of addresses.

Program	Type	Address	Data
Chrome	Stack	0xFFFF	0xFEEDC0DE
VSCode	Heap	0xFFA0	0xF00DC0DE
Terminal	Stack	0xFECC	0xCAFEC0DE

Each page contains data from a singe program and the pages have no particular order in physical RAM, they're just scattered around.

Things we didn't talk about

OS memory overcommit. Translation Lookaside Buffer (TLB). Memory Management Unit (MMU). All real, not made up things, that you can search for on your favorite search engine.

Act three - Explaining fast software

Act one introduced us to the concept of math, and how humans have tried to avoid doing it since way before I was born. Software seems like a good solution but some programming languages are slower than others.

Act two tried to lay the foundation for understanding why some programming languages are slow by explaining hardware, but didn't reveal any solutions.

Now is the time to combine our newfound knowledge of hardware and software to do what we came to do. Slay our enemy, Memory.

Imagine that you have some data that could be expressed as an array, where each piece of data takes 64 bytes, and there are 512 elements, how would that data be laid out in memory?

In Java, it would probably be an array of 512 elements where each element was an 8 byte pointer to an object taking up 64 bytes. The total amount of memory needed would be 512 * (8 + 64) bytes.

In a real world Java application, each 64 byte node would probably contain several pointers to other pieces of data as well. A tasty pointer soup, if you will.

The more memory each element uses, the more likely it becomes that a complete data set that is currently used by the CPU to do computations can't fit into L1 or L2 cache.

Each time the size of the data set is larger than L1 cache the CPU hits a latency cliff, where it needs to ask L2 cache for more data. When the data set is larger than L2 another latency penalty is experienced by the program.

First a 3x latency penalty for data sets larger than L1, then a 12x latency penalty for data sets larger than L2, and so on. And in the ultimate worst case, penalties for memory being paged in and out of SSD storage.

Having fast software, or Killing Memory as it were, is a matter of seeing the whole picture. Packing data tightly, not needing to fetch data from a more expensive type of memory, and using less memory, all of these are important to keep software fast. Modern computers are complex and complicated, and it's not obvious what is going on inside of them. The more you know about the particular computer you are targeting, the better.

Calculating 1 + 1 is finally in reach. Or maybe something a bit more complicated, too.

Epilogue

It's not all bad.

What are the available mitigations, for a software developer like you? First of all you need to start using languages with larger amount of control of data layout. Languages that give you the opportunity to choose if data is behind a pointer allocation (heap) or not. Languages like Rust.

And what about Go? We put a pin in it, remember? Go occupies a strange niche: A GC language that's faster than its siblings. It's more modern, but not based on LLVM like Swift and Rust. It uses the Plan 9 toolchain, but more importantly Go has structs that can live on the stack, and mostly uses its GC for long-lived objects things are passed around between different parts of the program. And even then it may skip heap allocations by doing escape analysis.

Escape analysis is a compiler optimization technique that elides heap allocations if the compiler can prove that it can do the given task directly on the stack instead. Rust uses the LLVM compiler for optimizations, and LLVM can also do escape analysis.

Whew, I hoped any of that made sense.

If you find any errors, or just want to ask a question, contact me using my e-mail address.