Tuesday, June 2, 2020

How to Write a (Toy) JVM


How to write a (toy) JVM

Whether we like it or not, but Java is one of the most widely used programming languages. However, since most of the applications in Java are either too boring or too complex - not every Java developer has enough curiosity to look under the hood and see how JVM works.

In this post I will try to write a toy (and incomplete) JVM to show the core principles behind it and hopefully sparkle some interest in you to learn it further.

Our humble goal

Let’s start with something really simple:

public class Add { public static int add(int a, int b) { return a + b; } } 

We compile our class with javac Add.java and it results in Add.class. This class file is the actual binary file that JVM can execute. All that is left to do is to implement such a JVM that would execute it correctly.

If we look inside the Add.class with a hexdump - we probably won’t get too impressed:

00000000 ca fe ba be 00 00 00 34 00 0f 0a 00 03 00 0c 07 |.......4........| 00000010 00 0d 07 00 0e 01 00 06 3c 69 6e 69 74 3e 01 00 |........<init>..| 00000020 03 28 29 56 01 00 04 43 6f 64 65 01 00 0f 4c 69 |.()V...Code...Li| 00000030 6e 65 4e 75 6d 62 65 72 54 61 62 6c 65 01 00 03 |neNumberTable...| 00000040 61 64 64 01 00 05 28 49 49 29 49 01 00 0a 53 6f |add...(II)I...So| 00000050 75 72 63 65 46 69 6c 65 01 00 08 41 64 64 2e 6a |urceFile...Add.j| 00000060 61 76 61 0c 00 04 00 05 01 00 03 41 64 64 01 00 |ava........Add..| 00000070 10 6a 61 76 61 2f 6c 61 6e 67 2f 4f 62 6a 65 63 |.java/lang/Objec| 00000080 74 00 21 00 02 00 03 00 00 00 00 00 02 00 01 00 |t.!.............| 00000090 04 00 05 00 01 00 06 00 00 00 1d 00 01 00 01 00 |................| 000000a0 00 00 05 2a b7 00 01 b1 00 00 00 01 00 07 00 00 |...*............| 000000b0 00 06 00 01 00 00 00 01 00 09 00 08 00 09 00 01 |................| 000000c0 00 06 00 00 00 1c 00 02 00 02 00 00 00 04 1a 1b |................| 000000d0 60 ac 00 00 00 01 00 07 00 00 00 06 00 01 00 00 |`...............| 000000e0 00 03 00 01 00 0a 00 00 00 02 00 0b |............| 

Although we don’t see a clear structure here yet - we need to find a way how to parse it: what are these ()V and (II)I, what is <init>, and why does it start with “cafe babe”?

You probably have seen another way to dump class files, which is often more useful:

$ javap -c Add Compiled from "Add.java" public class Add { public Add(); Code: 0: aload_0 1: invokespecial #1 // Method java/lang/Object."<init>":()V 4: return public static int add(int, int); Code: 0: iload_0 1: iload_1 2: iadd 3: ireturn } 

Now we see our class, its constructor, and a method. Both constructor and method contain a few instructions and it now becomes more or less clear what our add() method does: it loads two arguments (iload_0 and iload_1), adds them, and returns the result. JVM is a stack machine, so there are no registers, all arguments to the instructions are stored on the internal stack and the results are pushed on the stack as well.

Class loader

Now, how can we achieve what javap did here, how do we parse the class file?

If we look into the JVM specification, we learn about classfile structure. It always starts with 4 bytes signature (CAFEBABE), then 2+2 bytes for version, sounds simple.

Since we would have to read bytes, shorts, ints, and byte sequences from the binary file, we can start implementing our loader like this:

type loader struct { r io.Reader err error } func (l *loader) bytes(n int) []byte { b := make([]byte, n, n) // we don't want to handle errors in each step, // so simply store the first found error till the end // and do nothing if we entered an erroneous state if l.err == nil { _, l.err = io.ReadFull(l.r, b) } return b } func (l *loader) u1() uint8 { return l.bytes(1)[0] } func (l *loader) u2() uint16 { return binary.BigEndian.Uint16(l.bytes(2)) } func (l *loader) u4() uint32 { return binary.BigEndian.Uint32(l.bytes(4)) } func (l *loader) u8() uint64 { return binary.BigEndian.Uint64(l.bytes(8)) } // Usage: f, _ := os.Open("Add.class") loader := &loader{r: f} cafebabe := loader.u4() major := loader.u2() minor := loader.u2() 

And then the spec tells us that we need to parse the constant pool. What is it? It is a special part of the class file, that contains constants needed to run the class. All the strings, numerical constants, and references are stored there and each has a unique uint16 index (thus, a class may have up to 64K constants).

There are several types of constants in the pool, each containing a different set of values. We would be interested in:

  • UTF8: a plain string literal
  • Class: an index of a class name string (indirect reference)
  • Name and type: an index of a type name and descriptor, used for fields and methods
  • Field and method references: indices referring to classes and name-and-type constants

As you see, constants in the pool are referring to each other a lot. Since we are implementing a JVM in Go and there are no union types, let’s make a Const type that would contain various possible constant fields in it:

type Const struct { Tag byte NameIndex uint16 ClassIndex uint16 NameAndTypeIndex uint16 StringIndex uint16 DescIndex uint16 String string } type ConstPool []Const 

Then, following the JVM spec we could parse the constant pool data like this:

func (l *loader) cpinfo() (constPool ConstPool) { constPoolCount := l.u2() // Valid constant pool indices start from 1 for i := uint16(1); i < constPoolCount; i++ { c := Const{Tag: l.u1()} switch c.Tag { case 0x01: // UTF8 string literal, 2 bytes length + data c.String = string(l.bytes(int(l.u2()))) case 0x07: // Class index c.NameIndex = l.u2() case 0x08: // String reference index c.StringIndex = l.u2() case 0x09, 0x0a: // Field and method: class index + NaT index c.ClassIndex = l.u2() c.NameAndTypeIndex = l.u2() case 0x0c: // Name-and-type c.NameIndex, c.DescIndex = l.u2(), l.u2() default: l.err = fmt.Errorf("unsupported tag: %d", c.Tag) } constPool = append(constPool, c) } return constPool } 

We keep things simple here, but in real JVM we would have to treat long and double constant types uniquely, by inserting an additional unused const item, as JVM spec tells us (since const items are considered to be 32-bit).

To easier get string literals by indices, we would implement a Resolve(index uint16) string method:

func (cp ConstPool) Resolve(index uint16) string { if cp[index-1].Tag == 0x01 { return cp[index-1].String } return "" } 

Now we have to add similar helpers to parse a list of class interfaces, fields and methods, and their attributes:

func (l *loader) interfaces(cp ConstPool) (interfaces []string) { interfaceCount := l.u2() for i := uint16(0); i < interfaceCount; i++ { interfaces = append(interfaces, cp.Resolve(l.u2())) } return interfaces } // Field type is used for both, fields and methods type Field struct { Flags uint16 Name string Descriptor string Attributes []Attribute } // Attributes contain addition information about fields and classes // The most useful is "Code" attribute, which contains actual byte code type Attribute struct { Name string Data []byte } func (l *loader) fields(cp ConstPool) (fields []Field) { fieldsCount := l.u2() for i := uint16(0); i < fieldsCount; i++ { fields = append(fields, Field{ Flags: l.u2(), Name: cp.Resolve(l.u2()), Descriptor: cp.Resolve(l.u2()), Attributes: l.attrs(cp), }) } return fields } func (l *loader) attrs(cp ConstPool) (attrs []Attribute) { attributesCount := l.u2() for i := uint16(0); i < attributesCount; i++ { attrs = append(attrs, Attribute{ Name: cp.Resolve(l.u2()), Data: l.bytes(int(l.u4())), }) } return attrs } 

Both, fields and methods are represented as Fields, which is very fortunate and saves us some time. Finally, we can assemble it all together and parse our complete class:

type Class struct { ConstPool ConstPool Name string Super string Flags uint16 Interfaces []string Fields []Field Methods []Field Attributes []Attribute } func Load(r io.Reader) (Class, error) { loader := &loader{r: r} c := Class{} loader.u8() // magic (u32), minor (u16), major (u16) cp := loader.cpinfo() // const pool info c.ConstPool = cp c.Flags = loader.u2() // access flags c.Name = cp.Resolve(loader.u2()) // this class c.Super = cp.Resolve(loader.u2()) // super class c.Interfaces = loader.interfaces(cp) c.Fields = loader.fields(cp) // fields c.Methods = loader.fields(cp) // methods c.Attributes = loader.attrs(cp) // methods return c, loader.err } 

Now if we look into the resulting class info we will see that it has zero fields and two methods - <init>:()V and add:(II)V. What are these things that look like roman numbers with parens? Those are descriptors, they define what types of arguments a method takes and what it returns. In this case <init> (a synthetic method, used to initialize objects when they are constructed) takes no arguments and returns nothing (V=void), while the “add” method takes two ints (I=int32) and also returns nothing.

Bytecode

If we look closer, we’ll see that each method in our parsed class has one attribute, named “Code”. This attribute has a slice of bytes as a payload. The bytes are the following:

<init>: [0 1 0 1 0 0 0 5 42 183 0 1 177 0 0 0 1 0 7 0 0 0 6 0 1 0 0 0 1] add: [0 2 0 2 0 0 0 4 26 27 96 172 0 0 0 1 0 7 0 0 0 6 0 1 0 0 0 3] 

If we look at the spec, this time in the bytecode section, we will see that “Code” attribute starts with maxstack value (2 bytes), then maxlocals (2 bytes), then code length (4 bytes), and then actual code. So our attributes can be read like this:

<init>: maxstack: 1, maxlocals: 1, code: [42 183 0 1 177] add: maxstack: 2, maxlocals: 2, code: [26 27 96 172] 

Yes, we only have 4 and 5 bytes of code in each method. What do those bytes mean?

Like I said, JVM is a stack machine. Each instruction is encoded as a single byte, which might be followed by some additional arguments. If we look at the spec, we will see that the “add” method has the following instructions:

 26 = iload_0 27 = iload_1 96 = iadd 172 = ireturn 

Exactly like we saw in javap output at the beginning! But how shall we execute it?

JVM frames

When a method is executed inside the JVM, it has its own stack for temporary operands, its own local variables, and its own chunk of code to execute. All these parameters are stored in a single execution frame. Additionally, frames contain the current instruction pointer (how far we have advanced while executing the bytecode) and a pointer to the class, which contained the method. The latter is needed to get access to the const pool of the class, as well as other details.

Let’s make a method that constructs a frame for the given method to be called with the given arguments. I will be using interface{} type here as the Value type, although proper union types would of course be a safer choice.

type Frame struct { Class Class IP uint32 Code []byte Locals []interface{} Stack []interface{} } func (c Class) Frame(method string, args ...interface{}) Frame { for _, m := range c.Methods { if m.Name == method { for _, a := range m.Attributes { if a.Name == "Code" && len(a.Data) > 8 { maxLocals := binary.BigEndian.Uint16(a.Data[2:4]) frame := Frame{ Class: c, Code: a.Data[8:], Locals: make([]interface{}, maxLocals, maxLocals), } for i := 0; i < len(args); i++ { frame.Locals[i] = args[i] } return frame } } } } panic("method not found") } 

So, we got the Frame, with initialized locals, empty stack, and preloaded bytecode. Now it’s time to execute the bytecode:

func Exec(f Frame) interface{} { for { op := f.Code[f.IP] log.Printf("OP:%02x STACK:%v", op, f.Stack) n := len(f.Stack) switch op { case 26: // iload_0 f.Stack = append(f.Stack, f.Locals[0]) case 27: // iload_1 f.Stack = append(f.Stack, f.Locals[1]) case 96: a := f.Stack[n-1].(int32) b := f.Stack[n-2].(int32) f.Stack[n-2] = a + b f.Stack = f.Stack[:n-1] case 172: // ireturn v := f.Stack[n-1] f.Stack = f.Stack[:n-1] return v } f.IP++ } } 

Finally, we can put it all together and run by calling our add() method:

f, _ := os.Open("Add.class") class, _ := Load(f) frame := class.Frame("add", int32(2), int32(3)) result := Exec(frame) log.Println(result) // OUTPUT: OP:1a STACK:[] OP:1b STACK:[2] OP:60 STACK:[2 3] OP:ac STACK:[5] 5 

So, it works. Yes, it’s a very lousy and pitiful JVM, but still, it does what JVM does - loads bytecode and interprets it (but of course, the real JVM does way more than that).

what’s missing

The other two hundred instructions, the runtime, OOP type system, and a few other things.

There are 11 groups of instructions and most of them are trivial:

  • Constants (put a null or a small number or values from const pool on the stack).
  • Loads (put locals on the stack). There are 32 instructions like that.
  • Stores (pop from the stack into locals). Another 32 boring instructions.
  • Stack (pop/dup/swap), like in every stack machine.
  • Math (add/sub/div/mul/rem/shift/logic). For different value types, 36 instructions in total.
  • Conversions (int to short, int to float, …).
  • Comparisons (eq/ne/le/…). Useful for making conditionals, like if/else.
  • Control (goto/return). Useful for loops and subroutines.
  • References. The most interesting part, fields and methods, exceptions, and object monitors.
  • Extended. Something that would look like an ugly workaround at a first glance. And it probably won’t change over time.
  • Reserved. Breakpoint instruction 0xca goes here.

Most instructions are trivial to implement - they take one or two arguments from the stack, perform some operation on them, and push the result. The only thing to keep in mind here is that long and double instructions expect that each value takes two slots on the stack, so you may require additional push() and pop(), which makes it harder to group the instructions.

Implementing References requires to think about the object model - how you would like to store Objects and their Classes, how to represent inheritance, where to store instance fields and class fields. Also, this is where you would have to be careful about method dispatching - there are multiple “invoke” instructions and they behave in a slighly different manner:

  • invokestatic: invoke a static method on a class, no surprises.
  • invokespecial: invoke an instance method directly, mostly used for synthetic methods, like <init>, or private methods.
  • invokevirtual: invoke an instance method based on the class hierachy.
  • invokeinterface: invoke an interface method, similar to invokevirtual, but does different checks and optimizations.
  • invokedynamic: invoke a dynamically-computed call site, new in Java 7, useful for dynamic methods and MethodHandles.

If you implement a JVM in a language without garbage collection - this is also where you should think about how to perform the garbage collection: reference counting, mark-and-sweep, etc. Handling exceptions by implementing athrow, propagating them through the frames and handling them with exception tables is another interesting topic.

Finally, your JVM remains useless if there no runtime classes. Without java/lang/Object you are unlikely to even see how new instruction works by constructing new objects. Your runtime may provide some common JRE classes from java.lang, java.io, and java.util packages, or it may be something more domain-specific. Most likely some methods in the classes would have to be implemented natively and not in Java. This will raise the question of how to find and execute such methods and it becomes another edge case for your JVM.

In other words, implementing a proper JVM is not so trivial, however, understanding how it is implemented is not so complex either.

I only had one summer weekend to spare, and my JVM still has a long way to go, but the structure looks more or less clear: https://github.com/zserge/tojvm (PRs are always welcome!)

The actual code snippets from this blog post are even smaller and available as a gist.

If you would like to explore the topic deeper - you may consider looking at small JVMs:

I hope this article did not turn you away from Java. Virtual machines are fun, and JVM truly deserves its place.

Jun 01, 2020

like   tweet  
rss   @me   </>me


from Hacker News https://ift.tt/2TZILlp

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.