Java Reflection Isn't Slow

danielp · March 10, 2018, 8:20pm

When reading up on Java reflection it’s hard to browse very far without hearing about how slow reflection supposedly is. Probably most of us have seen about a bajillion benchmarks comparing reflection to direct method dispatch, lambdas, etc., and reflection usually loses badly. The other day, however, I encountered a situation where reflection significantly outperformed the alternatives.

Over the last few months I’ve been putting most of my free time into a scripting language/interpreter for game development. The language is dynamically typed and compiles into a custom bytecode format that is run by an interpreter written in Java. The original design of the interpreter utilized a “catch all” base class that relied on overridden methods to handle most interactions with the interpreter (very similar to how Python handles operator overloading). The problem with this design was that it was freakishly verbose and resulted in ultra-dense, ugly code even for simple operations. For example, here’s the integer addition code:

@Override
public CObject __plus__(CObject other){
	
	if(other instanceof CInt){
		return new CInt(intValue + ((CInt) other).intValue);
	}else if(other instanceof CFloat){
		return new CFloat((float)intValue + ((CFloat) other).floatValue);
	}else{
		throw new UnimplementedOperation(String.format("Undefined operation: cannot perform int + %s", other.getClass().getSimpleName()));
	}
}

There were two major problems with this design:

The code was nasty to write and maintain.
Interop with Java code was quite tricky - you’d have to write your own wrapper to expose Java objects to the interpreter, one wrapper per Java class.

To work around the second issue I had plans to write a reflective “auto wrapper.” Before I had a chance to do that, I got tired of writing the freakishly tangled up code above and totally rewrote the internals of the interpreter. The new design uses reflection to dispatch opcodes to Java methods, replacing the base-class override design above. I was concerned with the performance impact this would have, and when I did the rewrite I took the time to set up the project with proper JMH benchmarks. I’ve been using a simple count-to-one-million loop to benchmark/profile the core “basic” functionality of the runtime:

var x = 0
while(x < 1000000){
    x = x + 1
}

Sure enough, my shiny new reflective design benchmarked at a dismal 7 seconds to complete one iteration of the count-to-a-million benchmark. I knew it wouldn’t be optimal because I was looking up the Method objects with each opcode dispatch, but I didn’t expect it to be that bad. I promptly added a Method cache to save and re-use method references between dispatches (the lookup is done only once the first time a method is called). I also disabled access checking by calling method.setAccessible(true). The results were drastically better - ~ 0.27 seconds to run the benchmark on my main development machine (about 0.18 seconds on a machine with a newer processor and faster RAM clocks). Profiling with VisualVM showed that about 1/3 of the time per opcode was spent decoding and preparing arguments and 2/3 of the time was spent by Java in the reflective method invocation, and so I decided to let well enough alone and move on to other higher priority tasks on the project.

The other day, however, I had the idea of reintroducing the old style base class into the new design. Objects that extended from the base class would be dispatched statically through virtual method calls, check their parameter types, and perform the operations without reflection ever being involved. Objects that didn’t sport this “optimization” would go through the plain ol’ reflective dispatch. I tossed in enough code to use this technique for the benchmark above and fired up JMH, expecting a modest performance improvement since I’d bypassed the reflective invocation.

Much to my surprise, the reflective version ran the benchmark ~20% faster. My “optimization” spent more time checking parameter types and invoking the operation than the JVM did. In hindsight, I shouldn’t have been surprised - whenever you go from dynamically typed parameters to statically typed method invocations you’ve got to pay some type checking cost, and the JVM apparently has a more optimized way to do this than “if-instanceof-then-cast” techniques.

My takeaways from this:

As usual, make a good design first and optimize the unacceptably slow parts later (after proper benchmarking/profiling). Performance issues often aren’t where you think they are.
Reflection is slower than static method dispatch, but if you’ve got to dynamically bind method invocations (and invokedynamic/MethodHandles aren’t a good/applicable option), it may very well be faster than handwritten alternatives. In other words, reflection can be fast - quite fast.

jmaasing · March 10, 2018, 9:19pm

I could not agree more: Write good code first. Then measure. Then if at all needed optimize. I think in 99.9999% (keep adding nines) optimization is premature and the root of all evil
As you say: do not underestimate the clever engineers behind the JVM, it is incredibly good at optimization if the code is good.

pspeed · March 10, 2018, 10:22pm

No, that’s a myth that often comes because people include the method lookup in their benchmark.

Calling invoke() on a method is no slower than calling a method that calls another method. ie: at most you get double the overhead of a method call. Meaning, an immeasurably small value.

In some cases, reflection is actually slightly faster because you avoid the normal method resolution Java has to do for a regular push(class), push(method name) pair require for a normal invoke byte code.

The only reason I know the benchmarks must be bogus is because you mention lambdas… and they will have exactly the same overhead as a cached method dispatch.

…and if someone is doing reflection without caching Method/Field/whatever objects then, yeah, that’s super-dumb.

(I’ve been doing reflection-based framework programming since 1998 or so without issue. I even have old open source tech along these lines: http://meta-jb.org/ )

danielp · March 11, 2018, 7:33am

Yeah, I was trying to contrast “common” (wrong) knowledge with a result showing something else, but I worded it poorly. Edited for clarity.

Isn’t there some overhead caused by type checking the arguments though?

I wouldn’t have guessed that. Good to know.

Lots of benchmarks compare apples to oranges… and on top of that Java’s notoriously tricky to benchmark because of the JIT.

Took a quick look at it. Nifty!

pspeed · March 11, 2018, 8:23am

Yes, that’s true if you have arguments… but a regular method call is not 100% free in this regard, either. Lambdas will also have the exact same problem, as I recall.

You will also have the overhead of creating the argument array… if you have arguments.

danielp · March 11, 2018, 8:40am

Hm, I thought the compiler/bytecode verifier would take care of all type checking.

This turned out to be a surprisingly high cost on the GC end of things. Once I made the method caching optimization I mentioned in my original post I ran it through VisualVM to get a feel for garbage creation, and my simple little benchmark loop was beating the GC mercilessly because I was creating two arrays per opcode dispatch - an array of parameters to pass to the method and an array of the parameter types to verify method signatures before calling. Caching and reusing those arrays made the interpreter garbage-free. The caching made no difference to performance over creating new arrays (wouldn’t expect that to be measurable) but now I have a nice steady and gradual allocation slope as the immutable integer objects are created in the benchmark’s while-loop.

pspeed · March 11, 2018, 9:02am

These objects in the young generation will make your stats look bad but will not affect performance… they are effectively free for the GC.

danielp · March 11, 2018, 4:52pm

I know that the cost is only paid by longer lived objects that persist past the young generation, but doesn’t filling up the young generation trigger GC operations more frequently to clear out the young generation?

jmaasing · March 11, 2018, 7:41pm

Yes and no, if you create more objects there are more objects to collect of course but GCs such as G1 is very good at handling that. It’s actually pretty hard to choke it. Don’t worry about it unless you have measured and concluded that it is indeed young generation objects causing performance issues.

danielp · March 11, 2018, 7:59pm

The GC wasn’t causing any performance issues in my tests - I just reused the arrays because they were low-hanging fruit and didn’t significantly complicate the code. It might not make a significant improvement, but it doesn’t hurt and it makes the interpreter 100% garbage free during the bytecode dispatch operation. Also, while it might not make any difference at low scale, when deployed I expect a high number of scripts (hundreds - thousands) to be running simultaneously on a single server. Even then garbage arrays might not be a problem at all, but with array reuse I know it won’t be.

Java’s GCs are marvelous, and I’m looking forward to trying out G1 at operation scale. I always chuckle when C++ programmers talk up the (supposed) performance advantages of not having a garbage collector and then happily roll out (significantly higher overhead) reference counting shared pointers as an alternative.

pspeed · March 11, 2018, 8:07pm

…so I assume you’ve made sure that the array caching is in a ThreadLocal or something so that threaded calls don’t share it? Or are the values essentially unchanged also?

danielp · March 11, 2018, 11:11pm

At the moment the interpreter isn’t designed for multi-threaded use, no. The interpreter right now really only encapsulates the script’s execution state, memory, and a few internals like the method cache and the argument arrays. The scripting language itself has no native concept of threads, and if I ever do support concurrency I’m leaning in favor of an Erlang/Akka actors & message passing pattern (with no script-accessible shared memory) rather than raw threads/locks/etc.

Multithreaded use from the host JVM should be straightforward with that same paradigm.

pspeed · March 11, 2018, 11:18pm

It’s just that caching can introduce a whole host of “I didn’t think of that” style bugs that are really strange to track down. If it it were me, I’d leave the option in to toggle it off with a boolean flag or something. That way if there is ever a “why am I getting these values?!?” type of situation then you can just turn it off. (or you can leave it off until you are ready to benchmark and if you see strange behavior then you know where to look.)

danielp · March 12, 2018, 1:24am

In this case “caching” is probably a misnomer - I’m just using a set of Object arrays that are stored as instance members of the interpreter class. Each array is overwritten every time it’s used (with the exception of location 0 which always holds a reference to the interpreter and never changes), and they’re only written in one place and only read in another.

I like the idea of the boolean flag, but as the core dispatch loop is written now it would require a major refactoring to support (and since the arrays are overwritten with every bytecode dispatch I don’t see the value of such a refactoring). Using them is pretty simple, too, because they’re only used for one type of opcode.

The interpreter has three types of opcodes:

Interpreter state opcodes (stack push, pop, etc.). These are handled immediately in the bytecode dispatch loop and don’t make any calls (and thus don’t touch the call arrays).
“Internal” (sort of) opcodes. These are what the argument arrays are used for. Opcodes like add, subtract, etc. fall in this category. These run by popping a target and one or more arguments off of the stack. Arguments are written into an argument array, which holds the exact set of arguments that will be passed to the target method. The bytecode dispatch loop then calls another internal method specifying the type of operation (add, subtract, etc), the name of the method to call on the target object (“plus” in the case of addition), and the parameter array. That method takes care of retrieving cached method references, checking parameter types, and performing the final call to the matching method on the target. This makes operator overloading trivial, and since a reference to the interpreter is always passed it’s very easy for the target object to cooperate with the interpreter’s memory tracing system.
“External” opcodes - right now there’s only one, and it results in a plain ol’ reflective call to a method of any name on the target object. This bridges the gap between Java’s method convention and the scripting language’s methods-as-objects convention, so calling plain ol’ Java objects from within the scripting language is easy.

The argument arrays are only used for opcodes of type 2, and since these opcodes always only take 1, 2, or 3 arguments (depending on the type of opcode) it’s easy to keep the argument arrays straight. Any bugs related to reusing these arrays should pop up quite quickly once I have full unit test coverage of the opcode dispatch.

Here are snippets from the dispatch loop and the Integer runtime class to illustrate:

Interpreter dispatch loop:

case ADD:
    rh = this.pop();
    lh = this.pop();
    internalParams[2][1] = rh;
    this.push(doInternal(InternalOp.ADD, lh, 2));
    ip++;
    break;

The array assignment line sets the right-hand side of an addition expression to be the argument to a 2 parameter method invocation (the first parameter is a reference to the interpreter and is set previously). The InternalOp + the types of the objects in that array are used to lookup the method on the target object.

The doInternal() method signature:

private Object doInternal(InternalOp op, Object target, int paramCount)

(I should probably just pass the array instead of the paramCount (which is used to look up the right array) - that’s probably an artifact of the days before I’d fully distinguished between opcode types 2 & 3).

The CInteger.java “plus” method:

public CInteger plus(Interpreter vm, CInteger other){
    vm.traceMem(4);
    return new CInteger(value + other.value);
}

Aside: the great thing about this design is that supporting floating point addition is as easy as adding an overridden “plus()” method that takes a float instead of an integer and the interpreter automagically calls the right one based on parameter types.

So the trouble with the boolean flag is that I’d have to jump through some extra hoops to avoid using the per-instance arrays. It’s certainly doable, but considering how simple this design is I’m not sure it’s worth it - any bugs should crop up pretty quickly.

pspeed · March 12, 2018, 5:40am

And then they’re cleared again after use? So as to prevent holding onto object references and preventing them from getting GC’ed?

Anyway, things will either work fine for you or they won’t. Just be prepared for the potential “they won’t” to result in days of nasty debugging sessions before finding that some reference wasn’t cleared like it was supposed to or iteration stopped short on the overwriting because of some other error… etc. It probably won’t happen but just keep it in mind.

Reusing data structures across completely different uses (of which caching is the primary example) is always a good place for these weird “why do these two unrelated things affect each other somehow” type of errors.

danielp · March 12, 2018, 2:46pm

Got me there. They are now.

The “retention time” of arguments in the arrays shouldn’t be high anyway because of how soon they’re likely to be overwritten, but it makes zero performance difference and you’re right that it’s better to get rid of them ASAP.

Noted, thanks.

Yes, and it’s a pattern I generally avoid like the plague. In this case I think it’s a good tradeoff to relieve some of the pressure on the GC with a lot of scripts running.