How fast is Java? Teaching an old dog new tricks

4 hours ago 1

Java, the language everyone loves to hate. Long has it garnered the reputation of a stagnant, obtuse, bloated beast only used by those who have no other choice because some enterprise corporate believed the hype train of the 90s. But is that all still true today? Has java calcified in its unchanging object forward ways destined to fall into the well of has beens? Or, has the old dog learned some new tricks?

The challenge: Use every new trick in the “book of Java” to simulate as many particles as possible, cpu only and maybe, just maybe, beat rust?

Click here to skip to the end and here for a video.

Pregaming

Java has always had a special place in my heart as it was the first language I went deep into. In large part this was because my university used it as the defacto standard. That was well over ten years ago. Damn. Time is cruel.

I did game dev with Java and remember writing my first multi-threaded particle simulation and the obligatory minecraft clone even though I have never played minecraft. Good stuff. Painful by todays standards but fun at the time. Fast forward a decade of never touching the language and I find myself reading about a fancy SIMD incubator api.

“SIMD? in Java?” I says to myself, “Has hell frozen over?”

That is correct, hell has frozen over and Java has a SIMD api which abstracts the complexity of SIMD behind a fascinating api.

For the uninitiated, go google what SIMD is, or AI it, I don't care.

For the lazy, SIMD is a set of vector instructions for the cpu which can crunch multiple numbers at a time. While many compilers are smart enough to do this automagically for you it is not always guaranteed or be an optimal set of instructions.

The problem with SIMD is that each CPU architecture has a different spec on what vector instructions they support based on the physical hardware. Some support only 128-bit instructions or 4 32-bit floating point numbers. Others 256 or even 512. This means you often have to write the same thing multiple times, one for each instruction set, or use a library that has already done this for you but where is the fun in that.

Java is special. It is write once run everywhere or so they claim.

To find out how well Java's SIMD api works, I need a baseline. Luckily, I have recently written multi-threaded SIMD'd particle sims in both Rust and Swift. I did similar in Javascript and Go but they are slow languages who do not have native SIMD apis. Maybe v8 does if you look at it right.

Additionally, Java has some fancy lambda's and multi-threaded iterators which in theory should make threading a cake walk. Fantastic.

I'll look at the specific APIs later but first I need to figure out how to draw pixels onto a screen in java.

Painting in Java

grug not smart grug is image

I want to keep the setup similar to what I did with Rust and Swift. The simulation will be 2d with the ability to apply gravity to a point on the screen pulling particles around. Each particle will be a single pixel and the GPU is off limits as much as possible.

So how do I draw pixels to a window in Java?

With Rust, I had to use a windowing library that abstracted away the OS specific work for managing windows. Swift only works in the land of Tim Apple (rest in peace) so the Swift native APIs were enough. Java, on the other hand, has a set of default APIs which work across all OSs. They don't use the native OS look but they do work.

Last time I touched Java the UI solution was the built-in Swing library which was suppose to be replaced with JavaFX. Today...JavaFx “seems” to be the standard but you have to download the jar (java's version of a package) and include it in maven or gradle...

Swing it is, because fuck maven and gradle. I am not sorry. Java, I love you but really? No default package manager...still? Can I not, jfaster add jfx? Disappointing.

This wouldn't have mattered much anyways because even with JavaFX in order to paint pixels onto a window one needs to use what is called a BufferedImage. A BufferedImage is what it sounds like. Some pixels in RAM which can be drawn to the screen. Perfect.

I will save you the Java Swing boilerplate. The gist is that you create a JFrame and add UI components to it. You create objects that extend or implement Swing apis to create custom logic full OOP style. In this case, I have a custom Panel (the particle sim) which I add to a JFrame.

Java does have a fancy new main function which doesn't require a class but alas it didn't work with Swing. Oh well.

public class ParticleSim { public static void main(String[] args) { new ParticleSim().createAndShowGUI(); } private void createAndShowGUI() { JFrame frame = new JFrame("Sips Java"); frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE); int width = 1200; int height = 800; ParticlePanel particlePanel = new ParticlePanel(width, height); frame.add(particlePanel); frame.pack(); frame.setLocationRelativeTo(null); frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE); frame.setVisible(true); particlePanel.startSimulation(); } }

The ParticlePanel is long, verbose and Java'y. Initially, I did the logic on the event dispatch thread overloading the paint method on the JPanel and having a Timer call a tick method which trigger the simulation. Here is the outline.

Be warned, this is long. Hence, the Java'y.

public class ParticleSim { public static void main(String[] args) { new ParticleSim().createAndShowGUI(); } private void createAndShowGUI() { JFrame frame = new JFrame("Vector API Particle Sim (Requires flags)"); frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE); int width = 1600; int height = 900; ParticlePanel particlePanel = new ParticlePanel(width, height); frame.add(particlePanel); frame.pack(); frame.setLocationRelativeTo(null); frame.setVisible(true); particlePanel.startSimulation(); } } class ParticlePanel extends JPanel implements ActionListener, MouseListener, MouseMotionListener { private static final int NUM_PARTICLES = 80_000_000; private static final int UPDATE_RATE = 1000 / 60; private float[] positionsX = new float[NUM_PARTICLES]; private float[] positionsY = new float[NUM_PARTICLES]; private float[] velocitiesX = new float[NUM_PARTICLES]; private float[] velocitiesY = new float[NUM_PARTICLES]; private final BufferedImage image; private final byte[] pixelArray; private final int panelWidth; private final int panelHeight; public ParticlePanel(int width, int height) { image = new BufferedImage(width, height, BufferedImage.TYPE_BYTE_GRAY); initializeParticles(); } private void initializeParticles() { } public void startSimulation() { timer.start(); } @Override public void actionPerformed(ActionEvent e) { if (isTicking) { return; } isTicking = true; long now = System.nanoTime(); float deltaTime = (now - lastTickTime) / 1_000_000_000.0f; lastTickTime = now; updatePhysics(deltaTime); renderToPixelArray(); repaint(); frames++; isTicking = false; } private void updatePhysics(float deltaTime) { } private void renderToPixelArray() { } @Override protected void paintComponent(Graphics g) { super.paintComponent(g); g.drawImage(image, 0, 0, this); } }

Most of it is standard java Swing code. Rendering is a little interesting as I map particles into a flat buffer of pixels.

private void renderToPixelArray() { final int w = panelWidth; final int h = panelHeight; final byte empty = 0; Arrays.fill(pixelArray, empty); for (int i = 0; i < NUM_PARTICLES; i++) { int px = (int) positionsX[i]; int py = (int) positionsY[i]; int index = py * w + px; if (px < 0 || px >= w || py < 0 || py >= h) { continue; } int lu = pixelArray[index] & 0xFF; lu = Math.min(255, lu + 1); pixelArray[index] = (byte) lu; } }

But the real interesting part is updatePhysics as that is where the magic happens.

What Species are you?

Remember that SIMD can have different lane sizes depending on hardware. It is ideal to use the largest supported lane size. Java's SIMD api handles this by defining a species of a given data type such as float or int . You can then load and unload data into Vector types based on the species size.

This is what it looks like.

private static final VectorSpecies<Float> F_SPECIES = FloatVector.SPECIES_PREFERRED; private static final int LANE_SIZE = F_SPECIES.length(); private static final FloatVector PULL_VEC = FloatVector.broadcast(F_SPECIES, 500f);

This allows code which is pretty generic to the hardware it runs on. Preferred is the one that supposedly will perform the best. There is MAX which will use the largest size but may not perform the best. On my m1 they output the same size of 4 floats or 128-bit. I'll stick to preferred.

I don't know why but I love the name species here, it cracks me up. I can only imagine a giant committee of people coming to a consensus on it.

Multi-threaded SIMD

I got ahead of myself and thought I could jump right into multi-threading too using Java's need streaming api. The idea is to create a IntStream.range which will chunk up the particles based on the max number of lanes and then do parallel iteration over them.

final int w = this.panelWidth; final int h = this.panelHeight; final float wFloat = (float) w; final float hFloat = (float) h; final FloatVector DT_VEC = FloatVector.broadcast(F_SPECIES, deltaTime); final var MOUSE_X_VEC = FloatVector.broadcast(F_SPECIES, (float) mousePosition.x); final var MOUSE_Y_VEC = FloatVector.broadcast(F_SPECIES, (float) mousePosition.y); final int VECTOR_CHUNKS = NUM_PARTICLES / LANE_SIZE; final int SCALAR_START_INDEX = VECTOR_CHUNKS * LANE_SIZE; IntStream.range(0, VECTOR_CHUNKS).parallel().forEach(chunkIndex -> { int i = chunkIndex * LANE_SIZE; var px = FloatVector.fromArray(F_SPECIES, positionsX, i); var py = FloatVector.fromArray(F_SPECIES, positionsY, i); var vx = FloatVector.fromArray(F_SPECIES, velocitiesX, i); var vy = FloatVector.fromArray(F_SPECIES, velocitiesY, i); if (mouseIsPressed) { var dx = MOUSE_X_VEC.sub(px); var dy = MOUSE_Y_VEC.sub(py); var distSq = dx.mul(dx).add(dy.mul(dy)); var gravityMask = distSq.compare(GT, MIN_DIST_SQ_VEC); if (gravityMask.anyTrue()) { var dist = distSq.sqrt(); var forceX = dx.div(dist).mul(PULL_SCALED_VEC); var forceY = dy.div(dist).mul(PULL_SCALED_VEC); vx = vx.add(forceX, gravityMask); vy = vy.add(forceY, gravityMask); } } vx = vx.mul(FRICTION_DT_VEC); vy = vy.mul(FRICTION_DT_VEC); px = px.add(vx.mul(DT_VEC)); py = py.add(vy.mul(DT_VEC)); var maskLeftX = px.compare(LT, ZERO_VEC); var maskRightX = px.compare(GT, W_VEC); var maskBounceX = maskLeftX.or(maskRightX); vx = vx.blend(vx.mul(BOUNCE_MULTIPLIER_VEC), maskBounceX); px = px.blend(ZERO_VEC, maskLeftX); px = px.blend(W_VEC, maskRightX); var maskTopY = py.compare(LT, ZERO_VEC); var maskBottomY = py.compare(GT, H_VEC); var maskBounceY = maskTopY.or(maskBottomY); vy = vy.blend(vy.mul(BOUNCE_MULTIPLIER_VEC), maskBounceY); py = py.blend(ZERO_VEC, maskTopY); py = py.blend(H_VEC, maskBottomY); px.intoArray(positionsX, i); py.intoArray(positionsY, i); vx.intoArray(velocitiesX, i); vy.intoArray(velocitiesY, i); });

The multi-threading ergonomics is very similar to both Rust and Swift. More importantly, it works.

The var identifier makes java a bit easier on the eyes too but my old hands usual stick to explicit types.

Now, there is an issue the astute observer will know have noticed. The particle count will not always be perfectly divisible by the lane size. This means there will be left over particles that are not processed as it stands right now.

This can be handled in 2 ways. Keep processing as vectors by padding out any empty lanes for the vectors. Or, have a non-vectorized loop which updates any remaining particles left over. I chose the latter option but won't show it here for brevity.

It works, and here is 20m particles. Note, that for a pixel to be fully white, 255 particles must be in the same exact location.

It isn't that fast, struggling to hit 20fps. It also doesn't look that interesting. I think I know why.

A Grand Refactor

One of the reasons it is slow is because of the rendering and not the simulation. It is effectively accessing the pixel buffer randomly. This is where the majority of the time was spent in the rust and swift versions as well and Java is no exception. There is little that can be done to optimize this other than spreading out the slowness to more threads. This will make it faster but it doesn't solve the cache thrashing access pattern.

There are a few other issues. For one, the render loop is on the event dispatch thread. That needs to go. It will introduce some complexity with input handling. Further more, it looks rather boring.

The Rust and Swift versions I colored the pixels based scaling the rgb components by the x/y distances relative to the window height/width. To change things up a bit here I am going to give each particle a dedicated color. I am also going to color them all fancy a bit later. I also want to support window resize, panning, and being able to slow down particle velocities as some QoL.

I will skip over the intermediary steps and focus on where I landed at.

First, getting off the event dispatch thread.

I will use one of Java's new lambda features passing in a function to a thread object.

public void startSimulation() { if (!running) { running = true; gameLoopThread = new Thread(this::gameLoop); gameLoopThread.start(); } }

The game loop will use a busy while loop sleeping for a small chunk of time before checking if it can render again. I want to poll input from the event dispatch thread in between the small waits rather than at the rate I can render. This helps to keep the simulation more responsive. However, there are some events which need to run based on the render rate such as panning since the time delta needs to be factored in.

private void gameLoop() { lastTickTime = System.nanoTime(); while (running) { long now = System.nanoTime(); long timeElapsed = now - lastTickTime; this.processInputRequests(); if (timeElapsed >= NS_PER_TICK) { float deltaTime = timeElapsed / (float) NS_PER_SECOND; lastTickTime = now; for (var key : this.keysPressed) { float speed = 500; if (this.velInputMap.containsKey(key)) { this.panDeltaInput.x += velInputMap.get(key).x * speed * deltaTime; this.panDeltaInput.y += velInputMap.get(key).y * speed * deltaTime; } } long tickStart = System.nanoTime(); tick(deltaTime); long tickEnd = System.nanoTime(); long tickDuration = (tickEnd - tickStart); long renderStart = System.nanoTime(); render(); Graphics2D g = (Graphics2D) getGraphics(); g.drawImage(image, 0, 0, this); g.dispose(); Toolkit.getDefaultToolkit().sync(); frames++; long renderEnd = System.nanoTime(); long renderDuration = (renderEnd - renderStart); } else { try { Thread.sleep(1); } catch (InterruptedException e) { } } } }

The key line is this

if (timeElapsed >= NS_PER_TICK)

It will let me know if I need to render another frame. This should also allow for rendering more than 60 fps for those few beaf pancake machines out there or at low particle counts.

You may notice that I am now drawing the image after updating the pixels here too. I turned off Swing's built in passive rendering as I will be actively rendering instead. Less complicated this way.

The tick function is next.

private void tick(float deltaTime) { final int vectorizedEndIndex = (NUM_PARTICLES / LANE_SIZE) * LANE_SIZE; final int chunkSize = vectorizedEndIndex / CPU_COUNT; final var futures = new ArrayList<Future<?>>(CPU_COUNT); final int panDx = this.panDeltaInput.x; final int panDy = this.panDeltaInput.y; final float vScale = this.isSlowDownRequested ? this.inputVelScale : 0f; this.panDeltaInput.x = 0; this.panDeltaInput.y = 0; for (int i = 0; i < CPU_COUNT; i++) { int start = i * chunkSize; int end = (i == CPU_COUNT - 1) ? vectorizedEndIndex : start + chunkSize; ParticleUpdateTask task = tasks[i]; task.updateParams(i, start, end, this, deltaTime, panDx, panDy, vScale); futures.add(executorService.submit(task)); } for (Future<?> future : futures) { try { future.get(); } catch (InterruptedException | ExecutionException e) { e.printStackTrace(); } } }

There are several things of note here. First, borrowing from Rust, to keep from having to sync/lock variables across threads, I copy the latest input data before running the hot simulation loop. Most of the input values are synced between the main and event dispatch threads. This locking is fine since the path is not hot but for the particle updates copying the data is a must.

The other thing is that I profiled the app and found that ~20% of the time was spent in Java's thread management internals. This is somewhat expected. In Swing, Rust, and Go there is overhead using the fancy threaded iterators both in terms of memory pressure and processing time. It is faster to create a pool of workers and reuse those.

This is also a chance to use Java's Future objects for a bit of async control. I don't do much with the API here but it is fun.

The ParticleUpdateTask code is generally the same with a few changes. Here are the important parts.

class ParticleUpdateTask implements Runnable { @Override public void run() { for (int i = startIndex; i < vectorEndIndex; i += LANE_SIZE) { FloatVector px = FloatVector.fromArray(F_SPECIES, positionsX, i); FloatVector py = FloatVector.fromArray(F_SPECIES, positionsY, i); FloatVector vx = FloatVector.fromArray(F_SPECIES, velocitiesX, i); FloatVector vy = FloatVector.fromArray(F_SPECIES, velocitiesY, i); if (mouseIsPressed) { FloatVector dx = MOUSE_X_VEC.sub(px); FloatVector dy = MOUSE_Y_VEC.sub(py); FloatVector distSq = dx.mul(dx).add(dy.mul(dy)); var gravityMask = distSq.compare(GT, minPullDist); if (gravityMask.anyTrue()) { FloatVector dist = distSq.sqrt(); FloatVector forceX = dx.div(dist).mul(gf); FloatVector forceY = dy.div(dist).mul(gf); vx = vx.add(forceX, gravityMask); vy = vy.add(forceY, gravityMask); } } px = px.add(vx.mul(deltaTime)).add(ox); py = py.add(vy.mul(deltaTime)).add(oy); vx = vx.mul(FRICTION_DT_VEC); vy = vy.mul(FRICTION_DT_VEC); px.intoArray(positionsX, i); py.intoArray(positionsY, i); vx.intoArray(velocitiesX, i); vy.intoArray(velocitiesY, i); } var pixels = panel.threadPixelBuffers[id]; Arrays.fill(pixels, 0); for (int i = startIndex; i < endIndex; i++) { int px = (int) Math.min(Math.max(positionsX[i], 0), w - 1); int py = (int) Math.min(Math.max(positionsY[i], 0), h - 1); int index = py * w + px; pixels[index] = colors[i]; } } }

I removed the edge bouncing from the particles and cleaned up the SIMD code a bit more. Most importantly, I gave each thread a local pixel buffer to draw particles too.

I did experiment with SIMD'ing the pixel buffer clamping and index code but it ended up being slower. I also tried using some more fancy SIMD functions from Java's api such as fma but it too ended up being slower.

The final major change was accumulating the local worker pixel buffers directly into the BufferedImage's data.

private void render() { int[] buff = ((DataBufferInt) image.getRaster().getDataBuffer()).getData(); Arrays.fill(buff, 0); final int PIXEL_COUNT = buff.length; IntStream.range(0, CPU_COUNT).parallel().forEach(chunkIndex -> { int chunkSize = PIXEL_COUNT / CPU_COUNT; int start = chunkIndex * chunkSize; int end = (chunkIndex == CPU_COUNT - 1) ? PIXEL_COUNT : start + chunkSize; for (int i = start; i < end; i++) { int color = 0; for (int localIndex = 0; localIndex < CPU_COUNT; localIndex++) { int col = threadPixelBuffers[localIndex][i]; if (col != 0) { color = col; break; } } buff[i] = (0xFF << 24) | color; } }); }

I am using a “last writer wins” strategy for thread local pixel buffers and a first write wins when merging. Having the threads write directly to the same pixel buffer does work but introduces some cache coherency issues. Performance is more consistent to give each worker a buffer to draw too although it does bloat of the memory size.

Speaking of memory. Java clocks in at a pretty high baseline of over 300ms for 1m particles. However, it scales inline with Rust. 100m particles is about the same as Rust + the 300mb baseline.

How is the performance though? Before that...

Perfectly Placed Pixels

I added a colors array which stores particle colors. I want to fill this with something interesting. A friend had the idea of coloring the particles based on their angle to the center point simulating a color wheel. I thought this was cool but thought that I should use an OKLAB way of going about.

I don't know much about color formats and my time is limited to learning these specific details. So, I had AI whip up some code to find the OKLAB color hue based on the angle of particle towards the center point on the screen. Is it correct? I doubt it. I'd prefer to use a library but you know...gradle.

I went ahead and added a few more placement styles too, one that does it as a circle. (I wrote most of this code) and another which will have N points randomly distributed coloring each particle based on the distance to the points. I did not write this one but it looks pretty interesting.

Ok, here is a demo.

I think it looks amazing. You can right click to pan or use wasd. Space bar will slow the particles down. I love playing around with this. If I spent a fraction of the time figuring out the OKLAB coloring as I did playing around with this, I'd probably have gotten that done without AI but alas the gooey dopamine addict goblin in my brain must be fed.

Speaking of the gooey goblin, there was a reason I picked a 32-bit int color on the particles rather than a single byte or even an on/off flag.

What if I colored the particles based on a user defined Image? I could scale the image up or down based on the particle count being careful to place any extra particles in duplicate locations. That'd be neat, right?

Well, press the number 4 and pick a picture. Here is 20m particles with a favorite of mine.

Too darn cool. It will scale the image up or down based on the particles. This image is 4k scaled up to 20m particles which is almost 3x the size.

It makes me want to add zooming too but I think I need to stop.

So, performance?

How fast IS Java?

I'll compare it to Rust as it is currently the champ but note that each version is not a perfect one to one mapping. Rust only writes 1 byte per pixel but memory speed is not the limiting factor when rendering compared to the random access into the pixel buffer. Java's Vector api is also still in an “incubator” state so perhaps it will get improve in the future.

Here are the results on an M1 air.

LanguageCountTotal MSRender MSTick MS
Rust1m~8.87.71.1
Java1m~74.52.5
Rust10m~8.11.96.2
Java10m~18.23.714.4
Rust20m~14.51.5812.79
Java20m~24.72.821.8
Rust50m~36.21.834.4
Java50m~67.82.765
Rust100m~68.71.767
Java100m~118.82.7116
Rust200m~144.31.8142.5
Java200m~216.52.5214.5

Well would you look at that. Rust is just about 2x faster at most scales.

Both versions fill their pixel buffers in the tick which is why the render doesn't change much across scales. That is the time it takes to accumulate the buffers and blit the buffer to the screen which is usually a constant.

It is odd that at 1m Rust “seems” to take longer rendering which was reproducible on my side consistently but I don't know why.

Rust allocates memory much faster. This is because Java is allocating on the heap. It is possible to use off heap memory which is often 2-3x as fast to create but the performance I found to be ever so slightly worse than on heap. We are talking only a few percentage points but it is a consistent few percentage points specifically around accessing not writing.

Java has come a long way since I last used it. It is only about half as fast as Rust without any borrow checking!

Sadly, while the language of Java has made strides allowing for a more ergonomic style, the overall ecosystem leaves much to be desired.

If I were to write a game again today, what stops me from picking Java is not the language but the fact setting up a decent build system pulling in a small set of libraries is such a pain in the ass. I remember in the past having to write a convoluted build to get all the right vorbis and opengl jars, dlls, etc to work. It was a massive pain and nothing has changed on that front.

I suppose you can teach an old dog new tricks but if the dog still hangs out in the same dusty old dog park with no grass or big yellow balls, it won't get a chance to show off any new hotness.

Still, Java is a dog that will always have a special place in my heart. Good boy.

Until next time.

where to find me?

Read Entire Article