Screenshot based JME testing

richtea · June 1, 2024, 4:29pm

I’ve been thinking about a couple of problems we have:

JME has a slow release cycle
A lot of manual work is involved in releasing JME (obviously these first two are related)
Manual testing often unearths bugs that have to be fixed before the release

Looking at the things that came out of testing in jMonkeyEngine v3.7.0-beta1 release they were mostly of the form “I did this slightly unusual thing and it looks different from how it was in 3.6.1”. Given that would it be worth us adding (if it doesn’t already exist) screenshot based testing; where JME was set up in a “unit” test, a screenshot was taken, and that was compared with a png committed to the repo. If they are identical the test passes, otherwise the test fails (and if the new version is “better” an updated screenshot is added to the repository).

I created a proof of concept at GitHub - richardTingle/jmeSnapshotTestProofOfConcept. It has two tests; SimpleBlueCube and SimpleFailBlueCube that compare with the files in resources.

Locally this works nicely. However the github runners don’t have gpus so it can’t work as part of a pipeline. I don’t know if JME has runners that are GPU enabled? Although even if it only worked locally it might help make testing edge cases easier.

Does any of this seem like a good idea?

adi.barda · June 1, 2024, 5:29pm

I think it’s a good idea in the right direction. I didn’t look at the code yet but I would add using some AI API asking it to compare the before and after snapshots. I guess we can ask many interesting questions on the actual picture, on the model structure, position in space etc. sky is the limit.
It has a cost of course but we can reduce it by using some hugging-face models instead the commercial known ones.

richtea · June 1, 2024, 5:45pm

I suppose my question would be why use an AI model at all? If they aren’t pixel identical they should be human reviewed

adi.barda · June 1, 2024, 6:30pm

My thought was that the pictures not being pixel identical is exactly where the AI can provide answers similar or better than a human like: compare the pictures - which is sharper / brighter / richer / has more details / are they different at all - in what way and how much? / difference in size and any other questions which a human asks himself when looking on the pictures.

WDYT?

adi.barda · June 1, 2024, 6:32pm

Yes, I understand your point - this is a quick good indication for when a human should take a look

adi.barda · June 1, 2024, 6:48pm

Maybe I missed it in the code - I can’t find where do you switch between engine versions e.g. 3.6.1-stable and 3.7.0-beta1 before taking snapshots and do the comparison or it just relies on a prev. version snapshot already existing in a “snapshots” folder and just compare to a snapshot taken from the version stated in the build gradle?

richtea · June 1, 2024, 7:04pm

Yes, the idea is that the old state is committed in the repository. (New snapshots automatically generated for new tests) and the generated images are the current branch (3.6.1 in my proof of concept).

Ideally a real version would be in the jmonkeyEngine repo so a change might have the change itself, tests and screenshots in the PR. If a PR made a change that made things “better” you could even review the change in the images in the PR

File size might be a reason not to do that though

Pavl_G · June 2, 2024, 6:54am

It seems to be a good automation idea. However, I think it will be hard to compare complex test cases, so I recommend we might think of a better way to validate stuff, for example: Logging mechanisms, and server-based messaging system perhaps.

EDIT:
How is the test on complex scenes?

richtea · June 2, 2024, 12:43pm

Yes, I’m certainly not imaging this being the be all and end all of testing. But so much that jMonkey does is visual and that is closed to traditional automated testing. I know that there was nervousness with the new rendering pipeline and I’m hopeful this sort of thing might help reassure people.

I’ll give a complex scene a go and report back. But i think the complexity won’t be the issue but it being deterministic. If there are any random effects the screenshot test will see a difference but as long as it is deterministic it should be fine

pspeed · June 2, 2024, 2:07pm

Even if the comparison can’t be reliably automated, perhaps a report with small-ish images side-by-side would work. Then humans (us) could quickly scan through for issues… especially during the release process.

danielp · June 2, 2024, 4:29pm

I like this idea. For comparison of subtle differences, I don’t trust AI models to give consistently accurate results. It would be quite easy to code a “visual diff” tool that could show differences by something like setting all differing pixels to a specific color (lime green, bright red, etc). That way for subtle differences you could quickly load the test result and get a side-by-side + difference view.

zzuegg · June 3, 2024, 6:38am

I think the biggest issue is going to be how to create the reference images. To not generate a bunch of false positive image errors i guess the images have to be created on the same machine for comparsion?

So are we talking about a:

render reference images
apply/merge changes
render images
compare

chain of actions?

If i am not wrong, all of the current issues with the gltf loader and the 3.7 release could be tested in code? At least the tests would have shown that there were introduced changes to the geometry mapping as well as changes to the material parameters. But i am very unexperienced when it comes to writing tests so i am not sure about this all.

pspeed · June 3, 2024, 2:19pm

But sometimes multiple changes together combine to make the same visual. Just because the scene graph is different doesn’t meant that the loader is broken.

Yes. That’s why I suggested a report that can be visually scanned. Even different driver versions (in my experience) can result in different pixel-level images with the exact same code.

And imperfect solution that requires a very low level of volunteer effort (after setup) is often better than a perfect solution that will never really work perfectly.

richtea · June 3, 2024, 9:27pm

So I have news! It is possible to run JME within a gitlab headless runner and get it to render images. For example here is the the water post processor example that was generated in a gitlab pipeline (and collected from the run artifacts)

org.jmonkeyengine.water.TestPostWater.testPostWater

Further this was deterministic. I committed this image back as the reference image and the test passed as no difference!

@pspeed was absolutely correct though, my windows machine and the gitlab runner produced visually identical images but not pixel identical results. That’s a shame but I don’t think fatal.

How I did this

I had the gitlab runner install Mesa3D (a software OpenGL implementation) and xvfb (a virtual frame buffer) and use them to render. I’m sure it isn’t very fast, but I only want to render a single frame so it doesn’t really matter

    - name: Install Mesa3D
      run: |
        sudo apt-get update
        sudo apt-get install -y mesa-utils libgl1-mesa-dri libgl1-mesa-glx xvfb

    - name: Set environment variables for Mesa3D
      run: |
        echo "LIBGL_ALWAYS_SOFTWARE=1" >> $GITHUB_ENV
        echo "MESA_LOADER_DRIVER_OVERRIDE=llvmpipe" >> $GITHUB_ENV
    - name: Start xvfb
      run: |
        sudo Xvfb :99 -ac -screen 0 1280x1024x16 &
        export DISPLAY=:99
        echo "DISPLAY=:99" >> $GITHUB_ENV

At present I’m using gitlab commands to install those things but creating a docker image with them preinstalled would probably help keep the thing stable for the long term (rather than risk the most recent versions of things causing pixel differences)

Thoughts on a workflow
I already have it that the pipeline collects generated images as artefacts when the reference and generated images are different. That would make accepting a change as easy as collecting that image and committing it.

Report with images

I like @pspeed and @danielp’s idea about a report with screenshots (and diff map). I think allure reports can have images added into them. I’ll attempt to do that

Notes

The only weird thing was that Mesa3D’s background was white by default, whereas my windows machine’s was black. That makes me a little nervous, but the more complex scene was fine.

xuan · June 3, 2024, 10:16pm

I downloaded your image and it’s in fact transparent.
Maybe that helps finding the issue, I couldn’t find a concrete answer by a quick google search but I suspect the problem is in the Xvfb setup rather than the mesa driver.

zzuegg · June 4, 2024, 6:06am

I have been unclear, it would not be a hard error, but a hint in the report that the scenegraph layout has changed, it would be up to the implementer to decide if this is expected/wanted or a wrong side effect. The goal would be to make you aware of the changes.

Same with material properties, i think as long as no shaders are changed, having different material properties hints to eighter a fixed bug, or a regression. Again, up to the implementor to decide, but it would be nice if there is a warning.

When it comes down to the visual inspection we are in the land of soft warnings anyway.

i totaly agree with the proposed solutions here.

pspeed · June 4, 2024, 2:13pm

Note that in my experience, false positives only have to meet an exceedingly low threshold (maybe as low as 10%) before the results are just ignored out of hand.

But I take your point.

But it starts to get complicated here because “what related thing changed?” is not very straight forward. A material parameter might not be set because now it’s a material parameter override. (Or vice versa.) Material definitions, vert/frag shaders, glsllib files, all need to be checked for “did something related change or not?”

In general, JME has 2-3 ways to do the same thing and there is no automated way to easily check them all. So we again rely on the author or another contributor to look at the results and decode them.

Now factor in that the scene graphs could be identical and some other JME thing changed to mess up the visuals… and you end up with an imperfect check that requires detailed review and only covers some percentage of the cases. I’m not sure the benefit justifies the work involved.

While image comparison is definitely “soft warnings” and is imperfect… it is at least accurate in the sense that ultimately the visual result is what matters. And it’s something that anyone with eyes can look through for differences.

A point in your favor is that this would not have caught the gltf-induced duplicate control problem. So a scene graph comparison as part of loader unit tests is probably a good thing… in that case a change in behavior is something that gets recoded in the tests. Previous scene graph was the previous contract, new scene graph is either a bug or the new contract.

But this thread is talking about a much wider scope of detection.

richtea · June 5, 2024, 6:11pm

I tried Allure reports and I wasn’t a fan for this. You couldn’t just open the generated HTML report in chrome because of a bunch of CORS errors, you had to create a server and host the thing. It felt like a hassle. So I’ve used Extent reports and gotten good results.

Here is the way it looks

I have the expected image, the actual image, and a red lined diff image. I have messed with the reference image and drawn a mouse on it and you can see that outlined in red in the diff image.

I tried creating a single PDF document but it was a bit horrible (tiny images, and the Extent extension for that was a bit hard to use). I think this works for quickly clicking through.

As an interesting aside this is what the water image looked like on my computer vs the software renderer reference image

Almost no pixel that was the same between the two!

I’m thinking of having 2 levels of test:

On the reference infrastructure the image generated is consistently the same. If a difference is detected in these images the test is marked as “Failed” and the step is a fail (maybe this could run before a PR is merged, blocking a failing merge?)
Even on the reference infrastructure the image generated is variable (non deterministic). These are marked as “Warning” in the report and can be manually reviewed but do not fail the step

richtea · June 9, 2024, 5:16pm

I’ve created a PR to add the testing framework and the first 5 tests at #2279 screenshot tests by richardTingle · Pull Request #2280 · jMonkeyEngine/jmonkeyengine · GitHub (you can also see the Run Screenshot Tests running and passing on it which is pleasent). Assuming people are happy with the approach I’ll add more test cases but didn’t want to get in too deep before people had reviewed (and also didn’t want to produce a gigantic PR).

Thank you to everyone who has created and maintained the examples in the jme3-examples module. That is going to make the whole thing much easier. All of my initial test cases were based on those examples (just converted to AppStates and any manual interaction converted to test parameters).

Is there a list of those examples that is always run manually as part of a release? I will concentrate on automating those if there is one.

This is an example of a nice simple test, TestOgreConvert. It just converts an Ogre model, loads it, and takes two screenshots of it (two because it is animated). Some of the other ones are a bit more complicated but they are basically copy pasted from jme3-examples.

 @Test
    public void testOgreConvert(){

        screenshotTest(
                new BaseAppState(){
                    @Override
                    protected void initialize(Application app){
                        AssetManager assetManager = app.getAssetManager();
                        Node rootNode = ((SimpleApplication)app).getRootNode();
                        Camera cam = app.getCamera();
                        Spatial ogreModel = assetManager.loadModel("Models/Oto/Oto.mesh.xml");

                        DirectionalLight dl = new DirectionalLight();
                        dl.setColor(ColorRGBA.White);
                        dl.setDirection(new Vector3f(0,-1,-1).normalizeLocal());
                        rootNode.addLight(dl);

                        cam.setLocation(new Vector3f(0, 0, 15));

                        try {
                            ByteArrayOutputStream baos = new ByteArrayOutputStream();
                            BinaryExporter exp = new BinaryExporter();
                            exp.save(ogreModel, baos);

                            ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());
                            BinaryImporter imp = new BinaryImporter();
                            imp.setAssetManager(assetManager);
                            Node ogreModelReloaded = (Node) imp.load(bais, null, null);

                            AnimComposer composer = ogreModelReloaded.getControl(AnimComposer.class);
                            composer.setCurrentAction("Walk");

                            rootNode.attachChild(ogreModelReloaded);
                        } catch (IOException ex){
                            throw new RuntimeException(ex);
                        }
                    }

                    @Override
                    protected void cleanup(Application app){}

                    @Override
                    protected void onEnable(){}

                    @Override
                    protected void onDisable(){}
                }
        )
        .setFramesToTakeScreenshotsOn(1, 5)
        .run();

    }

You can see the screenshots at #2279 screenshot tests by richardTingle · Pull Request #2280 · jMonkeyEngine/jmonkeyengine · GitHub

richtea · November 3, 2024, 9:49am

What are peoples thoughts on this?

I merged into the branch all the work since I originally wrote this and the tests still all pass suggesting good stability.

Assuming people are happy with this approach I’ll automate more test cases but I don’t want to produce a giant PR that people may not want.

I have also updated the report so it gives different colours for different levels of difference. Green for no more that 1 colour step different, Blue for “almost the same”, then yellow, orange, red. Which should help when reading the report on non reference machines. If it is all green and blue it is probably just differences between the OpenGL implementation