Test the tests for invalid xml #1594

brontolosone · 2025-08-22T08:42:55Z

Following up on several discussions on Slack.

We had a whole bunch of form test XML in our tests that wasn't actually valid XML, which we used to get away with since the in-Central parser is very, very lenient.
The problem is that then when one swaps out that parser for an XML parser that expects, well, XML, one runs into problems with that faulty test data.

Thus, the faulty testdata needed fixing, and in the process I thought it would be good to add an automatic test for our test data with which we test Central 😆

Goals:

automatically verify at least the form XML we have in test/data/xml.js
provide a function to do such validation going forwards for other xml documents we use in tests outside of that test/data/xml.js. We might want to say something in review if we see new tests that contain handcrafted XML; it would be good if that XML itself is tested because it's quite easy to make little mistakes. Even not closing a tag - violating the most basic form of well-formedness - doesn't alarm the in-Central parser (bbf6213). Example of making sure ones test XML is actually XML in 54cde7b.
fix up the test XML so that it passes the failing test

Non-goals that had to be accomplished:

deal with the fallout: there's a few tests that started to fail once I fixed the invalid xml. I just updated the (overspecific) assertions to match (58c1036).

What has been done to verify that this works as intended?

Ran the new test, fixed the broken xml found with the new test, ran the tests, fixed a number of newly failing tests that shouldn't have been failing, ...

Why is this the best possible solution? Were any other approaches considered?

To check whether XML is actually XML, it's easiest to just feed it to an XML parser and see what it makes of it. I picked libxmljs which is just a bunch of bindings to the venerable libxml2 library which I've worked with before (and is also used by Postgres). The downside of that is that upon install of dev depencies, a compiler runs to compile the bindings. For me it worked without a hitch but then again on my system I have compilers and header files for both libxml2 and node, perhaps the install procedure is less pleasant on a consumer-oriented system. Or maybe not. I'm not so familiar with the node-gyp stuff. It runs on CircleCI without problems but let's hear it if any developer runs into any problems with this.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

It's a development thing, it doesn't affect users.

Does this change require updates to the API documentation? If so, please update docs/api.yaml as part of this PR.

No.

Before submitting this PR, please make sure you have:

run make test and confirmed all checks still pass OR confirm CircleCI build passes
verified that any code from external sources are properly credited in comments or that everything is internally sourced

matthew-white

I have a few small questions/comments, but it's looking good to me 👍

test/data/test-xml.js

package.json

test/data/test-xml.js

brontolosone · 2025-08-25T13:38:04Z

I've found and fixed a bunch more invalid XML and changed things around a bit, will update this PR

brontolosone · 2025-09-02T10:12:50Z

Updated, please one of you re-review @matthew-white / review @ktuite

matthew-white · 2025-09-02T18:22:06Z

I'm happy to re-review since I already left a review.

test/ci-mocha-reporter.js

test/integration/api/submissions.js

test/integration/other/select-many.js

test/palatable-xml.js

test/integration/api/forms/forms.js

test/data/xml.js

matthew-white

I'm feeling some reservations about the libxmljs dependency: see below. I'm suggesting an alternative, curious what you think. I've done a few rounds of review on this PR, so my preference would be to hold off on merging and to wait until I'm able to re-review again.

matthew-white · 2025-09-07T18:56:20Z

package.json

    "eslint-plugin-no-only-tests": "~3.1",
    "fetch-cookie": "^2.2.0",
    "http-proxy-middleware": "^2.0.9",
+    "libxmljs": "^1.0.11",


I'm starting to feel less sure about adding this dependency. I wrote above that I felt fine with us adding another XML parsing dependency even though we have others, and I still feel that way. However, there seem to be issues with this specific dependency:

No release in almost two years. No answer to a question about the status of the project: Current maintenance status of this project? libxmljs/libxmljs#664

Reports of vulnerability alerts: 2 high severity issues for the latest libxmljs library with node js 18.0 libxmljs/libxmljs#661. @brontolosone, what do you see when you run npm audit? Any alerts about this dependency or its subdependencies?

There have also been reports that Node 22 isn't supported: libxmljs/libxmljs#660. I'm not as worried about this one though, since we're already on Node 22, and like you said, things work in CircleCI.

Something you said gave me an idea though, I wonder what you think about it:

The nice thing about libxmljs is that it's just straight bindings to the venerable libxml2 library which is what Postgres and many, many other things use. So I have good hopes it's going to trip up over the same things that Postgres will trip up over.

If the main goal of isPalatableXML() is to verify that the given XML will be acceptable to Postgres, what about the idea of using Postgres to complete that check? I can see some minor downsides to that approach, including that it'd force isPalatableXML() to become async. But there are also advantages: (1) it guarantees that the XML parsing will be done in the exact same way as Postgres, and (2) it avoids an additional dependency. Specifically, I'm thinking of a function like:

const isPalatableXML = async (container, allegedXML) => { await container.run(sql`SELECT ${allegedXML}::xml`); return true; };

This change requires isPalatableXML() to receive the container as an argument. However, almost every test that calls isPalatableXML() or its cousin palatableXML() has access to container via testService(). The main exception is test/data/test-xml.js, but I think that could be changed to use testService() or testContainer(). Either one would provide access to container.

on vulnerabilities

Note how we use libxml2/libxmljs here:

only in testing (it doesn't even get installed outside of test environments)

only with xml crafted by ourselves

Given those conditions, I'm not at all worried about those vulnerabilities. Sure if I want to craft XML to exploit my own process to gain... a code execution opportunity that I already have as I already control the code, then... I could?

on no new releases for a bit

Great, a mature project, the scope is clear to everyone, the thing does what it needs to do. Move slow and don't break anything. If there are no wild new interfaces in libxml2 (which there won't be anytime soon) or changes to node-gyp's or swig's interfaces (also not something changed on a whim) then... then I don't need a new release, for my purposes.

If we find out that for some reason libxmljs doesn't work for us anymore: we'll see that quite clearly when it happens (in test) and all we need to do to get things going at that immediate moment is to make isPalatableXML() a no-op, while we look for an alternative to it at our leasure (meanwhile verifying the XML we use in our tests in "by hand", which is what we could have (but haven't) been doing all along. No panic.

on using PostgreSQL itself

I've found that to trigger a PostgreSQL exception you also have to actually let it execute some xpath. Not undoable of course. Just saying that an ::xml cast is not enough.

Feels weird to roundtrip to the DB for this, slows things down, indeed necessitates async which would make it odd to work with — it wouldn't be the transparent change anymore that it is now, and the dependency on a PG connection prevents usage of isPalatableXML() in other scenarios (simple unit test).

on alternatives

When the day comes to switch to something else, I think it wouldn't be hard to find another set of bindings to libxml2. People want to interface with libxml2 so there'll always be a couple around I suppose. For instance there's this one (which I wouldn't like so much as it ships object code — but it's an alternative nonetheless).

I could be open to libxmljs, but I have some questions and another idea I want to run by you. I also just left a high-level question below.

Thanks for your comment, @brontolosone. Just to be clear, I support the overall goals of this PR. 👍 That said, I still have concerns about libxmljs. Sorry for the long reply, but I'm glad we're discussing this.

on using PostgreSQL itself

necessitates async which would make it odd to work with

I've been thinking about this, and I think there may be a route to making palatableXML() synchronous without the use of a dependency. isPalatableXML() is only called in two places, so I think it's OK if it remains async. I'm thinking that palatableXML() could just keep track of XML that's been encountered, then that XML could be validated in an async afterEach hook. I'm thinking of something like:

test/palatable-xml.js

const isPalatableXML = async (db, allegedXML) => { await db.run(sql`SELECT ${allegedXML}::xml`); return true; }; const xmlToCheck = []; const palatableXML = (allegedXML) => { xmlToCheck.push(allegedXML); }; const checkPalatableXML = (db) => { while (xmlToCheck.length !== 0) await isPalatableXML(xmlToCheck.shift()); };

test/integration/setup.js

afterEach(() => { checkPalatableXML(db); });

slows things down

There aren't a huge number of uses of palatableXML(), so I don't think it'd slow things down that much. We query the DB many thousands of times in our tests.

the dependency on a PG connection prevents usage of isPalatableXML() in other scenarios (simple unit test).

This is a valid point for sure. But the vast majority of usage of isPalatableXML() is in integration tests. Especially if we're not planning to keep isPalatableXML() around forever, I think handling the majority of cases should do the trick for now.

on vulnerabilities

Note how we use libxml2/libxmljs here:

only in testing (it doesn't even get installed outside of test environments)

only with xml crafted by ourselves

Given those conditions, I'm not at all worried about those vulnerabilities. Sure if I want to craft XML to exploit my own process to gain... a code execution opportunity that I already have as I already control the code, then... I could?

In general, we try to address vulnerabilities in all packages, not just those used in production. Part of the concern is that tests are run locally, not just in CI.

I don't see how we can know a priori whether a vulnerability in a development package is exploitable. It takes research to evaluate (and therefore time). I know you're saying that it's only XML that we've crafted, but not all vulnerabilities depend on the specific input. They don't necessarily operate on that level.

It's not like we address all vulnerability alerts. For some alerts, they can't be easily addressed, and we've evaluated that they don't affect us. But each new vulnerability alert takes time to evaluate.

on no new releases for a bit

For me, this concern is mostly the same as the "on vulnerabilities" concern. If the project isn't maintained, it's more likely to continue to accrue vulnerabilities as time goes on.

on alternatives

When the day comes to switch to something else, I think it wouldn't be hard to find another set of bindings to libxml2. People want to interface with libxml2 so there'll always be a couple around I suppose. For instance there's this one (which I wouldn't like so much as it ships object code — but it's an alternative nonetheless).

If you don't like my suggested code above, and maintained alternatives are available, then how about we run with one of those? The object code could be encapsulated within isPalatableXML().

Outside this thread, you wrote below:

in the forthcoming PR that provides some other mechanism that'd make invalid XML fail in some way (probably at DB ingestion time), I can remove the wrappers and thus the libxmljs dependency. How about that?

When is the forthcoming PR coming? If it's coming soon, how about we just wait for that? If it's not coming soon, then I think we should address the points above.

By the way, if there's invalid XML in the codebase now, I think we could merge that part ASAP. It doesn't have to be the same PR as the addition of isPalatableXML().

We could also make isPalatableXML() a no-op for now if we wanted to merge this PR right away. Then you and I could continue discussing the specifics of isPalatableXML() for a follow-up PR.

Let's keep it at that we model threats (and their priors) differently 🤔

The afterEach() hook idea, initially I liked it, yet I think that nothing beats a straight up isPalatableXML() that one can run wherever, whenever, decoupled from the DB or the testing harness control flow.

To remove controversy and get this merged I'll just remove the isPalatableXML().

I'll add a follow-up PR that stops un-xml at DB insertion time, when forms and submissions are uploaded.

I'll add a follow-up PR that stops un-xml at DB insertion time, when forms and submissions are uploaded.

That sounds excellent to me. 👍 It'll help to know that users aren't uploading invalid XML.

Let's keep it at that we model threats (and their priors) differently 🤔

I'm open to having my model changed, but it sounds like we have a good path forward on this one. Happy to chat more about this on a call sometime!

matthew-white · 2025-09-08T05:24:08Z

One high-level question I have is, what's the long-term plan for XML validation? If we're going to be adding more validation to lib/ soon anyway, I don't feel like we necessarily even need to merge isPalatableXML(). Maybe in identifying the test XML that was broken, the function has done all that we need it to. What if we just merged the actual changes to XML?

brontolosone · 2025-09-09T04:47:56Z

One high-level question I have is, what's the long-term plan for XML validation? If we're going to be adding more validation to lib/ soon anyway, I don't feel like we necessarily even need to merge isPalatableXML(). Maybe in identifying the test XML that was broken, the function has done all that we need it to. What if we just merged the actual changes to XML?

If we're going to be adding more validation to lib/ soon anyway

Ah, but are we? And soon? I sure want to, and have code that does that (by necessity, since some amount of form inspection is done in-db, which then expects valid xml). At the same time it's evidently been a challenge for developers and code reviewers to ascertain XML validity "by hand", so in absence of a concrete alternative (though there are ideas, and some unmerged not-yet-PRed-even code) I think we should keep these wrappers. For code reviewers it's great: you see that wrapper, you know you don't have to scrutinize the XML.
But, in the forthcoming PR that provides some other mechanism that'd make invalid XML fail in some way (probably at DB ingestion time), I can remove the wrappers and thus the libxmljs dependency. How about that?

matthew-white · 2025-10-16T03:57:15Z

Sorry for the delay, @brontolosone! I promise I'll take another look soon. Much mapping to do on the frontend, but I should be done soon.

matthew-white · 2025-11-04T06:35:38Z

@brontolosone, we still want to merge this PR, right? I'm thinking to take another look at this PR this week.

test/data/xml.js

brontolosone · 2025-11-11T16:59:10Z

Closing (per), continuing in #1681 .

brontolosone force-pushed the test-the-tests_invalid-xml branch 4 times, most recently from 8702172 to 54cde7b Compare August 22, 2025 09:54

brontolosone marked this pull request as ready for review August 22, 2025 10:05

brontolosone requested a review from matthew-white August 22, 2025 10:05

matthew-white reviewed Aug 22, 2025

View reviewed changes

test/data/test-xml.js Outdated Show resolved Hide resolved

package.json Show resolved Hide resolved

test/data/test-xml.js Outdated Show resolved Hide resolved

brontolosone force-pushed the test-the-tests_invalid-xml branch from 66532f0 to a0ffa33 Compare September 2, 2025 09:48

brontolosone requested review from ktuite and matthew-white September 2, 2025 10:07

matthew-white reviewed Sep 2, 2025

View reviewed changes

brontolosone force-pushed the test-the-tests_invalid-xml branch 5 times, most recently from cff9686 to dfe84b4 Compare September 4, 2025 13:34

brontolosone requested a review from matthew-white September 4, 2025 13:49

matthew-white requested changes Sep 7, 2025

View reviewed changes

brontolosone requested a review from matthew-white September 8, 2025 05:04

brontolosone mentioned this pull request Oct 21, 2025

make entities.uuid DB column of type uuid (rather than varchar(255)) #1618

Open

2 tasks

brontolosone force-pushed the test-the-tests_invalid-xml branch from dfe84b4 to 4a4cb31 Compare November 4, 2025 15:46

ktuite reviewed Nov 4, 2025

View reviewed changes

test/data/xml.js Show resolved Hide resolved

install libxmljs

cf2ceaa

brontolosone added 7 commits November 11, 2025 15:55

test the tests: Form XML validity

4478c09

fix formtest-xml: wellformedness of forms.binaryType

09a0e8d

fix formtest-xml: entities namespace URI (was same as odk NSURI)

d847444

play overly specific test whack-a-mole

3092020

palatableXML() helper and example callsite

3dedfa6

turn more unxml into xml, move isPalatable

89d9b2c

add closing tags...

848dd51

brontolosone force-pushed the test-the-tests_invalid-xml branch from 4a4cb31 to 848dd51 Compare November 11, 2025 15:56

brontolosone mentioned this pull request Nov 11, 2025

Fix invalid xml in the tests — sans unpalatable palatableXML() #1681

Merged

2 tasks

brontolosone closed this Nov 11, 2025

Test the tests for invalid xml #1594

Test the tests for invalid xml #1594

Uh oh!

Conversation

brontolosone commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What has been done to verify that this works as intended?

Why is this the best possible solution? Were any other approaches considered?

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Does this change require updates to the API documentation? If so, please update docs/api.yaml as part of this PR.

Before submitting this PR, please make sure you have:

Uh oh!

matthew-white left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brontolosone commented Aug 25, 2025

Uh oh!

brontolosone commented Sep 2, 2025

Uh oh!

matthew-white commented Sep 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matthew-white left a comment

Choose a reason for hiding this comment

Uh oh!

matthew-white Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brontolosone Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

on vulnerabilities

on no new releases for a bit

on using PostgreSQL itself

on alternatives

Uh oh!

matthew-white Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

matthew-white Nov 7, 2025

Choose a reason for hiding this comment

on using PostgreSQL itself

on vulnerabilities

on no new releases for a bit

on alternatives

Uh oh!

brontolosone Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

matthew-white Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

matthew-white commented Sep 8, 2025

Uh oh!

brontolosone commented Sep 9, 2025

Uh oh!

matthew-white commented Oct 16, 2025

Uh oh!

matthew-white commented Nov 4, 2025

Uh oh!

Uh oh!

brontolosone commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

brontolosone commented Aug 22, 2025 •

edited

Loading

matthew-white Sep 7, 2025 •

edited

Loading

brontolosone Sep 8, 2025 •

edited

Loading