Skip to content

Conversation

@brontolosone
Copy link
Contributor

@brontolosone brontolosone commented Aug 22, 2025

Following up on several discussions on Slack.

We had a whole bunch of form test XML in our tests that wasn't actually valid XML, which we used to get away with since the in-Central parser is very, very lenient.
The problem is that then when one swaps out that parser for an XML parser that expects, well, XML, one runs into problems with that faulty test data.

Thus, the faulty testdata needed fixing, and in the process I thought it would be good to add an automatic test for our test data with which we test Central 😆

Goals:

  • automatically verify at least the form XML we have in test/data/xml.js
  • provide a function to do such validation going forwards for other xml documents we use in tests outside of that test/data/xml.js. We might want to say something in review if we see new tests that contain handcrafted XML; it would be good if that XML itself is tested because it's quite easy to make little mistakes. Even not closing a tag - violating the most basic form of well-formedness - doesn't alarm the in-Central parser (bbf6213). Example of making sure ones test XML is actually XML in 54cde7b.
  • fix up the test XML so that it passes the failing test

Non-goals that had to be accomplished:

  • deal with the fallout: there's a few tests that started to fail once I fixed the invalid xml. I just updated the (overspecific) assertions to match (58c1036).

What has been done to verify that this works as intended?

Ran the new test, fixed the broken xml found with the new test, ran the tests, fixed a number of newly failing tests that shouldn't have been failing, ...

Why is this the best possible solution? Were any other approaches considered?

To check whether XML is actually XML, it's easiest to just feed it to an XML parser and see what it makes of it. I picked libxmljs which is just a bunch of bindings to the venerable libxml2 library which I've worked with before (and is also used by Postgres). The downside of that is that upon install of dev depencies, a compiler runs to compile the bindings. For me it worked without a hitch but then again on my system I have compilers and header files for both libxml2 and node, perhaps the install procedure is less pleasant on a consumer-oriented system. Or maybe not. I'm not so familiar with the node-gyp stuff. It runs on CircleCI without problems but let's hear it if any developer runs into any problems with this.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

It's a development thing, it doesn't affect users.

Does this change require updates to the API documentation? If so, please update docs/api.yaml as part of this PR.

No.

Before submitting this PR, please make sure you have:

  • run make test and confirmed all checks still pass OR confirm CircleCI build passes
  • verified that any code from external sources are properly credited in comments or that everything is internally sourced

@brontolosone brontolosone force-pushed the test-the-tests_invalid-xml branch 4 times, most recently from 8702172 to 54cde7b Compare August 22, 2025 09:54
@brontolosone brontolosone marked this pull request as ready for review August 22, 2025 10:05
Copy link
Member

@matthew-white matthew-white left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few small questions/comments, but it's looking good to me 👍

@brontolosone
Copy link
Contributor Author

I've found and fixed a bunch more invalid XML and changed things around a bit, will update this PR

@brontolosone
Copy link
Contributor Author

Updated, please one of you re-review @matthew-white / review @ktuite

@matthew-white
Copy link
Member

I'm happy to re-review since I already left a review.

@brontolosone brontolosone force-pushed the test-the-tests_invalid-xml branch 5 times, most recently from cff9686 to dfe84b4 Compare September 4, 2025 13:34
Copy link
Member

@matthew-white matthew-white left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm feeling some reservations about the libxmljs dependency: see below. I'm suggesting an alternative, curious what you think. I've done a few rounds of review on this PR, so my preference would be to hold off on merging and to wait until I'm able to re-review again.

"eslint-plugin-no-only-tests": "~3.1",
"fetch-cookie": "^2.2.0",
"http-proxy-middleware": "^2.0.9",
"libxmljs": "^1.0.11",
Copy link
Member

@matthew-white matthew-white Sep 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to feel less sure about adding this dependency. I wrote above that I felt fine with us adding another XML parsing dependency even though we have others, and I still feel that way. However, there seem to be issues with this specific dependency:

There have also been reports that Node 22 isn't supported: libxmljs/libxmljs#660. I'm not as worried about this one though, since we're already on Node 22, and like you said, things work in CircleCI.

Something you said gave me an idea though, I wonder what you think about it:

The nice thing about libxmljs is that it's just straight bindings to the venerable libxml2 library which is what Postgres and many, many other things use. So I have good hopes it's going to trip up over the same things that Postgres will trip up over.

If the main goal of isPalatableXML() is to verify that the given XML will be acceptable to Postgres, what about the idea of using Postgres to complete that check? I can see some minor downsides to that approach, including that it'd force isPalatableXML() to become async. But there are also advantages: (1) it guarantees that the XML parsing will be done in the exact same way as Postgres, and (2) it avoids an additional dependency. Specifically, I'm thinking of a function like:

const isPalatableXML = async (container, allegedXML) => {
  await container.run(sql`SELECT ${allegedXML}::xml`);
  return true;
};

This change requires isPalatableXML() to receive the container as an argument. However, almost every test that calls isPalatableXML() or its cousin palatableXML() has access to container via testService(). The main exception is test/data/test-xml.js, but I think that could be changed to use testService() or testContainer(). Either one would provide access to container.

Copy link
Contributor Author

@brontolosone brontolosone Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on vulnerabilities

Note how we use libxml2/libxmljs here:

  • only in testing (it doesn't even get installed outside of test environments)
  • only with xml crafted by ourselves

Given those conditions, I'm not at all worried about those vulnerabilities. Sure if I want to craft XML to exploit my own process to gain... a code execution opportunity that I already have as I already control the code, then... I could?

on no new releases for a bit

  1. Great, a mature project, the scope is clear to everyone, the thing does what it needs to do. Move slow and don't break anything. If there are no wild new interfaces in libxml2 (which there won't be anytime soon) or changes to node-gyp's or swig's interfaces (also not something changed on a whim) then... then I don't need a new release, for my purposes.
  2. If we find out that for some reason libxmljs doesn't work for us anymore: we'll see that quite clearly when it happens (in test) and all we need to do to get things going at that immediate moment is to make isPalatableXML() a no-op, while we look for an alternative to it at our leasure (meanwhile verifying the XML we use in our tests in "by hand", which is what we could have (but haven't) been doing all along. No panic.

on using PostgreSQL itself

  1. I've found that to trigger a PostgreSQL exception you also have to actually let it execute some xpath. Not undoable of course. Just saying that an ::xml cast is not enough.
  2. Feels weird to roundtrip to the DB for this, slows things down, indeed necessitates async which would make it odd to work with — it wouldn't be the transparent change anymore that it is now, and the dependency on a PG connection prevents usage of isPalatableXML() in other scenarios (simple unit test).

on alternatives

When the day comes to switch to something else, I think it wouldn't be hard to find another set of bindings to libxml2. People want to interface with libxml2 so there'll always be a couple around I suppose. For instance there's this one (which I wouldn't like so much as it ships object code — but it's an alternative nonetheless).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be open to libxmljs, but I have some questions and another idea I want to run by you. I also just left a high-level question below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment, @brontolosone. Just to be clear, I support the overall goals of this PR. 👍 That said, I still have concerns about libxmljs. Sorry for the long reply, but I'm glad we're discussing this.

on using PostgreSQL itself

necessitates async which would make it odd to work with

I've been thinking about this, and I think there may be a route to making palatableXML() synchronous without the use of a dependency. isPalatableXML() is only called in two places, so I think it's OK if it remains async. I'm thinking that palatableXML() could just keep track of XML that's been encountered, then that XML could be validated in an async afterEach hook. I'm thinking of something like:

test/palatable-xml.js

const isPalatableXML = async (db, allegedXML) => {
  await db.run(sql`SELECT ${allegedXML}::xml`);
  return true;
};

const xmlToCheck = [];

const palatableXML = (allegedXML) => {
  xmlToCheck.push(allegedXML);
};

const checkPalatableXML = (db) => {
  while (xmlToCheck.length !== 0)
    await isPalatableXML(xmlToCheck.shift());
};

test/integration/setup.js

afterEach(() => {
  checkPalatableXML(db);
});

slows things down

There aren't a huge number of uses of palatableXML(), so I don't think it'd slow things down that much. We query the DB many thousands of times in our tests.

the dependency on a PG connection prevents usage of isPalatableXML() in other scenarios (simple unit test).

This is a valid point for sure. But the vast majority of usage of isPalatableXML() is in integration tests. Especially if we're not planning to keep isPalatableXML() around forever, I think handling the majority of cases should do the trick for now.

on vulnerabilities

Note how we use libxml2/libxmljs here:

  • only in testing (it doesn't even get installed outside of test environments)
  • only with xml crafted by ourselves

Given those conditions, I'm not at all worried about those vulnerabilities. Sure if I want to craft XML to exploit my own process to gain... a code execution opportunity that I already have as I already control the code, then... I could?

In general, we try to address vulnerabilities in all packages, not just those used in production. Part of the concern is that tests are run locally, not just in CI.

I don't see how we can know a priori whether a vulnerability in a development package is exploitable. It takes research to evaluate (and therefore time). I know you're saying that it's only XML that we've crafted, but not all vulnerabilities depend on the specific input. They don't necessarily operate on that level.

It's not like we address all vulnerability alerts. For some alerts, they can't be easily addressed, and we've evaluated that they don't affect us. But each new vulnerability alert takes time to evaluate.

on no new releases for a bit

For me, this concern is mostly the same as the "on vulnerabilities" concern. If the project isn't maintained, it's more likely to continue to accrue vulnerabilities as time goes on.

on alternatives

When the day comes to switch to something else, I think it wouldn't be hard to find another set of bindings to libxml2. People want to interface with libxml2 so there'll always be a couple around I suppose. For instance there's this one (which I wouldn't like so much as it ships object code — but it's an alternative nonetheless).

If you don't like my suggested code above, and maintained alternatives are available, then how about we run with one of those? The object code could be encapsulated within isPalatableXML().


Outside this thread, you wrote below:

in the forthcoming PR that provides some other mechanism that'd make invalid XML fail in some way (probably at DB ingestion time), I can remove the wrappers and thus the libxmljs dependency. How about that?

When is the forthcoming PR coming? If it's coming soon, how about we just wait for that? If it's not coming soon, then I think we should address the points above.


By the way, if there's invalid XML in the codebase now, I think we could merge that part ASAP. It doesn't have to be the same PR as the addition of isPalatableXML().

We could also make isPalatableXML() a no-op for now if we wanted to merge this PR right away. Then you and I could continue discussing the specifics of isPalatableXML() for a follow-up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Let's keep it at that we model threats (and their priors) differently 🤔
  • The afterEach() hook idea, initially I liked it, yet I think that nothing beats a straight up isPalatableXML() that one can run wherever, whenever, decoupled from the DB or the testing harness control flow.
  • To remove controversy and get this merged I'll just remove the isPalatableXML().
    • I'll add a follow-up PR that stops un-xml at DB insertion time, when forms and submissions are uploaded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a follow-up PR that stops un-xml at DB insertion time, when forms and submissions are uploaded.

That sounds excellent to me. 👍 It'll help to know that users aren't uploading invalid XML.

Let's keep it at that we model threats (and their priors) differently 🤔

I'm open to having my model changed, but it sounds like we have a good path forward on this one. Happy to chat more about this on a call sometime!

@matthew-white
Copy link
Member

One high-level question I have is, what's the long-term plan for XML validation? If we're going to be adding more validation to lib/ soon anyway, I don't feel like we necessarily even need to merge isPalatableXML(). Maybe in identifying the test XML that was broken, the function has done all that we need it to. What if we just merged the actual changes to XML?

@brontolosone
Copy link
Contributor Author

One high-level question I have is, what's the long-term plan for XML validation? If we're going to be adding more validation to lib/ soon anyway, I don't feel like we necessarily even need to merge isPalatableXML(). Maybe in identifying the test XML that was broken, the function has done all that we need it to. What if we just merged the actual changes to XML?

If we're going to be adding more validation to lib/ soon anyway

Ah, but are we? And soon? I sure want to, and have code that does that (by necessity, since some amount of form inspection is done in-db, which then expects valid xml). At the same time it's evidently been a challenge for developers and code reviewers to ascertain XML validity "by hand", so in absence of a concrete alternative (though there are ideas, and some unmerged not-yet-PRed-even code) I think we should keep these wrappers. For code reviewers it's great: you see that wrapper, you know you don't have to scrutinize the XML.
But, in the forthcoming PR that provides some other mechanism that'd make invalid XML fail in some way (probably at DB ingestion time), I can remove the wrappers and thus the libxmljs dependency. How about that?

@matthew-white
Copy link
Member

Sorry for the delay, @brontolosone! I promise I'll take another look soon. Much mapping to do on the frontend, but I should be done soon.

@matthew-white
Copy link
Member

@brontolosone, we still want to merge this PR, right? I'm thinking to take another look at this PR this week.

@brontolosone brontolosone force-pushed the test-the-tests_invalid-xml branch from dfe84b4 to 4a4cb31 Compare November 4, 2025 15:46
@brontolosone
Copy link
Contributor Author

Closing (per), continuing in #1681 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants