-
Notifications
You must be signed in to change notification settings - Fork 85
Test the tests for invalid xml #1594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test the tests for invalid xml #1594
Conversation
8702172 to
54cde7b
Compare
matthew-white
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few small questions/comments, but it's looking good to me 👍
|
I've found and fixed a bunch more invalid XML and changed things around a bit, will update this PR |
66532f0 to
a0ffa33
Compare
|
Updated, please one of you re-review @matthew-white / review @ktuite |
|
I'm happy to re-review since I already left a review. |
cff9686 to
dfe84b4
Compare
matthew-white
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm feeling some reservations about the libxmljs dependency: see below. I'm suggesting an alternative, curious what you think. I've done a few rounds of review on this PR, so my preference would be to hold off on merging and to wait until I'm able to re-review again.
| "eslint-plugin-no-only-tests": "~3.1", | ||
| "fetch-cookie": "^2.2.0", | ||
| "http-proxy-middleware": "^2.0.9", | ||
| "libxmljs": "^1.0.11", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm starting to feel less sure about adding this dependency. I wrote above that I felt fine with us adding another XML parsing dependency even though we have others, and I still feel that way. However, there seem to be issues with this specific dependency:
- No release in almost two years. No answer to a question about the status of the project: Current maintenance status of this project? libxmljs/libxmljs#664
- Reports of vulnerability alerts: 2 high severity issues for the latest libxmljs library with node js 18.0 libxmljs/libxmljs#661. @brontolosone, what do you see when you run
npm audit? Any alerts about this dependency or its subdependencies?
There have also been reports that Node 22 isn't supported: libxmljs/libxmljs#660. I'm not as worried about this one though, since we're already on Node 22, and like you said, things work in CircleCI.
Something you said gave me an idea though, I wonder what you think about it:
The nice thing about
libxmljsis that it's just straight bindings to the venerablelibxml2library which is what Postgres and many, many other things use. So I have good hopes it's going to trip up over the same things that Postgres will trip up over.
If the main goal of isPalatableXML() is to verify that the given XML will be acceptable to Postgres, what about the idea of using Postgres to complete that check? I can see some minor downsides to that approach, including that it'd force isPalatableXML() to become async. But there are also advantages: (1) it guarantees that the XML parsing will be done in the exact same way as Postgres, and (2) it avoids an additional dependency. Specifically, I'm thinking of a function like:
const isPalatableXML = async (container, allegedXML) => {
await container.run(sql`SELECT ${allegedXML}::xml`);
return true;
};This change requires isPalatableXML() to receive the container as an argument. However, almost every test that calls isPalatableXML() or its cousin palatableXML() has access to container via testService(). The main exception is test/data/test-xml.js, but I think that could be changed to use testService() or testContainer(). Either one would provide access to container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on vulnerabilities
Note how we use libxml2/libxmljs here:
- only in testing (it doesn't even get installed outside of test environments)
- only with xml crafted by ourselves
Given those conditions, I'm not at all worried about those vulnerabilities. Sure if I want to craft XML to exploit my own process to gain... a code execution opportunity that I already have as I already control the code, then... I could?
on no new releases for a bit
- Great, a mature project, the scope is clear to everyone, the thing does what it needs to do. Move slow and don't break anything. If there are no wild new interfaces in libxml2 (which there won't be anytime soon) or changes to node-gyp's or swig's interfaces (also not something changed on a whim) then... then I don't need a new release, for my purposes.
- If we find out that for some reason
libxmljsdoesn't work for us anymore: we'll see that quite clearly when it happens (in test) and all we need to do to get things going at that immediate moment is to makeisPalatableXML()a no-op, while we look for an alternative to it at our leasure (meanwhile verifying the XML we use in our tests in "by hand", which is what we could have (but haven't) been doing all along. No panic.
on using PostgreSQL itself
- I've found that to trigger a PostgreSQL exception you also have to actually let it execute some xpath. Not undoable of course. Just saying that an
::xmlcast is not enough. - Feels weird to roundtrip to the DB for this, slows things down, indeed necessitates async which would make it odd to work with — it wouldn't be the transparent change anymore that it is now, and the dependency on a PG connection prevents usage of
isPalatableXML()in other scenarios (simple unit test).
on alternatives
When the day comes to switch to something else, I think it wouldn't be hard to find another set of bindings to libxml2. People want to interface with libxml2 so there'll always be a couple around I suppose. For instance there's this one (which I wouldn't like so much as it ships object code — but it's an alternative nonetheless).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be open to libxmljs, but I have some questions and another idea I want to run by you. I also just left a high-level question below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment, @brontolosone. Just to be clear, I support the overall goals of this PR. 👍 That said, I still have concerns about libxmljs. Sorry for the long reply, but I'm glad we're discussing this.
on using PostgreSQL itself
necessitates async which would make it odd to work with
I've been thinking about this, and I think there may be a route to making palatableXML() synchronous without the use of a dependency. isPalatableXML() is only called in two places, so I think it's OK if it remains async. I'm thinking that palatableXML() could just keep track of XML that's been encountered, then that XML could be validated in an async afterEach hook. I'm thinking of something like:
test/palatable-xml.js
const isPalatableXML = async (db, allegedXML) => {
await db.run(sql`SELECT ${allegedXML}::xml`);
return true;
};
const xmlToCheck = [];
const palatableXML = (allegedXML) => {
xmlToCheck.push(allegedXML);
};
const checkPalatableXML = (db) => {
while (xmlToCheck.length !== 0)
await isPalatableXML(xmlToCheck.shift());
};test/integration/setup.js
afterEach(() => {
checkPalatableXML(db);
});slows things down
There aren't a huge number of uses of palatableXML(), so I don't think it'd slow things down that much. We query the DB many thousands of times in our tests.
the dependency on a PG connection prevents usage of
isPalatableXML()in other scenarios (simple unit test).
This is a valid point for sure. But the vast majority of usage of isPalatableXML() is in integration tests. Especially if we're not planning to keep isPalatableXML() around forever, I think handling the majority of cases should do the trick for now.
on vulnerabilities
Note how we use libxml2/libxmljs here:
- only in testing (it doesn't even get installed outside of test environments)
- only with xml crafted by ourselves
Given those conditions, I'm not at all worried about those vulnerabilities. Sure if I want to craft XML to exploit my own process to gain... a code execution opportunity that I already have as I already control the code, then... I could?
In general, we try to address vulnerabilities in all packages, not just those used in production. Part of the concern is that tests are run locally, not just in CI.
I don't see how we can know a priori whether a vulnerability in a development package is exploitable. It takes research to evaluate (and therefore time). I know you're saying that it's only XML that we've crafted, but not all vulnerabilities depend on the specific input. They don't necessarily operate on that level.
It's not like we address all vulnerability alerts. For some alerts, they can't be easily addressed, and we've evaluated that they don't affect us. But each new vulnerability alert takes time to evaluate.
on no new releases for a bit
For me, this concern is mostly the same as the "on vulnerabilities" concern. If the project isn't maintained, it's more likely to continue to accrue vulnerabilities as time goes on.
on alternatives
When the day comes to switch to something else, I think it wouldn't be hard to find another set of bindings to
libxml2. People want to interface withlibxml2so there'll always be a couple around I suppose. For instance there's this one (which I wouldn't like so much as it ships object code — but it's an alternative nonetheless).
If you don't like my suggested code above, and maintained alternatives are available, then how about we run with one of those? The object code could be encapsulated within isPalatableXML().
Outside this thread, you wrote below:
in the forthcoming PR that provides some other mechanism that'd make invalid XML fail in some way (probably at DB ingestion time), I can remove the wrappers and thus the libxmljs dependency. How about that?
When is the forthcoming PR coming? If it's coming soon, how about we just wait for that? If it's not coming soon, then I think we should address the points above.
By the way, if there's invalid XML in the codebase now, I think we could merge that part ASAP. It doesn't have to be the same PR as the addition of isPalatableXML().
We could also make isPalatableXML() a no-op for now if we wanted to merge this PR right away. Then you and I could continue discussing the specifics of isPalatableXML() for a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Let's keep it at that we model threats (and their priors) differently 🤔
- The
afterEach()hook idea, initially I liked it, yet I think that nothing beats a straight upisPalatableXML()that one can run wherever, whenever, decoupled from the DB or the testing harness control flow. - To remove controversy and get this merged I'll just remove the
isPalatableXML().- I'll add a follow-up PR that stops un-xml at DB insertion time, when forms and submissions are uploaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a follow-up PR that stops un-xml at DB insertion time, when forms and submissions are uploaded.
That sounds excellent to me. 👍 It'll help to know that users aren't uploading invalid XML.
Let's keep it at that we model threats (and their priors) differently 🤔
I'm open to having my model changed, but it sounds like we have a good path forward on this one. Happy to chat more about this on a call sometime!
|
One high-level question I have is, what's the long-term plan for XML validation? If we're going to be adding more validation to lib/ soon anyway, I don't feel like we necessarily even need to merge |
Ah, but are we? And soon? I sure want to, and have code that does that (by necessity, since some amount of form inspection is done in-db, which then expects valid xml). At the same time it's evidently been a challenge for developers and code reviewers to ascertain XML validity "by hand", so in absence of a concrete alternative (though there are ideas, and some unmerged not-yet-PRed-even code) I think we should keep these wrappers. For code reviewers it's great: you see that wrapper, you know you don't have to scrutinize the XML. |
|
Sorry for the delay, @brontolosone! I promise I'll take another look soon. Much mapping to do on the frontend, but I should be done soon. |
|
@brontolosone, we still want to merge this PR, right? I'm thinking to take another look at this PR this week. |
dfe84b4 to
4a4cb31
Compare
4a4cb31 to
848dd51
Compare
Following up on several discussions on Slack.
We had a whole bunch of form test XML in our tests that wasn't actually valid XML, which we used to get away with since the in-Central parser is very, very lenient.
The problem is that then when one swaps out that parser for an XML parser that expects, well, XML, one runs into problems with that faulty test data.
Thus, the faulty testdata needed fixing, and in the process I thought it would be good to add an automatic test for our test data with which we test Central 😆
Goals:
test/data/xml.jstest/data/xml.js. We might want to say something in review if we see new tests that contain handcrafted XML; it would be good if that XML itself is tested because it's quite easy to make little mistakes. Even not closing a tag - violating the most basic form of well-formedness - doesn't alarm the in-Central parser (bbf6213). Example of making sure ones test XML is actually XML in 54cde7b.Non-goals that had to be accomplished:
What has been done to verify that this works as intended?
Ran the new test, fixed the broken xml found with the new test, ran the tests, fixed a number of newly failing tests that shouldn't have been failing, ...
Why is this the best possible solution? Were any other approaches considered?
To check whether XML is actually XML, it's easiest to just feed it to an XML parser and see what it makes of it. I picked libxmljs which is just a bunch of bindings to the venerable
libxml2library which I've worked with before (and is also used by Postgres). The downside of that is that upon install of dev depencies, a compiler runs to compile the bindings. For me it worked without a hitch but then again on my system I have compilers and header files for both libxml2 and node, perhaps the install procedure is less pleasant on a consumer-oriented system. Or maybe not. I'm not so familiar with the node-gyp stuff. It runs on CircleCI without problems but let's hear it if any developer runs into any problems with this.How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?
It's a development thing, it doesn't affect users.
Does this change require updates to the API documentation? If so, please update docs/api.yaml as part of this PR.
No.
Before submitting this PR, please make sure you have:
make testand confirmed all checks still pass OR confirm CircleCI build passes