-
Notifications
You must be signed in to change notification settings - Fork 76
Description
Background
The --text output format used by dwalk and other tools in the suite emit a parseable string based on the following output:
mpifileutils/src/common/mfu_flist_io.c
Lines 1643 to 1645 in 8e5e35f
| numbytes = snprintf(buffer, bufsize, "%s %s %s %7.3f %3s %s %s\n", | |
| mode_format, username, groupname, | |
| size_tmp, size_units, modify_s, file |
Problem
Reliable line-by-line parsing is defeated by paths containing control characters like CR, LF in the actual filename. POSIX does not restrict anything in a path element except for forward-slash and the null character. This can lead to some pretty interesting filenames, like:
rank0.log
source script_run_kooky_12gpu_a75.sh 192.168.1.10 2>&1 | tee rank42
Yes, that's an actual filename containing three linefeeds that I found (sanitized for anonymity).
Solution
Just encoding CR and LF would be sufficient, however I recommend that we implement selective percent encoding for control characters from 0x01 through 0x1F and 0x7F, as these characters can do very strange things to terminal output. The percent sign (0x25) must also be encoded, as it is the escape character used in percent-encoding. I do not recommend using a complete (RFC-3986) urlencode style solution as all the slashes would be encoded to %2F and the file becomes unreadable, plus it is unnecessary bloat.
This mode would be an option to use with the --text mode, for example:
printf(" -t, --text - use with -o; write processed list to file in ascii format\n");
printf(" -E, --urlencode - use with -t; percent-encode ASCII control characters in filenames\n");
Adding this option leaves the current behavior in place for backwards compatibility, but fixes the output of examples such as above to be on a single line, so that readline() processing of the text file is possible without errors - for example:
rank0.log%0A%0A%0Asource script_run_kooky_12gpu_a75.sh 192.168.1.10 2>&1 | tee rank42
Any programs reading the text file written this way would be required to use a method like python's urllib.unquote to obtain the actual path.
Timeline
I have this solution implemented and tested (manually) in dcmp, dfind, drm, dsh, and dwalk. I will submit a pull request for this.