Skip to content

Add --urlencode option to safely output filenames with control characters #660

@jeking3

Description

@jeking3

Background

The --text output format used by dwalk and other tools in the suite emit a parseable string based on the following output:

numbytes = snprintf(buffer, bufsize, "%s %s %s %7.3f %3s %s %s\n",
mode_format, username, groupname,
size_tmp, size_units, modify_s, file

Problem

Reliable line-by-line parsing is defeated by paths containing control characters like CR, LF in the actual filename. POSIX does not restrict anything in a path element except for forward-slash and the null character. This can lead to some pretty interesting filenames, like:

rank0.log


source script_run_kooky_12gpu_a75.sh 192.168.1.10  2>&1 | tee rank42

Yes, that's an actual filename containing three linefeeds that I found (sanitized for anonymity).

Solution

Just encoding CR and LF would be sufficient, however I recommend that we implement selective percent encoding for control characters from 0x01 through 0x1F and 0x7F, as these characters can do very strange things to terminal output. The percent sign (0x25) must also be encoded, as it is the escape character used in percent-encoding. I do not recommend using a complete (RFC-3986) urlencode style solution as all the slashes would be encoded to %2F and the file becomes unreadable, plus it is unnecessary bloat.

This mode would be an option to use with the --text mode, for example:

    printf("  -t, --text              - use with -o; write processed list to file in ascii format\n");
    printf("  -E, --urlencode         - use with -t; percent-encode ASCII control characters in filenames\n");

Adding this option leaves the current behavior in place for backwards compatibility, but fixes the output of examples such as above to be on a single line, so that readline() processing of the text file is possible without errors - for example:

rank0.log%0A%0A%0Asource script_run_kooky_12gpu_a75.sh 192.168.1.10  2>&1 | tee rank42

Any programs reading the text file written this way would be required to use a method like python's urllib.unquote to obtain the actual path.

Timeline

I have this solution implemented and tested (manually) in dcmp, dfind, drm, dsh, and dwalk. I will submit a pull request for this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions