-
Notifications
You must be signed in to change notification settings - Fork 4
Strings & binary data
Librcd uses the data type fstr_t for representing binary data. An fstr_t consists of two members: a length (size_t len) and a pointer to the data (uint8_t* str). The memory is not owned by the structure, and in fact fstrings can point into arbitrary memory, e.g. the middle of another fstring. To represent owned memory, an fstr_mem_t* is used.
Many convenience functions and macros is defined for fstr_ts, for instance:
-
fstr_slice, for creating a substring (slice) of another fstring, pointing into the same memory -
fstr_cpy,fstr_cpy_over, for copying data between fstrings -
conc, for concatenating fstrings (returning anfstr_mem_t*- see alsoconcsandsconc) -
fss, for converting fromfstr_mem_t*tofstr_t -
FSTR_PACK, for packing an arbitrary C type into anfstr_tby taking its address and size
A more complete list, including detailed comments, can be found in fstring.h.
Text strings are normally represented by their UTF-8 encoded forms. There are a few helper functions for dealing with UTF-8-encoded text (such as extracting Unicode code points), but usually they are not needed: applications tend to be content agnostic except for small sets of parse-affecting control characters, and those generally always have single-byte encodings within UTF-8¹.
Librcd's preprocessor (rcd-pp) will automatically convert any string literals to their fstr_t equivalents - for instance, "abc" will be converted into ((fstr_t){.str = "abc", .len = 3}). This conversion is only performed after the magic word #pragma librcd occurs in the preprocessed source code, which must happen only within source (*.c) files. This avoids inflicting global state upon unrelated header files. To define fstring constants within headers, the fstr macro can be used to convert C string literals into fstr_ts.
¹ Note that due to the design of UTF-8, a byte cannot be both a starting byte and a continuation byte. Thus, when searching for single-byte characters it is not necessary to care about parsing of prior multi-byte characters. The same property is coincidentally also what makes concatenation of content-controlled strings with trusted ones safe, even for content containing broken UTF-8.