Skip to content

Strings & binary data

simonlindholm edited this page Dec 7, 2014 · 2 revisions

Librcd uses the data type fstr_t for representing binary data. An fstr_t consists of two members: a length (size_t len) and a pointer to the data (uint8_t* str). The memory is not owned by the structure, and in fact fstrings can point into arbitrary memory, e.g. the middle of another fstring. To represent owned memory, an fstr_mem_t* is used.

Many convenience functions and macros is defined for fstr_ts, for instance:

  • fstr_slice, for creating a substring (slice) of another fstring, pointing into the same memory
  • fstr_cpy, fstr_cpy_over, for copying data between fstrings
  • conc, for concatenating fstrings (returning an fstr_mem_t* - see also concs and sconc)
  • fss, for converting from fstr_mem_t* to fstr_t
  • FSTR_PACK, for packing an arbitrary C type into an fstr_t by taking its address and size

A more complete list, including detailed comments, can be found in fstring.h.

Text strings are normally represented by their UTF-8 encoded forms. There are a few helper functions for dealing with UTF-8-encoded text (such as extracting Unicode code points), but usually they are not needed: applications tend to be content agnostic except for small sets of parse-affecting control characters, and those generally always have single-byte encodings within UTF-8¹.

Librcd's preprocessor (rcd-pp) will automatically convert any string literals to their fstr_t equivalents - for instance, "abc" will be converted into ((fstr_t){.str = "abc", .len = 3}). This conversion is only performed after the magic word #pragma librcd occurs in the preprocessed source code, which must happen only within source (*.c) files. This avoids inflicting global state upon unrelated header files. To define fstring constants within headers, the fstr macro can be used to convert C string literals into fstr_ts.

¹ Note that due to the design of UTF-8, a byte cannot be both a starting byte and a continuation byte. Thus, when searching for single-byte characters it is not necessary to care about parsing of prior multi-byte characters. The same property is coincidentally also what makes concatenation of content-controlled strings with trusted ones safe, even for content containing broken UTF-8.

Clone this wiki locally