-
Alex Vandiver authored
Stop encoding all data as utf-8 before inserting -- it is clearly incorrect in the case of binary data. While it is tempting to instead only encode as UTF-8 if it is textual data, this too is incorrect. As 18c810d0 describes, the character set of textual data is not guaranteed to be UTF-8; as such, by having stored characters instead of bytes in the serialized form, information has been lost. There are two recovery methods, neither terribly appealing: update to store charset="utf-8" on all data on insert, or attempt to guess the original encoding and re-encode via that. The former is distasteful because it alters the database upon serializing; the latter is fragile becase it is not guaranteed to be the same encoding. Instead, serialize the bytes as they occurred in the database, and import them explicitly as bytes. This does make it possible to insert invalid UTF-8 into the database -- but contrary to what 74683a70 implies, this is not incorrect, as binary data (for instance) is seldom UTF-8. 3a9c38ed ensures that anything which is contains high-bit characters will be QP-encoded.
0077837f