|Title||non-ASCII on non-Unicode Perforce server breaks replicator|
|Assigned user||Nick Barnes|
|Description||When non-ASCII characters are stored on a non-Unicode Perforce server (e.g. by users entering them in a changelist description) the P4DTI replicator doesn't know how to interpret them. They are treated as raw binary and then break when encoding as (e.g.) Latin-1 or ASCII.|
|Analysis||Always use the same encoding/decoding to/from non-Unicode servers. The most sensible encoding is probably whatever P4Win and/or P4V use. However, research shows that this is locale-dependent.|
We need a good default option if this encoding breaks on whatever characters we read from Perforce. The UTF-8 encoding, for instance, balks at any byte in the range 80-bf, such as the common byte 0x92 (which is the Windows-1252 encoding for U+2109 RIGHT SINGLE QUOTATION MARK). Windows-1252 is undefined for bytes 81, 8d, 8f, 90, 9d. Latin-1 has the advantage of being defined (as the identity) on every byte. If we use a fully-defined or mostly-defined encoding, such as Latin-1 or Windows-1252, it might be tolerable to replace undefined characters with a fixed replacement, which is easy in Python (using the "replace" error handler).
We can get the locale encoding with (_,encoding) = locale.getdefaultlocale(). Test for existence with codecs.lookup(encoding), and default to "latin-1" if it doesn't exist?
|Created by||Nick Barnes|
|Created on||2009-04-24 14:02:21|
|Last modified by||Nick Barnes|
|Last modified on||2009-05-07 19:49:08|
|History||2009-04-24 NB Created|
|167927||closed||2009-05-06 16:48:44||Nick Barnes||Improve encoding used for talking to non-Unicode Perforce server.|