P4DTI issue job002109

Title	non-ASCII on non-Unicode Perforce server breaks replicator
Status	closed
Priority	essential
Assigned user	Nick Barnes
Organization	Ravenbrook
Description	When non-ASCII characters are stored on a non-Unicode Perforce server (e.g. by users entering them in a changelist description) the P4DTI replicator doesn't know how to interpret them. They are treated as raw binary and then break when encoding as (e.g.) Latin-1 or ASCII.
Analysis	Always use the same encoding/decoding to/from non-Unicode servers. The most sensible encoding is probably whatever P4Win and/or P4V use. However, research shows that this is locale-dependent. We need a good default option if this encoding breaks on whatever characters we read from Perforce. The UTF-8 encoding, for instance, balks at any byte in the range 80-bf, such as the common byte 0x92 (which is the Windows-1252 encoding for U+2109 RIGHT SINGLE QUOTATION MARK). Windows-1252 is undefined for bytes 81, 8d, 8f, 90, 9d. Latin-1 has the advantage of being defined (as the identity) on every byte. If we use a fully-defined or mostly-defined encoding, such as Latin-1 or Windows-1252, it might be tolerable to replace undefined characters with a fixed replacement, which is easy in Python (using the "replace" error handler). We can get the locale encoding with (_,encoding) = locale.getdefaultlocale(). Test for existence with codecs.lookup(encoding), and default to "latin-1" if it doesn't exist?
How found	customer
Evidence	`http://info.ravenbrook.com/mail/2009/04/23/23-14-52/0.txt`
Observed in	2.4.4
Introduced in	2.4.3
Created by	Nick Barnes
Created on	2009-04-24 14:02:21
Last modified by	Nick Barnes
Last modified on	2009-05-07 19:49:08
History	2009-04-24 NB Created

Fixes

Change	Effect	Date	User	Description
167927	closed	2009-05-06 16:48:44	Nick Barnes	Improve encoding used for talking to non-Unicode Perforce server.