P4DTI issue job002109

Titlenon-ASCII on non-Unicode Perforce server breaks replicator
Statusclosed
Priorityessential
Assigned userNick Barnes
OrganizationRavenbrook
DescriptionWhen non-ASCII characters are stored on a non-Unicode Perforce server (e.g. by users entering them in a changelist description) the P4DTI replicator doesn't know how to interpret them. They are treated as raw binary and then break when encoding as (e.g.) Latin-1 or ASCII.
AnalysisAlways use the same encoding/decoding to/from non-Unicode servers. The most sensible encoding is probably whatever P4Win and/or P4V use. However, research shows that this is locale-dependent.
We need a good default option if this encoding breaks on whatever characters we read from Perforce. The UTF-8 encoding, for instance, balks at any byte in the range 80-bf, such as the common byte 0x92 (which is the Windows-1252 encoding for U+2109 RIGHT SINGLE QUOTATION MARK). Windows-1252 is undefined for bytes 81, 8d, 8f, 90, 9d. Latin-1 has the advantage of being defined (as the identity) on every byte. If we use a fully-defined or mostly-defined encoding, such as Latin-1 or Windows-1252, it might be tolerable to replace undefined characters with a fixed replacement, which is easy in Python (using the "replace" error handler).
We can get the locale encoding with (_,encoding) = locale.getdefaultlocale(). Test for existence with codecs.lookup(encoding), and default to "latin-1" if it doesn't exist?
How foundcustomer
Evidencehttp://info.ravenbrook.com/mail/2009/04/23/23-14-52/0.txt
Observed in2.4.4
Introduced in2.4.3
Created byNick Barnes
Created on2009-04-24 14:02:21
Last modified byNick Barnes
Last modified on2009-05-07 19:49:08
History2009-04-24 NB Created

Fixes

Change Effect Date User Description
167927 closed 2009-05-06 16:48:44 Nick Barnes Improve encoding used for talking to non-Unicode Perforce server.