| Title | non-ASCII on non-Unicode Perforce server breaks replicator |
| Status | closed |
| Priority | essential |
| Assigned user | Nick Barnes |
| Organization | Ravenbrook |
| Description | When non-ASCII characters are stored on a non-Unicode Perforce server (e.g. by users entering them in a changelist description) the P4DTI replicator doesn't know how to interpret them. They are treated as raw binary and then break when encoding as (e.g.) Latin-1 or ASCII. |
| Analysis | Always use the same encoding/decoding to/from non-Unicode servers. The most sensible encoding is probably whatever P4Win and/or P4V use. However, research shows that this is locale-dependent. We need a good default option if this encoding breaks on whatever characters we read from Perforce. The UTF-8 encoding, for instance, balks at any byte in the range 80-bf, such as the common byte 0x92 (which is the Windows-1252 encoding for U+2109 RIGHT SINGLE QUOTATION MARK). Windows-1252 is undefined for bytes 81, 8d, 8f, 90, 9d. Latin-1 has the advantage of being defined (as the identity) on every byte. If we use a fully-defined or mostly-defined encoding, such as Latin-1 or Windows-1252, it might be tolerable to replace undefined characters with a fixed replacement, which is easy in Python (using the "replace" error handler). We can get the locale encoding with (_,encoding) = locale.getdefaultlocale(). Test for existence with codecs.lookup(encoding), and default to "latin-1" if it doesn't exist? |
| How found | customer |
| Evidence | http://info.ravenbrook.com/mail/2009/04/23/23-14-52/0.txt |
| Observed in | 2.4.4 |
| Introduced in | 2.4.3 |
| Created by | Nick Barnes |
| Created on | 2009-04-24 14:02:21 |
| Last modified by | Nick Barnes |
| Last modified on | 2009-05-07 19:49:08 |
| History | 2009-04-24 NB Created |
| Change | Effect | Date | User | Description |
|---|---|---|---|---|
| 167927 | closed | 2009-05-06 16:48:44 | Nick Barnes | Improve encoding used for talking to non-Unicode Perforce server. |