Opened 9 years ago

Closed 8 years ago

#436 closed defect (fixed)

Platform dependent encoding instead of UTF-8 in DataHelper

Reported by: John Doo Owned by: zzz
Priority: minor Milestone: 0.8.6
Component: api/data Version: 0.8.4
Keywords: Cc:
Parent Tickets: Sensitive: no

Description

net.i2p.data.DataHelper? contains the following lines:

private final static byte SEMICOLON_BYTES[] = ";".getBytes(); in UTF-8
private final static byte EQUAL_BYTES[] = "=".getBytes();
in UTF-8

As the comment already says, the byte-arrays are supposed to contain an UTF-8 encoding of ';' and '=', but actually they contain an platform-dependent encoding. So I suggest to replace it by:

private static final Charset UTF_8 = Charset.forName("UTF-8");
private static final byte[] EQUAL_BYTES = "=".getBytes(UTF_8); in UTF-8
private static final byte[] SEMICOLON_BYTES = ";".getBytes(UTF_8);
in UTF-8

[Using a (final static) Charset has the advantage that's not necessary to put getBytes in a try-catch-clause.]

Subtickets

Attachments (1)

CharsetTest.java (1.1 KB) - added by John Doo 9 years ago.

Download all attachments as: .zip

Change History (7)

comment:1 Changed 9 years ago by zzz

but '=' and ',' are part of the base ASCII charset that are the same in any character encoding, right? Can you name a locale where they are encoded differently?

comment:2 Changed 9 years ago by John Doo

Among the charsets are IBM-Thai, IBM01140-IBM01140, UTF-16 and its variations, and several more. As I understand it, UTF-8 incorporates ASCII so the first 128 characters are the same. But apparently that's not true for all available character sets.
I have attached a small program that prints charsets in which the byte-values of '"' and '=' are different to those in UTF-8.

Changed 9 years ago by John Doo

Attachment: CharsetTest.java added

comment:3 Changed 9 years ago by zzz

OK, you got me there.

I suspect there are a TON of places in the code where we assume that the local charset encoding is the same as 7-bit ASCII (ISO-8859-1) for the characters 0x20 - 0x7e, as well as \r and \n probably. I expect anybody running I2P on a charset where that is not the case would have it fail in strange and non-obvious ways. The good news is that the encoding is part of the version information listed on /logs.jsp so if anybody reports something unusual we can check.

So it's something to look for elsewhere in the code also.

comment:4 Changed 9 years ago by zzz

p.s. String.getBytes(Charset) not supported until Java 6

comment:5 Changed 9 years ago by zzz

Milestone: 0.8.50.8.6
Owner: set to zzz
Priority: majorminor
Status: newaccepted

I've fixed DataHelper? in my naming branch, to be propped over for 0.8.6, but the general issue of non-ASCII-friendly encodings will require a big audit/fix. Entering new ticket #457 for that.

comment:6 Changed 8 years ago by zzz

Resolution: fixed
Status: acceptedclosed

This was fixed in 0.8.6 but the general issue now in ticket #457 is not yet addressed.

Note: See TracTickets for help on using tickets.