Opened 6 months ago

Last modified 6 months ago

#2304 new defect

Torrents wrecked by inconsistent handling of "unsafe" characters

Reported by: jogger Owned by: zzz
Priority: minor Milestone: undecided
Component: apps/i2psnark Version: 0.9.36
Keywords: Cc:
Parent Tickets:

Description

Bug surfaces with filenames written by some Mac applications containing characters that have their high bit set. Example hexdump "EFBC8F" wich is displayed as " / ". Lots of those sequences exist.

Torrent created correctly with these characters inside the torrent file on Linux and Mac, Java 9 and 10.

Torrent downloads unchanged to Linux, Java 9 and 10. Downloaded torrent checks clean when moved to another instance on Linux or after crash. Same behaviour observed on Mac for 0.9.35 and Java 9.

On Mac with 0.9.36 and Java 10 above sequence is changed to a single underscore. Torrents do not check clean after a crash or when moved in after downloaded on Linux. As a consequence one can not be sure that it will be possible to seed a downloaded torrent at a later time or on a different machine.

Note about the standard for testing these kind of issues:

It was Kernighan & Pike in The Practice of Programming who said as much in Chapter 6, Testing, §6.5 Stress Tests:

When Steve Bourne was writing his Unix shell (which came to be known as the Bourne shell), he made a directory of 254 files with one-character names, one for each byte value except '\0' and slash, the two characters that cannot appear in Unix file names. He used that directory for all manner of tests of pattern-matching and tokenization. (The test directory was of course created by a program.) For years afterwards, that directory was the bane of file-tree-walking programs; it tested them to destruction.

Subtickets (add)

Change History (2)

comment:1 Changed 6 months ago by zzz

related: #571 #771 #1132 #1415

https://www.fileformat.info/info/unicode/char/ff0f/index.htm

0xEFBC8F is valid UTF-8, U+FF0F FULL WIDTH SOLIDUS

We validate based on the default charset for the JVM, which comes from the OS. If the character is not available in the default charset, it can't be mapped to that charset. So we need to replace it with something else. We use '_'. Converting between charsets is lossy, there's no way to fix it. In addition, even in the same charset, different OSes have different rules on valid chars in file names, and things may happen to file names when you copy them between OSes. Again, that's not fixable by us.

comment:2 Changed 6 months ago by jogger

Basically you are saying that torrents downloaded (not created) to a Mac with 0.9.35 / Java 9 with "unsafe characters" intact are wrecked on the very same machine after upgrading to 0.9.36 / Java 10 because for some reason filenames are no longer valid. Bad news if you can not move them to Linux.

As a further consequence I can no longer move torrents downloaded on a Mac to Linux because on Linux the characters now considered unsafe on the Mac are still valid.

I suggest changing the policy and abandon all character conversion except for null and slash as long as some Unix is detected as the underlying OS.

Note: See TracTickets for help on using tickets.