This was posted in one of Russian forums. It was not possible to archive
(under Linux, using tar) vfat directory where files had long Russian names
(really long - over 150 - 170 characters) - tar returned stat failure. When
looking with plain ls, file names appeared truncated.
Now looking at current (2.6.21) fat driver, __fat_readdir allocates large
enough buffer (PAGE_SIZE-522) for UTF-8 name; but for iocharset=utf8 it calls
uni16_to_x8() which artificially limits length of UTF-8 name to 256 ... which
is obviously not enough for long UTF-8 Russian string (2 bytes per character)
not to mention the - theoretical - general case of 6 bytes UTF-8 characters.
Similar problem has apparently vfat_lookup()->...->fat_search_long() call
chain. Except this appears to be broken even in case of "utf8", because
fat_search_long allocates fixed 256 bytes buffer for UTF-8 name.
Am I off track here?
-andrey
Hi Andrey!
On 7 May 2007, at 19:51, Andrey Borzenkov wrote:
> This was posted in one of Russian forums. It was not possible to
> archive
> (under Linux, using tar) vfat directory where files had long
> Russian names
> (really long - over 150 - 170 characters) - tar returned stat
> failure. When
> looking with plain ls, file names appeared truncated.
>
> Now looking at current (2.6.21) fat driver, __fat_readdir allocates
> large
> enough buffer (PAGE_SIZE-522) for UTF-8 name; but for
> iocharset=utf8 it calls
> uni16_to_x8() which artificially limits length of UTF-8 name to
> 256 ... which
> is obviously not enough for long UTF-8 Russian string (2 bytes per
> character)
> not to mention the - theoretical - general case of 6 bytes UTF-8
> characters.
>
> Similar problem has apparently vfat_lookup()->...->fat_search_long
> () call
> chain. Except this appears to be broken even in case of "utf8",
> because
> fat_search_long allocates fixed 256 bytes buffer for UTF-8 name.
>
> Am I off track here?
>
PATH_MAX specifically counts _bytes_ not characters, so UTF-8 does
not matter. ISTR that PATH_MAX was 256 at some point, but I just
quickly grepped /usr/include and found various mention of 4096, so
where's the central repository for this configuration item? A hard-
coded value of 256 somewhere inside the kernel smells like a bug.
Ciao,
Roland
--
TU Muenchen, Physik-Department E18, James-Franck-Str., 85748 Garching
Telefon 089/289-12575; Telefax 089/289-12570
--
CERN office: 892-1-D23 phone: +41 22 7676540 mobile: +41 76 487 4482
--
Any society that would give up a little liberty to gain a little
security will deserve neither and lose both. - Benjamin Franklin
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GS/CS/M/MU d-(++) s:+ a-> C+++ UL++++ P+++ L+++ E(+) W+ !N K- w--- M
+ !V Y+
PGP++ t+(++) 5 R+ tv-- b+ DI++ e+++>++++ h---- y+++
------END GEEK CODE BLOCK------
Roland Kuhn <[email protected]> writes:
> PATH_MAX specifically counts _bytes_ not characters, so UTF-8 does not
> matter. ISTR that PATH_MAX was 256 at some point, but I just quickly
> grepped /usr/include and found various mention of 4096, so where's the
> central repository for this configuration item? A hard-
> coded value of 256 somewhere inside the kernel smells like a bug.
There is PATH_MAX and there is NAME_MAX, and only the latter (which is
260 for vfat) matters here.
Andreas.
--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."
Hi,
Roland Kuhn <[email protected]> writes:
>> This was posted in one of Russian forums. It was not possible to
>> archive
>> (under Linux, using tar) vfat directory where files had long
>> Russian names
>> (really long - over 150 - 170 characters) - tar returned stat
>> failure. When
>> looking with plain ls, file names appeared truncated.
>>
>> Now looking at current (2.6.21) fat driver, __fat_readdir allocates
>> large
>> enough buffer (PAGE_SIZE-522) for UTF-8 name; but for
>> iocharset=utf8 it calls
>> uni16_to_x8() which artificially limits length of UTF-8 name to
>> 256 ... which
>> is obviously not enough for long UTF-8 Russian string (2 bytes per
>> character)
>> not to mention the - theoretical - general case of 6 bytes UTF-8
>> characters.
>>
>> Similar problem has apparently vfat_lookup()->...->fat_search_long
>> () call
>> chain. Except this appears to be broken even in case of "utf8",
>> because
>> fat_search_long allocates fixed 256 bytes buffer for UTF-8 name.
>>
>> Am I off track here?
>>
> PATH_MAX specifically counts _bytes_ not characters, so UTF-8 does
> not matter. ISTR that PATH_MAX was 256 at some point, but I just
> quickly grepped /usr/include and found various mention of 4096, so
> where's the central repository for this configuration item? A hard-
> coded value of 256 somewhere inside the kernel smells like a bug.
There is a nasty issue here. FAT is limited by 255 unicode chars or so.
So, we would need to count number of unicode chars of filename.
That's not implemented currently.
--
OGAWA Hirofumi <[email protected]>
On Monday 07 May 2007, Andreas Schwab wrote:
> Roland Kuhn <[email protected]> writes:
> > PATH_MAX specifically counts _bytes_ not characters, so UTF-8 does not
> > matter. ISTR that PATH_MAX was 256 at some point, but I just quickly
> > grepped /usr/include and found various mention of 4096, so where's the
> > central repository for this configuration item? A hard-
> > coded value of 256 somewhere inside the kernel smells like a bug.
>
> There is PATH_MAX and there is NAME_MAX, and only the latter (which is
> 260 for vfat) matters here.
>
Do you imply that Linux is unable to represent full VFAT names (255 UCS2
charaters) by design? Hmm ... testing ... looks like it,
touch $(perl -e 'print "ф" x 200;') -> Name too long.
So what do we do with long VFAT names? As initial post shows, they do exist.
Andrey Borzenkov <[email protected]> writes:
> On Monday 07 May 2007, Andreas Schwab wrote:
>> Roland Kuhn <[email protected]> writes:
>> > PATH_MAX specifically counts _bytes_ not characters, so UTF-8 does not
>> > matter. ISTR that PATH_MAX was 256 at some point, but I just quickly
>> > grepped /usr/include and found various mention of 4096, so where's the
>> > central repository for this configuration item? A hard-
>> > coded value of 256 somewhere inside the kernel smells like a bug.
>>
>> There is PATH_MAX and there is NAME_MAX, and only the latter (which is
>> 260 for vfat) matters here.
>>
>
> Do you imply that Linux is unable to represent full VFAT names (255 UCS2
> charaters) by design? Hmm ... testing ... looks like it,
Yes. But I think it's not design, and it should be fixed.
--
OGAWA Hirofumi <[email protected]>
OGAWA Hirofumi wrote:
>>>
>> PATH_MAX specifically counts _bytes_ not characters, so UTF-8 does
>> not matter. ISTR that PATH_MAX was 256 at some point, but I just
>> quickly grepped /usr/include and found various mention of 4096, so
>> where's the central repository for this configuration item? A hard-
>> coded value of 256 somewhere inside the kernel smells like a bug.
>
> There is a nasty issue here. FAT is limited by 255 unicode chars or so.
> So, we would need to count number of unicode chars of filename.
>
> That's not implemented currently.
Note also there is PATH_MAX and NAME_MAX; the latter is 255 I believe.
POSIX allows NAME_MAX to vary on a filesystem by filesystem basis,
although that is not currently implemented in Linux.
-hpa
Hi!
On 7 May 2007, at 20:27, OGAWA Hirofumi wrote:
> Roland Kuhn <[email protected]> writes:
>> PATH_MAX specifically counts _bytes_ not characters, so UTF-8 does
>> not matter. ISTR that PATH_MAX was 256 at some point, but I just
>> quickly grepped /usr/include and found various mention of 4096, so
>> where's the central repository for this configuration item? A hard-
>> coded value of 256 somewhere inside the kernel smells like a bug.
>
> There is a nasty issue here. FAT is limited by 255 unicode chars or
> so.
> So, we would need to count number of unicode chars of filename.
>
No, we don't. At least not when looking at the POSIX spec, which
explicitly mentions _bytes_ and _not_ unicode characters. So, to be
on the safe side, FAT filesystems would need to support a NAME_MAX of
roughly 6*255+3=1533 bytes (not to mention the hassles of forbidden
sequences, etc.; do we need to count zero-width characters?) and
report it through pathconf() to userspace, then userspace could do
with that whatever it liked.
What happened to: "file names are just sequences of octets, excluding
'/' and NUL"? Adding unicode parsing to the kernel is completely
useless _and_ a big trouble maker.
Ciao,
Roland
--
TU Muenchen, Physik-Department E18, James-Franck-Str., 85748 Garching
Telefon 089/289-12575; Telefax 089/289-12570
--
CERN office: 892-1-D23 phone: +41 22 7676540 mobile: +41 76 487 4482
--
Any society that would give up a little liberty to gain a little
security will deserve neither and lose both. - Benjamin Franklin
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GS/CS/M/MU d-(++) s:+ a-> C+++ UL++++ P+++ L+++ E(+) W+ !N K- w--- M
+ !V Y+
PGP++ t+(++) 5 R+ tv-- b+ DI++ e+++>++++ h---- y+++
------END GEEK CODE BLOCK------
Roland Kuhn <[email protected]> writes:
>> Roland Kuhn <[email protected]> writes:
>>> PATH_MAX specifically counts _bytes_ not characters, so UTF-8 does
>>> not matter. ISTR that PATH_MAX was 256 at some point, but I just
>>> quickly grepped /usr/include and found various mention of 4096, so
>>> where's the central repository for this configuration item? A hard-
>>> coded value of 256 somewhere inside the kernel smells like a bug.
>>
>> There is a nasty issue here. FAT is limited by 255 unicode chars or
>> so.
>> So, we would need to count number of unicode chars of filename.
>>
> No, we don't. At least not when looking at the POSIX spec, which
> explicitly mentions _bytes_ and _not_ unicode characters. So, to be
> on the safe side, FAT filesystems would need to support a NAME_MAX of
> roughly 6*255+3=1533 bytes (not to mention the hassles of forbidden
> sequences, etc.; do we need to count zero-width characters?) and
> report it through pathconf() to userspace, then userspace could do
> with that whatever it liked.
>
> What happened to: "file names are just sequences of octets, excluding
> '/' and UL"? Adding unicode parsing to the kernel is completely
> useless _and_ a big trouble maker.
The UCS2 in FAT is just on-disk format of the filename. So...
--
OGAWA Hirofumi <[email protected]>
Roland Kuhn wrote:
>
> No, we don't. At least not when looking at the POSIX spec, which
> explicitly mentions _bytes_ and _not_ unicode characters. So, to be on
> the safe side, FAT filesystems would need to support a NAME_MAX of
> roughly 6*255+3=1533 bytes (not to mention the hassles of forbidden
> sequences, etc.; do we need to count zero-width characters?) and report
> it through pathconf() to userspace, then userspace could do with that
> whatever it liked.
>
> What happened to: "file names are just sequences of octets, excluding
> '/' and NUL"? Adding unicode parsing to the kernel is completely useless
> _and_ a big trouble maker.
>
"Filenames are just octets" have never applied to alien filesystems like
VFAT.
-hpa
Andrey Borzenkov writes:
> This was posted in one of Russian forums. It was not possible to
> archive (under Linux, using tar) vfat directory where files had
> long Russian names (really long - over 150 - 170 characters) - tar
> returned stat failure. When looking with plain ls, file names
> appeared truncated.
I have an idea to deal with this, but first a rant...
At two bytes per character, you get 127 characters in a filename.
That's wider than the standard 80-column display, and far wider
than the 28 or 29 characters that an "ls -l" has room for. In a
GUI file manager or file dialog box, you'll have to scroll sideways.
In a web browser directory listing, you'll almost certainly have
to scroll sideways. Must of this even applies to Windows tools.
In other words, this is user error. Somebody thought that a filename
was a place to store a document, probably a README file. What next,
shall we MIME-encode an icon into the filename?
Fix: the vfat driver should use the 8.3 name for such files.
On May 8 2007 00:43, Albert Cahalan wrote:
>
> I have an idea to deal with this, but first a rant...
>
> At two bytes per character, you get 127 characters in a filename.
> That's wider than the standard 80-column display, and far wider
> than the 28 or 29 characters that an "ls -l" has room for. In a
> GUI file manager or file dialog box, you'll have to scroll sideways.
> In a web browser directory listing, you'll almost certainly have
> to scroll sideways. Must of this even applies to Windows tools.
>
> In other words, this is user error. Somebody thought that a filename
> was a place to store a document, probably a README file. What next,
> shall we MIME-encode an icon into the filename?
Surprise surprise, Microsoft Word uses the document's title as filename.
So when you write your diploma thesis about "The Recent Improvements on
The Completely Fair Scheduler And The Staircase Deadline Scheduler and
its Impact On The Community and Mailing Lists", Word is likely to use
that as a filename. Maybe truncated a bit, I have not figured out the
heuristic yet.
> Fix: the vfat driver should use the 8.3 name for such files.
Or the 31-character ISO Level 1(?).
Jan
--
Albert Cahalan <[email protected]> wrote:
[...]
> At two bytes per character, you get 127 characters in a filename.
> That's wider than the standard 80-column display, and far wider
> than the 28 or 29 characters that an "ls -l" has room for. In a
> GUI file manager or file dialog box, you'll have to scroll sideways.
> In a web browser directory listing, you'll almost certainly have
> to scroll sideways. Must of this even applies to Windows tools.
>
> In other words, this is user error. Somebody thought that a filename
> was a place to store a document, probably a README file. What next,
> shall we MIME-encode an icon into the filename?
I've got a music file where the artist name is 113 characters long, if I use
an abbreviation.
> Fix: the vfat driver should use the 8.3 name for such files.
Let's port that to ext3! I wantet the system to rename my files into
abcd~123.ext for all my life! NOT!
--
If at first you don't succeed call in an air-strike.
Fri?, Spammer: [email protected] [email protected]
[email protected] [email protected]
On Monday 07 May 2007, Roland Kuhn wrote:
> Hi!
>
> On 7 May 2007, at 20:27, OGAWA Hirofumi wrote:
> > Roland Kuhn <[email protected]> writes:
> >> PATH_MAX specifically counts _bytes_ not characters, so UTF-8 does
> >> not matter. ISTR that PATH_MAX was 256 at some point, but I just
> >> quickly grepped /usr/include and found various mention of 4096, so
> >> where's the central repository for this configuration item? A hard-
> >> coded value of 256 somewhere inside the kernel smells like a bug.
> >
> > There is a nasty issue here. FAT is limited by 255 unicode chars or
> > so.
> > So, we would need to count number of unicode chars of filename.
>
> No, we don't. At least not when looking at the POSIX spec, which
> explicitly mentions _bytes_ and _not_ unicode characters. So, to be
> on the safe side, FAT filesystems would need to support a NAME_MAX of
> roughly 6*255+3=1533 bytes (not to mention the hassles of forbidden
> sequences, etc.; do we need to count zero-width characters?)
How is this issue related to character *width* at all?
> and
> report it through pathconf() to userspace, then userspace could do
> with that whatever it liked.
>
> What happened to: "file names are just sequences of octets, excluding
> '/' and NUL"? Adding unicode parsing to the kernel is completely
> useless _and_ a big trouble maker.
>
Who speaks about unicode parsing? UCS2 - UTF-8 transformation does and
requires no parsing; this is simply conversion between on-disk and in-kernel
representation (like endian conversion). Anyway we are doing it now already;
how support for larger name length limit changes it?
-andrey
On 5/8/07, Jan Engelhardt <[email protected]> wrote:
> On May 8 2007 00:43, Albert Cahalan wrote:
> > Fix: the vfat driver should use the 8.3 name for such files.
>
> Or the 31-character ISO Level 1(?).
That might be appropriate for a similar problem on CD-ROM
filesystems. (when the CD is rockridge KOI8 and you want UTF-8)
It may even be appropriate for Joliet, though 8.3 may be
the better choice in that case.
It's not appropriate for vfat, HPFS, JFS, or NTFS. All of those
have built-in support for 8.3 aliases. Normally the 8.3 names
are like hidden hard links, except that deletion of either name
will wipe out the other. (same as case differences too)
So the names are there, and they should already work.
They just need to be reported for directory listings when the
long names would be too long.
On Wednesday 09 May 2007, Albert Cahalan wrote:
> On 5/8/07, Jan Engelhardt <[email protected]> wrote:
> > On May 8 2007 00:43, Albert Cahalan wrote:
> > > Fix: the vfat driver should use the 8.3 name for such files.
> >
> > Or the 31-character ISO Level 1(?).
>
> That might be appropriate for a similar problem on CD-ROM
> filesystems. (when the CD is rockridge KOI8 and you want UTF-8)
> It may even be appropriate for Joliet, though 8.3 may be
> the better choice in that case.
>
> It's not appropriate for vfat, HPFS, JFS, or NTFS. All of those
> have built-in support for 8.3 aliases. Normally the 8.3 names
> are like hidden hard links, except that deletion of either name
> will wipe out the other. (same as case differences too)
> So the names are there, and they should already work.
> They just need to be reported for directory listings when the
> long names would be too long.
several problems associated with it
1. those names are rather meaningless. How do you find out which file they
refer to? It is OK for trivial cases but not in a directory full of long
names; nor am I sure how many unique short names can be generated.
2. directory contents is effectively invalidated upon backup and restore (tar
c; rm -rf; tar x). It is impossible to infer long names from short ones.
3. this still does not answer how can I *create* long name from within Linux.
Currently workaround is to mount with single byte character set, like koi8-r;
of course it becomes quite awkward if the rest of system is using UTF-8.
On 5/9/07, Andrey Borzenkov <[email protected]> wrote:
> On Wednesday 09 May 2007, Albert Cahalan wrote:
...
>>> On May 8 2007 00:43, Albert Cahalan wrote:
>>>> Fix: the vfat driver should use the 8.3 name for such files.
...
>> It's not appropriate for vfat, HPFS, JFS, or NTFS. All of those
>> have built-in support for 8.3 aliases. Normally the 8.3 names
>> are like hidden hard links, except that deletion of either name
>> will wipe out the other. (same as case differences too)
>> So the names are there, and they should already work.
>> They just need to be reported for directory listings when the
>> long names would be too long.
>
> several problems associated with it
>
> 1. those names are rather meaningless. How do you find out which file they
> refer to? It is OK for trivial cases but not in a directory full of long
> names; nor am I sure how many unique short names can be generated.
If a short name can not be generated, then no OS could
create the file at all. The vfat and iso9660 filesystems require
short names. Any OS writing to such a filesystem MUST
generate short names in addition to any long names.
Mount your vfat as filesystem type "msdos" to see.
By default, Windows will also generate short names on NTFS.
Note that you can't put your files on a CD-ROM in a way
that Windows could read the filenames. Windows limits
CD-ROM filenames to 63 characters; you get at most 103
if you violate the spec.
> 2. directory contents is effectively invalidated upon backup and restore (tar
> c; rm -rf; tar x). It is impossible to infer long names from short ones.
It may be that tar fails to use the vfat ioctl calls to save
and restore short names. You could try using Wine to
run a Windows-native backup program. This shouldn't
really matter though; you'd only be getting short names
for files that had truly unreasonable long names anyway.
I suppose somebody should check to see if there is a
danger of overwrite when the short-named files get
written back. The safest thing might be to mount the
filesystem as type "msdos".
> 3. this still does not answer how can I *create* long name from within Linux.
WTF? These names are too annoying to use, even if there
weren't this limit. Anything over about 29 characters is in
need of a rename. (that'd be 58 bytes for you, which is OK)
The limit is already 4 times larger than what is reasonable.
Albert Cahalan <[email protected]> wrote:
> On 5/9/07, Andrey Borzenkov <[email protected]> wrote:
>> 3. this still does not answer how can I *create* long name from within Linux.
>
> WTF? These names are too annoying to use, even if there
> weren't this limit. Anything over about 29 characters is in
> need of a rename. (that'd be 58 bytes for you, which is OK)
> The limit is already 4 times larger than what is reasonable.
Just because you limit yourself to 80 chars minus "ls -l"-clutter, this is
no reason why I shouldn't use long filenames. If I need to handle these
filenames, I can enlarge the terminal window or read the next line.
E.g.: I have a music file named "artist - title.ext", where the artist
name is 103 characters long, using abbreviations. In order to enter that
name, I have to press seven keys, including the escape character.
There is nothing unreasonable in using that name.
If I could not use these filenames, I'd have to use e.g. numbers instead of
filenames and a database containing the mapping from name to number,
duplicating the function of a directory. That's bullsh... . You should rather
bump NAME_MAX and, if you are concerned about legacy applications, make it
optional. Having some names to be illegal becaue of their length on a
filesystem not allowing command.com metacharacters is not a problem.
--
Funny quotes:
3. On the other hand, you have different fingers.
Fri?, Spammer: [email protected] [email protected]
On May 10 2007 16:49, Bodo Eggert wrote:
>
>Just because you limit yourself to 80 chars minus "ls -l"-clutter, this is
>no reason why I shouldn't use long filenames. If I need to handle these
>filenames, I can enlarge the terminal window or read the next line.
>
>E.g.: I have a music file named "artist - title.ext", where the artist
>name is 103 characters long, using abbreviations. In order to enter that
>name, I have to press seven keys, including the escape character.
>There is nothing unreasonable in using that name.
What name would that be? I cannot dream up any IME that outputs _that_
many characters for that few keystrokes. Even with CJ(K), 7 keystrokes can
make at most 21 bytes if I had to take a good guess.
Jan
--
> >Just because you limit yourself to 80 chars minus "ls -l"-clutter, this is
> >no reason why I shouldn't use long filenames. If I need to handle these
> >filenames, I can enlarge the terminal window or read the next line.
> >
> >E.g.: I have a music file named "artist - title.ext", where the artist
> >name is 103 characters long, using abbreviations. In order to enter that
> >name, I have to press seven keys, including the escape character.
> >There is nothing unreasonable in using that name.
>
> What name would that be? I cannot dream up any IME that outputs _that_
> many characters for that few keystrokes. Even with CJ(K), 7 keystrokes can
> make at most 21 bytes if I had to take a good guess.
Probably talking about tab-completion in the shell, countering the comment
that names that long are too unwieldy to use.
On the other hand, limits are always bad. Haven't we seen bajillions of
cases in computing history where we start with limits (like only N of the
first characters of identifiers used in compilers), followed by people
finding those limits annoying, then followed by the limits being removed?
How about the Year-20?? limitations? Same thing: limits that were fine,
then limits were not fine, then limits were removed.
In this era, couldn't we just skip the whole thing and stop putting limits
in from the beginning? The only reason to have a PATH_MAX/NAME_MAX at all
is to make life easier for programmers implementing things and people
using things. It would require a LOT of work and changes, but life would
be so much better if there were a (configurable, of course!) option to say
"use standard values" or "use custom values" or "use runtime-dynamic
values". Option 1: No change. Option 2: Alter your limits.h too, and
recompile programs (or fix them). Option 3: Change programs from
compile-time inclusion of limits.h to run-time checking of a /sys value.