2008-11-21 12:51:26

by Alexey Salmin

[permalink] [raw]
Subject: Maximum filename length

Hello! I'm not sure the developers mailing list is a right place for
philosophical discussions - it's more about features, bugs and
patches. But anyway I have an important for me topic I want to talk
about.
Limits of the ext4 file are really huge, I just can't imagine 1 EiB
disk array, it's out of my mind's bounds. Maximum file size is quite
big too. But there is one limitation looking tiny against these Tera-
and Exbi-bytes: maximum filename length is 255 bytes. Is 255
characters enough? I think it's enough for the vast majority of users.
But there is one problem: 255 bytes and 255 characters are no longer
equal. Multibyte encodings are spreading fast and it should be taken
into account. For a long time I was using the simple koi8-r encoding
and it was enough. Even when my favorite debian distribution moved to
utf8 I was still keeping it. Even when I discovered that gtk and qt
application always use utf8 and every io-operation causes conversions
I was using koi8. But when I found out that the first thing gcc does
with the source code is converting it to utf8 I thought that it is
really the time to move ahead. I was full of optimism converting my
file systems to utf8 but I discovered that my book collection can not
be stored correctly due to the 127 characters filename length
limitation. Actually I'm lucky having only two bytes per character,
utf8-character can contain up to 4 bytes which reduces the limit to 63
characters. Really I see no reasons for keeping such a terrible
limitation. Ext4 branch was created because there were to many things
to change compared to ext3. And it's very sad that such a simple
improvement was forgotten :(

Alexey


2008-11-21 22:33:06

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Maximum filename length

On Fri, Nov 21, 2008 at 06:51:23PM +0600, Alexey Salmin wrote:
> But there is one limitation looking tiny against these Tera-
> and Exbi-bytes: maximum filename length is 255 bytes. Is 255
> characters enough? I think it's enough for the vast majority of users.
> But there is one problem: 255 bytes and 255 characters are no longer
> equal. Multibyte encodings are spreading fast and it should be taken
> into account. For a long time I was using the simple koi8-r encoding
> and it was enough. Even when my favorite debian distribution moved to
> utf8 I was still keeping it.

Yeah, unfortunately Unicode and UTF-8 is unfortunate for the Cyrillic
and Greek alphabets, since they are non-Latin alphabets where multiple
characters form a word. For most writing systems, they either use a
single glyph to represent a word (such as the CJK, or
Chinese-Japanese-Korean characters), or they are based on the Latin
alphabet, so in those writing systems, only a few characters in
practice require glyphs that are encoded using two bytes, and most
require only one byte.

> Actually I'm lucky having only two bytes per character,
> utf8-character can contain up to 4 bytes which reduces the limit to 63
> characters. Really I see no reasons for keeping such a terrible

In practice the bulk of the characters which require 3 bytes to encode
are used to denote a word (which in most other languages might be
encoded in 3 to 20 letters). There are a few writing systems that
have letters encoded above U+0800, and so require 3 bytes per letter,
but they tend to be "niche" languages that are rarely used in
computing. For example, the Buhid script, which is spoken by the
indigenous people Mangyans, which lives in the province of Mindoro in
the Phillipines, and which has about 8,000 speakers in the world,
utilize Unicode characters U+1740 through U+175F, and so require 3
bytes per character. The Native American Cherokee language, which has
about 10,000 speakers in the world, uses Unicode symbols U+13A0
through U+13F4, and similarly needs 3 bytes per character.

Characters that require 4 bytes to encode are needed to encode Unicode
symbols above U+10000, which are used primarily by dead languages
(i.e., no one alive speaks it as their primary language --- and in
some cases, no one alive has any idea how to speak it). For example
the Linear B script, which was used in Mycenaean civilization sometime
around the 13th and 14th century BC (i.e., over 3 millennia ago) is
assigned Unicode characters U+10000 through U+100FF, and so would
require 4 bytes per Linear B glyph to encode. However, aside from
researchers in ancient languages, it is doubtful anyone would actually
be using it, and it's even less likely anyone would be trying to
catalog books or mp3 filenames using Linear B glyphs. :-)


So in practice, in terms of the common languages that are likely to be
used in computing that are based on phomemes (i.e., such as most
European, Russian, Greeek writing systems) as opposed to ideographs
(i.e., the CJK writing systems) Russian, Greek, Hewbrew, and Arabic
are the unlucky ones that are not based on a Latin-1 alphabet, and
have this problem where 2 bytes are required. Curiously enough, its
generally people using the Cyrillic alphabet that tend to complain; I
suspect that it has the largest number of users who are likely to use
those letters in computing. (In practice, not many people complain
about Hewbrew writing systems, and I suspect that it's partially
because of the relative difference in the number of people using the
Hewbrew writing system as compared to the Cyrillic, and also that most
Israeli computer folk I know tend to do most of the computing work in
English, and not in Hewbrew.)

> Really I see no reasons for keeping such a terrible
> limitation. Ext4 branch was created because there were to many things
> to change compared to ext3. And it's very sad that such a simple
> improvement was forgotten :(

It wouldn't be _that_ hard to add an extension to ext4 to support
longer filenames (it would mean a new directory entry format, and a
way of marking a directory inode as to whether the old or new
directory format was being used). Unfortunately, the 255 byte limit
is encoded not only in the filesystem, but also in the kernel.
Changing it in the kernel is not just a matter of a #define constant,
but also fixing places which put filename[NAME_MAX] on the stack, and
where increasing NAME_MAX might cause kernel functions to blow the
limited stack space available to kernel code. In addition, there are
numerous userspace and in some cases, protocol limitations which
assume that the total overall length of a pathname is no more than
1024 bytes. (I suspect there is at least userspace code that also
would blow up if an individual pathname exceeded NAME_MAX, or 256
bytes.)

So the problem is that even if we were to add that enhancement to
ext4, there are lots of other things, both in and outside of the
kernel, that would likely need to be changed in order to support this.
I will say personally that its rare for me to use filenames longer
than 50-60 characters, just because they are a pain in the *ss to
type. However, I can see how someone using a graphical interface
might be happy with filenames in the 100-120 character range. The
question though is whether it is worth trying to fix this by
increasing the filename length beyond 255 bytes or not, given the
amount of effort that would be required in the kernel, libc,
userspace, etc.

- Ted

2008-11-22 02:50:27

by cheng renquan

[permalink] [raw]
Subject: Re: Maximum filename length

On Sat, Nov 22, 2008 at 6:32 AM, Theodore Tso <[email protected]> wrote:
> So the problem is that even if we were to add that enhancement to
> ext4, there are lots of other things, both in and outside of the
> kernel, that would likely need to be changed in order to support this.
> I will say personally that its rare for me to use filenames longer
> than 50-60 characters, just because they are a pain in the *ss to
> type. However, I can see how someone using a graphical interface
> might be happy with filenames in the 100-120 character range. The
> question though is whether it is worth trying to fix this by
> increasing the filename length beyond 255 bytes or not, given the
> amount of effort that would be required in the kernel, libc,
> userspace, etc.

In China, there's also a trend moving ahead from 2bytes charsets
(GB2312/GBK/GB18030/BIG5) to UTF-8, so all Chinese characters will
need 3 bytes for each to each instead of 2 from then on. The 255
filename length limit the Chinese filename to 85 characters:

now try to touch a file with 86 Chinese characters:

[email protected] ~/tmp 0 $ touch
中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中国
touch: cannot touch
`中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中国':
File name too long


It's very difficult to solve all these problems in the kernel, libc,
userspace, I know, but I think we should keep the option to fix them
in the future. Maybe in the next POSIX standard, we should change
NAME_MAX to [255 * max_bytes_per_character] ?


Regards,

>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>



--
Cheng Renquan, Shenzhen, China
Yogi Berra - "I never said most of the things I said."

2008-11-22 18:48:28

by Alexey Salmin

[permalink] [raw]
Subject: Re: Maximum filename length

2008/11/22 Theodore Tso <[email protected]>:
> It wouldn't be _that_ hard to add an extension to ext4 to support
> longer filenames (it would mean a new directory entry format, and a
> way of marking a directory inode as to whether the old or new
> directory format was being used). Unfortunately, the 255 byte limit
> is encoded not only in the filesystem, but also in the kernel.
> Changing it in the kernel is not just a matter of a #define constant,
> but also fixing places which put filename[NAME_MAX] on the stack, and
> where increasing NAME_MAX might cause kernel functions to blow the
> limited stack space available to kernel code. In addition, there are
> numerous userspace and in some cases, protocol limitations which
> assume that the total overall length of a pathname is no more than
> 1024 bytes. (I suspect there is at least userspace code that also
> would blow up if an individual pathname exceeded NAME_MAX, or 256
> bytes.)
>

Sure, I understand the problems you've mentioned. But every big act
has the beginning. Adding the extension to the ext4 is only the first
step. Of course it'll cause crashes and other problems in many places
from kernel to userspace code. But these problems will disturb only
people who will really use this extension (like me). Anyway most of these
bugs will be fixed some day, may be in two or three years. No one is
talking that it's a fast process but it will reach it's end and that's good
I think.

> I will say personally that its rare for me to use filenames longer
> than 50-60 characters, just because they are a pain in the *ss to
> type. However, I can see how someone using a graphical interface
> might be happy with filenames in the 100-120 character range.

Same here: most of my filenames are _way_ shorter than 50-60
characters. Besides, I really use English filenames almost always.
But there are some cases when long Cyrillic names are needed and
it's sad for me to have problems here.

Alexey

2008-11-22 23:36:22

by Andreas Dilger

[permalink] [raw]
Subject: Re: Maximum filename length

On Nov 23, 2008 00:48 +0600, Alexey Salmin wrote:
> Sure, I understand the problems you've mentioned. But every big act
> has the beginning. Adding the extension to the ext4 is only the first
> step. Of course it'll cause crashes and other problems in many places
> from kernel to userspace code. But these problems will disturb only
> people who will really use this extension (like me). Anyway most of these
> bugs will be fixed some day, may be in two or three years. No one is
> talking that it's a fast process but it will reach it's end and that's good
> I think.

If you are motivated to work on this, there are a number of possible ways
that this could be done. The simplest would be to create a new directory
entry (replacing ext4_dir_entry_2) that has a longer name_len field, and
ideally it would also have space for a 48-bit inode number (ext4 will
NEVER need more than 280 trillion inodes I think).

I don't know that it is practical to require this format for the entire
directory, because it would mean in some rare cases rewriting 1M entries
(or whatever) in a large directory to the new format. It would be better
to allow either just the leaf block to hold the new record format (with a
marker at the start of the block), or individual records having the new
format, possibly marked by a bit in the "file_type" field.

It's kind of ugly, but it needs to be possible to detect if the entry is
the old format or the new one.

#define EXT4_DIRENT3_FL 0x00400000 /* directory has any dir_entry_3 */


#define EXT4_FT_ENTRY_3 0x80 /* file_type for dir_entry_3 */
#define EXT4_FT_MASK 0x0f /* EXT4_FT_* mask */
#define EXT4_INODE_MASK 0x00ffffffffffffff /* 48-bit inode number mask */
#define EXT4_NAME_LEN3 1012

struct ext4_dir_entry_3 {
__le64 inode; /* High byte holds file_type */
__le16 rec_len;
__le16 name_len;
char name[EXT4_NAME_LEN3];
};

static inline __u8 ext4_get_de_file_type(struct ext4_dir_entry_2 *dirent)
{
return (dirent->file_type & EXT4_FT_MASK);
}

static inline int ext4_get_de_name_len(struct ext4_dir_entry_2 *dirent)
{
if (dirent->file_type & EXT4_FT_ENTRY_3) {
struct ext4_dir_entry_3 *dirent3 = dirent;

return le16_to_cpu(dirent3->name_len);
}

return dirent->name_len;
}

static inline int ext4_get_de_rec_len(struct ext4_dir_entry_2 *dirent)
{
if (dirent->file_type & EXT4_FT_ENTRY_3) {
struct ext4_dir_entry_3 *dirent3 = dirent;

return le16_to_cpu(dirent3->rec_len);
}

return le16_to_cpu(dirent->rec_len);
}

static inline __u64 ext4_get_de_inode(struct ext4_dir_entry_2 *dirent)
{
if (dirent->file_type & EXT4_FT_ENTRY_3) {
struct ext4_dir_entry_3 *dirent3 = dirent;

return le64_to_cpu(dirent3->inode) & EXT4_INODE_MASK;
}

return le32_to_cpu(dirent->inode);
}

static inline __u64 ext4_get_de_name(struct ext4_dir_entry_2 *dirent)
{
if (dirent->file_type & EXT4_FT_ENTRY_3) {
struct ext4_dir_entry_3 *dirent3 = dirent;

return dirent3->name;
}

return dirent->name);
}


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-11-23 04:43:41

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Maximum filename length

On Sat, Nov 22, 2008 at 10:50:25AM +0800, rae l wrote:
> In China, there's also a trend moving ahead from 2bytes charsets
> (GB2312/GBK/GB18030/BIG5) to UTF-8, so all Chinese characters will
> need 3 bytes for each to each instead of 2 from then on. The 255
> filename length limit the Chinese filename to 85 characters:

Sure, but 85 characters is a lot, given that each character might be
the equivalent of multiple letters. For example the English word
"country", which takes six characters, or 6 bytes in UTF-8, can be
encoded as a single Chinese ideograph, which can be encoded in 3 bytes
in UTF-8. Something like "United States of America", is encoded in 24
bytes in English, and 6 bytes (two ideographs) in Chinese in UTF-8.
My name, "Theodore Yue Tak Ts'o", takes 21 bytes in English and UTF-8.
In Chinese, it's 3 ideographs, or 9 bytes in UTF-8. I'm choosing
fairly basied examples here, of course, but I think it's in general
true.

As a final example consider, "The Tao which can be described is not
the true Tao". This can be expressed *much* more succiently in
Chinese. :-)

So I don't think people who use the Chinese writing system have much
to complain about with respect to the 255 byte / 85 ideograph limit.
I have much more sympathy for people who trying to are trying to write
something like "Union of Soviet Socialist Republics" in Russian, and
find that it takes many more bytes in UTF-8.....

- Ted

2008-11-23 22:15:32

by Andi Kleen

[permalink] [raw]
Subject: Re: Maximum filename length

Andreas Dilger <[email protected]> writes:
>
> #define EXT4_FT_ENTRY_3 0x80 /* file_type for dir_entry_3 */
> #define EXT4_FT_MASK 0x0f /* EXT4_FT_* mask */
> #define EXT4_INODE_MASK 0x00ffffffffffffff /* 48-bit inode number mask */
> #define EXT4_NAME_LEN3 1012
>
> struct ext4_dir_entry_3 {
> __le64 inode; /* High byte holds file_type */
> __le16 rec_len;
> __le16 name_len;

The new format should also reserve space for a checksum. Adding that
would be actually a reasonable practical improvement for everyone.

> char name[EXT4_NAME_LEN3];

-Andi

--
[email protected]