2018-12-27 17:25:34

by Florian Weimer

[permalink] [raw]
Subject: d_off field in struct dirent and 32-on-64 emulation

We have a bit of an interesting problem with respect to the d_off
field in struct dirent.

When running a 64-bit kernel on certain file systems, notably ext4,
this field uses the full 63 bits even for small directories (strace -v
output, wrapped here for readability):

getdents(3, [
{d_ino=1494304, d_off=3901177228673045825, d_reclen=40, d_name="authorized_keys", d_type=DT_REG},
{d_ino=1494277, d_off=7491915799041650922, d_reclen=24, d_name=".", d_type=DT_DIR},
{d_ino=1314655, d_off=9223372036854775807, d_reclen=24, d_name="..", d_type=DT_DIR}
], 32768) = 88

When running in 32-bit compat mode, this value is somehow truncated to
31 bits, for both the getdents and the getdents64 (!) system call (at
least on i386).

In an effort to simplify support for future architectures which only
have the getdents64 system call, we changed glibc 2.28 to use the
getdents64 system call unconditionally, and perform translation if
necessary. This translation is noteworthy because it includes
overflow checking for the d_ino and d_off members of struct dirent.
We did not initially observe a regression because the kernel performs
consistent d_off truncation (with the ext4 file system; small
directories do not show this issue on XFS), so the overflow check does
not fire.

However, both qemu-user and the 9p file system can run in such a way
that the kernel is entered from a 64-bit process, but the actual usage
is from a 32-bit process:

<https://sourceware.org/bugzilla/show_bug.cgi?id=23960>

I think diagrammatically, this looks like this:

guest process (32-bit)
| getdents64, 32-bit UAPI
qemu-user (64-bit)
| getdents, 64-bit UAPI
host kernel (64-bit)

Or:

guest process
| getdents64, 32-bit UAPI
guest kernel (64-bit)
| 9p over virtio (64-bit d_off in struct p9_dirent)
qemu
| getdents, 64-bit UAPI
host kernel (64-bit)

Back when we still called getdents, in the first case, the 32-bit
getdents system call emulation in a 64-bit qemu-user process would
just silently truncate the d_off field as part of the translation, not
reporting an error. The second case is more complicated, and I have
not figured out where the truncation happens.

This truncation has always been a bug; it breaks telldir/seekdir at
least in some cases. But use of telldir/seekdir is comparatively
rare. In contrast, now that we detect d_off overflow in glibc,
readdir will always fail in the sketched configurations, which is bad.
(glibc exposes the d_off field to applications, and it cannot know
whether the application will use it or not, so there is no direct way
to restrict the overflow error to the telldir/seekdir use case.)

We could switch glibc to call getdents again if the system call is
available. But that merely relies on the existence of the truncation
bug somewhere else in the file system stack. This is why I don't
think it's the right solution, just the path of least resistance.

I don't want to reimplement the ext4 truncation behavior in glibc (it
doesn't look like a straightforward truncation), and it wouldn't work
for the second scenario where we see the 9p file system in the 32-bit
glibc, not the ext4 file system. So that's not a good solution.

There is another annoying aspect: The standards expose d_off through
the telldir function, and that returns long int on all architectures
(not off_t, so unchanged by _FILE_OFFSET_BITS). That's mostly a
userspace issue and thus needing different steps to resolve (possibly
standards action).

Any suggestions how to solve this? Why does the kernel return
different d_off values for 32-bit and 64-bit processes even when using
getdents64, for the same directory?


2018-12-29 16:49:35

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

> On Dec 28, 2018, at 6:54 PM, Matthew Wilcox <[email protected]> wrote:
>
>> On Sat, Dec 29, 2018 at 12:12:27AM +0000, Peter Maydell wrote:
>> On Fri, 28 Dec 2018 at 23:16, Andreas Dilger <[email protected]> wrot
>>> On Dec 28, 2018, at 4:18 AM, Peter Maydell <[email protected]> wrote:
>>>> The problem is that there is no 32-bit API in some cases
>>>> (unless I have misunderstood the kernel code) -- not all
>>>> host architectures implement compat syscalls or allow them
>>>> to be called from 64-bit processes or implement all the older
>>>> syscall variants that had smaller offets. If there was a guaranteed
>>>> "this syscall always exists and always gives me 32-bit offsets"
>>>> we could use it.
>>>
>>> The "32bitapi" mount option would use 32-bit hash for seekdir
>>> and telldir, regardless of what kernel API was used. That would
>>> just set the FMODE_32BITHASH flag in the file->f_mode for all files.
>>
>> A mount option wouldn't be much use to QEMU -- we can't tell
>> our users how to mount their filesystems, which they're
>> often doing lots of other things with besides running QEMU.
>> (Otherwise we could just tell them "don't use ext4", which
>> would also solve the problem :-)) We need something we can
>> use at the individual-syscall level.
>
> Could you use a prctl to set whether you were running in 32 or 64 bit
> mode? Or do you change which kind of task you're emulating too often
> to make this a good idea?


How would this work? We already have the separate
COMPAT_DEFINE_SYSCALL entries *and* in_compat_syscall(). Now we’d have
a third degree of freedom.

Either the arches people care about should add reasonable ways to
issue 32-bit syscalls from 64-bit mode or there should be an explicit
way to ask for the 32-bit directory offsets.

2018-12-27 17:56:05

by Florian Weimer

[permalink] [raw]
Subject: Re: d_off field in struct dirent and 32-on-64 emulation

* Andy Lutomirski:

>> On Dec 27, 2018, at 10:18 AM, Florian Weimer <[email protected]> wrote:
>>
>> We have a bit of an interesting problem with respect to the d_off
>> field in struct dirent.
>>
>> When running a 64-bit kernel on certain file systems, notably ext4,
>> this field uses the full 63 bits even for small directories (strace -v
>> output, wrapped here for readability):
>>
>> getdents(3, [
>> {d_ino=1494304, d_off=3901177228673045825, d_reclen=40,
>> d_name="authorized_keys", d_type=DT_REG},
>> {d_ino=1494277, d_off=7491915799041650922, d_reclen=24, d_name=".",
>> d_type=DT_DIR},
>> {d_ino=1314655, d_off=9223372036854775807, d_reclen=24,
>> d_name="..", d_type=DT_DIR}
>> ], 32768) = 88
>>
>> When running in 32-bit compat mode, this value is somehow truncated to
>> 31 bits, for both the getdents and the getdents64 (!) system call (at
>> least on i386).
>
> I imagine you’re encountering this bug:
>
> https://lkml.org/lkml/2018/10/18/859

It's definitely in this area. However, the original collision problem
with 32-bit hashes is also real, so I can see the desire to use more
bits.

> Presumably the right fix involves modifying the relevant VFS file
> operations to indicate the relevant ABI to the implementations.

Not sure. How does NFS solve this problem when access happens from a
32-bit process and the rest (client kernel, transport, server kernel)
is 64-bit all the way?

> I would guess that 9p is triggering the “not really in the syscall you
> think you’re in” issue.

I think the issue is more like the networking case for 9p. In this
scenario, the server shouldn't have to care whether the client process
is in 32-bit mode or 64-bit mode. But maybe the only solution is to
pass through some sort of flag, as Peter Maydell has just suggested.

2018-12-28 12:21:10

by Adhemerval Zanella

[permalink] [raw]
Subject: Re: d_off field in struct dirent and 32-on-64 emulation



On 28/12/2018 10:01, Florian Weimer wrote:
> * Florian Weimer:
>
>> * Adhemerval Zanella:
>>
>>> On 27/12/2018 16:09, Florian Weimer wrote:
>>>> * Adhemerval Zanella:
>>>>
>>>>> Also for glibc standpoint, although reverting it back to use getdents
>>>>> syscall for non-LFS mode might fix this issue for architectures that
>>>>> provides non-LFS getdents syscall it won't be a fix for architectures
>>>>> that still provides off_t different than off64_t *and* only provides
>>>>> getdents64 syscall.
>>>>>
>>>>> Currently we only have nios2 and csky (unfortunately). But since generic
>>>>> definition for off_t and off64_t still assumes non-LFS support, all new
>>>>> 32-bits ports potentially might carry the issue.
>>>>
>>>> For csky, we could still change the type of the non-standard d_off
>>>> field to long long int. This way, only telldir would have to fail
>>>> when truncation is necessary, as mentioned below:
>>>
>>> I think it makes no sense to continue making non-LFS as default for
>>> newer 32 bits ports, the support will be emulated with LFS syscalls.
>>
>> Sorry, I don't see how this matters. seekdir and telldir are NOT
>> affected by LFS.
>
> Ah, right. If struct dirent is 64-bit only, then the d_off member
> will be 64 bits as well. But it is unclear whether you can use that
> with lseek (probably yes, in its 64-bit variant), and it's unlikely
> it's going to work with seekdir because of the POSIX-required long int
> type.
>

I was referring to all other API that uses off_t as well (pread for
instance), where new ports will have non-LFS variants that will call
only LFS variants.

2018-12-28 00:23:31

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

On Dec 27, 2018, at 10:41 AM, Peter Maydell <[email protected]> wrote:
>
> On Thu, 27 Dec 2018 at 17:19, Florian Weimer <[email protected]> wrote:
>> We have a bit of an interesting problem with respect to the d_off
>> field in struct dirent.
>>
>> When running a 64-bit kernel on certain file systems, notably ext4,
>> this field uses the full 63 bits even for small directories (strace -v
>> output, wrapped here for readability):
>>
>> getdents(3, [
>> {d_ino=1494304, d_off=3901177228673045825, d_reclen=40, d_name="authorized_keys", d_type=DT_REG},
>> {d_ino=1494277, d_off=7491915799041650922, d_reclen=24, d_name=".", d_type=DT_DIR},
>> {d_ino=1314655, d_off=9223372036854775807, d_reclen=24, d_name="..", d_type=DT_DIR}
>> ], 32768) = 88
>>
>> When running in 32-bit compat mode, this value is somehow truncated to
>> 31 bits, for both the getdents and the getdents64 (!) system call (at
>> least on i386).
>
> Yes -- look for hash2pos() and friends in fs/ext4/dir.c.
> The ext4 code in the kernel uses a 32 bit hash if (a) the kernel
> is 32 bit (b) this is a compat syscall (b) some other bit of
> the kernel asked it to via the FMODE_32BITHASH flag (currently only
> NFS does that I think).
>
> As you note, this causes breakage for userspace programs which
> need to implement an API/ABI with 32-bit offset but which only
> have access to the kernel's 64-bit offset API/ABI.

This is (IMHO) a bit of an oxymoron, isn't it? Applications using
the 64-bit API, but storing the value in a 32-bit field? The same
problem would exist for filesystems with 64-bit inodes or 64-bit
file offsets trying to store these values in 32-bit variables.
It might work most of the time, but it can also break randomly.

> I think the best fix for this would be for the kernel to either
> (a) consistently use a 32-bit hash or (b) to provide an API
> so that userspace can use the FMODE_32BITHASH flag the way
> that kernel-internal users already can.

It would be relatively straight forward to add a "32bitapi" mount
option to return a 32-bit directory hash to userspace for operations
on that mountpoint (ext4 doesn't have 64-bit inode numbers yet).
However, I can't think of an easy way to do this on a per-process
basis without just having it call the 32-bit API directly.

> I couldn't think of or find any existing way for userspace
> to get the right results here, which is why
> 32-bit-guest-on-64-bit-host QEMU doesn't work on these filesystems
> (depending on what exactly the guest's libc etc do).
>
>> the 32-bit getdents system call emulation in a 64-bit qemu-user
>> process would just silently truncate the d_off field as part of
>> the translation, not reporting an error.
>> [...]
>> This truncation has always been a bug; it breaks telldir/seekdir
>> at least in some cases.
>
> Yes; you can't fit a quart into a pint pot, so if the guest
> only handles 32-bit offsets then truncation is about all we
> can do. This works fine if offsets are offsets, assuming the
> directory isn't so enormous it would have broken the guest
> anyway. I'm not aware of any issues with this other than the
> oddball ext4 offsets-are-hashes situation -- could you expand
> on the telldir/seekdir issue? (I suppose we should probably
> make QEMU's syscall emulation layer return "no more entries"
> rather than entries with truncated hashes.)

For ext4 at least, you could just shift the high 32-bit part of
the 64-bit hash down into a 32-bit value in telldir(), and
shift it back up when seekdir() is called.

Cheers, Andreas






Attachments:
signature.asc (873.00 B)
Message signed with OpenPGP

2018-12-28 15:26:15

by Andy Lutomirski

[permalink] [raw]
Subject: Re: d_off field in struct dirent and 32-on-64 emulation

[sending again, slightly edited, due to email client issues]

On Thu, Dec 27, 2018 at 9:25 AM Florian Weimer <[email protected]> wrote:
>
> We have a bit of an interesting problem with respect to the d_off
> field in struct dirent.
>
> When running a 64-bit kernel on certain file systems, notably ext4,
> this field uses the full 63 bits even for small directories (strace -v
> output, wrapped here for readability):
>
> getdents(3, [
> {d_ino=1494304, d_off=3901177228673045825, d_reclen=40, d_name="authorized_keys", d_type=DT_REG},
> {d_ino=1494277, d_off=7491915799041650922, d_reclen=24, d_name=".", d_type=DT_DIR},
> {d_ino=1314655, d_off=9223372036854775807, d_reclen=24, d_name="..", d_type=DT_DIR}
> ], 32768) = 88
>
> When running in 32-bit compat mode, this value is somehow truncated to
> 31 bits, for both the getdents and the getdents64 (!) system call (at
> least on i386).
>

...

>
> However, both qemu-user and the 9p file system can run in such a way
> that the kernel is entered from a 64-bit process, but the actual usage
> is from a 32-bit process:


I imagine that at least some of the problems you're seeing are due to this bug:

https://lkml.org/lkml/2018/10/18/859

Presumably the right fix involves modifying the relevant VFS file
operations to indicate the relevant ABI to the implementations. I
would guess that 9p is triggering the “not really in the syscall you
think you’re in” issue.

2018-12-29 01:54:59

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

On Sat, Dec 29, 2018 at 12:12:27AM +0000, Peter Maydell wrote:
> On Fri, 28 Dec 2018 at 23:16, Andreas Dilger <[email protected]> wrot
> > On Dec 28, 2018, at 4:18 AM, Peter Maydell <[email protected]> wrote:
> > > The problem is that there is no 32-bit API in some cases
> > > (unless I have misunderstood the kernel code) -- not all
> > > host architectures implement compat syscalls or allow them
> > > to be called from 64-bit processes or implement all the older
> > > syscall variants that had smaller offets. If there was a guaranteed
> > > "this syscall always exists and always gives me 32-bit offsets"
> > > we could use it.
> >
> > The "32bitapi" mount option would use 32-bit hash for seekdir
> > and telldir, regardless of what kernel API was used. That would
> > just set the FMODE_32BITHASH flag in the file->f_mode for all files.
>
> A mount option wouldn't be much use to QEMU -- we can't tell
> our users how to mount their filesystems, which they're
> often doing lots of other things with besides running QEMU.
> (Otherwise we could just tell them "don't use ext4", which
> would also solve the problem :-)) We need something we can
> use at the individual-syscall level.

Could you use a prctl to set whether you were running in 32 or 64 bit
mode? Or do you change which kind of task you're emulating too often
to make this a good idea?

2018-12-28 07:46:35

by Florian Weimer

[permalink] [raw]
Subject: Re: d_off field in struct dirent and 32-on-64 emulation

* Dmitry V. Levin:

> On Thu, Dec 27, 2018 at 06:18:19PM +0100, Florian Weimer wrote:
>> We have a bit of an interesting problem with respect to the d_off
>> field in struct dirent.
>>
>> When running a 64-bit kernel on certain file systems, notably ext4,
>> this field uses the full 63 bits even for small directories (strace -v
>> output, wrapped here for readability):
>>
>> getdents(3, [
>> {d_ino=1494304, d_off=3901177228673045825, d_reclen=40,
>> d_name="authorized_keys", d_type=DT_REG},
>> {d_ino=1494277, d_off=7491915799041650922, d_reclen=24,
>> d_name=".", d_type=DT_DIR},
>> {d_ino=1314655, d_off=9223372036854775807, d_reclen=24,
>> d_name="..", d_type=DT_DIR}
>> ], 32768) = 88
>>
>> When running in 32-bit compat mode, this value is somehow truncated to
>> 31 bits, for both the getdents and the getdents64 (!) system call (at
>> least on i386).
>
> Why getdents64 system call is affected by this truncation,
> isn't it a kernel bug that has to be fixed in the kernel instead?

It's required because POSIX specifies that telldir and seekdir use
long int (and not off_t) as the seek offset. If the kernel does not
truncate while keeping a useful value, these functions would turn
unusable.

2018-12-28 02:23:42

by Dmitry V. Levin

[permalink] [raw]
Subject: Re: d_off field in struct dirent and 32-on-64 emulation

On Thu, Dec 27, 2018 at 06:18:19PM +0100, Florian Weimer wrote:
> We have a bit of an interesting problem with respect to the d_off
> field in struct dirent.
>
> When running a 64-bit kernel on certain file systems, notably ext4,
> this field uses the full 63 bits even for small directories (strace -v
> output, wrapped here for readability):
>
> getdents(3, [
> {d_ino=1494304, d_off=3901177228673045825, d_reclen=40, d_name="authorized_keys", d_type=DT_REG},
> {d_ino=1494277, d_off=7491915799041650922, d_reclen=24, d_name=".", d_type=DT_DIR},
> {d_ino=1314655, d_off=9223372036854775807, d_reclen=24, d_name="..", d_type=DT_DIR}
> ], 32768) = 88
>
> When running in 32-bit compat mode, this value is somehow truncated to
> 31 bits, for both the getdents and the getdents64 (!) system call (at
> least on i386).

Why getdents64 system call is affected by this truncation,
isn't it a kernel bug that has to be fixed in the kernel instead?


--
ldv


Attachments:
(No filename) (961.00 B)
signature.asc (801.00 B)
Download all attachments

2018-12-29 04:23:34

by Dominique Martinet

[permalink] [raw]
Subject: Re: [V9fs-developer] [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

Theodore Y. Ts'o wrote on Fri, Dec 28, 2018:
> On Sat, Dec 29, 2018 at 03:37:21AM +0100, Dominique Martinet wrote:
> > > Are there going to be cases where a process or a thread will sometimes
> > > want the 64-bit interface, and sometimes want the 32-bit interface?
> > > Or is it always going to be one or the other? I wonder if we could
> > > simply add a new flag to the process personality(2) flags.
> >
> > That would likely work for qemu user, but the qemu system+9p case is
> > going to be more painful..
> > More precisely, the 9p protocol does not plan for anything other than
> > 64bit offset so if the vfs needs to hand out a 32bit offset we'll need
> > to make a correspondance table between the 32bit offsets we hand off and
> > the 64bit ones to use; unless some flag can be passed at lopen to tell
> > the server to always hand out 32bit offsets for this directory... And if
> > we do that then 9p servers will need a way to use both APIs in parallel
> > for both types of directories.
>
> How about if we add a fcntl(2) mediated flag, which is tied to a
> struct file? Would that be more or less painful for 9p and qemu
> system+9p?

Hmm. 9P2000.L doesn't have anything akin to fcntl either, the only two
obvious places where we could pass a flag is lopen (which already
handles a bunch of linux-specific flags, e.g. passing O_LARGEFILE
O_NOATIME etc will just forward these through for qemu/diod at least),
or adding a new parameter to the 9p readdir.

The former would let us get away without modifying the protocol as
servers will just ignore flags they don't handle on implementations I
checked, so it'd definitely be the least effort choice from what I can
tell.


On the other hand a fcntl would solve the server-side problem, it'd
allow the server to request appropriately-sized offsets per fd, so it's
a good start; we "just" need to figure how to translate that on the wire.

--
Dominique Martinet | Asmadeus

2018-12-28 12:01:47

by Florian Weimer

[permalink] [raw]
Subject: Re: d_off field in struct dirent and 32-on-64 emulation

* Florian Weimer:

> * Adhemerval Zanella:
>
>> On 27/12/2018 16:09, Florian Weimer wrote:
>>> * Adhemerval Zanella:
>>>
>>>> Also for glibc standpoint, although reverting it back to use getdents
>>>> syscall for non-LFS mode might fix this issue for architectures that
>>>> provides non-LFS getdents syscall it won't be a fix for architectures
>>>> that still provides off_t different than off64_t *and* only provides
>>>> getdents64 syscall.
>>>>
>>>> Currently we only have nios2 and csky (unfortunately). But since generic
>>>> definition for off_t and off64_t still assumes non-LFS support, all new
>>>> 32-bits ports potentially might carry the issue.
>>>
>>> For csky, we could still change the type of the non-standard d_off
>>> field to long long int. This way, only telldir would have to fail
>>> when truncation is necessary, as mentioned below:
>>
>> I think it makes no sense to continue making non-LFS as default for
>> newer 32 bits ports, the support will be emulated with LFS syscalls.
>
> Sorry, I don't see how this matters. seekdir and telldir are NOT
> affected by LFS.

Ah, right. If struct dirent is 64-bit only, then the d_off member
will be 64 bits as well. But it is unclear whether you can use that
with lseek (probably yes, in its 64-bit variant), and it's unlikely
it's going to work with seekdir because of the POSIX-required long int
type.

2018-12-27 17:42:10

by Peter Maydell

[permalink] [raw]
Subject: Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

On Thu, 27 Dec 2018 at 17:19, Florian Weimer <[email protected]> wrote:
> We have a bit of an interesting problem with respect to the d_off
> field in struct dirent.
>
> When running a 64-bit kernel on certain file systems, notably ext4,
> this field uses the full 63 bits even for small directories (strace -v
> output, wrapped here for readability):
>
> getdents(3, [
> {d_ino=1494304, d_off=3901177228673045825, d_reclen=40, d_name="authorized_keys", d_type=DT_REG},
> {d_ino=1494277, d_off=7491915799041650922, d_reclen=24, d_name=".", d_type=DT_DIR},
> {d_ino=1314655, d_off=9223372036854775807, d_reclen=24, d_name="..", d_type=DT_DIR}
> ], 32768) = 88
>
> When running in 32-bit compat mode, this value is somehow truncated to
> 31 bits, for both the getdents and the getdents64 (!) system call (at
> least on i386).

Yes -- look for hash2pos() and friends in fs/ext4/dir.c.
The ext4 code in the kernel uses a 32 bit hash if (a) the kernel
is 32 bit (b) this is a compat syscall (b) some other bit of
the kernel asked it to via the FMODE_32BITHASH flag (currently only
NFS does that I think).

As you note, this causes breakage for userspace programs which
need to implement an API/ABI with 32-bit offset but which only
have access to the kernel's 64-bit offset API/ABI.

I think the best fix for this would be for the kernel to either
(a) consistently use a 32-bit hash or (b) to provide an API
so that userspace can use the FMODE_32BITHASH flag the way
that kernel-internal users already can.

I couldn't think of or find any existing way for userspace
to get the right results here, which is why
32-bit-guest-on-64-bit-host QEMU doesn't work on these filesystems
(depending on what exactly the guest's libc etc do).

> the 32-bit getdents system call emulation in a 64-bit qemu-user
> process would just silently truncate the d_off field as part of
> the translation, not reporting an error.
> [...]
> This truncation has always been a bug; it breaks telldir/seekdir
> at least in some cases.

Yes; you can't fit a quart into a pint pot, so if the guest
only handles 32-bit offsets then truncation is about all we
can do. This works fine if offsets are offsets, assuming the
directory isn't so enormous it would have broken the guest
anyway. I'm not aware of any issues with this other than the
oddball ext4 offsets-are-hashes situation -- could you expand
on the telldir/seekdir issue? (I suppose we should probably
make QEMU's syscall emulation layer return "no more entries"
rather than entries with truncated hashes.)

thanks
-- PMM

2018-12-28 23:16:28

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

On Dec 28, 2018, at 4:18 AM, Peter Maydell <[email protected]> wrote:
>
> On Fri, 28 Dec 2018 at 00:23, Andreas Dilger <[email protected]> wrote:
>> On Dec 27, 2018, at 10:41 AM, Peter Maydell <[email protected]> wrote:
>>> As you note, this causes breakage for userspace programs which
>>> need to implement an API/ABI with 32-bit offset but which only
>>> have access to the kernel's 64-bit offset API/ABI.
>>
>> This is (IMHO) a bit of an oxymoron, isn't it? Applications using
>> the 64-bit API, but storing the value in a 32-bit field?
>
> I didn't say "which choose to store the value in a 32-bit field",
> I said "which have to implement an API/ABI which has 32-bit fields".
> In QEMU's case, we use the host kernel's ABI, which has 64-bit
> offset fields. We implement a syscall ABI for the guest binary
> we are running under emulation, which may have 32-bit offset fields
> (for instance if we are running a 32-bit Arm binary.) Both of
> these ABIs are fixed -- QEMU doesn't have a choice here, it
> just has to make the best effort it can with what the host kernel
> provides it, to provide the semantics the guest binary needs.
> My suggestion in this thread is that the host kernel provides
> a wider range of facilities so that QEMU can do the job it's
> trying to do.
>
>> The same
>> problem would exist for filesystems with 64-bit inodes or 64-bit
>> file offsets trying to store these values in 32-bit variables.
>> It might work most of the time, but it can also break randomly.
>
> In general inodes and offsets start from 0 and work up --
> so almost all of the time they don't actually overflow.
> The problem with ext4 directory hash "offsets" is that they
> overflow all the time and immediately, so instead of "works
> unless you have a weird edge case" like all the other filesystems,
> it's "never works".
>
>>> I think the best fix for this would be for the kernel to either
>>> (a) consistently use a 32-bit hash or (b) to provide an API
>>> so that userspace can use the FMODE_32BITHASH flag the way
>>> that kernel-internal users already can.
>>
>> It would be relatively straight forward to add a "32bitapi" mount
>> option to return a 32-bit directory hash to userspace for operations
>> on that mountpoint (ext4 doesn't have 64-bit inode numbers yet).
>> However, I can't think of an easy way to do this on a per-process
>> basis without just having it call the 32-bit API directly.
>
> The problem is that there is no 32-bit API in some cases
> (unless I have misunderstood the kernel code) -- not all
> host architectures implement compat syscalls or allow them
> to be called from 64-bit processes or implement all the older
> syscall variants that had smaller offets. If there was a guaranteed
> "this syscall always exists and always gives me 32-bit offsets"
> we could use it.

The "32bitapi" mount option would use 32-bit hash for seekdir
and telldir, regardless of what kernel API was used. That would
just set the FMODE_32BITHASH flag in the file->f_mode for all files.

Using 32-bit directory hash values is not necessarily harmful, but
it returns the possibility to hit the problem with hash collisions
that previously existed before the move to 64-bit hash values.
This becomes more of a problem as directory sizes increase.

>> For ext4 at least, you could just shift the high 32-bit part of
>> the 64-bit hash down into a 32-bit value in telldir(), and
>> shift it back up when seekdir() is called.
>
> Yes, that has been suggested, but it seemed a bit dubious
> to bake in knowledge of ext4's internal implementation details.
> Can we rely on this as an ABI promise that will always work
> for all versions of all file systems going forwards?

Well, the directory cookies need to be relatively stable over
time because they are exported to applications and possibly
remote nodes via NFS, so it can't be changed very much.

Cheers, Andreas






Attachments:
signature.asc (873.00 B)
Message signed with OpenPGP

2018-12-27 17:58:54

by Adhemerval Zanella

[permalink] [raw]
Subject: Re: d_off field in struct dirent and 32-on-64 emulation



On 27/12/2018 15:18, Florian Weimer wrote:
> We have a bit of an interesting problem with respect to the d_off
> field in struct dirent.
>
> When running a 64-bit kernel on certain file systems, notably ext4,
> this field uses the full 63 bits even for small directories (strace -v
> output, wrapped here for readability):
>
> getdents(3, [
> {d_ino=1494304, d_off=3901177228673045825, d_reclen=40, d_name="authorized_keys", d_type=DT_REG},
> {d_ino=1494277, d_off=7491915799041650922, d_reclen=24, d_name=".", d_type=DT_DIR},
> {d_ino=1314655, d_off=9223372036854775807, d_reclen=24, d_name="..", d_type=DT_DIR}
> ], 32768) = 88
>
> When running in 32-bit compat mode, this value is somehow truncated to
> 31 bits, for both the getdents and the getdents64 (!) system call (at
> least on i386).
>
> In an effort to simplify support for future architectures which only
> have the getdents64 system call, we changed glibc 2.28 to use the
> getdents64 system call unconditionally, and perform translation if
> necessary. This translation is noteworthy because it includes
> overflow checking for the d_ino and d_off members of struct dirent.
> We did not initially observe a regression because the kernel performs
> consistent d_off truncation (with the ext4 file system; small
> directories do not show this issue on XFS), so the overflow check does
> not fire.
>
> However, both qemu-user and the 9p file system can run in such a way
> that the kernel is entered from a 64-bit process, but the actual usage
> is from a 32-bit process:
>
> <https://sourceware.org/bugzilla/show_bug.cgi?id=23960>
>
> I think diagrammatically, this looks like this:
>
> guest process (32-bit)
> | getdents64, 32-bit UAPI
> qemu-user (64-bit)
> | getdents, 64-bit UAPI
> host kernel (64-bit)
>
> Or:
>
> guest process
> | getdents64, 32-bit UAPI
> guest kernel (64-bit)
> | 9p over virtio (64-bit d_off in struct p9_dirent)
> qemu
> | getdents, 64-bit UAPI
> host kernel (64-bit)
>
> Back when we still called getdents, in the first case, the 32-bit
> getdents system call emulation in a 64-bit qemu-user process would
> just silently truncate the d_off field as part of the translation, not
> reporting an error. The second case is more complicated, and I have
> not figured out where the truncation happens.
>
> This truncation has always been a bug; it breaks telldir/seekdir at
> least in some cases. But use of telldir/seekdir is comparatively
> rare. In contrast, now that we detect d_off overflow in glibc,
> readdir will always fail in the sketched configurations, which is bad.
> (glibc exposes the d_off field to applications, and it cannot know
> whether the application will use it or not, so there is no direct way
> to restrict the overflow error to the telldir/seekdir use case.)
>
> We could switch glibc to call getdents again if the system call is
> available. But that merely relies on the existence of the truncation
> bug somewhere else in the file system stack. This is why I don't
> think it's the right solution, just the path of least resistance.
>
> I don't want to reimplement the ext4 truncation behavior in glibc (it
> doesn't look like a straightforward truncation), and it wouldn't work
> for the second scenario where we see the 9p file system in the 32-bit
> glibc, not the ext4 file system. So that's not a good solution.

Also for glibc standpoint, although reverting it back to use getdents
syscall for non-LFS mode might fix this issue for architectures that
provides non-LFS getdents syscall it won't be a fix for architectures
that still provides off_t different than off64_t *and* only provides
getdents64 syscall.

Currently we only have nios2 and csky (unfortunately). But since generic
definition for off_t and off64_t still assumes non-LFS support, all new
32-bits ports potentially might carry the issue.

>
> There is another annoying aspect: The standards expose d_off through
> the telldir function, and that returns long int on all architectures
> (not off_t, so unchanged by _FILE_OFFSET_BITS). That's mostly a
> userspace issue and thus needing different steps to resolve (possibly
> standards action).
>
> Any suggestions how to solve this? Why does the kernel return
> different d_off values for 32-bit and 64-bit processes even when using
> getdents64, for the same directory?
>

2018-12-28 11:56:38

by Florian Weimer

[permalink] [raw]
Subject: Re: d_off field in struct dirent and 32-on-64 emulation

* Adhemerval Zanella:

> On 27/12/2018 16:09, Florian Weimer wrote:
>> * Adhemerval Zanella:
>>
>>> Also for glibc standpoint, although reverting it back to use getdents
>>> syscall for non-LFS mode might fix this issue for architectures that
>>> provides non-LFS getdents syscall it won't be a fix for architectures
>>> that still provides off_t different than off64_t *and* only provides
>>> getdents64 syscall.
>>>
>>> Currently we only have nios2 and csky (unfortunately). But since generic
>>> definition for off_t and off64_t still assumes non-LFS support, all new
>>> 32-bits ports potentially might carry the issue.
>>
>> For csky, we could still change the type of the non-standard d_off
>> field to long long int. This way, only telldir would have to fail
>> when truncation is necessary, as mentioned below:
>
> I think it makes no sense to continue making non-LFS as default for
> newer 32 bits ports, the support will be emulated with LFS syscalls.

Sorry, I don't see how this matters. seekdir and telldir are NOT
affected by LFS.

2018-12-29 00:12:40

by Peter Maydell

[permalink] [raw]
Subject: Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

On Fri, 28 Dec 2018 at 23:16, Andreas Dilger <[email protected]> wrot
> On Dec 28, 2018, at 4:18 AM, Peter Maydell <[email protected]> wrote:
> > The problem is that there is no 32-bit API in some cases
> > (unless I have misunderstood the kernel code) -- not all
> > host architectures implement compat syscalls or allow them
> > to be called from 64-bit processes or implement all the older
> > syscall variants that had smaller offets. If there was a guaranteed
> > "this syscall always exists and always gives me 32-bit offsets"
> > we could use it.
>
> The "32bitapi" mount option would use 32-bit hash for seekdir
> and telldir, regardless of what kernel API was used. That would
> just set the FMODE_32BITHASH flag in the file->f_mode for all files.

A mount option wouldn't be much use to QEMU -- we can't tell
our users how to mount their filesystems, which they're
often doing lots of other things with besides running QEMU.
(Otherwise we could just tell them "don't use ext4", which
would also solve the problem :-)) We need something we can
use at the individual-syscall level.

thanks
-- PMM

2018-12-29 02:15:01

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

On Fri, Dec 28, 2018 at 11:18:18AM +0000, Peter Maydell wrote:
> In general inodes and offsets start from 0 and work up --
> so almost all of the time they don't actually overflow.
> The problem with ext4 directory hash "offsets" is that they
> overflow all the time and immediately, so instead of "works
> unless you have a weird edge case" like all the other filesystems,h
> it's "never works".

Actually, XFS uses the inode number to encode the location of the
inode (it doesn't have a fixed inode table, so it's effectively the
block number shifted left by 3 or 4 bits, with the low bits indicating
the slot in the 4k block). It has a hack to provide backwards
compatibility for 32-bit API's, but there is a similar, "oh, we're on
a non-paleolithic CPU, let's use the full 64-bits" sort of logic that
ext4 has.

> The problem is that there is no 32-bit API in some cases
> (unless I have misunderstood the kernel code) -- not all
> host architectures implement compat syscalls or allow them
> to be called from 64-bit processes or implement all the older
> syscall variants that had smaller offets. If there was a guaranteed
> "this syscall always exists and always gives me 32-bit offsets"
> we could use it.

Are there going to be cases where a process or a thread will sometimes
want the 64-bit interface, and sometimes want the 32-bit interface?
Or is it always going to be one or the other? I wonder if we could
simply add a new flag to the process personality(2) flags.

> Yes, that has been suggested, but it seemed a bit dubious
> to bake in knowledge of ext4's internal implementation details.
> Can we rely on this as an ABI promise that will always work
> for all versions of all file systems going forwards?

Yeah, that seems dubious because I'm pretty sure there are other file
systems that may have their own 32/64-bit quirks.

- Ted

2018-12-29 04:18:45

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

On Sat, Dec 29, 2018 at 03:37:21AM +0100, Dominique Martinet wrote:
> > Are there going to be cases where a process or a thread will sometimes
> > want the 64-bit interface, and sometimes want the 32-bit interface?
> > Or is it always going to be one or the other? I wonder if we could
> > simply add a new flag to the process personality(2) flags.
>
> That would likely work for qemu user, but the qemu system+9p case is
> going to be more painful..
> More precisely, the 9p protocol does not plan for anything other than
> 64bit offset so if the vfs needs to hand out a 32bit offset we'll need
> to make a correspondance table between the 32bit offsets we hand off and
> the 64bit ones to use; unless some flag can be passed at lopen to tell
> the server to always hand out 32bit offsets for this directory... And if
> we do that then 9p servers will need a way to use both APIs in parallel
> for both types of directories.

How about if we add a fcntl(2) mediated flag, which is tied to a
struct file? Would that be more or less painful for 9p and qemu
system+9p?

- Ted

2018-12-28 11:18:31

by Peter Maydell

[permalink] [raw]
Subject: Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

On Fri, 28 Dec 2018 at 00:23, Andreas Dilger <[email protected]> wrote:
> On Dec 27, 2018, at 10:41 AM, Peter Maydell <[email protected]> wrote:
> > As you note, this causes breakage for userspace programs which
> > need to implement an API/ABI with 32-bit offset but which only
> > have access to the kernel's 64-bit offset API/ABI.
>
> This is (IMHO) a bit of an oxymoron, isn't it? Applications using
> the 64-bit API, but storing the value in a 32-bit field?

I didn't say "which choose to store the value in a 32-bit field",
I said "which have to implement an API/ABI which has 32-bit fields".
In QEMU's case, we use the host kernel's ABI, which has 64-bit
offset fields. We implement a syscall ABI for the guest binary
we are running under emulation, which may have 32-bit offset fields
(for instance if we are running a 32-bit Arm binary.) Both of
these ABIs are fixed -- QEMU doesn't have a choice here, it
just has to make the best effort it can with what the host kernel
provides it, to provide the semantics the guest binary needs.
My suggestion in this thread is that the host kernel provides
a wider range of facilities so that QEMU can do the job it's
trying to do.

> The same
> problem would exist for filesystems with 64-bit inodes or 64-bit
> file offsets trying to store these values in 32-bit variables.
> It might work most of the time, but it can also break randomly.

In general inodes and offsets start from 0 and work up --
so almost all of the time they don't actually overflow.
The problem with ext4 directory hash "offsets" is that they
overflow all the time and immediately, so instead of "works
unless you have a weird edge case" like all the other filesystems,
it's "never works".

> > I think the best fix for this would be for the kernel to either
> > (a) consistently use a 32-bit hash or (b) to provide an API
> > so that userspace can use the FMODE_32BITHASH flag the way
> > that kernel-internal users already can.
>
> It would be relatively straight forward to add a "32bitapi" mount
> option to return a 32-bit directory hash to userspace for operations
> on that mountpoint (ext4 doesn't have 64-bit inode numbers yet).
> However, I can't think of an easy way to do this on a per-process
> basis without just having it call the 32-bit API directly.

The problem is that there is no 32-bit API in some cases
(unless I have misunderstood the kernel code) -- not all
host architectures implement compat syscalls or allow them
to be called from 64-bit processes or implement all the older
syscall variants that had smaller offets. If there was a guaranteed
"this syscall always exists and always gives me 32-bit offsets"
we could use it.

> For ext4 at least, you could just shift the high 32-bit part of
> the 64-bit hash down into a 32-bit value in telldir(), and
> shift it back up when seekdir() is called.

Yes, that has been suggested, but it seemed a bit dubious
to bake in knowledge of ext4's internal implementation details.
Can we rely on this as an ABI promise that will always work
for all versions of all file systems going forwards?

thanks
-- PMM

2018-12-29 02:37:39

by Dominique Martinet

[permalink] [raw]
Subject: Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

Theodore Y. Ts'o wrote on Fri, Dec 28, 2018:
> > The problem is that there is no 32-bit API in some cases
> > (unless I have misunderstood the kernel code) -- not all
> > host architectures implement compat syscalls or allow them
> > to be called from 64-bit processes or implement all the older
> > syscall variants that had smaller offets. If there was a guaranteed
> > "this syscall always exists and always gives me 32-bit offsets"
> > we could use it.
>
> Are there going to be cases where a process or a thread will sometimes
> want the 64-bit interface, and sometimes want the 32-bit interface?
> Or is it always going to be one or the other? I wonder if we could
> simply add a new flag to the process personality(2) flags.

That would likely work for qemu user, but the qemu system+9p case is
going to be more painful..
More precisely, the 9p protocol does not plan for anything other than
64bit offset so if the vfs needs to hand out a 32bit offset we'll need
to make a correspondance table between the 32bit offsets we hand off and
the 64bit ones to use; unless some flag can be passed at lopen to tell
the server to always hand out 32bit offsets for this directory... And if
we do that then 9p servers will need a way to use both APIs in parallel
for both types of directories.

(Although I'd rather not have to do either in the first place, keeping
track of all used offsets just in case seems like a waste even if we
only do it for processes in 32bit mode, and a new flag would be a
protocol change with 9p not being designed to catter for subtle protocol
changes so would be rather painful to roll out)


No bright idea here, sorry.
--
Dominique Martinet | Asmadeus

2018-12-27 18:09:14

by Florian Weimer

[permalink] [raw]
Subject: Re: d_off field in struct dirent and 32-on-64 emulation

* Adhemerval Zanella:

> Also for glibc standpoint, although reverting it back to use getdents
> syscall for non-LFS mode might fix this issue for architectures that
> provides non-LFS getdents syscall it won't be a fix for architectures
> that still provides off_t different than off64_t *and* only provides
> getdents64 syscall.
>
> Currently we only have nios2 and csky (unfortunately). But since generic
> definition for off_t and off64_t still assumes non-LFS support, all new
> 32-bits ports potentially might carry the issue.

For csky, we could still change the type of the non-standard d_off
field to long long int. This way, only telldir would have to fail
when truncation is necessary, as mentioned below:

>> There is another annoying aspect: The standards expose d_off through
>> the telldir function, and that returns long int on all architectures
>> (not off_t, so unchanged by _FILE_OFFSET_BITS). That's mostly a
>> userspace issue and thus needing different steps to resolve (possibly
>> standards action).

2018-12-28 11:53:27

by Adhemerval Zanella

[permalink] [raw]
Subject: Re: d_off field in struct dirent and 32-on-64 emulation



On 27/12/2018 16:09, Florian Weimer wrote:
> * Adhemerval Zanella:
>
>> Also for glibc standpoint, although reverting it back to use getdents
>> syscall for non-LFS mode might fix this issue for architectures that
>> provides non-LFS getdents syscall it won't be a fix for architectures
>> that still provides off_t different than off64_t *and* only provides
>> getdents64 syscall.
>>
>> Currently we only have nios2 and csky (unfortunately). But since generic
>> definition for off_t and off64_t still assumes non-LFS support, all new
>> 32-bits ports potentially might carry the issue.
>
> For csky, we could still change the type of the non-standard d_off
> field to long long int. This way, only telldir would have to fail
> when truncation is necessary, as mentioned below:

I think it makes no sense to continue making non-LFS as default for
newer 32 bits ports, the support will be emulated with LFS syscalls.

2018-12-30 13:59:29

by Peter Maydell

[permalink] [raw]
Subject: Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

On Sat, 29 Dec 2018 at 16:49, Andy Lutomirski <[email protected]> wrote:
> > Could you use a prctl to set whether you were running in 32 or 64 bit
> > mode? Or do you change which kind of task you're emulating too often
> > to make this a good idea?

QEMU's linux-user mode always only runs the single process,
which is a fixed guest architecture. But it also wants to
make system calls on its own behalf, as well as the ones it
is passing through from the guest, and I suspect it would
confuse the host libc if we changed the semantics of those
under its feet.

> How would this work? We already have the separate
> COMPAT_DEFINE_SYSCALL entries *and* in_compat_syscall(). Now we’d have
> a third degree of freedom.
>
> Either the arches people care about should add reasonable ways to
> issue 32-bit syscalls from 64-bit mode or there should be an explicit
> way to ask for the 32-bit directory offsets.

The first of those is not sufficient for QEMU if done
as a per-architecture thing, because there may not even be
a 32-bit syscall interface on the host kernel. The second
sounds better -- there's nothing conceptually architecture
specific about what we want to do or which is tied to the
idea of whether there's a 32-bit compat mode in the host
architecture or not.

thanks
-- PMM

2018-12-31 17:03:53

by Joseph Myers

[permalink] [raw]
Subject: Re: d_off field in struct dirent and 32-on-64 emulation

On Fri, 28 Dec 2018, Adhemerval Zanella wrote:

> >> Currently we only have nios2 and csky (unfortunately). But since generic
> >> definition for off_t and off64_t still assumes non-LFS support, all new
> >> 32-bits ports potentially might carry the issue.
> >
> > For csky, we could still change the type of the non-standard d_off
> > field to long long int. This way, only telldir would have to fail
> > when truncation is necessary, as mentioned below:
>
> I think it makes no sense to continue making non-LFS as default for
> newer 32 bits ports, the support will be emulated with LFS syscalls.

Any new 32-bit port that uses 64-bit time_t will also use 64-bit offsets
(because we don't have any glibc configurations that support the
combination of 64-bit time with 32-bit offsets, and don't want to add
them). That should apply for RISC-V 32-bit at least.

I've filed <https://sourceware.org/bugzilla/show_bug.cgi?id=24050> for
missing overflow checks in telldir when the default off_t is wider than
long int (currently just applies to x32; not sure why we don't see glibc
test failures on x32 resulting from the quiet truncation, as the issue is
certainly there in the source code).

--
Joseph S. Myers
[email protected]

2018-12-27 17:38:10

by Andy Lutomirski

[permalink] [raw]
Subject: Re: d_off field in struct dirent and 32-on-64 emulation



> On Dec 27, 2018, at 10:18 AM, Florian Weimer <[email protected]> wrote:
>
> We have a bit of an interesting problem with respect to the d_off
> field in struct dirent.
>
> When running a 64-bit kernel on certain file systems, notably ext4,
> this field uses the full 63 bits even for small directories (strace -v
> output, wrapped here for readability):
>
> getdents(3, [
> {d_ino=1494304, d_off=3901177228673045825, d_reclen=40, d_name="authorized_keys", d_type=DT_REG},
> {d_ino=1494277, d_off=7491915799041650922, d_reclen=24, d_name=".", d_type=DT_DIR},
> {d_ino=1314655, d_off=9223372036854775807, d_reclen=24, d_name="..", d_type=DT_DIR}
> ], 32768) = 88
>
> When running in 32-bit compat mode, this value is somehow truncated to
> 31 bits, for both the getdents and the getdents64 (!) system call (at
> least on i386).

I imagine you’re encountering this bug:

https://lkml.org/lkml/2018/10/18/859

Presumably the right fix involves modifying the relevant VFS file operations to indicate the relevant ABI to the implementations.

I would guess that 9p is triggering the “not really in the syscall you think you’re in” issue.

2019-01-02 13:16:44

by Adhemerval Zanella

[permalink] [raw]
Subject: Re: d_off field in struct dirent and 32-on-64 emulation



On 31/12/2018 15:03, Joseph Myers wrote:
> On Fri, 28 Dec 2018, Adhemerval Zanella wrote:
>
>>>> Currently we only have nios2 and csky (unfortunately). But since generic
>>>> definition for off_t and off64_t still assumes non-LFS support, all new
>>>> 32-bits ports potentially might carry the issue.
>>>
>>> For csky, we could still change the type of the non-standard d_off
>>> field to long long int. This way, only telldir would have to fail
>>> when truncation is necessary, as mentioned below:
>>
>> I think it makes no sense to continue making non-LFS as default for
>> newer 32 bits ports, the support will be emulated with LFS syscalls.
>
> Any new 32-bit port that uses 64-bit time_t will also use 64-bit offsets
> (because we don't have any glibc configurations that support the
> combination of 64-bit time with 32-bit offsets, and don't want to add
> them). That should apply for RISC-V 32-bit at least.
>
> I've filed <https://sourceware.org/bugzilla/show_bug.cgi?id=24050> for
> missing overflow checks in telldir when the default off_t is wider than
> long int (currently just applies to x32; not sure why we don't see glibc
> test failures on x32 resulting from the quiet truncation, as the issue is
> certainly there in the source code).
>

What about csky? Should we still make it use 32-bit offsets as default
configuration even when kernel does not support it natively?