2023-10-27 15:15:04

by Jens Axboe

[permalink] [raw]
Subject: lockdep: holding locks across syscall boundaries

Hi,

Normally we'd expect locking state to be clean and consistent across
syscall entry and exit, as that is always the case for sync syscalls.
We currently have a work-around for holding a lock from aio, see
kiocb_start_write(), which pretends to drop the lock from lockdeps
perspective, as it's held from submission to until kiocb_end_write() is
called at completion time.

This is a bit of an ugly work-around, and defeats the purpose of
lockdep.

Since I've now got another case where I want to hold a resource across
syscalls, is there a better way to do this?

This is for inode_dio_start(), which increments an inode int count, and
inode_dio_end() which decrements it. If a task is doing
inode_dio_start() and then inode_dio_wait(), I want to trigger this. I
have a hack that does this, but it disables lockdep_sys_exit() as
otherwise I just get that warning rather than the more useful one.

Thanks,
--
Jens Axboe


2023-10-27 16:00:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: lockdep: holding locks across syscall boundaries

On Fri, Oct 27, 2023 at 09:14:53AM -0600, Jens Axboe wrote:
> Hi,
>
> Normally we'd expect locking state to be clean and consistent across
> syscall entry and exit, as that is always the case for sync syscalls.

> We currently have a work-around for holding a lock from aio, see
> kiocb_start_write(), which pretends to drop the lock from lockdeps
> perspective, as it's held from submission to until kiocb_end_write() is
> called at completion time.

I was not aware of this, the only such hack I knew about was the
filesystem freezer thing.

The problem with holding locks past the end of a syscall is that you'll
nest whatever random lock hierarchies possibly by every other syscall
under that lock.

> This is a bit of an ugly work-around, and defeats the purpose of
> lockdep.
>
> Since I've now got another case where I want to hold a resource across
> syscalls, is there a better way to do this?
>
> This is for inode_dio_start(), which increments an inode int count, and
> inode_dio_end() which decrements it. If a task is doing
> inode_dio_start() and then inode_dio_wait(), I want to trigger this. I
> have a hack that does this, but it disables lockdep_sys_exit() as
> otherwise I just get that warning rather than the more useful one.

Suppose syscall-a returns with your kiocb thing held, call it lock A
Suppose syscall-b returns with your inode thing held, call it lock B

Then userspace does:

syscall-a
syscall-b

while it also does:

syscall-b
syscall-a

and we're up a creek, no?

2023-10-27 16:06:49

by Jens Axboe

[permalink] [raw]
Subject: Re: lockdep: holding locks across syscall boundaries

On 10/27/23 9:59 AM, Peter Zijlstra wrote:
> On Fri, Oct 27, 2023 at 09:14:53AM -0600, Jens Axboe wrote:
>> Hi,
>>
>> Normally we'd expect locking state to be clean and consistent across
>> syscall entry and exit, as that is always the case for sync syscalls.
>
>> We currently have a work-around for holding a lock from aio, see
>> kiocb_start_write(), which pretends to drop the lock from lockdeps
>> perspective, as it's held from submission to until kiocb_end_write() is
>> called at completion time.
>
> I was not aware of this, the only such hack I knew about was the
> filesystem freezer thing.
>
> The problem with holding locks past the end of a syscall is that you'll
> nest whatever random lock hierarchies possibly by every other syscall
> under that lock.

Can you expand on that bit, not quite sure I follow. Do we reset the
locking dependencies between syscalls?

>> This is a bit of an ugly work-around, and defeats the purpose of
>> lockdep.
>>
>> Since I've now got another case where I want to hold a resource across
>> syscalls, is there a better way to do this?
>>
>> This is for inode_dio_start(), which increments an inode int count, and
>> inode_dio_end() which decrements it. If a task is doing
>> inode_dio_start() and then inode_dio_wait(), I want to trigger this. I
>> have a hack that does this, but it disables lockdep_sys_exit() as
>> otherwise I just get that warning rather than the more useful one.
>
> Suppose syscall-a returns with your kiocb thing held, call it lock A
> Suppose syscall-b returns with your inode thing held, call it lock B
>
> Then userspace does:
>
> syscall-a
> syscall-b
>
> while it also does:
>
> syscall-b
> syscall-a
>
> and we're up a creek, no?

Should this not get caught by the usual lock ordering rules? Because
that is obviously a bug, ordering would have to be consistent, just like
if we have:

syscall-a
lock(a);
lock(b);

syscall-b
lock(b);
lock(a)

--
Jens Axboe

2023-10-27 16:14:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: lockdep: holding locks across syscall boundaries

On Fri, Oct 27, 2023 at 10:06:33AM -0600, Jens Axboe wrote:
> On 10/27/23 9:59 AM, Peter Zijlstra wrote:
> > On Fri, Oct 27, 2023 at 09:14:53AM -0600, Jens Axboe wrote:
> >> Hi,
> >>
> >> Normally we'd expect locking state to be clean and consistent across
> >> syscall entry and exit, as that is always the case for sync syscalls.
> >
> >> We currently have a work-around for holding a lock from aio, see
> >> kiocb_start_write(), which pretends to drop the lock from lockdeps
> >> perspective, as it's held from submission to until kiocb_end_write() is
> >> called at completion time.
> >
> > I was not aware of this, the only such hack I knew about was the
> > filesystem freezer thing.
> >
> > The problem with holding locks past the end of a syscall is that you'll
> > nest whatever random lock hierarchies possibly by every other syscall
> > under that lock.
>
> Can you expand on that bit, not quite sure I follow. Do we reset the
> locking dependencies between syscalls?

What I'm saying is that if syscall-a (see below) returns with lock-A
held, and userspace goes on and does syscall-b syscall-c ... syscall-z
before the lock gets released. Then all the lock chains of
syscall-[b..z] will be under lock-a.

This very easily leads to inversions, and is thus a strong reason to not
allow syscalls to 'leak' locks.

> >> This is a bit of an ugly work-around, and defeats the purpose of
> >> lockdep.
> >>
> >> Since I've now got another case where I want to hold a resource across
> >> syscalls, is there a better way to do this?
> >>
> >> This is for inode_dio_start(), which increments an inode int count, and
> >> inode_dio_end() which decrements it. If a task is doing
> >> inode_dio_start() and then inode_dio_wait(), I want to trigger this. I
> >> have a hack that does this, but it disables lockdep_sys_exit() as
> >> otherwise I just get that warning rather than the more useful one.
> >
> > Suppose syscall-a returns with your kiocb thing held, call it lock A
> > Suppose syscall-b returns with your inode thing held, call it lock B
> >
> > Then userspace does:
> >
> > syscall-a
> > syscall-b
> >
> > while it also does:
> >
> > syscall-b
> > syscall-a
> >
> > and we're up a creek, no?
>
> Should this not get caught by the usual lock ordering rules? Because

Well, yes, lockdep will yell, and a machine without lockdep will
deadlock.

This is bad syscall design if you can deadlock like this. We must assume
userspace is out to get us, if possible it will do this.

> that is obviously a bug, ordering would have to be consistent, just like
> if we have:
>
> syscall-a
> lock(a);
> lock(b);
>
> syscall-b
> lock(b);
> lock(a)

The difference is that in this case the full lock order is determined by
kernel code (under our full control), while in the earlier example, the
lock order is determined by syscall order -- out of our control.

2023-10-27 16:39:29

by Jens Axboe

[permalink] [raw]
Subject: Re: lockdep: holding locks across syscall boundaries

On 10/27/23 10:12 AM, Peter Zijlstra wrote:
> The difference is that in this case the full lock order is determined by
> kernel code (under our full control), while in the earlier example, the
> lock order is determined by syscall order -- out of our control.

Ah yes, good point - this seems like the key concept here. I think we're
better off doing this seperately and just return -EDEADLK or something
like that if it's being violated, rather than spew complaints.

Thanks!

--
Jens Axboe

2023-10-29 22:02:38

by David Laight

[permalink] [raw]
Subject: RE: lockdep: holding locks across syscall boundaries

From: Peter Zijlstra
> Sent: 27 October 2023 17:00
>
> On Fri, Oct 27, 2023 at 09:14:53AM -0600, Jens Axboe wrote:
> > Hi,
> >
> > Normally we'd expect locking state to be clean and consistent across
> > syscall entry and exit, as that is always the case for sync syscalls.
>
> > We currently have a work-around for holding a lock from aio, see
> > kiocb_start_write(), which pretends to drop the lock from lockdeps
> > perspective, as it's held from submission to until kiocb_end_write() is
> > called at completion time.
>
> I was not aware of this, the only such hack I knew about was the
> filesystem freezer thing.
>
> The problem with holding locks past the end of a syscall is that you'll
> nest whatever random lock hierarchies possibly by every other syscall
> under that lock.
>
...
>
> Suppose syscall-a returns with your kiocb thing held, call it lock A
> Suppose syscall-b returns with your inode thing held, call it lock B
>
> Then userspace does:
>
> syscall-a
> syscall-b
>
> while it also does:
>
> syscall-b
> syscall-a
>
> and we're up a creek, no?

Isn't it also open to a massive denial-of-service attack?
syscall-a
sleep(infinity)

assuming you actually catch:
syscall-a
_exit()

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)