2012-05-01 14:32:07

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] Describe race of direct read and fork for unaligned buffers

On Mon, Apr 30, 2012 at 5:30 AM, Jan Kara <[email protected]> wrote:
> This is a long standing problem (or a surprising feature) in our implementation
> of get_user_pages() (used by direct IO). Since several attempts to fix it
> failed (e.g.
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg06542.html, or
> http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html refused in
> http://comments.gmane.org/gmane.linux.kernel.mm/31569) and it's not completely
> clear whether we really want to fix it given the costs, let's at least document
> it.
>
> CC: [email protected]
> CC: Jeff Moyer <[email protected]>
> Signed-off-by: Jan Kara <[email protected]>
> ---
>
> --- a/man2/open.2 ? ? ? 2012-04-27 00:07:51.736883092 +0200
> +++ b/man2/open.2 ? ? ? 2012-04-27 00:29:59.489892980 +0200
> @@ -769,7 +769,12 @@
> ?and the file offset must all be multiples of the logical block size
> ?of the file system.
> ?Under Linux 2.6, alignment to 512-byte boundaries
> -suffices.
> +suffices. However, if the user buffer is not page aligned and direct read
> +runs in parallel with a
> +.BR fork (2)
> +of the reader process, it may happen that the read data is split between
> +pages owned by the original process and its child. Thus effectively read
> +data is corrupted.
> ?.LP
> ?The
> ?.B O_DIRECT

Hello,

Thank you revisit this. But as far as my remember is correct, this issue is NOT
unaligned access issue. It's just get_user_pages(_fast) vs fork race issue. i.e.
DIRECT_IO w/ multi thread process should not use fork().


2012-05-01 14:37:43

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] Describe race of direct read and fork for unaligned buffers

> Hello,
>
> Thank you revisit this. But as far as my remember is correct, this issue is NOT
> unaligned access issue. It's just get_user_pages(_fast) vs fork race issue. i.e.
> DIRECT_IO w/ multi thread process should not use fork().

The problem is, fork (and its COW logic) assume new access makes cow break,
But page table protection can't detect a DMA write. Therefore DIO may override
shared page data.

2012-05-01 15:21:06

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH] Describe race of direct read and fork for unaligned buffers

KOSAKI Motohiro <[email protected]> writes:

>> Hello,
>>
>> Thank you revisit this. But as far as my remember is correct, this issue is NOT
>> unaligned access issue. It's just get_user_pages(_fast) vs fork race issue. i.e.
>> DIRECT_IO w/ multi thread process should not use fork().
>
> The problem is, fork (and its COW logic) assume new access makes cow break,
> But page table protection can't detect a DMA write. Therefore DIO may override
> shared page data.

Hm, I've only seen this with misaligned or multiple sub-page-sized reads
in the same page. AFAIR, aligned, page-sized I/O does not get split.
But, I could be wrong...

Cheers,
Jeff

2012-05-01 15:34:26

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] Describe race of direct read and fork for unaligned buffers

On Tue, May 1, 2012 at 11:11 AM, Jeff Moyer <[email protected]> wrote:
> KOSAKI Motohiro <[email protected]> writes:
>
>>> Hello,
>>>
>>> Thank you revisit this. But as far as my remember is correct, this issue is NOT
>>> unaligned access issue. It's just get_user_pages(_fast) vs fork race issue. i.e.
>>> DIRECT_IO w/ multi thread process should not use fork().
>>
>> The problem is, fork (and its COW logic) assume new access makes cow break,
>> But page table protection can't detect a DMA write. Therefore DIO may override
>> shared page data.
>
> Hm, I've only seen this with misaligned or multiple sub-page-sized reads
> in the same page. ?AFAIR, aligned, page-sized I/O does not get split.
> But, I could be wrong...

If my remember is correct, the reproducer of past thread is misleading.

dma_thread.c in
http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html has
align parameter. But it doesn't only change align. Because of, every
worker thread read 4K (pagesize), then
- when offset is page aligned
-> every page is accessed from only one worker
- when offset is not page aligned
-> every page is accessed from two workers

But I don't remember why two threads are important things. hmm.. I'm
looking into the code a while.
Please don't 100% trust me.

2012-05-01 15:39:05

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH] Describe race of direct read and fork for unaligned buffers

KOSAKI Motohiro <[email protected]> writes:

> On Tue, May 1, 2012 at 11:11 AM, Jeff Moyer <[email protected]> wrote:
>> KOSAKI Motohiro <[email protected]> writes:
>>
>>>> Hello,
>>>>
>>>> Thank you revisit this. But as far as my remember is correct, this issue is NOT
>>>> unaligned access issue. It's just get_user_pages(_fast) vs fork race issue. i.e.
>>>> DIRECT_IO w/ multi thread process should not use fork().
>>>
>>> The problem is, fork (and its COW logic) assume new access makes cow break,
>>> But page table protection can't detect a DMA write. Therefore DIO may override
>>> shared page data.
>>
>> Hm, I've only seen this with misaligned or multiple sub-page-sized reads
>> in the same page.  AFAIR, aligned, page-sized I/O does not get split.
>> But, I could be wrong...
>
> If my remember is correct, the reproducer of past thread is misleading.
>
> dma_thread.c in
> http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html has
> align parameter. But it doesn't only change align. Because of, every
> worker thread read 4K (pagesize), then
> - when offset is page aligned
> -> every page is accessed from only one worker
> - when offset is not page aligned
> -> every page is accessed from two workers
>
> But I don't remember why two threads are important things. hmm.. I'm
> looking into the code a while.
> Please don't 100% trust me.

I bet Andrea or Larry would remember the details.

Cheers,
Jeff

2012-05-01 15:50:48

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH] Describe race of direct read and fork for unaligned buffers

On 2 May 2012 01:38, Jeff Moyer <[email protected]> wrote:
> KOSAKI Motohiro <[email protected]> writes:
>
>> On Tue, May 1, 2012 at 11:11 AM, Jeff Moyer <[email protected]> wrote:
>>> KOSAKI Motohiro <[email protected]> writes:
>>>
>>>>> Hello,
>>>>>
>>>>> Thank you revisit this. But as far as my remember is correct, this issue is NOT
>>>>> unaligned access issue. It's just get_user_pages(_fast) vs fork race issue. i.e.
>>>>> DIRECT_IO w/ multi thread process should not use fork().
>>>>
>>>> The problem is, fork (and its COW logic) assume new access makes cow break,
>>>> But page table protection can't detect a DMA write. Therefore DIO may override
>>>> shared page data.
>>>
>>> Hm, I've only seen this with misaligned or multiple sub-page-sized reads
>>> in the same page.  AFAIR, aligned, page-sized I/O does not get split.
>>> But, I could be wrong...
>>
>> If my remember is correct, the reproducer of past thread is misleading.
>>
>> dma_thread.c in
>> http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html has
>> align parameter. But it doesn't only change align. Because of, every
>> worker thread read 4K (pagesize), then
>>  - when offset is page aligned
>>     -> every page is accessed from only one worker
>>  - when offset is not page aligned
>>     -> every page is accessed from two workers
>>
>> But I don't remember why two threads are important things. hmm.. I'm
>> looking into the code a while.
>> Please don't 100% trust me.
>
> I bet Andrea or Larry would remember the details.

KOSAKI-san is correct, I think.

The race is something like this:

DIO-read
page = get_user_pages()
fork()
COW(page)
touch(page)
DMA(page)
page_cache_release(page);

So whether parent or child touches the page, determines who gets the
actual DMA target, and who gets the copy.

2 threads are not required, but it makes the race easier to code and a
larger window, I suspect.

It can also be hit with a single thread, using AIO.

2012-05-01 23:51:45

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] Describe race of direct read and fork for unaligned buffers

Hi Nick!

On Wed, May 02, 2012 at 01:50:46AM +1000, Nick Piggin wrote:
> KOSAKI-san is correct, I think.
>
> The race is something like this:
>
> DIO-read
> page = get_user_pages()
> fork()
> COW(page)
> touch(page)
> DMA(page)
> page_cache_release(page);

Yes. More in general this race happens every time the kernel wrprotect
a writable anon pte, if get_user_pages had a pin on the page while the
pte is being wrprotected.

fork can't just abort (like KSM does) when it notices mapcount <
page_count.

The only way to avoid this, is that somehow the GUP-pinned page should
remain pointed at all times by the pte of the process that pinned the
page (no matter the cows), and that's not happening.

> So whether parent or child touches the page, determines who gets the
> actual DMA target, and who gets the copy.

Correct, so far there are two reproducers, triggering two different
kind of corruption.

The corruption may appear in different ways:

1) we could lose the direct-io read in the parent (if the forked child
does nothing and just quits), that was the basic case in dma_thread.c,
a dummy fork was run just to mark the pte wrprotected

2) the destination of the direct-io read may also become visible to the
child if the child written to the page before the I/O is complete,
leading to random mm corruption in the child

3) it's a direct-io write, then the child could write random data to
disk by accident without noticing, if the DMA wasn't started yet and
the child got the pinned page mapped in the child pte

We had two working fixes for this and personally I'd prefer to apply
them than to document the bug. The probability that who writes code
that can hit the bug is reading the note in the manpage seems pretty
small, especially in the short/mid term. This lkml thread as reminder
may actually have higher chance of being noticed than the manpage
maybe. Nevertheless documenting it is better than nothing if the fixes
aren't applied :). However I'm afraid after we officially document it
the chances of fixing it becomes zero.

> 2 threads are not required, but it makes the race easier to code and a
> larger window, I suspect.
>
> It can also be hit with a single thread, using AIO.

Yes, it requires running fork in the same process that pinned a page
with GUP, and then writing to a buffer in the same page that is under
the GUP pin before the GUP pin is released.

It's not just direct-io, and not just direct-io read (see point 3).

2012-05-02 08:17:12

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] Describe race of direct read and fork for unaligned buffers

On Wed 02-05-12 01:50:46, Nick Piggin wrote:
> On 2 May 2012 01:38, Jeff Moyer <[email protected]> wrote:
> > KOSAKI Motohiro <[email protected]> writes:
> >
> >> On Tue, May 1, 2012 at 11:11 AM, Jeff Moyer <[email protected]> wrote:
> >>> KOSAKI Motohiro <[email protected]> writes:
> >>>
> >>>>> Hello,
> >>>>>
> >>>>> Thank you revisit this. But as far as my remember is correct, this issue is NOT
> >>>>> unaligned access issue. It's just get_user_pages(_fast) vs fork race issue. i.e.
> >>>>> DIRECT_IO w/ multi thread process should not use fork().
> >>>>
> >>>> The problem is, fork (and its COW logic) assume new access makes cow break,
> >>>> But page table protection can't detect a DMA write. Therefore DIO may override
> >>>> shared page data.
> >>>
> >>> Hm, I've only seen this with misaligned or multiple sub-page-sized reads
> >>> in the same page. ?AFAIR, aligned, page-sized I/O does not get split.
> >>> But, I could be wrong...
> >>
> >> If my remember is correct, the reproducer of past thread is misleading.
> >>
> >> dma_thread.c in
> >> http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html has
> >> align parameter. But it doesn't only change align. Because of, every
> >> worker thread read 4K (pagesize), then
> >> ?- when offset is page aligned
> >> ? ? -> every page is accessed from only one worker
> >> ?- when offset is not page aligned
> >> ? ? -> every page is accessed from two workers
> >>
> >> But I don't remember why two threads are important things. hmm.. I'm
> >> looking into the code a while.
> >> Please don't 100% trust me.
> >
> > I bet Andrea or Larry would remember the details.
>
> KOSAKI-san is correct, I think.
>
> The race is something like this:
>
> DIO-read
> page = get_user_pages()
> fork()
> COW(page)
> touch(page)
> DMA(page)
> page_cache_release(page);
>
> So whether parent or child touches the page, determines who gets the
> actual DMA target, and who gets the copy.
OK, this is roughly what I understood from original threads as well. So
if our buffer is page aligned and its size is page aligned, you would hit
the corruption only if you do modify the buffer while IO to / from that buffer
is in progress. And that would seem like a really bad programming practice
anyway. So I still believe that having everything page size aligned will
effectively remove the problem although I agree it does not aim at the core
of it.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-05-02 09:09:56

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH] Describe race of direct read and fork for unaligned buffers

On 2 May 2012 18:17, Jan Kara <[email protected]> wrote:
> On Wed 02-05-12 01:50:46, Nick Piggin wrote:

>> KOSAKI-san is correct, I think.
>>
>> The race is something like this:
>>
>> DIO-read
>>     page = get_user_pages()
>>                                                         fork()
>>                                                             COW(page)
>>                                                          touch(page)
>>     DMA(page)
>>     page_cache_release(page);
>>
>> So whether parent or child touches the page, determines who gets the
>> actual DMA target, and who gets the copy.
>  OK, this is roughly what I understood from original threads as well. So
> if our buffer is page aligned and its size is page aligned, you would hit
> the corruption only if you do modify the buffer while IO to / from that buffer
> is in progress. And that would seem like a really bad programming practice
> anyway. So I still believe that having everything page size aligned will
> effectively remove the problem although I agree it does not aim at the core
> of it.

I see what you mean.

I'm not sure, though. For most apps it's bad practice I think. If you get into
realm of sophisticated, performance critical IO/storage managers, it would
not surprise me if such concurrent buffer modifications could be allowed.
We allow exactly such a thing in our pagecache layer. Although probably
those would be using shared mmaps for their buffer cache.

I think it is safest to make a default policy of asking for IOs against private
cow-able mappings to be quiesced before fork, so there are no surprises
or reliance on COW details in the mm. Do you think?

2012-05-02 09:18:41

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] Describe race of direct read and fork for unaligned buffers

On Wed 02-05-12 19:09:54, Nick Piggin wrote:
> On 2 May 2012 18:17, Jan Kara <[email protected]> wrote:
> > On Wed 02-05-12 01:50:46, Nick Piggin wrote:
>
> >> KOSAKI-san is correct, I think.
> >>
> >> The race is something like this:
> >>
> >> DIO-read
> >> ? ? page = get_user_pages()
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? fork()
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? COW(page)
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?touch(page)
> >> ? ? DMA(page)
> >> ? ? page_cache_release(page);
> >>
> >> So whether parent or child touches the page, determines who gets the
> >> actual DMA target, and who gets the copy.
> > ?OK, this is roughly what I understood from original threads as well. So
> > if our buffer is page aligned and its size is page aligned, you would hit
> > the corruption only if you do modify the buffer while IO to / from that buffer
> > is in progress. And that would seem like a really bad programming practice
> > anyway. So I still believe that having everything page size aligned will
> > effectively remove the problem although I agree it does not aim at the core
> > of it.
>
> I see what you mean.
>
> I'm not sure, though. For most apps it's bad practice I think. If you get into
> realm of sophisticated, performance critical IO/storage managers, it would
> not surprise me if such concurrent buffer modifications could be allowed.
> We allow exactly such a thing in our pagecache layer. Although probably
> those would be using shared mmaps for their buffer cache.
>
> I think it is safest to make a default policy of asking for IOs against private
> cow-able mappings to be quiesced before fork, so there are no surprises
> or reliance on COW details in the mm. Do you think?
Yes, I agree that (and MADV_DONTFORK) is probably the best thing to have
in documentation. Otherwise it's a bit too hairy...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-05-02 19:15:03

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] Describe race of direct read and fork for unaligned buffers

Hello,

>> I see what you mean.
>>
>> I'm not sure, though. For most apps it's bad practice I think. If you get into
>> realm of sophisticated, performance critical IO/storage managers, it would
>> not surprise me if such concurrent buffer modifications could be allowed.
>> We allow exactly such a thing in our pagecache layer. Although probably
>> those would be using shared mmaps for their buffer cache.
>>
>> I think it is safest to make a default policy of asking for IOs against private
>> cow-able mappings to be quiesced before fork, so there are no surprises
>> or reliance on COW details in the mm. Do you think?
> Yes, I agree that (and MADV_DONTFORK) is probably the best thing to have
> in documentation. Otherwise it's a bit too hairy...

I neglected this issue for years because Linus asked who need this and
I couldn't
find real world usecase.

Ah, no, not exactly correct. Fujitsu proprietary database had such
usecase. But they
quickly fixed it. Then I couldn't find alternative usecase.

I'm not sure why you say "hairy". Do you mean you have any use case of this?

Thank you.

2012-05-02 19:23:33

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] Describe race of direct read and fork for unaligned buffers

On Wed 02-05-12 15:14:33, KOSAKI Motohiro wrote:
> Hello,
>
> >> I see what you mean.
> >>
> >> I'm not sure, though. For most apps it's bad practice I think. If you get into
> >> realm of sophisticated, performance critical IO/storage managers, it would
> >> not surprise me if such concurrent buffer modifications could be allowed.
> >> We allow exactly such a thing in our pagecache layer. Although probably
> >> those would be using shared mmaps for their buffer cache.
> >>
> >> I think it is safest to make a default policy of asking for IOs against private
> >> cow-able mappings to be quiesced before fork, so there are no surprises
> >> or reliance on COW details in the mm. Do you think?
> > Yes, I agree that (and MADV_DONTFORK) is probably the best thing to have
> > in documentation. Otherwise it's a bit too hairy...
>
> I neglected this issue for years because Linus asked who need this and
> I couldn't
> find real world usecase.
>
> Ah, no, not exactly correct. Fujitsu proprietary database had such
> usecase. But they quickly fixed it. Then I couldn't find alternative usecase.
One of our customers hit this bug recently which is why I started to look
at this. But they also modified their application not to hit the problem.

> I'm not sure why you say "hairy". Do you mean you have any use case of this?
I meant that if we should describe conditions like "if you have page
aligned buffer and you don't write to it while the IO is running, the
problem also won't occur", then it's already too detailed and might
easily change in future kernels...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-05-02 19:26:36

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] Describe race of direct read and fork for unaligned buffers

On Wed, May 2, 2012 at 3:23 PM, Jan Kara <[email protected]> wrote:
> On Wed 02-05-12 15:14:33, KOSAKI Motohiro wrote:
>> Hello,
>>
>> >> I see what you mean.
>> >>
>> >> I'm not sure, though. For most apps it's bad practice I think. If you get into
>> >> realm of sophisticated, performance critical IO/storage managers, it would
>> >> not surprise me if such concurrent buffer modifications could be allowed.
>> >> We allow exactly such a thing in our pagecache layer. Although probably
>> >> those would be using shared mmaps for their buffer cache.
>> >>
>> >> I think it is safest to make a default policy of asking for IOs against private
>> >> cow-able mappings to be quiesced before fork, so there are no surprises
>> >> or reliance on COW details in the mm. Do you think?
>> > ? ?Yes, I agree that (and MADV_DONTFORK) is probably the best thing to have
>> > in documentation. Otherwise it's a bit too hairy...
>>
>> I neglected this issue for years because Linus asked who need this and
>> I couldn't
>> find real world usecase.
>>
>> Ah, no, not exactly correct. Fujitsu proprietary database had such
>> usecase. But they quickly fixed it. Then I couldn't find alternative usecase.
> ?One of our customers hit this bug recently which is why I started to look
> at this. But they also modified their application not to hit the problem.
>
>> I'm not sure why you say "hairy". Do you mean you have any use case of this?
> ?I meant that if we should describe conditions like "if you have page
> aligned buffer and you don't write to it while the IO is running, the
> problem also won't occur", then it's already too detailed and might
> easily change in future kernels...

ok, thanks.

Subject: Re: [PATCH] Describe race of direct read and fork for unaligned buffers

On Thu, May 3, 2012 at 7:25 AM, KOSAKI Motohiro
<[email protected]> wrote:
> On Wed, May 2, 2012 at 3:23 PM, Jan Kara <[email protected]> wrote:
>> On Wed 02-05-12 15:14:33, KOSAKI Motohiro wrote:
>>> Hello,
>>>
>>> >> I see what you mean.
>>> >>
>>> >> I'm not sure, though. For most apps it's bad practice I think. If you get into
>>> >> realm of sophisticated, performance critical IO/storage managers, it would
>>> >> not surprise me if such concurrent buffer modifications could be allowed.
>>> >> We allow exactly such a thing in our pagecache layer. Although probably
>>> >> those would be using shared mmaps for their buffer cache.
>>> >>
>>> >> I think it is safest to make a default policy of asking for IOs against private
>>> >> cow-able mappings to be quiesced before fork, so there are no surprises
>>> >> or reliance on COW details in the mm. Do you think?
>>> > ? ?Yes, I agree that (and MADV_DONTFORK) is probably the best thing to have
>>> > in documentation. Otherwise it's a bit too hairy...
>>>
>>> I neglected this issue for years because Linus asked who need this and
>>> I couldn't
>>> find real world usecase.
>>>
>>> Ah, no, not exactly correct. Fujitsu proprietary database had such
>>> usecase. But they quickly fixed it. Then I couldn't find alternative usecase.
>> ?One of our customers hit this bug recently which is why I started to look
>> at this. But they also modified their application not to hit the problem.
>>
>>> I'm not sure why you say "hairy". Do you mean you have any use case of this?
>> ?I meant that if we should describe conditions like "if you have page
>> aligned buffer and you don't write to it while the IO is running, the
>> problem also won't occur", then it's already too detailed and might
>> easily change in future kernels...

So, am I correct to assume that right text to add to the page is as below?

Nick, can you clarify what you mean by "quiesced"?

[[
O_DIRECT IOs should never be run concurrently with fork(2) system call,
when the memory buffer is anonymous memory, or comes from mmap(2)
with MAP_PRIVATE.

Any such IOs, whether submitted with asynchronous IO interface or from
another thread in the process, should be quiesced before fork(2) is called.
Failure to do so can result in data corruption and undefined behavior in
parent and child processes.

This restriction does not apply when the memory buffer for the O_DIRECT
IOs comes from mmap(2) with MAP_SHARED or from shmat(2).
Nor does this restriction apply when the memory buffer has been advised
as MADV_DONTFORK with madvise(2), ensuring that it will not be available
to the child after fork(2).
]]

Thanks,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

2012-05-05 15:29:27

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] Describe race of direct read and fork for unaligned buffers

> So, am I correct to assume that right text to add to the page is as below?
>
> Nick, can you clarify what you mean by "quiesced"?

finished?

>
> [[
> O_DIRECT IOs should never be run concurrently with fork(2) system call,
> when the memory buffer is anonymous memory, or comes from mmap(2)
> with MAP_PRIVATE.
>
> Any such IOs, whether submitted with asynchronous IO interface or from
> another thread in the process, should be quiesced before fork(2) is called.
> Failure to do so can result in data corruption and undefined behavior in
> parent and child processes.
>
> This restriction does not apply when the memory buffer for the O_DIRECT
> IOs comes from mmap(2) with MAP_SHARED or from shmat(2).
> Nor does this restriction apply when the memory buffer has been advised
> as MADV_DONTFORK with madvise(2), ensuring that it will not be available
> to the child after fork(2).
> ]]

I don't have good English and I can't make editorial check. But at least,
I don't find any technical incorrect explanation here.

Acked-by: KOSAKI Motohiro <[email protected]>

2012-05-08 23:10:18

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH] Describe race of direct read and fork for unaligned buffers

On 6 May 2012 01:29, KOSAKI Motohiro <[email protected]> wrote:
>> So, am I correct to assume that right text to add to the page is as below?
>>
>> Nick, can you clarify what you mean by "quiesced"?
>
> finished?

Yes exactly. That might be a simpler word. Thanks!

>
>>
>> [[
>> O_DIRECT IOs should never be run concurrently with fork(2) system call,
>> when the memory buffer is anonymous memory, or comes from mmap(2)
>> with MAP_PRIVATE.
>>
>> Any such IOs, whether submitted with asynchronous IO interface or from
>> another thread in the process, should be quiesced before fork(2) is called.
>> Failure to do so can result in data corruption and undefined behavior in
>> parent and child processes.
>>
>> This restriction does not apply when the memory buffer for the O_DIRECT
>> IOs comes from mmap(2) with MAP_SHARED or from shmat(2).
>> Nor does this restriction apply when the memory buffer has been advised
>> as MADV_DONTFORK with madvise(2), ensuring that it will not be available
>> to the child after fork(2).
>> ]]
>
> I don't have good English and I can't make editorial check. But at least,
> I don't find any technical incorrect explanation here.
>
>  Acked-by: KOSAKI Motohiro <[email protected]>

Subject: Re: [PATCH] Describe race of direct read and fork for unaligned buffers

On Wed, May 9, 2012 at 11:10 AM, Nick Piggin <[email protected]> wrote:
> On 6 May 2012 01:29, KOSAKI Motohiro <[email protected]> wrote:
>>> So, am I correct to assume that right text to add to the page is as below?
>>>
>>> Nick, can you clarify what you mean by "quiesced"?
>>
>> finished?
>
> Yes exactly. That might be a simpler word. Thanks!

Thanks.

But see below. I realize the text is still ambiguous.

>>> [[
>>> O_DIRECT IOs should never be run concurrently with fork(2) system call,
>>> when the memory buffer is anonymous memory, or comes from mmap(2)
>>> with MAP_PRIVATE.
>>>
>>> Any such IOs, whether submitted with asynchronous IO interface or from
>>> another thread in the process, should be quiesced before fork(2) is called.
>>> Failure to do so can result in data corruption and undefined behavior in
>>> parent and child processes.
>>>
>>> This restriction does not apply when the memory buffer for the O_DIRECT
>>> IOs comes from mmap(2) with MAP_SHARED or from shmat(2).
>>> Nor does this restriction apply when the memory buffer has been advised
>>> as MADV_DONTFORK with madvise(2), ensuring that it will not be available
>>> to the child after fork(2).
>>> ]]

In the above, the status of a MAP_SHARED MAP_ANONYMOUS buffer is
unclear. The first paragraph implies that such a buffer is unsafe,
while the third paragraph implies that it *is* safe, thus
contradicting the first paragraph. Which is correct?

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

2012-05-09 07:01:44

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH] Describe race of direct read and fork for unaligned buffers

On 9 May 2012 15:35, Michael Kerrisk (man-pages) <[email protected]> wrote:
> On Wed, May 9, 2012 at 11:10 AM, Nick Piggin <[email protected]> wrote:
>> On 6 May 2012 01:29, KOSAKI Motohiro <[email protected]> wrote:
>>>> So, am I correct to assume that right text to add to the page is as below?
>>>>
>>>> Nick, can you clarify what you mean by "quiesced"?
>>>
>>> finished?
>>
>> Yes exactly. That might be a simpler word. Thanks!
>
> Thanks.
>
> But see below. I realize the text is still ambiguous.
>
>>>> [[
>>>> O_DIRECT IOs should never be run concurrently with fork(2) system call,
>>>> when the memory buffer is anonymous memory, or comes from mmap(2)
>>>> with MAP_PRIVATE.
>>>>
>>>> Any such IOs, whether submitted with asynchronous IO interface or from
>>>> another thread in the process, should be quiesced before fork(2) is called.
>>>> Failure to do so can result in data corruption and undefined behavior in
>>>> parent and child processes.
>>>>
>>>> This restriction does not apply when the memory buffer for the O_DIRECT
>>>> IOs comes from mmap(2) with MAP_SHARED or from shmat(2).
>>>> Nor does this restriction apply when the memory buffer has been advised
>>>> as MADV_DONTFORK with madvise(2), ensuring that it will not be available
>>>> to the child after fork(2).
>>>> ]]
>
> In the above, the status of a MAP_SHARED MAP_ANONYMOUS buffer is
> unclear. The first paragraph implies that such a buffer is unsafe,
> while the third paragraph implies that it *is* safe, thus
> contradicting the first paragraph. Which is correct?

Yes I see. It's because MAP_SHARED | MAP_ANONYMOUS isn't *really*
anonymous from the virtual memory subsystem's point of view. But that
just serves to confuse userspace I guess.

Anything with MAP_SHARED, shmat, or MADV_DONTFORK is OK.

Anything else (MAP_PRIVATE, brk, without MADV_DONTFORK) is
dangerous. These type are used by standard heap allocators malloc,
new, etc.

Subject: Re: [PATCH] Describe race of direct read and fork for unaligned buffers

On Wed, May 9, 2012 at 7:01 PM, Nick Piggin <[email protected]> wrote:
> On 9 May 2012 15:35, Michael Kerrisk (man-pages) <[email protected]> wrote:
>> On Wed, May 9, 2012 at 11:10 AM, Nick Piggin <[email protected]> wrote:
>>> On 6 May 2012 01:29, KOSAKI Motohiro <[email protected]> wrote:
>>>>> So, am I correct to assume that right text to add to the page is as below?
>>>>>
>>>>> Nick, can you clarify what you mean by "quiesced"?
>>>>
>>>> finished?
>>>
>>> Yes exactly. That might be a simpler word. Thanks!
>>
>> Thanks.
>>
>> But see below. I realize the text is still ambiguous.
>>
>>>>> [[
>>>>> O_DIRECT IOs should never be run concurrently with fork(2) system call,
>>>>> when the memory buffer is anonymous memory, or comes from mmap(2)
>>>>> with MAP_PRIVATE.
>>>>>
>>>>> Any such IOs, whether submitted with asynchronous IO interface or from
>>>>> another thread in the process, should be quiesced before fork(2) is called.
>>>>> Failure to do so can result in data corruption and undefined behavior in
>>>>> parent and child processes.
>>>>>
>>>>> This restriction does not apply when the memory buffer for the O_DIRECT
>>>>> IOs comes from mmap(2) with MAP_SHARED or from shmat(2).
>>>>> Nor does this restriction apply when the memory buffer has been advised
>>>>> as MADV_DONTFORK with madvise(2), ensuring that it will not be available
>>>>> to the child after fork(2).
>>>>> ]]
>>
>> In the above, the status of a MAP_SHARED MAP_ANONYMOUS buffer is
>> unclear. The first paragraph implies that such a buffer is unsafe,
>> while the third paragraph implies that it *is* safe, thus
>> contradicting the first paragraph. Which is correct?
>
> Yes I see. It's because MAP_SHARED | MAP_ANONYMOUS isn't *really*
> anonymous from the virtual memory subsystem's point of view. But that
> just serves to confuse userspace I guess.
>
> Anything with MAP_SHARED, shmat, or MADV_DONTFORK is OK.
>
> Anything else (MAP_PRIVATE, brk, without MADV_DONTFORK) is
> dangerous. These type are used by standard heap allocators malloc,
> new, etc.

So, would the following text be okay:

O_DIRECT I/Os should never be run concurrently with the fork(2)
system call, if the memory buffer is a private mapping (i.e.,
any mapping created with the mmap(2) MAP_PRIVATE flag; this
includes memory allocated on the heap and statically allocated
buffers). Any such I/Os, whether submitted via an asynchronous
I/O interface or from another thread in the process, should be
completed before fork(2) is called. Failure to do so can
result in data corruption and undefined behavior in parent and
child processes. This restriction does not apply when the mem‐
ory buffer for the O_DIRECT I/Os was created using shmat(2) or
mmap(2) with the MAP_SHARED flag. Nor does this restriction
apply when the memory buffer has been advised as MADV_DONTFORK
with madvise(2), ensuring that it will not be available to the
child after fork(2).

Thanks,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/

2012-05-10 15:01:18

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] Describe race of direct read and fork for unaligned buffers

On Wed 09-05-12 19:18:16, Michael Kerrisk (man-pages) wrote:
> On Wed, May 9, 2012 at 7:01 PM, Nick Piggin <[email protected]> wrote:
> > On 9 May 2012 15:35, Michael Kerrisk (man-pages) <[email protected]> wrote:
> >> On Wed, May 9, 2012 at 11:10 AM, Nick Piggin <[email protected]> wrote:
> >>> On 6 May 2012 01:29, KOSAKI Motohiro <[email protected]> wrote:
> >>>>> So, am I correct to assume that right text to add to the page is as below?
> >>>>>
> >>>>> Nick, can you clarify what you mean by "quiesced"?
> >>>>
> >>>> finished?
> >>>
> >>> Yes exactly. That might be a simpler word. Thanks!
> >>
> >> Thanks.
> >>
> >> But see below. I realize the text is still ambiguous.
> >>
> >>>>> [[
> >>>>> O_DIRECT IOs should never be run concurrently with fork(2) system call,
> >>>>> when the memory buffer is anonymous memory, or comes from mmap(2)
> >>>>> with MAP_PRIVATE.
> >>>>>
> >>>>> Any such IOs, whether submitted with asynchronous IO interface or from
> >>>>> another thread in the process, should be quiesced before fork(2) is called.
> >>>>> Failure to do so can result in data corruption and undefined behavior in
> >>>>> parent and child processes.
> >>>>>
> >>>>> This restriction does not apply when the memory buffer for the O_DIRECT
> >>>>> IOs comes from mmap(2) with MAP_SHARED or from shmat(2).
> >>>>> Nor does this restriction apply when the memory buffer has been advised
> >>>>> as MADV_DONTFORK with madvise(2), ensuring that it will not be available
> >>>>> to the child after fork(2).
> >>>>> ]]
> >>
> >> In the above, the status of a MAP_SHARED MAP_ANONYMOUS buffer is
> >> unclear. The first paragraph implies that such a buffer is unsafe,
> >> while the third paragraph implies that it *is* safe, thus
> >> contradicting the first paragraph. Which is correct?
> >
> > Yes I see. It's because MAP_SHARED | MAP_ANONYMOUS isn't *really*
> > anonymous from the virtual memory subsystem's point of view. But that
> > just serves to confuse userspace I guess.
> >
> > Anything with MAP_SHARED, shmat, or MADV_DONTFORK is OK.
> >
> > Anything else (MAP_PRIVATE, brk, without MADV_DONTFORK) is
> > dangerous. These type are used by standard heap allocators malloc,
> > new, etc.
>
> So, would the following text be okay:
>
> O_DIRECT I/Os should never be run concurrently with the fork(2)
> system call, if the memory buffer is a private mapping (i.e.,
> any mapping created with the mmap(2) MAP_PRIVATE flag; this
> includes memory allocated on the heap and statically allocated
> buffers). Any such I/Os, whether submitted via an asynchronous
> I/O interface or from another thread in the process, should be
> completed before fork(2) is called. Failure to do so can
> result in data corruption and undefined behavior in parent and
> child processes. This restriction does not apply when the mem‐
> ory buffer for the O_DIRECT I/Os was created using shmat(2) or
> mmap(2) with the MAP_SHARED flag. Nor does this restriction
> apply when the memory buffer has been advised as MADV_DONTFORK
> with madvise(2), ensuring that it will not be available to the
> child after fork(2).
This text looks OK, to me. Thanks for putting it together.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR